feat(CLDSRV-884): Add OpenTelemetry tracing instrumentation#6140
Conversation
Hello delthas,My role is to assist you with the merge of this Available options
Available commands
Status report is not available. |
Request integration branchesWaiting for integration branch creation to be requested by the user. To request integration branches, please comment on this pull request with the following command: Alternatively, the |
Codecov Report❌ Patch coverage is
Additional details and impacted files
... and 1 file with indirect coverage changes @@ Coverage Diff @@
## development/9.4 #6140 +/- ##
===================================================
+ Coverage 85.30% 85.35% +0.04%
===================================================
Files 208 208
Lines 13919 13932 +13
===================================================
+ Hits 11874 11891 +17
+ Misses 2045 2041 -4
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
9c93a1d to
97e63fc
Compare
97e63fc to
f978280
Compare
f978280 to
06eea4e
Compare
06eea4e to
75c3afb
Compare
75c3afb to
207b785
Compare
|
@DarkIsDude ready for review. For easier review, ignore the prettier commit: https://github.com/scality/cloudserver/pull/6140/changes/837c50f2b91d2a391ba1266761de74a17230e87c..396404d3f22623c06bc924f9669d431964983555 |
DarkIsDude
left a comment
There was a problem hiding this comment.
Almost good for me. My main concern is about the package.json 🙏
Waiting for approvalThe following approvals are needed before I can proceed with the merge:
|
Pre-existing prettier debt on lib/api/api.js and lib/server.js — the 9.4 base did not run prettier on these. Extracted as a separate commit so the subsequent CLDSRV-884 OTEL commits show only logic changes, not formatting noise. Issue: CLDSRV-884
Waiting for approvalThe following approvals are needed before I can proceed with the merge:
|
|
LGTM |
|
/approve |
In the queueThe changeset has received all authorizations and has been added to the The changeset will be merged in:
The following branches will NOT be impacted:
This pull request does not target the following hotfix branch(es) so they
There is no action required on your side. You will be notified here once IMPORTANT Please do not attempt to modify this pull request.
If you need this pull request to be removed from the queue, please contact a The following options are set: approve |
|
I have successfully merged the changeset of this pull request
The following branches have NOT changed:
Please check the status of the associated issue CLDSRV-884. Goodbye delthas. |
Summary
Add OpenTelemetry tracing to cloudserver, gated behind
ENABLE_OTEL=true. When the flag is unset, no@opentelemetry/*package is loaded — zero overhead off the OTEL path.The bootstrap, trust-boundary host filter, health-probe filter, and per-handler span helper now live in arsenal's shared
lib/tracingmodule (scality/Arsenal#2632, ARSN-586); cloudserver consumes it instead of carrying its own copy. Rebased ontodevelopment/9.4, leading with a prettier-only commit so the feature commits show logic, not formatting noise.Commits
chore: prettier-format files touched by CLDSRV-884— isolates the pre-existing prettier reformat oflib/api/api.js+lib/server.jsso the subsequentfeat:commits carry only logic deltas.chore: add OpenTelemetry dependencies and consume arsenal tracing module— pin arsenal at the ARSN-586 branch (which carries the shared tracing module and the W3C trace-context stamping on MongoDB metadata writes, ARSN-572, so async consumers can continue the trace across the oplog boundary). Drop the SDK-core packages now that arsenal carries them asoptionalDependencies, and keep the three instrumentation packages (instrumentation-http/-ioredis/-mongodb) here — the consumer owns and configures them. Noauto-instrumentations-nodebundle (which would pull ~36 unused instrumentations).feat: wire OpenTelemetry bootstrap and shutdown via arsenalindex.jsinitializes tracing viarequire('arsenal/build/lib/tracing')beforeserver.js. The deep-require intobuild/lib/tracingis deliberate:require('arsenal')would eagerly pull ioredis/mongodb through the barrel and defeatrequire-in-the-middlepatching. The call is gated ontracing.isEnabled()so the OTEL-off path never loads Config (and the arsenal barrel) early, and the cluster primary skips init (workers re-exec and init themselves). Cloudserver constructs the three instrumentations itself and passes the http one throughtracing.makeHttpInstrumentationConfig({ healthPaths })for the k8s liveness/readiness/metrics paths — which wires the trust-boundaryrequestHook(stripstraceparent/tracestateon outbound to hosts outsideOTEL_TRUSTED_HOSTS) plus anOPTIONS+ health-path ignore hook.S3Server.cleanUp()is async: itawaitstracing.close()afterserver.close()and beforeprocess.exit.close()is race-safe and bounded at 5 s viaPromise.racewith a.unref()'d timer, so an unreachable collector can't block past Kubernetes' 30 s grace.caughtExceptionShutdown()flushes on the non-cluster and cluster-worker paths; the cluster primary keeps its survive-and-log behavior.instrumentation-httpalready makes the server span a child of any remotetraceparent. A manual extract would demote downstreamapi.<handler>spans to siblings.feat: instrument all S3 API handlers with OTEL spans(lib/api/api.js) — apply arsenal'sinstrumentApiMethodvia anObject.entries(api)loop with aNON_INSTRUMENTED_KEYSskip set (callApiMethod,checkAuthResults,handleAuthorizationResults). Every S3 handler produces anapi.<method>span (instrumentation scopearsenal.api) owning the whole handler execution; auto-instrumentation spans nest beneath. New handlers are auto-instrumented — no per-handler boilerplate. When OTEL is disabled,instrumentApiMethodreturns the original function unchanged.Configuration
ENABLE_OTELtrueto load the SDK. When unset, no@opentelemetry/*package is required — zero overhead.OTEL_EXPORTER_OTLP_TRACES_ENDPOINTOTEL_SAMPLING_RATIO0.01[0, 1]. Use1to trace everything during testing.OTEL_TRUSTED_HOSTStraceparent. Loopback is always trusted. Unset ⇒ only loopback trusted, so cross-servicetraceparentis stripped until an operator sets this (safe default).OTEL_SERVICE_NAMEcloudserverOTEL_SERVICE_VERSIONpackage.json#versionOTEL_SERVICE_NAMESPACEscalityTracing-correctness coverage (bootstrap, trust boundary, health filter, span helper) lives with the code in arsenal's
tests/unit/tracing/; the full cloudserver unit suite remains green.Out of scope (follow-ups)
The trust-boundary enforcement (read
OTEL_TRUSTED_HOSTS, striptraceparenton untrusted outbound) lives entirely in arsenal and is wired automatically. Populating that env var per deployment is routine operator config — no cloudserver-side work.Related tickets
Issue: CLDSRV-884