test(longhaul): add long-haul test driver core#405
Conversation
Adds the test/longhaul Go module: a long-running canary driver that exercises a persistent DocumentDB cluster with continuous workload, scheduled disruptive operations, health monitoring, and reporting. Scope of this PR (Part 2/5 of documentdb#348 split): * cmd/longhaul: driver entrypoint * config: env-driven configuration * workload: writer/verifier/metrics * operations: scheduler + scale + upgrade ops * monitor: cluster client (consumes test/shared), health, leak detector * journal: thread-safe event log + outage policy oracle * report: markdown report, periodic checkpoint, GitHub Actions alerts * Dockerfile, README, deploy/setup.yaml bootstrap manifest Reuses test/shared/documentdb (lifecycle) and test/shared/mongo (data plane) extracted in documentdb#401 — no duplication of CR/Mongo logic. Excluded (in follow-up PRs): CI/CD workflows + deploy/deployment.yaml + deploy/rbac.yaml (PR-3), auto-upgrade logic (PR-4). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Adds the core of the new test/longhaul Go module: a long-running “canary” driver that applies continuous workload, schedules disruptive operations (scale/upgrade), monitors cluster health/resources, journals events/windows, and produces checkpoint + markdown reporting.
Changes:
- Introduces workload writer/verifier with atomic metrics and uniqueness/indexing for durability checks.
- Adds the disruption scheduler and operations (scale up/down, DocumentDB version upgrade) with outage-policy tracking via a journal.
- Adds Kubernetes monitoring client (typed
controller-runtime+test/shared), leak detection, and reporting/checkpoint + GitHub Actions annotations.
Reviewed changes
Copilot reviewed 41 out of 42 changed files in this pull request and generated 16 comments.
Show a summary per file
| File | Description |
|---|---|
| test/longhaul/cmd/longhaul/main.go | Driver wiring: config, mongo, monitors, workload, scheduler, checkpoint reporting |
| test/longhaul/config/config.go | Env-driven configuration defaults + validation |
| test/longhaul/config/config_test.go | Unit tests for env parsing/validation/enabled flag |
| test/longhaul/config/suite_test.go | Ginkgo suite bootstrap for config package |
| test/longhaul/workload/metrics.go | Atomic counters + snapshot helpers for workload metrics |
| test/longhaul/workload/metrics_test.go | Unit tests for metrics snapshot/rates/concurrency safety |
| test/longhaul/workload/writer.go | Majority-write workload writer + index creation |
| test/longhaul/workload/writer_test.go | Unit tests for checksum and writer counter behavior |
| test/longhaul/workload/verifier.go | Majority-read verifier scanning per-writer sequences for gaps/checksum mismatches |
| test/longhaul/workload/verifier_test.go | Constructor/resume-point tests (with integration boundary noted) |
| test/longhaul/workload/suite_test.go | Ginkgo suite bootstrap for workload package |
| test/longhaul/operations/scheduler.go | Weighted operation scheduler with cooldown + steady-state gating |
| test/longhaul/operations/scheduler_test.go | Unit tests for selection, disruption windows, and error journaling |
| test/longhaul/operations/scale.go | Scale up/down operations with budgets + recovery waits |
| test/longhaul/operations/scale_test.go | Unit tests for scale operation bounds and policy values |
| test/longhaul/operations/upgrade.go | Upgrade operation reading desired version from ConfigMap + readiness waits |
| test/longhaul/operations/upgrade_test.go | Unit tests for desired-version reads, preconditions, and outage policy |
| test/longhaul/operations/suite_test.go | Ginkgo suite bootstrap for operations package |
| test/longhaul/monitor/k8sclient.go | Typed K8s/CR client + metrics-server integration; uses test/shared/documentdb |
| test/longhaul/monitor/k8sclient_test.go | Unit tests with fake clientsets/controller-runtime fake client |
| test/longhaul/monitor/health.go | Health monitor with steady-state tracking and wait helper |
| test/longhaul/monitor/health_test.go | Unit tests for steady-state logic and wait behavior |
| test/longhaul/monitor/leakdetect.go | Linear regression leak detector for memory/CPU trends |
| test/longhaul/monitor/leakdetect_test.go | Unit tests for slope/leak threshold logic |
| test/longhaul/monitor/suite_test.go | Ginkgo suite bootstrap for monitor package |
| test/longhaul/journal/journal.go | Thread-safe append-only event journal + disruption window tracking |
| test/longhaul/journal/journal_test.go | Unit tests for events/windows/policy violation detection |
| test/longhaul/journal/policy.go | Outage policy + window semantics (duration, exceeded policy) |
| test/longhaul/journal/policy_test.go | Unit tests for policy/window semantics |
| test/longhaul/journal/suite_test.go | Ginkgo suite bootstrap for journal package |
| test/longhaul/report/report.go | Markdown report generator (metrics, windows, leaks, recent events) |
| test/longhaul/report/report_test.go | Unit tests for report content/sections/truncation |
| test/longhaul/report/checkpoint.go | Periodic checkpoint reporter writing ConfigMap + stdout + annotations |
| test/longhaul/report/checkpoint_test.go | Unit tests using fake clientset for ConfigMap persistence |
| test/longhaul/report/alert.go | GitHub Actions workflow command annotations (error/notice/warning) |
| test/longhaul/report/alert_test.go | Unit tests capturing stdout for annotations |
| test/longhaul/report/suite_test.go | Ginkgo suite bootstrap for report package |
| test/longhaul/deploy/setup.yaml | Bootstrap namespace/secret/DocumentDB CR manifest for canary cluster |
| test/longhaul/Dockerfile | Container image build for the longhaul driver |
| test/longhaul/README.md | Usage/configuration documentation for running + deploying longhaul |
| test/longhaul/go.mod | New longhaul module definition with replace to local operator/shared modules |
| test/longhaul/go.sum | Dependency lockfile for the longhaul module |
|
🤖 Auto-triaged by documentdb-triage-tool. Applied: Reasoningcomponent from path globs (test, dependencies, docs); effort from diff stats (5029+0 LOC, 42 files); LLM: Large multi-package long-haul test driver core spanning several new Go packages, a Dockerfile, manifests, and docs — cross-cutting test infrastructure addition as part of a 5-PR series. If a label is wrong, remove it manually and ping |
Copilot inline comments (16): - writer.go: fix StartWriters doc; switch to math/rand/v2 (auto-seeded) - main.go: header says Deployment (not Job); drop-collection gated on LONGHAUL_RESET_DATA - upgrade.go: wrap+return error from GetCurrentDocumentDBImageTag - checkpoint.go: bounded ctx on final emit; final PASS persists as PASS (not RUNNING) - alert.go: escape % / CR / LF in GH Actions annotation messages - deploy/setup.yaml: password is placeholder; plugin field uses spec.plugins.sidecarInjectorName - Dockerfile + README: build from repo root so go.mod replace paths resolve; Go 1.25; update relationship table for shared module Local code-review-agent (3 critical): - C1: writer no longer pre-increments seq; advance only on success or DupKey ack, so non-DupKey failures retry the same seq instead of being mistaken for data loss - C2: main calls reporter.EmitFinal() synchronously before exit so the authoritative final verdict reaches the longhaul-report ConfigMap - C3: writers seed seq from FindOne(sort:-1) on startup so a Deployment-driven pod restart resumes past the prior tip without colliding with the unique (writer_id, seq) index Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Copilot <223556219+Copilot@users.noreply.github.com>
xgerman
left a comment
There was a problem hiding this comment.
Review: structure, duplication & terseness
Focused pass on code duplication, structure, and style (the existing Copilot threads already covered correctness/security and are addressed in 3e9ac70). The module is well-documented and go build/go vet/go test ./... all pass locally. 👍
🟠 Must fix — gofmt (CI blocker)
Four files are not gofmt-clean (extra-space alignment), so make fmt/lint will fail:
operations/scale.go, operations/scheduler.go, monitor/health.go, report/checkpoint.go.
Fix: gofmt -w operations/ monitor/ report/.
🟡 Duplication / structure (highest-value dedup)
ScaleUpandScaleDownare ~95% identical — collapse into one parameterized type (inline).main.buildRESTConfigre-implementsmonitor.buildRestConfigand builds a second clientset thatK8sClusterClientalready owns (inline).shareddb.Get(...)repeated 4× ink8sclient.go— extract agetCRhelper (inline).- The ticker +
select{ctx.Done/ticker.C}loop recurs 5× (writer/verifier/scheduler/health/metrics) — optional shared helper (inline).
🟢 Style / terseness / dead code
journal.DefaultOutagePolicy()is unused (inline).report.go:b.WriteString(fmt.Sprintf(...))→fmt.Fprintf(&b, ...)(inline).checkpoint.go:map[string]interface{}→map[string]any(inline).metrics.go:MetricsSnapshotdoc comment should start with the type name (inline).
No blocking design issues — just the gofmt fix for CI plus the two consolidations (#1, #2) which are the biggest wins. Nice work on the durability oracle and the bounded verifier scan.
| } | ||
|
|
||
| // ScaleDown decreases spec.instancesPerNode by 1 (HA scale dimension; range 1-3). | ||
| type ScaleDown struct { |
There was a problem hiding this comment.
ScaleUp (above) and ScaleDown are ~95% identical: same fields, and Precondition/Execute differ only by +1/-1, >=/<=, the bound, and a couple of constants. Consider collapsing into one parameterized type to halve the file:
type scaleOp struct {
client monitor.ClusterClient
healthMon *monitor.HealthMonitor
name string
weight int
delta int // +1 up, -1 down
limit int // max (up) or min (down)
atLimit func(cur int) bool
recovery time.Duration
policy journal.OutagePolicy
}Thin NewScaleUp/NewScaleDown constructors keep call sites unchanged.
(Also: Name()/Weight() on lines 37-38 and 98-99 fail gofmt — extra spaces.)
There was a problem hiding this comment.
Fixed in ed498df — collapsed into a shared scaleOp struct parameterized by delta/bound/boundKind/policy. ScaleUp and ScaleDown are now thin embedding wrappers so call sites and the existing NewScaleUp/NewScaleDown constructors are unchanged. Tests use small maxInstances()/minInstances() accessors instead of field access. Net: scale.go went from 136 -> ~130 lines and has zero duplication. Also gofmt-fixed (the extra-space alignment you flagged on lines 37-38 / 98-99 is gone).
| return client, clientset, nil | ||
| } | ||
|
|
||
| func buildRESTConfig() (*rest.Config, error) { |
There was a problem hiding this comment.
buildRESTConfig here re-implements monitor.buildRestConfig (k8sclient.go:86) almost verbatim, and initK8sClient then creates a second clientset (line 231) even though K8sClusterClient already built one (k8sclient.go:59). Expose the existing one instead:
func (k *K8sClusterClient) Clientset() kubernetes.Interface { return k.clientset }Then delete buildRESTConfig and the redundant NewForConfig — one client, one config path.
There was a problem hiding this comment.
Fixed in ed498df — added K8sClusterClient.Clientset() kubernetes.Interface, deleted main.buildRESTConfig and initK8sClient entirely, and dropped the unused k8s.io/client-go/{kubernetes,rest,tools/clientcmd} imports from main.go. main.go now does the one monitor.NewK8sClusterClient(...) call and reuses clusterClient.Clientset() for the reporter. main.go shrank by ~40 lines.
| // Get the DocumentDB CR status via the shared typed helper. Using | ||
| // shareddb.IsHealthy keeps the readiness predicate consistent with | ||
| // the e2e suite (single source of truth for ReadyStatus). | ||
| dd, err := shareddb.Get(ctx, k.crClient, types.NamespacedName{Namespace: k.namespace, Name: k.clusterName}) |
There was a problem hiding this comment.
shareddb.Get(ctx, k.crClient, types.NamespacedName{Namespace: k.namespace, Name: k.clusterName}) is repeated verbatim in 4 methods (GetClusterHealth, GetInstancesPerNode, GetCurrentDocumentDBImageTag, UpgradeDocumentDB). Extract a private helper:
func (k *K8sClusterClient) getCR(ctx context.Context) (*previewv1.DocumentDB, error) {
return shareddb.Get(ctx, k.crClient, types.NamespacedName{Namespace: k.namespace, Name: k.clusterName})
}There was a problem hiding this comment.
Fixed in ed498df — added private (k *K8sClusterClient) getCR(ctx) helper. All four call sites (GetClusterHealth, GetInstancesPerNode, GetCurrentDocumentDBImageTag, UpgradeDocumentDB) now use it. types/shareddb are still imported once at the top for the helper itself.
| ticker := time.NewTicker(healthCheckInterval) | ||
| defer ticker.Stop() | ||
|
|
||
| for { |
There was a problem hiding this comment.
This ticker + select { <-ctx.Done() / <-ticker.C } loop is duplicated 5× across the module (here, writer.Run, verifier.Run, scheduler.Run, runMetricsSampling). Optional, since they span packages, but a tiny shared helper would DRY them:
func tick(ctx context.Context, d time.Duration, fn func(context.Context)) {
t := time.NewTicker(d); defer t.Stop()
for { select { case <-ctx.Done(): return; case <-t.C: fn(ctx) } }
}There was a problem hiding this comment.
Skipped intentionally. Agree the pattern recurs 5 times, but a cross-package utility for a 5-line idiom isn't a clear win — it would mean either a new test/longhaul/internal package (overhead) or sticking it in one of the existing packages (awkward import direction). Each call site is also a Run() method with package-specific logging, so the saved lines per site are closer to 2-3. Happy to revisit if more components get added in PR-3/PR-4.
| } | ||
|
|
||
| // DefaultOutagePolicy returns a conservative policy suitable for most operations. | ||
| func DefaultOutagePolicy() OutagePolicy { |
There was a problem hiding this comment.
DefaultOutagePolicy() is dead code — it's never referenced anywhere (each operation builds its OutagePolicy inline). Either use it as the base for the scale/upgrade policies or remove it.
There was a problem hiding this comment.
Keeping DefaultOutagePolicy() — it's referenced from 3 test files (journal/journal_test.go:70-71, journal/policy_test.go:80, operations/scheduler_test.go:38) where the tests need a "reasonable defaults" stand-in to construct fake operations / open disruption windows. Removing it would push three literal OutagePolicy{...} copies into the tests, which is worse. Open to renaming if you'd prefer something like BaselineOutagePolicy() to signal it's a test/baseline default rather than a production default.
| b.WriteString("# Long Haul Test Report\n\n") | ||
|
|
||
| // Header | ||
| b.WriteString(fmt.Sprintf("**Result:** %s\n", s.Result)) |
There was a problem hiding this comment.
b.WriteString(fmt.Sprintf(...)) appears ~15× in this function. fmt.Fprintf(&b, ...) is the idiomatic, terser form (a strings.Builder implements io.Writer), e.g.:
fmt.Fprintf(&b, "**Result:** %s\n", s.Result)There was a problem hiding this comment.
Fixed in ed498df — all ~15 occurrences switched to fmt.Fprintf(&b, ...). Pure b.WriteString("literal") calls (no formatting needed) kept as-is since Fprintf would be unnecessary indirection there.
| } | ||
|
|
||
| // Also log the summary as JSON for structured log consumers. | ||
| summaryJSON, _ := json.Marshal(map[string]interface{}{ |
There was a problem hiding this comment.
map[string]interface{} → map[string]any (Go 1.18+, and go.mod is 1.25). Minor: fmt.Printf("\n%s\n", "=== CHECKPOINT REPORT ===") on lines 93/95 can just be fmt.Println(...). Note this file also fails gofmt (the JSON map literal alignment).
There was a problem hiding this comment.
Fixed in ed498df — map[string]any, Printf("\n%s\n", "...") collapsed to Println("\n...") / Print("...\n\n"), and gofmt -w cleaned the JSON map alignment. go vet and gofmt -l are both clean now.
| } | ||
| } | ||
|
|
||
| // Snapshot returns a point-in-time copy of all metric values. |
There was a problem hiding this comment.
This doc comment is attached to type MetricsSnapshot but reads "Snapshot returns a point-in-time copy...". Per Go convention the comment should start with the identifier it documents, e.g. // MetricsSnapshot is a point-in-time copy of all metric values.
There was a problem hiding this comment.
Fixed in ed498df — comment now reads // MetricsSnapshot is a point-in-time copy of all metric values.
- gofmt -w across the module (4 files were not gofmt-clean: scale.go, scheduler.go, health.go, checkpoint.go) — fixes the CI lint gate
- collapse ScaleUp/ScaleDown into a shared scaleOp struct (parameterized by delta/bound/policy); thin ScaleUp / ScaleDown wrappers keep call sites and tests unchanged
- expose K8sClusterClient.Clientset() and drop the duplicate REST-config + clientset construction in cmd/longhaul/main.go
- extract K8sClusterClient.getCR() helper; the namespaced-name lookup no longer repeats in every method
- report.go: b.WriteString(fmt.Sprintf(...)) -> fmt.Fprintf(&b, ...)
- checkpoint.go: map[string]interface{} -> map[string]any; collapse double Printf wrappers
- metrics.go: MetricsSnapshot doc comment now starts with the type name
Skipped (with rationale in PR replies):
- cross-package tick helper (xgerman marked optional; ~5 lines saved is not worth a new utility package)
- removing DefaultOutagePolicy (still used by journal/operations tests as a reasonable-defaults stand-in)
go build / go vet / go test ./... all clean; gofmt -l reports no remaining issues.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Copilot <223556219+Copilot@users.noreply.github.com>
Verifier: - Drop NumVerifiers config knob and StartVerifiers helper; always run exactly one verifier. - Multiple verifiers were a bug, not a feature: each one independently scanned the whole collection and wrote to the SHARED Metrics.VerifyGapsDetected counter, so a single real gap of N missing seqs was reported as N x NumVerifiers in alerts, and read load on the cluster scaled with NumVerifiers (turning the verifier's own load into a confounding signal for the test report). - One verifier is sufficient: the scan is stateless against the DB and bounded by the per-writer nextSeq resume map, so adding more instances adds noise, not coverage. - LONGHAUL_NUM_VERIFIERS env var removed (along with the cfg field, default, parse, validate, README row, and config_test references). Metrics: - Add per-field godoc on every Metrics field clarifying ack-vs-fail semantics, what triggers FAIL (verifier-side only), DupKey ack behavior, scan-cycle vs document semantics, and StartTime / Elapsed restart behavior. - The user asked 'what is VerifyGapsDetected' — that the question came up at all is evidence the field comments were insufficient. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Metrics: one-line godoc per field; drop the multi-paragraph prose. - Verifier: drop the dead 'id' field (only one verifier exists), update test/log sites accordingly. - Fix incorrect nextSeq comment: once we step past a gap, expectedSeq advances to seq+1, so a late-arriving fill at the missing seq is filtered out by 'seq >= nextSeq' on subsequent cycles — gaps are counted exactly once, not re-checked. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Copilot <223556219+Copilot@users.noreply.github.com>
Without this, the verifier could miss data loss in the tail: if seqs above the verifier's last-observed point were acked then lost, and no later writes arrived to expose the gap via the per-doc check, the loss was invisible. Concretely: writer acks 1..110, DB loses 101..110, writer crashes/idles. Verifier scans seq>=101, sees nothing, and the durability oracle stays clean — false negative. Fix: snapshot writer.Seq() at the top of each per-writer scan, bound the Find filter to seq <= that snapshot, and after the per-doc loop check whether expectedSeq <= snapshot. Any residual is tail loss; count it under VerifyGapsDetected and log a distinct 'tail loss' message. Advance nextSeq to snapshot+1 so the next cycle scans only newly committed seqs. Race safety: reading writer.Seq() BEFORE the Find guarantees we never count in-flight writes (which commit after Seq() advances). Verified expected <= actual in steady state, so missing>0 only when seqs are truly gone. Also drops the aggregation-based writer-id discovery (we now iterate the writers slice the verifier owns) and adds Writer.Seq() public accessor + unit test. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Copilot <223556219+Copilot@users.noreply.github.com>
Verifier: extract auditDocs as a pure function that takes the decoded docs and (expectedSeq, maxSeq) and returns (newExpectedSeq, internalGaps, tailLoss, checksumErrors) plus a list of structured findings the caller logs. This decouples the gap/tail/checksum math from cursor iteration and the journal, so it's unit-testable without a live mongo. Writer: introduce writeBackend interface (insert + isDuplicate + highestSeq) and wrap *mongo.Collection in mongoBackend. writeOne and Resume now go through the interface, so we can stub it in tests. Adds 11 audit table cases (no docs, clean run, gap-in-middle, gap-at-start, empty-with-tail, tail-after-docs, gap+tail, late-write-exposes-gap, single checksum, multi-checksum, nextSeq advance invariant) and 9 writer cases (success, DupKey-as-ack, transient error, retry-after-error, monotonic advance; Resume empty / has-data / on-error). Behavior unchanged. Net: ~+200 lines test, ~+50/-40 lines source. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Copilot <223556219+Copilot@users.noreply.github.com>
Merge audit_test.go into verifier_test.go and writer_backend_test.go into writer_test.go. Each source file now has exactly one matching _test.go (Go convention), no test changes. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add doc.go with the package-level overview explaining why report.go / checkpoint.go / alert.go are separate (data model vs. orchestration vs. CI surface) so the next reader doesn't have to ask. - Add per-field godoc on Summary clarifying who populates each field, when it's empty, and the PASS/FAIL relationship to LeakAnalysis (warning only, not a verdict flip). - Drop the narrow one-liner package doc from report.go in favor of the richer one in doc.go. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Copilot <223556219+Copilot@users.noreply.github.com>
AllowedDowntime: set by every operation's OutagePolicy() and DefaultOutagePolicy() but never read by ExceededPolicy (which is the only consumer). Misleading public API surface — readers assume it's enforced. Removed the field, all assignments, and the TODO. The downtime budget can be re-introduced together with the writer-side timestamp plumbing it would actually require. Event.Metadata: declared, allocated, threaded through Record's signature, set to nil at every Info/Warn/Error call site, and never read. Dropped the field and simplified Record(level, component, message). No behavior change. All callers were updated; tests still pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Copilot <223556219+Copilot@users.noreply.github.com>
On a multi-day run the in-memory events slice grew unbounded — ~200 events/day baseline (op open/close + verifier batch errors) means ~6k after a month, harmless on its own but Events() returns a full slice copy on every checkpoint (~30s), so allocation/GC pressure scales with run length. Cap the ring at 10k with 1k headroom so the trim is amortized to one copy every ~1k appends, not paid on every append. The rendered report only ever surfaces the last 20 events so the cap is invisible to consumers. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Copilot <223556219+Copilot@users.noreply.github.com>
Part 2 of 5: long-haul test driver core
Part 2 of the #348 split (design doc landed in #400, shared test module in #401).
What this PR adds
A new
test/longhaulGo module — a long-running canary driver that exercises a persistent DocumentDB cluster with continuous workload, scheduled disruptive operations, health monitoring, and reporting. See the design indocs/designs/long-haul-test-design.md(merged via #400) for the why.Packages:
cmd/longhaulconfigworkloadoperationsmonitortest/shared), health gating, leak detectorjournalreportdeploy/setup.yamlDockerfile,README.mdCode reuse with e2e
Per the design doc (and #401), the driver consumes:
test/shared/documentdb—Get,PatchSpec,PatchInstances,WaitHealthy,Delete,ReadyStatus(no duplicated CR logic; no hardcoded readiness strings)test/shared/mongo—NewClient,BuildURI,Ping,Seed,Count, TLS plumbing (no rawmongo.Connect)go.modusesreplacefor both modules so changes to shared helpers flow into the driver immediately.Excluded from this PR (follow-ups)
.github/workflows/longhaul-*.yaml) +deploy/deployment.yaml+deploy/rbac.yamlVerification
Reviewer notes
monitor/k8sclient.gois the typedcontroller-runtimewrapper that addresses xgerman's feat: long-haul (canary) test driver for the operator (#220) #348 review feedback (dropped rawdynamic.Interface+ hardcoded GVR).operations/upgrade.goreads the desired DocumentDB version from alonghaul-versionsConfigMap that PR-4's workflow will populate; today it no-ops if the ConfigMap is missing.