Skip to content

test(longhaul): add long-haul test driver core#405

Open
WentingWu666666 wants to merge 11 commits into
documentdb:mainfrom
WentingWu666666:developer/wentingwu/longhaul-core
Open

test(longhaul): add long-haul test driver core#405
WentingWu666666 wants to merge 11 commits into
documentdb:mainfrom
WentingWu666666:developer/wentingwu/longhaul-core

Conversation

@WentingWu666666

Copy link
Copy Markdown
Collaborator

Part 2 of 5: long-haul test driver core

Part 2 of the #348 split (design doc landed in #400, shared test module in #401).

What this PR adds

A new test/longhaul Go module — a long-running canary driver that exercises a persistent DocumentDB cluster with continuous workload, scheduled disruptive operations, health monitoring, and reporting. See the design in docs/designs/long-haul-test-design.md (merged via #400) for the why.

Packages:

Path Role
cmd/longhaul Driver entrypoint; wires workload + ops + monitor + reporting
config Env-driven configuration with validation + unit tests
workload Writer / verifier / atomic metrics (acked-write durability oracle)
operations Weighted-random scheduler + scale/upgrade ops
monitor Cluster client (consumes test/shared), health gating, leak detector
journal Thread-safe event log + per-operation outage-policy oracle
report Markdown report, periodic checkpoint to ConfigMap, GitHub Actions alerts
deploy/setup.yaml Bootstrap manifest (namespace + Secret + DocumentDB CR) for canary cluster
Dockerfile, README.md Container image + operator-facing documentation

Code reuse with e2e

Per the design doc (and #401), the driver consumes:

  • test/shared/documentdbGet, PatchSpec, PatchInstances, WaitHealthy, Delete, ReadyStatus (no duplicated CR logic; no hardcoded readiness strings)
  • test/shared/mongoNewClient, BuildURI, Ping, Seed, Count, TLS plumbing (no raw mongo.Connect)

go.mod uses replace for both modules so changes to shared helpers flow into the driver immediately.

Excluded from this PR (follow-ups)

  • PR-3: CI/CD workflows (.github/workflows/longhaul-*.yaml) + deploy/deployment.yaml + deploy/rbac.yaml
  • PR-4: auto-upgrade logic for operator + DocumentDB version drift

Verification

cd test/longhaul
go build ./...   # clean
go test ./...    # 6/6 packages pass: config / journal / monitor / operations / report / workload

Reviewer notes

Adds the test/longhaul Go module: a long-running canary driver that

exercises a persistent DocumentDB cluster with continuous workload,

scheduled disruptive operations, health monitoring, and reporting.

Scope of this PR (Part 2/5 of documentdb#348 split):

  * cmd/longhaul: driver entrypoint

  * config: env-driven configuration

  * workload: writer/verifier/metrics

  * operations: scheduler + scale + upgrade ops

  * monitor: cluster client (consumes test/shared), health, leak detector

  * journal: thread-safe event log + outage policy oracle

  * report: markdown report, periodic checkpoint, GitHub Actions alerts

  * Dockerfile, README, deploy/setup.yaml bootstrap manifest

Reuses test/shared/documentdb (lifecycle) and test/shared/mongo

(data plane) extracted in documentdb#401 — no duplication of CR/Mongo logic.

Excluded (in follow-up PRs): CI/CD workflows + deploy/deployment.yaml

+ deploy/rbac.yaml (PR-3), auto-upgrade logic (PR-4).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 16, 2026 19:52

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds the core of the new test/longhaul Go module: a long-running “canary” driver that applies continuous workload, schedules disruptive operations (scale/upgrade), monitors cluster health/resources, journals events/windows, and produces checkpoint + markdown reporting.

Changes:

  • Introduces workload writer/verifier with atomic metrics and uniqueness/indexing for durability checks.
  • Adds the disruption scheduler and operations (scale up/down, DocumentDB version upgrade) with outage-policy tracking via a journal.
  • Adds Kubernetes monitoring client (typed controller-runtime + test/shared), leak detection, and reporting/checkpoint + GitHub Actions annotations.

Reviewed changes

Copilot reviewed 41 out of 42 changed files in this pull request and generated 16 comments.

Show a summary per file
File Description
test/longhaul/cmd/longhaul/main.go Driver wiring: config, mongo, monitors, workload, scheduler, checkpoint reporting
test/longhaul/config/config.go Env-driven configuration defaults + validation
test/longhaul/config/config_test.go Unit tests for env parsing/validation/enabled flag
test/longhaul/config/suite_test.go Ginkgo suite bootstrap for config package
test/longhaul/workload/metrics.go Atomic counters + snapshot helpers for workload metrics
test/longhaul/workload/metrics_test.go Unit tests for metrics snapshot/rates/concurrency safety
test/longhaul/workload/writer.go Majority-write workload writer + index creation
test/longhaul/workload/writer_test.go Unit tests for checksum and writer counter behavior
test/longhaul/workload/verifier.go Majority-read verifier scanning per-writer sequences for gaps/checksum mismatches
test/longhaul/workload/verifier_test.go Constructor/resume-point tests (with integration boundary noted)
test/longhaul/workload/suite_test.go Ginkgo suite bootstrap for workload package
test/longhaul/operations/scheduler.go Weighted operation scheduler with cooldown + steady-state gating
test/longhaul/operations/scheduler_test.go Unit tests for selection, disruption windows, and error journaling
test/longhaul/operations/scale.go Scale up/down operations with budgets + recovery waits
test/longhaul/operations/scale_test.go Unit tests for scale operation bounds and policy values
test/longhaul/operations/upgrade.go Upgrade operation reading desired version from ConfigMap + readiness waits
test/longhaul/operations/upgrade_test.go Unit tests for desired-version reads, preconditions, and outage policy
test/longhaul/operations/suite_test.go Ginkgo suite bootstrap for operations package
test/longhaul/monitor/k8sclient.go Typed K8s/CR client + metrics-server integration; uses test/shared/documentdb
test/longhaul/monitor/k8sclient_test.go Unit tests with fake clientsets/controller-runtime fake client
test/longhaul/monitor/health.go Health monitor with steady-state tracking and wait helper
test/longhaul/monitor/health_test.go Unit tests for steady-state logic and wait behavior
test/longhaul/monitor/leakdetect.go Linear regression leak detector for memory/CPU trends
test/longhaul/monitor/leakdetect_test.go Unit tests for slope/leak threshold logic
test/longhaul/monitor/suite_test.go Ginkgo suite bootstrap for monitor package
test/longhaul/journal/journal.go Thread-safe append-only event journal + disruption window tracking
test/longhaul/journal/journal_test.go Unit tests for events/windows/policy violation detection
test/longhaul/journal/policy.go Outage policy + window semantics (duration, exceeded policy)
test/longhaul/journal/policy_test.go Unit tests for policy/window semantics
test/longhaul/journal/suite_test.go Ginkgo suite bootstrap for journal package
test/longhaul/report/report.go Markdown report generator (metrics, windows, leaks, recent events)
test/longhaul/report/report_test.go Unit tests for report content/sections/truncation
test/longhaul/report/checkpoint.go Periodic checkpoint reporter writing ConfigMap + stdout + annotations
test/longhaul/report/checkpoint_test.go Unit tests using fake clientset for ConfigMap persistence
test/longhaul/report/alert.go GitHub Actions workflow command annotations (error/notice/warning)
test/longhaul/report/alert_test.go Unit tests capturing stdout for annotations
test/longhaul/report/suite_test.go Ginkgo suite bootstrap for report package
test/longhaul/deploy/setup.yaml Bootstrap namespace/secret/DocumentDB CR manifest for canary cluster
test/longhaul/Dockerfile Container image build for the longhaul driver
test/longhaul/README.md Usage/configuration documentation for running + deploying longhaul
test/longhaul/go.mod New longhaul module definition with replace to local operator/shared modules
test/longhaul/go.sum Dependency lockfile for the longhaul module

Comment thread test/longhaul/workload/writer.go Outdated
Comment thread test/longhaul/cmd/longhaul/main.go
Comment thread test/longhaul/cmd/longhaul/main.go Outdated
Comment thread test/longhaul/operations/upgrade.go Outdated
Comment thread test/longhaul/operations/scheduler.go
Comment thread test/longhaul/README.md Outdated
Comment thread test/longhaul/README.md Outdated
Comment thread test/longhaul/Dockerfile
Comment thread test/longhaul/deploy/setup.yaml Outdated
Comment thread test/longhaul/report/checkpoint.go Outdated
@WentingWu666666 WentingWu666666 marked this pull request as draft June 16, 2026 20:23
@documentdb-triage-tool documentdb-triage-tool Bot added dependencies Pull requests that update a dependency file documentation Improvements or additions to documentation enhancement New feature or request test labels Jun 16, 2026
@documentdb-triage-tool

Copy link
Copy Markdown

🤖 Auto-triaged by documentdb-triage-tool.

Applied: test, dependencies, documentation, enhancement
Project fields suggested: Component test · Priority P2 · Effort XL · Status In Progress
Confidence: 0.88 (mixed)

Reasoning

component from path globs (test, dependencies, docs); effort from diff stats (5029+0 LOC, 42 files); LLM: Large multi-package long-haul test driver core spanning several new Go packages, a Dockerfile, manifests, and docs — cross-cutting test infrastructure addition as part of a 5-PR series.

If a label is wrong, remove it manually and ping @patty-chow so the rules can be tuned. The bot will not re-label items that already have component labels.

@WentingWu666666 WentingWu666666 changed the title test(longhaul): add long-haul test driver core (Part 2/5 of #348) test(longhaul): add long-haul test driver core Jun 17, 2026
Copilot inline comments (16):

- writer.go: fix StartWriters doc; switch to math/rand/v2 (auto-seeded)

- main.go: header says Deployment (not Job); drop-collection gated on LONGHAUL_RESET_DATA

- upgrade.go: wrap+return error from GetCurrentDocumentDBImageTag

- checkpoint.go: bounded ctx on final emit; final PASS persists as PASS (not RUNNING)

- alert.go: escape % / CR / LF in GH Actions annotation messages

- deploy/setup.yaml: password is placeholder; plugin field uses spec.plugins.sidecarInjectorName

- Dockerfile + README: build from repo root so go.mod replace paths resolve; Go 1.25; update relationship table for shared module

Local code-review-agent (3 critical):

- C1: writer no longer pre-increments seq; advance only on success or DupKey ack, so non-DupKey failures retry the same seq instead of being mistaken for data loss

- C2: main calls reporter.EmitFinal() synchronously before exit so the authoritative final verdict reaches the longhaul-report ConfigMap

- C3: writers seed seq from FindOne(sort:-1) on startup so a Deployment-driven pod restart resumes past the prior tip without colliding with the unique (writer_id, seq) index

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Copilot <223556219+Copilot@users.noreply.github.com>
@WentingWu666666 WentingWu666666 marked this pull request as ready for review June 17, 2026 17:26

@xgerman xgerman left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: structure, duplication & terseness

Focused pass on code duplication, structure, and style (the existing Copilot threads already covered correctness/security and are addressed in 3e9ac70). The module is well-documented and go build/go vet/go test ./... all pass locally. 👍

🟠 Must fix — gofmt (CI blocker)

Four files are not gofmt-clean (extra-space alignment), so make fmt/lint will fail:
operations/scale.go, operations/scheduler.go, monitor/health.go, report/checkpoint.go.
Fix: gofmt -w operations/ monitor/ report/.

🟡 Duplication / structure (highest-value dedup)

  1. ScaleUp and ScaleDown are ~95% identical — collapse into one parameterized type (inline).
  2. main.buildRESTConfig re-implements monitor.buildRestConfig and builds a second clientset that K8sClusterClient already owns (inline).
  3. shareddb.Get(...) repeated 4× in k8sclient.go — extract a getCR helper (inline).
  4. The ticker + select{ctx.Done/ticker.C} loop recurs 5× (writer/verifier/scheduler/health/metrics) — optional shared helper (inline).

🟢 Style / terseness / dead code

  • journal.DefaultOutagePolicy() is unused (inline).
  • report.go: b.WriteString(fmt.Sprintf(...))fmt.Fprintf(&b, ...) (inline).
  • checkpoint.go: map[string]interface{}map[string]any (inline).
  • metrics.go: MetricsSnapshot doc comment should start with the type name (inline).

No blocking design issues — just the gofmt fix for CI plus the two consolidations (#1, #2) which are the biggest wins. Nice work on the durability oracle and the bounded verifier scan.

Comment thread test/longhaul/operations/scale.go Outdated
}

// ScaleDown decreases spec.instancesPerNode by 1 (HA scale dimension; range 1-3).
type ScaleDown struct {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ScaleUp (above) and ScaleDown are ~95% identical: same fields, and Precondition/Execute differ only by +1/-1, >=/<=, the bound, and a couple of constants. Consider collapsing into one parameterized type to halve the file:

type scaleOp struct {
    client    monitor.ClusterClient
    healthMon *monitor.HealthMonitor
    name      string
    weight    int
    delta     int        // +1 up, -1 down
    limit     int        // max (up) or min (down)
    atLimit   func(cur int) bool
    recovery  time.Duration
    policy    journal.OutagePolicy
}

Thin NewScaleUp/NewScaleDown constructors keep call sites unchanged.

(Also: Name()/Weight() on lines 37-38 and 98-99 fail gofmt — extra spaces.)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in ed498df — collapsed into a shared scaleOp struct parameterized by delta/bound/boundKind/policy. ScaleUp and ScaleDown are now thin embedding wrappers so call sites and the existing NewScaleUp/NewScaleDown constructors are unchanged. Tests use small maxInstances()/minInstances() accessors instead of field access. Net: scale.go went from 136 -> ~130 lines and has zero duplication. Also gofmt-fixed (the extra-space alignment you flagged on lines 37-38 / 98-99 is gone).

Comment thread test/longhaul/cmd/longhaul/main.go Outdated
return client, clientset, nil
}

func buildRESTConfig() (*rest.Config, error) {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

buildRESTConfig here re-implements monitor.buildRestConfig (k8sclient.go:86) almost verbatim, and initK8sClient then creates a second clientset (line 231) even though K8sClusterClient already built one (k8sclient.go:59). Expose the existing one instead:

func (k *K8sClusterClient) Clientset() kubernetes.Interface { return k.clientset }

Then delete buildRESTConfig and the redundant NewForConfig — one client, one config path.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in ed498df — added K8sClusterClient.Clientset() kubernetes.Interface, deleted main.buildRESTConfig and initK8sClient entirely, and dropped the unused k8s.io/client-go/{kubernetes,rest,tools/clientcmd} imports from main.go. main.go now does the one monitor.NewK8sClusterClient(...) call and reuses clusterClient.Clientset() for the reporter. main.go shrank by ~40 lines.

Comment thread test/longhaul/monitor/k8sclient.go Outdated
// Get the DocumentDB CR status via the shared typed helper. Using
// shareddb.IsHealthy keeps the readiness predicate consistent with
// the e2e suite (single source of truth for ReadyStatus).
dd, err := shareddb.Get(ctx, k.crClient, types.NamespacedName{Namespace: k.namespace, Name: k.clusterName})

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shareddb.Get(ctx, k.crClient, types.NamespacedName{Namespace: k.namespace, Name: k.clusterName}) is repeated verbatim in 4 methods (GetClusterHealth, GetInstancesPerNode, GetCurrentDocumentDBImageTag, UpgradeDocumentDB). Extract a private helper:

func (k *K8sClusterClient) getCR(ctx context.Context) (*previewv1.DocumentDB, error) {
    return shareddb.Get(ctx, k.crClient, types.NamespacedName{Namespace: k.namespace, Name: k.clusterName})
}

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in ed498df — added private (k *K8sClusterClient) getCR(ctx) helper. All four call sites (GetClusterHealth, GetInstancesPerNode, GetCurrentDocumentDBImageTag, UpgradeDocumentDB) now use it. types/shareddb are still imported once at the top for the helper itself.

ticker := time.NewTicker(healthCheckInterval)
defer ticker.Stop()

for {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ticker + select { <-ctx.Done() / <-ticker.C } loop is duplicated 5× across the module (here, writer.Run, verifier.Run, scheduler.Run, runMetricsSampling). Optional, since they span packages, but a tiny shared helper would DRY them:

func tick(ctx context.Context, d time.Duration, fn func(context.Context)) {
    t := time.NewTicker(d); defer t.Stop()
    for { select { case <-ctx.Done(): return; case <-t.C: fn(ctx) } }
}

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skipped intentionally. Agree the pattern recurs 5 times, but a cross-package utility for a 5-line idiom isn't a clear win — it would mean either a new test/longhaul/internal package (overhead) or sticking it in one of the existing packages (awkward import direction). Each call site is also a Run() method with package-specific logging, so the saved lines per site are closer to 2-3. Happy to revisit if more components get added in PR-3/PR-4.

}

// DefaultOutagePolicy returns a conservative policy suitable for most operations.
func DefaultOutagePolicy() OutagePolicy {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DefaultOutagePolicy() is dead code — it's never referenced anywhere (each operation builds its OutagePolicy inline). Either use it as the base for the scale/upgrade policies or remove it.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping DefaultOutagePolicy() — it's referenced from 3 test files (journal/journal_test.go:70-71, journal/policy_test.go:80, operations/scheduler_test.go:38) where the tests need a "reasonable defaults" stand-in to construct fake operations / open disruption windows. Removing it would push three literal OutagePolicy{...} copies into the tests, which is worse. Open to renaming if you'd prefer something like BaselineOutagePolicy() to signal it's a test/baseline default rather than a production default.

Comment thread test/longhaul/report/report.go Outdated
b.WriteString("# Long Haul Test Report\n\n")

// Header
b.WriteString(fmt.Sprintf("**Result:** %s\n", s.Result))

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

b.WriteString(fmt.Sprintf(...)) appears ~15× in this function. fmt.Fprintf(&b, ...) is the idiomatic, terser form (a strings.Builder implements io.Writer), e.g.:

fmt.Fprintf(&b, "**Result:** %s\n", s.Result)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in ed498df — all ~15 occurrences switched to fmt.Fprintf(&b, ...). Pure b.WriteString("literal") calls (no formatting needed) kept as-is since Fprintf would be unnecessary indirection there.

Comment thread test/longhaul/report/checkpoint.go Outdated
}

// Also log the summary as JSON for structured log consumers.
summaryJSON, _ := json.Marshal(map[string]interface{}{

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

map[string]interface{}map[string]any (Go 1.18+, and go.mod is 1.25). Minor: fmt.Printf("\n%s\n", "=== CHECKPOINT REPORT ===") on lines 93/95 can just be fmt.Println(...). Note this file also fails gofmt (the JSON map literal alignment).

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in ed498dfmap[string]any, Printf("\n%s\n", "...") collapsed to Println("\n...") / Print("...\n\n"), and gofmt -w cleaned the JSON map alignment. go vet and gofmt -l are both clean now.

Comment thread test/longhaul/workload/metrics.go Outdated
}
}

// Snapshot returns a point-in-time copy of all metric values.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doc comment is attached to type MetricsSnapshot but reads "Snapshot returns a point-in-time copy...". Per Go convention the comment should start with the identifier it documents, e.g. // MetricsSnapshot is a point-in-time copy of all metric values.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in ed498df — comment now reads // MetricsSnapshot is a point-in-time copy of all metric values.

Copilot AI added 5 commits June 17, 2026 18:17
- gofmt -w across the module (4 files were not gofmt-clean: scale.go, scheduler.go, health.go, checkpoint.go) — fixes the CI lint gate

- collapse ScaleUp/ScaleDown into a shared scaleOp struct (parameterized by delta/bound/policy); thin ScaleUp / ScaleDown wrappers keep call sites and tests unchanged

- expose K8sClusterClient.Clientset() and drop the duplicate REST-config + clientset construction in cmd/longhaul/main.go

- extract K8sClusterClient.getCR() helper; the namespaced-name lookup no longer repeats in every method

- report.go: b.WriteString(fmt.Sprintf(...)) -> fmt.Fprintf(&b, ...)

- checkpoint.go: map[string]interface{} -> map[string]any; collapse double Printf wrappers

- metrics.go: MetricsSnapshot doc comment now starts with the type name

Skipped (with rationale in PR replies):

- cross-package tick helper (xgerman marked optional; ~5 lines saved is not worth a new utility package)

- removing DefaultOutagePolicy (still used by journal/operations tests as a reasonable-defaults stand-in)

go build / go vet / go test ./... all clean; gofmt -l reports no remaining issues.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Copilot <223556219+Copilot@users.noreply.github.com>
Verifier:

- Drop NumVerifiers config knob and StartVerifiers helper; always run exactly one verifier.

- Multiple verifiers were a bug, not a feature: each one independently scanned the whole collection and wrote to the SHARED Metrics.VerifyGapsDetected counter, so a single real gap of N missing seqs was reported as N x NumVerifiers in alerts, and read load on the cluster scaled with NumVerifiers (turning the verifier's own load into a confounding signal for the test report).

- One verifier is sufficient: the scan is stateless against the DB and bounded by the per-writer nextSeq resume map, so adding more instances adds noise, not coverage.

- LONGHAUL_NUM_VERIFIERS env var removed (along with the cfg field, default, parse, validate, README row, and config_test references).

Metrics:

- Add per-field godoc on every Metrics field clarifying ack-vs-fail semantics, what triggers FAIL (verifier-side only), DupKey ack behavior, scan-cycle vs document semantics, and StartTime / Elapsed restart behavior.

- The user asked 'what is VerifyGapsDetected' — that the question came up at all is evidence the field comments were insufficient.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Metrics: one-line godoc per field; drop the multi-paragraph prose.

- Verifier: drop the dead 'id' field (only one verifier exists), update test/log sites accordingly.

- Fix incorrect nextSeq comment: once we step past a gap, expectedSeq advances to seq+1, so a late-arriving fill at the missing seq is filtered out by 'seq >= nextSeq' on subsequent cycles — gaps are counted exactly once, not re-checked.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Copilot <223556219+Copilot@users.noreply.github.com>
Without this, the verifier could miss data loss in the tail: if seqs above the verifier's last-observed point were acked then lost, and no later writes arrived to expose the gap via the per-doc check, the loss was invisible. Concretely: writer acks 1..110, DB loses 101..110, writer crashes/idles. Verifier scans seq>=101, sees nothing, and the durability oracle stays clean — false negative.

Fix: snapshot writer.Seq() at the top of each per-writer scan, bound the Find filter to seq <= that snapshot, and after the per-doc loop check whether expectedSeq <= snapshot. Any residual is tail loss; count it under VerifyGapsDetected and log a distinct 'tail loss' message. Advance nextSeq to snapshot+1 so the next cycle scans only newly committed seqs.

Race safety: reading writer.Seq() BEFORE the Find guarantees we never count in-flight writes (which commit after Seq() advances). Verified expected <= actual in steady state, so missing>0 only when seqs are truly gone.

Also drops the aggregation-based writer-id discovery (we now iterate the writers slice the verifier owns) and adds Writer.Seq() public accessor + unit test.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Copilot <223556219+Copilot@users.noreply.github.com>
Verifier: extract auditDocs as a pure function that takes the decoded docs and (expectedSeq, maxSeq) and returns (newExpectedSeq, internalGaps, tailLoss, checksumErrors) plus a list of structured findings the caller logs. This decouples the gap/tail/checksum math from cursor iteration and the journal, so it's unit-testable without a live mongo.

Writer: introduce writeBackend interface (insert + isDuplicate + highestSeq) and wrap *mongo.Collection in mongoBackend. writeOne and Resume now go through the interface, so we can stub it in tests.

Adds 11 audit table cases (no docs, clean run, gap-in-middle, gap-at-start, empty-with-tail, tail-after-docs, gap+tail, late-write-exposes-gap, single checksum, multi-checksum, nextSeq advance invariant) and 9 writer cases (success, DupKey-as-ack, transient error, retry-after-error, monotonic advance; Resume empty / has-data / on-error).

Behavior unchanged. Net: ~+200 lines test, ~+50/-40 lines source.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Copilot <223556219+Copilot@users.noreply.github.com>
@WentingWu666666 WentingWu666666 marked this pull request as draft June 18, 2026 16:29
Copilot AI added 4 commits June 18, 2026 12:29
Merge audit_test.go into verifier_test.go and writer_backend_test.go into writer_test.go. Each source file now has exactly one matching _test.go (Go convention), no test changes.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add doc.go with the package-level overview explaining why report.go / checkpoint.go / alert.go are separate (data model vs. orchestration vs. CI surface) so the next reader doesn't have to ask.

- Add per-field godoc on Summary clarifying who populates each field, when it's empty, and the PASS/FAIL relationship to LeakAnalysis (warning only, not a verdict flip).

- Drop the narrow one-liner package doc from report.go in favor of the richer one in doc.go.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Copilot <223556219+Copilot@users.noreply.github.com>
AllowedDowntime: set by every operation's OutagePolicy() and DefaultOutagePolicy() but never read by ExceededPolicy (which is the only consumer). Misleading public API surface — readers assume it's enforced. Removed the field, all assignments, and the TODO. The downtime budget can be re-introduced together with the writer-side timestamp plumbing it would actually require.

Event.Metadata: declared, allocated, threaded through Record's signature, set to nil at every Info/Warn/Error call site, and never read. Dropped the field and simplified Record(level, component, message).

No behavior change. All callers were updated; tests still pass.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Copilot <223556219+Copilot@users.noreply.github.com>
On a multi-day run the in-memory events slice grew unbounded — ~200 events/day baseline (op open/close + verifier batch errors) means ~6k after a month, harmless on its own but Events() returns a full slice copy on every checkpoint (~30s), so allocation/GC pressure scales with run length. Cap the ring at 10k with 1k headroom so the trim is amortized to one copy every ~1k appends, not paid on every append.

The rendered report only ever surfaces the last 20 events so the cap is invisible to consumers.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Copilot <223556219+Copilot@users.noreply.github.com>
@WentingWu666666 WentingWu666666 marked this pull request as ready for review June 18, 2026 16:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dependencies Pull requests that update a dependency file documentation Improvements or additions to documentation enhancement New feature or request test

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants