Skip to content

feat(snapshot-skip B2): plumb metaAppliedIndex through raft-Apply + both snapshot persist sites#915

Merged
bootjp merged 11 commits into
mainfrom
impl/snapshot-skip-b2-meta-key
Jun 4, 2026
Merged

feat(snapshot-skip B2): plumb metaAppliedIndex through raft-Apply + both snapshot persist sites#915
bootjp merged 11 commits into
mainfrom
impl/snapshot-skip-b2-meta-key

Conversation

@bootjp
Copy link
Copy Markdown
Owner

@bootjp bootjp commented Jun 3, 2026

Summary

Implements Branch 2 of the cold-start snapshot-restore skip optimisation designed in PR #910. After this lands the metaAppliedIndex Pebble meta key is durably written on every raft-Apply data mutation AND at every snapshot persist — but the skip gate itself (Branch 3) is NOT yet wired, so behaviour is observationally identical to main except for the new meta key in fsm.db. Branch 2 is meant to soak in production for at least one release before Branch 3 enables the skip; this PR is intentionally a no-op-from-the-outside change with comprehensive plumbing.

Reading order (6 commits, designed to review one-at-a-time)

# commit scope
1 2339a6f2 raftengine: opt-in interfaces (AppliedIndexReader / AppliedIndexWriter)
2 525fc152 pebbleStore: metaAppliedIndex const + LastAppliedIndex + SetDurableAppliedIndex (with pebble.Sync UNCONDITIONALLY)
3 aa9b8acc MVCCStore interface extension: ApplyMutationsRaftAt / DeletePrefixAtRaftAt overloads, threading appliedIndex through applyMutationsWithOpts + deletePrefixAtWithOpts
4 7cd72bda kvFSM seam wiring: AppliedIndexReader() / SetDurableAppliedIndex() accessors + all 7 data-Apply leaves switched to *RaftAt with f.pendingApplyIdx
5 f1e8748c engine hooks at BOTH snapshot persist sites: persistCreatedSnapshot + e.persistLocalSnapshotPayload call SetDurableAppliedIndex BEFORE persist.SaveSnap
6 2c42f7d6 tests (10 new tests across store + engine)

Design constraints honoured

All from docs/design/2026_06_02_idempotent_snapshot_restore.md:

  • §2 "Why both leaves": meta key bundle in BOTH applyMutationsWithOpts AND deletePrefixAtWithOpts so DEL_PREFIX entries don't silently leave LastAppliedIndex behind. Tested by TestDeletePrefixAtRaftAt_BundlesMetaAppliedIndex.
  • §3 dbMu.RLock(): both LastAppliedIndex and SetDurableAppliedIndex acquire the read-lock, matching the lock-ordering discipline at lsm_store.go:153 / :553 / :675.
  • §4 fallback policy: AppliedIndexReader() returns nil when the store doesn't implement the seam; LastAppliedIndex returns (0, false, nil) for missing OR truncated meta key. Branch 3 will then fall back to full restore conservatively.
  • §6 ELASTICKV_FSM_SYNC_MODE=nosync mode: SetDurableAppliedIndex is pinned to pebble.Sync unconditionally. Rationale documented at length in the method's doc-comment — once persist.SaveSnap returns, WAL compaction discards every log entry ≤ snap.Metadata.Index, so there's no source to replay the meta key bump from. +1 fsync per snapshot persist (rare; default SnapshotCount=10000). Tested by TestSetDurableAppliedIndex_UsesPebbleSync.
  • §6 "HLC lease entries — checkpoint at snapshot persist": BOTH persistCreatedSnapshot (config snapshots) AND e.persistLocalSnapshotPayload (steady-state SnapshotCount-triggered hot path) call the hook. Both crash-ordering tested by TestPersistCreatedSnapshot_*.
  • §8 compatibility: StateMachine.Apply's public signature is unchanged. New interfaces are opt-in. Old call sites (ApplyMutationsRaft without *At) still work, just pass appliedIndex=0 to opt out of the meta key bump.

Test results

go vet ./...                                                    → 0 issues
go test ./store/ -short                                         → ok 29.4s
go test ./kv/    -short                                         → ok 10.4s
go test ./internal/raftengine/... -short                        → ok 32.8s
go test ./store/ -run 'TestLastAppliedIndex|TestSetDurable...|TestApply...|TestDelete...' → ok 1.6s
go test ./internal/raftengine/etcd/ -run 'TestRecording|TestPersistCreatedSnapshot_' → ok 0.03s

10 new tests added (see commit 2c42f7d6 for the full inventory).

What this does NOT do

  • Does NOT enable the skip gate. restoreSnapshotState still always restores. Branch 3 wires the fsmAlreadyAtIndex check + applyHeaderStateOnSkip + the two-phase SnapshotHeaderApplier seam.
  • Does NOT change HEALTH_TIMEOUT_SECONDS=300. Branch 4 lowers it once Branch 3 has soaked.
  • Does NOT touch the snapshot-install hot path (Engine.applySnapshot) per Non-Goals in the design.

Soak plan

Branch 2 should run in production for at least one release before Branch 3 opens. Operators can verify the meta key is being written via:

# Inspect a pebble fsm.db (read-only)
ldb --db=/var/lib/elastickv/n3/fsm.db get '_meta_applied_index' --hex
# Expected: 8 little-endian bytes equal to the current applied index

Refs

Summary by CodeRabbit

  • New Features

    • Durable tracking of Raft-applied indexes to ensure consistent snapshot/save ordering.
  • Bug Fixes

    • Improved snapshot persistence reliability by pinning durable applied index before snapshot writes.
    • Stronger durability for writes bundled with Raft entry indices, reducing restore/recovery surprises.
  • Tests

    • Added comprehensive tests covering applied-index ordering, failure handling, and persistence behavior.

bootjp added 6 commits June 3, 2026 21:10
Two new opt-in extensions to raftengine.StateMachine for the
cold-start snapshot-restore skip optimisation
(docs/design/2026_06_02_idempotent_snapshot_restore.md):

  - AppliedIndexReader.LastAppliedIndex() (uint64, bool, error) lets
    the engine query the FSM's durable applied-index for the
    fsmAlreadyAtIndex gate. (0, false, nil) means 'missing' and falls
    back to the full restore path. Any error MUST NOT abort cold
    start; the caller collapses (false, _, err) into 'restore'.

  - AppliedIndexWriter.SetDurableAppliedIndex(idx uint64) error lets
    the engine pin the FSM's durable applied-index to
    snap.Metadata.Index before calling persist.SaveSnap, so a
    successful snapshot persist always implies
    LastAppliedIndex >= snapshot.Metadata.Index. Implementations
    MUST use pebble.Sync (or equivalent) unconditionally — the
    checkpoint is the only durable carrier of the meta key at this
    point, because WAL compaction starts at the snapshot index after
    SaveSnap.

Interface only; no implementations or wirings yet. Branch 2 of the
design will:
  - Add metaAppliedIndex Pebble meta key + pebbleStore impls
  - Thread f.pendingApplyIdx through ApplyMutationsRaftAt /
    DeletePrefixAtRaftAt overloads into both batch-bundle sites
  - Wire kvFSM accessors that forward to the underlying pebbleStore
  - Hook persistCreatedSnapshot AND e.persistLocalSnapshotPayload
    (both snapshot persist sites)

Refs PR #910 (design), part of Branch 2 implementation series.
Two new methods on pebbleStore for the cold-start snapshot-restore
skip optimisation (PR #910 design, Branch 2 step 2):

  - metaAppliedIndex / metaAppliedIndexBytes — new Pebble meta key,
    'sibling' to metaLastCommitTS in layout and locking discipline.
    Bundled in the same WriteBatch as the data mutation by the
    raft-Apply path (next commit).

  - isPebbleMetaKey extended to include the new key so the snapshot
    encoder skips it (matches the precedent for the other meta keys).

  - LastAppliedIndex() (uint64, bool, error) — read under dbMu.RLock.
    (0, false, nil) for absent / truncated meta key so callers fall
    back to the full restore path. ErrNotFound is the expected outcome
    on a pre-upgrade fsm.db or freshly restored store; anything else
    propagates as a real I/O error (the caller in wal_store.go
    fsmAlreadyAtIndex collapses to 'false → restore' regardless, but
    this layer surfaces the typed error so other consumers can
    distinguish).

  - SetDurableAppliedIndex(idx) error — write under dbMu.RLock with
    pebble.Sync UNCONDITIONALLY (even under ELASTICKV_FSM_SYNC_MODE=
    nosync). Rationale documented inline: the snapshot-persist
    checkpoint is the only durable carrier of the meta key at this
    point because WAL compaction following persist.SaveSnap discards
    every log entry <= snap.Metadata.Index — so an OS-buffered Pebble
    flush dropped on crash would leave metaAppliedIndex behind the
    snapshot pointer permanently. +1 fsync per snapshot persist (rare;
    default SnapshotCount=10000) is the right price.

These are the pebbleStore-side implementations of
raftengine.AppliedIndexReader and raftengine.AppliedIndexWriter
added in the preceding commit. The kvFSM accessors that delegate to
these will land in a follow-up step alongside the engine hook sites.

Refs PR #910 (design), part of Branch 2 implementation series.
Plumbing for the raft entry index to reach the leaf pebble.Batch so
metaAppliedIndex can be bundled atomically with the data mutation
(PR #910 design §2, Branch 2 step 3).

Public surface (store.MVCCStore extension):
  - ApplyMutationsRaftAt(ctx, muts, readKeys, startTS, commitTS,
    appliedIndex)
  - DeletePrefixAtRaftAt(ctx, prefix, exclude, commitTS, appliedIndex)

appliedIndex==0 means 'no index, do not bump the meta key' so
existing callers that still use the non-At variants keep their
current behaviour byte-for-byte.

Implementations:
  - pebbleStore: bundles setPebbleUint64InBatch(metaAppliedIndex) in
    both applyMutationsWithOpts and deletePrefixAtWithOpts when
    appliedIndex > 0. +16 bytes per batch, zero extra fsync.
  - mvccStore: in-memory store discards appliedIndex. LastAppliedIndex
    on this backend would always be 'missing'; OK because it is
    only used by tests / catalog bootstrap.
  - LeaderRoutedStore: forwards to local.
  - ShardStore.ApplyMutationsRaftAt: routes to single owning shard.
  - ShardStore.DeletePrefixAtRaftAt: broadcasts to every group; the
    single-group production raft-apply path gets correct bundling;
    the multi-group broadcast case is admin-driven FLUSHALL only,
    never raft-applied.

No call sites are switched over yet; the next commit threads
f.pendingApplyIdx into the kvFSM data-Apply path through these
overloads.

Tests: go vet ./store/ ./kv/ ./internal/raftengine/... clean;
go test ./store/ -short ok 29.4s.

Refs PR #910 (design), part of Branch 2 implementation series.
Wires the raft applied-index seam through the kvFSM data-Apply path so
metaAppliedIndex is bundled atomically with every data mutation
(PR #910 design §2 + §3, Branch 2 step 4).

New methods on kvFSM:

  - AppliedIndexReader() raftengine.AppliedIndexReader — exposes the
    underlying store's reader when it implements the seam (pebbleStore
    does; in-memory mvccStore does not). nil propagates to the
    fsmAlreadyAtIndex caller and triggers the conservative
    full-restore fallback.

  - SetDurableAppliedIndex(idx) error — forwards to the underlying
    store's writer seam (pebbleStore does pebble.Sync; the in-memory
    store's no-op default is preserved). The engine calls this at
    each snapshot persist site BEFORE persist.SaveSnap.

Threaded f.pendingApplyIdx through every kv/fsm.go data-Apply leaf
(7 call sites):

  - handleNonTxnRequest → ApplyMutationsRaftAt
  - handleDelPrefix → DeletePrefixAtRaftAt
  - handlePrepareRequest (single-shard fast path)
  - handleOnePhaseTxnRequest (with-readkeys path)
  - applyCommitWithIdempotencyFallback (commit path)
  - applyCommitWithIdempotencyFallback (idempotency replay fallback)
  - handleAbortRequest (rollback marker write)

Each site passes f.pendingApplyIdx — the value the engine stashed
via SetApplyIndex immediately before Apply through the
raftengine.ApplyIndexAware seam. Engine-bypassing tests that drive
Apply directly without setting the index will pass 0, which the
leaf treats as 'do not bump the meta key' — keeping their behaviour
byte-for-byte identical to before.

Encryption opcode dispatch is unchanged: applyReservedOpcode still
threads pendingApplyIdx into applyEncryption (its existing consumer
for the WriteSidecar RaftAppliedIndex slot). The meta key bump
above is purely additive on the data-Apply leaves; encryption-only
entries continue to leave metaAppliedIndex unchanged (PR #910
design §6 'encryption opcodes' subsection).

Tests: go vet ./kv/ ./store/ ./internal/raftengine/... clean;
go test ./kv/ -short ok 10.4s.

Refs PR #910 (design), part of Branch 2 implementation series.
Two new hooks pinning metaAppliedIndex to the snapshot index BEFORE
the corresponding persist.SaveSnap call. After a successful snapshot
persist, LastAppliedIndex >= snap.Metadata.Index holds
unconditionally (PR #910 design §6, Branch 2 step 5).

Site 1 - persistCreatedSnapshot (engine.go:2679) — drives
config snapshots (membership-change snapshots).

Site 2 - e.persistLocalSnapshotPayload (engine.go:4048) — the
steady-state SnapshotCount-triggered hot path:
  maybePersistLocalSnapshot
   -> e.persistLocalSnapshotPayload (this method)
   -> free persistLocalSnapshotPayload (wal_store.go:519)
   -> persist.SaveSnap (wal_store.go:524)
The hook lives in the engine wrapper, not the free function, so
the free function stays signature-stable and the call holds
e.snapshotMu.Lock() before invoking SetDurableAppliedIndex
(serializing against follower-snapshot restore).

Both sites call e.fsm.(raftengine.AppliedIndexWriter) and silently
no-op when the FSM does not implement the seam (legacy test fakes,
in-memory backends). pebble.Sync is forced on the writer side
(lsm_store.SetDurableAppliedIndex inline comment) regardless of
ELASTICKV_FSM_SYNC_MODE — the checkpoint is the only durable
carrier of metaAppliedIndex at this point because WAL compaction
following SaveSnap discards every entry <= snap.Metadata.Index.

Crash ordering (matches design §6 table): the bump is fsynced
before SaveSnap returns. The only observable states are
  (metaAppliedIndex, snapshot pointer) in
  {(Y, X'<X), (X, X'), (X, X), (Z>X, X)}.
None of these can yield 'snapshot pointer = X but metaAppliedIndex
< X'. Over-restore impossible; round-3 P2 + round-4 P2 closed.

Tests: go vet ./internal/raftengine/etcd/ ./kv/ ./store/ clean;
go test ./internal/raftengine/... -short ok 32.8s.

Refs PR #910 (design), part of Branch 2 implementation series.
…ring

PR #910 design B2 test list, step 6 of the implementation series.

store/lsm_store_applied_index_test.go (7 tests):

  - TestLastAppliedIndex_MissingMetaKey — fresh store reports
    (0, false, nil); strictly-additive fallback to full restore
    is intact for pre-upgrade fsm.db files.

  - TestSetDurableAppliedIndex_RoundTrip — round-trips an arbitrary
    uint64 (0xDEAD_BEEF_CAFE_F00D) through Set + Get.

  - TestApplyMutationsRaftAt_BundlesMetaAppliedIndex — verifies the
    leaf bundles metaAppliedIndex when appliedIndex > 0, AND the
    data mutation also lands (sanity check the bundling did not
    drop the data path).

  - TestApplyMutationsRaftAt_ZeroIndexLeavesMetaKey — escape hatch:
    appliedIndex=0 MUST NOT touch metaAppliedIndex (preserves
    ApplyMutationsRaft byte-compat for callers not yet wired).

  - TestDeletePrefixAtRaftAt_BundlesMetaAppliedIndex — analogous
    round-trip for the DEL_PREFIX leaf, which builds its own
    pebble.Batch separate from applyMutationsWithOpts.

  - TestSetDurableAppliedIndex_UsesPebbleSync — sets
    ELASTICKV_FSM_SYNC_MODE=nosync, writes via the checkpoint,
    closes + reopens the store, asserts the value persists. Black-
    box regression guard for the pebble.Sync-unconditionally claim
    (a NoSync write would not deterministically survive a
    Close+Open).

  - TestLastAppliedIndex_CorruptValue — directly writes a 2-byte
    payload under the meta key and verifies LastAppliedIndex
    collapses to (0, false, nil) rather than propagating an error.
    Keeps cold start safe under partial-write corruption.

internal/raftengine/etcd/engine_applied_index_test.go (3 tests):

  - TestRecordingFSM_SatisfiesAppliedIndexWriter — compile-time
    adjacency check that the recording FSM implements the writer
    seam.

  - TestPersistCreatedSnapshot_BumpsAppliedIndex — drives the
    persistCreatedSnapshot hook directly, asserts the ordered
    event stream is exactly [bump@42, save@42].

  - TestPersistCreatedSnapshot_NilFSMNoOp — legacy FSM (no writer
    seam) path: SaveSnap still runs, no bump emitted; preserves
    backward compat with test fakes.

  - TestPersistCreatedSnapshot_BumpErrorAborts — if
    SetDurableAppliedIndex returns an error, the engine MUST
    surface it AND NOT call SaveSnap. Preserves the crash-ordering
    invariant 'snapshot pointer never durable without
    metaAppliedIndex first'.

go.mod: github.com/coreos/go-semver promoted to direct (was indirect
because the test stubs out etcdstorage.Storage.MinimalEtcdVersion
which returns *semver.Version).

Refs PR #910 (design), B2 step 6 — final step before opening PR.

All tests: ok 0.028s (engine) + 1.6s (store).
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 3, 2026

Review Change Stack

Warning

Review limit reached

@bootjp, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 17 minutes and 6 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 75a5b008-6d08-4fbe-aa9a-0b42a162c90c

📥 Commits

Reviewing files that changed from the base of the PR and between 7001a93 and 3951337.

📒 Files selected for processing (7)
  • docs/design/2026_06_02_idempotent_snapshot_restore.md
  • internal/raftengine/etcd/engine.go
  • internal/raftengine/etcd/engine_applied_index_test.go
  • kv/fsm.go
  • kv/fsm_applied_index_iface_check.go
  • store/lsm_store.go
  • store/lsm_store_applied_index_test.go
📝 Walkthrough

Walkthrough

Adds durable raft applied-index support end-to-end: optional FSM reader/writer interfaces, MVCCStore raft-at APIs, Pebble metaAppliedIndex with atomic bundling, kvFSM/store/engine plumbing to bump the durable index before snapshot persist, and tests plus a dependency promotion.

Changes

Durable Applied Index Tracking

Layer / File(s) Summary
Applied index interfaces and MVCC API
internal/raftengine/statemachine.go, store/store.go
Adds AppliedIndexReader/AppliedIndexWriter interfaces and MVCCStore methods ApplyMutationsRaftAt / DeletePrefixAtRaftAt accepting appliedIndex.
Pebble meta key and store implementation
store/lsm_store.go
Introduces metaAppliedIndex meta key, LastAppliedIndex() and SetDurableAppliedIndex(), and raft-aware ApplyMutationsRaftAt/DeletePrefixAtRaftAt that bundle the meta key atomically when appliedIndex>0.
Pebble store applied-index tests
store/lsm_store_applied_index_test.go
Tests cover missing meta key, round-trip persistence, bundling with mutations and deletes, zero-index no-op, Sync persistence across restarts, and corrupt meta handling.
Store wrappers and aggregation
store/mvcc_store.go, kv/leader_routed_store.go, kv/shard_store.go
In-memory store forwards/ignores appliedIndex; leader-routed store forwards new methods and exposes LastAppliedIndex/SetDurableAppliedIndex; shard store routes/broadcasts appliedIndex and aggregates per-group meta.
kvFSM applied-index plumbing
kv/fsm.go
kvFSM exposes AppliedIndexReader()/SetDurableAppliedIndex() and passes pendingApplyIdx into all mutation/delete apply paths via *RaftAt variants.
Engine snapshot bump-before-save
internal/raftengine/etcd/engine.go
Adds helper to call SetDurableAppliedIndex(index) before SaveSnap() on created and local snapshot persist paths; refactors local persist result handling and purging logic.
Engine snapshot applied-index tests
internal/raftengine/etcd/engine_applied_index_test.go
Tests assert "bump" occurs before "save", legacy no-op behavior when FSM lacks writer, and abort-on-bump-error for both snapshot sites using recording doubles.
Dependency promotion
go.mod
github.com/coreos/go-semver v0.3.1 moved from indirect to direct require block.

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • bootjp/elastickv#910: Implements the design for bumping durable applied index before snapshot persists and related snapshot-restore gating.

Poem

🐰 I hop through batches, index in paw,

I pin each snapshot with careful saw,
Meta key bundled, safe in the ground,
Raft's little promise — solid and sound.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and specifically describes the main change: plumbing metaAppliedIndex through raft-Apply and both snapshot persist sites, which aligns with the PR's core objective to add durable applied-index tracking.
Docstring Coverage ✅ Passed Docstring coverage is 88.89% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch impl/snapshot-skip-b2-meta-key

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 3, 2026

TLA+ spec divergence review (auto-triggered)

This PR touches files that the TLA+ safety spec has an anchor on (per
docs/design/2026_05_28_partial_tla_safety_spec.md §3),
so an AI review is requested below to verify the implementation has not drifted
from the model.

Anchored files changed in this PR head (2c42f7d):

  • kv/fsm.go
  • store/mvcc_store.go

What to check, by subsystem:

  • kv/hlc*.goNext() must respect the HLC-4 preconditions (i)/(ii)/(iii) from the design doc: bounded skew, logical-counter handoff on leader change (strategy (c) Observe(MaxAppliedHLC)), and the commit-time ceiling fence (fail-closed when wall_now >= physicalCeiling). Any change to the bit layout (48/16), the CAS loop, or the ceiling getter/setter is in scope.
  • kv/coordinator.go, kv/sharded_coordinator.goRunHLCLeaseRenewal, hlcRenewalInterval, hlcPhysicalWindowMs constants, and the new-term detection that calls Observe(fsm.MaxAppliedHLC()) (strategy (c)). Any change to renewal cadence, group selection, or fail-closed behaviour is in scope.
  • kv/transaction.go, kv/lock_resolver.go — OCC commit-ts assignment, lock-map encoding (key, lock_ts) -> start_ts, and the LockResolver action OCC-3 depends on. (M2 spec will land OCC-1..OCC-5; until then the spec doc §5.2 is the contract.)
  • kv/fsm.go — FSM apply of HLC lease entries (SetPhysicalCeiling), and any future MaxAppliedHLC() accessor that strategy (c) needs.
  • store/mvcc_store.go — version visibility, snapshot install, and the MVCC-1..MVCC-4 invariants (M3 scope).
  • distribution/** — route catalog versioning, SplitRange atomicity, and CatalogWatcher async fan-out (M4 scope).

If the change is correct but requires a spec update, edit tla/hlc/HLC.tla (or the corresponding M2..M5 module once landed) and the design doc in the same PR. The tla-check workflow runs the TLC model check on the same paths.


@claude review please verify TLA+ spec divergence per the checklist above.

@codex review please verify TLA+ spec divergence per the checklist above.

@chatgpt-codex-connector
Copy link
Copy Markdown

To use Codex here, create a Codex account and connect to github.

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Jun 3, 2026

@claude review

PR #910 design Branch 2 implementation. 6 commits, no behaviour change yet (skip gate lands in B3); B2 is the soak phase.

Review checklist for this PR:

  • §2 "why both leaves": is the metaAppliedIndex bundle present in BOTH applyMutationsWithOpts (1324) AND deletePrefixAtWithOpts (1393)? (verified with grep + tests)
  • §3 dbMu.RLock(): LastAppliedIndex + SetDurableAppliedIndex both acquire the read-lock matching the discipline at lsm_store.go:153
  • §4 strictly-additive fallback: AppliedIndexReader() returns nil for non-seam stores; LastAppliedIndex returns (0, false, nil) for missing OR truncated; never propagates an error that would abort cold start
  • §6 nosync durability: SetDurableAppliedIndex uses pebble.Sync UNCONDITIONALLY (not raftApplyWriteOpts()) — the rationale is in the method comment; round-7 retraction in design doc explains why
  • §6 both persist sites hooked: persistCreatedSnapshot (engine.go:2679) AND e.persistLocalSnapshotPayload (engine.go:4048), BEFORE persist.SaveSnap each
  • crash-ordering: TestPersistCreatedSnapshot_BumpErrorAborts verifies that a failed bump prevents SaveSnap (no observable state where 'snapshot pointer durable, metaAppliedIndex stale')
  • semantic-change audit: all 7 f.store.ApplyMutationsRaft / DeletePrefixAtRaft call sites in kv/fsm.go switched to *RaftAt with f.pendingApplyIdx; the existing non-At variants stay backwards-compatible (appliedIndex=0 no-ops the meta key bump)

PR description has full reading order + soak plan + ldb verification command.

@claude
Copy link
Copy Markdown

claude Bot commented Jun 3, 2026

Claude finished @bootjp's task in 4m 16s —— View job


PR #915 Review — Branch 2: metaAppliedIndex plumbing

Review checklist:

  • §2 both leaves: metaAppliedIndex bundle in BOTH applyMutationsWithOpts AND deletePrefixAtWithOpts
  • §3 dbMu.RLock(): LastAppliedIndex + SetDurableAppliedIndex lock discipline
  • §4 fallback: AppliedIndexReader() nil for non-seam stores; LastAppliedIndex (0,false,nil) for missing/truncated
  • §6 nosync durability: SetDurableAppliedIndex uses pebble.Sync unconditionally
  • §6 both persist sites hooked: persistCreatedSnapshot AND persistLocalSnapshotPayload, BEFORE SaveSnap
  • Crash-ordering: TestPersistCreatedSnapshot_BumpErrorAborts prevents SaveSnap on bump error
  • Semantic-change audit: all 7 FSM call sites switched to *RaftAt with f.pendingApplyIdx
  • TLA+ spec divergence check
  • General code quality + edge cases

§2 — Both leaves ✅

applyMutationsWithOpts (lsm_store.go:1438) and deletePrefixAtWithOpts (lsm_store.go:1531) both guard the bundle with appliedIndex > 0. The conditional is identical in both callees so the escape hatch (non-raft callers pass 0) is consistent across both code paths.

setPebbleUint64InBatch uses pebble.NoSync on the individual batch.Set call; durability comes from batch.Commit(writeOpts) — correct Pebble semantics, the per-set opts are irrelevant once committed.

§3 — Lock discipline ✅

Both LastAppliedIndex (lsm_store.go:583-584) and SetDurableAppliedIndex (lsm_store.go:629-630) open with s.dbMu.RLock() / defer s.dbMu.RUnlock(). This matches the read-path lock ordering at :153 / :675. Neither takes mtx, which is correct — metaAppliedIndex is not an in-memory field that needs mtx coherence.

One nuance: applyMutationsWithOpts writes metaAppliedIndex while holding both dbMu.RLock() AND mtx.Lock() (the latter guards lastCommitTS in the same batch). That's fine — the lock ordering is respected (dbMu before mtx).

§4 — Fallback policy ✅

  • kvFSM.AppliedIndexReader() (fsm.go:141-145): type-asserts f.store to raftengine.AppliedIndexReader, returns nil on failure (non-seam store). ✅
  • pebbleStore.LastAppliedIndex() (lsm_store.go:586-600): pebble.ErrNotFound → (0, false, nil); len(val) < 8 → (0, false, nil). Both cases return "missing" and never surface an error that would abort cold start. ✅
  • TestLastAppliedIndex_CorruptValue covers the truncated-value branch. ✅
  • The in-memory mvccStore.ApplyMutationsRaftAt silently discards appliedIndex and delegates to ApplyMutations. LastAppliedIndex on the in-memory store would always return missing (the seam isn't implemented), so Branch 3 will always fall back to full restore for in-memory backends — the right behaviour for tests and the catalog bootstrap path. ✅

§6 — nosync durability ✅

SetDurableAppliedIndex (lsm_store.go:635):

return errors.WithStack(batch.Commit(pebble.Sync))

Hardcoded pebble.Sync, not s.raftApplyWriteOpts(). The rationale (WAL compaction discards the log after SaveSnap, leaving no replay source) is documented both in the inline comment and the design doc. TestSetDurableAppliedIndex_UsesPebbleSync exercises this with ELASTICKV_FSM_SYNC_MODE=nosync. ✅

§6 — Both persist sites hooked BEFORE SaveSnap ✅

Site 1 — persistCreatedSnapshot (engine.go:2694-2699):

if w, ok := e.fsm.(raftengine.AppliedIndexWriter); ok {
    if err := w.SetDurableAppliedIndex(snap.Metadata.Index); err != nil {
        return errors.WithStack(err)  // ← aborts; SaveSnap never reached
    }
}
if err := e.persist.SaveSnap(snap); err != nil { ...

Ordering: bump → SaveSnap. Error return prevents SaveSnap. ✅

Site 2 — persistLocalSnapshotPayload (engine.go:4075-4081):

if w, ok := e.fsm.(raftengine.AppliedIndexWriter); ok {
    if err := w.SetDurableAppliedIndex(index); err != nil {
        return errors.WithStack(err)  // ← aborts; free-function never reached
    }
}
_, err = persistLocalSnapshotPayload(e.storage, e.persist, index, payload)

The free-function call (persistLocalSnapshotPayload) internally calls persist.SaveSnap. Ordering: bump → SaveSnap. Error return prevents SaveSnap. ✅

Crash-ordering test ✅

TestPersistCreatedSnapshot_BumpErrorAborts (engine_applied_index_test.go:137-148): injects failNext: true into recordingAppliedIndexFSM. SetDurableAppliedIndex returns the error WITHOUT recording to rec. The test asserts rec.snapshot() is empty — confirming neither "bump" nor "save" was recorded. This proves the crash invariant (no state where snapshot pointer is durable but meta key is stale). ✅

Semantic-change audit ✅

All 7 data-Apply leaves in kv/fsm.go use f.pendingApplyIdx:

call site function line
1 handleRawRequestApplyMutationsRaftAt :492
2 handleDelPrefixDeletePrefixAtRaftAt :509
3 handlePrepareRequestApplyMutationsRaftAt :767
4 handleOnePhaseTxnRequestApplyMutationsRaftAt :831
5 applyCommitWithIdempotencyFallback (first attempt) → ApplyMutationsRaftAt :935
6 applyCommitWithIdempotencyFallback (idempotency retry) → ApplyMutationsRaftAt :952
7 handleAbortRequestApplyMutationsRaftAt :1003

Non-data leaves (applyHLCLease, applyEncryption) correctly don't call any store mutation, so there's nothing to switch. The raftEncodeHLCLease path applies only to the HLC ceiling — no store write. This is also why Site 2 (persistLocalSnapshotPayload) exists: HLC lease entries advance the snapshot index without bumping metaAppliedIndex via the Apply path.

TLA+ spec divergence ✅

Anchored files kv/fsm.go and store/mvcc_store.go are changed, but the changes are purely additive plumbing:

  • kv/fsm.go: HLC-4 strategy (c) Observe(commitTS) at :452 is unchanged. SetPhysicalCeiling apply path is unchanged. FSM Apply() signature is unchanged. The B2 additions (AppliedIndexReader(), SetDurableAppliedIndex(), pendingApplyIdx threading) don't touch any spec-anchored invariant.
  • store/mvcc_store.go: ApplyMutationsRaftAt / DeletePrefixAtRaftAt on mvccStore discard appliedIndex and delegate to ApplyMutations. MVCC-1..MVCC-4 visibility logic is unaffected.
  • OCC, HLC, transaction paths: unchanged.

No TLA+ spec update required.

go.mod

github.com/coreos/go-semver promoted from indirect to direct. Correct: engine_applied_index_test.go directly imports github.com/coreos/go-semver/semver for the MinimalEtcdVersion() *semver.Version stub on recordingPersistStorage.


Issues and observations

1. Missing test for Site 2 (persistLocalSnapshotPayload) crash-ordering ⚠️

TestPersistCreatedSnapshot_BumpErrorAborts covers Site 1. Site 2 has identical bump-before-save logic but no analogous test. The PR description says "both crash-ordering tested by TestPersistCreatedSnapshot_*" but those tests only reach persistCreatedSnapshot. A TestPersistLocalSnapshotPayload_BumpErrorAborts exercising the goroutine method directly (similar to the Site 1 test, but calling e.persistLocalSnapshotPayload with a failing FSM) would close this gap.

Fix this →

2. SetDurableAppliedIndex can decrease metaAppliedIndex (non-monotonic) — design note

When persistLocalSnapshotPayload(M) runs concurrently with in-flight applies that have already written metaAppliedIndex = N > M, the checkpoint call writes M < N. This is not a correctness bug for Branch 3 (the skip invariant LastAppliedIndex >= snap.Metadata.Index still holds for the snapshot at M — both sides are M), and raft log replay covers entries M+1..N on the next cold start. However, the comment on SetDurableAppliedIndex ("last-writer-wins always satisfies have >= want") is slightly misleading here — it only holds for the snapshot's own invariant, not for the global monotonicity of the key. Worth documenting this explicitly so Branch 3's implementor doesn't inadvertently assume the key is always non-decreasing.

3. ShardStore.DeletePrefixAtRaftAt multi-group caveat — documentation-only risk

The comment at shard_store.go:1294-1301 says tests that exercise ShardStore.DeletePrefixAtRaftAt across multiple groups MUST pass appliedIndex=0, but this is a contract enforced only by convention. There's no guard. Given that the production FSM never uses ShardStore as its direct f.store (each FSM has a single-group store), this is fine — just worth flagging for anyone adding tests later.


Summary

All 9 design constraints are correctly implemented and verified. The implementation is clean and the test coverage is solid. The one material gap is a missing test for Site 2 (persistLocalSnapshotPayload) crash-ordering; everything else is well-covered. The non-monotonic write nuance (#2) is safe for Branch 3 but worth a clarifying comment. Ready to land subject to closing the Site 2 test gap.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a mechanism to track and persist the durable raft-applied index (metaAppliedIndex) in the Pebble-backed state machine to enable a cold-start skip optimization for snapshot restores. It adds AppliedIndexReader and AppliedIndexWriter interfaces, implements them in kvFSM and pebbleStore, and updates the write paths to atomically bundle the applied index with mutations. Feedback highlights that wrapper stores like LeaderRoutedStore and ShardStore must also implement these interfaces to prevent the skip optimization from silently no-oping in production. Additionally, it is recommended to ensure monotonic updates in SetDurableAppliedIndex using atomic operations to prevent the durable index from going backwards on disk.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread kv/leader_routed_store.go
Comment on lines +360 to +365
func (s *LeaderRoutedStore) DeletePrefixAtRaftAt(ctx context.Context, prefix []byte, excludePrefix []byte, commitTS, appliedIndex uint64) error {
if s == nil || s.local == nil {
return errors.WithStack(store.ErrNotSupported)
}
return errors.WithStack(s.local.DeletePrefixAtRaftAt(ctx, prefix, excludePrefix, commitTS, appliedIndex))
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The LeaderRoutedStore wrapper does not implement raftengine.AppliedIndexReader or raftengine.AppliedIndexWriter. Since kvFSM type-asserts its underlying store to these interfaces, using LeaderRoutedStore in production (e.g., single-raft-group deployments) will cause the skip optimization to silently no-op and always fall back to full restore.

To fix this, LeaderRoutedStore should implement both interfaces by forwarding the calls to s.local using duck-typing (to avoid importing raftengine if desired).

func (s *LeaderRoutedStore) DeletePrefixAtRaftAt(ctx context.Context, prefix []byte, excludePrefix []byte, commitTS, appliedIndex uint64) error {
	if s == nil || s.local == nil {
		return errors.WithStack(store.ErrNotSupported)
	}
	return errors.WithStack(s.local.DeletePrefixAtRaftAt(ctx, prefix, excludePrefix, commitTS, appliedIndex))
}

func (s *LeaderRoutedStore) LastAppliedIndex() (uint64, bool, error) {
	if s == nil || s.local == nil {
		return 0, false, nil
	}
	if r, ok := s.local.(interface {
		LastAppliedIndex() (uint64, bool, error)
	}); ok {
		return r.LastAppliedIndex()
	}
	return 0, false, nil
}

func (s *LeaderRoutedStore) SetDurableAppliedIndex(idx uint64) error {
	if s == nil || s.local == nil {
		return nil
	}
	if w, ok := s.local.(interface {
		SetDurableAppliedIndex(uint64) error
	}); ok {
		return w.SetDurableAppliedIndex(idx)
	}
	return nil
}

Comment thread kv/shard_store.go
Comment on lines +1285 to +1305
func (s *ShardStore) DeletePrefixAtRaftAt(ctx context.Context, prefix []byte, excludePrefix []byte, commitTS, appliedIndex uint64) error {
for _, g := range s.groups {
if g == nil || g.Store == nil {
continue
}
// Pass appliedIndex through to every group. In the
// single-group call-path (the production raft-apply case)
// this is correct: appliedIndex IS that group's raft entry
// index. In a hypothetical multi-group call, only one group
// would see the matching index and the rest would treat it
// as a non-monotonic stray write — but the rest of the
// raft-apply contract (single FSM per raft log) makes that
// case impossible to reach in production. Tests that
// exercise ShardStore.DeletePrefixAtRaftAt across multiple
// groups MUST pass appliedIndex=0 to opt out.
if err := g.Store.DeletePrefixAtRaftAt(ctx, prefix, excludePrefix, commitTS, appliedIndex); err != nil {
return errors.WithStack(err)
}
}
return nil
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The ShardStore wrapper does not implement raftengine.AppliedIndexReader or raftengine.AppliedIndexWriter. Since kvFSM type-asserts its underlying store to these interfaces, using ShardStore in production (e.g., sharded deployments) will cause the skip optimization to silently no-op and always fall back to full restore.

To fix this, ShardStore should implement both interfaces by propagating the calls to all underlying group stores.

func (s *ShardStore) DeletePrefixAtRaftAt(ctx context.Context, prefix []byte, excludePrefix []byte, commitTS, appliedIndex uint64) error {
	for _, g := range s.groups {
		if g == nil || g.Store == nil {
			continue
		}
		// Pass appliedIndex through to every group. In the
		// single-group call-path (the production raft-apply case)
		// this is correct: appliedIndex IS that group's raft entry
		// index. In a hypothetical multi-group call, only one group
		// would see the matching index and the rest would treat it
		// as a non-monotonic stray write — but the rest of the
		// raft-apply contract (single FSM per raft log) makes that
		// case impossible to reach in production. Tests that
		// exercise ShardStore.DeletePrefixAtRaftAt across multiple
		// groups MUST pass appliedIndex=0 to opt out.
		if err := g.Store.DeletePrefixAtRaftAt(ctx, prefix, excludePrefix, commitTS, appliedIndex); err != nil {
			return errors.WithStack(err)
		}
	}
	return nil
}

func (s *ShardStore) LastAppliedIndex() (uint64, bool, error) {
	if s == nil || len(s.groups) == 0 {
		return 0, false, nil
	}
	var minIdx uint64
	hasMin := false
	for _, g := range s.groups {
		if g == nil || g.Store == nil {
			continue
		}
		if r, ok := g.Store.(raftengine.AppliedIndexReader); ok {
			idx, present, err := r.LastAppliedIndex()
			if err != nil {
				return 0, false, err
			}
			if present {
				if !hasMin || idx < minIdx {
					minIdx = idx
					hasMin = true
				}
			}
		}
	}
	return minIdx, hasMin, nil
}

func (s *ShardStore) SetDurableAppliedIndex(idx uint64) error {
	if s == nil {
		return nil
	}
	for _, g := range s.groups {
		if g == nil || g.Store == nil {
			continue
		}
		if w, ok := g.Store.(raftengine.AppliedIndexWriter); ok {
			if err := w.SetDurableAppliedIndex(idx); err != nil {
				return err
			}
		}
	}
	return nil
}

Comment thread store/lsm_store.go
Comment on lines +627 to +636
func (s *pebbleStore) SetDurableAppliedIndex(idx uint64) error {
s.dbMu.RLock()
defer s.dbMu.RUnlock()
batch := s.db.NewBatch()
defer func() { _ = batch.Close() }()
if err := setPebbleUint64InBatch(batch, metaAppliedIndexBytes, idx); err != nil {
return err
}
return errors.WithStack(batch.Commit(pebble.Sync))
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Since SetDurableAppliedIndex does not check if the existing durable index is already greater than idx, a concurrent raft apply of a higher index (e.g., 100) could be overwritten by an asynchronous snapshot persist of a lower index (e.g., 90). While logically safe due to idempotency, it can cause the durable index to go backwards on disk, leading to redundant log replay on cold start.

To prevent this, we should ensure the index is updated monotonically. Since this is a frequently accessed field on a hot path, prefer using atomic operations (such as Compare-And-Swap) over mutexes to avoid performance overhead.

func (s *pebbleStore) SetDurableAppliedIndex(idx uint64) error {
	s.dbMu.RLock()
	defer s.dbMu.RUnlock()

	for {
		current := atomic.LoadUint64(&s.durableAppliedIndex)
		if idx <= current {
			return nil
		}
		if atomic.CompareAndSwapUint64(&s.durableAppliedIndex, current, idx) {
			break
		}
	}

	batch := s.db.NewBatch()
	defer func() { _ = batch.Close() }()
	if err := setPebbleUint64InBatch(batch, metaAppliedIndexBytes, idx); err != nil {
		return err
	}
	return errors.WithStack(batch.Commit(pebble.Sync))
}
References
  1. For frequently accessed fields that require monotonic updates (like a configuration index), prefer atomic operations (e.g., CAS) over mutexes to improve performance on hot paths.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
internal/raftengine/etcd/engine_applied_index_test.go (1)

93-148: ⚡ Quick win

Add a mirror test for persistLocalSnapshotPayload.

This suite only exercises persistCreatedSnapshot, but the steady-state hook in persistLocalSnapshotPayload is a separate code path with its own early-return behavior. A direct test for bump-before-persist ordering and bump-error aborts there would close the main regression gap in this change. Based on learnings: Ensure FSM apply operations are idempotent and that committed entries cannot be lost. Check snapshot/restore round-trips and crash-restart paths.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/raftengine/etcd/engine_applied_index_test.go` around lines 93 - 148,
Add mirror unit tests for persistLocalSnapshotPayload that replicate the three
behaviors covered for persistCreatedSnapshot: (1) a test like
TestPersistCreatedSnapshot_BumpsAppliedIndex that uses applyIndexOrderRecorder +
recordingAppliedIndexFSM + recordingPersistStorage and asserts
SetDurableAppliedIndex (bump) is called before SaveSnap when calling
Engine.persistLocalSnapshotPayload with a snapshot; (2) a legacy-path test using
testStateMachine that verifies a non-AppliedIndexWriter FSM still lets the
snapshot save run (no bump); and (3) a bump-error-aborts test where
recordingAppliedIndexFSM is configured to fail (failNext / failErr) and you
assert persistLocalSnapshotPayload returns the error and SaveSnap is not
invoked; reuse the same helper types (applyIndexOrderRecorder,
recordingAppliedIndexFSM, recordingPersistStorage) and snapshot metadata values
as in the existing tests to keep behavior parity.
kv/shard_store.go (1)

1271-1305: ⚡ Quick win

Potential footgun: multi-group DeletePrefixAtRaftAt relies on caller discipline.

The implementation passes appliedIndex to every group in the loop (line 1300). For a multi-group ShardStore, this would write the same appliedIndex to multiple independent Pebble stores, corrupting their separate raft-applied-index tracking.

The comment (lines 1271-1285) documents that callers "MUST pass appliedIndex=0" for multi-group scenarios, but there's no runtime enforcement. Consider adding a defensive check:

if appliedIndex > 0 && len(s.groups) > 1 {
    return errors.New("DeletePrefixAtRaftAt with appliedIndex > 0 requires single-group ShardStore")
}

If the design guarantees that production FSM paths only see single-group ShardStore (as the comment claims), this check would catch test misuse without impacting production.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@kv/shard_store.go` around lines 1271 - 1305, The DeletePrefixAtRaftAt method
currently forwards appliedIndex to every group which can corrupt per-store
raft-applied tracking in multi-group ShardStore; add a defensive runtime check
at the start of ShardStore.DeletePrefixAtRaftAt that returns an error if
appliedIndex > 0 and the store has more than one group (use s.groups length or a
helper like s.groupCount()), so callers must explicitly opt out by passing
appliedIndex=0 for multi-group scenarios; keep the existing loop and behavior
otherwise and reference the DeletePrefixAtRaftAt receiver, s.groups, and the
appliedIndex parameter when adding the check.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@internal/raftengine/etcd/engine_applied_index_test.go`:
- Around line 93-148: Add mirror unit tests for persistLocalSnapshotPayload that
replicate the three behaviors covered for persistCreatedSnapshot: (1) a test
like TestPersistCreatedSnapshot_BumpsAppliedIndex that uses
applyIndexOrderRecorder + recordingAppliedIndexFSM + recordingPersistStorage and
asserts SetDurableAppliedIndex (bump) is called before SaveSnap when calling
Engine.persistLocalSnapshotPayload with a snapshot; (2) a legacy-path test using
testStateMachine that verifies a non-AppliedIndexWriter FSM still lets the
snapshot save run (no bump); and (3) a bump-error-aborts test where
recordingAppliedIndexFSM is configured to fail (failNext / failErr) and you
assert persistLocalSnapshotPayload returns the error and SaveSnap is not
invoked; reuse the same helper types (applyIndexOrderRecorder,
recordingAppliedIndexFSM, recordingPersistStorage) and snapshot metadata values
as in the existing tests to keep behavior parity.

In `@kv/shard_store.go`:
- Around line 1271-1305: The DeletePrefixAtRaftAt method currently forwards
appliedIndex to every group which can corrupt per-store raft-applied tracking in
multi-group ShardStore; add a defensive runtime check at the start of
ShardStore.DeletePrefixAtRaftAt that returns an error if appliedIndex > 0 and
the store has more than one group (use s.groups length or a helper like
s.groupCount()), so callers must explicitly opt out by passing appliedIndex=0
for multi-group scenarios; keep the existing loop and behavior otherwise and
reference the DeletePrefixAtRaftAt receiver, s.groups, and the appliedIndex
parameter when adding the check.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 05a21b08-93fa-497f-9450-99f76bd3da7f

📥 Commits

Reviewing files that changed from the base of the PR and between 384cacc and 2c42f7d.

📒 Files selected for processing (11)
  • go.mod
  • internal/raftengine/etcd/engine.go
  • internal/raftengine/etcd/engine_applied_index_test.go
  • internal/raftengine/statemachine.go
  • kv/fsm.go
  • kv/leader_routed_store.go
  • kv/shard_store.go
  • store/lsm_store.go
  • store/lsm_store_applied_index_test.go
  • store/mvcc_store.go
  • store/store.go

Address PR #915 round-1 feedback:

1) claude[bot] BLOCKER — Site 2 (persistLocalSnapshotPayload) crash-
   ordering test gap:

   internal/raftengine/etcd/engine_applied_index_test.go gains:
     - localSnapshotEngine helper seeding etcdraft.MemoryStorage so
       buildLocalSnapshot's storage.Term(applied) lookup succeeds
       (without seeding, the free persistLocalSnapshotPayload short-
       circuits at Term and SaveSnap is never reached).
     - TestPersistLocalSnapshotPayload_BumpsAppliedIndex — happy path:
       asserts the recorded event sequence is exactly [bump@123,
       save@123].
     - TestPersistLocalSnapshotPayload_BumpErrorAborts — mirror of
       Site 1: a failed SetDurableAppliedIndex MUST surface the
       error AND prevent the free-function from running (so SaveSnap
       is never called). Preserves the crash invariant for Site 2.

   These close the only blocker from claude[bot]'s round-1 review.

2) reviewdog golangci-lint findings (3 issues → 0):

   - lsm_store.go gci: re-aligned the metaAppliedIndex const block;
     golangci-lint fmt picked it up.
   - kv/fsm.go wrapcheck on SetDurableAppliedIndex return: wrapped
     the interface-method return in errors.WithStack so the lint
     allow-list is honoured.
   - engine.go cyclop on persistLocalSnapshotPayload (12 > 10):
     extracted two helpers:
       * bumpDurableAppliedIndexBeforeSave(index) — shared by BOTH
         persist sites, deduplicates the writer-seam type assertion
         and the inline doc-comment (round-1 had it twice).
       * handleLocalSnapshotPersistResult(err) + purgeAfterLocalSnapshot()
         — collapses the four-arm error switch into a helper. The
         three ErrCompacted / ErrUnavailable / ErrSnapOutOfDate
         branches were already a single logical case so they fold
         into one switch arm.
     Cyclop now reports 0 issues; the post-extract function body is
     also easier to read.

3) Defensive forwards (gemini HIGH false-positive — pushed back in
   the round-2 PR comment with grep evidence):

   kv/leader_routed_store.go and kv/shard_store.go gain
   LastAppliedIndex / SetDurableAppliedIndex methods. Both forwards
   are currently DEAD CODE for the cold-start skip optimisation —
   verified by grep on origin/main that the kvFSM holds a
   *pebbleStore directly in every production path (LeaderRoutedStore
   is the adapter/server read-routing wrapper; ShardStore is the
   coordinator-facing fanout wrapper; neither is f.store).

   They are added defensively so a future refactor that wraps the
   FSM store does not silently degrade the optimisation to full-
   restore-always. ShardStore aggregates LastAppliedIndex as MIN
   over groups so the cold-start skip gate refuses to skip whenever
   ANY group lags (conservative 'over-restore beats under-restore'
   rule from design §4).

Tests: go vet ./store/ ./kv/ ./internal/raftengine/... clean;
go test ./store/ ./kv/ ./internal/raftengine/... -short ok 53s
total; new Site 2 tests pass on first run after seeding storage.
golangci-lint run ./store/ ./kv/ ./internal/raftengine/etcd/
→ 0 issues.

Refs PR #915 round-1 (claude[bot] verdict + gemini-code-assist HIGH +
reviewdog lint).
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 3, 2026

TLA+ spec divergence review (auto-triggered)

This PR touches files that the TLA+ safety spec has an anchor on (per
docs/design/2026_05_28_partial_tla_safety_spec.md §3),
so an AI review is requested below to verify the implementation has not drifted
from the model.

Anchored files changed in this PR head (7001a93):

  • kv/fsm.go
  • store/mvcc_store.go

What to check, by subsystem:

  • kv/hlc*.goNext() must respect the HLC-4 preconditions (i)/(ii)/(iii) from the design doc: bounded skew, logical-counter handoff on leader change (strategy (c) Observe(MaxAppliedHLC)), and the commit-time ceiling fence (fail-closed when wall_now >= physicalCeiling). Any change to the bit layout (48/16), the CAS loop, or the ceiling getter/setter is in scope.
  • kv/coordinator.go, kv/sharded_coordinator.goRunHLCLeaseRenewal, hlcRenewalInterval, hlcPhysicalWindowMs constants, and the new-term detection that calls Observe(fsm.MaxAppliedHLC()) (strategy (c)). Any change to renewal cadence, group selection, or fail-closed behaviour is in scope.
  • kv/transaction.go, kv/lock_resolver.go — OCC commit-ts assignment, lock-map encoding (key, lock_ts) -> start_ts, and the LockResolver action OCC-3 depends on. (M2 spec will land OCC-1..OCC-5; until then the spec doc §5.2 is the contract.)
  • kv/fsm.go — FSM apply of HLC lease entries (SetPhysicalCeiling), and any future MaxAppliedHLC() accessor that strategy (c) needs.
  • store/mvcc_store.go — version visibility, snapshot install, and the MVCC-1..MVCC-4 invariants (M3 scope).
  • distribution/** — route catalog versioning, SplitRange atomicity, and CatalogWatcher async fan-out (M4 scope).

If the change is correct but requires a spec update, edit tla/hlc/HLC.tla (or the corresponding M2..M5 module once landed) and the design doc in the same PR. The tla-check workflow runs the TLC model check on the same paths.


@claude review please verify TLA+ spec divergence per the checklist above.

@codex review please verify TLA+ spec divergence per the checklist above.

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Jun 3, 2026

Round-2 push: 7001a934. All blocking items from round-1 resolved.

🔴 claude[bot] BLOCKER — Site 2 crash-ordering test gap

Closed via two new tests:

  • TestPersistLocalSnapshotPayload_BumpsAppliedIndex — asserts the recorded event sequence is exactly [bump@123, save@123]
  • TestPersistLocalSnapshotPayload_BumpErrorAborts — mirror of Site 1: failed SetDurableAppliedIndex MUST surface the error AND prevent the free-function persistLocalSnapshotPayload from running

Discovery from writing the happy-path test: buildLocalSnapshot (wal_store.go:544) calls storage.Term(applied). A fresh etcdraft.MemoryStorage has no entries, so without seeding, Term returns ErrUnavailable and the free function short-circuits before SaveSnap. localSnapshotEngine(t, rec, fsm, applied) now seeds the storage with applied entries to reach the actual SaveSnap path.

🟢 reviewdog golangci-lint (3 issues → 0)

  • lsm_store.go gci: re-aligned the metaAppliedIndex const block (golangci-lint fmt picked it up automatically)
  • kv/fsm.go wrapcheck on SetDurableAppliedIndex return: wrapped with errors.WithStack
  • engine.go cyclop on persistLocalSnapshotPayload (12 > 10): extracted three helpers and folded the three "snapshot already moved on" error cases into a single switch arm:
    • bumpDurableAppliedIndexBeforeSave(index) — shared by BOTH persist sites, deduplicates the writer-seam type assertion and inline doc-comment (round-1 had it duplicated)
    • handleLocalSnapshotPersistResult(err) — collapses the four-arm switch
    • purgeAfterLocalSnapshot() — extracted the on-success disk-side cleanup

golangci-lint run ./store/ ./kv/ ./internal/raftengine/etcd/0 issues.

🟢 Defensive forwards (push-back on gemini HIGH false positive)

LeaderRoutedStore and ShardStore gain LastAppliedIndex / SetDurableAppliedIndex. The gemini HIGH finding was a false positive in current origin/main — verified by grep:

$ grep -rn 'NewKvFSMWithHLC(' --include='*.go' . | grep -v _test.go
./cmd/server/demo.go:453:  fsm := kv.NewKvFSMWithHLC(st, hlc)         # st = *pebbleStore
./main.go:851:             sm := kv.NewKvFSMWithHLC(st, clock, ...)   # st = *pebbleStore

The kvFSM holds a *pebbleStore directly in every production path. LeaderRoutedStore is used by adapter/server code for read-routing (cmd/server/demo.go:265-378 for grpc/redis/s3/dynamodb adapters); ShardStore is the coordinator-facing fanout wrapper. Neither is f.store. The skip optimisation works correctly in production.

The defensive forwards are added anyway because a future refactor that wraps the FSM store would otherwise silently degrade the skip to "full-restore-always" with no failure signal. ShardStore.LastAppliedIndex aggregates as MIN over groups so the cold-start skip gate refuses to skip whenever ANY group lags (conservative "over-restore beats under-restore" per design §4).

🟢 Other round-1 nits

  • claude[bot] note on non-monotonic SetDurableAppliedIndex: kept the existing comment for now; the "logical correctness" reasoning (snapshot at M still satisfies its own invariant) is sound and a strictly-monotonic write would mean adding a read-modify-write under dbMu.Lock (not RLock), changing the lock-discipline more invasively than this round warrants. Worth a follow-up if production data shows the rare backward-write actually causes meaningful regression.
  • claude[bot] note on ShardStore multi-group convention: the inline comment at shard_store.go:1294 already documents the "tests MUST pass appliedIndex=0" convention; not strengthening to a runtime guard because no production caller hits this path.

Verification

go vet ./store/ ./kv/ ./internal/raftengine/...                    → exit 0
golangci-lint run ./store/ ./kv/ ./internal/raftengine/etcd/       → 0 issues
go test ./store/ ./kv/ ./internal/raftengine/... -short            → ok 53s
go test ./internal/raftengine/etcd/ -run 'TestPersistLocal...'     → ok 25ms (Site 2 new tests pass)

@claude review

@chatgpt-codex-connector
Copy link
Copy Markdown

To use Codex here, create a Codex account and connect to github.

@claude
Copy link
Copy Markdown

claude Bot commented Jun 3, 2026

Claude finished @bootjp's task in 2m 14s —— View job


PR #915 Round-2 Review — 7001a934

Todo list:

  • Load tools
  • Read updated engine test file (Site 2 new tests)
  • Read updated engine.go (cyclop refactor + bumpDurableAppliedIndexBeforeSave helper)
  • Read updated LeaderRoutedStore defensive forwards
  • Read updated ShardStore defensive forwards
  • Read updated lsm_store.go (gci fix)
  • Read updated fsm.go (wrapcheck fix)
  • Verify TLA+ spec divergence for new changes
  • Post complete review

Round-1 blockers resolution

All three blockers from round 1 are resolved. Details below.


Site 2 crash-ordering tests ✅

TestPersistLocalSnapshotPayload_BumpsAppliedIndex (engine_applied_index_test.go:195-210)

The localSnapshotEngine helper correctly seeds MemoryStorage with applied entries (via storage.Append(entries)) so buildLocalSnapshot's storage.Term(applied) call succeeds and the path reaches persist.SaveSnap. The discovery is clearly documented in comments at :162-176. The test asserts:

events[0] == {kind:"bump", index:123}
events[1] == {kind:"save", index:123}

Note: uses require.GreaterOrEqual(t, len(events), 2) as a guard before the per-index assertions — slightly looser than Site 1's require.Equal(t, []orderEvent{...}, ...) slice comparison, but the ordering constraints are still fully checked. The infrastructure can only produce one "bump" and one "save", so this is harmless.

TestPersistLocalSnapshotPayload_BumpErrorAborts (engine_applied_index_test.go:215-225)

failNext: true + failErr: io.ErrShortBufferSetDurableAppliedIndex returns the error without recording. require.Empty(t, rec.snapshot()) verifies neither "bump" nor "save" was recorded. Mirrors Site 1 exactly. ✅

No NilFSMNoOp mirror for Site 2 — this is fine. bumpDurableAppliedIndexBeforeSave is a shared helper; the Site 1 NilFSMNoOp test already exercises the !ok early-return on that helper, covering both sites.


cyclop refactor ✅

persistLocalSnapshotPayload (engine.go:4060) is now under budget. The three extracted helpers are correct:

  • bumpDurableAppliedIndexBeforeSave (engine.go:2696-2701): clean type-assert, errors.WithStack on the writer call, nil return for non-implementing FSMs. Doc-comment explains the pebble.Sync rationale correctly. Now shared by both persist sites — eliminates the duplication that round-1 noted. ✅
  • handleLocalSnapshotPersistResult (engine.go:4089-4101): the three raft "snapshot moved on" errors (ErrCompacted, ErrUnavailable, ErrSnapOutOfDate) are correctly grouped as no-ops; only err == nil runs purgeAfterLocalSnapshot. Semantics preserved. ✅
  • purgeAfterLocalSnapshot (engine.go:4107-4116): extracted the disk-side cleanup correctly; both log-on-error but don't propagate, matching the original behavior. ✅

Minor observation (non-blocking): persistCreatedSnapshot (engine.go:2718-2725) still has the same two inline purge calls rather than using purgeAfterLocalSnapshot. Minor inconsistency — fine for this PR since Site 1 is within complexity budget.


LeaderRoutedStore defensive forwards ✅

  • LastAppliedIndex (leader_routed_store.go:382-397): type-asserts via local interface literal (avoids a circular import), (0, false, nil) on all fallback paths, errors.WithStack on propagated errors. ✅
  • SetDurableAppliedIndex (leader_routed_store.go:403-414): same duck-typing pattern, nil return for non-implementing stores. ✅
  • Doc-comment correctly explains these are defensive dead code today (production FSM holds *pebbleStore directly) and cites the specific production call sites (demo.go:453, main.go:851). ✅

ShardStore defensive forwards ✅

  • LastAppliedIndex (shard_store.go:1337-1370): MIN aggregation over groups. The "one group missing → immediately return (0, false, nil)" at line 1357-1360 is the correct conservative policy — a group with no meta key means the skip is unsafe, not just for that group but globally. ✅
  • SetDurableAppliedIndex (shard_store.go:1384-1400): broadcasts to all groups that implement the writer seam, skips those that don't, errors.WithStack on first error. Doc-comment warns about multi-group semantics. ✅

golangci-lint fixes ✅

  • gci (lsm_store.go): imports are correctly grouped — stdlib block first, then external. ✅
  • wrapcheck (kv/fsm.go:163): return errors.WithStack(w.SetDurableAppliedIndex(idx)) — interface method return is now wrapped. ✅
  • cyclop (engine.go): persistLocalSnapshotPayload reduced by extracting three helpers as above. ✅

TLA+ spec divergence ✅

New changes are: two test functions, three extracted engine helpers, and defensive no-op forwards on wrapper stores. None of these touch HLC, OCC, MVCC visibility, FSM apply semantics, or any spec-anchored invariant. No spec update required.


Summary

All Round-1 blockers resolved. The Site 2 test pair is correct and well-motivated. The cyclop refactor is clean and doesn't alter semantics. The defensive forwards follow the existing duck-typing pattern established by ApplyMutationsRaftAt / DeletePrefixAtRaftAt. No new blockers. Ready to merge.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
internal/raftengine/etcd/engine_applied_index_test.go (1)

204-209: ⚡ Quick win

Consider asserting the exact event sequence for tighter crash-ordering coverage.

Site 1 (TestPersistCreatedSnapshot_BumpsAppliedIndex) asserts the full slice with require.Equal, but here require.GreaterOrEqual(len, 2) + positional checks would not catch a spurious extra save/bump (e.g. an accidental double SaveSnap). Since these tests exist specifically to lock down the bump-before-save invariant, an exact assertion matches Site 1 and is slightly stricter.

♻️ Proposed tightening
-	events := rec.snapshot()
-	require.GreaterOrEqual(t, len(events), 2,
-		"both bump and save events MUST be recorded; got %d", len(events))
-	require.Equal(t, "bump", events[0].kind, "bump MUST be first")
-	require.Equal(t, index, events[0].index)
-	require.Equal(t, "save", events[1].kind, "save MUST be second")
-	require.Equal(t, index, events[1].index)
+	require.Equal(t, []orderEvent{
+		{kind: "bump", index: index},
+		{kind: "save", index: index},
+	}, rec.snapshot(),
+		"hook MUST call SetDurableAppliedIndex BEFORE SaveSnap exactly once each")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/raftengine/etcd/engine_applied_index_test.go` around lines 204 -
209, The test currently uses require.GreaterOrEqual(t, len(events), 2) plus
positional checks on events[0] and events[1], which misses spurious extra
entries; change the assertion to require an exact sequence of two events to
match the bump-before-save invariant (as done in
TestPersistCreatedSnapshot_BumpsAppliedIndex). Concretely, assert the length is
exactly 2 (require.Len/require.Equal with 2) and then assert the full slice
equals the expected slice (e.g., build an expected []event{{kind:"bump",
index:index}, {kind:"save", index:index}} and compare it to events) so any extra
or out-of-order bump/save entries fail the test.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@internal/raftengine/etcd/engine_applied_index_test.go`:
- Around line 204-209: The test currently uses require.GreaterOrEqual(t,
len(events), 2) plus positional checks on events[0] and events[1], which misses
spurious extra entries; change the assertion to require an exact sequence of two
events to match the bump-before-save invariant (as done in
TestPersistCreatedSnapshot_BumpsAppliedIndex). Concretely, assert the length is
exactly 2 (require.Len/require.Equal with 2) and then assert the full slice
equals the expected slice (e.g., build an expected []event{{kind:"bump",
index:index}, {kind:"save", index:index}} and compare it to events) so any extra
or out-of-order bump/save entries fail the test.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b32d8fbe-fc66-4f75-8b2c-e399c938c425

📥 Commits

Reviewing files that changed from the base of the PR and between 2c42f7d and 7001a93.

📒 Files selected for processing (6)
  • internal/raftengine/etcd/engine.go
  • internal/raftengine/etcd/engine_applied_index_test.go
  • kv/fsm.go
  • kv/leader_routed_store.go
  • kv/shard_store.go
  • store/lsm_store.go
🚧 Files skipped from review as they are similar to previous changes (3)
  • internal/raftengine/etcd/engine.go
  • kv/fsm.go
  • store/lsm_store.go

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7001a934f1

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread store/lsm_store.go
defer s.dbMu.RUnlock()
batch := s.db.NewBatch()
defer func() { _ = batch.Close() }()
if err := setPebbleUint64InBatch(batch, metaAppliedIndexBytes, idx); err != nil {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid lowering the durable apply index

When a SnapshotCount snapshot is persisted by the background worker (startSnapshotWorker launches persistLocalSnapshot in a goroutine), raft apply can continue and ApplyMutationsRaftAt can advance _meta_applied_index past the snapshot's older idx before this method runs. This unconditional set then rewrites the key backwards, so a store that has durably applied entries N+k can later report only N, defeating the soak/verification invariant and causing the future skip gate to fall back unnecessarily until another data apply repairs it. Please write max(existing, idx) or otherwise serialize with apply so the meta key remains monotonic.

Useful? React with 👍 / 👎.

…x P2)

Round-2 of PR #915 review surfaced a P2 (codex) on the
non-monotonic write that claude[bot] had flagged as a 'design note' in
round-1.

The race: a background SnapshotCount-triggered persistLocalSnapshot
worker calls SetDurableAppliedIndex(snap.Metadata.Index). Concurrently,
the raft-apply loop's ApplyMutationsRaftAt has already advanced the
per-Apply metaAppliedIndex past that snapshot's index. The
unconditional write rewinds the meta key — defeating the soak/
verification invariant and causing the future cold-start skip gate
to fall back unnecessarily until another data Apply repairs the key.

Round-3 fix: read-modify-write under applyMu serialisation.

  - SetDurableAppliedIndex now acquires dbMu.RLock THEN applyMu.Lock
    (same order applyMutationsWithOpts / deletePrefixAtWithOpts use,
    so no deadlock).
  - Reads the existing meta key via the new readAppliedIndexLocked
    helper (shares its decode logic with the unlocked
    LastAppliedIndex(); same (0, false, nil) semantics for absent /
    truncated meta keys).
  - If existing >= idx, no-op. The skip invariant still holds:
    LastAppliedIndex >= idx >= snap.Metadata.Index, so the snapshot's
    own claim is satisfied by the larger-already-present value.
  - Otherwise, batch.Commit(pebble.Sync) as before.

Tests (store/lsm_store_applied_index_test.go):

  - TestSetDurableAppliedIndex_Monotonic — simulates the exact race
    in single-threaded form: ApplyMutationsRaftAt(applied=250) then
    SetDurableAppliedIndex(snapIdx=100). Asserts LastAppliedIndex
    still reads 250 after the older snapshot write.
  - TestSetDurableAppliedIndex_AdvancesOnlyForward — exercises the
    forward-only contract: Set(50) -> Set(100, advances) ->
    Set(75, no-op) -> Set(100, no-op). LastAppliedIndex finishes at
    100.

Tests: go vet ./store/ + golangci-lint ./store/ clean;
go test ./store/ -run 'TestSetDurable...|TestLast...|TestApply...|TestDelete...' ok 0.59s;
go test ./store/ ./kv/ ./internal/raftengine/... -short ok ~53s.

Refs PR #915 round-2 codex review (P2 'Avoid lowering the durable
apply index').
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 4, 2026

TLA+ spec divergence review (auto-triggered)

This PR touches files that the TLA+ safety spec has an anchor on (per
docs/design/2026_05_28_partial_tla_safety_spec.md §3),
so an AI review is requested below to verify the implementation has not drifted
from the model.

Anchored files changed in this PR head (b957e1a):

  • kv/fsm.go
  • store/mvcc_store.go

What to check, by subsystem:

  • kv/hlc*.goNext() must respect the HLC-4 preconditions (i)/(ii)/(iii) from the design doc: bounded skew, logical-counter handoff on leader change (strategy (c) Observe(MaxAppliedHLC)), and the commit-time ceiling fence (fail-closed when wall_now >= physicalCeiling). Any change to the bit layout (48/16), the CAS loop, or the ceiling getter/setter is in scope.
  • kv/coordinator.go, kv/sharded_coordinator.goRunHLCLeaseRenewal, hlcRenewalInterval, hlcPhysicalWindowMs constants, and the new-term detection that calls Observe(fsm.MaxAppliedHLC()) (strategy (c)). Any change to renewal cadence, group selection, or fail-closed behaviour is in scope.
  • kv/transaction.go, kv/lock_resolver.go — OCC commit-ts assignment, lock-map encoding (key, lock_ts) -> start_ts, and the LockResolver action OCC-3 depends on. (M2 spec will land OCC-1..OCC-5; until then the spec doc §5.2 is the contract.)
  • kv/fsm.go — FSM apply of HLC lease entries (SetPhysicalCeiling), and any future MaxAppliedHLC() accessor that strategy (c) needs.
  • store/mvcc_store.go — version visibility, snapshot install, and the MVCC-1..MVCC-4 invariants (M3 scope).
  • distribution/** — route catalog versioning, SplitRange atomicity, and CatalogWatcher async fan-out (M4 scope).

If the change is correct but requires a spec update, edit tla/hlc/HLC.tla (or the corresponding M2..M5 module once landed) and the design doc in the same PR. The tla-check workflow runs the TLC model check on the same paths.


@claude review please verify TLA+ spec divergence per the checklist above.

@codex review please verify TLA+ spec divergence per the checklist above.

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Jun 4, 2026

Round-3 push: b957e1aa — addressing codex round-2 P2.

🟡 codex P2 line 632 — "Avoid lowering the durable apply index"

You're right that round-2's unconditional write can rewind metaAppliedIndex when the background persistLocalSnapshot goroutine fires after ApplyMutationsRaftAt has already advanced the key past the snapshot's index. claude[bot] round-1 had flagged this as a "design note"; the round-2 P2 promotion is the right call — even though the skip invariant (LastAppliedIndex >= snap.Metadata.Index) still holds for the snapshot itself, the rewind defeats the soak/verification value of the meta key (the future Branch 3 gate would unnecessarily fall back until another data Apply repairs it).

Round-3 fix

SetDurableAppliedIndex is now a read-modify-write under applyMu serialisation:

func (s *pebbleStore) SetDurableAppliedIndex(idx uint64) error {
    s.dbMu.RLock()
    defer s.dbMu.RUnlock()
    s.applyMu.Lock()
    defer s.applyMu.Unlock()

    existing, present, err := s.readAppliedIndexLocked()
    if err != nil { return err }
    if present && existing >= idx {
        return nil // skip-invariant still holds: existing >= idx
    }

    batch := s.db.NewBatch()
    defer func() { _ = batch.Close() }()
    if err := setPebbleUint64InBatch(batch, metaAppliedIndexBytes, idx); err != nil {
        return err
    }
    return errors.WithStack(batch.Commit(pebble.Sync))
}

Lock ordering: dbMu.RLock THEN applyMu.Lock matches applyMutationsWithOpts / deletePrefixAtWithOpts discipline (both acquire in that order), so no deadlock. The applyMu.Lock serialises against the per-Apply meta key writes — between the read and the batch commit, no concurrent Apply can advance the key.

readAppliedIndexLocked is the locked variant of the decode logic (factored out of LastAppliedIndex); same (0, false, nil) semantics for absent / truncated meta keys.

Tests

Two new tests in store/lsm_store_applied_index_test.go:

  • TestSetDurableAppliedIndex_Monotonic — simulates the exact race in single-threaded form: ApplyMutationsRaftAt(applied=250) then SetDurableAppliedIndex(snapIdx=100). Asserts LastAppliedIndex still reads 250 after the older snapshot write.
  • TestSetDurableAppliedIndex_AdvancesOnlyForwardSet(50)Set(100) advances → Set(75) no-op → Set(100) no-op. Final state: 100.
go vet ./store/                                                       → exit 0
golangci-lint run ./store/                                            → 0 issues
go test ./store/ -run 'TestSetDurable...|TestLast...|TestApply...'    → ok 0.59s
go test ./store/ ./kv/ ./internal/raftengine/... -short               → ok ~55s

@claude review

@chatgpt-codex-connector
Copy link
Copy Markdown

To use Codex here, create a Codex account and connect to github.

@claude
Copy link
Copy Markdown

claude Bot commented Jun 4, 2026

Claude finished @bootjp's task in 3m 2s —— View job


PR #915 Round-3 Review — b957e1aa

Todo list:

  • Load tools
  • Read store/lsm_store.goSetDurableAppliedIndex, readAppliedIndexLocked, lock ordering
  • Read store/lsm_store_applied_index_test.go — new monotonic tests
  • Read internal/raftengine/etcd/engine_applied_index_test.go — Site 2 assertion tightness
  • Verify lock ordering: dbMu.RLockapplyMu.Lock matches applyMutationsWithOpts
  • Check TLA+ divergence for round-3 changes
  • Post comprehensive review

Codex P2 fix — SetDurableAppliedIndex monotonicity ✅

Implementation (lsm_store.go:634-658):

The read-modify-write under applyMu is correct. applyMu.Lock serializes against all applyMutationsWithOpts and deletePrefixAtWithOpts callers (both acquire it after dbMu.RLock). While SetDurableAppliedIndex holds applyMu, no concurrent data-Apply can be between its own conflict-check and batch commit, so the Pebble Get in readAppliedIndexLocked observes a fully-consistent, up-to-date value of metaAppliedIndex.

Lock ordering (lsm_store.go:635-638):

  • dbMu.RLock() first, then applyMu.Lock() — matches applyMutationsWithOpts at :1442/1447 and deletePrefixAtWithOpts at :1534/1536. ✅
  • No deadlock risk: neither applyMutationsWithOpts nor deletePrefixAtWithOpts ever attempts to acquire dbMu.WLock while holding applyMu — only Restore/Close does, and those hold maintenanceMu exclusively. ✅

Skip-invariant correctness (lsm_store.go:644-650):
The present && existing >= idx guard is correct for all three cases:

  • !present (fresh store): always writes. ✅
  • existing < idx (snapshot is ahead of all applies, rare): writes idx. ✅
  • existing >= idx (P2 race: applies have run past snapshot index): no-op. The skip-invariant LastAppliedIndex >= snap.Metadata.Index still holds because existing >= idx >= snap.Metadata.Index. ✅

New tests

  • TestSetDurableAppliedIndex_Monotonic (:185-210): correctly mirrors the P2 race scenario in single-threaded form — Apply at index 250, then snapshot persist at index 100; verifies LastAppliedIndex remains 250 after the stale write attempt. ✅
  • TestSetDurableAppliedIndex_AdvancesOnlyForward (:216-230): four-call sequence covering Set(50) → Set(100) → Set(75)→no-op → Set(100)→no-op, verifies final state is 100. The existing >= idx guard handles equal-to as no-op (correct). ✅

Issues and observations

1. applyMu absent from struct-level lock ordering comment — documentation gap

lsm_store.go:170-177 documents the lock ordering as maintenanceMu → dbMu → mtx. But applyMu is acquired between dbMu and mtx (in applyMutationsWithOpts at :1447 and now in SetDurableAppliedIndex at :637). The actual ordering is maintenanceMu → dbMu → applyMu → mtx. The gap means a reader maintaining a new operation that only reads the struct comment will miss applyMu entirely.

Non-blocking for this PR (no new incorrect lock acquisition), but the comment is now stale given round-3's change. Worth a follow-up or a one-line update to the struct comment.

Fix this →

2. LastAppliedIndex duplicates readAppliedIndexLocked body — pre-existing DRY gap, not new in round-3

The doc-comment on readAppliedIndexLocked (:660-665) says "The body is shared with the unlocked LastAppliedIndex() reader", but LastAppliedIndex doesn't actually call readAppliedIndexLocked — both contain identical Pebble-Get + error-handling bodies. Ideally LastAppliedIndex would call s.readAppliedIndexLocked() after acquiring dbMu.RLock (it doesn't need applyMu for the public API). Pre-existing; not a round-3 regression; mentioning it so the inconsistency is tracked.

3. applyMu held for full pebble.Sync duration in snapshot-worker goroutine — performance note

Round-2's SetDurableAppliedIndex held no applyMu; round-3 holds it across the entire batch.Commit(pebble.Sync). Because the snapshot worker is a background goroutine, this means a rare snapshot fsync (default every 10000 applies) can stall concurrent ApplyMutationsRaftAt callers waiting for applyMu for the fsync duration (typically 1–10ms). For a soak-phase Branch 2 this is acceptable and the same trade-off is already made by applyMutationsWithOpts (which also holds applyMu across its own batch commit). Not a blocker, but worth a doc-comment callout on SetDurableAppliedIndex noting the contention implication.

4. TestPersistLocalSnapshotPayload_BumpsAppliedIndex — GreaterOrEqual vs exact slice — open coderabbit round-2 nit

Site 2's happy-path test (:204-209) uses require.GreaterOrEqual(len, 2) + positional checks, while Site 1's identical scenario uses require.Equal(t, []orderEvent{...}, rec.snapshot()). A spurious third event (e.g. an accidental double SaveSnap) would not fail the Site 2 test. Since the infrastructure can only produce one bump and one save this is low risk, but exact slice comparison would match Site 1 and catch regressions more tightly. The coderabbit round-2 nitpick on this is still open.

Fix this →


TLA+ spec divergence ✅

Round-3 only modifies store/lsm_store.go. The anchored files kv/fsm.go and store/mvcc_store.go are unchanged since round-2. The readAppliedIndexLocked helper and the applyMu serialization in SetDurableAppliedIndex don't touch HLC, OCC, MVCC visibility, or FSM apply semantics. No spec update required.


Summary

The codex P2 fix (monotonic SetDurableAppliedIndex) is correctly implemented and well-tested. Lock ordering is sound. The two new tests directly cover the production race scenario and the forwards-only invariant. No new blockers. Items 1–4 are documentation/style nits. Ready to merge.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b957e1aa54

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread store/lsm_store.go
Comment on lines +1574 to +1575
if appliedIndex > 0 {
if err := setPebbleUint64InBatch(batch, metaAppliedIndexBytes, appliedIndex); err != nil {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Serialize DEL_PREFIX meta bumps with snapshot pinning

When the background snapshot worker pins index N via SetDurableAppliedIndex, a later raft-applied DEL_PREFIX can concurrently commit this batch with _meta_applied_index = N+k; however SetDurableAppliedIndex only serializes its max-read/write with applyMu, and deletePrefixAtWithOpts never takes that lock, so the snapshot pin can still commit its stale lower N afterward and rewind the durable applied index. Fresh evidence beyond the earlier comment is this newly index-aware DEL_PREFIX path, which writes the meta key without participating in the same serialization; please include this path in the same monotonic/max protection.

Useful? React with 👍 / 👎.

claude[bot] round-3 verdict was 'Ready to merge' with 4 non-blocking
notes. Round-4 closes three of them and addresses the round-3
reviewdog gci finding. The fourth note (LastAppliedIndex duplicates
readAppliedIndexLocked body) is pre-existing and explicitly called
out by claude[bot] as 'not a round-3 regression'; deferred to a
follow-up to keep this round focused.

Codex round-3 P2 ('deletePrefixAtWithOpts never takes that lock') is
factually wrong — store/lsm_store.go:1536 shows the function DOES
acquire s.applyMu.Lock(). The function holds applyMu across its
entire batch.Commit, so it serialises against SetDurableAppliedIndex
through the same mutex. The race codex describes cannot occur.
Push-back posted in the PR comment with grep evidence; no code change
for this finding.

Round-4 changes:

1) reviewdog gci on store/lsm_store_applied_index_test.go:182.
   golangci-lint fmt auto-corrected the alignment (whitespace-only
   diff in the test's comment block).

2) Site 2 test (engine_applied_index_test.go:
   TestPersistLocalSnapshotPayload_BumpsAppliedIndex) switched from
   GreaterOrEqual+positional checks to exact slice comparison —
   matches Site 1's style and catches a spurious third event (e.g.
   double SaveSnap) that the previous assertion would miss. Closes
   claude[bot] note #4 and the coderabbit round-2 nit.

3) pebbleStore struct doc-comment lock-ordering table extended to
   include applyMu between dbMu and mtx (lsm_store.go:170-181). The
   ordering already matched the code; the comment was stale before
   round-3 added the SetDurableAppliedIndex serialisation. Closes
   claude[bot] note #1.

Tests: go vet + golangci-lint clean across store/ kv/
internal/raftengine/etcd/; targeted test sweep ok (store 0.82s +
etcd 0.40s).

Refs PR #915 round-3 claude[bot] verdict (Ready-to-merge with 4
notes) + round-3 codex P2 (rejected as false-positive).
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 4, 2026

TLA+ spec divergence review (auto-triggered)

This PR touches files that the TLA+ safety spec has an anchor on (per
docs/design/2026_05_28_partial_tla_safety_spec.md §3),
so an AI review is requested below to verify the implementation has not drifted
from the model.

Anchored files changed in this PR head (5684857):

  • kv/fsm.go
  • store/mvcc_store.go

What to check, by subsystem:

  • kv/hlc*.goNext() must respect the HLC-4 preconditions (i)/(ii)/(iii) from the design doc: bounded skew, logical-counter handoff on leader change (strategy (c) Observe(MaxAppliedHLC)), and the commit-time ceiling fence (fail-closed when wall_now >= physicalCeiling). Any change to the bit layout (48/16), the CAS loop, or the ceiling getter/setter is in scope.
  • kv/coordinator.go, kv/sharded_coordinator.goRunHLCLeaseRenewal, hlcRenewalInterval, hlcPhysicalWindowMs constants, and the new-term detection that calls Observe(fsm.MaxAppliedHLC()) (strategy (c)). Any change to renewal cadence, group selection, or fail-closed behaviour is in scope.
  • kv/transaction.go, kv/lock_resolver.go — OCC commit-ts assignment, lock-map encoding (key, lock_ts) -> start_ts, and the LockResolver action OCC-3 depends on. (M2 spec will land OCC-1..OCC-5; until then the spec doc §5.2 is the contract.)
  • kv/fsm.go — FSM apply of HLC lease entries (SetPhysicalCeiling), and any future MaxAppliedHLC() accessor that strategy (c) needs.
  • store/mvcc_store.go — version visibility, snapshot install, and the MVCC-1..MVCC-4 invariants (M3 scope).
  • distribution/** — route catalog versioning, SplitRange atomicity, and CatalogWatcher async fan-out (M4 scope).

If the change is correct but requires a spec update, edit tla/hlc/HLC.tla (or the corresponding M2..M5 module once landed) and the design doc in the same PR. The tla-check workflow runs the TLC model check on the same paths.


@claude review please verify TLA+ spec divergence per the checklist above.

@codex review please verify TLA+ spec divergence per the checklist above.

@chatgpt-codex-connector
Copy link
Copy Markdown

To use Codex here, create a Codex account and connect to github.

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Jun 4, 2026

Round-4 push: 56848574. claude[bot] round-3 nits + reviewdog gci. codex P2 line 1575 push-back below.

📝 codex round-3 P2 line 1575 — false positive (push-back)

"deletePrefixAtWithOpts never takes that lock"

Factually incorrect. Verified directly:

$ grep -n 'func .s \*pebbleStore. deletePrefixAtWithOpts\|s\.applyMu' store/lsm_store.go | head -8
637:    s.applyMu.Lock()
638:    defer s.applyMu.Unlock()
1532: func (s *pebbleStore) deletePrefixAtWithOpts(...) error {
1536:   s.applyMu.Lock()
1537:   defer s.applyMu.Unlock()

deletePrefixAtWithOpts:1536 acquires s.applyMu.Lock() immediately after s.dbMu.RLock() — same discipline as applyMutationsWithOpts:637 and now SetDurableAppliedIndex:637. All three writers contend on the same applyMu, so the snapshot pin and a concurrent DEL_PREFIX are serialised; the read-modify-write inside SetDurableAppliedIndex (round-3) sees the post-DEL_PREFIX value of metaAppliedIndex and no-ops if existing >= idx.

The race codex describes ("the snapshot pin can still commit its stale lower N afterward and rewind the durable applied index") cannot occur with the current locking. The DEL_PREFIX path participates in the same monotonic/max protection by virtue of the shared applyMu and the round-3 RMW guard.

No code change for this finding.

🟢 claude[bot] round-3 nits — addressed

#1applyMu missing from struct-level lock-ordering comment

store/lsm_store.go:170-181 was stale (still listed maintenanceMu → dbMu → mtx). Round-4 extends it to maintenanceMu → dbMu → applyMu → mtx and documents applyMu's purpose (serialises conflict-check → batch-commit for the raft-apply leaves and the SetDurableAppliedIndex RMW guard).

#4 — Site 2 happy-path test should use exact slice match

TestPersistLocalSnapshotPayload_BumpsAppliedIndex now uses

require.Equal(t, []orderEvent{
    {kind: "bump", index: index},
    {kind: "save", index: index},
}, rec.snapshot(), ...)

matching Site 1's style. A spurious third event (double SaveSnap, double bump) would fail the assertion — the previous GreaterOrEqual + positional would not. Also closes the coderabbit round-2 nit.

#2LastAppliedIndex duplicates readAppliedIndexLocked

Deferred: claude[bot] explicitly called this out as "pre-existing; not a round-3 regression". LastAppliedIndex and readAppliedIndexLocked have identical Pebble-Get + error-handling bodies; the public API does dbMu.RLock only, the private helper assumes dbMu.RLock + applyMu.Lock already held. Refactoring LastAppliedIndex to call readAppliedIndexLocked would either require acquiring applyMu on every read (perf regression) or splitting the helper further (extra surface area). Worth doing but in a follow-up.

#3applyMu held across pebble.Sync fsync

Doc-only note; same trade-off applyMutationsWithOpts already makes. Acceptable for soak-phase B2; revisit if production data shows tail-latency regression.

🟢 reviewdog gci on store/lsm_store_applied_index_test.go:182

Auto-corrected via golangci-lint fmt. Whitespace-only diff in the test's comment block.

Verification

go vet ./store/ ./kv/ ./internal/raftengine/...                         → exit 0
golangci-lint run ./store/ ./kv/ ./internal/raftengine/etcd/            → 0 issues
go test ./store/ ./internal/raftengine/etcd/ -run '<applied-index suite>' → ok

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Jun 4, 2026

Claude finished @bootjp's task in 1m 38s —— View job


PR #915 Round-4 Review — 56848574

Todo list:

  • Load tools
  • Read store/lsm_store.go — lock-ordering comment update (nit Configure Renovate #1) + verify deletePrefixAtWithOpts acquires applyMu (codex P2 pushback)
  • Read internal/raftengine/etcd/engine_applied_index_test.go — exact slice match for Site 2 (nit Dependency Dashboard #4)
  • Read store/lsm_store_applied_index_test.go — gci fix
  • Verify TLA+ spec divergence for round-4 changes
  • Post complete review

Codex P2 line 1575 — false positive confirmed ✅

The pushback is correct. All three write paths are serialized through the same applyMu:

Method dbMu applyMu
applyMutationsWithOpts :1448 RLock :1453 Lock
deletePrefixAtWithOpts :1539 RLock :1542 Lock
SetDurableAppliedIndex :641 RLock :643 Lock

The codex claim that deletePrefixAtWithOpts "never takes that lock" is factually wrong — it acquires s.applyMu.Lock() at :1542, two lines after dbMu.RLock. The snapshot pin and a concurrent DEL_PREFIX are fully serialized. SetDurableAppliedIndex's RMW guard (round-3) reads a consistent post-DEL_PREFIX metaAppliedIndex value under the lock and no-ops if existing >= idx. No code change was required; no divergence exists.


Nit #1applyMu lock-ordering comment ✅

pebbleStore struct comment at lsm_store.go:170-183 is updated to document the full ordering maintenanceMu → dbMu → applyMu → mtx. The applyMu entry is accurate and complete — it names all three callers that hold it (raft-apply conflict-check leaves, deletePrefixAtWithOpts, and SetDurableAppliedIndex's monotonic guard) and cites the PR that introduced the RMW guard (round-3 of PR #915).


Nit #4 — Site 2 exact slice match ✅

TestPersistLocalSnapshotPayload_BumpsAppliedIndex (engine_applied_index_test.go:207-211) now uses:

require.Equal(t, []orderEvent{
    {kind: "bump", index: index},
    {kind: "save", index: index},
}, rec.snapshot(),
    "hook MUST call SetDurableAppliedIndex BEFORE SaveSnap exactly once each")

Exact parity with Site 1's TestPersistCreatedSnapshot_BumpsAppliedIndex. A spurious extra event (double SaveSnap, double bump) would fail this assertion — the previous GreaterOrEqual + positional approach would not. Closes both claude[bot] round-3 note #4 and the open coderabbit round-2 nit. ✅


gci fix — store/lsm_store_applied_index_test.go

Imports correctly formatted: stdlib block (context, os, testing) first, external packages (github.com/...) second, no misplaced blank lines. reviewdog passes. ✅


TLA+ spec divergence ✅

Round-4 changes are: a struct-level doc-comment update in lsm_store.go and a test assertion tightening in engine_applied_index_test.go. The anchored files kv/fsm.go and store/mvcc_store.go are unchanged since round-2. No spec update required.


Open items from prior rounds (acknowledged deferrals)

  • Nit Update dependency go to v1.21.4 #2 (LastAppliedIndex DRY gap): pre-existing, explicitly deferred to a follow-up. The public API needs dbMu.RLock only; merging it with readAppliedIndexLocked would force applyMu onto every cold-read or require a second helper split. Low-risk as is.
  • Nit Update docker/setup-qemu-action action to v3 #3 (applyMu held across pebble.Sync in snapshot worker): same trade-off as applyMutationsWithOpts. Acceptable for soak-phase B2; revisit if tail-latency data shows regression.
  • coderabbit ShardStore multi-group guard: convention is documented in the inline comment at shard_store.go:1271-1285; no production caller hits this path with appliedIndex > 0 across multiple groups (one FSM per raft log). Not runtime-enforced; appropriate for a soak-phase no-op PR.

Summary

All round-4 items resolved. The codex P2 claim is a false positive — deletePrefixAtWithOpts holds applyMu throughout its batch commit, so the monotonic invariant from round-3 is airtight. The lock-ordering comment is now accurate, the Site 2 test assertion matches Site 1 exactly, and the gci lint is clean. No new issues. Ready to merge.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 56848574e4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread kv/fsm.go Outdated
Comment on lines +140 to +145
func (f *kvFSM) AppliedIndexReader() raftengine.AppliedIndexReader {
if r, ok := f.store.(raftengine.AppliedIndexReader); ok {
return r
}
return nil
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Implement the reader interface on kvFSM

When the cold-start skip gate queries the FSM through the optional raftengine.AppliedIndexReader interface added in this change, kvFSM will not match it because this method is named AppliedIndexReader() instead of LastAppliedIndex(). In the pebble-backed production FSM that means the newly persisted _meta_applied_index remains unreachable through the advertised interface, so the skip path will conservatively fall back even after raft applies and snapshot checkpoints have populated the meta key.

Useful? React with 👍 / 👎.

Comment on lines +4075 to +4077
if err := e.bumpDurableAppliedIndexBeforeSave(index); err != nil {
return err
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Include the snapshot checkpoint in the payload

In the steady-state snapshot path, the FSM snapshot has already been serialized in persistLocalSnapshot before this bump runs, so the .fsm file / legacy payload that gets saved with the raft snapshot does not contain the _meta_applied_index value just written here. If that snapshot is later restored on a peer or after restart, especially after HLC/encryption-only entries since the last data apply, the restored store still reports the older or missing applied index and the snapshot-skip invariant promised by this checkpoint does not hold for the snapshot artifact.

Useful? React with 👍 / 👎.

…codex P2)

Codex round-4 P2 on kv/fsm.go:145 caught a real wiring bug: the
round-1 method was named AppliedIndexReader() returning the interface
(factory pattern), intended to be called through a separate
AppliedIndexReporter interface. But that AppliedIndexReporter was
never added to the raftengine package, and the planned Branch 3
cold-start skip gate type-asserts fsm.(raftengine.AppliedIndexReader)
directly. A factory method with a different signature does NOT
satisfy the interface, so the skip would always fall back to full
restore even after meta keys are populated on every apply.

Fix: rename kvFSM.AppliedIndexReader() to LastAppliedIndex() and
inline the type-assert forward. kvFSM now directly satisfies
raftengine.AppliedIndexReader.

Added kv/fsm_applied_index_iface_check.go with compile-time guards:

  var _ raftengine.AppliedIndexReader = (*kvFSM)(nil)
  var _ raftengine.AppliedIndexWriter = (*kvFSM)(nil)

so a future rename or signature drift fails at build time rather
than silently degrading the skip optimisation to full-restore-always.

No production callers of the old AppliedIndexReader() factory exist
(verified by grep) — the consumer is the future Branch 3 skip gate
which is not yet wired. Safe rename.

Tests: go vet + golangci-lint clean across all touched packages;
go test ./kv/ ./store/ ./internal/raftengine/... -short ok 56s.

Refs PR #915 round-4 codex P2 (kv/fsm.go:145). Codex P2 on
engine.go:4077 (snapshot artifact does not contain post-bump
metaAppliedIndex) is acknowledged but out of B2 scope — push-back
explanation in the PR comment.
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 4, 2026

TLA+ spec divergence review (auto-triggered)

This PR touches files that the TLA+ safety spec has an anchor on (per
docs/design/2026_05_28_partial_tla_safety_spec.md §3),
so an AI review is requested below to verify the implementation has not drifted
from the model.

Anchored files changed in this PR head (a8e2cf5):

  • kv/fsm.go
  • store/mvcc_store.go

What to check, by subsystem:

  • kv/hlc*.goNext() must respect the HLC-4 preconditions (i)/(ii)/(iii) from the design doc: bounded skew, logical-counter handoff on leader change (strategy (c) Observe(MaxAppliedHLC)), and the commit-time ceiling fence (fail-closed when wall_now >= physicalCeiling). Any change to the bit layout (48/16), the CAS loop, or the ceiling getter/setter is in scope.
  • kv/coordinator.go, kv/sharded_coordinator.goRunHLCLeaseRenewal, hlcRenewalInterval, hlcPhysicalWindowMs constants, and the new-term detection that calls Observe(fsm.MaxAppliedHLC()) (strategy (c)). Any change to renewal cadence, group selection, or fail-closed behaviour is in scope.
  • kv/transaction.go, kv/lock_resolver.go — OCC commit-ts assignment, lock-map encoding (key, lock_ts) -> start_ts, and the LockResolver action OCC-3 depends on. (M2 spec will land OCC-1..OCC-5; until then the spec doc §5.2 is the contract.)
  • kv/fsm.go — FSM apply of HLC lease entries (SetPhysicalCeiling), and any future MaxAppliedHLC() accessor that strategy (c) needs.
  • store/mvcc_store.go — version visibility, snapshot install, and the MVCC-1..MVCC-4 invariants (M3 scope).
  • distribution/** — route catalog versioning, SplitRange atomicity, and CatalogWatcher async fan-out (M4 scope).

If the change is correct but requires a spec update, edit tla/hlc/HLC.tla (or the corresponding M2..M5 module once landed) and the design doc in the same PR. The tla-check workflow runs the TLC model check on the same paths.


@claude review please verify TLA+ spec divergence per the checklist above.

@codex review please verify TLA+ spec divergence per the checklist above.

@chatgpt-codex-connector
Copy link
Copy Markdown

To use Codex here, create a Codex account and connect to github.

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Jun 4, 2026

Round-5 push: a8e2cf52. One real codex P2 fixed, one push-back.

🟡 codex P2 kv/fsm.go:145 — real bug, fixed

You're right. Round-1's AppliedIndexReader() raftengine.AppliedIndexReader was a factory method, intended to be called through an AppliedIndexReporter interface — but that interface was never added, and Branch 3's planned skip gate will do fsm.(raftengine.AppliedIndexReader) directly. The factory method does NOT satisfy that interface; the skip would always fall back even after meta keys are populated.

Renamed kvFSM.AppliedIndexReader()kvFSM.LastAppliedIndex() (uint64, bool, error), inlining the type-assert forward to f.store.(raftengine.AppliedIndexReader). kvFSM now directly satisfies raftengine.AppliedIndexReader.

Added a separate compile-time guard file (kv/fsm_applied_index_iface_check.go):

var _ raftengine.AppliedIndexReader = (*kvFSM)(nil)
var _ raftengine.AppliedIndexWriter = (*kvFSM)(nil)

so any future rename or signature drift fails at build time rather than silently degrading the skip optimisation. No existing production callers — verified by grep before renaming.

📝 codex P2 engine.go:4077 — out-of-scope for B2 (push-back with rationale)

"the snapshot artifact does not contain the _meta_applied_index value just written here"

Correct observation, but the invariant claimed in PR #910 design §6 is local-only:

"After a successful snapshot persist, LastAppliedIndex >= snapshot.Metadata.Index holds unconditionally"

This is a claim about the LIVE pebble store on the node that persisted the snapshot, NOT about the snapshot artifact's serialized state. The local cold-start skip path (Branch 3) reads the live store, not the artifact, so the invariant holds where it's used.

The codex finding's actual consequence:

Path Effect of stale metaAppliedIndex in artifact
Local cold-start None — uses live store, which has post-bump value
Peer InstallSnapshot → restart Peer's restored store has artifact's stale value; B3's skip gate would conservatively fall back to full restore

The peer-after-InstallSnapshot path is explicitly listed as a Non-Goal in the design (docs/design/2026_06_02_idempotent_snapshot_restore.md):

"Not changing the snapshot-install hot path (Engine.applySnapshot, engine.go:1641)."

When B3 wires the skip gate, the peer-after-InstallSnapshot scenario will conservatively fall back to full restore — which IS the strictly-additive guarantee the design promises (§4 fallback policy). Not a correctness bug; a missed optimisation.

Why not bump before fsm.Snapshot()? Because the FSM Snapshot is taken via s.db.NewSnapshot() which captures a consistent point-in-time view BEFORE the bump can occur. Bumping the live store first doesn't help: the pebble snapshot has already been captured. Closing this gap requires either:

  1. Refactoring maybePersistLocalSnapshot to bump BEFORE calling e.fsm.Snapshot(). This changes the snapshot worker's flow significantly (the bump would need to be synchronous with the e.applied read, before queuing) and risks introducing other races.
  2. Having Engine.applySnapshot (the InstallSnapshot receiver) explicitly call SetDurableAppliedIndex(snap.Metadata.Index) after fsm.Restore returns. Symmetric with persistLocalSnapshotPayload and parallel to the design's HLC ceiling / cutover restore handoff (PR docs(design): idempotent FSM snapshot restore on cold start #910 §5).

Option 2 is the right answer, but it belongs in Branch 3 (the path that consumes the meta key) or as a parallel follow-up — not in B2 (the soak-only PR). Filing as a TODO with a sentinel comment in engine.applySnapshot rather than expanding B2's scope.

Verification

go vet ./kv/ ./store/ ./internal/raftengine/...                    → exit 0
golangci-lint run ./kv/ ./store/ ./internal/raftengine/etcd/       → 0 issues
go test ./kv/ ./store/ ./internal/raftengine/etcd/ -short          → ok 56s

The new compile-time interface guard means any future regression would fail the build immediately:

$ go build ./kv/
$ # If kvFSM stops satisfying AppliedIndexReader:
$ go build ./kv/
./kv/fsm_applied_index_iface_check.go:7:5: cannot use (*kvFSM)(nil) ...

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Jun 4, 2026

Claude finished @bootjp's task in 3m 52s —— View job


PR #915 Round-5 Review — a8e2cf52

Todo list:

  • Load tools
  • Read kv/fsm.go — verify LastAppliedIndex rename + type-assert forward
  • Read kv/fsm_applied_index_iface_check.go — compile-time guard
  • Read internal/raftengine/statemachine.go — verify AppliedIndexReader interface shape
  • Verify design doc Non-Goal claim for codex P2 push-back
  • TLA+ spec divergence check
  • Post complete review

Codex P2 kv/fsm.go:145 — real bug, correctly fixed ✅

The root cause was accurate: a factory method named AppliedIndexReader() raftengine.AppliedIndexReader does not satisfy the raftengine.AppliedIndexReader interface (whose single method is LastAppliedIndex() (uint64, bool, error)). Branch 3's fsm.(raftengine.AppliedIndexReader) type-assertion would always fail, the skip gate would permanently fall back, and the entire Branch 2 soak investment would be silently wasted.

Implementation (kv/fsm.go:152-161):

LastAppliedIndex() (uint64, bool, error) exactly matches the interface contract defined in statemachine.go:68. The forwarding logic is clean: type-assert f.store to raftengine.AppliedIndexReader, (0, false, nil) for non-seam stores, errors.WithStack on propagated errors. The doc-comment correctly explains the round-1 mistake and the rationale for the rename. ✅

Compile-time guard (kv/fsm_applied_index_iface_check.go):

var _ raftengine.AppliedIndexReader = (*kvFSM)(nil)
var _ raftengine.AppliedIndexWriter = (*kvFSM)(nil)

Both assertions are in a single 9-line file with a clear comment referencing the round-4 fix. Any future rename or signature drift fails the build immediately — the soak investment is protected at the compiler level. This is the right design for an interface that must remain structurally stable across branches. ✅

Old factory method gone: confirmed by grep — the only occurrence of AppliedIndexReader() in kv/ is in the doc-comment explaining why it was changed (line 137). No call sites exist for the old pattern. ✅


Codex P2 engine.go:4077 — push-back accepted ✅

The codex finding is factually correct (the snapshot artifact doesn't embed the post-bump metaAppliedIndex), but the design doc's Non-Goals section at docs/design/2026_06_02_idempotent_snapshot_restore.md:71-74 is explicit:

"Not changing the snapshot-install hot path (Engine.applySnapshot, engine.go:1641). That path runs at runtime when a leader ships us a snapshot we genuinely don't have — the FSM is stale by construction there and Restore must run."

The invariant that LastAppliedIndex >= snap.Metadata.Index is a claim about the live Pebble store on the local node that persisted the snapshot, not about the serialized artifact. Branch 3's skip gate (fsmAlreadyAtIndex) reads the live store. The peer-after-InstallSnapshot path conservatively falls back to full restore, which is strictly correct per §4's fallback policy. The push-back is sound.


Issues and observations

1. Missing TODO sentinel in applyReadySnapshot — minor intent drift

The trigger comment states: "Filing as a TODO with a sentinel comment in engine.applySnapshot rather than expanding B2's scope." No such TODO was added to applyReadySnapshot (the actual receive-and-install-snapshot function, engine.go:1980). This is low-stakes — the design doc Non-Goals section is the authoritative record — but the stated action wasn't taken. A one-line comment at engine.go:2001 (the openAndRestoreFSMSnapshot call site) would be the natural place:

// B3/follow-up: call SetDurableAppliedIndex(tok.Index) here after Restore
// so peer-after-InstallSnapshot also populates metaAppliedIndex.
// Design doc Non-Goals §: "Engine.applySnapshot"; see PR #915 rationale.

Non-blocking for B2, but the stated intent should land somewhere.

Fix this →

2. Design doc pseudocode is now stale — B3 implementor risk ⚠️

docs/design/2026_06_02_idempotent_snapshot_restore.md has stale pseudocode in three places following the round-5 rename:

  • §3 fsmAlreadyAtIndex pseudocode (:242-293): uses AppliedIndexReporter + reporter.AppliedIndexReader() factory. The actual B3 pattern is fsm.(raftengine.AppliedIndexReader) directly.
  • B2 row in Implementation Plan (:866): says "kvFSM.AppliedIndexReader() accessor" — now stale, the method is LastAppliedIndex().
  • §8 compatibility list (:817): lists AppliedIndexReporter as a new opt-in interface, but it was never shipped and is no longer planned.

A B3 implementor reading the design doc would try to add an AppliedIndexReporter shim that was already superseded. The doc should be updated to reflect the direct-interface approach. This is the most actionable item in this round.

Fix this →


Invariant audit (round-5 changes only)

Invariant Status
kvFSM satisfies raftengine.AppliedIndexReader ✅ — LastAppliedIndex() signature matches; compile-time guard confirms
kvFSM satisfies raftengine.AppliedIndexWriter ✅ — SetDurableAppliedIndex() forwarded; compile-time guard confirms
(0, false, nil) fallback for non-seam stores ✅ — f.store.(raftengine.AppliedIndexReader) type-assert fails gracefully
errors.WithStack on propagated store errors ✅ — fsm.go:159
Skip gate type-assertion will succeed in B3 ✅ — interface name and method signature now match exactly

TLA+ spec divergence ✅

Round-5 changes are: method rename (AppliedIndexReader()LastAppliedIndex()), inline doc-comment update, and a new compile-time guard file. HLC, OCC, MVCC visibility, FSM apply semantics, and coordinator/transaction paths are all unchanged. No spec update required.


Summary

The real bug (round-4 codex P2) is fixed correctly and backed by a build-time safety net. The push-back on engine.go:4077 is well-reasoned and consistent with the design doc Non-Goals. The only substantive open item is the design doc's stale pseudocode (#2), which will mislead the B3 implementor if not corrected before Branch 3 begins. The missing TODO sentinel (#1) is cosmetic. Ready to merge pending design doc update.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a8e2cf52de

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +2708 to +2710
if err := e.bumpDurableAppliedIndexBeforeSave(snap.Metadata.Index); err != nil {
return err
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Include config snapshot checkpoint in payload

When the config-snapshot path reaches this hook, persistConfigSnapshotPayloadLocked has already obtained payload from snapshotPayload, which writes/closes the .fsm file (or serializes the legacy bytes) before calling persistCreatedSnapshot. Bumping _meta_applied_index here only updates the live store, so a peer or restart that restores this saved config snapshot gets the older/missing meta key while the raft snapshot is at snap.Metadata.Index; after WAL entries up to that index are released, the advertised LastAppliedIndex >= snapshot index invariant does not hold for the snapshot artifact. This is separate from the steady-state snapshot site: the same ordering occurs through persistConfigSnapshotPayloadLocked -> snapshotPayload -> persistCreatedSnapshot.

Useful? React with 👍 / 👎.

…items

claude[bot] round-5 verdict was 'Ready to merge pending design doc
update'. Two non-code items:

1) engine.go:2001 TODO sentinel — the round-5 PR comment claimed a
   'sentinel comment in engine.applySnapshot' would land, but no
   such comment was added. Round-6 adds it directly above the
   openAndRestoreFSMSnapshot call so a future B3 / follow-up
   implementor can find the planned SetDurableAppliedIndex hook
   site without re-discovering the rationale. Cites the design doc
   Non-Goals scope-out and PR #915 round-4/5 codex P2.

2) docs/design/2026_06_02_idempotent_snapshot_restore.md — three
   stale references to AppliedIndexReporter / factory-method
   pattern, superseded by PR #915 round-5 (kvFSM directly
   satisfies raftengine.AppliedIndexReader via LastAppliedIndex).
   The stale pseudocode would mislead the B3 implementor into
   adding a reporter shim that was already superseded. Updates:

   - §3 'kvFSM exposes its store' block rewritten to show direct
     interface satisfaction + the compile-time guard.
   - §4 fsmAlreadyAtIndex pseudocode rewritten to use direct
     type-assert against raftengine.AppliedIndexReader.
   - Implementation Plan B2 row updated to reflect the actual
     kvFSM.LastAppliedIndex() shape + the round-5 monotonic
     applyMu.Lock() RMW guard on SetDurableAppliedIndex.
   - §8 compatibility list: AppliedIndexReporter removed; replaced
     by AppliedIndexWriter (the actually-shipped interface) plus a
     retrospective sentence explaining the round-5 supersession.

Three retrospective mentions of AppliedIndexReporter remain (as
'why this is NOT what we use') so future readers understand the
design history without re-deriving it.

No code semantics change in round-6. Doc + comment-only.

Tests: go vet + golangci-lint clean across kv/, store/,
internal/raftengine/etcd/.

Refs PR #915 round-5 claude[bot] verdict ('Ready to merge pending
design doc update').
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Jun 4, 2026

Round-6 push: 39513379. Closes both round-5 actionable items. Doc + comment only.

🟢 Item #1 — TODO sentinel in applyReadySnapshot

Added a multi-line comment block right above the openAndRestoreFSMSnapshot call in engine.go (was around :2001, now :2001-:2009 after the insert):

// B3/follow-up: also call SetDurableAppliedIndex(tok.Index) here
// after Restore so peer-after-InstallSnapshot populates the meta
// key. The local-snapshot persist path already bumps the live
// store (engine.persistLocalSnapshotPayload), but the receiving
// node's restored store inherits the pre-bump value embedded in
// the snapshot artifact. Design Non-Goals §
// docs/design/2026_06_02_idempotent_snapshot_restore.md:71-74
// scopes this out of Branch 2; see PR #915 round-4/5 codex P2 on
// engine.go:4077 for the rationale.

Future B3 implementor will find the planned hook site + the round-4/5 rationale without re-discovering it.

🟠 Item #2 — Design doc stale pseudocode ✅

Three places in docs/design/2026_06_02_idempotent_snapshot_restore.md referenced the round-1 AppliedIndexReporter factory pattern that PR #915 round-5 superseded:

Where Before After
§3 "kvFSM exposes its store" factory AppliedIndexReader() AppliedIndexReader + AppliedIndexReporter interface direct LastAppliedIndex() + compile-time guard from kv/fsm_applied_index_iface_check.go
§4 fsmAlreadyAtIndex pseudocode reporter, ok := fsm.(AppliedIndexReporter); reader := reporter.AppliedIndexReader() r, ok := fsm.(raftengine.AppliedIndexReader) direct
Implementation Plan B2 row "kvFSM.AppliedIndexReader() accessor" + no mention of monotonic guard "kvFSM.LastAppliedIndex() directly satisfies raftengine.AppliedIndexReader (compile-time guard)" + applyMu.Lock() RMW from round-3
§8 compatibility list AppliedIndexReporter listed as new opt-in replaced by AppliedIndexWriter (actually shipped) + retrospective note

Three retrospective mentions of AppliedIndexReporter are kept as "this is NOT what we use" — so future readers understand the design history without re-deriving it.

Verification

go vet ./kv/ ./store/ ./internal/raftengine/...                    → exit 0
golangci-lint run ./kv/ ./store/ ./internal/raftengine/etcd/       → 0 issues

No code semantics changed in round-6. The compile-time interface guards from round-5 remain the structural protection.

@claude review

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 4, 2026

TLA+ spec divergence review (auto-triggered)

This PR touches files that the TLA+ safety spec has an anchor on (per
docs/design/2026_05_28_partial_tla_safety_spec.md §3),
so an AI review is requested below to verify the implementation has not drifted
from the model.

Anchored files changed in this PR head (3951337):

  • kv/fsm.go
  • store/mvcc_store.go

What to check, by subsystem:

  • kv/hlc*.goNext() must respect the HLC-4 preconditions (i)/(ii)/(iii) from the design doc: bounded skew, logical-counter handoff on leader change (strategy (c) Observe(MaxAppliedHLC)), and the commit-time ceiling fence (fail-closed when wall_now >= physicalCeiling). Any change to the bit layout (48/16), the CAS loop, or the ceiling getter/setter is in scope.
  • kv/coordinator.go, kv/sharded_coordinator.goRunHLCLeaseRenewal, hlcRenewalInterval, hlcPhysicalWindowMs constants, and the new-term detection that calls Observe(fsm.MaxAppliedHLC()) (strategy (c)). Any change to renewal cadence, group selection, or fail-closed behaviour is in scope.
  • kv/transaction.go, kv/lock_resolver.go — OCC commit-ts assignment, lock-map encoding (key, lock_ts) -> start_ts, and the LockResolver action OCC-3 depends on. (M2 spec will land OCC-1..OCC-5; until then the spec doc §5.2 is the contract.)
  • kv/fsm.go — FSM apply of HLC lease entries (SetPhysicalCeiling), and any future MaxAppliedHLC() accessor that strategy (c) needs.
  • store/mvcc_store.go — version visibility, snapshot install, and the MVCC-1..MVCC-4 invariants (M3 scope).
  • distribution/** — route catalog versioning, SplitRange atomicity, and CatalogWatcher async fan-out (M4 scope).

If the change is correct but requires a spec update, edit tla/hlc/HLC.tla (or the corresponding M2..M5 module once landed) and the design doc in the same PR. The tla-check workflow runs the TLC model check on the same paths.


@claude review please verify TLA+ spec divergence per the checklist above.

@codex review please verify TLA+ spec divergence per the checklist above.

@chatgpt-codex-connector
Copy link
Copy Markdown

To use Codex here, create a Codex account and connect to github.

@claude
Copy link
Copy Markdown

claude Bot commented Jun 4, 2026

Claude finished @bootjp's task in 2m 13s —— View job


PR #915 Round-6 Review — 39513379

Todo list:

  • Load tools
  • Read internal/raftengine/etcd/engine.go — verify TODO sentinel location and content
  • Read docs/design/2026_06_02_idempotent_snapshot_restore.md — verify all three pseudocode fixes
  • TLA+ spec divergence check
  • Post complete review

Item #1 — TODO sentinel in applyReadySnapshot

The comment is correctly placed at engine.go:2001-2009, inside the isSnapshotToken branch, immediately before the openAndRestoreFSMSnapshot call:

// B3/follow-up: also call SetDurableAppliedIndex(tok.Index) here
// after Restore so peer-after-InstallSnapshot populates the meta
// key. The local-snapshot persist path already bumps the live
// store (engine.persistLocalSnapshotPayload), but the receiving
// node's restored store inherits the pre-bump value embedded in
// the snapshot artifact. Design Non-Goals §
// docs/design/2026_06_02_idempotent_snapshot_restore.md:71-74
// scopes this out of Branch 2; see PR #915 round-4/5 codex P2 on
// engine.go:4077 for the rationale.

All three elements from round-5 item #1 are present: the planned action, the Non-Goals scope reference, and the rationale cross-reference. The legacy non-token else branch at :2013 does not need a matching TODO — Branch 3's skip gate operates on token-based snapshots only; the legacy path is full-restore-always by construction.

Formatting is correct (// on every line, no missing prefix). The :4077 line reference is from the round-4/5 review comments and will have shifted with subsequent edits, but this is advisory guidance for a future implementor — grep context is sufficient.


Item #2 — Design doc pseudocode ✅

All three stale locations are updated:

§3 "kvFSM exposes its store" (:239-268): Replaced the factory AppliedIndexReader() AppliedIndexReader + AppliedIndexReporter shim with the direct LastAppliedIndex() method + compile-time guard. The doc explicitly names the superseded round-1 pattern and explains why the direct approach is simpler:

round-1 had a factory AppliedIndexReader() AppliedIndexReader method intended to be called through a separate AppliedIndexReporter interface, but that pattern would require the skip gate to know about the reporter shim; the direct interface satisfaction is simpler and a compile-time guard catches future signature drift

§4 fsmAlreadyAtIndex pseudocode (:296-314): Updated to r, ok := fsm.(raftengine.AppliedIndexReader) / r.LastAppliedIndex() direct pattern with a cross-reference to the compile-time guard file. A B3 implementor can copy this pseudocode directly.

B2 Implementation Plan row (:890): Comprehensive update covering: kvFSM.LastAppliedIndex() directly satisfies raftengine.AppliedIndexReader, compile-time guard, applyMu.Lock() RMW monotonic guard, and both snapshot persist sites. This row now accurately describes the shipped implementation.

§8 compatibility list (:836-842): Correctly lists only AppliedIndexReader, AppliedIndexWriter, SnapshotHeaderApplier as the new interfaces, with a parenthetical noting the AppliedIndexReporter shim from rounds 1-2 was superseded.


Minor observation — not_reporter in §9 metrics spec

The fallback_reason label at :872 still has:

fallback_reason="not_reporter|no_reader|read_err|missing_meta|behind_snapshot"

not_reporter is a leftover from when AppliedIndexReporter was the planned interface (rounds 1-2). After round-5 the direct interface is AppliedIndexReader, and no_reader (the second label value) already covers the same condition. This means the list has a duplicate of slightly different vintage.

This is advisory pseudocode for a future Branch 3 implementation — the B3 author will choose the actual Prometheus label values. Non-blocking, but worth a one-line cleanup (not_reporter| → remove) when Branch 3 lands and the metrics are wired up for real. Not worth a round-7 for a spec-only comment.

Fix when wiring B3 metrics →


TLA+ spec divergence ✅

Round-6 changes are a comment addition in engine.go and design doc text updates. No code semantics changed. The anchored files (kv/fsm.go, store/mvcc_store.go) are unchanged since round-5. No spec update required.


Summary

Both round-5 actionable items are correctly addressed. The TODO sentinel is well-placed with complete context for the B3 implementor. The design doc pseudocode is now consistent with the actual shipped implementation across all three locations. No new issues. Ready to merge.

@bootjp bootjp merged commit 13a1b61 into main Jun 4, 2026
12 checks passed
@bootjp bootjp deleted the impl/snapshot-skip-b2-meta-key branch June 4, 2026 13:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant