feat(snapshot-skip B2): plumb metaAppliedIndex through raft-Apply + both snapshot persist sites#915
Conversation
Two new opt-in extensions to raftengine.StateMachine for the
cold-start snapshot-restore skip optimisation
(docs/design/2026_06_02_idempotent_snapshot_restore.md):
- AppliedIndexReader.LastAppliedIndex() (uint64, bool, error) lets
the engine query the FSM's durable applied-index for the
fsmAlreadyAtIndex gate. (0, false, nil) means 'missing' and falls
back to the full restore path. Any error MUST NOT abort cold
start; the caller collapses (false, _, err) into 'restore'.
- AppliedIndexWriter.SetDurableAppliedIndex(idx uint64) error lets
the engine pin the FSM's durable applied-index to
snap.Metadata.Index before calling persist.SaveSnap, so a
successful snapshot persist always implies
LastAppliedIndex >= snapshot.Metadata.Index. Implementations
MUST use pebble.Sync (or equivalent) unconditionally — the
checkpoint is the only durable carrier of the meta key at this
point, because WAL compaction starts at the snapshot index after
SaveSnap.
Interface only; no implementations or wirings yet. Branch 2 of the
design will:
- Add metaAppliedIndex Pebble meta key + pebbleStore impls
- Thread f.pendingApplyIdx through ApplyMutationsRaftAt /
DeletePrefixAtRaftAt overloads into both batch-bundle sites
- Wire kvFSM accessors that forward to the underlying pebbleStore
- Hook persistCreatedSnapshot AND e.persistLocalSnapshotPayload
(both snapshot persist sites)
Refs PR #910 (design), part of Branch 2 implementation series.
Two new methods on pebbleStore for the cold-start snapshot-restore skip optimisation (PR #910 design, Branch 2 step 2): - metaAppliedIndex / metaAppliedIndexBytes — new Pebble meta key, 'sibling' to metaLastCommitTS in layout and locking discipline. Bundled in the same WriteBatch as the data mutation by the raft-Apply path (next commit). - isPebbleMetaKey extended to include the new key so the snapshot encoder skips it (matches the precedent for the other meta keys). - LastAppliedIndex() (uint64, bool, error) — read under dbMu.RLock. (0, false, nil) for absent / truncated meta key so callers fall back to the full restore path. ErrNotFound is the expected outcome on a pre-upgrade fsm.db or freshly restored store; anything else propagates as a real I/O error (the caller in wal_store.go fsmAlreadyAtIndex collapses to 'false → restore' regardless, but this layer surfaces the typed error so other consumers can distinguish). - SetDurableAppliedIndex(idx) error — write under dbMu.RLock with pebble.Sync UNCONDITIONALLY (even under ELASTICKV_FSM_SYNC_MODE= nosync). Rationale documented inline: the snapshot-persist checkpoint is the only durable carrier of the meta key at this point because WAL compaction following persist.SaveSnap discards every log entry <= snap.Metadata.Index — so an OS-buffered Pebble flush dropped on crash would leave metaAppliedIndex behind the snapshot pointer permanently. +1 fsync per snapshot persist (rare; default SnapshotCount=10000) is the right price. These are the pebbleStore-side implementations of raftengine.AppliedIndexReader and raftengine.AppliedIndexWriter added in the preceding commit. The kvFSM accessors that delegate to these will land in a follow-up step alongside the engine hook sites. Refs PR #910 (design), part of Branch 2 implementation series.
Plumbing for the raft entry index to reach the leaf pebble.Batch so metaAppliedIndex can be bundled atomically with the data mutation (PR #910 design §2, Branch 2 step 3). Public surface (store.MVCCStore extension): - ApplyMutationsRaftAt(ctx, muts, readKeys, startTS, commitTS, appliedIndex) - DeletePrefixAtRaftAt(ctx, prefix, exclude, commitTS, appliedIndex) appliedIndex==0 means 'no index, do not bump the meta key' so existing callers that still use the non-At variants keep their current behaviour byte-for-byte. Implementations: - pebbleStore: bundles setPebbleUint64InBatch(metaAppliedIndex) in both applyMutationsWithOpts and deletePrefixAtWithOpts when appliedIndex > 0. +16 bytes per batch, zero extra fsync. - mvccStore: in-memory store discards appliedIndex. LastAppliedIndex on this backend would always be 'missing'; OK because it is only used by tests / catalog bootstrap. - LeaderRoutedStore: forwards to local. - ShardStore.ApplyMutationsRaftAt: routes to single owning shard. - ShardStore.DeletePrefixAtRaftAt: broadcasts to every group; the single-group production raft-apply path gets correct bundling; the multi-group broadcast case is admin-driven FLUSHALL only, never raft-applied. No call sites are switched over yet; the next commit threads f.pendingApplyIdx into the kvFSM data-Apply path through these overloads. Tests: go vet ./store/ ./kv/ ./internal/raftengine/... clean; go test ./store/ -short ok 29.4s. Refs PR #910 (design), part of Branch 2 implementation series.
Wires the raft applied-index seam through the kvFSM data-Apply path so metaAppliedIndex is bundled atomically with every data mutation (PR #910 design §2 + §3, Branch 2 step 4). New methods on kvFSM: - AppliedIndexReader() raftengine.AppliedIndexReader — exposes the underlying store's reader when it implements the seam (pebbleStore does; in-memory mvccStore does not). nil propagates to the fsmAlreadyAtIndex caller and triggers the conservative full-restore fallback. - SetDurableAppliedIndex(idx) error — forwards to the underlying store's writer seam (pebbleStore does pebble.Sync; the in-memory store's no-op default is preserved). The engine calls this at each snapshot persist site BEFORE persist.SaveSnap. Threaded f.pendingApplyIdx through every kv/fsm.go data-Apply leaf (7 call sites): - handleNonTxnRequest → ApplyMutationsRaftAt - handleDelPrefix → DeletePrefixAtRaftAt - handlePrepareRequest (single-shard fast path) - handleOnePhaseTxnRequest (with-readkeys path) - applyCommitWithIdempotencyFallback (commit path) - applyCommitWithIdempotencyFallback (idempotency replay fallback) - handleAbortRequest (rollback marker write) Each site passes f.pendingApplyIdx — the value the engine stashed via SetApplyIndex immediately before Apply through the raftengine.ApplyIndexAware seam. Engine-bypassing tests that drive Apply directly without setting the index will pass 0, which the leaf treats as 'do not bump the meta key' — keeping their behaviour byte-for-byte identical to before. Encryption opcode dispatch is unchanged: applyReservedOpcode still threads pendingApplyIdx into applyEncryption (its existing consumer for the WriteSidecar RaftAppliedIndex slot). The meta key bump above is purely additive on the data-Apply leaves; encryption-only entries continue to leave metaAppliedIndex unchanged (PR #910 design §6 'encryption opcodes' subsection). Tests: go vet ./kv/ ./store/ ./internal/raftengine/... clean; go test ./kv/ -short ok 10.4s. Refs PR #910 (design), part of Branch 2 implementation series.
Two new hooks pinning metaAppliedIndex to the snapshot index BEFORE the corresponding persist.SaveSnap call. After a successful snapshot persist, LastAppliedIndex >= snap.Metadata.Index holds unconditionally (PR #910 design §6, Branch 2 step 5). Site 1 - persistCreatedSnapshot (engine.go:2679) — drives config snapshots (membership-change snapshots). Site 2 - e.persistLocalSnapshotPayload (engine.go:4048) — the steady-state SnapshotCount-triggered hot path: maybePersistLocalSnapshot -> e.persistLocalSnapshotPayload (this method) -> free persistLocalSnapshotPayload (wal_store.go:519) -> persist.SaveSnap (wal_store.go:524) The hook lives in the engine wrapper, not the free function, so the free function stays signature-stable and the call holds e.snapshotMu.Lock() before invoking SetDurableAppliedIndex (serializing against follower-snapshot restore). Both sites call e.fsm.(raftengine.AppliedIndexWriter) and silently no-op when the FSM does not implement the seam (legacy test fakes, in-memory backends). pebble.Sync is forced on the writer side (lsm_store.SetDurableAppliedIndex inline comment) regardless of ELASTICKV_FSM_SYNC_MODE — the checkpoint is the only durable carrier of metaAppliedIndex at this point because WAL compaction following SaveSnap discards every entry <= snap.Metadata.Index. Crash ordering (matches design §6 table): the bump is fsynced before SaveSnap returns. The only observable states are (metaAppliedIndex, snapshot pointer) in {(Y, X'<X), (X, X'), (X, X), (Z>X, X)}. None of these can yield 'snapshot pointer = X but metaAppliedIndex < X'. Over-restore impossible; round-3 P2 + round-4 P2 closed. Tests: go vet ./internal/raftengine/etcd/ ./kv/ ./store/ clean; go test ./internal/raftengine/... -short ok 32.8s. Refs PR #910 (design), part of Branch 2 implementation series.
…ring PR #910 design B2 test list, step 6 of the implementation series. store/lsm_store_applied_index_test.go (7 tests): - TestLastAppliedIndex_MissingMetaKey — fresh store reports (0, false, nil); strictly-additive fallback to full restore is intact for pre-upgrade fsm.db files. - TestSetDurableAppliedIndex_RoundTrip — round-trips an arbitrary uint64 (0xDEAD_BEEF_CAFE_F00D) through Set + Get. - TestApplyMutationsRaftAt_BundlesMetaAppliedIndex — verifies the leaf bundles metaAppliedIndex when appliedIndex > 0, AND the data mutation also lands (sanity check the bundling did not drop the data path). - TestApplyMutationsRaftAt_ZeroIndexLeavesMetaKey — escape hatch: appliedIndex=0 MUST NOT touch metaAppliedIndex (preserves ApplyMutationsRaft byte-compat for callers not yet wired). - TestDeletePrefixAtRaftAt_BundlesMetaAppliedIndex — analogous round-trip for the DEL_PREFIX leaf, which builds its own pebble.Batch separate from applyMutationsWithOpts. - TestSetDurableAppliedIndex_UsesPebbleSync — sets ELASTICKV_FSM_SYNC_MODE=nosync, writes via the checkpoint, closes + reopens the store, asserts the value persists. Black- box regression guard for the pebble.Sync-unconditionally claim (a NoSync write would not deterministically survive a Close+Open). - TestLastAppliedIndex_CorruptValue — directly writes a 2-byte payload under the meta key and verifies LastAppliedIndex collapses to (0, false, nil) rather than propagating an error. Keeps cold start safe under partial-write corruption. internal/raftengine/etcd/engine_applied_index_test.go (3 tests): - TestRecordingFSM_SatisfiesAppliedIndexWriter — compile-time adjacency check that the recording FSM implements the writer seam. - TestPersistCreatedSnapshot_BumpsAppliedIndex — drives the persistCreatedSnapshot hook directly, asserts the ordered event stream is exactly [bump@42, save@42]. - TestPersistCreatedSnapshot_NilFSMNoOp — legacy FSM (no writer seam) path: SaveSnap still runs, no bump emitted; preserves backward compat with test fakes. - TestPersistCreatedSnapshot_BumpErrorAborts — if SetDurableAppliedIndex returns an error, the engine MUST surface it AND NOT call SaveSnap. Preserves the crash-ordering invariant 'snapshot pointer never durable without metaAppliedIndex first'. go.mod: github.com/coreos/go-semver promoted to direct (was indirect because the test stubs out etcdstorage.Storage.MinimalEtcdVersion which returns *semver.Version). Refs PR #910 (design), B2 step 6 — final step before opening PR. All tests: ok 0.028s (engine) + 1.6s (store).
|
Warning Review limit reached
More reviews will be available in 17 minutes and 6 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (7)
📝 WalkthroughWalkthroughAdds durable raft applied-index support end-to-end: optional FSM reader/writer interfaces, MVCCStore raft-at APIs, Pebble metaAppliedIndex with atomic bundling, kvFSM/store/engine plumbing to bump the durable index before snapshot persist, and tests plus a dependency promotion. ChangesDurable Applied Index Tracking
🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
TLA+ spec divergence review (auto-triggered)This PR touches files that the TLA+ safety spec has an anchor on (per Anchored files changed in this PR head (2c42f7d):
What to check, by subsystem:
If the change is correct but requires a spec update, edit @claude review please verify TLA+ spec divergence per the checklist above. @codex review please verify TLA+ spec divergence per the checklist above. |
|
To use Codex here, create a Codex account and connect to github. |
|
@claude review PR #910 design Branch 2 implementation. 6 commits, no behaviour change yet (skip gate lands in B3); B2 is the soak phase. Review checklist for this PR:
PR description has full reading order + soak plan + ldb verification command. |
|
Claude finished @bootjp's task in 4m 16s —— View job PR #915 Review — Branch 2:
|
| call site | function | line |
|---|---|---|
| 1 | handleRawRequest → ApplyMutationsRaftAt |
:492 |
| 2 | handleDelPrefix → DeletePrefixAtRaftAt |
:509 |
| 3 | handlePrepareRequest → ApplyMutationsRaftAt |
:767 |
| 4 | handleOnePhaseTxnRequest → ApplyMutationsRaftAt |
:831 |
| 5 | applyCommitWithIdempotencyFallback (first attempt) → ApplyMutationsRaftAt |
:935 |
| 6 | applyCommitWithIdempotencyFallback (idempotency retry) → ApplyMutationsRaftAt |
:952 |
| 7 | handleAbortRequest → ApplyMutationsRaftAt |
:1003 |
Non-data leaves (applyHLCLease, applyEncryption) correctly don't call any store mutation, so there's nothing to switch. The raftEncodeHLCLease path applies only to the HLC ceiling — no store write. This is also why Site 2 (persistLocalSnapshotPayload) exists: HLC lease entries advance the snapshot index without bumping metaAppliedIndex via the Apply path.
TLA+ spec divergence ✅
Anchored files kv/fsm.go and store/mvcc_store.go are changed, but the changes are purely additive plumbing:
kv/fsm.go: HLC-4 strategy (c)Observe(commitTS)at:452is unchanged.SetPhysicalCeilingapply path is unchanged. FSMApply()signature is unchanged. The B2 additions (AppliedIndexReader(),SetDurableAppliedIndex(),pendingApplyIdxthreading) don't touch any spec-anchored invariant.store/mvcc_store.go:ApplyMutationsRaftAt/DeletePrefixAtRaftAtonmvccStorediscardappliedIndexand delegate toApplyMutations. MVCC-1..MVCC-4 visibility logic is unaffected.- OCC, HLC, transaction paths: unchanged.
No TLA+ spec update required.
go.mod ✅
github.com/coreos/go-semver promoted from indirect to direct. Correct: engine_applied_index_test.go directly imports github.com/coreos/go-semver/semver for the MinimalEtcdVersion() *semver.Version stub on recordingPersistStorage.
Issues and observations
1. Missing test for Site 2 (persistLocalSnapshotPayload) crash-ordering
TestPersistCreatedSnapshot_BumpErrorAborts covers Site 1. Site 2 has identical bump-before-save logic but no analogous test. The PR description says "both crash-ordering tested by TestPersistCreatedSnapshot_*" but those tests only reach persistCreatedSnapshot. A TestPersistLocalSnapshotPayload_BumpErrorAborts exercising the goroutine method directly (similar to the Site 1 test, but calling e.persistLocalSnapshotPayload with a failing FSM) would close this gap.
2. SetDurableAppliedIndex can decrease metaAppliedIndex (non-monotonic) — design note
When persistLocalSnapshotPayload(M) runs concurrently with in-flight applies that have already written metaAppliedIndex = N > M, the checkpoint call writes M < N. This is not a correctness bug for Branch 3 (the skip invariant LastAppliedIndex >= snap.Metadata.Index still holds for the snapshot at M — both sides are M), and raft log replay covers entries M+1..N on the next cold start. However, the comment on SetDurableAppliedIndex ("last-writer-wins always satisfies have >= want") is slightly misleading here — it only holds for the snapshot's own invariant, not for the global monotonicity of the key. Worth documenting this explicitly so Branch 3's implementor doesn't inadvertently assume the key is always non-decreasing.
3. ShardStore.DeletePrefixAtRaftAt multi-group caveat — documentation-only risk
The comment at shard_store.go:1294-1301 says tests that exercise ShardStore.DeletePrefixAtRaftAt across multiple groups MUST pass appliedIndex=0, but this is a contract enforced only by convention. There's no guard. Given that the production FSM never uses ShardStore as its direct f.store (each FSM has a single-group store), this is fine — just worth flagging for anyone adding tests later.
Summary
All 9 design constraints are correctly implemented and verified. The implementation is clean and the test coverage is solid. The one material gap is a missing test for Site 2 (persistLocalSnapshotPayload) crash-ordering; everything else is well-covered. The non-monotonic write nuance (#2) is safe for Branch 3 but worth a clarifying comment. Ready to land subject to closing the Site 2 test gap.
There was a problem hiding this comment.
Code Review
This pull request introduces a mechanism to track and persist the durable raft-applied index (metaAppliedIndex) in the Pebble-backed state machine to enable a cold-start skip optimization for snapshot restores. It adds AppliedIndexReader and AppliedIndexWriter interfaces, implements them in kvFSM and pebbleStore, and updates the write paths to atomically bundle the applied index with mutations. Feedback highlights that wrapper stores like LeaderRoutedStore and ShardStore must also implement these interfaces to prevent the skip optimization from silently no-oping in production. Additionally, it is recommended to ensure monotonic updates in SetDurableAppliedIndex using atomic operations to prevent the durable index from going backwards on disk.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| func (s *LeaderRoutedStore) DeletePrefixAtRaftAt(ctx context.Context, prefix []byte, excludePrefix []byte, commitTS, appliedIndex uint64) error { | ||
| if s == nil || s.local == nil { | ||
| return errors.WithStack(store.ErrNotSupported) | ||
| } | ||
| return errors.WithStack(s.local.DeletePrefixAtRaftAt(ctx, prefix, excludePrefix, commitTS, appliedIndex)) | ||
| } |
There was a problem hiding this comment.
The LeaderRoutedStore wrapper does not implement raftengine.AppliedIndexReader or raftengine.AppliedIndexWriter. Since kvFSM type-asserts its underlying store to these interfaces, using LeaderRoutedStore in production (e.g., single-raft-group deployments) will cause the skip optimization to silently no-op and always fall back to full restore.
To fix this, LeaderRoutedStore should implement both interfaces by forwarding the calls to s.local using duck-typing (to avoid importing raftengine if desired).
func (s *LeaderRoutedStore) DeletePrefixAtRaftAt(ctx context.Context, prefix []byte, excludePrefix []byte, commitTS, appliedIndex uint64) error {
if s == nil || s.local == nil {
return errors.WithStack(store.ErrNotSupported)
}
return errors.WithStack(s.local.DeletePrefixAtRaftAt(ctx, prefix, excludePrefix, commitTS, appliedIndex))
}
func (s *LeaderRoutedStore) LastAppliedIndex() (uint64, bool, error) {
if s == nil || s.local == nil {
return 0, false, nil
}
if r, ok := s.local.(interface {
LastAppliedIndex() (uint64, bool, error)
}); ok {
return r.LastAppliedIndex()
}
return 0, false, nil
}
func (s *LeaderRoutedStore) SetDurableAppliedIndex(idx uint64) error {
if s == nil || s.local == nil {
return nil
}
if w, ok := s.local.(interface {
SetDurableAppliedIndex(uint64) error
}); ok {
return w.SetDurableAppliedIndex(idx)
}
return nil
}| func (s *ShardStore) DeletePrefixAtRaftAt(ctx context.Context, prefix []byte, excludePrefix []byte, commitTS, appliedIndex uint64) error { | ||
| for _, g := range s.groups { | ||
| if g == nil || g.Store == nil { | ||
| continue | ||
| } | ||
| // Pass appliedIndex through to every group. In the | ||
| // single-group call-path (the production raft-apply case) | ||
| // this is correct: appliedIndex IS that group's raft entry | ||
| // index. In a hypothetical multi-group call, only one group | ||
| // would see the matching index and the rest would treat it | ||
| // as a non-monotonic stray write — but the rest of the | ||
| // raft-apply contract (single FSM per raft log) makes that | ||
| // case impossible to reach in production. Tests that | ||
| // exercise ShardStore.DeletePrefixAtRaftAt across multiple | ||
| // groups MUST pass appliedIndex=0 to opt out. | ||
| if err := g.Store.DeletePrefixAtRaftAt(ctx, prefix, excludePrefix, commitTS, appliedIndex); err != nil { | ||
| return errors.WithStack(err) | ||
| } | ||
| } | ||
| return nil | ||
| } |
There was a problem hiding this comment.
The ShardStore wrapper does not implement raftengine.AppliedIndexReader or raftengine.AppliedIndexWriter. Since kvFSM type-asserts its underlying store to these interfaces, using ShardStore in production (e.g., sharded deployments) will cause the skip optimization to silently no-op and always fall back to full restore.
To fix this, ShardStore should implement both interfaces by propagating the calls to all underlying group stores.
func (s *ShardStore) DeletePrefixAtRaftAt(ctx context.Context, prefix []byte, excludePrefix []byte, commitTS, appliedIndex uint64) error {
for _, g := range s.groups {
if g == nil || g.Store == nil {
continue
}
// Pass appliedIndex through to every group. In the
// single-group call-path (the production raft-apply case)
// this is correct: appliedIndex IS that group's raft entry
// index. In a hypothetical multi-group call, only one group
// would see the matching index and the rest would treat it
// as a non-monotonic stray write — but the rest of the
// raft-apply contract (single FSM per raft log) makes that
// case impossible to reach in production. Tests that
// exercise ShardStore.DeletePrefixAtRaftAt across multiple
// groups MUST pass appliedIndex=0 to opt out.
if err := g.Store.DeletePrefixAtRaftAt(ctx, prefix, excludePrefix, commitTS, appliedIndex); err != nil {
return errors.WithStack(err)
}
}
return nil
}
func (s *ShardStore) LastAppliedIndex() (uint64, bool, error) {
if s == nil || len(s.groups) == 0 {
return 0, false, nil
}
var minIdx uint64
hasMin := false
for _, g := range s.groups {
if g == nil || g.Store == nil {
continue
}
if r, ok := g.Store.(raftengine.AppliedIndexReader); ok {
idx, present, err := r.LastAppliedIndex()
if err != nil {
return 0, false, err
}
if present {
if !hasMin || idx < minIdx {
minIdx = idx
hasMin = true
}
}
}
}
return minIdx, hasMin, nil
}
func (s *ShardStore) SetDurableAppliedIndex(idx uint64) error {
if s == nil {
return nil
}
for _, g := range s.groups {
if g == nil || g.Store == nil {
continue
}
if w, ok := g.Store.(raftengine.AppliedIndexWriter); ok {
if err := w.SetDurableAppliedIndex(idx); err != nil {
return err
}
}
}
return nil
}| func (s *pebbleStore) SetDurableAppliedIndex(idx uint64) error { | ||
| s.dbMu.RLock() | ||
| defer s.dbMu.RUnlock() | ||
| batch := s.db.NewBatch() | ||
| defer func() { _ = batch.Close() }() | ||
| if err := setPebbleUint64InBatch(batch, metaAppliedIndexBytes, idx); err != nil { | ||
| return err | ||
| } | ||
| return errors.WithStack(batch.Commit(pebble.Sync)) | ||
| } |
There was a problem hiding this comment.
Since SetDurableAppliedIndex does not check if the existing durable index is already greater than idx, a concurrent raft apply of a higher index (e.g., 100) could be overwritten by an asynchronous snapshot persist of a lower index (e.g., 90). While logically safe due to idempotency, it can cause the durable index to go backwards on disk, leading to redundant log replay on cold start.
To prevent this, we should ensure the index is updated monotonically. Since this is a frequently accessed field on a hot path, prefer using atomic operations (such as Compare-And-Swap) over mutexes to avoid performance overhead.
func (s *pebbleStore) SetDurableAppliedIndex(idx uint64) error {
s.dbMu.RLock()
defer s.dbMu.RUnlock()
for {
current := atomic.LoadUint64(&s.durableAppliedIndex)
if idx <= current {
return nil
}
if atomic.CompareAndSwapUint64(&s.durableAppliedIndex, current, idx) {
break
}
}
batch := s.db.NewBatch()
defer func() { _ = batch.Close() }()
if err := setPebbleUint64InBatch(batch, metaAppliedIndexBytes, idx); err != nil {
return err
}
return errors.WithStack(batch.Commit(pebble.Sync))
}References
- For frequently accessed fields that require monotonic updates (like a configuration index), prefer atomic operations (e.g., CAS) over mutexes to improve performance on hot paths.
There was a problem hiding this comment.
🧹 Nitpick comments (2)
internal/raftengine/etcd/engine_applied_index_test.go (1)
93-148: ⚡ Quick winAdd a mirror test for
persistLocalSnapshotPayload.This suite only exercises
persistCreatedSnapshot, but the steady-state hook inpersistLocalSnapshotPayloadis a separate code path with its own early-return behavior. A direct test for bump-before-persist ordering and bump-error aborts there would close the main regression gap in this change. Based on learnings: Ensure FSM apply operations are idempotent and that committed entries cannot be lost. Check snapshot/restore round-trips and crash-restart paths.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/raftengine/etcd/engine_applied_index_test.go` around lines 93 - 148, Add mirror unit tests for persistLocalSnapshotPayload that replicate the three behaviors covered for persistCreatedSnapshot: (1) a test like TestPersistCreatedSnapshot_BumpsAppliedIndex that uses applyIndexOrderRecorder + recordingAppliedIndexFSM + recordingPersistStorage and asserts SetDurableAppliedIndex (bump) is called before SaveSnap when calling Engine.persistLocalSnapshotPayload with a snapshot; (2) a legacy-path test using testStateMachine that verifies a non-AppliedIndexWriter FSM still lets the snapshot save run (no bump); and (3) a bump-error-aborts test where recordingAppliedIndexFSM is configured to fail (failNext / failErr) and you assert persistLocalSnapshotPayload returns the error and SaveSnap is not invoked; reuse the same helper types (applyIndexOrderRecorder, recordingAppliedIndexFSM, recordingPersistStorage) and snapshot metadata values as in the existing tests to keep behavior parity.kv/shard_store.go (1)
1271-1305: ⚡ Quick winPotential footgun: multi-group DeletePrefixAtRaftAt relies on caller discipline.
The implementation passes
appliedIndexto every group in the loop (line 1300). For a multi-group ShardStore, this would write the sameappliedIndexto multiple independent Pebble stores, corrupting their separate raft-applied-index tracking.The comment (lines 1271-1285) documents that callers "MUST pass appliedIndex=0" for multi-group scenarios, but there's no runtime enforcement. Consider adding a defensive check:
if appliedIndex > 0 && len(s.groups) > 1 { return errors.New("DeletePrefixAtRaftAt with appliedIndex > 0 requires single-group ShardStore") }If the design guarantees that production FSM paths only see single-group ShardStore (as the comment claims), this check would catch test misuse without impacting production.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@kv/shard_store.go` around lines 1271 - 1305, The DeletePrefixAtRaftAt method currently forwards appliedIndex to every group which can corrupt per-store raft-applied tracking in multi-group ShardStore; add a defensive runtime check at the start of ShardStore.DeletePrefixAtRaftAt that returns an error if appliedIndex > 0 and the store has more than one group (use s.groups length or a helper like s.groupCount()), so callers must explicitly opt out by passing appliedIndex=0 for multi-group scenarios; keep the existing loop and behavior otherwise and reference the DeletePrefixAtRaftAt receiver, s.groups, and the appliedIndex parameter when adding the check.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@internal/raftengine/etcd/engine_applied_index_test.go`:
- Around line 93-148: Add mirror unit tests for persistLocalSnapshotPayload that
replicate the three behaviors covered for persistCreatedSnapshot: (1) a test
like TestPersistCreatedSnapshot_BumpsAppliedIndex that uses
applyIndexOrderRecorder + recordingAppliedIndexFSM + recordingPersistStorage and
asserts SetDurableAppliedIndex (bump) is called before SaveSnap when calling
Engine.persistLocalSnapshotPayload with a snapshot; (2) a legacy-path test using
testStateMachine that verifies a non-AppliedIndexWriter FSM still lets the
snapshot save run (no bump); and (3) a bump-error-aborts test where
recordingAppliedIndexFSM is configured to fail (failNext / failErr) and you
assert persistLocalSnapshotPayload returns the error and SaveSnap is not
invoked; reuse the same helper types (applyIndexOrderRecorder,
recordingAppliedIndexFSM, recordingPersistStorage) and snapshot metadata values
as in the existing tests to keep behavior parity.
In `@kv/shard_store.go`:
- Around line 1271-1305: The DeletePrefixAtRaftAt method currently forwards
appliedIndex to every group which can corrupt per-store raft-applied tracking in
multi-group ShardStore; add a defensive runtime check at the start of
ShardStore.DeletePrefixAtRaftAt that returns an error if appliedIndex > 0 and
the store has more than one group (use s.groups length or a helper like
s.groupCount()), so callers must explicitly opt out by passing appliedIndex=0
for multi-group scenarios; keep the existing loop and behavior otherwise and
reference the DeletePrefixAtRaftAt receiver, s.groups, and the appliedIndex
parameter when adding the check.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 05a21b08-93fa-497f-9450-99f76bd3da7f
📒 Files selected for processing (11)
go.modinternal/raftengine/etcd/engine.gointernal/raftengine/etcd/engine_applied_index_test.gointernal/raftengine/statemachine.gokv/fsm.gokv/leader_routed_store.gokv/shard_store.gostore/lsm_store.gostore/lsm_store_applied_index_test.gostore/mvcc_store.gostore/store.go
Address PR #915 round-1 feedback: 1) claude[bot] BLOCKER — Site 2 (persistLocalSnapshotPayload) crash- ordering test gap: internal/raftengine/etcd/engine_applied_index_test.go gains: - localSnapshotEngine helper seeding etcdraft.MemoryStorage so buildLocalSnapshot's storage.Term(applied) lookup succeeds (without seeding, the free persistLocalSnapshotPayload short- circuits at Term and SaveSnap is never reached). - TestPersistLocalSnapshotPayload_BumpsAppliedIndex — happy path: asserts the recorded event sequence is exactly [bump@123, save@123]. - TestPersistLocalSnapshotPayload_BumpErrorAborts — mirror of Site 1: a failed SetDurableAppliedIndex MUST surface the error AND prevent the free-function from running (so SaveSnap is never called). Preserves the crash invariant for Site 2. These close the only blocker from claude[bot]'s round-1 review. 2) reviewdog golangci-lint findings (3 issues → 0): - lsm_store.go gci: re-aligned the metaAppliedIndex const block; golangci-lint fmt picked it up. - kv/fsm.go wrapcheck on SetDurableAppliedIndex return: wrapped the interface-method return in errors.WithStack so the lint allow-list is honoured. - engine.go cyclop on persistLocalSnapshotPayload (12 > 10): extracted two helpers: * bumpDurableAppliedIndexBeforeSave(index) — shared by BOTH persist sites, deduplicates the writer-seam type assertion and the inline doc-comment (round-1 had it twice). * handleLocalSnapshotPersistResult(err) + purgeAfterLocalSnapshot() — collapses the four-arm error switch into a helper. The three ErrCompacted / ErrUnavailable / ErrSnapOutOfDate branches were already a single logical case so they fold into one switch arm. Cyclop now reports 0 issues; the post-extract function body is also easier to read. 3) Defensive forwards (gemini HIGH false-positive — pushed back in the round-2 PR comment with grep evidence): kv/leader_routed_store.go and kv/shard_store.go gain LastAppliedIndex / SetDurableAppliedIndex methods. Both forwards are currently DEAD CODE for the cold-start skip optimisation — verified by grep on origin/main that the kvFSM holds a *pebbleStore directly in every production path (LeaderRoutedStore is the adapter/server read-routing wrapper; ShardStore is the coordinator-facing fanout wrapper; neither is f.store). They are added defensively so a future refactor that wraps the FSM store does not silently degrade the optimisation to full- restore-always. ShardStore aggregates LastAppliedIndex as MIN over groups so the cold-start skip gate refuses to skip whenever ANY group lags (conservative 'over-restore beats under-restore' rule from design §4). Tests: go vet ./store/ ./kv/ ./internal/raftengine/... clean; go test ./store/ ./kv/ ./internal/raftengine/... -short ok 53s total; new Site 2 tests pass on first run after seeding storage. golangci-lint run ./store/ ./kv/ ./internal/raftengine/etcd/ → 0 issues. Refs PR #915 round-1 (claude[bot] verdict + gemini-code-assist HIGH + reviewdog lint).
TLA+ spec divergence review (auto-triggered)This PR touches files that the TLA+ safety spec has an anchor on (per Anchored files changed in this PR head (7001a93):
What to check, by subsystem:
If the change is correct but requires a spec update, edit @claude review please verify TLA+ spec divergence per the checklist above. @codex review please verify TLA+ spec divergence per the checklist above. |
|
Round-2 push: 🔴 claude[bot] BLOCKER — Site 2 crash-ordering test gapClosed via two new tests:
Discovery from writing the happy-path test: 🟢 reviewdog golangci-lint (3 issues → 0)
🟢 Defensive forwards (push-back on gemini HIGH false positive)
$ grep -rn 'NewKvFSMWithHLC(' --include='*.go' . | grep -v _test.go
./cmd/server/demo.go:453: fsm := kv.NewKvFSMWithHLC(st, hlc) # st = *pebbleStore
./main.go:851: sm := kv.NewKvFSMWithHLC(st, clock, ...) # st = *pebbleStoreThe kvFSM holds a The defensive forwards are added anyway because a future refactor that wraps the FSM store would otherwise silently degrade the skip to "full-restore-always" with no failure signal. 🟢 Other round-1 nits
Verification@claude review |
|
To use Codex here, create a Codex account and connect to github. |
|
Claude finished @bootjp's task in 2m 14s —— View job PR #915 Round-2 Review —
|
There was a problem hiding this comment.
🧹 Nitpick comments (1)
internal/raftengine/etcd/engine_applied_index_test.go (1)
204-209: ⚡ Quick winConsider asserting the exact event sequence for tighter crash-ordering coverage.
Site 1 (
TestPersistCreatedSnapshot_BumpsAppliedIndex) asserts the full slice withrequire.Equal, but hererequire.GreaterOrEqual(len, 2)+ positional checks would not catch a spurious extrasave/bump(e.g. an accidental doubleSaveSnap). Since these tests exist specifically to lock down the bump-before-save invariant, an exact assertion matches Site 1 and is slightly stricter.♻️ Proposed tightening
- events := rec.snapshot() - require.GreaterOrEqual(t, len(events), 2, - "both bump and save events MUST be recorded; got %d", len(events)) - require.Equal(t, "bump", events[0].kind, "bump MUST be first") - require.Equal(t, index, events[0].index) - require.Equal(t, "save", events[1].kind, "save MUST be second") - require.Equal(t, index, events[1].index) + require.Equal(t, []orderEvent{ + {kind: "bump", index: index}, + {kind: "save", index: index}, + }, rec.snapshot(), + "hook MUST call SetDurableAppliedIndex BEFORE SaveSnap exactly once each")🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/raftengine/etcd/engine_applied_index_test.go` around lines 204 - 209, The test currently uses require.GreaterOrEqual(t, len(events), 2) plus positional checks on events[0] and events[1], which misses spurious extra entries; change the assertion to require an exact sequence of two events to match the bump-before-save invariant (as done in TestPersistCreatedSnapshot_BumpsAppliedIndex). Concretely, assert the length is exactly 2 (require.Len/require.Equal with 2) and then assert the full slice equals the expected slice (e.g., build an expected []event{{kind:"bump", index:index}, {kind:"save", index:index}} and compare it to events) so any extra or out-of-order bump/save entries fail the test.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@internal/raftengine/etcd/engine_applied_index_test.go`:
- Around line 204-209: The test currently uses require.GreaterOrEqual(t,
len(events), 2) plus positional checks on events[0] and events[1], which misses
spurious extra entries; change the assertion to require an exact sequence of two
events to match the bump-before-save invariant (as done in
TestPersistCreatedSnapshot_BumpsAppliedIndex). Concretely, assert the length is
exactly 2 (require.Len/require.Equal with 2) and then assert the full slice
equals the expected slice (e.g., build an expected []event{{kind:"bump",
index:index}, {kind:"save", index:index}} and compare it to events) so any extra
or out-of-order bump/save entries fail the test.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: b32d8fbe-fc66-4f75-8b2c-e399c938c425
📒 Files selected for processing (6)
internal/raftengine/etcd/engine.gointernal/raftengine/etcd/engine_applied_index_test.gokv/fsm.gokv/leader_routed_store.gokv/shard_store.gostore/lsm_store.go
🚧 Files skipped from review as they are similar to previous changes (3)
- internal/raftengine/etcd/engine.go
- kv/fsm.go
- store/lsm_store.go
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7001a934f1
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| defer s.dbMu.RUnlock() | ||
| batch := s.db.NewBatch() | ||
| defer func() { _ = batch.Close() }() | ||
| if err := setPebbleUint64InBatch(batch, metaAppliedIndexBytes, idx); err != nil { |
There was a problem hiding this comment.
Avoid lowering the durable apply index
When a SnapshotCount snapshot is persisted by the background worker (startSnapshotWorker launches persistLocalSnapshot in a goroutine), raft apply can continue and ApplyMutationsRaftAt can advance _meta_applied_index past the snapshot's older idx before this method runs. This unconditional set then rewrites the key backwards, so a store that has durably applied entries N+k can later report only N, defeating the soak/verification invariant and causing the future skip gate to fall back unnecessarily until another data apply repairs it. Please write max(existing, idx) or otherwise serialize with apply so the meta key remains monotonic.
Useful? React with 👍 / 👎.
…x P2) Round-2 of PR #915 review surfaced a P2 (codex) on the non-monotonic write that claude[bot] had flagged as a 'design note' in round-1. The race: a background SnapshotCount-triggered persistLocalSnapshot worker calls SetDurableAppliedIndex(snap.Metadata.Index). Concurrently, the raft-apply loop's ApplyMutationsRaftAt has already advanced the per-Apply metaAppliedIndex past that snapshot's index. The unconditional write rewinds the meta key — defeating the soak/ verification invariant and causing the future cold-start skip gate to fall back unnecessarily until another data Apply repairs the key. Round-3 fix: read-modify-write under applyMu serialisation. - SetDurableAppliedIndex now acquires dbMu.RLock THEN applyMu.Lock (same order applyMutationsWithOpts / deletePrefixAtWithOpts use, so no deadlock). - Reads the existing meta key via the new readAppliedIndexLocked helper (shares its decode logic with the unlocked LastAppliedIndex(); same (0, false, nil) semantics for absent / truncated meta keys). - If existing >= idx, no-op. The skip invariant still holds: LastAppliedIndex >= idx >= snap.Metadata.Index, so the snapshot's own claim is satisfied by the larger-already-present value. - Otherwise, batch.Commit(pebble.Sync) as before. Tests (store/lsm_store_applied_index_test.go): - TestSetDurableAppliedIndex_Monotonic — simulates the exact race in single-threaded form: ApplyMutationsRaftAt(applied=250) then SetDurableAppliedIndex(snapIdx=100). Asserts LastAppliedIndex still reads 250 after the older snapshot write. - TestSetDurableAppliedIndex_AdvancesOnlyForward — exercises the forward-only contract: Set(50) -> Set(100, advances) -> Set(75, no-op) -> Set(100, no-op). LastAppliedIndex finishes at 100. Tests: go vet ./store/ + golangci-lint ./store/ clean; go test ./store/ -run 'TestSetDurable...|TestLast...|TestApply...|TestDelete...' ok 0.59s; go test ./store/ ./kv/ ./internal/raftengine/... -short ok ~53s. Refs PR #915 round-2 codex review (P2 'Avoid lowering the durable apply index').
TLA+ spec divergence review (auto-triggered)This PR touches files that the TLA+ safety spec has an anchor on (per Anchored files changed in this PR head (b957e1a):
What to check, by subsystem:
If the change is correct but requires a spec update, edit @claude review please verify TLA+ spec divergence per the checklist above. @codex review please verify TLA+ spec divergence per the checklist above. |
|
Round-3 push: 🟡 codex P2 line 632 — "Avoid lowering the durable apply index"You're right that round-2's unconditional write can rewind Round-3 fix
func (s *pebbleStore) SetDurableAppliedIndex(idx uint64) error {
s.dbMu.RLock()
defer s.dbMu.RUnlock()
s.applyMu.Lock()
defer s.applyMu.Unlock()
existing, present, err := s.readAppliedIndexLocked()
if err != nil { return err }
if present && existing >= idx {
return nil // skip-invariant still holds: existing >= idx
}
batch := s.db.NewBatch()
defer func() { _ = batch.Close() }()
if err := setPebbleUint64InBatch(batch, metaAppliedIndexBytes, idx); err != nil {
return err
}
return errors.WithStack(batch.Commit(pebble.Sync))
}Lock ordering:
TestsTwo new tests in
@claude review |
|
To use Codex here, create a Codex account and connect to github. |
|
Claude finished @bootjp's task in 3m 2s —— View job PR #915 Round-3 Review —
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b957e1aa54
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if appliedIndex > 0 { | ||
| if err := setPebbleUint64InBatch(batch, metaAppliedIndexBytes, appliedIndex); err != nil { |
There was a problem hiding this comment.
Serialize DEL_PREFIX meta bumps with snapshot pinning
When the background snapshot worker pins index N via SetDurableAppliedIndex, a later raft-applied DEL_PREFIX can concurrently commit this batch with _meta_applied_index = N+k; however SetDurableAppliedIndex only serializes its max-read/write with applyMu, and deletePrefixAtWithOpts never takes that lock, so the snapshot pin can still commit its stale lower N afterward and rewind the durable applied index. Fresh evidence beyond the earlier comment is this newly index-aware DEL_PREFIX path, which writes the meta key without participating in the same serialization; please include this path in the same monotonic/max protection.
Useful? React with 👍 / 👎.
claude[bot] round-3 verdict was 'Ready to merge' with 4 non-blocking
notes. Round-4 closes three of them and addresses the round-3
reviewdog gci finding. The fourth note (LastAppliedIndex duplicates
readAppliedIndexLocked body) is pre-existing and explicitly called
out by claude[bot] as 'not a round-3 regression'; deferred to a
follow-up to keep this round focused.
Codex round-3 P2 ('deletePrefixAtWithOpts never takes that lock') is
factually wrong — store/lsm_store.go:1536 shows the function DOES
acquire s.applyMu.Lock(). The function holds applyMu across its
entire batch.Commit, so it serialises against SetDurableAppliedIndex
through the same mutex. The race codex describes cannot occur.
Push-back posted in the PR comment with grep evidence; no code change
for this finding.
Round-4 changes:
1) reviewdog gci on store/lsm_store_applied_index_test.go:182.
golangci-lint fmt auto-corrected the alignment (whitespace-only
diff in the test's comment block).
2) Site 2 test (engine_applied_index_test.go:
TestPersistLocalSnapshotPayload_BumpsAppliedIndex) switched from
GreaterOrEqual+positional checks to exact slice comparison —
matches Site 1's style and catches a spurious third event (e.g.
double SaveSnap) that the previous assertion would miss. Closes
claude[bot] note #4 and the coderabbit round-2 nit.
3) pebbleStore struct doc-comment lock-ordering table extended to
include applyMu between dbMu and mtx (lsm_store.go:170-181). The
ordering already matched the code; the comment was stale before
round-3 added the SetDurableAppliedIndex serialisation. Closes
claude[bot] note #1.
Tests: go vet + golangci-lint clean across store/ kv/
internal/raftengine/etcd/; targeted test sweep ok (store 0.82s +
etcd 0.40s).
Refs PR #915 round-3 claude[bot] verdict (Ready-to-merge with 4
notes) + round-3 codex P2 (rejected as false-positive).
TLA+ spec divergence review (auto-triggered)This PR touches files that the TLA+ safety spec has an anchor on (per Anchored files changed in this PR head (5684857):
What to check, by subsystem:
If the change is correct but requires a spec update, edit @claude review please verify TLA+ spec divergence per the checklist above. @codex review please verify TLA+ spec divergence per the checklist above. |
|
To use Codex here, create a Codex account and connect to github. |
|
Round-4 push: 📝 codex round-3 P2 line 1575 — false positive (push-back)
Factually incorrect. Verified directly: $ grep -n 'func .s \*pebbleStore. deletePrefixAtWithOpts\|s\.applyMu' store/lsm_store.go | head -8
637: s.applyMu.Lock()
638: defer s.applyMu.Unlock()
1532: func (s *pebbleStore) deletePrefixAtWithOpts(...) error {
1536: s.applyMu.Lock()
1537: defer s.applyMu.Unlock()
The race codex describes ("the snapshot pin can still commit its stale lower N afterward and rewind the durable applied index") cannot occur with the current locking. The DEL_PREFIX path participates in the same monotonic/max protection by virtue of the shared No code change for this finding. 🟢 claude[bot] round-3 nits — addressed#1 —
|
|
Claude finished @bootjp's task in 1m 38s —— View job PR #915 Round-4 Review —
|
| Method | dbMu |
applyMu |
|---|---|---|
applyMutationsWithOpts |
:1448 RLock |
:1453 Lock |
deletePrefixAtWithOpts |
:1539 RLock |
:1542 Lock |
SetDurableAppliedIndex |
:641 RLock |
:643 Lock |
The codex claim that deletePrefixAtWithOpts "never takes that lock" is factually wrong — it acquires s.applyMu.Lock() at :1542, two lines after dbMu.RLock. The snapshot pin and a concurrent DEL_PREFIX are fully serialized. SetDurableAppliedIndex's RMW guard (round-3) reads a consistent post-DEL_PREFIX metaAppliedIndex value under the lock and no-ops if existing >= idx. No code change was required; no divergence exists.
Nit #1 — applyMu lock-ordering comment ✅
pebbleStore struct comment at lsm_store.go:170-183 is updated to document the full ordering maintenanceMu → dbMu → applyMu → mtx. The applyMu entry is accurate and complete — it names all three callers that hold it (raft-apply conflict-check leaves, deletePrefixAtWithOpts, and SetDurableAppliedIndex's monotonic guard) and cites the PR that introduced the RMW guard (round-3 of PR #915).
Nit #4 — Site 2 exact slice match ✅
TestPersistLocalSnapshotPayload_BumpsAppliedIndex (engine_applied_index_test.go:207-211) now uses:
require.Equal(t, []orderEvent{
{kind: "bump", index: index},
{kind: "save", index: index},
}, rec.snapshot(),
"hook MUST call SetDurableAppliedIndex BEFORE SaveSnap exactly once each")Exact parity with Site 1's TestPersistCreatedSnapshot_BumpsAppliedIndex. A spurious extra event (double SaveSnap, double bump) would fail this assertion — the previous GreaterOrEqual + positional approach would not. Closes both claude[bot] round-3 note #4 and the open coderabbit round-2 nit. ✅
gci fix — store/lsm_store_applied_index_test.go ✅
Imports correctly formatted: stdlib block (context, os, testing) first, external packages (github.com/...) second, no misplaced blank lines. reviewdog passes. ✅
TLA+ spec divergence ✅
Round-4 changes are: a struct-level doc-comment update in lsm_store.go and a test assertion tightening in engine_applied_index_test.go. The anchored files kv/fsm.go and store/mvcc_store.go are unchanged since round-2. No spec update required.
Open items from prior rounds (acknowledged deferrals)
- Nit Update dependency go to v1.21.4 #2 (
LastAppliedIndexDRY gap): pre-existing, explicitly deferred to a follow-up. The public API needsdbMu.RLockonly; merging it withreadAppliedIndexLockedwould forceapplyMuonto every cold-read or require a second helper split. Low-risk as is. - Nit Update docker/setup-qemu-action action to v3 #3 (
applyMuheld acrosspebble.Syncin snapshot worker): same trade-off asapplyMutationsWithOpts. Acceptable for soak-phase B2; revisit if tail-latency data shows regression. - coderabbit
ShardStoremulti-group guard: convention is documented in the inline comment atshard_store.go:1271-1285; no production caller hits this path withappliedIndex > 0across multiple groups (one FSM per raft log). Not runtime-enforced; appropriate for a soak-phase no-op PR.
Summary
All round-4 items resolved. The codex P2 claim is a false positive — deletePrefixAtWithOpts holds applyMu throughout its batch commit, so the monotonic invariant from round-3 is airtight. The lock-ordering comment is now accurate, the Site 2 test assertion matches Site 1 exactly, and the gci lint is clean. No new issues. Ready to merge.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 56848574e4
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| func (f *kvFSM) AppliedIndexReader() raftengine.AppliedIndexReader { | ||
| if r, ok := f.store.(raftengine.AppliedIndexReader); ok { | ||
| return r | ||
| } | ||
| return nil | ||
| } |
There was a problem hiding this comment.
Implement the reader interface on kvFSM
When the cold-start skip gate queries the FSM through the optional raftengine.AppliedIndexReader interface added in this change, kvFSM will not match it because this method is named AppliedIndexReader() instead of LastAppliedIndex(). In the pebble-backed production FSM that means the newly persisted _meta_applied_index remains unreachable through the advertised interface, so the skip path will conservatively fall back even after raft applies and snapshot checkpoints have populated the meta key.
Useful? React with 👍 / 👎.
| if err := e.bumpDurableAppliedIndexBeforeSave(index); err != nil { | ||
| return err | ||
| } |
There was a problem hiding this comment.
Include the snapshot checkpoint in the payload
In the steady-state snapshot path, the FSM snapshot has already been serialized in persistLocalSnapshot before this bump runs, so the .fsm file / legacy payload that gets saved with the raft snapshot does not contain the _meta_applied_index value just written here. If that snapshot is later restored on a peer or after restart, especially after HLC/encryption-only entries since the last data apply, the restored store still reports the older or missing applied index and the snapshot-skip invariant promised by this checkpoint does not hold for the snapshot artifact.
Useful? React with 👍 / 👎.
…codex P2) Codex round-4 P2 on kv/fsm.go:145 caught a real wiring bug: the round-1 method was named AppliedIndexReader() returning the interface (factory pattern), intended to be called through a separate AppliedIndexReporter interface. But that AppliedIndexReporter was never added to the raftengine package, and the planned Branch 3 cold-start skip gate type-asserts fsm.(raftengine.AppliedIndexReader) directly. A factory method with a different signature does NOT satisfy the interface, so the skip would always fall back to full restore even after meta keys are populated on every apply. Fix: rename kvFSM.AppliedIndexReader() to LastAppliedIndex() and inline the type-assert forward. kvFSM now directly satisfies raftengine.AppliedIndexReader. Added kv/fsm_applied_index_iface_check.go with compile-time guards: var _ raftengine.AppliedIndexReader = (*kvFSM)(nil) var _ raftengine.AppliedIndexWriter = (*kvFSM)(nil) so a future rename or signature drift fails at build time rather than silently degrading the skip optimisation to full-restore-always. No production callers of the old AppliedIndexReader() factory exist (verified by grep) — the consumer is the future Branch 3 skip gate which is not yet wired. Safe rename. Tests: go vet + golangci-lint clean across all touched packages; go test ./kv/ ./store/ ./internal/raftengine/... -short ok 56s. Refs PR #915 round-4 codex P2 (kv/fsm.go:145). Codex P2 on engine.go:4077 (snapshot artifact does not contain post-bump metaAppliedIndex) is acknowledged but out of B2 scope — push-back explanation in the PR comment.
TLA+ spec divergence review (auto-triggered)This PR touches files that the TLA+ safety spec has an anchor on (per Anchored files changed in this PR head (a8e2cf5):
What to check, by subsystem:
If the change is correct but requires a spec update, edit @claude review please verify TLA+ spec divergence per the checklist above. @codex review please verify TLA+ spec divergence per the checklist above. |
|
To use Codex here, create a Codex account and connect to github. |
|
Round-5 push: 🟡 codex P2 kv/fsm.go:145 — real bug, fixedYou're right. Round-1's Renamed Added a separate compile-time guard file ( var _ raftengine.AppliedIndexReader = (*kvFSM)(nil)
var _ raftengine.AppliedIndexWriter = (*kvFSM)(nil)so any future rename or signature drift fails at build time rather than silently degrading the skip optimisation. No existing production callers — verified by grep before renaming. 📝 codex P2 engine.go:4077 — out-of-scope for B2 (push-back with rationale)
Correct observation, but the invariant claimed in PR #910 design §6 is local-only:
This is a claim about the LIVE pebble store on the node that persisted the snapshot, NOT about the snapshot artifact's serialized state. The local cold-start skip path (Branch 3) reads the live store, not the artifact, so the invariant holds where it's used. The codex finding's actual consequence:
The peer-after-InstallSnapshot path is explicitly listed as a Non-Goal in the design (
When B3 wires the skip gate, the peer-after-InstallSnapshot scenario will conservatively fall back to full restore — which IS the strictly-additive guarantee the design promises (§4 fallback policy). Not a correctness bug; a missed optimisation. Why not bump before
Option 2 is the right answer, but it belongs in Branch 3 (the path that consumes the meta key) or as a parallel follow-up — not in B2 (the soak-only PR). Filing as a TODO with a sentinel comment in VerificationThe new compile-time interface guard means any future regression would fail the build immediately: $ go build ./kv/
$ # If kvFSM stops satisfying AppliedIndexReader:
$ go build ./kv/
./kv/fsm_applied_index_iface_check.go:7:5: cannot use (*kvFSM)(nil) ...@claude review |
|
Claude finished @bootjp's task in 3m 52s —— View job PR #915 Round-5 Review —
|
| Invariant | Status |
|---|---|
kvFSM satisfies raftengine.AppliedIndexReader |
✅ — LastAppliedIndex() signature matches; compile-time guard confirms |
kvFSM satisfies raftengine.AppliedIndexWriter |
✅ — SetDurableAppliedIndex() forwarded; compile-time guard confirms |
(0, false, nil) fallback for non-seam stores |
✅ — f.store.(raftengine.AppliedIndexReader) type-assert fails gracefully |
errors.WithStack on propagated store errors |
✅ — fsm.go:159 |
| Skip gate type-assertion will succeed in B3 | ✅ — interface name and method signature now match exactly |
TLA+ spec divergence ✅
Round-5 changes are: method rename (AppliedIndexReader() → LastAppliedIndex()), inline doc-comment update, and a new compile-time guard file. HLC, OCC, MVCC visibility, FSM apply semantics, and coordinator/transaction paths are all unchanged. No spec update required.
Summary
The real bug (round-4 codex P2) is fixed correctly and backed by a build-time safety net. The push-back on engine.go:4077 is well-reasoned and consistent with the design doc Non-Goals. The only substantive open item is the design doc's stale pseudocode (#2), which will mislead the B3 implementor if not corrected before Branch 3 begins. The missing TODO sentinel (#1) is cosmetic. Ready to merge pending design doc update.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a8e2cf52de
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if err := e.bumpDurableAppliedIndexBeforeSave(snap.Metadata.Index); err != nil { | ||
| return err | ||
| } |
There was a problem hiding this comment.
Include config snapshot checkpoint in payload
When the config-snapshot path reaches this hook, persistConfigSnapshotPayloadLocked has already obtained payload from snapshotPayload, which writes/closes the .fsm file (or serializes the legacy bytes) before calling persistCreatedSnapshot. Bumping _meta_applied_index here only updates the live store, so a peer or restart that restores this saved config snapshot gets the older/missing meta key while the raft snapshot is at snap.Metadata.Index; after WAL entries up to that index are released, the advertised LastAppliedIndex >= snapshot index invariant does not hold for the snapshot artifact. This is separate from the steady-state snapshot site: the same ordering occurs through persistConfigSnapshotPayloadLocked -> snapshotPayload -> persistCreatedSnapshot.
Useful? React with 👍 / 👎.
…items claude[bot] round-5 verdict was 'Ready to merge pending design doc update'. Two non-code items: 1) engine.go:2001 TODO sentinel — the round-5 PR comment claimed a 'sentinel comment in engine.applySnapshot' would land, but no such comment was added. Round-6 adds it directly above the openAndRestoreFSMSnapshot call so a future B3 / follow-up implementor can find the planned SetDurableAppliedIndex hook site without re-discovering the rationale. Cites the design doc Non-Goals scope-out and PR #915 round-4/5 codex P2. 2) docs/design/2026_06_02_idempotent_snapshot_restore.md — three stale references to AppliedIndexReporter / factory-method pattern, superseded by PR #915 round-5 (kvFSM directly satisfies raftengine.AppliedIndexReader via LastAppliedIndex). The stale pseudocode would mislead the B3 implementor into adding a reporter shim that was already superseded. Updates: - §3 'kvFSM exposes its store' block rewritten to show direct interface satisfaction + the compile-time guard. - §4 fsmAlreadyAtIndex pseudocode rewritten to use direct type-assert against raftengine.AppliedIndexReader. - Implementation Plan B2 row updated to reflect the actual kvFSM.LastAppliedIndex() shape + the round-5 monotonic applyMu.Lock() RMW guard on SetDurableAppliedIndex. - §8 compatibility list: AppliedIndexReporter removed; replaced by AppliedIndexWriter (the actually-shipped interface) plus a retrospective sentence explaining the round-5 supersession. Three retrospective mentions of AppliedIndexReporter remain (as 'why this is NOT what we use') so future readers understand the design history without re-deriving it. No code semantics change in round-6. Doc + comment-only. Tests: go vet + golangci-lint clean across kv/, store/, internal/raftengine/etcd/. Refs PR #915 round-5 claude[bot] verdict ('Ready to merge pending design doc update').
|
Round-6 push: 🟢 Item #1 — TODO sentinel in
|
| Where | Before | After |
|---|---|---|
| §3 "kvFSM exposes its store" | factory AppliedIndexReader() AppliedIndexReader + AppliedIndexReporter interface |
direct LastAppliedIndex() + compile-time guard from kv/fsm_applied_index_iface_check.go |
§4 fsmAlreadyAtIndex pseudocode |
reporter, ok := fsm.(AppliedIndexReporter); reader := reporter.AppliedIndexReader() |
r, ok := fsm.(raftengine.AppliedIndexReader) direct |
| Implementation Plan B2 row | "kvFSM.AppliedIndexReader() accessor" + no mention of monotonic guard | "kvFSM.LastAppliedIndex() directly satisfies raftengine.AppliedIndexReader (compile-time guard)" + applyMu.Lock() RMW from round-3 |
| §8 compatibility list | AppliedIndexReporter listed as new opt-in |
replaced by AppliedIndexWriter (actually shipped) + retrospective note |
Three retrospective mentions of AppliedIndexReporter are kept as "this is NOT what we use" — so future readers understand the design history without re-deriving it.
Verification
go vet ./kv/ ./store/ ./internal/raftengine/... → exit 0
golangci-lint run ./kv/ ./store/ ./internal/raftengine/etcd/ → 0 issues
No code semantics changed in round-6. The compile-time interface guards from round-5 remain the structural protection.
@claude review
TLA+ spec divergence review (auto-triggered)This PR touches files that the TLA+ safety spec has an anchor on (per Anchored files changed in this PR head (3951337):
What to check, by subsystem:
If the change is correct but requires a spec update, edit @claude review please verify TLA+ spec divergence per the checklist above. @codex review please verify TLA+ spec divergence per the checklist above. |
|
To use Codex here, create a Codex account and connect to github. |
|
Claude finished @bootjp's task in 2m 13s —— View job PR #915 Round-6 Review —
|
Summary
Implements Branch 2 of the cold-start snapshot-restore skip optimisation designed in PR #910. After this lands the
metaAppliedIndexPebble meta key is durably written on every raft-Apply data mutation AND at every snapshot persist — but the skip gate itself (Branch 3) is NOT yet wired, so behaviour is observationally identical tomainexcept for the new meta key in fsm.db. Branch 2 is meant to soak in production for at least one release before Branch 3 enables the skip; this PR is intentionally a no-op-from-the-outside change with comprehensive plumbing.Reading order (6 commits, designed to review one-at-a-time)
2339a6f2AppliedIndexReader/AppliedIndexWriter)525fc152metaAppliedIndexconst +LastAppliedIndex+SetDurableAppliedIndex(withpebble.SyncUNCONDITIONALLY)aa9b8accMVCCStoreinterface extension:ApplyMutationsRaftAt/DeletePrefixAtRaftAtoverloads, threading appliedIndex throughapplyMutationsWithOpts+deletePrefixAtWithOpts7cd72bdaAppliedIndexReader()/SetDurableAppliedIndex()accessors + all 7 data-Apply leaves switched to*RaftAtwithf.pendingApplyIdxf1e8748cpersistCreatedSnapshot+e.persistLocalSnapshotPayloadcallSetDurableAppliedIndexBEFOREpersist.SaveSnap2c42f7d6Design constraints honoured
All from
docs/design/2026_06_02_idempotent_snapshot_restore.md:applyMutationsWithOptsANDdeletePrefixAtWithOptsso DEL_PREFIX entries don't silently leaveLastAppliedIndexbehind. Tested byTestDeletePrefixAtRaftAt_BundlesMetaAppliedIndex.dbMu.RLock(): bothLastAppliedIndexandSetDurableAppliedIndexacquire the read-lock, matching the lock-ordering discipline atlsm_store.go:153 / :553 / :675.AppliedIndexReader()returns nil when the store doesn't implement the seam;LastAppliedIndexreturns(0, false, nil)for missing OR truncated meta key. Branch 3 will then fall back to full restore conservatively.ELASTICKV_FSM_SYNC_MODE=nosyncmode:SetDurableAppliedIndexis pinned topebble.Syncunconditionally. Rationale documented at length in the method's doc-comment — oncepersist.SaveSnapreturns, WAL compaction discards every log entry ≤snap.Metadata.Index, so there's no source to replay the meta key bump from. +1 fsync per snapshot persist (rare; defaultSnapshotCount=10000). Tested byTestSetDurableAppliedIndex_UsesPebbleSync.persistCreatedSnapshot(config snapshots) ANDe.persistLocalSnapshotPayload(steady-stateSnapshotCount-triggered hot path) call the hook. Both crash-ordering tested byTestPersistCreatedSnapshot_*.StateMachine.Apply's public signature is unchanged. New interfaces are opt-in. Old call sites (ApplyMutationsRaftwithout*At) still work, just passappliedIndex=0to opt out of the meta key bump.Test results
10 new tests added (see commit
2c42f7d6for the full inventory).What this does NOT do
restoreSnapshotStatestill always restores. Branch 3 wires thefsmAlreadyAtIndexcheck +applyHeaderStateOnSkip+ the two-phaseSnapshotHeaderApplierseam.HEALTH_TIMEOUT_SECONDS=300. Branch 4 lowers it once Branch 3 has soaked.Engine.applySnapshot) per Non-Goals in the design.Soak plan
Branch 2 should run in production for at least one release before Branch 3 opens. Operators can verify the meta key is being written via:
Refs
HEALTH_TIMEOUT_SECONDSband-aid that this series eventually obviatesSummary by CodeRabbit
New Features
Bug Fixes
Tests