fix(derivation): verify fill-gap region against L1 blob to heal local forks#1008
fix(derivation): verify fill-gap region against L1 blob to heal local forks#1008curryxbo wants to merge 1 commit into
Conversation
|
Warning Review limit reached
More reviews will be available in 59 minutes and 9 seconds. Learn how PR review limits work. Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file). ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits. 🚦 How do rate limits work?CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability. For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughLocal-verify scenario C now reconciles through ChangesLocal verify reconciliation
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@node/derivation/derivation.go`:
- Around line 416-421: The fallback around `rebuildBlob` in `derivation.go` is
returning too early on divergence-like sentinel errors, so it never reaches the
existing `deriveForce(..., 0)` self-heal path. Update the caller that handles
`rebuildBlob` results to detect the wrapped sentinel divergence cases (including
`ErrBatchVerifyDivergence` / `blob_count_mismatch`) and route them through the
same full-rewrite branch used after `fillGap`-detected mismatches. Keep the
existing self-heal behavior centralized in the `deriveForce` path so both
verification failures and `rebuildBlob` divergence errors are handled
identically.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 894f04f7-2b4f-46a8-806e-3a62e60d6ca7
📒 Files selected for processing (1)
node/derivation/derivation.go
rebuildBlob signals a fork two ways: a per-blob hash mismatch (returned in the rebuilt slice) and a wrapped ErrBatchVerifyDivergence error (blob_count_mismatch). The caller only handled the former and returned early on the latter, so a fork that changes the reconstructed blob count aborted the pull instead of self-healing. Extract the full-rewrite (deriveForce(.., 0)) into a shared selfHeal closure and invoke it from both the hash-mismatch loop and the ErrBatchVerifyDivergence error branch. Non-sentinel rebuildBlob errors remain transient failures that must not trigger a reorg. Addresses review feedback on #1008. Co-authored-by: Cursor <cursoragent@cursor.com>
504fb79 to
7fc178d
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@node/derivation/derivation.go`:
- Around line 398-400: `deriveForce` is still using `d.ctx` and
`context.Background()` instead of the caller’s `derivationBlock` context, so
cancellation and deadlines are not propagated during quiesced work. Update the
`deriveForce` call sites in `derivation.go` to pass the active context from
`withReactorsQuiesced`/`derivationBlock`, and thread that context through
`deriveForce` itself and the related helpers it calls so all L2 reads/writes
respect cancellation promptly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 6afbe52b-03cd-462b-860c-6ad6c646a345
📒 Files selected for processing (2)
node/derivation/batch_info.gonode/derivation/derivation.go
| err = d.withReactorsQuiesced(ctx, batchInfo.batchIndex, func() error { | ||
| localLatest, err := d.l2Client.BlockNumber(ctx) | ||
| if err != nil { | ||
| return fmt.Errorf("read local latest: %w", err) | ||
| } | ||
| var derErr error | ||
| lastHeader, derErr = d.deriveForce(batchInfoFull, localLatest) | ||
| lastHeader, derErr = d.deriveForce(batchInfoFull) |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | ⚡ Quick win
Thread the caller context through deriveForce.
deriveForce can perform many L2 reads/writes while reactors are quiesced, but it uses d.ctx and context.Background() instead of the derivationBlock context. Cancellation won’t stop an in-flight reconcile promptly, and each rewritten block can wait up to 60s.
Proposed fix
- lastHeader, derErr = d.deriveForce(batchInfoFull)
+ lastHeader, derErr = d.deriveForce(ctx, batchInfoFull)- lastHeader, derErr = d.deriveForce(batchInfoFull)
+ lastHeader, derErr = d.deriveForce(ctx, batchInfoFull)-func (d *Derivation) deriveForce(rollupData *BatchInfo) (*eth.Header, error) {
+func (d *Derivation) deriveForce(ctx context.Context, rollupData *BatchInfo) (*eth.Header, error) {
@@
- anchor, err := d.l2Client.HeaderByNumber(d.ctx, big.NewInt(int64(anchorNum)))
+ anchor, err := d.l2Client.HeaderByNumber(ctx, big.NewInt(int64(anchorNum)))
@@
- local, lErr := d.l2Client.BlockByNumber(d.ctx, big.NewInt(int64(num)))
+ local, lErr := d.l2Client.BlockByNumber(ctx, big.NewInt(int64(num)))
@@
- ctx, cancel := context.WithTimeout(context.Background(), time.Duration(60)*time.Second)
+ callCtx, cancel := context.WithTimeout(ctx, time.Duration(60)*time.Second)
defer cancel()
- return d.l2Client.NewSafeL2Block(ctx, &safeData)
+ return d.l2Client.NewSafeL2Block(callCtx, &safeData)Also applies to: 442-444, 918-980
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@node/derivation/derivation.go` around lines 398 - 400, `deriveForce` is still
using `d.ctx` and `context.Background()` instead of the caller’s
`derivationBlock` context, so cancellation and deadlines are not propagated
during quiesced work. Update the `deriveForce` call sites in `derivation.go` to
pass the active context from `withReactorsQuiesced`/`derivationBlock`, and
thread that context through `deriveForce` itself and the related helpers it
calls so all L2 reads/writes respect cancellation promptly.
…anch Cherry-pick the PR #1008 fix (deriveForce verifies each local block against the batch's content + canonical parent before keeping it, rewriting from the first divergent/missing height) so the QA fork-injection node actually exercises L1 self-heal. Scenario C now reconciles in one pass; the REORG-TEST [2/3]/[3/3] markers are retained for log visibility. Co-authored-by: Cursor <cursoragent@cursor.com>
1a011d4 to
149b664
Compare
…them (#1008) Verify-driven deriveForce: walking from the canonical anchor (firstBlockNumber-1), keep each local block while its content matches the batch, then rewrite via NewSafeL2Block from the first divergent or missing height onward. Replaces the old skipNumber fill-gap, which blindly trusted local blocks and could grow a permanent shadow chain (correct tx content, wrong parent/state hashes) that never self-healed. This is the fix under test by the chaos hook in the preceding commit; it is the same change as PR #1008. Co-authored-by: Cursor <cursoragent@cursor.com>
…them (#1008) Verify-driven deriveForce: walking from the canonical anchor (firstBlockNumber-1), keep each local block while its content matches the batch, then rewrite via NewSafeL2Block from the first divergent or missing height onward. Replaces the old skipNumber fill-gap, which blindly trusted local blocks and could grow a permanent shadow chain (correct tx content, wrong parent/state hashes) that never self-healed. This is the fix under test by the chaos hook in the preceding commit; it is the same change as PR #1008. Co-authored-by: Cursor <cursoragent@cursor.com>
…them A node that diverges from canonical at its unsafe tip and gets stuck below a committed batch tip never self-heals via L1 derivation. A batch is ~480 blocks, so the fork height is almost always inside a batch (batch.lastBlockNumber > fork height); HeaderByNumber(lastBlockNumber) returns nil, routing every batch to scenario C (fill-gap) before any content comparison runs. fill-gap then called deriveForce with skipNumber = localLatest, which skipped every block already present locally and appended only the missing tail, anchoring it on the local head. The batch carries block content but not parent hashes, so that skip was blind: if the present blocks were a fork, fill-gap built the tail on top of the fork and geth kept advancing a shadow chain (correct tx content, wrong parent/state hashes) while P2P stayed wedged on "parent block not found". Only a data wipe + resync recovered it. Fix: replace the skipNumber-driven deriveForce with a single verify-driven walk. Starting from a known-canonical anchor (firstBlockNumber-1, the tip the previous batch already verified), each height is KEPT only when the local block both links to the running canonical parent AND its content matches the batch (timestamp, gas limit, base fee, ordered txs); otherwise the walk switches to rewrite mode via NewSafeL2Block for the rest of the range. The parent-link check is what makes a keep safe — the batch has no parent hashes, but canonical anchor + matching parent + matching content implies deterministic execution reproduces the canonical block. The first divergent or missing height flips the walk into rewrite mode, and because every later local block then points at the now-stale parent, the parent check keeps rewriting to the tip automatically. This unifies the old scenario-B (self-heal, rewrite all) and scenario-C (fill-gap, append tail) paths: the divergence point is discovered by comparison instead of being supplied as a skip boundary, so a clean prefix is skipped for free, a fork is reorged onto canonical, and a missing tail is appended — without ever blindly trusting un-verified local blocks. Scenario C now reconciles in one pass and advances the cursor instead of deferring to a later poll. Co-authored-by: Cursor <cursoragent@cursor.com>
149b664 to
9d3eb84
Compare
Problem
A node that diverges from canonical at its unsafe tip and then gets stuck below a committed batch tip never self-heals via L1 derivation — it advances a permanent "shadow chain" and P2P stays wedged on
parent block not found. Only a data wipe + resync recovers it.Root cause is in the local verify scenario dispatch:
H; a batch is ~480 blocks, soHis almost always inside a batch, i.e.batch.lastBlockNumber > H.HeaderByNumber(batch.lastBlockNumber)returnsnil(tail missing) → dispatch routes to scenario C (fill-gap) before any blob comparison runs. The blob-mismatch self-heal (scenario B,deriveForce(.., 0)full rewrite) is therefore never reached.deriveForce(batchInfoFull, localLatest), which skips every block already present locally (Number <= skipNumber) and appends only the missing tail, anchoring it on the local head.Result: geth keeps producing blocks with correct tx content but wrong parent/state hashes (a shadow chain), batch after batch.
latestBatchIndexadvances, block height climbs, but every canonical block from peers fails withparent block not foundand the node never reconciles.Fix
After fill-gap completes, the whole
[firstBlock, lastBlock]range is present locally. So do notbreakout of the switch — fall through to the existingrebuildBlobverification:deriveForce(.., 0)full rewrite, reorging the fork onto canonical.This reuses the exact verification + self-heal machinery already used by scenario A/B; the diff is just "skip the early break and verify".
Why this is complete
The only blind spot of blob-content comparison is a fork that differs only in parent linkage with identical content — but any genuine fork's root block must differ in content (txs or a blob-encoded context field), otherwise it would be canonical. Since batches are processed in order, the root is caught and rewritten in its own batch, making every later batch's
firstBlock-1anchor canonical by induction.Test plan
go build ./...innode/go vet ./derivation/blob hash mismatch; triggering self-heal reorgfires on the fill-gap batch and the node reorgs back onto canonical (P2P recovers, no moreparent block not found).Made with Cursor
Summary by CodeRabbit