Skip to content

fix(rpc): release shard lock before cold-storage awaits in get_filter_changes#145

Open
Evalir wants to merge 1 commit into
mainfrom
evalir/fix-filter-changes-deadlock
Open

fix(rpc): release shard lock before cold-storage awaits in get_filter_changes#145
Evalir wants to merge 1 commit into
mainfrom
evalir/fix-filter-changes-deadlock

Conversation

@Evalir
Copy link
Copy Markdown
Member

@Evalir Evalir commented May 18, 2026

get_filter_changes held a DashMap RefMut (a parking_lot RwLock write
guard) across cold-storage .await points. On a current_thread tokio
runtime, two concurrent polls that landed on the same shard could
deadlock: the second task parked the OS thread waiting for the lock,
leaving no way for the first task to resume.

Refactored into snapshot -> cold I/O -> commit, with the RefMut scoped
to two short critical sections (no .await inside either).

Verified by temporarily forcing DashMap to 2 shards (~50% collision):

  • Pre-fix: test_rpc_filter_edge_cases hung 10 of 30 runs.
  • Post-fix: 0 of 30. Default shard count (~128 on dev, ~8 on 2-core
    CI) made the hang rare enough to slip through into main.

Also adds .config/nextest.toml with a 5-minute per-test timeout so any
future hang fails fast instead of burning the GitHub Actions 6-hour
job ceiling (the failure mode of node-components#134).

Co-Authored-By: Claude Opus 4.7 (1M context) noreply@anthropic.com

…_changes

get_filter_changes held a DashMap RefMut (a parking_lot RwLock write
guard) across cold-storage .await points. On a current_thread tokio
runtime, two concurrent polls that landed on the same shard could
deadlock: the second task parked the OS thread waiting for the lock,
leaving no way for the first task to resume.

Refactored into snapshot -> cold I/O -> commit, with the RefMut scoped
to two short critical sections (no .await inside either).

Verified by temporarily forcing DashMap to 2 shards (~50% collision):
- Pre-fix: test_rpc_filter_edge_cases hung 10 of 30 runs.
- Post-fix: 0 of 30. Default shard count (~128 on dev, ~8 on 2-core
  CI) made the hang rare enough to slip through into main.

Also adds .config/nextest.toml with a 5-minute per-test timeout so any
future hang fails fast instead of burning the GitHub Actions 6-hour
job ceiling (the failure mode of node-components#134).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Evalir Evalir requested a review from a team as a code owner May 18, 2026 17:27
Copy link
Copy Markdown
Member Author

Evalir commented May 18, 2026

This stack of pull requests is managed by Graphite. Learn more about stacking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant