Skip to content

fix(storage): verify ClientPut receiver against over-query window#142

Closed
jacderida wants to merge 1 commit into
WithAutonomi:rc-2026.6.2from
jacderida:fix/receiver-membership-hybrid
Closed

fix(storage): verify ClientPut receiver against over-query window#142
jacderida wants to merge 1 commit into
WithAutonomi:rc-2026.6.2from
jacderida:fix/receiver-membership-hybrid

Conversation

@jacderida

Copy link
Copy Markdown
Collaborator

Problem

After #140 (XOR-only verification) and #141 (issuer-closeness over-query window + network fallback), upload success improved dramatically — large files 100%, small files ~92–98%. The residual few-percent failures are now dominated by the receiver storage-responsibility check (AntProtocol::validate_store_membership):

ClientPut receiver <peer> is not among this node's local 9 closest peers (close group plus storage margin)

They surface most visibly as Failed to store public DataMap — the public DataMap is a single critical chunk per upload, so one receiver-check rejection of it fails the whole upload regardless of file size.

Cause

This is the same divergence #141 fixed for the issuer check, on the receiver side. The uploader queries 2 * CLOSE_GROUP_SIZE peers and PUTs each chunk to the CLOSE_GROUP_SIZE closest successful responders (ant-client get_store_quotes). When closer peers are slow or NAT-stuck, the storer it legitimately PUT to sits at XOR positions up to 2 * close_group_size. But the receiver check verified only the bare close_group_size + storage margin (9) of the node's local routing table, with exact self-membership — so it rejected honest PUTs.

#141 only touched the issuer check (PaymentVerifier); the receiver check still had the strict local-only / width-9 / exact-membership logic.

Fix (symmetric with #141)

  • Widen to 2 * close_group_size, matching the uploader's over-query window. This does not amplify replica count — the uploader still PUTs to only its selected storers — it just lets a legitimately-selected storer at position 10..14 accept.
  • Keep XOR-only ordering (find_closest_nodes_local_with_self reranks by reachability and would demote XOR-close relay-only / NAT'd storers).
  • Hybrid source: try the local routing-table view first; only on a local miss fall back to an authoritative find_closest_nodes_network lookup (the same view the uploader used to choose the storers), wrapped in the shared CLOSENESS_LOOKUP_TIMEOUT. Reject only if absent from both.

CLOSENESS_LOOKUP_TIMEOUT is promoted to pub(crate) so the receiver check reuses the merkle/issuer timeout.

Note

Same cost caveat as #141: the fallback can issue a per-chunk network lookup when the local view misses; the local fast-path keeps that off the hot path for storers we already know. One storage-specific consideration: a node may now accept a chunk it is the ~10th–14th-closest to, which the replication/pruning close-group logic could later treat as borderline; replica count is unaffected since the uploader controls the PUT targets.

Builds on #140 and #141.

🤖 Generated with Claude Code

Mirror the paid-quote issuer fix (WithAutonomi#141) on the receiver
storage-responsibility check. After WithAutonomi#140/WithAutonomi#141 removed the reachability
re-rank and widened the issuer check, uploads recovered substantially, but a
residual few percent still fail on:

  ClientPut receiver <peer> is not among this node's local 9 closest peers
  (close group plus storage margin)

most visibly as "Failed to store public DataMap" — the DataMap is a single
critical chunk, so one receiver-check rejection fails the whole upload
regardless of file size.

Cause is the same divergence the issuer check had: the uploader queries
2 * CLOSE_GROUP_SIZE peers and PUTs each chunk to the CLOSE_GROUP_SIZE
closest *successful responders* (ant-client get_store_quotes), so when closer
peers are slow or NAT-stuck the storer it legitimately PUT to sits at XOR
positions up to 2 * close_group_size. The receiver check verified only the
bare close_group_size + storage margin (9) of the node's *local* routing
table with exact self-membership, so it rejected honest PUTs.

Bring it in line with the issuer check:

- Widen to 2 * close_group_size, matching the uploader's over-query window.
  This does not amplify replica count — the uploader still PUTs to only its
  selected storers — it just lets a legitimately-selected storer at position
  10..14 accept.
- Keep the XOR-only lookup (find_closest_nodes_local_with_self reranks by
  reachability and would demote XOR-close relay-only / NAT'd storers).
- Hybrid source: try the local routing-table view first, and only on a local
  miss fall back to an authoritative find_closest_nodes_network lookup (the
  same view the uploader used to choose the storers), wrapped in the shared
  CLOSENESS_LOOKUP_TIMEOUT. Reject only if absent from both.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@dirvine

dirvine commented Jun 14, 2026

Copy link
Copy Markdown
Collaborator

Hermes review for PR #142

Verdict: changes requested before merge (review-team consensus: 6 agents / 2 groups). CI is green, and local git diff --check, cargo test -q storage::handler::tests --lib, and cargo check -q pass, but I found two blocking issues.

Blockers

  1. Unauthenticated PUTs can trigger a 240s network lookup before payment verification

handle_put_inner calls validate_store_membership() before verify_payment() (src/storage/handler.rs:278-315). With this PR, a local membership miss runs find_closest_nodes_network(address, lookup_width) under PaymentVerifier::CLOSENESS_LOOKUP_TIMEOUT (src/storage/handler.rs:446-466), now 240s.

A requester only needs syntactically valid content/address bytes; no valid payment proof is required to reach this lookup. For random addresses the local top-2 * close_group_size check will usually miss, so this becomes an unpaid DHT lookup amplifier.

This is made worse by the chunk protocol handler semaphore (src/node.rs:744, src/node.rs:766-771): 64 concurrent unpaid local-miss PUTs can hold all permits for up to the lookup timeout, stalling valid PUT/GET/quote handling.

Suggested fix: do not perform the authoritative network fallback in the pre-payment hot path unless it is separately bounded. Options include rejecting missing/obviously invalid proofs first, moving the expensive fallback after cheap/payment validation, adding a dedicated short timeout/concurrency limit, and adding singleflight/cache/rate-limiting for receiver-membership lookups.

  1. Receiver admission width now diverges from storage retention/replication responsibility width

The PR changes direct ClientPut receiver admission to 2 * close_group_size (src/storage/handler.rs:428), but replication/admission/pruning still use storage_admission_width(close_group_size) = close group + margin (src/replication/config.rs:50-54, src/replication/admission.rs:74-76,116-121, src/replication/pruning.rs:383-424).

That means a node ranked 10–14 (for K=7) can accept/store a paid direct PUT while the rest of the storage responsibility machinery may treat that record as out-of-range and eventually prune it. That may be intended as an ingress-overquery policy, but the code/comments call it storage responsibility. Please either align the retention/replication semantics with the widened window, or explicitly document and test the distinction between “ClientPut ingress acceptance width” and “stored-record retention / proof authority width”.

Non-blocking but should be fixed

  • Stale comments/error text still say the receiver check is local close-group-plus-margin (src/storage/handler.rs:278-281, src/storage/handler.rs:383-386; related VerificationContext comments in src/payment/verifier.rs:125-130).
  • No focused tests cover the new production branch: local XOR hit at 2*K, local miss + network hit, network error/timeout, and final rejection when both views omit self. Existing storage handler tests pass but mostly use the test override / no-P2P test path.

Verified

  • GitHub checks for current head f377d970aebda318f041836827e211211ca34317: all passing.
  • Local checks:
    • git diff --check origin/rc-2026.6.2...HEAD — pass
    • cargo test -q storage::handler::tests --lib — 16 passed
    • cargo check -q — pass

I’d hold merge until the pre-payment lookup DoS path is bounded and the widened receiver/retention semantics are made explicit.

@dirvine

dirvine commented Jun 14, 2026

Copy link
Copy Markdown
Collaborator

Follow-up after Mick's note: I agree the cleaner direction is likely to fix the uploader's quote target selection rather than widen receiver admission.

If the protocol invariant is that quotes/storage responsibility are bound to the strict close group (or the existing close-group+storage-margin admission width), then accepting 2 * close_group_size at the receiver is treating a client-side over-query workaround as protocol authority. That is what creates the admission/retention mismatch called out above.

So my recommendation would be stronger than “document the distinction”: unless maintainers explicitly want a new ingress-overquery role, keep receiver admission aligned with the intended close-group/storage-admission width and adjust the uploader so it does not select quote/storage targets outside that authority set.

@dirvine

dirvine commented Jun 14, 2026

Copy link
Copy Markdown
Collaborator

Checked the ant-client history for the single-node payment upload path.

The over-query appears to have been introduced in WithAutonomi/ant-client by commit bf4aa562323d4a09006bdf1f7a5cf8becd34ddf3 on 2026-04-01:

fix: over-query 2x peers for quote collection fault tolerance

That changed Client::get_store_quotes from:

let mut remote_peers = self.close_group_peers(address).await?;
remote_peers.truncate(CLOSE_GROUP_SIZE);

to:

let over_query_count = CLOSE_GROUP_SIZE * 2;
let remote_peers = self
    .network()
    .find_closest_peers(address, over_query_count)
    .await?;

and then it sorts successful responses by XOR distance and truncates back to CLOSE_GROUP_SIZE.

So the answer to Mick's question is: since 2026-04-01 in ant-client commit bf4aa56. Before that, this path queried/truncated to CLOSE_GROUP_SIZE only. That supports the direction of reverting/fixing the client quote collection rather than widening ant-node receiver authority.

@dirvine

dirvine commented Jun 14, 2026

Copy link
Copy Markdown
Collaborator

Follow-up answer to Chris' question about #139/#140/#141 against Mick's “no over-querying for SNP” principle:

So I’d keep #140’s XOR-only ordering idea, but revert/replace the #141 SNP issuer widening, and not revive #139’s K+margin issuer widening unless maintainers explicitly decide SNP quote authority is wider than strict K.

@jacderida jacderida closed this Jun 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants