Skip to content

Corruption-pattern survey tool for FEC design#85

Merged
josephnef merged 2 commits into
masterfrom
corruption-survey
Jun 7, 2026
Merged

Corruption-pattern survey tool for FEC design#85
josephnef merged 2 commits into
masterfrom
corruption-survey

Conversation

@josephnef
Copy link
Copy Markdown
Collaborator

Summary

Follow-up B from #83 (depends on #84's phy soft metrics): a chip-side DEVOURER_RX_DUMP_ALL=1 env var that emits one line per RX frame with the chip's full integrity + phy soft-metric vector, plus an aggregate analyser that turns those into FEC-design-grade statistics.

The previous work showed that the chip-corrupt pipeline now reaches the application layer (#83) and that per-frame phy metrics let the analyser correlate BER with SNR (#84). This PR is the third leg: a long-capture survey tool that characterises the actual corruption-pattern distribution real-world deployments face, so a FEC layer on top of the stream link can be sized empirically rather than guessed.

Changes

  • demo/main.cpp — new DEVOURER_RX_DUMP_ALL=1 knob emits <devourer-corrupt-any>len=L crc_err=X icv_err=Y rate=R rssi=A,B evm=A,B snr=A,B. Body bytes are deliberately omitted (a hot survey would inflate the log past usable size); the aggregate report only needs length + flags + phy.
  • tools/precoder/corruption_survey.py — new tool that reads those lines and reports:
    • headline chip-clean vs chip-corrupt counts
    • corruption rate broken down by DESC_RATE (the CCK-vs-OFDM split — without this the headline is dominated by always-clean CCK ACKs/beacons and underestimates what OFDM data faces)
    • frame-size distribution for each population
    • phy-metric stats per population, filtered to frames where the chip populated phy stats (CCK reports 0/0; we treat as "no measurement" instead of "0 dB" so the buckets don't collapse)
    • per-SNR-bucket corruption rate (where measurable)
    • temporal clustering (live captures only)
    • a heuristic FEC recommendation based on median-vs-peak corruption rate

Bench finding

60-second ch6 capture in a busy office environment with several APs in range:

=== corruption survey (2266 frames, file/pipe) ===
chip-clean       :   1663 ( 73.4%)
chip-corrupt     :    603 ( 26.6%)
corruption rate  : 26.61%
no-phy-measurement:  2103  (CCK/short frames, chip reports 0/0)

Corruption rate by DESC_RATE:
   idx name            count      %    corrupt    rate
  0x00 1M CCK           2075  91.6%        412  19.9%
  0x02 5.5M CCK            2   0.1%          2 100.0%
  0x03 11M CCK             1   0.0%          1 100.0%
  0x04 6M OFDM            17   0.8%         17 100.0%
  0x05 9M OFDM            19   0.8%         19 100.0%
  0x06 12M OFDM           20   0.9%         20 100.0%
  0x07 18M OFDM           31   1.4%         31 100.0%
  0x08 24M OFDM           22   1.0%         22 100.0%
  0x09 36M OFDM           30   1.3%         30 100.0%
  0x0a 48M OFDM           31   1.4%         31 100.0%
  0x0b 54M OFDM           18   0.8%         18 100.0%

Reading the result

  • 1M CCK loses ~20% even at this location — CCK is robust but background interference still nukes one in five ACKs/beacons.
  • Every OFDM rate above CCK is 100% corrupt because we're hearing distant APs at marginal SNR — the chip detects them, decodes them, fails the FCS, and now (with Surface CRC/ICV-corrupted RX frames + analysis tool #83's RCR change) surfaces them.

The FEC-design takeaway:

  • The PoC's 6M OFDM stream link only works because TX and RX are co-located. At any real range the chip will surface FCS failures at high rate.
  • The stream layer needs inter-frame parity (Reed-Solomon over N frames + K parity, Raptor, etc.) to recover from blocks of lost frames, not just per-frame FEC.
  • For a P2P link's typical "moderate range" use case (e.g. OpenIPC long-range video), expect frame loss rates in the 30–70% range. FEC overhead has to be sized accordingly — at 50% loss you need K/N ≈ 0.5 to be reliable.

Follow-ups (for whoever picks up the FEC layer)

  • Pick a parity scheme (Reed-Solomon is simplest, Raptor scales better) and parametrise N, K against captures from realistic ranges.
  • Decide where parity rides: in-band on the same SA (current TX path) vs. on a dedicated SA / frame type. In-band keeps the link simple but eats stream airtime.
  • Consider degrading rate gracefully (rateless codes) so the receiver can decode at whatever fraction of N+K frames it actually receives.

Builds on #83 (chip-level filter open, merged) and #84 (phy soft metrics, open).

🤖 Generated with Claude Code

josephnef and others added 2 commits June 7, 2026 16:31
Follow-up A from #83. Adds per-path RSSI / EVM / SNR to every
<devourer-stream> line so corruption_analysis.py can correlate BER
with link quality on a per-frame basis instead of relying on
aggregated statistics.

* demo/main.cpp: <devourer-stream>rate=R len=L crc_err=X icv_err=Y
  rssi=A,B evm=A,B snr=A,B body=HEX. Same source as the Tier-2
  diagnostics in <devourer-body>; no new RX-status fields, just
  surfacing what FrameParser already populates.
* tools/precoder/corruption_analysis.py: parses the new fields,
  reports
    - SNR distribution (min/p25/med/p75/max) for chip-clean vs
      chip-corrupt populations
    - BER per 5-dB SNR bucket
  Uses max(snr_A, snr_B) as the "effective" SNR — on single-antenna
  1T1R sticks path B reads 0 (no signal, not "0 dB"), so a naive min
  would always report 0 and the bucket view collapses; max picks
  the active path on 1T1R and the stronger path on 2T2R
  single-stream operation.
* stream_rx.py / tun_p2p.py / precoder_stream_roundtrip.py: regex
  updated to tolerate the new optional rssi/evm/snr fields (none
  read them yet — pass-through compatibility).

Verification

Hardware (500 frames at default TX power, RTL8812AU → T2U Plus
RTL8821AU, ch 6):

    phy SNR (stronger path, dB):
      chip-clean    : n=467 min=0 p25=30 med=33 p75=38 max=51
      chip-corrupt  : n=0
    BER by SNR bucket (stronger path, 5-dB buckets):
      bucket       frames   bits-cmp   bit-err    BER
           0-5 dB        1        192        0   0.000e+00
         20-25 dB       11       2112        0   0.000e+00
         25-30 dB       76      14592        0   0.000e+00
         30-35 dB      178      34176        0   0.000e+00
         35-40 dB      122      23424        0   0.000e+00
         40-45 dB       55      10560        0   0.000e+00
         45-50 dB       19       3648        0   0.000e+00
         50-55 dB        5        960        0   0.000e+00

Bench link is too clean for chip-corrupt events even at the SNR tails,
which matches the post-PR-investigation finding for #83: at bench
distance the loss is at PHY sync, not FCS. The analyser is ready for
noisier deployments / range-extended captures (follow-up B).

Offline smoke (synthetic 5-clean@28dB + 5-corrupt@5dB injection)
correctly buckets BER=0 in the 25-30 dB bucket and BER=1.04e-2 in the
5-10 dB bucket — the per-bucket correlation works as designed.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Follow-up B from #83 (and depends on #84's phy soft metrics): adds a
chip-side DEVOURER_RX_DUMP_ALL env var that emits a
<devourer-corrupt-any> line for every RX frame, plus an aggregate
analyser that turns those into FEC-design-grade statistics.

* demo/main.cpp: DEVOURER_RX_DUMP_ALL=1 emits one body-less line per
  frame with len + chip-flag bits + rate + per-path rssi/evm/snr.
  Body bytes are deliberately omitted (a hot survey would inflate
  the log past usable size); pkt_len + flags + phy is what the
  aggregate report needs.

* tools/precoder/corruption_survey.py: parses the new lines and
  reports
    - headline chip-clean / chip-corrupt counts
    - corruption rate broken down by DESC_RATE (the CCK vs OFDM
      split — without this the headline number is dominated by
      always-clean CCK ACKs and beacons and underestimates what
      OFDM data faces)
    - frame-size distribution for chip-clean vs chip-corrupt
    - phy-metric stats (rssi/evm/snr) per population, filtered to
      frames where the chip actually populated phy stats (CCK and
      short mgmt frames report 0/0; we treat those as "no
      measurement" instead of "0 dB" so the bucket views don't
      collapse)
    - per-SNR-bucket corruption rate (where measurable)
    - temporal clustering (when running live for >1 s; skipped on
      file/pipe input where all lines arrive at once)
  Output ends with a heuristic FEC recommendation based on
  median-vs-peak corruption rate.

Bench finding (60 s ch6 capture, busy office environment near
several APs):

  === corruption survey (2266 frames, file/pipe) ===
  chip-clean       :   1663 ( 73.4%)
  chip-corrupt     :    603 ( 26.6%)
  corruption rate  : 26.61%
  no-phy-measurement:  2103  (CCK/short frames, chip reports 0/0)

  Corruption rate by DESC_RATE:
     idx name            count      %    corrupt    rate
    0x00 1M CCK           2075  91.6%        412  19.9%
    0x02 5.5M CCK            2   0.1%          2 100.0%
    0x03 11M CCK             1   0.0%          1 100.0%
    0x04 6M OFDM            17   0.8%         17 100.0%
    0x05 9M OFDM            19   0.8%         19 100.0%
    0x06 12M OFDM           20   0.9%         20 100.0%
    0x07 18M OFDM           31   1.4%         31 100.0%
    0x08 24M OFDM           22   1.0%         22 100.0%
    0x09 36M OFDM           30   1.3%         30 100.0%
    0x0a 48M OFDM           31   1.4%         31 100.0%
    0x0b 54M OFDM           18   0.8%         18 100.0%

The FEC-design takeaway: 1M CCK is robust at ~20% loss because the
modulation is simple; every OFDM rate is 100% corrupt because we're
hearing distant APs at marginal SNR. The PoC's 6M OFDM stream link
works only because TX and RX are co-located — at any real range the
chip will surface FCS failures at high rate and the stream layer
needs inter-frame parity (Reed-Solomon / Raptor) to recover, not
just per-frame FEC. The tool gives FEC designers the concrete
inputs (rate distribution, snr distribution, time clustering) to
size the parity block and overhead.

Builds on #83 (chip-level filter open) and #84 (phy soft metrics).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@josephnef josephnef merged commit 1f5c843 into master Jun 7, 2026
5 checks passed
@josephnef josephnef deleted the corruption-survey branch June 7, 2026 14:20
josephnef added a commit that referenced this pull request Jun 7, 2026
## Summary

The corruption survey in #85 showed real-range OFDM frames on this link
will see **30–70% loss**. tun_p2p.py's blind `--repeat N` is a
fixed-cost workaround that can't compose to handle the tail; this PR
ships a real erasure code on top of the existing stream framing.

## Library

`raptorq` from cberner (Rust+PyO3 binding to the RFC 6330 reference
port). MIT, manylinux abi3 wheels on PyPI, ~26 Gbps enc / ~7 Gbps dec at
K=1000 on commodity x86. `uv add raptorq` is the only install step.

## Wire format

The existing `stream.py` framing stays untouched. FEC is an **inner
envelope** living inside `StreamFrame.payload`:

```
   FEC_MAGIC      (2)  = 0xF52E
   VERSION/FLAGS  (1)  = 0
   K              (1)  = source symbols per block
   KREAL          (1)  = real source symbols in this block (≤ K). Trailing
                        (K - KREAL) decoded symbols are zero-pad to discard.
   SYMBOL_SIZE    (2)  = LE u16
   BLOCK_ID       (2)  = LE u16 wraps
   RAPTORQ_PKT    (var) = lib-managed SBN+ESI+symbol
   inner overhead   = 9 B + raptorq's 4 B SBN/ESI = 13 B
```

Source symbols are themselves concatenations of length-prefixed IP
packets:

```
[u16 len_a][packet_a]…[u16 len_b][packet_b]…[zero pad to SYMBOL_SIZE]
```

So small packets (ACK floods) share symbols instead of each burning a
whole symbol's worth of airtime.

## Files

- `tools/precoder/pyproject.toml` — add `raptorq>=2`.
- `tools/precoder/stream_fec.py` — `FecConfig`, `FecEncoder`
(concatenation packing + block encoding), `FecDecoder`
(block-incremental decode + late-symbol drop + block expiry).
- `tools/precoder/test_stream_fec.py` — 19 unit tests: round-trip, loss
tolerance 0/20/40% at R/K=1, 50% at R/K=2, unrecoverable-block
bookkeeping at 70%, concatenation, partial flush, block-id wrap, MTU
enforcement, garbage envelopes.
- `tools/precoder/tun_p2p.py` — new
`--fec-k`/`--fec-overhead`/`--fec-symbol-size`/`--fec-flush-ms`/`--fec-block-expire-ms`
flags. tx_thread feeds packets through the encoder; a parallel
`fec_flush_thread` force-encodes partial blocks every flush-ms (sparse
traffic doesn't stall). rx_thread feeds payloads through the decoder;
decoded IP packets go to TUN. Outer `SeqWindow` dedup is forced OFF when
FEC is on (RaptorQ symbols self-dedup via SBN+ESI). New `fec=[...]`
segment in the periodic stderr report. Docstring extended.

## Hardware verification

Two-netns single-host bench (RTL8812AU `0x8812` + TP-Link Archer T2U
Plus / RTL8821AU `2357:0120`, ch 6, no `--repeat`, `ping -c 30 -i 1`):

| Config | RTT min/avg/max | Loss | DUP | Blocks ok/lost |
|---|---|---:|---:|---:|
| `--fec-k 16 --fec-overhead 1.0 --fec-flush-ms 50` | 121 / **160** /
207 ms | 0% | 0 | 30 / 1 (startup) |
| `--fec-k 8 --fec-overhead 1.0 --fec-flush-ms 20` | 73 / **95** / 145
ms | 0% | 0 | 30 / 1 (startup) |

The K=8 config trades a bit of recovery margin for a 65 ms drop in
median RTT. Both decode 100% of source packets on a healthy link; the
survey's noisier regimes are what motivates `--fec-overhead > 1`.

For comparison from PR #82's earlier numbers (same bench, byte mode):

| Mode | Loss | Avg RTT |
|---|---:|---:|
| Byte mode `--repeat 1` | 10% | 7 ms |
| Byte mode `--repeat 4` + dedup | 0% | 10 ms (with up to 25 DUPs per
ping eaten by dedup) |
| **FEC K=8 R/K=1 flush=20**  | **0%** | **95 ms** |

FEC moves us from "blind redundancy + dedup" to "real erasure code". The
latency cost is the K-source-symbol encode buffer; the win is that the
codec scales gracefully to higher loss rates by raising `--fec-overhead`
instead of running out at `--repeat=∞`.

## Test plan

- [x] `cd tools/precoder && uv run pytest` → 87 passed (31 pipeline + 37
stream + 19 fec)
- [x] `python -m pytest tests/precoder_smoke.py
tests/precoder_stream_smoke.py` → 8 passed
- [x] tun_p2p.py --help parses cleanly (incl. all FEC flags)
- [x] Bench: K=16/R=1 and K=8/R=1, both 30/30 ping with 0% loss and 0
DUPs

## Open caveats (documented in script)

- Strict block boundaries — no cross-block FEC, no Raptor carousel. Good
enough at K=8–16 + 20–50 ms flush; revisit if the latency budget
tightens further.
- No rateless dynamic overhead — R/K is fixed at construction. A future
PR could let RX hint TX to send more repair symbols via a
reverse-channel feedback envelope.
- Patent note: RFC 6330 has Qualcomm patents largely expired in primary
jurisdictions by 2026; cberner's MIT lib explicitly notes this.

Builds on #82 (TUN bridge, merged), #83 (corrupted-frame surfacing,
merged), #84 (phy soft metrics, open), #85 (corruption survey, open).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant