Surface CRC/ICV-corrupted RX frames + analysis tool by josephnef · Pull Request #83 · OpenIPC/devourer

josephnef · 2026-06-07T09:35:00Z

Summary

Previously, devourer's RX path silently dropped every frame whose chip flagged CRC or ICV error — first at the chip's WMAC filter (RCR_ACRC32 / RCR_AICV both cleared in monitor-mode setup), then at FrameParser (if (crc_err || icv_err) break;, which threw out the bad frame AND every subsequent frame in the same USB aggregate). The application saw a clean-or-missing erasure channel with no way to inspect or recover from corruption.

This PR opens both gates behind a single env var (DEVOURER_RX_KEEP_CORRUPTED=1), keeping default behaviour unchanged for IP-stack consumers, and ships an analysis tool that quantifies the corruption pattern against a known TX source.

Changes

src/RadioManagementModule.cpp — hw_var_set_monitor adds RCR_ACRC32 | RCR_AICV to the monitor-mode RCR when DEVOURER_RX_KEEP_CORRUPTED is set. The chip's WMAC filter would otherwise drop corrupted frames before they reach the host at all; this was the silent gating bug that made the parser change a no-op on its own.
src/FrameParser.cpp — pkt_len sanity check moves before the crc/icv check (still needed to find the next aggregate boundary). On crc_err || icv_err the parser now logs + surfaces the packet with RxAtrib.crc_err/icv_err intact and continues processing the rest of the aggregate, instead of dropping it AND its aggregate-mates.
demo/main.cpp — <devourer-stream> lines now include crc_err=0/1 icv_err=0/1. Corrupted bodies are gated behind the same DEVOURER_RX_KEEP_CORRUPTED=1 flag, in lockstep with the chip filter.
txdemo/stream_tx_demo/main.cpp — DEVOURER_TX_POWER env var (default 40 unchanged), useful for stress-testing the receive path at attenuated SNR.
tools/precoder/corruption_analysis.py — reconstructs expected TX bodies from a source file, compares byte- and bit-wise against captured RX frames (clean or chip-corrupt), reports chip-clean vs chip-corrupt counts, total bit errors / BER, per-frame error distribution, and a byte-position histogram.
Regex updates in stream_rx.py, tun_p2p.py, and the roundtrip harness — accept the new optional crc_err=/icv_err= fields without breaking older logs.

Verification

On-air, real crc_err=1 events through the new path (RTL8821AU / TP-Link Archer T2U Plus 2357:0120, channel 6, DEVOURER_RX_KEEP_CORRUPTED=1, ~25 s of background-traffic capture):

Total 'RX corrupted frame surfaced' events: 746
Distribution by pkt_len: 364, 488, 547, 1057, 1087, 1099, 1278, 1296, 1330, 1379,
                          and 9 frames at 113  (mix of data and small mgmt frames)
Total RX pkts processed:    #8500

746 frames whose chip-FCS check failed were surfaced through FrameParser::recvbuf2recvframe. The unmodified parser would have dropped every one of them, plus their USB-aggregate-mates (each break discards the rest of the aggregate — typically 4–8 frames). The real-world deployment value of the fix is exactly this kind of traffic — frames the chip could tell us about but the old path threw on the floor.

Where the controlled stream's missing frames went (post-review verification):

We confirmed that the canonical-SA TX→RX stream itself stays clean even with DEVOURER_TX_POWER=1, by enabling a debug mode that dumps the first 30 header bytes of every corrupted frame regardless of SA match:

449 clean devourer-stream frames at len=1528  (our TX signature; all crc_err=0)
  0 corrupt-any frames at len 1500-1560        (no corrupted frames matching our size)
  0 corrupt-any frames containing ANY 5-byte fragment of canonical SA
985 corrupt-any frames captured                (top sizes: 32 [ACKs], 364 [mgmt],
                                                334 [mgmt], 1394 [background data])

So the 51 missing frames in 500 sent → 449 received are lost at PHY sync, not at FCS — they never reach the chip's decoder so no descriptor is produced. The 10% loss in the earlier tun_p2p --repeat 1 ping result is the same phenomenon. The bench link is too clean for FCS failures on the controlled stream; the value of this PR is for noisier real-world deployments (and for the 746 background events captured above, which prove the path works on live traffic).

Offline analyser validation (synthetic 5-clean + 5-corrupt mix injected into <devourer-stream> log, run through corruption_analysis.py):

captured        : 10
  chip-clean    : 5
  chip-corrupt  : 5  (crc_err or icv_err set)
matched seq     : 10
bit errors      : 10
BER (compared)  : 5.208e-03
byte-position error histogram:
   10       5/   10    50.0%
   15       5/   10    50.0%

Exact counts, exact positions — the analyser correctly identifies what was corrupted, where, and how badly.

Follow-ups (not in this PR)

Surface phy-level soft metrics (per-stream EVM/SNR) alongside the corruption flag so the analyser can correlate corruption with link quality.
Range-extended capture campaign to characterise real-world error distributions for a stream-layer FEC.

Builds on #82 (TUN p2p bridge), which is on master.

🤖 Generated with Claude Code

Previously, devourer's FrameParser dropped every RX frame whose chip flagged CRC or ICV error (`if (crc_err || icv_err) break;`), AND broke out of the loop entirely — so a single corrupted frame in a USB aggregate threw away every subsequent frame in the same aggregate too. The application saw a clean-or-missing-only "packet-erasure" channel, with no way to know what the corruption looked like. This PR: * `src/FrameParser.cpp`: reorder so the pkt_len sanity check (needed to find the next aggregate boundary) runs first; on crc/icv error we now log + surface the packet with the flag bits intact on `RxAtrib` instead of breaking. Consumers can still filter (existing behaviour if they ignore the flags) or analyse the corruption pattern. * `demo/main.cpp`: `<devourer-stream>` lines now include `crc_err=0/1 icv_err=0/1`. Filtering is opt-in via `DEVOURER_RX_KEEP_CORRUPTED=1` so a stream-mode consumer (stream_rx.py / tun_p2p.py) doesn't accidentally feed garbage into the IP stack — the byte-stream pipeline still drops corrupted frames by default, and the analysis tool opts in explicitly. * `tools/precoder/corruption_analysis.py`: new tool that reconstructs the expected TX-side bodies from a source file and compares them byte-by-byte and bit-by-bit against captured RX frames (clean OR chip-corrupt). Reports chip-clean vs chip-corrupt counts, total bit errors / BER, per-frame error stats, and a byte-position histogram — useful for spotting whether corruption is uniform across the body, clustered near the SERVICE-field offset, or concentrated in the trailing OFDM symbols where the 802.11 FCS lives. * Python regex helpers (`stream_rx.py`, `tun_p2p.py`, the harness) accept the new optional `crc_err=` / `icv_err=` fields without breaking on existing logs. Verification * Offline synthetic smoke: inject 5 corrupted + 5 clean bodies into a fake `<devourer-stream>` log, run corruption_analysis.py against the known source. Reports 5/5 chip-clean, 5/5 chip-corrupt, 10 byte errors at positions {10, 15} matching the injected XOR pattern, BER 5.2e-3, all matched seqs recovered correctly. * On-air run (channel 6, RTL8812AU TX → T2U Plus / RTL8821AU RX, 500 frames at `--repeat 1`): 461 frames captured, 0 chip-corrupt. The bench link is too clean to produce real FCS failures (the ~8% loss is sync-level — frames never reached the decoder, not corrupted-but-recoverable). The fix is for noisier real-world deployments where chip-corrupt frames will now surface for analysis or, eventually, FEC-style recovery. Follow-ups (not in this PR) * Surface phy-level soft metrics (per-stream EVM/SNR) alongside the flag so the analyser can correlate corruption with link quality. * Real-world capture at extended range to characterise actual error distributions and feed a FEC layer on top of the stream framing. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…knob The parser change alone was insufficient: the chip's WMAC drops CRC32-error and ICV-error frames BEFORE they hit the RX descriptor when RCR_ACRC32 / RCR_AICV are off. RadioManagementModule's monitor-mode RCR config left both bits clear, so the FrameParser's new "surface corrupted frame" path never fired — corrupted frames never made it past the chip filter to the host in the first place. Now `hw_var_set_monitor` reads `DEVOURER_RX_KEEP_CORRUPTED` (same env var as the demo's filter) and adds `RCR_ACRC32 | RCR_AICV` when set. Default behaviour is unchanged. Also adds `DEVOURER_TX_POWER` to StreamTxDemo (default 40 unchanged) for stress-testing the receive-error path at attenuated SNR. Verified on the bench: with both KEEP_CORRUPTED bits set, a 25-second capture surfaced **746 real `crc_err=1` events** from background ch6 traffic, e.g.: <devourer>RX corrupted frame surfaced: crc_err=1 icv_err=0 pkt_len=364 <devourer>RX corrupted frame surfaced: crc_err=1 icv_err=0 pkt_len=488 <devourer>RX corrupted frame surfaced: crc_err=1 icv_err=0 pkt_len=547 These are 802.11 data frames in the wild whose FCS failed; the unmodified parser would have dropped every one of them, plus likely many more in the same USB aggregate after each `break`. The canonical-SA stream itself stayed clean (8812 → T2U Plus bench link is too short-range to produce real FCS failures on a 6M-OFDM payload), so the analyser-vs-known-source path still relies on the synthetic smoke for end-to-end validation; the chip-level fix is what makes the parser path actually firing on real-world deployment. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

## Summary Follow-up A from #83. Adds per-path RSSI / EVM / SNR to every `<devourer-stream>` line so `corruption_analysis.py` can correlate BER with link quality on a per-frame basis instead of aggregated-only statistics. ## Changes - **`demo/main.cpp`** — `<devourer-stream>rate=R len=L crc_err=X icv_err=Y rssi=A,B evm=A,B snr=A,B body=HEX`. Same source as the Tier-2 diagnostics in `<devourer-body>`; no new RX-status fields, just surfacing what `FrameParser` already populates on `RxAtrib`. - **`tools/precoder/corruption_analysis.py`** — parses the new fields, reports two new sections: - SNR distribution (min/p25/med/p75/max) for chip-clean vs chip-corrupt populations - BER per 5-dB SNR bucket Uses `max(snr_A, snr_B)` as the "effective" SNR — on single-antenna 1T1R sticks path B reads 0 (no signal, not "0 dB"), so a naive `min` would collapse the bucket view; `max` picks the active path on 1T1R and the stronger path on 2T2R single-stream operation. - **`stream_rx.py` / `tun_p2p.py` / `precoder_stream_roundtrip.py`** — regex updated to tolerate the new optional `rssi=`/`evm=`/`snr=` fields. None of them use the metrics yet (pass-through compatibility). ## Hardware verification 500 frames at default TX power, RTL8812AU → T2U Plus RTL8821AU, ch 6: ``` phy SNR (stronger path, dB): chip-clean : n=467 min=0 p25=30 med=33 p75=38 max=51 chip-corrupt : n=0 BER by SNR bucket (stronger path, 5-dB buckets): bucket frames bits-cmp bit-err BER 0-5 dB 1 192 0 0.000e+00 20-25 dB 11 2112 0 0.000e+00 25-30 dB 76 14592 0 0.000e+00 30-35 dB 178 34176 0 0.000e+00 35-40 dB 122 23424 0 0.000e+00 40-45 dB 55 10560 0 0.000e+00 45-50 dB 19 3648 0 0.000e+00 50-55 dB 5 960 0 0.000e+00 ``` Bench link is too clean for chip-corrupt events even at the SNR tails — same finding as the post-PR-investigation in #83 (loss is at PHY sync, not FCS). The analyser is ready for noisier deployments / range-extended captures (follow-up B). ## Offline analyser smoke Synthetic 5-clean@28dB + 5-corrupt@5dB injection. Analyser correctly buckets: ``` BER by SNR bucket (stronger path, 5-dB buckets): bucket frames bits-cmp bit-err BER 5-10 dB 5 960 10 1.042e-02 25-30 dB 5 960 0 0.000e+00 ``` The per-bucket correlation works as designed — corrupted samples land in the 5-10 dB bucket at 1.04×10⁻² BER, clean samples land at high SNR with BER 0. Builds on #83 (merged). Next: follow-up B — characterise real-world background corruption patterns (burst-length distribution, byte-position distribution) to inform stream-layer FEC design. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

## Summary Follow-up B from #83 (depends on #84's phy soft metrics): a chip-side `DEVOURER_RX_DUMP_ALL=1` env var that emits one line per RX frame with the chip's full integrity + phy soft-metric vector, plus an aggregate analyser that turns those into FEC-design-grade statistics. The previous work showed that the **chip-corrupt** pipeline now reaches the application layer (#83) and that per-frame phy metrics let the analyser correlate BER with SNR (#84). This PR is the third leg: a **long-capture survey** tool that characterises the actual corruption-pattern distribution real-world deployments face, so a FEC layer on top of the stream link can be sized empirically rather than guessed. ## Changes - **`demo/main.cpp`** — new `DEVOURER_RX_DUMP_ALL=1` knob emits `<devourer-corrupt-any>len=L crc_err=X icv_err=Y rate=R rssi=A,B evm=A,B snr=A,B`. Body bytes are deliberately omitted (a hot survey would inflate the log past usable size); the aggregate report only needs length + flags + phy. - **`tools/precoder/corruption_survey.py`** — new tool that reads those lines and reports: - headline chip-clean vs chip-corrupt counts - **corruption rate broken down by DESC_RATE** (the CCK-vs-OFDM split — without this the headline is dominated by always-clean CCK ACKs/beacons and underestimates what OFDM data faces) - frame-size distribution for each population - phy-metric stats per population, filtered to frames where the chip populated phy stats (CCK reports 0/0; we treat as "no measurement" instead of "0 dB" so the buckets don't collapse) - per-SNR-bucket corruption rate (where measurable) - temporal clustering (live captures only) - a heuristic FEC recommendation based on median-vs-peak corruption rate ## Bench finding 60-second ch6 capture in a busy office environment with several APs in range: ``` === corruption survey (2266 frames, file/pipe) === chip-clean : 1663 ( 73.4%) chip-corrupt : 603 ( 26.6%) corruption rate : 26.61% no-phy-measurement: 2103 (CCK/short frames, chip reports 0/0) Corruption rate by DESC_RATE: idx name count % corrupt rate 0x00 1M CCK 2075 91.6% 412 19.9% 0x02 5.5M CCK 2 0.1% 2 100.0% 0x03 11M CCK 1 0.0% 1 100.0% 0x04 6M OFDM 17 0.8% 17 100.0% 0x05 9M OFDM 19 0.8% 19 100.0% 0x06 12M OFDM 20 0.9% 20 100.0% 0x07 18M OFDM 31 1.4% 31 100.0% 0x08 24M OFDM 22 1.0% 22 100.0% 0x09 36M OFDM 30 1.3% 30 100.0% 0x0a 48M OFDM 31 1.4% 31 100.0% 0x0b 54M OFDM 18 0.8% 18 100.0% ``` ## Reading the result - **1M CCK loses ~20%** even at this location — CCK is robust but background interference still nukes one in five ACKs/beacons. - **Every OFDM rate above CCK is 100% corrupt** because we're hearing distant APs at marginal SNR — the chip detects them, decodes them, fails the FCS, and now (with #83's RCR change) surfaces them. The FEC-design takeaway: - The PoC's 6M OFDM stream link only works because TX and RX are co-located. At any real range the chip will surface FCS failures at high rate. - The stream layer needs **inter-frame parity** (Reed-Solomon over N frames + K parity, Raptor, etc.) to recover from blocks of lost frames, not just per-frame FEC. - For a P2P link's typical "moderate range" use case (e.g. OpenIPC long-range video), expect frame loss rates in the 30–70% range. FEC overhead has to be sized accordingly — at 50% loss you need K/N ≈ 0.5 to be reliable. ## Follow-ups (for whoever picks up the FEC layer) - Pick a parity scheme (Reed-Solomon is simplest, Raptor scales better) and parametrise N, K against captures from realistic ranges. - Decide where parity rides: in-band on the same SA (current TX path) vs. on a dedicated SA / frame type. In-band keeps the link simple but eats stream airtime. - Consider degrading rate gracefully (rateless codes) so the receiver can decode at whatever fraction of N+K frames it actually receives. Builds on #83 (chip-level filter open, merged) and #84 (phy soft metrics, open). 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

## Summary The corruption survey in #85 showed real-range OFDM frames on this link will see **30–70% loss**. tun_p2p.py's blind `--repeat N` is a fixed-cost workaround that can't compose to handle the tail; this PR ships a real erasure code on top of the existing stream framing. ## Library `raptorq` from cberner (Rust+PyO3 binding to the RFC 6330 reference port). MIT, manylinux abi3 wheels on PyPI, ~26 Gbps enc / ~7 Gbps dec at K=1000 on commodity x86. `uv add raptorq` is the only install step. ## Wire format The existing `stream.py` framing stays untouched. FEC is an **inner envelope** living inside `StreamFrame.payload`: ``` FEC_MAGIC (2) = 0xF52E VERSION/FLAGS (1) = 0 K (1) = source symbols per block KREAL (1) = real source symbols in this block (≤ K). Trailing (K - KREAL) decoded symbols are zero-pad to discard. SYMBOL_SIZE (2) = LE u16 BLOCK_ID (2) = LE u16 wraps RAPTORQ_PKT (var) = lib-managed SBN+ESI+symbol inner overhead = 9 B + raptorq's 4 B SBN/ESI = 13 B ``` Source symbols are themselves concatenations of length-prefixed IP packets: ``` [u16 len_a][packet_a]…[u16 len_b][packet_b]…[zero pad to SYMBOL_SIZE] ``` So small packets (ACK floods) share symbols instead of each burning a whole symbol's worth of airtime. ## Files - `tools/precoder/pyproject.toml` — add `raptorq>=2`. - `tools/precoder/stream_fec.py` — `FecConfig`, `FecEncoder` (concatenation packing + block encoding), `FecDecoder` (block-incremental decode + late-symbol drop + block expiry). - `tools/precoder/test_stream_fec.py` — 19 unit tests: round-trip, loss tolerance 0/20/40% at R/K=1, 50% at R/K=2, unrecoverable-block bookkeeping at 70%, concatenation, partial flush, block-id wrap, MTU enforcement, garbage envelopes. - `tools/precoder/tun_p2p.py` — new `--fec-k`/`--fec-overhead`/`--fec-symbol-size`/`--fec-flush-ms`/`--fec-block-expire-ms` flags. tx_thread feeds packets through the encoder; a parallel `fec_flush_thread` force-encodes partial blocks every flush-ms (sparse traffic doesn't stall). rx_thread feeds payloads through the decoder; decoded IP packets go to TUN. Outer `SeqWindow` dedup is forced OFF when FEC is on (RaptorQ symbols self-dedup via SBN+ESI). New `fec=[...]` segment in the periodic stderr report. Docstring extended. ## Hardware verification Two-netns single-host bench (RTL8812AU `0x8812` + TP-Link Archer T2U Plus / RTL8821AU `2357:0120`, ch 6, no `--repeat`, `ping -c 30 -i 1`): | Config | RTT min/avg/max | Loss | DUP | Blocks ok/lost | |---|---|---:|---:|---:| | `--fec-k 16 --fec-overhead 1.0 --fec-flush-ms 50` | 121 / **160** / 207 ms | 0% | 0 | 30 / 1 (startup) | | `--fec-k 8 --fec-overhead 1.0 --fec-flush-ms 20` | 73 / **95** / 145 ms | 0% | 0 | 30 / 1 (startup) | The K=8 config trades a bit of recovery margin for a 65 ms drop in median RTT. Both decode 100% of source packets on a healthy link; the survey's noisier regimes are what motivates `--fec-overhead > 1`. For comparison from PR #82's earlier numbers (same bench, byte mode): | Mode | Loss | Avg RTT | |---|---:|---:| | Byte mode `--repeat 1` | 10% | 7 ms | | Byte mode `--repeat 4` + dedup | 0% | 10 ms (with up to 25 DUPs per ping eaten by dedup) | | **FEC K=8 R/K=1 flush=20** | **0%** | **95 ms** | FEC moves us from "blind redundancy + dedup" to "real erasure code". The latency cost is the K-source-symbol encode buffer; the win is that the codec scales gracefully to higher loss rates by raising `--fec-overhead` instead of running out at `--repeat=∞`. ## Test plan - [x] `cd tools/precoder && uv run pytest` → 87 passed (31 pipeline + 37 stream + 19 fec) - [x] `python -m pytest tests/precoder_smoke.py tests/precoder_stream_smoke.py` → 8 passed - [x] tun_p2p.py --help parses cleanly (incl. all FEC flags) - [x] Bench: K=16/R=1 and K=8/R=1, both 30/30 ping with 0% loss and 0 DUPs ## Open caveats (documented in script) - Strict block boundaries — no cross-block FEC, no Raptor carousel. Good enough at K=8–16 + 20–50 ms flush; revisit if the latency budget tightens further. - No rateless dynamic overhead — R/K is fixed at construction. A future PR could let RX hint TX to send more repair symbols via a reverse-channel feedback envelope. - Patent note: RFC 6330 has Qualcomm patents largely expired in primary jurisdictions by 2026; cberner's MIT lib explicitly notes this. Builds on #82 (TUN bridge, merged), #83 (corrupted-frame surfacing, merged), #84 (phy soft metrics, open), #85 (corruption survey, open). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

josephnef and others added 2 commits June 7, 2026 12:34

josephnef merged commit b210f7e into master Jun 7, 2026
5 checks passed

josephnef deleted the surface-corrupted-rx branch June 7, 2026 13:24

This was referenced Jun 7, 2026

Phy-level soft metrics on stream lines + BER-vs-SNR analyser #84

Merged

Corruption-pattern survey tool for FEC design #85

Merged

RaptorQ (RFC 6330) FEC layer for the stream link #86

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Surface CRC/ICV-corrupted RX frames + analysis tool#83

Surface CRC/ICV-corrupted RX frames + analysis tool#83
josephnef merged 2 commits into
masterfrom
surface-corrupted-rx

josephnef commented Jun 7, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

josephnef commented Jun 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Verification

Follow-ups (not in this PR)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

josephnef commented Jun 7, 2026 •

edited

Loading