diff --git a/docs/development/run-log-encoding-options.md b/docs/development/run-log-encoding-options.md new file mode 100644 index 0000000..a787ab7 --- /dev/null +++ b/docs/development/run-log-encoding-options.md @@ -0,0 +1,136 @@ +# Run-log encoding options (course pipeline) + +**Status:** proposal for review. **Tracking:** [#140](https://github.com/reiserlab/webDisplayTools/issues/140). + +## TL;DR + +The bench run-log (FicTrac-bridge JSONL, auto-committed per run to +`runlogs//` in the course repo) should carry **turning behaviour + +forward locomotion** without bloating. Recommendation: log **4 curated FicTrac +fields** (per-frame side / forward / yaw + integrated heading) as **compact CSV +frame rows** (relative-ms timestamps, 4–5 significant-figure floats) interleaved +with JSON event rows. That's **~1.8 MB for a 15-minute run** — ~90 % smaller than +dumping all 25 columns as JSON, and even smaller than today's (behaviour-less) +default — while staying human-readable and self-describing. + +## Context + +- **One committed log = one run.** The bridge rotates a fresh file at each run + start, so each file is scoped start-to-finish (bench-verified 2026-07-04). +- **GitHub ceiling ≈ 35 MiB/file** (Contents API; measured). Anything under a few + MB is comfortable. +- **Frame rate ≈ 50 Hz** on the current bench (measured: 2506 frames / 50.1 s). + A 15-min run ≈ **45,000 frames**. **Frame rows are ~99 % of the bytes**; + session/runner events are a handful. +- All sizes below are estimates at 50 Hz. **Double the camera rate → double the + file.** + +## The behavioural fields to keep + +Goal: turning + forward locomotion. FicTrac's **delta rotation vector in lab +coordinates (cols 6–8)** gives per-frame velocities on exactly these axes; +**integrated heading (col 17)** gives absolute orientation. + +| key | FicTrac col (1-based) | `fields[]` idx | meaning | unit | +| ------ | --------------------- | -------------- | ---------------------------- | --------- | +| `side` | 6 | 5 | sideways / slip velocity | rad/frame | +| `fwd` | 7 | 6 | **forward** velocity | rad/frame | +| `yaw` | 8 | 7 | **turning** velocity | rad/frame | +| `hd` | 17 | 16 | integrated heading (facing) | rad | + +**Scaling for analysis:** forward mm/s = `fwd × ball_radius × fps`; turning °/s = +`yaw × fps × 57.3`. The axis mapping (x = side, y = forward, z = yaw) is FicTrac's +standard lab-frame convention — worth a 10-second bench check (walk the ball +straight → `fwd` dominates; spin in place → `yaw` dominates) to confirm signs +before the course. + +### Every frame row also carries identity + timing + +The 4 fields above are the *behaviour*. Each frame row keeps them **in addition +to** the frame's identity + timing — **not instead of**: + +| key | source | meaning | +| ------- | ------------- | ------------------------------------------------------------------------------------------------------------------ | +| `ms` | bridge clock | timestamp, stored as **relative ms since run start** (the absolute epoch start is logged once in `run_metadata`) | +| `index` | arena | the frame index the controller **displayed** at that instant — needed to correlate behaviour with what the fly saw | +| `seq` | FicTrac | FicTrac's own frame counter — lets you detect **dropped frames**; the one field you may drop (~8 B) if you don't need that QC | + +So a full frame row is **`ms, seq, index, side, fwd, yaw, hd`** (7 values). The +size estimates below already include `index` + timestamp; `seq` adds ~8 B/frame +(~0.35 MB over a 15-min run) if kept. + +Optional add-ons: cols 15–16 (integrated x/y position) for the 2-D walking path; +the full 25 columns for archival/QC. + +## Encoding options — sizes for a 15-min run (45,000 frames @ 50 Hz) + +| # | Scheme | data kept | ~B/frame | 15-min file | vs #0 | readable | +| ----- | -------------------------------------------------- | ---------------------- | -------- | ----------- | ----- | ----------- | +| 0 | Full 25 cols — verbose JSON (`--log-frames` today) | everything | ~350 | **~16 MB** | — | yes | +| 1 | Full 25 cols — CSV rows + JSON events | everything | ~260 | ~12 MB | −26 % | yes | +| 2 | Curated 4 — verbose JSON | turn + fwd + slip + hd | ~125 | ~5.6 MB | −64 % | yes | +| 3 | Curated 4 — short-key JSON (drop `"type"`) | same | ~85 | ~3.8 MB | −76 % | ~ | +| 4 | Curated 4 — CSV rows + JSON events (abs ms) | same | ~53 | ~2.4 MB | −85 % | yes+schema | +| **5★**| **Curated 4 — CSV + relative-ms + rounded floats** | same | **~40** | **~1.8 MB** | −89 % | yes+schema | +| 6 | #5 + gzip (`.jsonl.gz`) | same | ~11 | ~0.5 MB | −97 % | no (opaque) | +| — | _today's default (no behaviour data)_ | seq / index / t | ~66 | ~3.0 MB | — | yes | + +**Two levers do most of the work** (independent of the row format): + +- **Relative time** — store ms-since-run-start (the epoch start is logged once in + `run_metadata`) instead of the 13-digit absolute epoch. Saves ~7–8 B/frame — + the single biggest field. +- **Float rounding** — FicTrac deltas are tiny; 4–5 significant figures is ample + behavioural resolution. + +**gzip** layers on top of _any_ row format (≈ 3–4× on CSV, ≈ 7× on verbose JSON) +as an escape hatch for an unusually long run — at the cost of GitHub inline +preview. + +## Recommended scheme (#5) + +Interleave: **JSON** for the few session/runner event rows, **bare CSV** for the +frame rows, with a one-time schema line. Parser rule: a line starting with `{` is +a JSON event; otherwise it's a CSV frame per the declared schema. + +```jsonl +{"type":"logging_started","file":"arena-log-20260704-134533-129.jsonl","ms":1783187133129} +{"type":"run_metadata","run_id":"nkts77qj","experimenter":"michael","genotype":"Canton-S","rig_id":"bench01","protocol_filename":"g6_2x10_smoke.yaml","t0_ms":1783187133129,"ball_radius_mm":4.5,"fps":50} +{"type":"frame_schema","cols":["ms","seq","index","side","fwd","yaw","hd"]} +0,1054712,64,-0.0123,0.0345,0.0082,1.2345 +20,1054713,64,-0.0110,0.0361,0.0075,1.2352 +40,1054714,65,-0.0098,0.0357,0.0069,1.2359 +{"type":"runner","phase":"sequence-complete","rx_ms":1783187183251} +{"type":"logging_stopped","ms":1783187183252} +``` + +- **Self-describing** — `frame_schema` gives column order; `run_metadata` carries + the epoch start, ball radius, and fps needed to convert to real units. +- **Streamable** — the bridge writes each frame row as it arrives (no buffering). +- **~2.2 MB / 15-min run** with `seq` kept (~1.8 MB — table row #5 — if `seq` is + dropped), far under the 35 MB ceiling even at 2× the frame rate or 3× the + duration. + +## Open questions + +1. **Fields** — are `side / fwd / yaw / hd` the right minimal set, or do you also + want raw delta-rotation-cam (cols 2–4) or absolute orientation (cols 12–14) + for QC? +2. **Frame rate** — confirm the course cameras run ~50 Hz (drives every size). + At 100 Hz, #5 is ~3.6 MB/15-min (still fine); full JSON (#0) would be ~32 MB, + right at the ceiling. +3. **Archival** — keep a `full` (all-25) option alongside `curated` in case + reprocessing later needs a dropped column? +4. **gzip** — acceptable to opt into `.jsonl.gz` for long runs, or must every + committed log stay plain text on GitHub? +5. **Precision** — OK to round floats to ~5 sig figs? (`seq`, `index`, and the + timestamp are all **kept** per frame alongside the 4 behavioural fields; `seq` + is the only droppable one, ~8 B/frame, if dropped-frame QC isn't needed.) + +## References + +- Tracking issue: [reiserlab/webDisplayTools#140](https://github.com/reiserlab/webDisplayTools/issues/140) +- Shipped pipeline: PR #139 (Arena Studio v0.5 course data pipeline) +- FicTrac column reference: `rjdmoore/fictrac` → `doc/data_header.txt` +- This repo's column map: `fictrac-bridge/fictrac_sim.py` (header comment) + + `frame_index_from_fictrac()` in `fictrac-bridge/bridge.py`