Skip to content

Run-log: curated FicTrac column subset (middle ground) + optional size reduction #140

Description

@mbreiser

Motivation

Bridge run logs (JSONL committed to the course repo) currently offer two extremes per FicTrac frame (fictrac-bridge/bridge.py write_frame):

  • default — just {"type":"fictrac_frame","seq":…,"index":…,"t":…} (the arena index + timing; raw tracking discarded). ~65 B/line, ~196 KB per 50 s run.
  • --log-frames — the full ~25-column FicTrac record appended as "fictrac":[…]. ~3–4× bigger (~8 MB per 10 min run).

We want a middle ground: keep a few scientifically-useful columns, not all 25, and not just seq/index/t. The index alone lets you recover heading (via gain) but permanently loses position/velocity/direction — unrecoverable after the fact.

Ask 1 — curated FicTrac column subset

Log a bounded, configurable subset. Candidate default set:

  • integrated animal heading (col 17)
  • integrated x / y position (cols 15–16)
  • animal movement direction (col 18) + speed (col 19)

Implementation sketch:

  • In Pipeline.handle_line the full fields list is already parsed (~line 254); write_frame decides what to store (~line 221).
  • Add a --log-fields option (named preset like curated/full/min, or an explicit column list) so the bench chooses the payload. Store curated fields under short, stable keys (e.g. hd,x,y,dir,spd) so downstream parsing is simple.
  • Keep --log-frames (= full) working.

Ask 2 — optional log size reduction (related)

Logs are plain JSONL with heavily repeated keys → very compressible. Options, best-first:

  1. Gzip before commit (.jsonl.gz): ~5–10× lossless (browser CompressionStream('gzip') or bridge-side). Trades GitHub inline-preview for size; analysis tools read gzip natively. Turns the 60 MB worst case into ~6–8 MB.
  2. Curated columns (Ask 1) — fewer floats/frame.
  3. Terser schema — drop the constant "type":"fictrac_frame", short keys, or a columnar frame block with one header row (~40–60%, stays text).
  4. Decimate — every Nth frame / on-change (loses temporal resolution).
  5. Binary packing — smallest, opaque, needs a parser (last resort).

Recommendation

Ship the curated subset as the course default, keep small runs as human-readable JSONL, and add opt-in gzip for long runs. Revisit only if real course logs approach the ~35 MB GitHub Contents-API ceiling.

Context

  • Shipped in Arena Studio v0.5: CSHL course data pipeline (Session 2) #139 (Arena Studio v0.5 course pipeline). Bench-verified: ~196 KB / 50 s default run.
  • GitHub Contents-API commit ceiling ≈ 35 MiB/file (measured); bridge→browser WS export tested to 50 MB.
  • Per-run rotation already scopes each committed log to one experiment.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions