fix: token usage once per message (+ Codex per-step attribution); kind v1.1.0 by akesling · Pull Request #106 · empathic/toolpath

akesling · 2026-06-10T18:38:05Z

Situation

Derived documents over-counted token spend, and the kind spec was silent on how token_usage on steps relates to API-message accounting — so every consumer was left to guess, and the natural reading (sum per step) gave inflated numbers.

Claude: Claude Code writes one JSONL line per content block of an assistant API message. toolpath-claude emitted one step per line, each carrying that line's usage — ~3× output-token inflation on real sessions — and dropped the message.id that ties the lines back to one message.
Codex: toolpath-codex stamped the cumulative session counter (total_token_usage) onto each assistant turn, so per-step sums grew quadratically. Codex also re-emits a stale last_token_usage on repeated token_count events, so any sum-the-deltas approach double-counts.

Investigation against real data (all 287 Claude sessions, all Codex sessions on disk) plus authoritative docs settled two questions the data alone couldn't:

Claude per-line usage is a cumulative streaming snapshot, not a per-block bill. The Anthropic streaming API seeds output_tokens near zero at message_start and reports the running cumulative total on each message_delta; Claude Code stamps each line with whatever snapshot was current at flush time. Across 1,959 streamed groups the final line is the max in 100% of cases, and 87% of non-final prose lines report ≤5 output tokens — so differencing the lines does not recover per-block costs.
Codex per-call deltas are real and provider-reported. OpenAI's own issue tracker (#14489, #17539) defines total_token_usage as cumulative and last_token_usage as the per-call delta, and documents the stale-re-emission that makes summing last_token_usage over-count.

Resolution

The contract (kind `agent-coding-session` v1.1.0, both fields optional)

token_usage = "the total for a group", on the group's final step. Σ token_usage over a path = session total. Verbatim from the source (for Claude, the field-wise max across the split — the final streaming snapshot, found without trusting line order).
attributed_token_usage = "this step's own attributed spend", on its own key so the sum above is unaffected. Populated only where the source genuinely reports per-step spend. Whether a number is a total or a share is structural (the key it's under), never positional. The unattributed remainder (group total − Σ attributed) is computed by consumers, never recorded.
Optional ⇒ any producer can populate attribution now or later with no further kind version. Folded into the still-unreleased v1.1.0.

The v1.0.0 page carries an erratum; path p validate bundles both schemas. PATH_KIND_AGENT_CODING_SESSION → v1.1.0 (old URI kept as …_V1_0_0).

Per provider

toolpath-convo: Turn.group_id + Turn.attributed_token_usage; derive_path writes token_usage once per group and attributed_token_usage per step; extract_conversation reads both back. Schema gains the optional field.
toolpath-claude: each group_id run reduced to the field-wise-max total on its final turn (never under-counts whatever the line order). No attributed_token_usage — the per-line snapshots aren't per-block costs. total_usage deduped by group; the projector re-expands the total onto every line of a split (with the shared message.id).
toolpath-codex: per-step spend = increase in the cumulative total_token_usage (differencing is dedup-safe against the stale-re-emission; summing last_token_usage would double). Each per-call delta becomes that step's attributed_token_usage; a round's total is the sum of its steps' attributions (one source of truth — total and shares can't drift). The projector emits a turn_context per group and a cumulative token_count after each step, so grouping and attribution round-trip.
toolpath-pi / toolpath-opencode: all-zero wire usage counters decode as None ("spend unknown"), not Some(zeros).
cross-harness matrix: survival invariant compares the order-preserving usage sequence on common fields.
Gemini and Cursor audited clean (one message = one turn, per-message deltas).

Verification

1,622 tests passing, clippy clean, site builds; new tests at each layer (derive grouping, extract round-trip, Claude max-total / out-of-order / repeated-total canonicalization with no fabricated attribution, projector total re-expansion, Codex deduped-delta attribution).
Real data: Claude per-step token_usage sums to the source ground truth (439,429 = max per message.id) with zero fabricated attribution; Codex Σ attributed == Σ token_usage == 14233 and survives a full export→reimport round-trip.
Conclusions corroborated against primary sources: the Anthropic streaming API reference (Claude) and openai/codex #14489 / #17539 (Codex). Format docs (usage.md, codex.md), the v1.1.0 spec page, and the schema carry the corrected semantics and citations.

Versioning

Minor bumps across all toolpath crates (the emitted kind URI changes for every deriver): toolpath 0.7.0, toolpath-convo 0.11.0, toolpath-claude 0.12.0, toolpath-codex 0.6.0, toolpath-gemini 0.6.0, toolpath-opencode 0.5.0, toolpath-cursor 0.2.0, toolpath-pi 0.6.0, toolpath-git/toolpath-github 0.6.0, toolpath-dot 0.5.0, toolpath-md 0.7.0, path-cli/toolpath-cli 0.14.0.

Existing v1.0.0 documents remain over-counted; the version gate is the consumer signal — sum v1.1.0 naively, keep dedup heuristics only for the v1.0.0 backlog (where tuple-equality still cannot repair Codex-derived documents).

Pathbase reported ~3x token over-counting on real Claude sessions: Claude Code writes one JSONL line per content block of an assistant API message, repeating the message-level usage on every line, and toolpath-claude emitted one step per line each carrying the full usage — while dropping the message.id that disambiguates them. Codex was worse in a different way: cumulative session counters stamped per turn. The contract, now specified by agent-coding-session v1.1.0 (and pinned by PATH_KIND_AGENT_CODING_SESSION): token_usage always means "the total for a message", verbatim from the source. Steps split from one message share a message_id; the total sits on the group's final step; a step without message_id is a one-step message. Summing token_usage over a path's steps therefore equals the session totals. Future per-step attribution is reserved a separate field (structural, not positional, differentiation), so the summing rule is permanent. - toolpath-convo: Turn.message_id (grouping key); derive_path writes it into conversation.append and emits token_usage once per group; extract_conversation reads it back. - toolpath-claude: fills message_id from message.id; canonicalizes the IR (total only on the group's final turn) so trait-level consumers are correct by construction; dedups total_usage; the projector re-expands the group total and message.id onto every JSONL line of a split for wire fidelity. - toolpath-codex: groups a round's assistant turns under its turn_id; accumulates token_count spend per round (last_token_usage preferred, else saturating delta of cumulative totals) onto the round's final assistant turn; the projector closes usage-bearing turns with a task_complete boundary so re-reads attribute spend correctly. - toolpath-pi / toolpath-opencode: decode absent/all-zero wire counters as token_usage None ("spend unknown"), never Some(zeros). - path-cli: bundles the v1.1.0 kind schema alongside v1.0.0; the cross-harness matrix asserts the order-preserving usage sequence instead of index-paired presence (which only ever passed because duplication blanketed every index). - spec/site: v1.1.0 page with the accounting rules and forward-compat field reservation; erratum on the v1.0.0 page documenting historical duplication; format docs gain the don't-sum-per-line warning (claude) and cumulative-vs-round scoping (codex). Verified against real sessions: zero duplicate usage tuples under Pathbase's jq check, and naive per-step sums exactly match the source ground truth summed once per distinct message.id. Crates bumped (kind URI changes for every deriver): toolpath 0.7.0, toolpath-convo 0.11.0, toolpath-claude 0.12.0, toolpath-codex 0.6.0, toolpath-gemini 0.6.0, toolpath-opencode 0.5.0, toolpath-cursor 0.2.0, toolpath-pi 0.6.0, toolpath-git 0.6.0, toolpath-github 0.6.0, toolpath-dot 0.5.0, toolpath-md 0.7.0, path-cli 0.14.0, toolpath-cli 0.14.0. Reported-by: Pathbase team (local/toolpath-usage-duplication.md)

github-actions · 2026-06-10T18:41:48Z

🔍 Preview deployed: https://9f0b0116.toolpath.pages.dev

akesling · 2026-06-10T18:56:18Z

Related to #105

Follows the v1.1.0 accounting fix with two things checking the data forced. First, the "split lines repeat byte-identical usage" assumption was false: across every Claude session on disk, ~27% of split messages disagree — output_tokens streams upward to a final total while input/cache stay constant. The Claude reader no longer trusts line order; it reduces each message_id run to the field-wise MAX total (never under-counts whatever the order) on the run's final turn. Second, those cumulative snapshots ARE per-step data. Differencing consecutive output_tokens yields genuine per-block attribution. Codex is even cleaner: its token_count carries the cumulative total_token_usage, and differencing it (dedup-safe — Codex emits each event twice, so a repeated total is a 0 delta, where summing last_token_usage would double) gives per-step deltas aligned to the step each count follows. New model (kind agent-coding-session v1.1.0, folded in while unreleased so no version churn): token_usage stays "the total for a message" (sum over a path = session total); a new optional attributed_token_usage holds "this step's own attributed spend" on its own key, so the total sum is unaffected. Total vs share is structural (the key), never positional. Remainder (group total minus sum attributed) is computed by consumers, never recorded. Both fields optional, so any producer can add attribution later with no further kind version. - toolpath-convo: Turn.attributed_token_usage; derive writes it per step, extract reads it back; schema gains the optional field. - toolpath-claude: field-wise-max group total; streamed output deltas (running-max differenced so the sum telescopes to the total) become attribution; projector reconstructs the cumulative per-line wire so attribution survives a round-trip. - toolpath-codex: per-step deltas from differencing the cumulative total_token_usage; round total derived from the sum of its steps' attributions (one source of truth); projector emits a turn_context per group and a cumulative token_count after each step, so grouping and attribution round-trip. - toolpath-pi / toolpath-opencode: all-zero wire usage decodes as None. - cross-harness matrix: survival invariant compares the order-preserving usage sequence on common fields. Verified on real data: Claude streamed sessions produce correct deltas that telescope to message totals; Codex sum attributed == sum token_usage == session ground truth and survives a full export-reimport round-trip.

Checking the on-disk data against authoritative docs overturned the per-step attribution assumption for Claude, and confirmed it for Codex. Claude — the per-content-block `usage` values are cumulative STREAMING SNAPSHOTS, not per-block bills. The Anthropic streaming API seeds `output_tokens` near zero at `message_start` and reports the running cumulative total on each `message_delta`; Claude Code stamps each JSONL line with whatever snapshot was current at flush time. Across 1959 streamed message groups the final line is the max in 100% of cases, and 87% of non-final lines that contain real prose report <=5 output tokens — i.e. differencing the lines does NOT recover per-block costs. So: - drop Claude `attributed_token_usage` entirely (it was fabricated from snapshot deltas); keep the field-wise-max message total, which the streaming semantics confirm is the right read. - simplify the projector to re-expand that total onto every line of a split (the dominant real-Claude pattern); drop the now-unneeded streaming reconstruction. Codex — per-step attribution is real and confirmed by OpenAI's own issue tracker (openai/codex #14489, #17539): `total_token_usage` is the session-cumulative counter, `last_token_usage` the per-call delta, and Codex re-emits a stale `last_token_usage` on repeated events — so summing it double-counts while differencing the cumulative is dedup-safe. That is exactly what `toolpath-codex` already does; left intact. `attributed_token_usage` stays in the IR/schema/spec as an optional, provider-populated field (Codex populates it; Claude does not) — no kind version change needed, per the optional-field design. Docs updated with the corrected semantics and citations: claude-code/ usage.md (streaming-snapshot explanation, why no per-block attribution), codex.md (field definitions + the #14489/#17539 doubling trap), the v1.1.0 spec page and schema description, CHANGELOG, and CLAUDE.md. Verified on real data: Claude per-step token_usage sums to the source ground truth (439429, max per message.id) with zero fabricated attribution; Codex sum attributed == sum token_usage == 14233.

…ng key) The grouping key isn't always a message id. For Codex the group is a *round* (its turn_id), which contains several messages — so `message_id` overfit to Claude and was actively misleading there. It also collided namewise with the pre-existing provider-internal `message_id` fields (Claude's JSONL envelope, gemini, opencode), adding to the confusion. `group_id` names the actual role: the identifier of the source accounting unit a step was derived from (a message for Claude, a round for Codex). The stored value is unchanged — only the field/key name and its documented meaning generalize. v1.1.0 is unreleased, so this is free. Scope was surgical: only the IR `Turn` field and the `conversation.append` `group_id` structural key were renamed. The unrelated provider-internal `message_id` fields (Claude `ConversationEntry.message_id` and its event-data round-trip, gemini/opencode/cursor entry fields, opencode SQL columns, the jsonl-envelope doc) are left untouched. - toolpath-convo: `Turn.group_id`; derive writes the `group_id` key, extract reads it. - claude/codex/gemini/pi/opencode/cursor: Turn-field usages renamed (compiler-verified); provider envelope fields preserved. - spec/schema/docs reworded to provider-neutral "group / accounting unit" framing (message for Claude, round for Codex); the "Message accounting" section is now "Group accounting"; ID stays capitalized in prose, `group_id` backticked as a symbol. Verified: 1622 tests pass, clippy clean, site builds, schema valid. Real data — Claude: 1063 steps carry group_id, 0 attribution (correct); Codex: attribution telescopes to the session total (older sessions whose turn_context lacks a turn_id degrade to per-turn groups, still correct).

…ssage "A turn's group id" → "group ID", "a unique synthesized id" → "ID", "share that id as" → "ID", "gets a unique id" → "ID" (codex), and the claude test assertion "carry no id" → "carry no ID". Per the house rule: "ID" in prose; lowercase `id` only as a backticked literal symbol. Pre-existing `id` prose elsewhere in the tree (e.g. toolpath-convo extract/derive comments) is left alone — out of scope for this PR.

…er-count Adds an optional `breakdowns` field to `TokenUsage`: a priced, informational decomposition of a top-level token class, keyed by class (e.g. "output") with an inner sub-class -> tokens map (e.g. {"output": {"reasoning": 243}}). Breakdowns are never summed into any total -- the parent class already counts those tokens -- so the session-total guarantee is untouched. Invariant: Σ(inner) ≤ parent; the field is omitted when empty (byte-identical to prior docs). Rides both `token_usage` and `attributed_token_usage`. Additive to the unreleased kind v1.1.0 (no new kind version, no version bumps). Per provider: - Gemini: `thoughts` (reasoning) was being dropped, under-counting generated tokens. Now folded into `output_tokens` (Google bills it as output) and recorded under breakdowns["output"]["reasoning"]; the projector un-folds it on the reverse path for a lossless round-trip, preserving the Some(0) vs absent distinction (Gemini-3 zero-reasoning vs Gemini-2.5 no-reasoning-concept). - OpenCode: keeps folding reasoning into output; also records the slice. - Codex: differences the cumulative `reasoning_output_tokens` (⊆ output, dedup-safe) into the breakdown per-step and per-round. - Claude: none (its JSONL usage does not itemize thinking tokens). Also includes deep-dive doc/comment corrections for Pi and Cursor token reporting (Cursor tokenCount reliability is unverified; Pi totalTokens is version-dependent and not read). Schema tokenUsage $def, CHANGELOG, CLAUDE.md, kind index.md, and the gemini/opencode/codex format docs updated. Full workspace green (1637 tests), clippy clean, examples validate, site builds 11 pages.

`breakdowns` stores token counts keyed by arbitrary sub-class strings; it encodes no price, cost, or rate. "Priced" described the motivation (sub-classes like text vs image bill differently) but mischaracterized the data. Reworded to "decomposition of a top-level class into named sub-classes" across the field doc, schema $def, kind spec, CHANGELOG, and CLAUDE.md.

benbaarber

LGTM

akesling changed the title ~~fix: count token usage once per message; kind v1.1.0 specifies the rule~~ fix: token usage once per message + per-step attribution; kind v1.1.0 Jun 16, 2026

akesling changed the title ~~fix: token usage once per message + per-step attribution; kind v1.1.0~~ fix: token usage once per message (+ Codex per-step attribution); kind v1.1.0 Jun 16, 2026

akesling added 4 commits June 16, 2026 15:30

akesling requested a review from benbaarber June 18, 2026 14:46

akesling assigned benbaarber Jun 18, 2026

benbaarber approved these changes Jun 22, 2026

View reviewed changes

benbaarber assigned akesling Jun 22, 2026

akesling merged commit dca4604 into main Jun 22, 2026
2 checks passed

akesling deleted the fix/token-usage-once-per-message branch June 22, 2026 17:55

benbaarber mentioned this pull request Jun 23, 2026

Cross-harness context-compaction provenance (kind v1.2.0) #108

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: token usage once per message (+ Codex per-step attribution); kind v1.1.0#106

fix: token usage once per message (+ Codex per-step attribution); kind v1.1.0#106
akesling merged 7 commits into
mainfrom
fix/token-usage-once-per-message

akesling commented Jun 10, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 10, 2026 •

edited

Loading

Uh oh!

akesling commented Jun 10, 2026

Uh oh!

benbaarber left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

akesling commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Situation

Resolution

The contract (kind agent-coding-session v1.1.0, both fields optional)

Per provider

Verification

Versioning

Uh oh!

github-actions Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

akesling commented Jun 10, 2026

Uh oh!

benbaarber left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

akesling commented Jun 10, 2026 •

edited

Loading

The contract (kind `agent-coding-session` v1.1.0, both fields optional)

github-actions Bot commented Jun 10, 2026 •

edited

Loading