Skip to content

fix: token usage once per message (+ Codex per-step attribution); kind v1.1.0#106

Merged
akesling merged 7 commits into
mainfrom
fix/token-usage-once-per-message
Jun 22, 2026
Merged

fix: token usage once per message (+ Codex per-step attribution); kind v1.1.0#106
akesling merged 7 commits into
mainfrom
fix/token-usage-once-per-message

Conversation

@akesling

@akesling akesling commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Situation

Derived documents over-counted token spend, and the kind spec was silent on how token_usage on steps relates to API-message accounting — so every consumer was left to guess, and the natural reading (sum per step) gave inflated numbers.

  • Claude: Claude Code writes one JSONL line per content block of an assistant API message. toolpath-claude emitted one step per line, each carrying that line's usage~3× output-token inflation on real sessions — and dropped the message.id that ties the lines back to one message.
  • Codex: toolpath-codex stamped the cumulative session counter (total_token_usage) onto each assistant turn, so per-step sums grew quadratically. Codex also re-emits a stale last_token_usage on repeated token_count events, so any sum-the-deltas approach double-counts.

Investigation against real data (all 287 Claude sessions, all Codex sessions on disk) plus authoritative docs settled two questions the data alone couldn't:

  • Claude per-line usage is a cumulative streaming snapshot, not a per-block bill. The Anthropic streaming API seeds output_tokens near zero at message_start and reports the running cumulative total on each message_delta; Claude Code stamps each line with whatever snapshot was current at flush time. Across 1,959 streamed groups the final line is the max in 100% of cases, and 87% of non-final prose lines report ≤5 output tokens — so differencing the lines does not recover per-block costs.
  • Codex per-call deltas are real and provider-reported. OpenAI's own issue tracker (#14489, #17539) defines total_token_usage as cumulative and last_token_usage as the per-call delta, and documents the stale-re-emission that makes summing last_token_usage over-count.

Resolution

The contract (kind agent-coding-session v1.1.0, both fields optional)

  • token_usage = "the total for a group", on the group's final step. Σ token_usage over a path = session total. Verbatim from the source (for Claude, the field-wise max across the split — the final streaming snapshot, found without trusting line order).
  • attributed_token_usage = "this step's own attributed spend", on its own key so the sum above is unaffected. Populated only where the source genuinely reports per-step spend. Whether a number is a total or a share is structural (the key it's under), never positional. The unattributed remainder (group total − Σ attributed) is computed by consumers, never recorded.
  • Optional ⇒ any producer can populate attribution now or later with no further kind version. Folded into the still-unreleased v1.1.0.

The v1.0.0 page carries an erratum; path p validate bundles both schemas. PATH_KIND_AGENT_CODING_SESSION → v1.1.0 (old URI kept as …_V1_0_0).

Per provider

  • toolpath-convo: Turn.group_id + Turn.attributed_token_usage; derive_path writes token_usage once per group and attributed_token_usage per step; extract_conversation reads both back. Schema gains the optional field.
  • toolpath-claude: each group_id run reduced to the field-wise-max total on its final turn (never under-counts whatever the line order). No attributed_token_usage — the per-line snapshots aren't per-block costs. total_usage deduped by group; the projector re-expands the total onto every line of a split (with the shared message.id).
  • toolpath-codex: per-step spend = increase in the cumulative total_token_usage (differencing is dedup-safe against the stale-re-emission; summing last_token_usage would double). Each per-call delta becomes that step's attributed_token_usage; a round's total is the sum of its steps' attributions (one source of truth — total and shares can't drift). The projector emits a turn_context per group and a cumulative token_count after each step, so grouping and attribution round-trip.
  • toolpath-pi / toolpath-opencode: all-zero wire usage counters decode as None ("spend unknown"), not Some(zeros).
  • cross-harness matrix: survival invariant compares the order-preserving usage sequence on common fields.
  • Gemini and Cursor audited clean (one message = one turn, per-message deltas).

Verification

  • 1,622 tests passing, clippy clean, site builds; new tests at each layer (derive grouping, extract round-trip, Claude max-total / out-of-order / repeated-total canonicalization with no fabricated attribution, projector total re-expansion, Codex deduped-delta attribution).
  • Real data: Claude per-step token_usage sums to the source ground truth (439,429 = max per message.id) with zero fabricated attribution; Codex Σ attributed == Σ token_usage == 14233 and survives a full export→reimport round-trip.
  • Conclusions corroborated against primary sources: the Anthropic streaming API reference (Claude) and openai/codex #14489 / #17539 (Codex). Format docs (usage.md, codex.md), the v1.1.0 spec page, and the schema carry the corrected semantics and citations.

Versioning

Minor bumps across all toolpath crates (the emitted kind URI changes for every deriver): toolpath 0.7.0, toolpath-convo 0.11.0, toolpath-claude 0.12.0, toolpath-codex 0.6.0, toolpath-gemini 0.6.0, toolpath-opencode 0.5.0, toolpath-cursor 0.2.0, toolpath-pi 0.6.0, toolpath-git/toolpath-github 0.6.0, toolpath-dot 0.5.0, toolpath-md 0.7.0, path-cli/toolpath-cli 0.14.0.

Existing v1.0.0 documents remain over-counted; the version gate is the consumer signal — sum v1.1.0 naively, keep dedup heuristics only for the v1.0.0 backlog (where tuple-equality still cannot repair Codex-derived documents).

Pathbase reported ~3x token over-counting on real Claude sessions:
Claude Code writes one JSONL line per content block of an assistant API
message, repeating the message-level usage on every line, and
toolpath-claude emitted one step per line each carrying the full usage —
while dropping the message.id that disambiguates them. Codex was worse
in a different way: cumulative session counters stamped per turn.

The contract, now specified by agent-coding-session v1.1.0 (and pinned
by PATH_KIND_AGENT_CODING_SESSION): token_usage always means "the total
for a message", verbatim from the source. Steps split from one message
share a message_id; the total sits on the group's final step; a step
without message_id is a one-step message. Summing token_usage over a
path's steps therefore equals the session totals. Future per-step
attribution is reserved a separate field (structural, not positional,
differentiation), so the summing rule is permanent.

- toolpath-convo: Turn.message_id (grouping key); derive_path writes it
  into conversation.append and emits token_usage once per group;
  extract_conversation reads it back.
- toolpath-claude: fills message_id from message.id; canonicalizes the
  IR (total only on the group's final turn) so trait-level consumers
  are correct by construction; dedups total_usage; the projector
  re-expands the group total and message.id onto every JSONL line of a
  split for wire fidelity.
- toolpath-codex: groups a round's assistant turns under its turn_id;
  accumulates token_count spend per round (last_token_usage preferred,
  else saturating delta of cumulative totals) onto the round's final
  assistant turn; the projector closes usage-bearing turns with a
  task_complete boundary so re-reads attribute spend correctly.
- toolpath-pi / toolpath-opencode: decode absent/all-zero wire counters
  as token_usage None ("spend unknown"), never Some(zeros).
- path-cli: bundles the v1.1.0 kind schema alongside v1.0.0; the
  cross-harness matrix asserts the order-preserving usage sequence
  instead of index-paired presence (which only ever passed because
  duplication blanketed every index).
- spec/site: v1.1.0 page with the accounting rules and forward-compat
  field reservation; erratum on the v1.0.0 page documenting historical
  duplication; format docs gain the don't-sum-per-line warning (claude)
  and cumulative-vs-round scoping (codex).

Verified against real sessions: zero duplicate usage tuples under
Pathbase's jq check, and naive per-step sums exactly match the source
ground truth summed once per distinct message.id.

Crates bumped (kind URI changes for every deriver): toolpath 0.7.0,
toolpath-convo 0.11.0, toolpath-claude 0.12.0, toolpath-codex 0.6.0,
toolpath-gemini 0.6.0, toolpath-opencode 0.5.0, toolpath-cursor 0.2.0,
toolpath-pi 0.6.0, toolpath-git 0.6.0, toolpath-github 0.6.0,
toolpath-dot 0.5.0, toolpath-md 0.7.0, path-cli 0.14.0,
toolpath-cli 0.14.0.

Reported-by: Pathbase team (local/toolpath-usage-duplication.md)
@github-actions

github-actions Bot commented Jun 10, 2026

Copy link
Copy Markdown

🔍 Preview deployed: https://9f0b0116.toolpath.pages.dev

@akesling

Copy link
Copy Markdown
Contributor Author

Related to #105

Follows the v1.1.0 accounting fix with two things checking the data
forced. First, the "split lines repeat byte-identical usage" assumption
was false: across every Claude session on disk, ~27% of split messages
disagree — output_tokens streams upward to a final total while
input/cache stay constant. The Claude reader no longer trusts line
order; it reduces each message_id run to the field-wise MAX total (never
under-counts whatever the order) on the run's final turn.

Second, those cumulative snapshots ARE per-step data. Differencing
consecutive output_tokens yields genuine per-block attribution. Codex is
even cleaner: its token_count carries the cumulative total_token_usage,
and differencing it (dedup-safe — Codex emits each event twice, so a
repeated total is a 0 delta, where summing last_token_usage would
double) gives per-step deltas aligned to the step each count follows.

New model (kind agent-coding-session v1.1.0, folded in while unreleased
so no version churn): token_usage stays "the total for a message" (sum
over a path = session total); a new optional attributed_token_usage
holds "this step's own attributed spend" on its own key, so the total
sum is unaffected. Total vs share is structural (the key), never
positional. Remainder (group total minus sum attributed) is computed by
consumers, never recorded. Both fields optional, so any producer can add
attribution later with no further kind version.

- toolpath-convo: Turn.attributed_token_usage; derive writes it per
  step, extract reads it back; schema gains the optional field.
- toolpath-claude: field-wise-max group total; streamed output deltas
  (running-max differenced so the sum telescopes to the total) become
  attribution; projector reconstructs the cumulative per-line wire so
  attribution survives a round-trip.
- toolpath-codex: per-step deltas from differencing the cumulative
  total_token_usage; round total derived from the sum of its steps'
  attributions (one source of truth); projector emits a turn_context per
  group and a cumulative token_count after each step, so grouping and
  attribution round-trip.
- toolpath-pi / toolpath-opencode: all-zero wire usage decodes as None.
- cross-harness matrix: survival invariant compares the order-preserving
  usage sequence on common fields.

Verified on real data: Claude streamed sessions produce correct deltas
that telescope to message totals; Codex sum attributed == sum token_usage
== session ground truth and survives a full export-reimport round-trip.
@akesling akesling changed the title fix: count token usage once per message; kind v1.1.0 specifies the rule fix: token usage once per message + per-step attribution; kind v1.1.0 Jun 16, 2026
Checking the on-disk data against authoritative docs overturned the
per-step attribution assumption for Claude, and confirmed it for Codex.

Claude — the per-content-block `usage` values are cumulative STREAMING
SNAPSHOTS, not per-block bills. The Anthropic streaming API seeds
`output_tokens` near zero at `message_start` and reports the running
cumulative total on each `message_delta`; Claude Code stamps each JSONL
line with whatever snapshot was current at flush time. Across 1959
streamed message groups the final line is the max in 100% of cases, and
87% of non-final lines that contain real prose report <=5 output tokens
— i.e. differencing the lines does NOT recover per-block costs. So:
  - drop Claude `attributed_token_usage` entirely (it was fabricated
    from snapshot deltas); keep the field-wise-max message total, which
    the streaming semantics confirm is the right read.
  - simplify the projector to re-expand that total onto every line of a
    split (the dominant real-Claude pattern); drop the now-unneeded
    streaming reconstruction.

Codex — per-step attribution is real and confirmed by OpenAI's own
issue tracker (openai/codex #14489, #17539): `total_token_usage` is the
session-cumulative counter, `last_token_usage` the per-call delta, and
Codex re-emits a stale `last_token_usage` on repeated events — so
summing it double-counts while differencing the cumulative is dedup-safe.
That is exactly what `toolpath-codex` already does; left intact.

`attributed_token_usage` stays in the IR/schema/spec as an optional,
provider-populated field (Codex populates it; Claude does not) — no kind
version change needed, per the optional-field design.

Docs updated with the corrected semantics and citations: claude-code/
usage.md (streaming-snapshot explanation, why no per-block attribution),
codex.md (field definitions + the #14489/#17539 doubling trap), the
v1.1.0 spec page and schema description, CHANGELOG, and CLAUDE.md.

Verified on real data: Claude per-step token_usage sums to the source
ground truth (439429, max per message.id) with zero fabricated
attribution; Codex sum attributed == sum token_usage == 14233.
@akesling akesling changed the title fix: token usage once per message + per-step attribution; kind v1.1.0 fix: token usage once per message (+ Codex per-step attribution); kind v1.1.0 Jun 16, 2026
akesling added 4 commits June 16, 2026 15:30
…ng key)

The grouping key isn't always a message id. For Codex the group is a
*round* (its turn_id), which contains several messages — so `message_id`
overfit to Claude and was actively misleading there. It also collided
namewise with the pre-existing provider-internal `message_id` fields
(Claude's JSONL envelope, gemini, opencode), adding to the confusion.
`group_id` names the actual role: the identifier of the source
accounting unit a step was derived from (a message for Claude, a round
for Codex). The stored value is unchanged — only the field/key name and
its documented meaning generalize. v1.1.0 is unreleased, so this is free.

Scope was surgical: only the IR `Turn` field and the `conversation.append`
`group_id` structural key were renamed. The unrelated provider-internal
`message_id` fields (Claude `ConversationEntry.message_id` and its
event-data round-trip, gemini/opencode/cursor entry fields, opencode SQL
columns, the jsonl-envelope doc) are left untouched.

- toolpath-convo: `Turn.group_id`; derive writes the `group_id` key,
  extract reads it.
- claude/codex/gemini/pi/opencode/cursor: Turn-field usages renamed
  (compiler-verified); provider envelope fields preserved.
- spec/schema/docs reworded to provider-neutral "group / accounting
  unit" framing (message for Claude, round for Codex); the "Message
  accounting" section is now "Group accounting"; ID stays capitalized in
  prose, `group_id` backticked as a symbol.

Verified: 1622 tests pass, clippy clean, site builds, schema valid.
Real data — Claude: 1063 steps carry group_id, 0 attribution (correct);
Codex: attribution telescopes to the session total (older sessions whose
turn_context lacks a turn_id degrade to per-turn groups, still correct).
…ssage

"A turn's group id" → "group ID", "a unique synthesized id" → "ID",
"share that id as" → "ID", "gets a unique id" → "ID" (codex), and the
claude test assertion "carry no id" → "carry no ID". Per the house rule:
"ID" in prose; lowercase `id` only as a backticked literal symbol.

Pre-existing `id` prose elsewhere in the tree (e.g. toolpath-convo
extract/derive comments) is left alone — out of scope for this PR.
…er-count

Adds an optional `breakdowns` field to `TokenUsage`: a priced,
informational decomposition of a top-level token class, keyed by class
(e.g. "output") with an inner sub-class -> tokens map (e.g.
{"output": {"reasoning": 243}}). Breakdowns are never summed into any
total -- the parent class already counts those tokens -- so the
session-total guarantee is untouched. Invariant: Σ(inner) ≤ parent;
the field is omitted when empty (byte-identical to prior docs). Rides
both `token_usage` and `attributed_token_usage`. Additive to the
unreleased kind v1.1.0 (no new kind version, no version bumps).

Per provider:
- Gemini: `thoughts` (reasoning) was being dropped, under-counting
  generated tokens. Now folded into `output_tokens` (Google bills it as
  output) and recorded under breakdowns["output"]["reasoning"]; the
  projector un-folds it on the reverse path for a lossless round-trip,
  preserving the Some(0) vs absent distinction (Gemini-3 zero-reasoning
  vs Gemini-2.5 no-reasoning-concept).
- OpenCode: keeps folding reasoning into output; also records the slice.
- Codex: differences the cumulative `reasoning_output_tokens` (⊆ output,
  dedup-safe) into the breakdown per-step and per-round.
- Claude: none (its JSONL usage does not itemize thinking tokens).

Also includes deep-dive doc/comment corrections for Pi and Cursor token
reporting (Cursor tokenCount reliability is unverified; Pi totalTokens
is version-dependent and not read).

Schema tokenUsage $def, CHANGELOG, CLAUDE.md, kind index.md, and the
gemini/opencode/codex format docs updated. Full workspace green
(1637 tests), clippy clean, examples validate, site builds 11 pages.
`breakdowns` stores token counts keyed by arbitrary sub-class strings;
it encodes no price, cost, or rate. "Priced" described the motivation
(sub-classes like text vs image bill differently) but mischaracterized
the data. Reworded to "decomposition of a top-level class into named
sub-classes" across the field doc, schema $def, kind spec, CHANGELOG,
and CLAUDE.md.

@benbaarber benbaarber left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@akesling akesling merged commit dca4604 into main Jun 22, 2026
2 checks passed
@akesling akesling deleted the fix/token-usage-once-per-message branch June 22, 2026 17:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants