fix: token usage once per message (+ Codex per-step attribution); kind v1.1.0#106
Merged
Conversation
Pathbase reported ~3x token over-counting on real Claude sessions:
Claude Code writes one JSONL line per content block of an assistant API
message, repeating the message-level usage on every line, and
toolpath-claude emitted one step per line each carrying the full usage —
while dropping the message.id that disambiguates them. Codex was worse
in a different way: cumulative session counters stamped per turn.
The contract, now specified by agent-coding-session v1.1.0 (and pinned
by PATH_KIND_AGENT_CODING_SESSION): token_usage always means "the total
for a message", verbatim from the source. Steps split from one message
share a message_id; the total sits on the group's final step; a step
without message_id is a one-step message. Summing token_usage over a
path's steps therefore equals the session totals. Future per-step
attribution is reserved a separate field (structural, not positional,
differentiation), so the summing rule is permanent.
- toolpath-convo: Turn.message_id (grouping key); derive_path writes it
into conversation.append and emits token_usage once per group;
extract_conversation reads it back.
- toolpath-claude: fills message_id from message.id; canonicalizes the
IR (total only on the group's final turn) so trait-level consumers
are correct by construction; dedups total_usage; the projector
re-expands the group total and message.id onto every JSONL line of a
split for wire fidelity.
- toolpath-codex: groups a round's assistant turns under its turn_id;
accumulates token_count spend per round (last_token_usage preferred,
else saturating delta of cumulative totals) onto the round's final
assistant turn; the projector closes usage-bearing turns with a
task_complete boundary so re-reads attribute spend correctly.
- toolpath-pi / toolpath-opencode: decode absent/all-zero wire counters
as token_usage None ("spend unknown"), never Some(zeros).
- path-cli: bundles the v1.1.0 kind schema alongside v1.0.0; the
cross-harness matrix asserts the order-preserving usage sequence
instead of index-paired presence (which only ever passed because
duplication blanketed every index).
- spec/site: v1.1.0 page with the accounting rules and forward-compat
field reservation; erratum on the v1.0.0 page documenting historical
duplication; format docs gain the don't-sum-per-line warning (claude)
and cumulative-vs-round scoping (codex).
Verified against real sessions: zero duplicate usage tuples under
Pathbase's jq check, and naive per-step sums exactly match the source
ground truth summed once per distinct message.id.
Crates bumped (kind URI changes for every deriver): toolpath 0.7.0,
toolpath-convo 0.11.0, toolpath-claude 0.12.0, toolpath-codex 0.6.0,
toolpath-gemini 0.6.0, toolpath-opencode 0.5.0, toolpath-cursor 0.2.0,
toolpath-pi 0.6.0, toolpath-git 0.6.0, toolpath-github 0.6.0,
toolpath-dot 0.5.0, toolpath-md 0.7.0, path-cli 0.14.0,
toolpath-cli 0.14.0.
Reported-by: Pathbase team (local/toolpath-usage-duplication.md)
|
🔍 Preview deployed: https://9f0b0116.toolpath.pages.dev |
Contributor
Author
|
Related to #105 |
Follows the v1.1.0 accounting fix with two things checking the data forced. First, the "split lines repeat byte-identical usage" assumption was false: across every Claude session on disk, ~27% of split messages disagree — output_tokens streams upward to a final total while input/cache stay constant. The Claude reader no longer trusts line order; it reduces each message_id run to the field-wise MAX total (never under-counts whatever the order) on the run's final turn. Second, those cumulative snapshots ARE per-step data. Differencing consecutive output_tokens yields genuine per-block attribution. Codex is even cleaner: its token_count carries the cumulative total_token_usage, and differencing it (dedup-safe — Codex emits each event twice, so a repeated total is a 0 delta, where summing last_token_usage would double) gives per-step deltas aligned to the step each count follows. New model (kind agent-coding-session v1.1.0, folded in while unreleased so no version churn): token_usage stays "the total for a message" (sum over a path = session total); a new optional attributed_token_usage holds "this step's own attributed spend" on its own key, so the total sum is unaffected. Total vs share is structural (the key), never positional. Remainder (group total minus sum attributed) is computed by consumers, never recorded. Both fields optional, so any producer can add attribution later with no further kind version. - toolpath-convo: Turn.attributed_token_usage; derive writes it per step, extract reads it back; schema gains the optional field. - toolpath-claude: field-wise-max group total; streamed output deltas (running-max differenced so the sum telescopes to the total) become attribution; projector reconstructs the cumulative per-line wire so attribution survives a round-trip. - toolpath-codex: per-step deltas from differencing the cumulative total_token_usage; round total derived from the sum of its steps' attributions (one source of truth); projector emits a turn_context per group and a cumulative token_count after each step, so grouping and attribution round-trip. - toolpath-pi / toolpath-opencode: all-zero wire usage decodes as None. - cross-harness matrix: survival invariant compares the order-preserving usage sequence on common fields. Verified on real data: Claude streamed sessions produce correct deltas that telescope to message totals; Codex sum attributed == sum token_usage == session ground truth and survives a full export-reimport round-trip.
Checking the on-disk data against authoritative docs overturned the
per-step attribution assumption for Claude, and confirmed it for Codex.
Claude — the per-content-block `usage` values are cumulative STREAMING
SNAPSHOTS, not per-block bills. The Anthropic streaming API seeds
`output_tokens` near zero at `message_start` and reports the running
cumulative total on each `message_delta`; Claude Code stamps each JSONL
line with whatever snapshot was current at flush time. Across 1959
streamed message groups the final line is the max in 100% of cases, and
87% of non-final lines that contain real prose report <=5 output tokens
— i.e. differencing the lines does NOT recover per-block costs. So:
- drop Claude `attributed_token_usage` entirely (it was fabricated
from snapshot deltas); keep the field-wise-max message total, which
the streaming semantics confirm is the right read.
- simplify the projector to re-expand that total onto every line of a
split (the dominant real-Claude pattern); drop the now-unneeded
streaming reconstruction.
Codex — per-step attribution is real and confirmed by OpenAI's own
issue tracker (openai/codex #14489, #17539): `total_token_usage` is the
session-cumulative counter, `last_token_usage` the per-call delta, and
Codex re-emits a stale `last_token_usage` on repeated events — so
summing it double-counts while differencing the cumulative is dedup-safe.
That is exactly what `toolpath-codex` already does; left intact.
`attributed_token_usage` stays in the IR/schema/spec as an optional,
provider-populated field (Codex populates it; Claude does not) — no kind
version change needed, per the optional-field design.
Docs updated with the corrected semantics and citations: claude-code/
usage.md (streaming-snapshot explanation, why no per-block attribution),
codex.md (field definitions + the #14489/#17539 doubling trap), the
v1.1.0 spec page and schema description, CHANGELOG, and CLAUDE.md.
Verified on real data: Claude per-step token_usage sums to the source
ground truth (439429, max per message.id) with zero fabricated
attribution; Codex sum attributed == sum token_usage == 14233.
…ng key) The grouping key isn't always a message id. For Codex the group is a *round* (its turn_id), which contains several messages — so `message_id` overfit to Claude and was actively misleading there. It also collided namewise with the pre-existing provider-internal `message_id` fields (Claude's JSONL envelope, gemini, opencode), adding to the confusion. `group_id` names the actual role: the identifier of the source accounting unit a step was derived from (a message for Claude, a round for Codex). The stored value is unchanged — only the field/key name and its documented meaning generalize. v1.1.0 is unreleased, so this is free. Scope was surgical: only the IR `Turn` field and the `conversation.append` `group_id` structural key were renamed. The unrelated provider-internal `message_id` fields (Claude `ConversationEntry.message_id` and its event-data round-trip, gemini/opencode/cursor entry fields, opencode SQL columns, the jsonl-envelope doc) are left untouched. - toolpath-convo: `Turn.group_id`; derive writes the `group_id` key, extract reads it. - claude/codex/gemini/pi/opencode/cursor: Turn-field usages renamed (compiler-verified); provider envelope fields preserved. - spec/schema/docs reworded to provider-neutral "group / accounting unit" framing (message for Claude, round for Codex); the "Message accounting" section is now "Group accounting"; ID stays capitalized in prose, `group_id` backticked as a symbol. Verified: 1622 tests pass, clippy clean, site builds, schema valid. Real data — Claude: 1063 steps carry group_id, 0 attribution (correct); Codex: attribution telescopes to the session total (older sessions whose turn_context lacks a turn_id degrade to per-turn groups, still correct).
…ssage "A turn's group id" → "group ID", "a unique synthesized id" → "ID", "share that id as" → "ID", "gets a unique id" → "ID" (codex), and the claude test assertion "carry no id" → "carry no ID". Per the house rule: "ID" in prose; lowercase `id` only as a backticked literal symbol. Pre-existing `id` prose elsewhere in the tree (e.g. toolpath-convo extract/derive comments) is left alone — out of scope for this PR.
…er-count
Adds an optional `breakdowns` field to `TokenUsage`: a priced,
informational decomposition of a top-level token class, keyed by class
(e.g. "output") with an inner sub-class -> tokens map (e.g.
{"output": {"reasoning": 243}}). Breakdowns are never summed into any
total -- the parent class already counts those tokens -- so the
session-total guarantee is untouched. Invariant: Σ(inner) ≤ parent;
the field is omitted when empty (byte-identical to prior docs). Rides
both `token_usage` and `attributed_token_usage`. Additive to the
unreleased kind v1.1.0 (no new kind version, no version bumps).
Per provider:
- Gemini: `thoughts` (reasoning) was being dropped, under-counting
generated tokens. Now folded into `output_tokens` (Google bills it as
output) and recorded under breakdowns["output"]["reasoning"]; the
projector un-folds it on the reverse path for a lossless round-trip,
preserving the Some(0) vs absent distinction (Gemini-3 zero-reasoning
vs Gemini-2.5 no-reasoning-concept).
- OpenCode: keeps folding reasoning into output; also records the slice.
- Codex: differences the cumulative `reasoning_output_tokens` (⊆ output,
dedup-safe) into the breakdown per-step and per-round.
- Claude: none (its JSONL usage does not itemize thinking tokens).
Also includes deep-dive doc/comment corrections for Pi and Cursor token
reporting (Cursor tokenCount reliability is unverified; Pi totalTokens
is version-dependent and not read).
Schema tokenUsage $def, CHANGELOG, CLAUDE.md, kind index.md, and the
gemini/opencode/codex format docs updated. Full workspace green
(1637 tests), clippy clean, examples validate, site builds 11 pages.
`breakdowns` stores token counts keyed by arbitrary sub-class strings; it encodes no price, cost, or rate. "Priced" described the motivation (sub-classes like text vs image bill differently) but mischaracterized the data. Reworded to "decomposition of a top-level class into named sub-classes" across the field doc, schema $def, kind spec, CHANGELOG, and CLAUDE.md.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Situation
Derived documents over-counted token spend, and the kind spec was silent on how
token_usageon steps relates to API-message accounting — so every consumer was left to guess, and the natural reading (sum per step) gave inflated numbers.toolpath-claudeemitted one step per line, each carrying that line'susage— ~3× output-token inflation on real sessions — and dropped themessage.idthat ties the lines back to one message.toolpath-codexstamped the cumulative session counter (total_token_usage) onto each assistant turn, so per-step sums grew quadratically. Codex also re-emits a stalelast_token_usageon repeatedtoken_countevents, so any sum-the-deltas approach double-counts.Investigation against real data (all 287 Claude sessions, all Codex sessions on disk) plus authoritative docs settled two questions the data alone couldn't:
usageis a cumulative streaming snapshot, not a per-block bill. The Anthropic streaming API seedsoutput_tokensnear zero atmessage_startand reports the running cumulative total on eachmessage_delta; Claude Code stamps each line with whatever snapshot was current at flush time. Across 1,959 streamed groups the final line is the max in 100% of cases, and 87% of non-final prose lines report ≤5 output tokens — so differencing the lines does not recover per-block costs.total_token_usageas cumulative andlast_token_usageas the per-call delta, and documents the stale-re-emission that makes summinglast_token_usageover-count.Resolution
The contract (kind
agent-coding-sessionv1.1.0, both fields optional)token_usage= "the total for a group", on the group's final step.Σ token_usageover a path = session total. Verbatim from the source (for Claude, the field-wise max across the split — the final streaming snapshot, found without trusting line order).attributed_token_usage= "this step's own attributed spend", on its own key so the sum above is unaffected. Populated only where the source genuinely reports per-step spend. Whether a number is a total or a share is structural (the key it's under), never positional. The unattributed remainder (group total − Σ attributed) is computed by consumers, never recorded.The v1.0.0 page carries an erratum;
path p validatebundles both schemas.PATH_KIND_AGENT_CODING_SESSION→ v1.1.0 (old URI kept as…_V1_0_0).Per provider
Turn.group_id+Turn.attributed_token_usage;derive_pathwritestoken_usageonce per group andattributed_token_usageper step;extract_conversationreads both back. Schema gains the optional field.group_idrun reduced to the field-wise-max total on its final turn (never under-counts whatever the line order). Noattributed_token_usage— the per-line snapshots aren't per-block costs.total_usagededuped by group; the projector re-expands the total onto every line of a split (with the sharedmessage.id).total_token_usage(differencing is dedup-safe against the stale-re-emission; summinglast_token_usagewould double). Each per-call delta becomes that step'sattributed_token_usage; a round's total is the sum of its steps' attributions (one source of truth — total and shares can't drift). The projector emits aturn_contextper group and a cumulativetoken_countafter each step, so grouping and attribution round-trip.None("spend unknown"), notSome(zeros).Verification
token_usagesums to the source ground truth (439,429 = max permessage.id) with zero fabricated attribution; CodexΣ attributed == Σ token_usage == 14233and survives a full export→reimport round-trip.usage.md,codex.md), the v1.1.0 spec page, and the schema carry the corrected semantics and citations.Versioning
Minor bumps across all toolpath crates (the emitted kind URI changes for every deriver):
toolpath0.7.0,toolpath-convo0.11.0,toolpath-claude0.12.0,toolpath-codex0.6.0,toolpath-gemini0.6.0,toolpath-opencode0.5.0,toolpath-cursor0.2.0,toolpath-pi0.6.0,toolpath-git/toolpath-github0.6.0,toolpath-dot0.5.0,toolpath-md0.7.0,path-cli/toolpath-cli0.14.0.Existing v1.0.0 documents remain over-counted; the version gate is the consumer signal — sum v1.1.0 naively, keep dedup heuristics only for the v1.0.0 backlog (where tuple-equality still cannot repair Codex-derived documents).