Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -10,3 +10,6 @@ site/wasm/
.claude/worktrees/
# Synthetic benchmark fixtures — generate locally via gen_synthetic_path.
/bench/fixtures/

# macOS
.DS_Store
108 changes: 108 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,114 @@

All notable changes to the Toolpath workspace are documented here.

## Token usage: once per message, with per-step attribution + kind v1.1.0 — 2026-06-17

Fixes token over-counting in derived documents (~3× output-token
inflation on real Claude sessions, unbounded on Codex) and adds per-step
token attribution where the source genuinely reports it (Codex). Two
over-counting bugs, one spec gap, plus a capability the corrected reads
make possible. Verified against every Claude session and all Codex
sessions on disk, and cross-checked against the Anthropic streaming API
reference and OpenAI's codex issue tracker.

- **Claude**: Claude Code writes one JSONL line per content block of an
assistant API message, repeating the message-level `usage` on every
line. `toolpath-claude` emitted one step per line, each carrying the
full usage — so summing `token_usage` per step over-counted by the
block count, and the disambiguating `message.id` was dropped.
- **Codex**: `toolpath-codex` stamped the *cumulative* session counter
(`total_token_usage`) onto each assistant turn instead of per-step
spend, so per-step sums grew quadratically.

Core model (kind `agent-coding-session` **v1.1.0**, both fields optional
so any producer can populate per-step attribution later with no further
kind version):

- `token_usage` always means **the total for a message**, on the
group's final step (`Σ token_usage` over a path = session total).
- `attributed_token_usage` (new) is **this step's own attributed
spend**, on its own key so the sum above is unaffected. Whether a
number is a total or a share is structural (the key), never
positional. The unattributed remainder
(`group token_usage − Σ attributed`) is computed by consumers, never
recorded — stored values stay verbatim source observations.
- `breakdowns` (new, optional) is a **decomposition of a top-level
class into named sub-classes** — keyed by the class being broken down (e.g.
`"output"`), inner map sub-class → tokens (e.g. `{"output":
{"reasoning": 243}}`). It is **informational and never summed into
any total** — the parent class already counts those tokens — so the
session-total guarantee is untouched. Invariant: `Σ(inner) ≤` the
parent class's value; the field is omitted when empty. It rides both
`token_usage` and `attributed_token_usage`.

Changes:

- `toolpath_convo::TokenUsage` gains `breakdowns`
(`BTreeMap<class, BTreeMap<sub-class, tokens>>`); the kind
`tokenUsage` `$def` gains a matching optional `breakdowns` property.
- **Gemini under-count FIX**: Gemini reports `thoughts` (reasoning) as
an additive sibling of `output_tokens` that the derivation was
**dropping** — so Gemini output totals were under-counted by the
reasoning spend. `thoughts` is now **folded into `output_tokens`**
(correcting the total) *and* recorded under
`breakdowns["output"]["reasoning"]`; the projector **un-folds** it on
the reverse path for a lossless round-trip (`Some(0)` is preserved as
a real Gemini-3 zero-reasoning signal, not collapsed to absent).
- **OpenCode**: continues folding `reasoning` into `output_tokens`, and
now also records it under `breakdowns["output"]["reasoning"]`.
- **Codex**: `reasoning_output_tokens` (a subset of `output_tokens`,
cumulative → differenced like the other counters) is surfaced under
`breakdowns["output"]["reasoning"]` on both the per-step
`attributed_token_usage` and the per-round `token_usage`.
- **Claude**: records no breakdown — its JSONL `usage` does not itemize
thinking tokens.
- `toolpath_convo::Turn` gains `group_id` (grouping key) and
`attributed_token_usage`. `derive_path` writes `token_usage` once per
`group_id` group and `attributed_token_usage` on each step that has
it; `extract_conversation` reads both back.
- `toolpath-claude`: a split message's lines carry `message.usage` as a
**cumulative streaming snapshot**, not a per-line bill — per the
Anthropic streaming API, `message_start` seeds `output_tokens` near
zero and each `message_delta` reports the running cumulative total
(confirmed across every session sampled: input/cache constant, output
climbing to the final-line total; ~27% of multi-line messages vary).
Each `group_id` run is reduced to the **field-wise maximum** total
(never under-counts whatever the line order) on its final turn. The
intermediate snapshots are flush-time artifacts, *not* per-block costs
(a real prose block routinely shows `output_tokens: 1`), so Claude
emits **no** `attributed_token_usage`. `total_usage` is deduped by
group; the projector re-expands the total onto every line of a split.
- `toolpath-codex`: per-step spend is the increase in the cumulative
`total_token_usage` since the previous count — **differencing the
cumulative is dedup-safe**, where summing `last_token_usage` would
double-count because Codex re-emits a stale `last_token_usage` on
repeated `token_count` events (a documented trap: openai/codex #14489,
#17539). Each per-call delta is attributed to the step it follows as
`attributed_token_usage`; a round's `token_usage` total is the sum of
its steps' attributions (one source of truth — total and shares cannot
drift). The projector emits a `turn_context` per group and a cumulative
`token_count` after each step, so grouping and attribution survive the
round-trip.
- `toolpath-pi` and `toolpath-opencode` decode absent/all-zero wire
usage counters as `token_usage: None` ("spend unknown") instead of
`Some(zeros)` — their wires require usage fields, which
foreign-source projections zero-fill.
- `PATH_KIND_AGENT_CODING_SESSION` now points at v1.1.0;
`PATH_KIND_AGENT_CODING_SESSION_V1_0_0` names the old URI. `path p
validate` bundles both schemas. The v1.0.0 spec page gains an erratum
documenting the historical duplication (consumers of v1.0.0 documents
still need dedup heuristics; the byte-identical-tuple heuristic does
not repair Codex documents).

Crates bumped (every crate that depends on `toolpath`, matching the
domain-rename precedent since the emitted kind URI changes): `toolpath`
0.7.0, `toolpath-convo` 0.11.0, `toolpath-git` 0.6.0, `toolpath-github`
0.6.0, `toolpath-claude` 0.12.0, `toolpath-gemini` 0.6.0,
`toolpath-codex` 0.6.0, `toolpath-opencode` 0.5.0, `toolpath-cursor`
0.2.0, `toolpath-pi` 0.6.0, `toolpath-dot` 0.5.0, `toolpath-md` 0.7.0,
`path-cli` 0.14.0, `toolpath-cli` 0.14.0. `pathbase-client` is
unaffected.

## toolpath-claude 0.11.1 + path-cli 0.13.1 + toolpath-cli 0.13.1: derive `project_path` from the file's parent directory — 2026-06-09

`ConversationReader::read_conversation_metadata` used to set
Expand Down
26 changes: 14 additions & 12 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -175,18 +175,18 @@ the server publishes that operation.

Tests live alongside the code (`#[cfg(test)] mod tests`), plus `path-cli` has integration tests in `tests/`. Per-crate counts:

- `toolpath`: 32 unit + 9 doc tests (serde roundtrip, builders, query)
- `toolpath-convo`: 58 unit + 1 doc test (types, enrichment, display, ConversationView -> Path derivation)
- `toolpath`: 69 unit + 11 doc tests (serde roundtrip, builders, query)
- `toolpath-convo`: 118 unit + 4 doc tests (types, enrichment, display, ConversationView -> Path derivation, message-group usage accounting, breakdowns)
- `toolpath-git`: 33 unit + 3 doc tests (derive, branch detection, diffstat)
- `toolpath-github`: 28 unit + 2 doc tests (mapping, DAG construction, fixtures)
- `toolpath-claude`: 278 unit + 6 doc tests (path resolution, conversation reading, query, chaining, watcher, derive, metadata first-user-message)
- `toolpath-gemini`: 163 unit + 12 integration + 4 doc tests (path resolution, chat-file parsing, query, watcher, derive, provider, round-trip fidelity)
- `toolpath-codex`: 69 unit + 33 integration + 1 doc test (rollout parsing, provider assembly, patch-fidelity derive, real-session fixture, source→path fidelity invariants, JSON wire-level round-trip)
- `toolpath-opencode`: 43 unit + 1 doc test (SQLite reader, JSON payload serde, provider assembly, snapshot-based derive, tool-input fallback for gitignored paths)
- `toolpath-cursor`: 70 unit + 8 integration round-trip + 1 real-DB sanity + 1 doc test (state.vscdb SQLite reader, bubble store + composer header parsing, content-addressed blob lookup, projector with full TOOL_TABLE coverage, JSONL transcript ingest in `examples/dump_fixture.rs`)
- `toolpath-pi`: 123 unit + 4 doc tests (types, paths, error, reader, io, provider)
- `toolpath-github`: 32 unit + 3 doc tests (mapping, DAG construction, fixtures)
- `toolpath-claude`: 229 unit + 18 integration + 6 doc tests (path resolution, conversation reading, query, chaining, watcher, derive, metadata first-user-message, group_id grouping + once-per-message usage totals)
- `toolpath-gemini`: 161 unit + 29 integration + 5 doc tests (path resolution, chat-file parsing, query, watcher, derive, provider, round-trip fidelity, thoughts-folded-into-output + reasoning breakdown round-trip)
- `toolpath-codex`: 80 unit + 51 integration + 2 doc tests (rollout parsing, provider assembly, patch-fidelity derive, real-session fixture, source→path fidelity invariants, JSON wire-level round-trip, per-turn token deltas from cumulative counters, reasoning breakdown)
- `toolpath-opencode`: 52 unit + 19 integration + 1 doc test (SQLite reader, JSON payload serde, provider assembly, snapshot-based derive, tool-input fallback for gitignored paths, reasoning breakdown)
- `toolpath-cursor`: 78 unit + 8 integration round-trip + 1 real-DB sanity + 1 doc test (state.vscdb SQLite reader, bubble store + composer header parsing, content-addressed blob lookup, projector with full TOOL_TABLE coverage, JSONL transcript ingest in `examples/dump_fixture.rs`)
- `toolpath-pi`: 133 unit + 26 integration + 5 doc tests (types, paths, error, reader, io, provider)
- `toolpath-dot`: 30 unit + 2 doc tests (render, visual conventions, escaping)
- `path-cli`: 260 unit + 63 integration tests (import/export/cache, track sessions, merge, validate, roundtrip, render-md snapshots, deprecation aliases, pathbase HTTP mock-server tests, fzf-friendly TSV output, `path resume` orchestration with injectable `ExecStrategy`). For an end-to-end check against a real Pathbase deployment, run `scripts/test-pathbase-live.sh <url>` — it does an anon round-trip in a sandboxed config dir and, if you're logged into that URL, an authed pathstash round-trip too.
- `path-cli`: 294 unit + 65 integration tests (import/export/cache, track sessions, merge, validate, roundtrip, render-md snapshots, deprecation aliases, pathbase HTTP mock-server tests, fzf-friendly TSV output, `path resume` orchestration with injectable `ExecStrategy`). For an end-to-end check against a real Pathbase deployment, run `scripts/test-pathbase-live.sh <url>` — it does an anon round-trip in a sandboxed config dir and, if you're logged into that URL, an authed pathstash round-trip too.
- `toolpath-cli`: 0 tests (it's a one-line `path_cli::run()` shim crate that exists only so `cargo install toolpath-cli` keeps installing the `path` binary)

Validate example documents: `for f in examples/*.json; do cargo run -p path-cli -- p validate --input "$f"; done`
Expand Down Expand Up @@ -229,7 +229,7 @@ When changing a crate's public API (new types, new trait impls, new public metho

The `toolpath-cli` shim lives **outside** the workspace (`exclude = ["crates/toolpath-cli"]` in the root `Cargo.toml`). Both `toolpath-cli` and `path-cli` produce a binary literally named `path`, and cargo can't write two bin targets to the same workspace `target/debug/path` — so the shim opts out and gets its own `crates/toolpath-cli/target/` (covered by the `crates/*/target` line in `.gitignore`). Practical consequences: `cargo build --workspace`, `cargo test --workspace`, and `cargo run -p toolpath-cli` from the repo root **do not** include the shim. To touch it, use `--manifest-path crates/toolpath-cli/Cargo.toml`. The release script special-cases the shim in `get_version` and `publish` so the workflow is otherwise unchanged.

Build the site after changes: `cd site && pnpm run build` (should produce 7 pages).
Build the site after changes: `cd site && pnpm run build` (should produce 11 pages).

## Things to know

Expand All @@ -242,7 +242,9 @@ Build the site after changes: `cd site && pnpm run build` (should produce 7 page
- `toolpath-gemini` treats main file + sibling sub-agent UUID dir as one conversation. Sub-agent files are folded into `DelegatedWork` with populated `turns` (unlike `toolpath-claude`, whose sub-agent turns live in separate session files and stay empty). See `docs/agents/formats/gemini.md` for the full format reference.
- Provider-specific extras convention: `Turn.extra` and `WatcherEvent::Progress.data` use provider-namespaced keys (e.g. `extra["claude"]`, `extra["gemini"]`). `toolpath-claude` populates `Turn.extra["claude"]` from `ConversationEntry.extra`; `toolpath-gemini` populates `Turn.extra["gemini"]` with the full `tokens` struct, per-thought metadata, and tool-call status. This lets trait-only consumers access provider metadata without importing provider types.
- Shared derivation: `toolpath-convo` provides a provider-agnostic `ConversationView → Path` mapping via `toolpath_convo::derive_path`. New conversation providers should build on it rather than re-implementing the mapping.
- Path kinds: `toolpath::v1::PathMeta.kind` is an optional URI naming a hosted kind spec; URIs are immutable and semver-versioned. The only one defined so far is `https://toolpath.net/kinds/agent-coding-session/v1.0.0` (constant `toolpath::v1::PATH_KIND_AGENT_CODING_SESSION`); every conversation → `Path` derivation sets it via the shared `toolpath_convo::derive_path` or each provider crate's own. Carried through the JSONL form via `PathOpen.meta` and `PathMeta` patch lines. Spec sources live in `site/kinds/<name>/<version>/{index.md,schema.json}` and publish under `https://toolpath.net/kinds/`; the registry index is `site/kinds/index.md`. RFC: "Document Kind". JSON Schema: `$defs/pathMeta`.
- Path kinds: `toolpath::v1::PathMeta.kind` is an optional URI naming a hosted kind spec; URIs are immutable and semver-versioned. The only one defined so far is `https://toolpath.net/kinds/agent-coding-session/v1.1.0` (constant `toolpath::v1::PATH_KIND_AGENT_CODING_SESSION`; `…_V1_0_0` names the superseded URI); every conversation → `Path` derivation sets it via the shared `toolpath_convo::derive_path` or each provider crate's own. Carried through the JSONL form via `PathOpen.meta` and `PathMeta` patch lines. Spec sources live in `site/kinds/<name>/<version>/{index.md,schema.json}` (schema.json is a symlink into `crates/path-cli/kinds/`, which `path p validate` bundles — both versions) and publish under `https://toolpath.net/kinds/`; the registry index is `site/kinds/index.md`. RFC: "Document Kind". JSON Schema: `$defs/pathMeta`.
- Token accounting (kind v1.1.0): two keys on `conversation.append`/`Turn`, both optional. `token_usage` = "the total for a message" (on the group's final step; `Σ` over a path = session total). `attributed_token_usage` = "this step's own attributed spend", populated only where the source genuinely reports per-step spend (its own key, so the sum is unaffected; remainder = group total − Σ attributed, computed not stored). One provider message can span several steps (Claude writes one JSONL line per content block); `Turn.group_id` groups them. `toolpath-claude` fills `group_id` from `message.id` and takes the **field-wise-max** group total (line order not trusted). Claude's per-line `usage` is a cumulative *streaming snapshot* (Anthropic streaming API: `message_start` seeds output near 0, `message_delta` is cumulative), NOT a per-block cost — so Claude emits no `attributed_token_usage`; the projector re-expands the total onto every line. `toolpath-codex` differences the cumulative `total_token_usage` (dedup-safe: never sum `last_token_usage` — Codex re-emits it stale; openai/codex #14489), attributes each per-call delta to the step it follows, and derives the round total from those attributions. pi/opencode decode all-zero wire counters as `None`. Never stamp a cumulative counter, a repeated message total, or zero-filled placeholders onto a step; never derive attribution from Claude's streaming snapshots.
- Token usage `breakdowns` (kind v1.1.0, additive): an optional third key on `TokenUsage` — a decomposition of a top-level class into named sub-classes, keyed by class (e.g. `"output"`), inner map sub-class → tokens (e.g. `breakdowns["output"]["reasoning"] = 243`). INFORMATIONAL ONLY: **never summed into any total** (the parent class already counts those tokens, so the session-total guarantee is untouched); invariant `Σ(inner) ≤ parent`; omitted when empty; rides both `token_usage` and `attributed_token_usage`. Per-provider reality: **Gemini** reports `thoughts` (reasoning) as an additive sibling that the derivation used to **drop** (under-counting output) — it's now folded into `output_tokens` *and* recorded as `breakdowns["output"]["reasoning"]`, with the projector un-folding it on the reverse path for a lossless round-trip (`Some(0)` preserved as a real Gemini-3 zero-reasoning signal). **OpenCode** folds `reasoning` into output and records the same breakdown. **Codex** differences `reasoning_output_tokens` (⊆ output, cumulative) into `breakdowns["output"]["reasoning"]` on both per-step `attributed_token_usage` and per-round `token_usage`. **Claude** records no breakdown (its JSONL `usage` doesn't itemize thinking tokens).
- Pi provider: `toolpath-pi` reads Pi session JSONL from `~/.pi/agent/sessions/`. Sessions use a tree (id/parentId) in a single file, and may link to a parent file via `parentSession` in the header. The tree is preserved as a DAG in the derived `Path`.
- Codex provider: `toolpath-codex` reads Codex CLI rollout files from `~/.codex/sessions/YYYY/MM/DD/rollout-*.jsonl`. Sessions are date-bucketed (not project-keyed). File-change fidelity is excellent — Codex's `patch_apply_end` events carry either the unified diff (for updates) or the full file content (for adds), so the derived `Path` gets a real `raw` perspective on every file artifact. See `docs/agents/formats/codex.md` for the full format reference.
- opencode provider: `toolpath-opencode` reads a SQLite database at `~/.local/share/opencode/opencode.db` (opened read-only). Each session's messages and 12 typed part variants (text, reasoning, tool, step-start/-finish, snapshot, patch, file, agent, subtask, retry, compaction) land as one step per message with tool invocations attached. File diffs come from a sibling bare git repo at `snapshot/<project-id>/[<sha1(worktree)>]/` via `git2` tree↔tree diffs — opencode respects the user's `.gitignore`, so changes under gitignored paths fall back to tool-input-derived structural changes with no `raw` perspective. Project id is the SHA of the repo's first root commit. See `docs/agents/formats/opencode.md` for the full format reference.
Expand Down
Loading
Loading