Skip to content

feat: estimate cycles#17

Open
not-matthias wants to merge 6 commits into
masterfrom
feat/cycle-estimation
Open

feat: estimate cycles#17
not-matthias wants to merge 6 commits into
masterfrom
feat/cycle-estimation

Conversation

@not-matthias

@not-matthias not-matthias commented Jun 9, 2026

Copy link
Copy Markdown
Member

No description provided.

@codspeed-hq

codspeed-hq Bot commented Jun 9, 2026

Copy link
Copy Markdown

Merging this PR will improve performance by 11.8%

⚡ 1 improved benchmark
❌ 3 regressed benchmarks
✅ 36 untouched benchmarks
⏩ 80 skipped benchmarks1

Warning

Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Benchmark BASE HEAD Efficiency
test_valgrind[valgrind-3.25.1, python3 testdata/test.py, full-with-inline] 6.9 s 234.3 s -97.05%
test_valgrind[valgrind-3.25.1, testdata/take_strings-aarch64 varbinview_non_null, full-with-inline] 7.7 s 94.9 s -91.86%
test_valgrind[valgrind-3.26.0, python3 testdata/test.py, full-no-inline] 6.8 s 8.7 s -22.18%
test_valgrind[valgrind-3.25.1, echo Hello, World!, full-with-inline] 612,627.1 ms 733.7 ms ×830

Tip

Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.


Comparing feat/cycle-estimation (d2d09c3) with master (fa9ee2e)

Open in CodSpeed

Footnotes

  1. 80 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

Add callgrind/lut-gen/ (offline generators for the --cycle-estimation cost
tables) and the regenerated x86_caps_lut.inc / arm64_caps_lut.inc.

- lib/: sigkey.py (Python mirror of sigkey.h's packed-key contract) and
  gen_common.py (shared measurement parsing, collision collapse, emit).
- x86/gen_x86_lut.py: Zen4-tuned table from uops.info instructions.xml.
- arm64/: measured Cortex-A72 table from ocxtal/insn_bench_aarch64 via
  insn_bench_to_xml.py -> gen_arm64_lut.py -> merge_arm64_lut.py, with a
  hand-frozen guide supplement for ops insn_bench does not benchmark.
- test_lut.py: validates the committed tables against the runtime keying
  contract and asserts sigkey.py stays in lockstep with sigkey.h.
Add the --cycle-estimation=yes runtime: decode each guest instruction with
Capstone and look its reciprocal-throughput cost up in the generated cost
table (x86/arm64_caps_lut.inc), accumulating a per-instruction Cy event.

- cycledecode.c/.h: isolated Capstone decode + LUT lookup behind a plain-C
  ABI (no Valgrind tool headers, to avoid type clashes); freestanding libc
  shims since the tool links -nodefaultlibs.
- sigkey.h: packed (insn-id, signature) key contract shared with the offline
  LUT generators in lut-gen/.
- main.c/sim.c/bbcc.c/global.h/clo.c: wire Cy through callgrind (CLI option,
  EventSet, self + inclusive cost, fallback warning when an instruction has
  no table match).
- configure.ac/Makefile.am: build cycledecode against Capstone.

Emits only Cy (reciprocal throughput); no latency event.
@not-matthias not-matthias force-pushed the feat/cycle-estimation branch from 4b2e2b7 to 22e0934 Compare June 11, 2026 17:56
@not-matthias not-matthias marked this pull request as ready for review June 12, 2026 06:49
@greptile-apps

greptile-apps Bot commented Jun 12, 2026

Copy link
Copy Markdown

Greptile Summary

This PR adds --cycle-estimation=yes to Callgrind: at instrumentation time each guest instruction is decoded by Capstone and looked up in a generated cost table (x86_caps_lut.inc / arm64_caps_lut.inc) keyed by a packed (insn-id, operand-sig) value defined in sigkey.h. The resulting centi-cycle throughput cost is stored per-InstrInfo, accumulated into a running cy_incl sum for O(1) call-graph inclusive updates, and surfaced as a new Cy event in Callgrind output.

  • Capstone is statically linked from a hardening-free, x86+arm64-only build; configure.ac makes it mandatory for this fork. CI workflows build and cache the library before the Valgrind configure step.
  • The cycledecode.c translation unit is deliberately isolated from Valgrind tool headers and provides all required libc symbols as freestanding shims backed by a 1 MiB bump arena.
  • A Python/C lockstep test (lut-gen/test_lut.py) validates that sigkey.py and sigkey.h pack keys identically, and golden anchor rows guard against cost regressions in the LUT.

Confidence Score: 5/5

Safe to merge. The cycle-estimation path is entirely opt-in and off by default; all new code paths are guarded by CLG_(clo).cycle_estimation. Capstone is statically linked with no runtime dependency.

The change introduces a substantial new subsystem but it is well-isolated: cycledecode.c is a self-contained translation unit with its own freestanding shims, the LUT key contract is validated by a lockstep test, and all integration points in main.c/bbcc.c/sim.c are guarded behind the new CLI flag. The only technical gap is in the realloc shim, which is latent code that Capstone's cs_disasm_iter flow does not exercise in practice.

callgrind/cycledecode.c — specifically the realloc shim's memcpy source-length bound.

Important Files Changed

Filename Overview
callgrind/cycledecode.c New file: Capstone-backed cycle-cost decoder with freestanding libc shims and binary-search LUT. One over-read hazard in the realloc shim (uses new size for memcpy source length).
callgrind/cycledecode.h Clean public C ABI header for the cycle decoder; plain types only, no tool-header pollution.
callgrind/sigkey.h Packed 64-bit LUT key contract shared between the C runtime and the Python generator; bit layout is clearly documented and tested by test_lut.py.
callgrind/main.c Adds cycle-cost decode at IMark time (first translation only) and the cy_incl running-sum pass; guarded by CLG_(clo).cycle_estimation throughout. Graceful fallback when Capstone init fails.
callgrind/sim.c Adds EG_CYCLES event group (conditionally registered, unconditionally added to full set per existing codebase pattern) and per-instruction cycle cost accumulation in cachesim_add_icost.
callgrind/bbcc.c Uses cy_incl (pre-computed running sum) for O(1) inclusive cycle-cost update at each side exit; correctly handles both normal and skipped-function paths.
configure.ac Makes Capstone mandatory for this fork with clear error messages; fortify-off CFLAGS match the -nodefaultlibs tool link constraint.
.github/workflows/ci.yml Adds Capstone static build step before configure; switches to --enable-only64bit, removes now-unneeded 32-bit multilib deps.
.github/workflows/codspeed.yml Capstone build correctly gated on cache-hit != 'true' && matrix.valgrind == 'local', matching the Valgrind build gate so CAPSTONE_DIR is always set when needed.
.github/workflows/release.yml Capstone built and CAPSTONE_DIR forwarded via debuild -e; additional-deps matrix entries removed correctly.
debian/rules Adds --enable-only64bit universally and forwards CAPSTONE_DIR via --with-capstone when set; consistent with the release workflow.
callgrind/lut-gen/lib/sigkey.py Python mirror of sigkey.h; correctness enforced by test_lut.py's lockstep test against the C header defines.
callgrind/lut-gen/test_lut.py Comprehensive LUT validation: lockstep sigkey.h vs sigkey.py bit layout, cost non-negativity, flag/width constraints, and golden measured/grafted anchor rows.

Sequence Diagram

sequenceDiagram
    participant VG as Valgrind Core
    participant INST as CLG_(instrument)
    participant CD as cycledecode.c
    participant LUT as Cost LUT (sorted array)
    participant BBCC as setup_bbcc
    participant SIM as cachesim_add_icost

    VG->>INST: instrument(BB, first_translation)
    loop Each IMark (first translation only)
        INST->>CD: "clg_cycle_cost(cia, len, &cy)"
        CD->>CD: cs_disasm_iter → compute_sig
        CD->>LUT: binary search CLG_CSKEY(insn_id, sig)
        LUT-->>CD: "Row{cy, cl} or miss"
        CD-->>INST: "cy (centi-cycles) or fallback=100"
        INST->>INST: "ii->cy_cost = cy"
    end
    INST->>INST: compute cy_incl running sums over BB

    VG->>BBCC: setup_bbcc (side exit / call)
    BBCC->>BBCC: "cost[EG_CYCLES] += last_ii->cy_incl"

    VG->>SIM: cachesim_add_icost(cost, bbcc, ii, exe_count)
    SIM->>SIM: "cost[EG_CYCLES] += exe_count * ii->cy_cost"
Loading

Reviews (3): Last reviewed commit: "ci(callgrind): build 64-bit only so cycl..." | Re-trigger Greptile

Comment thread callgrind/cycledecode.c
Comment thread callgrind/main.c
Comment on lines 2132 to +2142

CLG_(init_dumps)();

if (CLG_(clo).cycle_estimation && !clg_cycledecode_init()) {
extern int clg_cd_init_stage;
VG_(message)(Vg_UserMsg,
"--cycle-estimation=yes needs a Capstone build "
"(amd64/arm64 guest); disabling it. [init stage %d]\n",
clg_cd_init_stage);
CLG_(clo).cycle_estimation = False;
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Diagnostic variable clg_cd_init_stage not declared in the public header

clg_cd_init_stage is accessed here via an extern declaration inside a function body, which is an unusual pattern that bypasses the header contract. Since cycledecode.h already defines the public ABI for this translation unit, adding extern int clg_cd_init_stage; to that header would keep all callers in sync and make the diagnostic visible to any future caller without copy-pasting the extern declaration.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Comment thread bench/generate_config.py
- configure: require Capstone via --with-capstone or $CAPSTONE_DIR; error if absent
- cycledecode: #error instead of no-op stub when built without Capstone
- flake: slim (x86+arm64) static Capstone; hardeningDisable on the devShell
- ci/codspeed/release: inline cmake step builds slim static hardening-free Capstone
- debian/rules: forward $CAPSTONE_DIR to configure via --with-capstone
- changelog: document the build requirement
…-bit Capstone

The amd64 host builds both the primary (amd64) tool and a 32-bit x86
secondary tool, but Capstone is only built 64-bit, so CLG_WITH_CAPSTONE
is set only for the primary build. The secondary build compiled
cycledecode.c without it and tripped the mandatory-Capstone #error.

CodSpeed only ever runs the 64-bit tool, so build 64-bit only everywhere:
add --enable-only64bit to the CI configure, the release deb (debian/rules,
now unconditional), and the Justfile, and drop the now-unneeded
gcc-multilib / libc6-dev-i386 deps. This also roughly halves build time by
skipping the entire 32-bit toolchain.
Comment thread callgrind/cycledecode.c
Comment on lines +110 to +118
void* realloc(void* old, size_t n)
{
/* Old size is unknown; copy n bytes (stays within the arena array, so no
* fault) and leak the old block. Rarely hit under cs_disasm_iter. */
void* p = malloc(n);
if (p && old)
memcpy(p, old, n);
return p;
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 realloc copies n (new size) bytes from old, potentially reading past the arena

The comment says "stays within the arena array, so no fault", but that holds only if (old - arena) + n ≤ sizeof(arena). If old sits late in the 1 MiB arena and n is larger than the remaining arena space, memcpy(p, old, n) over-reads past arena[1u << 20] into adjacent memory. A simple clamp to the remaining arena space preserves the copy-what-you-can semantics without needing to track the old block's actual size.

Suggested change
void* realloc(void* old, size_t n)
{
/* Old size is unknown; copy n bytes (stays within the arena array, so no
* fault) and leak the old block. Rarely hit under cs_disasm_iter. */
void* p = malloc(n);
if (p && old)
memcpy(p, old, n);
return p;
}
void* realloc(void* old, size_t n)
{
/* Old size is unknown; copy min(n, remaining arena from old) bytes and
* leak the old block. Rarely hit under cs_disasm_iter. */
void* p = malloc(n);
if (p && old) {
size_t arena_end = (size_t)(arena + sizeof(arena));
size_t old_pos = (size_t)old;
size_t avail = (old_pos < arena_end) ? (arena_end - old_pos) : 0;
memcpy(p, old, avail < n ? avail : n);
}
return p;
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant