feat: estimate cycles#17
Conversation
Merging this PR will improve performance by 11.8%
Warning Please fix the performance issues or acknowledge them on CodSpeed. Performance Changes
Tip Investigate this regression by commenting Comparing Footnotes
|
Add callgrind/lut-gen/ (offline generators for the --cycle-estimation cost tables) and the regenerated x86_caps_lut.inc / arm64_caps_lut.inc. - lib/: sigkey.py (Python mirror of sigkey.h's packed-key contract) and gen_common.py (shared measurement parsing, collision collapse, emit). - x86/gen_x86_lut.py: Zen4-tuned table from uops.info instructions.xml. - arm64/: measured Cortex-A72 table from ocxtal/insn_bench_aarch64 via insn_bench_to_xml.py -> gen_arm64_lut.py -> merge_arm64_lut.py, with a hand-frozen guide supplement for ops insn_bench does not benchmark. - test_lut.py: validates the committed tables against the runtime keying contract and asserts sigkey.py stays in lockstep with sigkey.h.
Add the --cycle-estimation=yes runtime: decode each guest instruction with Capstone and look its reciprocal-throughput cost up in the generated cost table (x86/arm64_caps_lut.inc), accumulating a per-instruction Cy event. - cycledecode.c/.h: isolated Capstone decode + LUT lookup behind a plain-C ABI (no Valgrind tool headers, to avoid type clashes); freestanding libc shims since the tool links -nodefaultlibs. - sigkey.h: packed (insn-id, signature) key contract shared with the offline LUT generators in lut-gen/. - main.c/sim.c/bbcc.c/global.h/clo.c: wire Cy through callgrind (CLI option, EventSet, self + inclusive cost, fallback warning when an instruction has no table match). - configure.ac/Makefile.am: build cycledecode against Capstone. Emits only Cy (reciprocal throughput); no latency event.
4b2e2b7 to
22e0934
Compare
Greptile SummaryThis PR adds
Confidence Score: 5/5Safe to merge. The cycle-estimation path is entirely opt-in and off by default; all new code paths are guarded by CLG_(clo).cycle_estimation. Capstone is statically linked with no runtime dependency. The change introduces a substantial new subsystem but it is well-isolated: cycledecode.c is a self-contained translation unit with its own freestanding shims, the LUT key contract is validated by a lockstep test, and all integration points in main.c/bbcc.c/sim.c are guarded behind the new CLI flag. The only technical gap is in the realloc shim, which is latent code that Capstone's cs_disasm_iter flow does not exercise in practice. callgrind/cycledecode.c — specifically the realloc shim's memcpy source-length bound. Important Files Changed
|
|
|
||
| CLG_(init_dumps)(); | ||
|
|
||
| if (CLG_(clo).cycle_estimation && !clg_cycledecode_init()) { | ||
| extern int clg_cd_init_stage; | ||
| VG_(message)(Vg_UserMsg, | ||
| "--cycle-estimation=yes needs a Capstone build " | ||
| "(amd64/arm64 guest); disabling it. [init stage %d]\n", | ||
| clg_cd_init_stage); | ||
| CLG_(clo).cycle_estimation = False; | ||
| } |
There was a problem hiding this comment.
Diagnostic variable
clg_cd_init_stage not declared in the public header
clg_cd_init_stage is accessed here via an extern declaration inside a function body, which is an unusual pattern that bypasses the header contract. Since cycledecode.h already defines the public ABI for this translation unit, adding extern int clg_cd_init_stage; to that header would keep all callers in sync and make the diagnostic visible to any future caller without copy-pasting the extern declaration.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
- configure: require Capstone via --with-capstone or $CAPSTONE_DIR; error if absent - cycledecode: #error instead of no-op stub when built without Capstone - flake: slim (x86+arm64) static Capstone; hardeningDisable on the devShell - ci/codspeed/release: inline cmake step builds slim static hardening-free Capstone - debian/rules: forward $CAPSTONE_DIR to configure via --with-capstone - changelog: document the build requirement
…-bit Capstone The amd64 host builds both the primary (amd64) tool and a 32-bit x86 secondary tool, but Capstone is only built 64-bit, so CLG_WITH_CAPSTONE is set only for the primary build. The secondary build compiled cycledecode.c without it and tripped the mandatory-Capstone #error. CodSpeed only ever runs the 64-bit tool, so build 64-bit only everywhere: add --enable-only64bit to the CI configure, the release deb (debian/rules, now unconditional), and the Justfile, and drop the now-unneeded gcc-multilib / libc6-dev-i386 deps. This also roughly halves build time by skipping the entire 32-bit toolchain.
| void* realloc(void* old, size_t n) | ||
| { | ||
| /* Old size is unknown; copy n bytes (stays within the arena array, so no | ||
| * fault) and leak the old block. Rarely hit under cs_disasm_iter. */ | ||
| void* p = malloc(n); | ||
| if (p && old) | ||
| memcpy(p, old, n); | ||
| return p; | ||
| } |
There was a problem hiding this comment.
realloc copies n (new size) bytes from old, potentially reading past the arena
The comment says "stays within the arena array, so no fault", but that holds only if (old - arena) + n ≤ sizeof(arena). If old sits late in the 1 MiB arena and n is larger than the remaining arena space, memcpy(p, old, n) over-reads past arena[1u << 20] into adjacent memory. A simple clamp to the remaining arena space preserves the copy-what-you-can semantics without needing to track the old block's actual size.
| void* realloc(void* old, size_t n) | |
| { | |
| /* Old size is unknown; copy n bytes (stays within the arena array, so no | |
| * fault) and leak the old block. Rarely hit under cs_disasm_iter. */ | |
| void* p = malloc(n); | |
| if (p && old) | |
| memcpy(p, old, n); | |
| return p; | |
| } | |
| void* realloc(void* old, size_t n) | |
| { | |
| /* Old size is unknown; copy min(n, remaining arena from old) bytes and | |
| * leak the old block. Rarely hit under cs_disasm_iter. */ | |
| void* p = malloc(n); | |
| if (p && old) { | |
| size_t arena_end = (size_t)(arena + sizeof(arena)); | |
| size_t old_pos = (size_t)old; | |
| size_t avail = (old_pos < arena_end) ? (arena_end - old_pos) : 0; | |
| memcpy(p, old, avail < n ? avail : n); | |
| } | |
| return p; | |
| } |
No description provided.