Skip to content

Add multi-event sampling#7

Open
GuillaumeLagrange wants to merge 5 commits into
codspeedfrom
cod-2810-check-samply-possibilitieslimitations-regarding-additional
Open

Add multi-event sampling#7
GuillaumeLagrange wants to merge 5 commits into
codspeedfrom
cod-2810-check-samply-possibilitieslimitations-regarding-additional

Conversation

@GuillaumeLagrange

Copy link
Copy Markdown

Related: AvalancheHQ/linux-perf-event-reader@908775c

Which is updated in our linux-perf-data reference

Initially opened wrongfully as #6

GuillaumeLagrange and others added 5 commits June 11, 2026 18:10
Open extra hardware events as counting-only siblings in each sampling
event's kernel event group, so every sample carries the group's counter
values via PERF_SAMPLE_READ. The converter turns the cumulative values
into per-sample deltas, tracked per (kernel event id, tid) because with
inherit every thread has its own counter instance sharing the group's
id.

The deltas ride along on the thread's stack samples and are serialized
as a new samplyEventDeltas object in the samples table (one column per
event, null when a sample carries no value), next to threadCPUDelta.
This keeps the association between stack and event values exact by
construction, with per-thread attribution, instead of requiring a
timestamp join against per-process counter tracks. The field is a
samply extension which the Firefox Profiler UI ignores.

If the extra events cannot be opened (PMU not exposed, or inherit +
PERF_SAMPLE_READ requiring Linux 6.12), retry hardware-cycles sampling
without them before falling back to software cpu-clock.

No events are wired up yet; event selection comes separately.

Refs COD-2810
Co-Authored-By: Claude <noreply@anthropic.com>
The interface is very low level on purpose to leave the logic of chosing
which events to open and how to call it up to the caller.
Pre-6.12 kernels reject `inherit` combined with `PERF_SAMPLE_READ`, so the
extra-event group (the per-sample hardware counters) couldn't be opened
alongside the inherited sampling events; samply hit EINVAL and silently
dropped the extra counters.

Probe the kernel once for inherited sample-read support. When it's missing,
open the sampling group per-CPU system-wide (pid=-1, no inherit) across all
CPUs instead of per-task. This still captures every thread on the CPU —
including ones spawned after attach, which the per-task no-inherit approach
would miss — and keeps stacks and counters in a single sample record.
Samples are filtered to the launched process tree (seeded from the profiled
pids, grown on fork) so the system-wide buffers don't pull in unrelated
processes.

System-wide events aren't tied to the launched task, so they're enabled
explicitly (ENABLE_ON_EXEC never fires for them) and profiling stops as soon
as the launched process exits (system-wide events never close on their own).

On >=6.12 the kernel accepts inherit + PERF_SAMPLE_READ, so the existing
per-task + inherit path is used unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@not-matthias not-matthias left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall, just want to make sure we don't accidentally introduce variance/noise into benchmarks due to now profiling the entire system.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a perf drawback to using system-wide capture? This could lead to unwanted variance in walltime benchmarks on macro runners.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants