smp: always wake up core when ready by zyuiop · Pull Request #2468 · hermit-os/kernel

zyuiop · 2026-06-06T14:21:53Z

This is a partial revert of commit f0bb0a0.

This fixes a very tricky race condition that caused some cores to not be woken up, causing tests to fail (and weird bugs in the scheduling also)

Context

In some PRs, cargo xtask ci rs --arch x86_64 --profile dev --package rusty_demo --features fs --smp 4 qemu --accel --sudo --devices virtio-fs-pci fails due to timeout.

This bug is reproducible locally, and is even more frequent with higher core counts (e.g., try it with cargo xtask ci rs --arch x86_64 --profile dev --package rusty_demo --features fs --smp 32 qemu --accel --sudo --devices virtio-fs-pci on main and you're almost certain to get the bug. You can similarly confirm that this patch solves the bug.

The symptom is that the "Laplace" computation will get stuck. You can confirm this is not just the loop taking a long time by adding a small log in compute in laplace.rs

	for i in 0..iterations {
		iteration(current, next, size_x, size_y);
		mem::swap(&mut current, &mut next);
		println!("iteration {i}")
	}

You will observe the counter go up, then suddenly stop and nothing happens. This is the bug.

I initially believed this was a futex problem (so I rewrote them partially, another PR is incoming), but it was actually something different.

The Bug (i think)

The thing that tipped me was that the bug was absent with the idle-loop flag. This greatly reduce the area that needed to be searched, since this flag is only present in a couple of places.

The bug is caused by a race condition when a core B wants to wake up a core A, which is going into halt.

The relevant call points are these:

PerCoreScheduler::run. This is the idle task. It uses a backoff, checking periodically the core inbox to see if there is work. If there is none and the backoff is exhausted, it goes to sleep, calling enable_and_wait.
wakeup_core. This is called by another CPU, after adding a task to the inbox. It sends an interrupt to the other core, to wake it up.

f0bb0a0 introduced a boolean for each core that indicates if the core is halting. If this boolean is set to false, wakeup_core skips the interrupt and returns immediately.
This works most of the time because PerCoreScheduler::run will always pick up tasks when some are available.

However, there is a possible race condition. Imagine the two following cores:

Core 1                                     Core 2

                                           run:
                                              check_inbox // inbox is empty
                                              enable_and_wait // go to enable interrupts

                                           enable_and_wait:
custom_wakeup:
  add_task_to_inbox
  wakeup_core

wakeup_core:
  check_hlt_state // hlt state is currently false
  return          // do nothing

                                              set_hlt_state(true)
                                              interrupts_enable
                                              hlt

I hope you appreciate my artistic talents, but that's the gist.

The fix

Simply a partial revert of f0bb0a0.
I was considering changing the boolean into a state machine, with a "don't sleep please" variant that could be sent by wakeup_core if the halt state was false, to indicate to a core that was about to go to sleep that it should not.
I'm not sure it was working well because I was getting weird values in my compare_and_exchange calls, but this may be a viable alternative to reduce the number of interrupts we send.

Notes

Increasing the core count increases the likelihood of the race condition happening. Laplace appears to be a good test-case because all threads are waiting on each other, so if ONE thread does not wake up, the program remains stuck.

It is possible that this patch will also solve other issues related to the scheduler suddenly becoming unresponsive until you cause an interrupt, but I have not tested that yet.

Fixes #2317

This is a partial revert of commit f0bb0a0

github-actions

Benchmark Results

Details

Benchmark	Current: `f4942c9`	Previous: `28c7089`	Performance Ratio
startup_benchmark Build Time	`113.60` s	`115.28` s	`0.99` ❗
startup_benchmark File Size	`0.77` MB	`0.77` MB	`1.00` ❗
Startup Time - 1 core	`1.00` s (`±0.02` s)	`1.00` s (`±0.03` s)	`1.01`
Startup Time - 2 cores	`1.03` s (`±0.06` s)	`1.04` s (`±0.05` s)	`0.99`
Startup Time - 4 cores	`0.85` s (`±0.10` s)	`1.00` s (`±0.06` s)	`0.86` ❗
multithreaded_benchmark Build Time	`113.78` s	`118.33` s	`0.96` ❗
multithreaded_benchmark File Size	`0.86` MB	`0.87` MB	`0.99` ❗
Multithreaded Pi Efficiency - 2 Threads	`89.84` % (`±15.27` %)	`92.34` % (`±11.70` %)	`0.97`
Multithreaded Pi Efficiency - 4 Threads	`44.71` % (`±7.25` %)	`45.83` % (`±5.87` %)	`0.98`
Multithreaded Pi Efficiency - 8 Threads	`25.10` % (`±3.77` %)	`25.71` % (`±3.50` %)	`0.98`
micro_benchmarks Build Time	`91.92` s	`96.08` s	`0.96` ❗
micro_benchmarks File Size	`0.86` MB	`0.88` MB	`0.99` ❗
Scheduling time - 1 thread	`84.39` ticks (`±3.50` ticks)	`73.85` ticks (`±3.74` ticks)	`1.14` ❗
Scheduling time - 2 threads	`46.44` ticks (`±4.08` ticks)	`41.46` ticks (`±4.37` ticks)	`1.12`
Micro - Time for syscall (getpid)	`3.11` ticks (`±0.43` ticks)	`3.08` ticks (`±0.19` ticks)	`1.01`
Memcpy speed - (built_in) block size 4096	`74528.23` MByte/s (`±51550.28` MByte/s)	`73208.48` MByte/s (`±50616.82` MByte/s)	`1.02`
Memcpy speed - (built_in) block size 1048576	`29435.39` MByte/s (`±24156.07` MByte/s)	`29392.18` MByte/s (`±24125.47` MByte/s)	`1.00`
Memcpy speed - (built_in) block size 16777216	`26248.27` MByte/s (`±21612.65` MByte/s)	`23751.74` MByte/s (`±19605.56` MByte/s)	`1.11`
Memset speed - (built_in) block size 4096	`74689.57` MByte/s (`±51661.39` MByte/s)	`73313.38` MByte/s (`±50689.78` MByte/s)	`1.02`
Memset speed - (built_in) block size 1048576	`30180.99` MByte/s (`±24573.26` MByte/s)	`30169.43` MByte/s (`±24587.38` MByte/s)	`1.00`
Memset speed - (built_in) block size 16777216	`26959.34` MByte/s (`±22039.42` MByte/s)	`24402.29` MByte/s (`±20008.02` MByte/s)	`1.10`
Memcpy speed - (rust) block size 4096	`66129.21` MByte/s (`±46109.14` MByte/s)	`66427.37` MByte/s (`±46490.49` MByte/s)	`1.00`
Memcpy speed - (rust) block size 1048576	`29519.67` MByte/s (`±24267.87` MByte/s)	`29425.11` MByte/s (`±24175.79` MByte/s)	`1.00`
Memcpy speed - (rust) block size 16777216	`26484.97` MByte/s (`±21804.24` MByte/s)	`24403.77` MByte/s (`±20179.96` MByte/s)	`1.09`
Memset speed - (rust) block size 4096	`66520.04` MByte/s (`±46379.04` MByte/s)	`66262.32` MByte/s (`±46403.75` MByte/s)	`1.00`
Memset speed - (rust) block size 1048576	`30279.23` MByte/s (`±24693.83` MByte/s)	`30172.11` MByte/s (`±24599.55` MByte/s)	`1.00`
Memset speed - (rust) block size 16777216	`27221.49` MByte/s (`±22251.69` MByte/s)	`25069.95` MByte/s (`±20590.63` MByte/s)	`1.09`
alloc_benchmarks Build Time	`91.97` s	`91.43` s	`1.01` ❗
alloc_benchmarks File Size	`0.85` MB	`0.85` MB	`1.00` ❗
Allocations - Allocation success	`100.00` %	`100.00` %	`1`
Allocations - Deallocation success	`100.00` %	`100.00` %	`1`
Allocations - Pre-fail Allocations	`100.00` %	`100.00` %	`1`
Allocations - Average Allocation time	`4647.33` Ticks (`±57.72` Ticks)	`4633.96` Ticks (`±62.19` Ticks)	`1.00`
Allocations - Average Allocation time (no fail)	`4647.33` Ticks (`±57.72` Ticks)	`4633.96` Ticks (`±62.19` Ticks)	`1.00`
Allocations - Average Deallocation time	`1198.13` Ticks (`±109.06` Ticks)	`683.28` Ticks (`±99.25` Ticks)	`1.75` ❗
mutex_benchmark Build Time	`95.02` s	`90.68` s	`1.05` ❗
mutex_benchmark File Size	`0.87` MB	`0.88` MB	`0.99` ❗
Mutex Stress Test Average Time per Iteration - 1 Threads	`13.80` ns (`±0.75` ns)	`13.82` ns (`±0.84` ns)	`1.00`
Mutex Stress Test Average Time per Iteration - 2 Threads	`19.48` ns (`±4.29` ns)	`14.48` ns (`±0.75` ns)	`1.35` ❗

This comment was automatically generated by workflow using github-action-benchmark.

zyuiop · 2026-06-06T15:00:48Z

I will note that after running a couple of times

cargo xtask ci rs --arch x86_64 --profile release  --package rusty_demo --features fs --features hermit/idle-poll --smp 4 qemu --accel

and

cargo xtask ci rs --arch x86_64 --profile release  --package rusty_demo --features fs --smp 4 qemu --accel

idle-poll seems to outperform the non idle-poll version by a factor around 2x. Maybe this flag should become the default again?

mkroening

Thank you so much for finding the cause of that issue! :)

This PR does indeed solve it, but the slowdown feels too large for me to merge immediately.

On main, Laplace takes around 31 ms with and without idle-poll for me. With this PR, idle-poll stays within that range, which makes sense, but without idle-poll, Laplace takes more than 700 ms. This is in line with your performance comment.

I don't think we should default to idle-poll, since that wastes many resources and should be opt-in for high-performance scenarios.

I'll take a look at #2470 later. Maybe we can land that directly.

zyuiop · 2026-06-15T13:59:59Z

That's weird. Locally I observed similar times with and without the patch, and certainly not a x30.
How many cores do you use? It looks like Laplace is only using a single core at a time anyway.

For this specific bench, idle-poll was faster than main, and my patch was basically identical

mkroening · 2026-06-15T15:42:35Z

Okay, I have rerun the tests. I run cargo xtask ci rs --arch x86_64 --package rusty_demo --release --smp 4 qemu --accel and sometimes add --feature hermit/idle-poll.

In an Intel-based VM (Hermit is nested), I get the aforementioned slowdown (no poll-idle). On a bare-metal AMD EPYC 9354P machine, I only get a slowdown from about 20ms to about 30ms for Laplace (no poll-idle).

Laplace is using multiple threads through Rayon's par_chunks_mut.

zyuiop · 2026-06-15T15:57:36Z

Thanks for the details.

My remark on "single thread at a time" was due to looking deeper into the futexes usage by rayon, and noticing that basically all threads are always waiting for the lock, so a single thread is used at a time - which you can also observe because iirc the single core time is faster than multiple cores.

On my side, I'm using an AMD Ryzen AI 9 HX PRO 370 w/ Radeon 890M (my laptop). Using cargo xtask ci to run 20 times laplace, in each config, I get:

main: 14.4 ms average, 10.970 median. #wakeups (per core): 4/23/25/26
this branch: 32.45 ms average, 31.441 median. #wakeups (per core): 12813/12781/12215/11863
feat(smp): re-add an anti-spurious-wakeup feature #2470: 10.2 ms average, 8.512 median, #wakeups (per core): 11/34/31/28

So it looks like #2470 completely fixes the issue AND brings the missing performance.

EDIT:

Here are the results with 8 cores.

main: only 15 successful attempts. 886ms average, 775.791ms median
this branch: 3858ms average, 3690ms median
feat(smp): re-add an anti-spurious-wakeup feature #2470: 608ms average, 523ms median.

Although I must say that I observe quite a lot of variability on this benchmark, so maybe it's not the best case to verify multicore performance.

mkroening · 2026-06-16T08:30:35Z

#2470 brings the performance back down somewhat, but not better than before (before 30ms, this PR 700ms, that PR 38ms).

Still, correctness first, this bug is annoying. Let's merge this and try to gain the performance back eventually. Thanks again for this! :)

My remark on "single thread at a time" was due to looking deeper into the futexes usage by rayon, and noticing that basically all threads are always waiting for the lock, so a single thread is used at a time - which you can also observe because iirc the single core time is faster than multiple cores.

I did not look at the futexes in detail yet, but this should scale with more cores. It does so on Linux in an expected way. On Hermit, it is weird. I opened #2480 to discuss this.

smp: always wake up core when ready

f4942c9

This is a partial revert of commit f0bb0a0

zyuiop force-pushed the feat/smp-race-cond branch from 1b60684 to f4942c9 Compare June 6, 2026 14:26

github-actions Bot reviewed Jun 6, 2026

View reviewed changes

zyuiop mentioned this pull request Jun 6, 2026

feat: bucketized futexes #2469

Open

zyuiop mentioned this pull request Jun 6, 2026

feat(smp): re-add an anti-spurious-wakeup feature #2470

Open

mkroening self-assigned this Jun 8, 2026

mkroening self-requested a review June 8, 2026 13:36

mkroening requested changes Jun 15, 2026

View reviewed changes

mkroening approved these changes Jun 16, 2026

View reviewed changes

mkroening added this pull request to the merge queue Jun 16, 2026

Merged via the queue into hermit-os:main with commit 30cb3f9 Jun 16, 2026
20 checks passed

zyuiop mentioned this pull request Jun 16, 2026

Scheduler scales poorly #2480

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

smp: always wake up core when ready#2468

smp: always wake up core when ready#2468
mkroening merged 1 commit into
hermit-os:mainfrom
zyuiop:feat/smp-race-cond

zyuiop commented Jun 6, 2026 •

edited

Loading

Uh oh!

github-actions Bot left a comment •

edited

Loading

Uh oh!

zyuiop commented Jun 6, 2026

Uh oh!

mkroening left a comment

Uh oh!

zyuiop commented Jun 15, 2026 •

edited

Loading

Uh oh!

mkroening commented Jun 15, 2026 •

edited

Loading

Uh oh!

zyuiop commented Jun 15, 2026 •

edited

Loading

Uh oh!

mkroening commented Jun 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zyuiop commented Jun 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

The Bug (i think)

The fix

Notes

Uh oh!

github-actions Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Benchmark Results

Uh oh!

zyuiop commented Jun 6, 2026

Uh oh!

mkroening left a comment

Choose a reason for hiding this comment

Uh oh!

zyuiop commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mkroening commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zyuiop commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mkroening commented Jun 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zyuiop commented Jun 6, 2026 •

edited

Loading

github-actions Bot left a comment •

edited

Loading

zyuiop commented Jun 15, 2026 •

edited

Loading

mkroening commented Jun 15, 2026 •

edited

Loading

zyuiop commented Jun 15, 2026 •

edited

Loading