WIP: `Test.run` intermittent `<<loop>>` under `-threaded -N` by omnibs · Pull Request #156 · NoRedInk/haskell-libraries

omnibs · 2026-06-04T18:07:28Z

WIP: `Test.run` intermittent `<<loop>>` under `-threaded -N`

WIP — this PR adds a minimal reproduction under
nri-prelude/scripts/parallel-loop-bug
so we can work on a fix. No library change yet.

Summary

A test executable built with Test.run, compiled -threaded and run with
+RTS -N (multiple capabilities), intermittently crashes with an uncaught

<<loop>>

(NonTermination) before printing a report — i.e. a CI flake (~0.2–0.3% of runs
on a 12-core box). It comes from the parallel execution of ungrouped tests
(Test.Internal.runStrategy → Task.parallel → Async.forConcurrently).
Test.serialize makes it disappear, which is the current workaround.

Reproduction

See scripts/parallel-loop-bug/ (README + Main.hs). It's a dozen ungrouped
tests, each decoding a tiny document to a distinct type. Build it flag-gated
and loop it on a multi-core box:

cabal build parallel-loop-bug -fparallel-loop-bug
BIN=$(cabal list-bin parallel-loop-bug -fparallel-loop-bug)
loops=0
for i in $(seq 1 10000); do
  "$BIN" +RTS -N -RTS >/dev/null 2>err || grep -q '<<loop>>' err && loops=$((loops+1))
done
echo "loop crashes: $loops / 10000"

Observed 27/10000 (~0.27%) at -N12; 0/10000 at -N1.

Symptom

Exits non-zero with <<loop>> on stderr and no test report.
The exception is uncaught — it escapes the per-test bodies (which the
runner wraps and reports as failures), so it originates in the
orchestration/forcing layer, not inside a test.
Fully intermittent: same binary, same inputs.

Observations (GHC 9.8.4, 12-core, looping the process)

Configuration	`<<loop>>` rate
`-N1` / `-N2`	0
`-N4`	~0.07%
`-N8` / `-N12`	~0.3%
12 distinct YAML decoders (this repro), `-N12`	27 / 10000
50 identical decoders (one type, repeated), `-N12`	0
200 trivial `Expect.pass` tests, `-N12`	0
aeson decoders — trivial `"{}"`, rich nested record, AND 20 recursive types, `-N12`	0 (all)
Same decode via bare `Async.forConcurrently` (no `Test.run`), `-N12`	0
Suite wrapped in `Test.serialize`, `-N12`	0 / 8000

Takeaways:

Concurrency-gated: 0 at -N1/-N2; scales with capability count.
Needs the runner: bare Async.forConcurrently of the same decode work
never reproduced it; going through Test.run does.
Needs a variety of distinct decoders, not volume: identical decoders
and Expect.pass suites never loop; ~12 distinct ones do.
Appears specific to the YAML/libyaml path: a pure-aeson equivalent did
not reproduce in any variant tried (trivial "{}", a rich nested record, and
20 distinct recursive types — all 0/10000), while YAML does. So the trigger
seems tied to what the yaml package does (libyaml over FFI, with
unsafePerformIO), not to concurrent decoding in general.
Test.serialize reliably avoids it.

What we ruled out

A cyclic binding in user code — -N1 is clean across thousands of runs; a
real self-referential thunk would loop regardless of -N.
GHC #13751 (<<loop>> under concurrent STM) — fixed in 8.2.1; this is 9.8.4.
Concurrent decoding in general — a pure-aeson equivalent did not reproduce
(see above); it's specific to the libyaml-backed YAML path. (We did not
exhaustively rule out aeson.)

Hypothesis

A threaded-RTS black-hole / deadlock-detector firing on concurrent forcing
of multiple distinct shared CAFs (per-type decoder dictionaries/thunks). When
several capabilities each enter a different CAF and then need one another's
(already black-holed) CAFs, an unlucky interleaving forms a transient
block-cycle the RTS reports as NonTermination. Consistent with: identical
decoders not tripping it, scaling with -N, and being masked by profiling
(a -fprof-late build run with +RTS -xc does not reproduce — so no
Haskell-level stack; a timing race, not a deterministic cycle).

Open questions / directions

Is this a known interaction with Task.parallel? Should test parallelism be
opt-in or capped, or serialize documented as the remedy for <<loop>>?
Does anything in runSingle (per-test tracing-span MVar/IORef, the
Task.timeout racer) widen the window, vs. this being a pure GHC RTS issue
to report upstream?

🤖 Generated with Claude Code

A minimal reproduction for an intermittent `<<loop>>` (NonTermination) crash from Test.run under -threaded with multiple capabilities (+RTS -N): a dozen ungrouped tests, each decoding a tiny document to a distinct type (a distinct decoder CAF), run in parallel by the test runner. Crashes in ~0.3% of runs at -N>=4 and never at -N1. Flag-gated behind `parallel-loop-bug` (off by default), so normal builds are unaffected. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Tried aeson eitherDecodeStrict' (trivial "{}", a rich nested record, and 20 distinct recursive types) — all 0/10000 at -N12, vs ~0.27% for the YAML version. The trigger appears specific to the libyaml-backed YAML decode path, not concurrent decoding in general. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

omnibs · 2026-06-04T18:25:04Z

This is just a repro script, not a fix!

omnibs and others added 2 commits June 4, 2026 10:36

ArthurJordao approved these changes Jun 4, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: `Test.run` intermittent `<<loop>>` under `-threaded -N`#156

WIP: `Test.run` intermittent `<<loop>>` under `-threaded -N`#156
omnibs wants to merge 2 commits into
trunkfrom
parallel-loop-bug-repro

omnibs commented Jun 4, 2026

Uh oh!

omnibs commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants

Conversation

omnibs commented Jun 4, 2026

WIP: Test.run intermittent <<loop>> under -threaded -N

Summary

Reproduction

Symptom

Observations (GHC 9.8.4, 12-core, looping the process)

What we ruled out

Hypothesis

Open questions / directions

Uh oh!

omnibs commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants

WIP: `Test.run` intermittent `<<loop>>` under `-threaded -N`