Skip to content

WIP: Test.run intermittent <<loop>> under -threaded -N#156

Draft
omnibs wants to merge 2 commits into
trunkfrom
parallel-loop-bug-repro
Draft

WIP: Test.run intermittent <<loop>> under -threaded -N#156
omnibs wants to merge 2 commits into
trunkfrom
parallel-loop-bug-repro

Conversation

@omnibs

@omnibs omnibs commented Jun 4, 2026

Copy link
Copy Markdown
Member

WIP: Test.run intermittent <<loop>> under -threaded -N

WIP — this PR adds a minimal reproduction under
nri-prelude/scripts/parallel-loop-bug
so we can work on a fix. No library change yet.

Summary

A test executable built with Test.run, compiled -threaded and run with
+RTS -N (multiple capabilities), intermittently crashes with an uncaught

<<loop>>

(NonTermination) before printing a report — i.e. a CI flake (~0.2–0.3% of runs
on a 12-core box). It comes from the parallel execution of ungrouped tests
(Test.Internal.runStrategyTask.parallelAsync.forConcurrently).
Test.serialize makes it disappear, which is the current workaround.

Reproduction

See scripts/parallel-loop-bug/ (README + Main.hs). It's a dozen ungrouped
tests, each decoding a tiny document to a distinct type. Build it flag-gated
and loop it on a multi-core box:

cabal build parallel-loop-bug -fparallel-loop-bug
BIN=$(cabal list-bin parallel-loop-bug -fparallel-loop-bug)
loops=0
for i in $(seq 1 10000); do
  "$BIN" +RTS -N -RTS >/dev/null 2>err || grep -q '<<loop>>' err && loops=$((loops+1))
done
echo "loop crashes: $loops / 10000"

Observed 27/10000 (~0.27%) at -N12; 0/10000 at -N1.

Symptom

  • Exits non-zero with <<loop>> on stderr and no test report.
  • The exception is uncaught — it escapes the per-test bodies (which the
    runner wraps and reports as failures), so it originates in the
    orchestration/forcing layer, not inside a test.
  • Fully intermittent: same binary, same inputs.

Observations (GHC 9.8.4, 12-core, looping the process)

Configuration <<loop>> rate
-N1 / -N2 0
-N4 ~0.07%
-N8 / -N12 ~0.3%
12 distinct YAML decoders (this repro), -N12 27 / 10000
50 identical decoders (one type, repeated), -N12 0
200 trivial Expect.pass tests, -N12 0
aeson decoders — trivial "{}", rich nested record, AND 20 recursive types, -N12 0 (all)
Same decode via bare Async.forConcurrently (no Test.run), -N12 0
Suite wrapped in Test.serialize, -N12 0 / 8000

Takeaways:

  1. Concurrency-gated: 0 at -N1/-N2; scales with capability count.
  2. Needs the runner: bare Async.forConcurrently of the same decode work
    never reproduced it; going through Test.run does.
  3. Needs a variety of distinct decoders, not volume: identical decoders
    and Expect.pass suites never loop; ~12 distinct ones do.
  4. Appears specific to the YAML/libyaml path: a pure-aeson equivalent did
    not reproduce in any variant tried (trivial "{}", a rich nested record, and
    20 distinct recursive types — all 0/10000), while YAML does. So the trigger
    seems tied to what the yaml package does (libyaml over FFI, with
    unsafePerformIO), not to concurrent decoding in general.
  5. Test.serialize reliably avoids it.

What we ruled out

  • A cyclic binding in user code-N1 is clean across thousands of runs; a
    real self-referential thunk would loop regardless of -N.
  • GHC #13751 (<<loop>> under concurrent STM) — fixed in 8.2.1; this is 9.8.4.
  • Concurrent decoding in general — a pure-aeson equivalent did not reproduce
    (see above); it's specific to the libyaml-backed YAML path. (We did not
    exhaustively rule out aeson.)

Hypothesis

A threaded-RTS black-hole / deadlock-detector firing on concurrent forcing
of multiple distinct shared CAFs (per-type decoder dictionaries/thunks). When
several capabilities each enter a different CAF and then need one another's
(already black-holed) CAFs, an unlucky interleaving forms a transient
block-cycle the RTS reports as NonTermination. Consistent with: identical
decoders not tripping it, scaling with -N, and being masked by profiling
(a -fprof-late build run with +RTS -xc does not reproduce — so no
Haskell-level stack; a timing race, not a deterministic cycle).

Open questions / directions

  1. Is this a known interaction with Task.parallel? Should test parallelism be
    opt-in or capped, or serialize documented as the remedy for <<loop>>?
  2. Does anything in runSingle (per-test tracing-span MVar/IORef, the
    Task.timeout racer) widen the window, vs. this being a pure GHC RTS issue
    to report upstream?

🤖 Generated with Claude Code

omnibs and others added 2 commits June 4, 2026 10:36
A minimal reproduction for an intermittent `<<loop>>` (NonTermination)
crash from Test.run under -threaded with multiple capabilities (+RTS -N):
a dozen ungrouped tests, each decoding a tiny document to a distinct type
(a distinct decoder CAF), run in parallel by the test runner. Crashes in
~0.3% of runs at -N>=4 and never at -N1.

Flag-gated behind `parallel-loop-bug` (off by default), so normal builds
are unaffected.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Tried aeson eitherDecodeStrict' (trivial "{}", a rich nested record, and
20 distinct recursive types) — all 0/10000 at -N12, vs ~0.27% for the YAML
version. The trigger appears specific to the libyaml-backed YAML decode
path, not concurrent decoding in general.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@omnibs

omnibs commented Jun 4, 2026

Copy link
Copy Markdown
Member Author

This is just a repro script, not a fix!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants