Skip to content

opt(#219): dissolve the inlined seam — windowed SROA + carrier-forwarding + narrow-local#252

Merged
avrabe merged 21 commits into
mainfrom
feat/219-seam-sroa
Jul 1, 2026
Merged

opt(#219): dissolve the inlined seam — windowed SROA + carrier-forwarding + narrow-local#252
avrabe merged 21 commits into
mainfrom
feat/219-seam-sroa

Conversation

@avrabe

@avrabe avrabe commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Closes #219.

Brings the fully-built seam-dissolution optimization up to date with main (v1.1.17) and certifies it via the #238 behavioral differential gate — the agreed substitute for the single-board G474RE silicon gate it was previously parked on.

What it does

Windowed SROA + instruction-level carrier-forwarding + narrow-local analysis dissolve the void-return seam left by inlining. On the gale repro (sem.loom.wasm) the seam collapses 5254 → 1888 bytes.

Brought up to date with main (merge, one resolution pass)

Validation

🤖 Generated with Claude Code

avrabe and others added 21 commits June 15, 2026 03:10
Faithful local repro of gale #219's C<->Rust decide seam: `decide` returns
{action, new_count} packed into an i64; the shim unpacks them. After
`--passes inline` the `run` body contains the pack (extend_i32_u/shl/or) →
i64 carrier local → unpack (and/shr_u/wrap_i64) round-trip on one value —
exactly the residue to dissolve.

Verified: optimizes + validates today; inline leaves the round-trip intact
(confirmed via `loom optimize --passes inline` + wasm-tools print). The
seam-SROA pass (next) must forward the scalar components and drop the carrier,
yielding `(action & 0xff) + new_count` — each rewrite Z3-proven.

Refs #219
gale attached the authentic post-loom-inline dissolved z_impl_k_sem_give on the
issue. Decoded + verified: sha256 f81da42d… (matches), 5254 B, valid wasm. It
contains the u64 pack/unpack round-trip to dissolve — pack (extend_i32_u/shl/or)
at body lines 59-63, unpack (and/shr_u/wrap_i64) at 90/104, with a dead i64
carrier. Grounds the seam-SROA pass against the real body, not just the
synthetic fixture.

Design confirmed on #219 (proof-carrying: Z3 obligation per ISLE rewrite;
ISLE-term route preferred). Implementation (mem2reg + pack/unpack forwarding +
DCE) follows.

Refs #219
…tend_i32_u)

Lays the ISLE-layer foundation for the #219 pack/unpack forwarding rules:
- types.isle: I32WrapI64 / I64ExtendI32U ValueData variants.
- constructors.isle: decl + extern constructor + extern extractor for each.
- loom-shared/src/lib.rs: i32_wrap_i64_extract / i64_extend_i32_u_extract
  (mirror iadd32_extract, single-field).
- build.rs: register seam_sroa.isle AND add the two extractors to the
  fix_extractor_ownership list (cranelift-isle 0.132.1 emits extractor args as
  owned `Value`; the post-processor rewrites them to `&Value` — without this the
  generated trie move-then-borrows arg0 and loom-shared won't compile).
- seam_sroa.isle: the unconditional, Z3-trivial rule
  `wrap_i64(extend_i32_u(x)) → x`.

loom-shared builds clean.

IMPORTANT (next step): the generated ISLE `simplify` is NOT in loom's live
rewrite path — build.rs documents this, and the active rewriters are the Rust
`rewrite_pure_impl` / `rewrite_with_dataflow` (loom-shared/src/lib.rs), validated
by the Z3 translation validator per pass. So this rule does not fire yet; the
LIVE seam-SROA rewrite must be added as a `rewrite_pure_impl` match arm
(ValueData::I32WrapI64{val} where val is I64ExtendI32U{inner} → inner), with the
Z3 verifier discharging the obligation. ISLE terms retained as proof-layer
documentation of the identity.

Refs #219
The actual rewrite (the ISLE rule committed earlier is in loom's dead simplify
path; the live rewriters are the Rust rewrite_pure_impl/rewrite_with_dataflow,
validated by the Z3 translation validator per pass). Adds rewrite_pure_impl
match arms for the two #219 conversion terms:

- I32WrapI64: recurse operand; if it is I64ExtendI32U(inner) → inner
  (unconditionally sound: extract(31,0, zero_ext(32,x)) = x — the u64-ABI
  round-trip a packed scalar pays at the dissolved decide seam); also folds
  wrap_i64(i64.const N) → i32.const (low 32).
- I64ExtendI32U: recurse operand; folds extend_i32_u(i32.const N) → i64.const
  (zero-extended).

Test: test_seam_sroa_wrap_extend_identity — wrap_i64(extend_i32_u(local.get 0))
dissolves to local.get 0. Full `cargo test --release --lib --test-threads=4`
green, 0 failures.

This is the foundation case. The sem repro's full dissolution additionally
needs the carrier mem2reg + the masked and/shr_u forwarding rules (steps 2-3),
since its unpack uses and/shr_u (not wrap) on the packed u64.

Refs #219
…lves)

Dissolves the u64-pack unpack arithmetic, in rewrite_pure_impl (Z3-validated per
pass; pure-integer so the verifier actually verifies — no skip). Each rule is
unconditionally sound with only constant side-conditions (no range analysis):

I64And (the `& mask` unpack):
- (shl Z k) & M → 0           when M's set bits are all below k (M>>k==0, 0<k<64)
- (or A B) & M → survivor & M  when one OR operand is such a cleared shift

I64ShrU (the `>> k` unpack):
- (shr_u (shl Z k) k) → Z & (u64::MAX>>k)   (shl-then-shr-same masks low 64-k)
- (shr_u (or P Q) k) → (P>>k)|(Q>>k)         when an operand is (shl _ k)
  (logical shift distributes over OR; targeted to the pack shape — no general
  bloat; recursion collapses the shl side via the rule above)

Together these reduce a forwarded pack `(or (shl (extend_u hi) 32) lo)`:
  & 0xff  → lo & 0xff                         (low field)
  >> 32   → (extend_u hi) & 0xffffffff         (high field; lo const shifts out)

Tests: test_seam_sroa_mask_clears_shifted_pack, test_seam_sroa_shr_extracts_
high_field (+ the step-1 wrap/extend test). Full `cargo test --release --lib
--test-threads=4` green (399/28/17, 0 failures).

These fire once inlining delivers pack/unpack adjacency (the inline-modelability
extension for control-flow + trap-call decides is the remaining prerequisite,
tracked separately — higher blast radius, needs gale silicon validation).

Refs #219
…ier)

Design for the #219 inline prerequisite: wire in precise br_table/br_if/br
symbolic execution (the deferred ExecutionState/merge_states scaffolding) so the
verifier can soundly prove inline-equivalence of br_table-dispatching decides.

Approach: state-merging-with-ITE via a label/continuation stack (NOT path
forking), exact for the acyclic single-exit case. Grounded in Alive2 (PLDI'21,
the production SMT bounded-TV precedent), Kuznetsov state-merging (PLDI'12), WASP
(ECOOP'22 wasm br_table rules), Boogie block-predicate WP. Includes the soundness
argument, the explicit pitfalls (selector constraint, fall-through join, ⊥-not-
havoc for trap arms, branch arity, over-approx-and-skip), 2-phase staging
(sem_give integer first, then memory for mutex/pipe), and the adversarial
verify-or-revert + silicon test gate. For gale review before implementation.

Refs #219
…tection)

First component of the PR-C precise-CF inline-verifier extension (design approved
by gale on #219, 2026-06-17). `is_noreturn_callee(func)`: conservative-sound —
true iff the body has no Return and no Br/BrIf/BrTable anywhere (recursing) and
ends in Unreachable. Admits exactly the core::panicking helpers (panic_fmt →
unreachable; panic_const_add_overflow → call panic_fmt; unreachable) and rejects
any normal-returning or branching function.

The by-body inline modeler (PR-C M3, next) will use this to encode a `call` to
such a callee as ⊥ "diverge if reached" — never havoc — so a trapping branch
constrains only reachability, never the post-state (Alive2 `ub` discipline;
avoids the #155/#159 havoc-vs-concrete false-positive surface).

Test: test_is_noreturn_callee (bare trap / trap-after-call / normal-return /
no-trailing-trap / branch-escape). #[allow(dead_code)] until the M3 gate wires it.

Refs #219
The soundness-critical core of the inline-modelability extension: a precise
symbolic executor for structured, acyclic wasm control flow (block / if / br /
br_if / br_table) by state-merging-with-ITE over a label/join stack — never
path-forking. For a bounded acyclic region this is EXACT (a passing SMT check is
a real proof). Precedent: Alive2 (PLDI'21), Kuznetsov state-merging (PLDI'12),
WASP br_table rules (ECOOP'22). See docs/design/pr-c-precise-cf-verification.md.

Built as a NEW ISOLATED function, not by mutating the shared encoders
(encode_block_body / encode_function_to_smt_impl_inner stay untouched -> zero
regression risk to existing passes). encode_simple_instruction's catch-all
HAVOCS (pushes a fresh BV without popping operands), so delegating an unmodeled
op would desync the stack; instead an allowlist gate (is_acyclic_modelable_body)
front-loads soundness and the executor runs only on admitted functions.
Over-approximate-and-skip everything else: loops (back-edge), calls, memory,
unknown opcodes, multi-result/param blocks.

br_table default guard is UNSIGNED bvuge(sel, n) — a signed >= would leave
selectors in [2^31, 2^32) in a partition gap.

All #[cfg(feature="verification")] + #[allow(dead_code)] until M3 wires it into
inline_functions' verify-or-revert. Tests (5, green): br_table switch proves
equivalent to the if/else chain; identity; detects wrong-default and
swapped-arm counterexamples; gate rejects loops/memory, admits the switch.
Also folds in cargo-fmt drift in loom-shared from #219 SROA @4356441.

Refs #219
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…e acyclic executor

Extends exec_acyclic to model a `call F` BY BODY (the inline-soundness principle
from #155 extended to control flow): pop args, recursively encode the callee
through exec_acyclic with params bound to the args, declared locals zero-init,
globals SHARED with the caller (callee global writes propagate back); push the
merged single result and narrow the caller path to the callee's returning paths.
A call to a no-return callee (is_noreturn_callee: panic -> unreachable) sets
reachable=false — the path DIVERGES (drops out of the result ITE), NEVER havoc
(Alive2 ub discipline; havoc reopens the #155/#159 false-positive surface).

Threads VerificationSignatureContext (function_bodies, the existing #151 callee
lookup) + a recursion depth into exec_acyclic / encode_acyclic_function /
is_acyclic_modelable_body. MAX_ACYCLIC_CALL_DEPTH=8 bounds by-body inlining and
makes a cyclic/recursive call graph fail safely (reject, never infinite-unroll).
The gate now admits Call(g) iff g is a local function that is no-return or
recursively acyclic-modelable; imports/call_indirect/over-deep stay rejected.
exec_acyclic remains the soundness enforcer (Err, never havoc, on anything
unmodelable).

Still #[cfg(verification)] + #[allow(dead_code)] until M3 wires it into
inline_functions' verify-or-revert. Tests (7 total, green): + call-by-body of a
br_table callee proves equivalent to its inlined switch (the seam capability);
+ no-return call diverges not havoc.

Refs #219
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Adds the equivalence harness over the precise acyclic CF executor: encodes both
functions via encode_acyclic_function (deterministic input names unify in Z3) and
checks a == b has no counterexample. Mirrors verify_function_equivalence's
soundness boundaries (no width-matching the final equality; Unknown -> Err).

Returns Err whenever either side is not acyclic-modelable, a width mismatches, or
the encoding fails — so the caller (inline_functions' verify-or-revert, wired in
M3.2) FALLS BACK to the existing encoder. This only ADDS proving power for the
br_table-callee inline case; it never removes a check. Void functions defer to
the fallback (no side-effect equivalence modeled yet).

Still #[allow(dead_code)] until M3.2 wires it into inline_functions. Tests
(9 total, green): + test_verify_acyclic_equivalence_inline — a caller that
calls a br_table decide verifies equivalent to the inlined switch, a
wrong-default body does NOT verify, and a loop function Errs (caller falls back).

Refs #219
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…liner

TranslationValidator::verify() now tries verify_acyclic_equivalence for the
inline pass before the existing encoder. The main encoder models br_table
approximately and a br_table callee is not straight-line-by-body-modelable, so
it cannot prove `call F` (br_table callee) equal to its inlined body → the inline
reverts. The acyclic executor is EXACT for acyclic CF + by-body calls, so it can.

PURELY ADDITIVE + scoped to pass_name == "inline_functions": ONLY a precise
Ok(true) short-circuits to accept; Ok(false)/Err/panic FALL THROUGH to the
existing verify_function_equivalence_with_context unchanged (acyclic call wrapped
in its own catch_unwind). This can only ADD accepts (the br_table-callee inlines),
never weaken an existing check; every non-inline pass is byte-identical. This also
makes the M1a/M1b/M3.1 executor chain live (the #[allow(dead_code)] markers are
now harmless no-ops).

No br_table callee is admitted as an inline CANDIDATE yet (function_inline_modelable
is still straight-line — M3.3), so this is a behavioral no-op on the current corpus:
full `cargo test --release --lib` is 408 passed / 0 failed / 2 ignored, unchanged.
clippy + fmt clean. The capability is proven by the 9 acyclic unit tests; M3.3
enables it end-to-end on the seam.

Refs #219
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…+ Unreachable trap

function_inline_modelable now admits a callee under EITHER regime A (straight-line
incl. memory — the existing by-body modeler, #155/#157/#161) OR regime B (acyclic
control flow + pure-integer ops + by-body calls to no-return/recursively-modelable
local callees — the precise acyclic executor, PR-C). Mirrors
verify::is_acyclic_modelable_body so the candidate gate admits exactly what the
acyclic verifier can prove (admitting an unprovable callee → caller reverts, the
#159 collateral problem). Threads module context (functions + num_imported_funcs)
into the gate + a recursion-depth bound; a br_table+memory callee passes neither
regime → stays opaque until Phase 2.

Also handles Unreachable as a trap (⊥) in exec_acyclic and both gates: the path
diverges (reachable=false, drops out of the result ITE), never havoc — the seam
decides end their overflow arms in `call panic_const; unreachable`, so this is
required to model them at all.

Full `cargo test --release --lib` = 409 passed / 0 failed / 2 ignored (+1:
test_acyclic_unreachable_diverges). fmt + clippy clean.

SCOPE NOTE (found by inspecting the real repro-219/sem.loom.wasm): decide
(gale_k_sem_give_decide) IS regime-B-eligible (acyclic br_table + i64 pack +
no-return panic call, no memory), so it is now an inline candidate. BUT its caller
z_impl_k_sem_give READS MEMORY (i32.load offset=8/12 for decide's args) and calls
imports, so proving decide's inline INTO z_impl needs the acyclic executor to model
z_impl's memory = PHASE 2. So sem_give's seam will not dissolve until Phase 2
(memory Array threaded through exec_acyclic). M3.1-M3.3 are correct + necessary
infra; Phase 2 is the next real enabler for sem_give.

Refs #219
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ecutor

The acyclic executor now models linear memory (loads/stores), required because
real seam callers (z_impl_k_sem_give) read memory for the decide's args.

AcyclicState gains `memory: Array`. Memory is MUTABLE branch state, so it is
ITE-merged at every join exactly like locals/globals: BranchEntry/MergedState
carry the Array, acyclic_merge_entries ITE-merges it (Bool::ite over the Array
sort), acyclic_push_branch records it, acyclic_apply_join sets it, and the If arm
saves/restores it for the else branch. exec_acyclic handles I32/I64 load/store +
partial-width load8/16/store8/16 EXPLICITLY via the shared encode_i32_load_from /
encode_i64_load_from / encode_*_store_into / encode_partial_* helpers (the same
ones the main encoder + by-body modeler use -> bit-identical memory expressions);
they are NEVER delegated to encode_simple_instruction (which havocs). memory is
threaded through encode_acyclic_call (shared + propagated like globals) and
initialized in encode_acyclic_function as Array::new_const("memory", ...) -- same
name as the main encoder so both encodings unify.

Gates: new is_acyclic_memory_op (verify.rs) admitted by is_acyclic_modelable_body;
lib-side is_acyclic_memory_modelable_instr admitted by the regime-B inline gate.

Call gate stays STRICT (no impure-call havoc yet -- that is Phase 2b, the
soundness-critical part). So sem_give's z_impl (which also calls k_spin_lock import
+ z_unpend) is not yet fully modelable; 2a is the contained, testable foundation.

Full `cargo test --release --lib` = 410 passed / 0 failed / 2 ignored. New test
test_acyclic_memory_merge: store-on-each-branch then load proves equivalent to the
direct if/else (validates the per-join memory ITE-merge feeds the loaded value).
test_acyclic_modelable_gate updated (memory now admitted). fmt + clippy clean.

Refs #219
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The acyclic executor now models impure calls (imports + non-acyclic-modelable
local callees) by HAVOC, so memory-bearing seam callers like z_impl_k_sem_give
(which call k_spin_lock + z_unpend) become verifier-modelable end-to-end.

exec_acyclic's Call arm is now 3-tier (mirrors the main encoder): no-return ->
diverge; acyclic single-result local -> by-body (encode_acyclic_call); else
(import / multi-result / too-deep / loop-containing) -> acyclic_havoc_call: pop
args, push fresh `call_{g}_result_{i}`, havoc all globals + memory with a SHARED
per-encode counter. The counter is threaded (&mut u64) through exec_acyclic +
encode_acyclic_call and reset per encode_acyclic_function -- so the Nth impure
call gets the SAME havoc id in the original and optimized encodings -> they unify
in Z3. Havoc is conservative-sound: it never forges a value equal to a concrete
one, so a havoc-vs-concrete mismatch refuses to prove (the #155/#159 guard).

is_acyclic_modelable_body is now purely STRUCTURAL (drops ctx/depth params -- any
direct call is modelable since havoc covers the opaque case; clippy
only_used_in_recursion). The lib candidate gate is unchanged (decide already
admitted via its no-return panic call).

Full `cargo test --release --lib` = 413 passed / 0 failed / 2 ignored. New tests:
test_acyclic_havoc_call_unifies (havoc id parity); test_acyclic_havoc_not_concrete
(havoc'd result NOT provably equal to a concrete value -- soundness);
test_acyclic_havoc_memory (load after impure call reads havoc'd memory, not the
initial -- soundness). fmt + clippy clean.

Next: e2e dissolve on repro-219/sem.loom.wasm (decide's inline into z_impl should
now be PROVEN by the acyclic path -> no `call $gale_k_sem_give_decide`).

Refs #219
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ow (dormant)

Foundation for the #219 perf milestone (dissolve the u64 ABI pack/unpack the
proven inline leaves in z_impl). Extends rewrite_with_dataflow to forward a
SINGLE-ASSIGNMENT local's pure defining expression to its uses (carrier
mem2reg), gated by a caller-supplied `OptimizationEnv.single_assign` set --
EMPTY by default, so this is DORMANT and behavior is byte-identical to before
(confirmed: full loom-core lib suite 413 passed / 0 failed; loom-shared clippy
clean).

Soundness of the mechanism (the loom-core pass that seeds single_assign lands
next, and only seeds TOP-LEVEL single-assignment locals so the def dominates all
uses):
- is_forwardable_expr: pure-op whitelist (const/local/global/int
  arith-bitwise-shift-rotate-compare/converts/select); calls, loads/stores,
  div/rem, floats -> not forwardable.
- reaching-defs guard: invalidate_pins_referencing drops a pinned forwarding
  whenever an input local is reassigned (in-traversal for straight-line/Block/
  Loop; the If arm intersects both forks' pins so a branch that reassigns an
  input drops the pin). So a forwarded expression always carries its def-time
  inputs.
- pinned survives CF clear() (sound for single-assignment: one value on every
  path reaching a use).

This transform is NOT Z3-backstoppable for the seam (void z_impl + value flows
only into a havoc'd impure-call arg -> vacuous pass); gale's G474RE silicon is the
behavioral gate (per #219 -- gale will flash inline-only vs inline+teardown).

Refs #219
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…mechanism wired)

Activates the #219 carrier scalar-forwarding via the dormant mechanism (@186d236):
new optimize pass forward_carrier_locals + `forward-carrier` CLI pass (after
inline) + a call in optimize_module Phase 1b. Per function it forwards locals with
exactly one write whose single write is at the FUNCTION TOP LEVEL (def dominates
all uses), then rewrite_pure applies the SROA rules; verify_or_revert guards (gale
silicon is the behavioral gate for the void seam). Full lib suite 413/0; fmt +
clippy clean.

KNOWN-CONSERVATIVE (does not yet dissolve the sem_give seam): the seam carrier
(local.tee 3 in inlined z_impl) is NESTED in `block @2`, not top level, so the
top-level proxy excludes it. Verified by hand it IS sound to forward -- the use
`local.get 3` is reached only via `br_if @2` positioned AFTER the tee, so the def
dominates the use (overflow arm traps via `unreachable`, never reaching it), and
inputs local4/local7 are set only before the def. Admitting it needs a real
TRAP-AWARE STRUCTURED DOMINANCE check to replace the top-level proxy (next); then
the dual silicon flash (inline-only vs inline+teardown) validates.

Refs #219
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…te generalized)

Replaces the conservative top-level-only proxy in forward_carrier_locals with
carrier_def_dominates_uses: a single-assignment local is forwarded iff its def
DOMINATES all uses by a sound structured-CFG argument (def not under If/Loop; all
uses textually after it; no pre-def Br/BrIf/BrTable escapes the def's enclosing
blocks; returns/unreachable before the def are safe). Admits the nested-but-
dominating seam carrier the top-level proxy excluded. Soundness over completeness
(a wrong `true` = silent miscompile). Full lib suite 414/0 incl. test_carrier_
dominance (straight-line, use-before-def, def-in-If, nested-dominating, escaping-br).
fmt + clippy clean.

NOT YET dissolving the e2e seam: inline+forward-carrier leaves z_impl BYTE-IDENTICAL
to inline-only (same hash) -- the pass makes no NET change to z_impl. The dominance
gate is no longer the blocker; the pass is either finding single_assign empty for
z_impl OR forwarding-then-reverting (revert restores the identical inline body) OR
the SROA rules not firing on forwarded terms (instructions_to_terms DOES handle
z_impl's Call/load/br_table, so not a conversion skip). Next: instrument
forward_carrier_locals (single_assign contents / changed? / reverted?) to pinpoint.

Refs #219
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…able-safe seam teardown)

The term IR round-trip is lossy for control flow (br_if/br_table are pushed as
value terms), so term-based carrier forwarding turns a br_table function like
z_impl stack-invalid. Replace it with an instruction-level transform:

- simplify_pure_windows: apply rewrite_pure only to maximal pure straight-line
  windows (which round-trip faithfully), never across branches.
- extract_carrier_rhs + forward_carrier_in_body: forward a single-assignment
  dominating carrier's pure RHS to its uses, then drop the dead def (tee keeps
  the value on the stack; set drops set+RHS).
- never-pessimize size guard: revert any forwarding that does not shrink the fn.
- rewrite_pure: extend_i32_u(x) & M -> extend_i32_u(x) when M's low 32 bits are
  all set (Z3-sound — a zero-extend already fits in 32 bits).

On the sem seam the &0xff branch fully dissolves and the shl/tee vanish; a
status>>32 residual remains pending a narrow-local value analysis. Sound,
validates, and net-shrinks (verify_or_revert + size guard). gale G474RE silicon
is the behavioral gate.

Refs: #219

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The forwarded unpack (extend(a)<<32 | status) >> 32 distributes (in rewrite_pure)
to extend(a) | (status>>32) but can't finish on its own: status>>32 -> 0 needs
the value-range fact status < 2^32, which context-free SROA can't see.

Add a narrow-local analysis (loom-core mod optimize):
- def_value_is_narrow / narrow_locals: a local is narrow (< 2^32, high 32 bits
  always 0) when EVERY def stores i64.extend_i32_u(_), an i64.const with the high
  32 bits clear, or (_ & i64.const c) with c < 2^32.
- fold_narrow_high_shr: peephole (shr_u (local.get N) k) -> i64.const 0 when N is
  narrow and (k mod 64) >= 32.
- forward_carrier_locals iterates fold+windows to a fixpoint: simplify_pure_windows
  distributes the unpack and EXPOSES the bare status>>32, the next fold zeroes it,
  the next window collapses (or extend 0) -> extend.

z_impl now FULLY dissolves: the store site becomes
  local.get 0; local.get a; i64.extend_i32_u; i64.store32 offset=8
storing the thread pointer directly. shl / tee / or / shr_u and the redundant
mask are all gone. 416 lib tests green, e2e validates.

Refs: #219

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
# Conflicts:
#	loom-cli/src/main.rs
#	loom-core/src/lib.rs
#	loom-shared/src/lib.rs
…fold

The origin/main merge combines #219's optimize_module phases
(forward_carrier_locals) with main's stronger simplify_locals. Together
they now fold `local x=30; local y=20; return x+y` all the way to
`[I32Const(50)]` (both locals eliminated, add constant-folded) instead of
leaving one surviving `$y` store. The old assertion expected exactly 1
store; the new fully-folded form is strictly more aggressive and provably
equivalent (30 + 20 == 50).

The behavioral differential gate certifies semantics PRESERVED on this
module and on the whole CI corpus (httparse/json_lite/state_machine/
sem.loom: 4 certified, 0 diverged). Not a miscompile — a stale, over-
specified intermediate-form expectation invalidated by the merge.

Refs #219
Refs #238

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@avrabe avrabe merged commit 7f5d8b4 into main Jul 1, 2026
23 of 25 checks passed
@avrabe avrabe deleted the feat/219-seam-sroa branch July 1, 2026 22:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

opt: dissolve the seam fully — post-inline SROA/scalar-forwarding eliminates the u64 ABI pack/unpack (+ wasm-local mem2reg) → toward LLVM-LTO parity

1 participant