Skip to content

JVNAUTOSCI-2507: bootstrap recursive episode instance_terminal trigger disabled#286

Merged
witbrock merged 1 commit into
mainfrom
JVNAUTOSCI-2507-disable-recursive-episode-binding
Jun 14, 2026
Merged

JVNAUTOSCI-2507: bootstrap recursive episode instance_terminal trigger disabled#286
witbrock merged 1 commit into
mainfrom
JVNAUTOSCI-2507-disable-recursive-episode-binding

Conversation

@witbrock

Copy link
Copy Markdown
Member

Problem

_ensure_episode_evaluation_event_bindings re-enabled both episode_evaluation event bindings on every server start (enabled=True, replace_existing=True), including workflow.instance_terminal.

That trigger is a recursive firehose: every workflow instance terminal — including episode_evaluation's own terminals — re-fires episode evaluation, so a single event self-sustains into an unbounded backlog. Observed live: 138k pending instances, amplified by old-build foreign workers on the shared Atlas cluster that don't run the storm damper (#285). Disabling the binding at runtime was not durable — the next restart's bootstrap re-enabled it and the storm returned.

Change

Bootstrap the workflow.instance_terminal → episode_evaluation binding disabled (the per-turn turn_execution.completion_gate trigger stays enabled). replace_existing keeps the disabled state self-healing across restarts. The binding still exists (discoverable / re-enableable) once a self-exclusion condition + throttling are added — re-enabling a blanket every-terminal trigger without those would just recreate the firehose.

Why disabled rather than removed

Keeping the binding row (disabled) documents intent and leaves a clean re-enable path; removing it would silently drop the capability. The damper (#285) caps per-(workflow_id, source_event_type) backlog but (a) only on workers that run it and (b) partitions by source event type, so it mitigates rather than fixes the recursion.

Tests

Updated test_bootstrap_materialises_episode_evaluation_workflow_family: completion-gate binding enabled; instance_terminal binding present but enabled is False and absent from the enabled-only list. File green (4 passed).

Context (JVNAUTOSCI-2507 completion)

Final piece of the storm work: damper (#285, merged) + one-off cleanup of 138,508 backlog instances + this durable disable. Note the ticket's original "unsupported actions" premise was stale — episode_critic.* actions are registered; the real cause was this recursive trigger pattern. Shared-cluster foreign-worker governance (old builds bypassing the damper) remains JVNAUTOSCI-2510.

🤖 Generated with Claude Code

…gger DISABLED

_ensure_episode_evaluation_event_bindings re-enabled BOTH episode_evaluation
bindings on every server start (enabled=True, replace_existing=True). The
workflow.instance_terminal trigger is a recursive firehose: every workflow
instance terminal — including episode_evaluation's OWN terminals — re-fires
episode evaluation, so a single event self-sustains into an unbounded backlog
(138k instances observed live, amplified by old-build foreign workers on the
shared cluster). Disabling the binding at runtime was therefore not durable:
the next restart's bootstrap re-enabled it and the storm returned.

Bootstrap it DISABLED instead (completion-gate per-turn trigger stays enabled).
replace_existing keeps the disabled state self-healing across restarts. The
binding still exists (discoverable/re-enableable) once a self-exclusion
condition + throttling are added.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@witbrock witbrock merged commit 9c18b3c into main Jun 14, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant