JVNAUTOSCI-2507: bootstrap recursive episode instance_terminal trigger disabled#286
Merged
Merged
Conversation
…gger DISABLED _ensure_episode_evaluation_event_bindings re-enabled BOTH episode_evaluation bindings on every server start (enabled=True, replace_existing=True). The workflow.instance_terminal trigger is a recursive firehose: every workflow instance terminal — including episode_evaluation's OWN terminals — re-fires episode evaluation, so a single event self-sustains into an unbounded backlog (138k instances observed live, amplified by old-build foreign workers on the shared cluster). Disabling the binding at runtime was therefore not durable: the next restart's bootstrap re-enabled it and the storm returned. Bootstrap it DISABLED instead (completion-gate per-turn trigger stays enabled). replace_existing keeps the disabled state self-healing across restarts. The binding still exists (discoverable/re-enableable) once a self-exclusion condition + throttling are added. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
_ensure_episode_evaluation_event_bindingsre-enabled both episode_evaluation event bindings on every server start (enabled=True, replace_existing=True), includingworkflow.instance_terminal.That trigger is a recursive firehose: every workflow instance terminal — including episode_evaluation's own terminals — re-fires episode evaluation, so a single event self-sustains into an unbounded backlog. Observed live: 138k pending instances, amplified by old-build foreign workers on the shared Atlas cluster that don't run the storm damper (#285). Disabling the binding at runtime was not durable — the next restart's bootstrap re-enabled it and the storm returned.
Change
Bootstrap the
workflow.instance_terminal → episode_evaluationbinding disabled (the per-turnturn_execution.completion_gatetrigger stays enabled).replace_existingkeeps the disabled state self-healing across restarts. The binding still exists (discoverable / re-enableable) once a self-exclusion condition + throttling are added — re-enabling a blanket every-terminal trigger without those would just recreate the firehose.Why disabled rather than removed
Keeping the binding row (disabled) documents intent and leaves a clean re-enable path; removing it would silently drop the capability. The damper (#285) caps per-
(workflow_id, source_event_type)backlog but (a) only on workers that run it and (b) partitions by source event type, so it mitigates rather than fixes the recursion.Tests
Updated
test_bootstrap_materialises_episode_evaluation_workflow_family: completion-gate binding enabled; instance_terminal binding present butenabled is Falseand absent from the enabled-only list. File green (4 passed).Context (JVNAUTOSCI-2507 completion)
Final piece of the storm work: damper (#285, merged) + one-off cleanup of 138,508 backlog instances + this durable disable. Note the ticket's original "unsupported actions" premise was stale —
episode_critic.*actions are registered; the real cause was this recursive trigger pattern. Shared-cluster foreign-worker governance (old builds bypassing the damper) remains JVNAUTOSCI-2510.🤖 Generated with Claude Code