JVNAUTOSCI-2507: event-storm backlog damper for event-triggered instances#285
Merged
Conversation
…stances Self-sustaining event cascades (paper_recommendation_evaluation_workflow and episode_evaluation_workflow rewrite graph relations, which re-trigger themselves via their own text_relation.*/relationship.*/instance_terminal bindings) create hundreds of pending instances that never drain, polluting workflow_instances, slowing worker claim scans, and (until JVNAUTOSCI-2512) churning the routing capability index. Add a high-watermark backlog damper at the single event-triggered creation choke point (submit_verified_workflow_instance, event-idempotency branch): when a (workflow_id, source_event_type) pair already has >= N pending/running instances within a bounded recent window, suppress new creation and return a throttled_event_backlog result (callers already tolerate rejection statuses). - instance_manager: find_instance_id_by_event_key (keeps idempotent redelivery un-throttled) and count_recent_event_backlog (served by the workflow_created index, always time-bounded so it cannot become an unindexed scan on Atlas). - Config: VON_WORKFLOW_EVENT_BACKLOG_LIMIT (default 50; <=0 disables) and VON_WORKFLOW_EVENT_BACKLOG_WINDOW_SECONDS (default 6h). High-watermark, not a per-event dedupe: normal low-volume event processing is unaffected. - Tests: unit coverage of the throttle decision + readers, plus an integration test through submit_verified_workflow_instance. Does not address 2507's other facets (unrunnable-binding guard, episode_critic.* actions, one-off backlog cleanup) or stop the existing backlog from draining. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
scripts/cleanup_event_storm_backlog.py cancels the pending/paused (and orphaned-running) backlog of the storm workflows via a direct bulk update_many, deliberately bypassing WorkflowInstanceManager.mark_cancelled so it emits no workflow.instance_terminal events (which would re-spawn episode_evaluation instances and feed the storm). Dry-run by default; --execute to apply; batches by _id so it never full-scans the loaded Atlas collection; raises the socket timeout for this one-off run. Live run 2026-06-13 (after the damper was deployed to throttle the source): cancelled 138,508 instances (61,881 paper_recommendation + 76,623 episode pending/paused + 4 stale running); residual backlog now ~14, capped by the damper high-watermark. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Self-sustaining event cascades mint workflow instances that never drain.
paper_recommendation_evaluation_workflowandepisode_evaluation_workflowrewrite graph relations as a side effect, which re-triggers them via their owntext_relation.*/relationship.*/workflow.instance_terminalevent bindings — observed live 2026-06-13 at ~1 new instance every 2-3s (853 paper-recommendation + 151 episode instances in a short window; 16k+ historical backlog per the ticket). This pollutesworkflow_instances, slows worker claim scans, makes unindexed diagnostics queries time out, and (until JVNAUTOSCI-2512) churned the routing capability index.This PR implements the storm damper facet of JVNAUTOSCI-2507.
Change
A high-watermark backlog guard at the single event-triggered creation choke point —
submit_verified_workflow_instance, theuse_event_idempotency_submissionbranch (true only for event-binding submissions, not schedule/HTTP/turn paths). When a(workflow_id, source_event_type)pair already has>= Npending/running instances within a bounded recent window, new creation is suppressed and athrottled_event_backlogresult is returned (callers already tolerate rejection statuses likerejected_preflight).instance_manager.find_instance_id_by_event_key— keeps idempotent redelivery of the same event un-throttled (it must still resolve to its existing instance).instance_manager.count_recent_event_backlog— served by the existingworkflow_created(workflow_id + created_at) index and always time-bounded viasince_utc, so it can't degrade into an unindexed scan on the shared Atlas cluster. No new index build on the 16k+ collection.Config
VON_WORKFLOW_EVENT_BACKLOG_LIMIT(default 50;<= 0disables)VON_WORKFLOW_EVENT_BACKLOG_WINDOW_SECONDS(default 6h)This is a high-watermark guard, not a per-event dedupe: at the default limit, normal low-volume event processing is unaffected; it only engages during an actual runaway. The limit is the main review knob.
Tests
tests/backend/test_event_storm_damper.py: throttle decision (disabled / idempotent-redelivery skip / below-limit / at-limit) + env readers + time-bound assertion.tests/backend/test_workflow_instance_submission_service.py: integration test throughsubmit_verified_workflow_instanceproving the wiring (throttled path returns without callingcreate_instance_for_event).Out of scope (remaining 2507 facets)
Unrunnable-binding guard (refuse instances for workflows referencing unsupported actions), implementing/disabling the
episode_critic.*actions, and the one-off cleanup of the existing pending/paused/orphaned backlog. This PR throttles new runaway creation; it does not drain the existing backlog.Related: JVNAUTOSCI-2512 (decoupled the routing index from this storm; #284).
🤖 Generated with Claude Code