Skip to content

JVNAUTOSCI-2507: event-storm backlog damper for event-triggered instances#285

Merged
witbrock merged 2 commits into
mainfrom
JVNAUTOSCI-2507-event-storm-damper
Jun 13, 2026
Merged

JVNAUTOSCI-2507: event-storm backlog damper for event-triggered instances#285
witbrock merged 2 commits into
mainfrom
JVNAUTOSCI-2507-event-storm-damper

Conversation

@witbrock

Copy link
Copy Markdown
Member

Problem

Self-sustaining event cascades mint workflow instances that never drain. paper_recommendation_evaluation_workflow and episode_evaluation_workflow rewrite graph relations as a side effect, which re-triggers them via their own text_relation.* / relationship.* / workflow.instance_terminal event bindings — observed live 2026-06-13 at ~1 new instance every 2-3s (853 paper-recommendation + 151 episode instances in a short window; 16k+ historical backlog per the ticket). This pollutes workflow_instances, slows worker claim scans, makes unindexed diagnostics queries time out, and (until JVNAUTOSCI-2512) churned the routing capability index.

This PR implements the storm damper facet of JVNAUTOSCI-2507.

Change

A high-watermark backlog guard at the single event-triggered creation choke point — submit_verified_workflow_instance, the use_event_idempotency_submission branch (true only for event-binding submissions, not schedule/HTTP/turn paths). When a (workflow_id, source_event_type) pair already has >= N pending/running instances within a bounded recent window, new creation is suppressed and a throttled_event_backlog result is returned (callers already tolerate rejection statuses like rejected_preflight).

  • instance_manager.find_instance_id_by_event_key — keeps idempotent redelivery of the same event un-throttled (it must still resolve to its existing instance).
  • instance_manager.count_recent_event_backlog — served by the existing workflow_created (workflow_id + created_at) index and always time-bounded via since_utc, so it can't degrade into an unindexed scan on the shared Atlas cluster. No new index build on the 16k+ collection.

Config

  • VON_WORKFLOW_EVENT_BACKLOG_LIMIT (default 50; <= 0 disables)
  • VON_WORKFLOW_EVENT_BACKLOG_WINDOW_SECONDS (default 6h)

This is a high-watermark guard, not a per-event dedupe: at the default limit, normal low-volume event processing is unaffected; it only engages during an actual runaway. The limit is the main review knob.

Tests

  • tests/backend/test_event_storm_damper.py: throttle decision (disabled / idempotent-redelivery skip / below-limit / at-limit) + env readers + time-bound assertion.
  • tests/backend/test_workflow_instance_submission_service.py: integration test through submit_verified_workflow_instance proving the wiring (throttled path returns without calling create_instance_for_event).
  • Full files green (36 in the submission file + 8 damper unit tests).

Out of scope (remaining 2507 facets)

Unrunnable-binding guard (refuse instances for workflows referencing unsupported actions), implementing/disabling the episode_critic.* actions, and the one-off cleanup of the existing pending/paused/orphaned backlog. This PR throttles new runaway creation; it does not drain the existing backlog.

Related: JVNAUTOSCI-2512 (decoupled the routing index from this storm; #284).

🤖 Generated with Claude Code

witbrock and others added 2 commits June 13, 2026 18:44
…stances

Self-sustaining event cascades (paper_recommendation_evaluation_workflow and
episode_evaluation_workflow rewrite graph relations, which re-trigger themselves
via their own text_relation.*/relationship.*/instance_terminal bindings) create
hundreds of pending instances that never drain, polluting workflow_instances,
slowing worker claim scans, and (until JVNAUTOSCI-2512) churning the routing
capability index.

Add a high-watermark backlog damper at the single event-triggered creation
choke point (submit_verified_workflow_instance, event-idempotency branch):
when a (workflow_id, source_event_type) pair already has >= N pending/running
instances within a bounded recent window, suppress new creation and return a
throttled_event_backlog result (callers already tolerate rejection statuses).

- instance_manager: find_instance_id_by_event_key (keeps idempotent redelivery
  un-throttled) and count_recent_event_backlog (served by the workflow_created
  index, always time-bounded so it cannot become an unindexed scan on Atlas).
- Config: VON_WORKFLOW_EVENT_BACKLOG_LIMIT (default 50; <=0 disables) and
  VON_WORKFLOW_EVENT_BACKLOG_WINDOW_SECONDS (default 6h). High-watermark, not a
  per-event dedupe: normal low-volume event processing is unaffected.
- Tests: unit coverage of the throttle decision + readers, plus an integration
  test through submit_verified_workflow_instance.

Does not address 2507's other facets (unrunnable-binding guard, episode_critic.*
actions, one-off backlog cleanup) or stop the existing backlog from draining.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
scripts/cleanup_event_storm_backlog.py cancels the pending/paused (and
orphaned-running) backlog of the storm workflows via a direct bulk update_many,
deliberately bypassing WorkflowInstanceManager.mark_cancelled so it emits no
workflow.instance_terminal events (which would re-spawn episode_evaluation
instances and feed the storm). Dry-run by default; --execute to apply; batches
by _id so it never full-scans the loaded Atlas collection; raises the socket
timeout for this one-off run.

Live run 2026-06-13 (after the damper was deployed to throttle the source):
cancelled 138,508 instances (61,881 paper_recommendation + 76,623 episode
pending/paused + 4 stale running); residual backlog now ~14, capped by the
damper high-watermark.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@witbrock witbrock merged commit 21486ac into main Jun 13, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant