feat(studio): add resolver-shadow attempt counter for soak-gate denominator#1926
Conversation
…inator Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>
This stack of pull requests is managed by Graphite. Learn more about stacking. |
miga-heygen
left a comment
There was a problem hiding this comment.
Review — feat(studio): add resolver-shadow attempt counter for soak-gate denominator
Verdict: LGTM with nits
Clean, well-scoped addition. The counter + periodic rollup is the right shape for this — you get the denominator without recreating the per-event chattiness the divergence-only design explicitly avoided. Tests are thorough (15 new, covering both paths of all three emit functions, flag-off, and timer wiring). Design doc and plan linked. Ship it.
Findings
1. recordAttempt is outside the try/catch in all three emit functions (Minor)
runResolverShadow, recordResolverParity, and recordAnimationResolverParity all place recordAttempt(opLabel) before their try block. The contract on these functions is "never throws; never propagates from the shadow path," but recordAttempt → ensureAttemptFlushScheduled touches setInterval and document.addEventListener, which can throw in non-standard environments (SSR, web workers, restrictive iframe sandboxes).
If recordAttempt throws, the exception propagates past the catch and breaks the caller — the exact failure mode these functions are designed to prevent.
Suggestion: move recordAttempt inside the existing try block (right at the top). It's a one-line move in each function and preserves the "count regardless of outcome" guarantee since the catch at the bottom swallows everything:
// e.g. in runResolverShadow:
try {
recordAttempt("dom-edit");
const mismatches = sdkResolverShadowCheck(session, hfId, ops);
// ...
} catch {
// never propagate from the shadow path
}Alternatively, wrap the body of recordAttempt itself in a try/catch so the counter is self-hardened regardless of call site.
2. __resetAttemptSchedulingForTests resets the flag but doesn't remove the listener (Nit)
__resetAttemptSchedulingForTests sets attemptUnloadListenerRegistered = false and clears the interval, but does not call document.removeEventListener. This means after a reset + re-init cycle in tests, a second visibilitychange listener is registered. In the current test suite this is harmless (the timer tests don't fire visibility events), but it could cause a double-flush surprise if a future test exercises the tab-hide path after a reset. Not blocking — just something to be aware of if the visibility flush ever gets its own test.
3. Listener-order dependency for the visibilitychange flush (Acknowledged)
The PR body already calls this out ("documented-but-fragile DOM event-order dependency for the tab-hide flush"). The new visibilitychange handler calls flushAndEmitAttempts() → trackStudioEvent(), which queues an event into studioTelemetry.ts's batch queue. If studioTelemetry's own visibilitychange handler runs first (listener registration order is not guaranteed by spec), the attempt rollup event arrives in the queue after the sendBeacon flush and sits there unflushed until the tab returns (or forever, if the tab is closed). This is a low-probability data loss for the very last rollup in a session, and the 5-minute periodic flush covers the vast majority of data. Fine for now — the follow-up to add sendBeacon to this handler (mirroring studioTelemetry.ts lines 121-123) would close the gap when someone gets to it.
Ponytail (over-engineering audit)
Lean already. The counter + rollup is the minimum viable denominator — no unnecessary abstractions, no premature generalization, no config surface. The __resetAttemptSchedulingForTests export is justified overhead for the fake-timer tests. Ship.
Silent failure patterns
One: finding #1 above (recordAttempt outside try/catch). The rest of the error handling follows the established fail-open pattern correctly.
Diff stats: 2 files changed, +230 / -1
— Miga
james-russo-rames-d-jusso
left a comment
There was a problem hiding this comment.
Reviewed at 8562e0d.
Verdict
🟠 Solid design and shape; one real-behavior finding on the tab-hide flush order that partly defeats the "short session" claim in the PR body, plus a few smaller notes. Nothing that blocks merge if the tab-hide-flush trade-off is acknowledged; happy to stamp on either a fix or a decision to accept-and-document.
What this PR adds (observability shape)
Introduces a denominator for the existing sdk_resolver_shadow divergence signal:
- Module-private
attemptCounts: Record<string, number>insdkResolverShadow.ts. - Incremented at the top of all three emit functions (
runResolverShadow,recordResolverParity,recordAnimationResolverParity), after theSTUDIO_SDK_RESOLVER_SHADOW_ENABLEDguard and after the null-input early returns — so attempts count on both the parity (silent) path AND the divergence (emits) path, and don't count when input is missing. - Rolled up into ONE
sdk_resolver_shadow_attemptPostHog event every 5 minutes (setInterval) plus avisibilitychange → hiddenflush, both installed lazily on the FIRST attempt viaensureAttemptFlushScheduled. - Event shape:
{ counts: JSON.stringify({ [opLabel]: count, ... }) }— one field, JSON-string-nested.
Same-file locality, no new flag, no changes to evaluateSoakGate. Tests: 15 new (39 → 54), covering pure counter behavior, per-function attempt counting on both parity + divergence paths, and production timer/rollup wiring with fake timers.
Divergence-vs-parity coherence
- Dim alignment: the divergence event
sdk_resolver_shadowfromrunResolverShadow(sdkResolverShadow.ts:358) emits noopLabel— it only has one shape (DOM-edit). The attempt counter labels DOM-edit attempts as"dom-edit". Sodivergences_dom_edit = count(sdk_resolver_shadow WHERE opLabel IS NULL),attempts_dom_edit = counts["dom-edit"]. Works, but the "NULL means dom-edit" convention is implicit. PR body Minor #1 already acknowledges this ((dom-edit)label convention on the dashboard). - opLabel field-shape drift: parity attempts serialize opLabel as a nested JSON string field (
counts: JSON.stringify({ setTiming: N })) while divergences emitopLabelas a top-level property onsdk_resolver_shadow. Analyst has toJSON_EXTRACTon the numerator side but not the denominator. PR body Minor #2 flags this too. Fine to ship; makes the eventual rate view a bit more finicky. - Time window: attempts flush every 5 minutes; divergences fire per-event through the shared
studioTelemetry30-second batch. Both hit the same PostHog project, no clock drift concern.
Failure-path coverage
Covered. recordAttempt(...) is called BEFORE each function's try {}:
sdkResolverShadow.ts:346—runResolverShadow, before the try that catchessdkResolverShadowCheckthrows.sdkResolverShadow.ts:399—recordResolverParity, before the try that catchesresolveSnapshot/ source-read errors.sdkResolverShadow.ts:462—recordAnimationResolverParity, before the try that catchessession.getElements()failures.
Every non-guarded invocation gets counted, so the denominator can't silently under-count the failure cohort — the exact rubric miss recurring in observability PRs. Well done.
Null-input paths (!hfId, !session) return before recordAttempt — those aren't attempts on the tripwire, they're skipped calls. Correct.
Findings (severity-ordered)
-
🟠
visibilitychangeflush is dropped for short sessions due to listener-order —sdkResolverShadow.ts:284-294(ensureAttemptFlushScheduled) vs.studioTelemetry.ts:93-115(module-load-time visibility handler). PR body claims the visibilitychange path exists to catch "short sessions" that never see a 5-min interval fire. But:studioTelemetry.ts:93registers its visibility handler at module load time.ensureAttemptFlushScheduledregisters its visibility handler lazily on the firstrecordAttempt, which by construction runs AFTERstudioTelemetryhas been imported and initialized.- DOM
addEventListenerfires handlers in registration order. - So on
visibilitychange → hidden:studioTelemetry's handler runs FIRST, drains the queue viasendBeacon(queue is EMPTY of the rollup event, since the PR's handler hasn't run yet). Then PR's handler runs, callsflushAndEmitAttempts→trackStudioEvent→ enqueues the rollup — but studioTelemetry'ssendBeaconpath has already fired, and the queue-consuming flushTimer was just cleared+nulled by studioTelemetry's handler (studioTelemetry.ts:96-99). - The rollup event now sits in
queue— the NEXTtrackStudioEventcall restarts the interval, so it goes out ~30s later IF the tab stays open. If the user closes the tab / navigates away, the rollup is lost.
Impact: short sessions (< 5 min) that would have relied on the tab-hide flush lose their attempt count with high probability. The very cohort the visibilitychange path targets. Options: (a) fire the rollup in a
sendBeacon-flavored path fromflushAndEmitAttemptsdirectly (bypass the studioTelemetry queue for this one event); (b)capture: trueon the addEventListener to run before studioTelemetry's bubble-phase handler; (c) accept the trade-off and drop the visibilitychange path entirely + doc "5-min-min-session for denominator accuracy" (my recommendation — simplest, and short sessions are a small share of the fleet). Whatever you pick, worth calling out in the PR body / commit message so operators know the true guarantee. -
🟡
countsas JSON-string field —sdkResolverShadow.ts:277.counts: JSON.stringify(counts)forces every downstream query toJSON_EXTRACTper opLabel. Flattening tocount_setTiming: N, count_dom_edit: M, count_reorderElements: K, ...(per-key top-level fields, bounded set of ~13 known labels enumerated atsdkCutover.ts:311/329/348/377/395/475+useDomEditSession.ts:304+sdkResolverShadow.ts:346/399/462) would make the rate view a direct division in the PostHog UI with no JSON parsing. PR body Minor #2 acknowledges this — flagging as a 🟡 vs a 🟠 because the set is bounded so no cardinality risk and analysts can work around it, but the rate view will be more painful than it needs to be. Nit as scoped, or a follow-up. -
🟡
windowvsdocumentvisibility listener divergence —sdkResolverShadow.ts:290bindsdocument.addEventListener("visibilitychange", ...), while siblingstudioTelemetry.ts:94bindswindow.addEventListener("visibilitychange", ...). Both work (visibilitychange bubbles), but visually diverges from the referenced sibling pattern. Not a bug; consistency nit. -
🟡 Listener leaks for module lifetime —
sdkResolverShadow.ts:288-293. NoremoveEventListenerandattemptUnloadListenerRegisterednever resets in production. The sibling pattern instudioTelemetry.tsalso doesn't remove, so this is consistent with house style — but the sibling registers at module-load (guaranteed once), while this PR registers on first-attempt behind a mutable-in-tests bool.__resetAttemptSchedulingForTestsclears the bool but does NOT actuallyremoveEventListener— so on the nextrecordAttempt, a NEW listener is added while the old one stays alive. Not observable in tests (jsdom cleanup +vi.useRealTimersteardown), but if anyone else ever calls the reset helper in a non-test long-lived environment, the leak is real. Cheapest fix: capture the listener as a named function and remove it in the reset helper too. Or accept + document "test-only, jsdom cleans up between suites." -
🟢 PII / redaction: rollup event carries only
{ opLabel: int }counts. No user content, no hfIds, no error strings. Consistent withredactValuetreatment in the divergence-emit path. No new PII surface. -
🟢 Cardinality:
opLabelset is bounded to ~13 hard-coded strings (grepped from all callers). No user-controlled input reaches the counter's keyspace. No cardinality bomb. -
🟢 Idempotency:
flushAttemptCountscorrectly returns null on empty state, andflushAndEmitAttemptsno-ops on null. Double-flushes are safe. -
🟢 Failure-path emission: attempts count on the throw path of all three emit functions. Passes the observability-PR-must-cover-failure-path rubric.
Cross-repo consumer awareness
New event name studio:sdk_resolver_shadow_attempt (via trackStudioEvent's studio: prefix in studioTelemetry.ts:55). No existing PostHog dashboards / monitors reference it yet. Anyone consuming the divergence event studio:sdk_resolver_shadow for the soak-gate readout will want to know the denominator is now available — a heads-up in the SDK cutover soak thread would land the new signal in the analyst's mental model.
Test coverage
- Pure counter behavior: ✅ (5 tests).
- Attempt counting on parity AND divergence paths of all three emit functions: ✅ (6 tests).
- Multi-chokepoint accumulation into one rollup: ✅.
- Flag-off no-op: ✅.
- Production timer wiring, both "no attempts → no emit" and "attempts → emit at 5 min": ✅.
Missing:
- No test covers the
visibilitychange → hiddencode path. Given the 🟠 finding above, this is the exact path where the order-of-listener bug lurks. A test that fires avisibilitychangeevent in jsdom and asserts the rollup arrives intrackedEventswould have caught the interaction (or made you notice it during test-writing). - No assertion on the rollup event's
distinct_id/ session properties — but those aretrackStudioEvent's responsibility, not this module's, so fair to leave.
AI-trailer sweep
Commit trailer Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com> present. No leftover trailer or scaffolding in the source. Diff is disciplined — module-private state, no config sprawl, no new flags, no hallucinated APIs (all DOM APIs used are real). Docstrings are honest about the trade-offs (test-only export explanation, ordering fragility acknowledged in PR body). Reads like well-vetted code, not scaffold.
Peer awareness
No prior reviews on this PR at HEAD 8562e0d0 (gh api /repos/heygen-com/hyperframes/pulls/1926/reviews returned [] before I started). Fresh R1.
Slack-ready summary
hyperframes#1926 @ 8562e0d0 — 🟠 Solid parity-denominator signal; one real listener-order bug on the tab-hide flush that undoes the "short session" claim, plus known-and-shipped JSON-nested counts field. Not blocking merge if the tab-hide trade-off is acknowledged.
Top findings:
- 🟠
visibilitychangeflush drops the rollup for short sessions —studioTelemetry's handler registers at module load (fires first), PR's handler registers lazily on first attempt (fires second, aftersendBeaconalready went out). Fix options: capture-phase handler, directsendBeaconinflushAndEmitAttempts, or accept + doc. - 🟡
countsas JSON-string forcesJSON_EXTRACTin every rate query; bounded opLabel set (~13) would flatten cleanly to top-level fields. PR body Minor #2 acknowledges. - 🟡
documentvswindowvisibility listener style diverges from siblingstudioTelemetry.ts. Consistency nit.
Failure-path coverage: ✅ — recordAttempt fires before the try in all three emit functions.
Divergence-vs-parity dim coherence: ✅ modulo the opLabel field-shape drift the PR body already flags.
PII: clean. Cardinality: bounded. CI: all required green at HEAD.
— Review by Rames D Jusso
miguel-heygen
left a comment
There was a problem hiding this comment.
The resolver-shadow denominator shape is sound: all three shadow emit paths are still gated on STUDIO_SDK_RESOLVER_SHADOW_ENABLED, op labels are static/low-cardinality, and the new rollup event stays separate from the divergence event. The tests cover parity and divergence paths for DOM edit, element resolver, and animation resolver attempts.
Important follow-up, already raised by RDJ: packages/studio/src/utils/sdkResolverShadow.ts:290 enqueues the tab-hide rollup through trackStudioEvent, but packages/studio/src/utils/studioTelemetry.ts:94 has its own module-load visibilitychange beacon flush. On a short session, studioTelemetry can flush first, then this handler queues the attempt rollup after the beacon is gone. That weakens the PR body's "visibilitychange flush for short sessions" claim, though the 5-minute periodic denominator remains intact. Direct sendBeacon in flushAndEmitAttempts or capture-phase ordering would close it.
Nits from prior reviews still stand: recordAttempt is outside the existing try/catch, test reset doesn't remove the listener, and counts as a JSON-string property will make rate queries a little clunkier than top-level bounded fields.
CI is green at 8562e0d0.
Verdict: APPROVE
Reasoning: Core counter/rollup implementation is correct and CI is clean; the tab-hide ordering issue is real but limited to short-session telemetry completeness, so I’m treating it as follow-up rather than a merge blocker.
…sh ordering PR review feedback (4 reviewers): recordAttempt() sat outside the try/catch in all three emit functions, so a throw inside it (e.g. setInterval/ addEventListener failing in a non-standard environment) would break the "never throws" contract. Also, the new visibilitychange listener races studioTelemetry.ts's own tab-hide handler — whichever fires first can beacon the queue before or after this module's rollup lands in it, silently dropping the attempt count for short sessions closed before the 5-minute timer fires. Fixes: move recordAttempt() inside each function's try block; export flushViaBeacon() from studioTelemetry.ts and call it explicitly after queuing the rollup, so delivery no longer depends on listener registration order; capture the visibilitychange handler by reference so __resetAttemptSchedulingForTests() actually removes it instead of leaking a duplicate on re-arm. Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>
|
Addressed the two real findings (thanks all four for the thorough passes):
Also added test coverage for the tab-hide path itself (previously untested, which is exactly where the ordering bug was hiding) — switched the file to Left as documented follow-ups (all three reviewers flagged these as non-blocking):
Full studio suite green (1328 passed), typecheck clean, lint/format clean. |
miguel-heygen
left a comment
There was a problem hiding this comment.
Re-reviewed the 8562e0d0..2c60ae98 delta. The follow-up fixes address the prior findings cleanly:
packages/studio/src/utils/sdkResolverShadow.tsnow callsrecordAttemptinside each existing fail-opentry, so scheduling/counting cannot break the shadow emitters' never-throw contract.- The visibility handler is now held by reference and removed in
__resetAttemptSchedulingForTests, with a regression test proving reset+re-arm does not double-register. packages/studio/src/utils/studioTelemetry.tsexposesflushViaBeacon(), and the resolver-shadow hidden-tab handler calls it after queuing the rollup. That closes the listener-order bug RDJ flagged: whether studioTelemetry's own visibility handler runs before or after this module's handler, the attempt rollup is beacon-flushed instead of sitting behind a cleared timer.- The new happy-dom tests cover
visibilitychange -> hiddenand duplicate-listener reset behavior; the repoTestjob is green at this head.
Remaining notes are the already-accepted query-shape nits: counts is still a JSON string, and dom-edit remains the special denominator label. Those don't block merge.
CI is green at 2c60ae98.
Verdict: APPROVE
Reasoning: The new delta directly fixes the exception-safety, tab-hide delivery, and listener-reset issues without changing the core low-volume denominator design; all checks are green on the exact head.

What
Adds a low-volume attempt counter to the resolver-shadow tripwire (
packages/studio/src/utils/sdkResolverShadow.ts), rolled up into a new PostHog eventsdk_resolver_shadow_attemptevery 5 minutes (plus avisibilitychangeflush for short sessions).Why
sdk_resolver_shadow(the existing tripwire event) only fires on divergence — parity is silent by design, to avoid spamming an event on every editor edit. That makes it a pure numerator: we can count SDK-resolver failures, but never attempts. Without a denominator there's no way to compute an actual divergence rate or decide when the SDK-cutover soak sample is big enough — "0 divergences" means very different things at 5 attempts vs 5,000.How
Record<string, number>counter (attemptCounts) insidesdkResolverShadow.ts, incremented by all three existing emit functions (runResolverShadow,recordResolverParity,recordAnimationResolverParity) right after their existingSTUDIO_SDK_RESOLVER_SHADOW_ENABLEDguard — so every real invocation is counted regardless of outcome, with no new flag check needed.sdk_resolver_shadow_attempt) kept distinct fromsdk_resolver_shadow, so the two streams (activity vs failures) stay independently queryable.visibilitychangelistener start lazily on the firstrecordAttemptcall, mirroringstudioTelemetry.ts's existing lazy-timer pattern — a session that never touches the tripwire never runs a background timer.evaluateSoakGateor any existing function's public signature — this only ships the raw attempt data; thedivergences / attemptsrate view is a follow-up once real data exists.Design doc:
docs/hyperframes/plans/sdk/2026-07-04-resolver-shadow-attempt-counter-design.mdPlan:
docs/hyperframes/plans/sdk/2026-07-04-resolver-shadow-attempt-counter-plan.mdTest plan
sdkResolverShadow.test.ts): pure counter/rollup behavior, attempt-counting on both the parity and divergence paths of all three emit functions, and production timer/rollup wiring with fake timers.(dom-edit)convention, opLabel being JSON-nested vs top-level, and a documented-but-fragile DOM event-order dependency for the tab-hide flush) — none block this change.