[FE POC] ETL implementation for evaluation runs.#4354
Draft
ardaerzin wants to merge 41 commits into
Draft
Conversation
…L engine RFCs Three paired design docs covering the evaluation frontend refactor: - eval-filtering.md (390 lines, 8 mermaid diagrams) — what to build. Reuses tracing Filtering/Condition primitive, v1 frontend / v2 backend split, field-path convention, UI states, compare-mode gap with the query-backed-runs join-key problem flagged. - eval-package-architecture.md (647 lines, 11 diagrams) — where it lives. Maps current OSS data layer (28 atom files, ~9k LOC) to target @agenta/entities/evaluationRun package boundary. Documents existing molecule ground truth, the 4-namespace convention to follow, phased migration with Phase 1 (scenario + metrics molecules) as the prerequisite for the filter primitive. - eval-etl-engine.md (804 lines, 7 diagrams, 7 code blocks) — how transforms compose. JP's huddle diagrams as instances of one chunked iteration loop with 5 guarantees (memory-bounded, cancellation, progress, backpressure, idempotent resume). Source/Transform/Sink contracts, the ~40-line loop, three worked examples (filter, query->testset, file export), and full integration story with entity packages via the adapter pattern. All three docs cross-link, share yellow/green/red diagram color conventions, and explicitly defer DSL/vocabulary work until 3+ transforms exist with comparable shapes. Not pushed; staged on fe-experiment/etl-engine for review. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…future-improvement designs Honest performance audit of the trio surfaced several overstated guarantees and missing constraints. This commit applies the corrections, restructures the migration phases, and adds substantive design treatment for four future improvements that didn't make v1. Corrections applied across the trio: - "Memory bounded by chunk size" was overstated. Loop runtime IS bounded; cumulative session state is NOT. Doc now distinguishes pipeline memory from session memory, with resident-memory estimates per run size. - "Cancellation propagates" was a partial truth. Mid-flight HTTP requests aren't aborted unless AbortSignal is plumbed through the API layer. Promoted to Phase 1d (was implicit / "v2 polish"). - Row eviction was Phase 4+. Promoted to Phase 3a with explicit sliding-window policy + atomFamily.remove() in lockstep. - v1 client-side join sizing tightened from ~10k to ~5k per side based on memory math (10KB/row × 10k = 100MB hash map; browser struggles). New mandatory performance constraints in eval-filtering.md: - C1: ≥250ms debounce on scenarioFilterAtom writes (predicate eval is O(N)) - C2: Three predicate operator tiers (cheap / moderate / expensive); UI surfaces only Tier 1/2 by default, Tier 3 auto-escalates to v2 - C3: Eager v2 escalation on three triggers (hit-ratio, loaded > 10k, Tier 3 operator), not just hit-ratio - C4: Background tab pause via visibility-aware AbortSignal wrapper - C5: AtomFamily eviction in lockstep with row eviction New "Limitations and required discipline" section in eval-package-architecture.md spelling out what IS bounded vs what ISN'T, sizing expectations table, and what the design explicitly does NOT fix (server aggregations, cross-table joins beyond compare-mode, real-time streaming, offline resume). ETL engine doc clarifies the 5 guarantees with explicit caveats about loop-local vs caller-managed state, and adds Performance properties section with per-chunk cost tables. Four future-improvement designs added with concrete shapes: Filter RFC: - F1: Skip-ahead UX on filter transitions using last-visible row ID as a content anchor (not opaque cursor). Primitive findNearestPosition() on derived view; v2 needs API extension to accept anchor in query payload. - F2: Predicate explain mode (dev tool) with ring buffer of timed evaluations, tier classification recommendations, three enable mechanisms (URL param / devtools / env var). Key insight: tier classification should be measured, not stipulated. Package architecture RFC: - F1: Worker-thread predicate evaluation via snapshot-based ship-once pattern. Snapshot shape acts as predicate field schema; predicate changes ship only the predicate, not the data. Performance comparison table shows ~25x speedup for repeated predicates after first eval. - F2: Memoized derived results — LRU cache keyed by predicate hash with per-entry revision tracking. v1 coarse invalidation (all entries on any correlated update); fine-grained field-path tracking promoted on user feedback. Cache is tiny (~180KB for 10 entries × 500 matches). The four future improvements compose cleanly across the worker/main- thread boundary and are cross-linked between the two RFCs. Working tree returns to clean. Not pushed; stays on fe-experiment/etl-engine for review. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ter schema design
Two structural changes the trio needed before it could be considered
load-bearing.
1. Split the ETL engine doc into general + consumer
The original eval-etl-engine.md mixed two concerns: the general loop
engine (zero entity coupling) and eval's specific adoption (adapters,
folder structure, filter pipeline worked example). Resolved by:
- NEW: docs/designs/etl-engine.md (572 lines, 2 diagrams)
General loop engine. Contracts, runtime, performance properties,
5 guarantees with caveats, generic worked examples (streaming file
export, cross-entity query->testset, multi-source join). Open
questions are engine-level only. "What to do next" is engine-level.
- REFOCUSED: docs/designs/eval-etl-engine.md (331 lines, 3 diagrams,
down from 922)
Eval-specific adoption only. The filter pipeline as canonical first
consumer of the engine. Adapter folder structure under
@agenta/entities/evaluationRun/etl/. Per-chunk sequence for the
eval filter case. Migration path aligned to architecture RFC phases.
Explicit "why this doc is small" note: eval is a consumer, most work
is in shared infrastructure.
This pattern scales — future entity packages adopting the engine
write similar small consumer docs (testset-etl-integration.md, etc.).
The general engine doc stays stable.
2. Filter schema and field declarations (D4)
The trio defined the predicate vocabulary (D1) and field-path
convention (D3) but never specified how each entity declares its
filterable surface. Without that, the filter UI has nothing to render,
the predicate validator has nothing to check against, and tier-based
escalation can't reason per-field. Resolved by:
- NEW D4 in eval-filtering.md (~280 lines):
- FilterSchema and FilterFieldSchema type definitions
- Type-to-operator matrix (9 field types x tier classification)
- Canonical scenario filter schema with static + dynamic fields
- Evaluator output type -> FilterFieldType mapping
- Schema-driven UI rendering pattern (one input shape per type)
- Predicate validator (3 checks: field, operator, value type)
- Tier propagation: per-field tier drives per-predicate escalation
- Server-side parity strategy (3 options, v1 picks "independently
authored + integration test")
- Schema versioning rules
- New subsection in eval-package-architecture.md "Cross-entity filter
schemas" (~70 lines):
- Folder structure: shared/paginated/filter/ for generic types +
validator + tier walker; entity/etl/filterSchema.ts for
per-entity schema builders
- Construction flow diagram
- Why this lives at the shared layer (not the engine)
- Pattern for future transforms (ProjectionSchema, MapSchema)
- Clarifying section in etl-engine.md "Filter / transform schemas are
NOT engine concerns" (~20 lines):
- Engine has zero knowledge of fields/types/operators
- Schemas live one layer up at the derived layer
- Cross-references to filter and architecture RFCs
- Folder update in eval-etl-engine.md noting filterSchema.ts location
and pattern for other entities
The key architectural property: the same FilterSchema drives UI
rendering, predicate validation, tier classification, and runtime
field resolution. One source of truth per entity per context. Other
entities (testset, observability, etc.) get the same leverage by
writing their own filterSchema.ts following the shared contract.
Working tree returns to clean. Not pushed.
Final state — 4 docs, 31 mermaid diagrams, 2,914 lines:
- etl-engine.md — 572 lines, 2 diagrams (general engine)
- eval-etl-engine.md — 331 lines, 3 diagrams (eval adoption)
- eval-filtering.md — 937 lines, 11 diagrams (what to filter)
- eval-package-architecture.md — 1,074 lines, 15 diagrams (where state lives)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The entire architecture is environment-agnostic — no React, no DOM, no browser APIs. Adding a "PoC strategy" section that captures how to validate the design end-to-end against a real backend before touching the frontend. The section covers: - Why headless: faster iteration than UI work, cleaner architectural proof (if contracts hold in Node they hold in the UI), real performance measurement via process.memoryUsage and hrtime - What the PoC validates: the engine's 5 guarantees + prefetch hook + filter schema validator + tier escalator, all against real data - File layout: PoC files become the v1 implementation. The package paths in the PoC script are the real package paths. - Three layers of testing (unit / integration / E2E) - A complete sketch of scripts/etl-poc.ts (~50 lines, executable) - Suggested ordering with time budget (~5-6 days total) - What the PoC's run report should capture as the empirical complement to the design RFCs - Preconditions (dev stack runnable, test run with realistic shape, step-0 verification that createPaginatedEntityStore runs in Node) - Branch strategy: fe-experiment/etl-poc follows this trio Working tree clean after this commit. Not pushed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…es integration
End-to-end validation of the ETL architecture against a real Agenta backend.
The engine consumes data through the existing entities-package paginated
store with full cursor pagination, cancellation, and memory bounds.
ENGINE — web/packages/agenta-entities/src/etl/
- core/types.ts: Source, Transform, Sink, Chunk, MultiSourceTransform,
JoinState, Progress, LoopResult. Plain TS, no DSL.
- runtime/runLoop.ts: ~50-line AsyncGenerator implementing the loop. All
five design-RFC guarantees fall out of the code.
- adapters/makeSourceFromPaginatedStore.ts: wraps any
createPaginatedEntityStore instance as a Source<TApiRow>. Drives the
store's reactive controller subscription, uses scheduleNextPageAtomFamily
to advance the cursor.
- index.ts: public exports including the new adapter.
- __tests__/runLoop.guarantees.test.ts: 5 guarantee tests via node:test
(built-in, no vitest/jest dep).
EVAL-SPECIFIC ADAPTER — web/packages/agenta-entities/src/evaluationRun/etl/
- realScenarioSource.ts: minimal Source<EvaluationScenario> with the OSS
cursor-fallback pattern.
POC SCRIPTS
- scripts/etl-poc-smoke.ts: synthetic-data engine validation (5/5 pass)
- scripts/etl-poc.ts: minimum-viable real-backend PoC via direct fetch
- web/oss/poc/etl-entities-probe.ts: 4-stage Node-portability probe
covering shared axios + Zod + Jotai + createPaginatedEntityStore (4/4
pass against real backend)
- web/oss/poc/etl-poc-entities.ts: full "really using entities" PoC —
wraps real createPaginatedEntityStore via makeSourceFromPaginatedStore,
runs runLoop end-to-end (3 chunks, 150 rows, cursor advance via
scheduleNextPageAtomFamily, all 5 assertions pass)
PACKAGE EXPORTS — web/packages/agenta-entities/package.json
- "./etl" → ./src/etl/index.ts
- "./evaluationRun/etl" → ./src/evaluationRun/etl/index.ts
- "test:etl" script runs node:test guarantee tests via tsx
ARCHITECTURAL FINDING: @agenta/entities barrel exports transitively pull
React components (via shared/user/UserAuthorLabel.tsx → @agenta/ui → CSS).
Workaround: deep relative imports. Follow-up: split entities/shared barrel
so data-layer consumers don't pull UI. Not blocking.
VERIFIED against real backend (run 019e3701-...):
- etl-poc.ts: 7 chunks, 300 scanned, 64 matched, ~73ms/chunk
- etl-poc-entities.ts: 3 chunks via store, 150 rows, ~70ms/chunk
- all 5 engine guarantees hold against real network + real data
Branched from fe-experiment/etl-engine. Not pushed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The PoC's observational "heap=+X MB" logs were anecdotes, not tests.
This commit turns the design RFC's claims about memory bounds and engine
overhead into enforceable assertions that fail on regression.
NEW TESTS — web/packages/agenta-entities/src/etl/__tests__/
runLoop.memory.test.ts (4 assertions, --expose-gc required)
- 100 chunks × 1000 fat rows (~1MB each) — heap delta stays under
25MB budget. Unbounded would show 100MB+ growth.
- Heap delta at chunk 100 does NOT exceed delta at chunk 25 by more
than 10MB. Catches monotonic per-chunk growth.
- After AbortSignal cancellation mid-stream, forced GC twice, heap
returns within 15MB of baseline. Catches "abort doesn't release
in-flight chunk references" regressions.
- 10-transform identity chain over 100 chunks × 500 rows — heap
stays under 30MB. Catches "transform array retains intermediate
chunks" regressions.
All four skip gracefully without --expose-gc (default test:etl
unaffected for contributors without the flag).
runLoop.overhead.test.ts (2 assertions)
- runLoop vs hand-written equivalent: same source, transform, sink,
iteration. Median of 5 runs with warmup. Asserts engine overhead
< 25% of baseline. Measured locally at 9.4%. Reports numbers in
test output for CI log inspection.
- Correctness parity: engine and baseline produce identical
scanned/matched/loaded counts. Catches "engine drops/double-counts
rows" regressions independently of timing.
SCRIPTS — web/packages/agenta-entities/package.json
- test:etl — guarantees only (fast, no --expose-gc, every PR)
- test:etl:memory — memory + overhead (--expose-gc, ~400ms)
Scope B (next commit) adds: AtomFamily leak detection,
per-chunk latency budgets per scenario, long-run (10k iter) leak
regression test, budget tuning docs.
Verified locally:
- test:etl: 9/9 pass (existing guarantees, unchanged)
- test:etl:memory: 6/6 pass (memory: 4, overhead: 2)
- Engine overhead: 9.4% over baseline (budget 25%)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…n + docs
Completes the perf/memory test suite started in Scope A. Adds per-scenario
latency budgets, atomFamily-style leak detection via long-run heap sampling,
and a README explaining what each test catches and how to handle failures.
NEW TESTS — web/packages/agenta-entities/src/etl/__tests__/
runLoop.benchmark.test.ts (7 scenarios)
- passthrough — 200 rows: p95 budget 5ms (local p95: 0.09ms)
- tier1 eq filter — 200 rows: p95 budget 5ms (0.08ms)
- tier1 gte filter — 200 rows: p95 budget 5ms (0.06ms)
- tier2 in-set filter — 200 rows: p95 budget 10ms (0.07ms)
- map transform — 200 rows: p95 budget 8ms (0.06ms)
- large chunk — 1000 rows: p95 budget 15ms (0.75ms)
- multi-transform chain (5 filters) — 200 rows: p95 budget 12ms (0.09ms)
Budgets carry 50-150x headroom for CI variance. Each test reports its
actual p50/p95/p99/max in test output for trend observability.
runLoop.leak.test.ts (2 assertions, --expose-gc required, ~300ms)
- 100 iter fresh source/sink: heap regression slope < 50 KB/iter.
Local: 1.51 KB/iter.
- 500 iter: heap range (max - min) under 5MB. Local: 0.77MB.
Catches atomFamily leaks in makeSourceFromPaginatedStore indirectly:
persistent atom entries manifest as monotonic heap growth.
DOCS — web/packages/agenta-entities/src/etl/__tests__/README.md
What each test catches, how to interpret failures, how budgets are
calibrated, how to add new tests under the existing convention.
SCRIPTS — web/packages/agenta-entities/package.json
- test:etl guarantees only (~300ms, every PR)
- test:etl:memory memory + overhead + benchmark (~1s, every PR)
- test:etl:longrun leak detection (~30s, nightly)
Verified locally:
- test:etl: 9/9 pass
- test:etl:memory: 13/13 pass (4 memory + 2 overhead + 7 benchmark)
- test:etl:longrun: 2/2 pass
Together Scope A + B turn design RFC claims into enforceable assertions:
Claim Test
────────────────────────────────────────────────────────────
Memory bounded by chunk size → runLoop.memory.test.ts
Cancellation releases held → runLoop.memory.test.ts
Engine adds minimal overhead → runLoop.overhead.test.ts
Per-scenario p95 latency → runLoop.benchmark.test.ts
No leaks across pipelines → runLoop.leak.test.ts
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PoC observation that surfaced this work: the same eval run shows different
"Total scanned" values across runs (200 / 200 / 300) — not because the data
differs, but because viewport-driven cancellation stops the loop early when
matches arrive before the viewport target.
POC OUTPUT — web/oss/poc/etl-poc-entities.ts
Throughput section now reports:
- Stop reason: "viewport-fill cancellation" vs "source exhausted"
- Dataset coverage: "100% — scanned all N rows" or "partial — N rows"
- Over-fetch waste (viewport-cancelled only): "180 rows matched beyond
viewport target of 20 (900% over)"
Before: "Rows scanned: 200" was ambiguous between dataset size and
where we stopped.
ARCHITECTURE — docs/designs/eval-package-architecture.md
New section "Chunk size selection — the RTT vs over-fetch trade-off"
in Limitations section:
- Mermaid diagram showing the small-chunks-vs-big-chunks cost surface
- Measured table from real PoC (300-row eval):
chunk=25, viewport=200 → 8 RTTs, 0 over-fetch
chunk=200, viewport=20 → 1 RTT, 180 over-fetch (9× viewport)
chunk=1000, viewport=20 → 1 RTT, 980 over-fetch (49× viewport)
- Recommended sizing per 6 consumer patterns
- Filter UX implications + two mitigation strategies
FILTER RFC — docs/designs/eval-filtering.md
New constraint C6 in Performance constraints with same empirical data
scoped to filter use case. Cross-references the architecture doc.
Both docs now name the over-fetch cost explicitly rather than letting
chunk_size be an arbitrary default.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… JSON output
POC OUTPUT — web/oss/poc/etl-poc-entities.ts
New "Rows per RTT" metric in Throughput section:
Rows per RTT 200 (1 RTT(s) for 200 rows)
Rows per RTT 25 (8 RTT(s) for 200 rows)
New "Network requests (HTTP)" subsection showing every HTTP call
grouped by endpoint with count, total ms, median ms, bytes.
Implemented via axios interceptors capturing per-call latency
and bytes; stored for both human-readable output and JSON report.
AGENTA_OUTPUT=json mode: suppresses human output, emits a single
structured JSON report on stdout covering config / runtime /
outcome / throughput / latency / network / memory / chunks /
assertions. Useful for CI artifacts, benchmark pipelines, scripted
analysis.
CURSOR IMPROVEMENT — realScenarioSource.ts + PoC fetchPage callback
Refined three-case cursor resolution:
Case 1: server returns windowing.next as string — use it
Case 2: server returns windowing object with next=null/missing —
authoritative end-of-stream, skip heuristic
Case 3: server omits windowing entirely — fall back to OSS
last-row-id heuristic
Plus: items.length < limit → definitive end
Saves one RTT for backends signalling end via windowing:{next:null}.
Doesn't help local Agenta /evaluations/scenarios/query (omits
windowing — still pays phantom RTT). Improvement is correct for
spec-compliant backends.
ARCHITECTURE — docs/designs/eval-package-architecture.md
Updated chunk-size table with "Rows per RTT" and "Scan rate"
columns. New paragraph: rows-per-RTT is the load-bearing metric,
not rows-per-second. RTT amortization explanation. Recommendation:
size chunks for rows-per-RTT.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The shared/index barrel transitively re-exports React-coupled paginated table helpers and the UserAuthorLabel TSX component, dragging @agenta/ui CSS modules into any consumer. That broke Node-side execution (scripts, tests, ETL adapters) — importing safeParseWithLogging via the barrel crashed on the first .module.css. Patch each api+core+state file to deep-import from the pure source modules (shared/utils/zodSchema, shared/molecule/...) instead of the contaminated barrel. The api+schema layers have no React surface and must stay Node-safe so they can be reused outside the browser. Also adds EvaluationMetric schema + queryEvaluationMetrics API to the evaluationRun module — the metric-query endpoint was the last entity the hydrate pipeline needs and wasn't represented in the package yet. Plus a bug fix: fetchTestcasesBatch was missing the cache-write side effect that fetchTestcasesPage already had — both batch fetchers now consistently populate the TanStack cache. Files patched: annotation, testcase, trace, evaluationRun api/core/state layers.
Scenarios as returned by /evaluations/scenarios/query are *references* —
id, status, run_id, testcase_id only. To render anything meaningful in
the UI (input data, app outputs, evaluator scores, traces) we have to
join 4 additional entities per chunk: results, metrics, testcases, and
traces. This transform stages those bulk fetches behind one chunk-level
boundary so the rest of the pipeline sees fully-materialized rows.
The four sub-fetches are injected via a HydrateFetchers interface, not
hardcoded. Default is raw-HTTP via the entities-package api functions;
callers can swap in molecule-backed or cache-aware variants without
touching the transform itself. Two-stage internally:
1. results + metrics (parallel, scoped by scenarioIds)
2. testcases + traces (parallel, derived from result.testcase_id and
result.trace_id once stage 1 completes)
Bounded request budget: 4 bulk calls per chunk, independent of chunk
size or column count.
The transform is what makes the architecture RFC's Convention 7
('correlatedDataPrefetch — data presence is a store concern, not a cell
concern') concrete. Without this stage, the v1 filter has nothing to
materialize predicates against.
…uping
The run document (POST /evaluations/runs/query) carries run.data.steps
and run.data.mappings — the eval graph and the column manifest the UI
renders. Each mapping declares { column.kind, column.name, step.key,
step.path } and the renderer is supposed to resolve each cell value
declaratively against the joined entities.
This module implements that resolver. Dispatch on step.type (not
column.kind) so future custom step types just register a strategy
without editing existing branches:
input → resolveFromTestcase (testcase.data.*)
invocation → resolveFromTrace (walks spans tree on result.trace_id)
annotation → composeResolvers(metric, trace)
• metric.data[step.key][flat-path] (cheap, pre-aggregated)
• trace fallback if metric absent
Plus column grouping: derived from step.type + step.references so two
evaluators with the same column name (e.g. both emit 'success') resolve
to distinct group keys ('evaluator:exact-match' vs 'evaluator:fuzzy-match')
and the UI's grouped-header layout mirrors the live screenshot.
Path-based override: paths under attributes.ag.metrics.* land in the
cross-cutting 'Metrics' group regardless of the underlying step type —
matches the UI's Cost/Duration/Tokens/Errors column placement.
Includes a multi-shape findInTrace walker that handles {spans:{<name>:span}},
{spans:[...]}, {response:{tree:[...]}}, deep child trees, and the
envelope-as-span case — so different trace endpoints don't require a
resolver patch.
41 unit tests covering each strategy, the multi-evaluator collision
case, metric-override-on-path, group ordering, edge cases, extensibility
via customResolvers + fallbackResolver.
… tests
Adds the entity layer the hydrate transform needed:
evaluationResultMolecule.actions.prefetchByScenarioIds (new)
evaluationMetricMolecule.actions.prefetchByScenarioIds (new)
prefetchTestcasesByIds (new sidecar)
prefetchTracesByIds (new sidecar)
buildMoleculeBackedFetchers / MOLECULE_BACKED_HYDRATE_FETCHERS
Every action consults the shared TanStack QueryClient before bulk-fetching
misses, then writes results back. Cache keys:
['evaluation-results', projectId, runId, scenarioId]
['evaluation-metrics', projectId, runId, scenarioId]
['testcase', projectId, testcaseId]
['trace-entity', projectId, traceId]
Empty arrays are cached too — a scenario with no metrics doesn't refetch
every time. Caller-friendly stat block returned per call (cacheHits,
cacheMisses, fetchMs) so observability surfaces can report hit ratios.
evictByRunId on the result/metric molecules bulk-clears every entry for
a run via prefix match — required for long-run ETL passes that rotate
through many runs and would otherwise leak TanStack observer state.
Trace prefetch uses fetchAllPreviewTraces with an IN filter directly
(one network round-trip for the entire batch). Routing through the
existing per-id traceBatchFetcher would have hit its maxBatchSize=50
cap and turned 100 ids into 2 round-trips instead of 1 — measured
~50% throughput regression vs raw fetcher. Bypassing the per-id
coalescer for already-bulk inputs keeps performance flat.
Tests:
15 unit tests for the cache contract (read, write, invalidate, isolation)
5 leak tests (--expose-gc) verifying:
- re-prefetching same scenarios doesn't grow cache
- evictByRunId fully releases run-scoped entries
- evictByRunId is run-scoped (other runs untouched)
- 100 iterations with evict-between → heap slope ~7 KB/iter
- WITHOUT eviction → heap grows linearly (documents caller responsibility)
package.json: test:etl:longrun split into per-file subprocesses to
avoid cross-test Jotai store pollution. --test-force-exit ensures
the process terminates after the test runner finishes.
…cache diagnostics
jotai-family's atomFamily memoizes one atom per unique param and exposes
.remove() for eviction, but no way to ask 'how many entries does this
family hold right now?'. That makes memory diagnosis impossible — a
family holding 50 ids and one holding 50,000 look identical from outside.
This wrapper adds .size(), .params(), .clear() plus an optional global
registry so consumers can list every instrumented family and its current
param count via inspectAtomFamilies(). Drop-in compatible with the base
factory's two call shapes (create only / create + areEqual).
The trace store's 9 atom families and the 16 atom families across
createPaginatedEntityStore + createInfiniteTableStore + createInfiniteDatasetStore
migrate to the instrumented variant. Each paginated store now exposes:
store.dispose() → releases every owned family's params AND
removes the store's TanStack queries by prefix
store.familySizes() → diagnostic snapshot per internal family
Result on the combined leak test (50 iterations of pipeline + adapter +
molecules with full teardown):
before: 70 KB/iter heap slope (~50 KB unattributed paginated overhead)
after: 0.5 KB/iter heap slope (~140× reduction, completely flat)
The unattributed bleed was two distinct concerns:
1. 16 atomFamily closures per store instance retained forever — fixed
by instrumentedAtomFamily.clear() in dispose()
2. TanStack queries keyed by [options.key, scopeId, ...] never evicted
when the scopeId rotates — fixed by removeQueries({queryKey:[options.key]})
in dispose()
Adds cacheDiagnostics module: inspectCache() walks the QueryClient by
prefix, inspectMemory() bundles cache + atomFamily + heap, clearCacheByPrefix()
for explicit teardown. Default prefix list includes span-level cache
(populated as a side-effect of traceBatchFetcher) so per-trace memory
cost isn't under-counted.
Subtle bug fix in the migration: the python-script injection initially
produced atomFamily(create, areEqual, name) but the local wrapper only
accepted (create, name) and silently dropped the areEqual function.
Without areEqual, params dedup by reference identity → every call
creates a fresh atom → pagination state lost between chunks. Caught
by re-running the PoC against the real backend (only 50 rows scanned
instead of 300). Wrapper now accepts both call shapes.
Combined leak test verifies the full chain: real paginatedStore adapter
+ molecule prefetch + atom family teardown stays under 30 KB/iter
budget (actual: 3-7 KB/iter).
…calation signal)
Implements the v1 client-side filter the eval-filtering RFC commits to
shipping first, plus the meter that decides when to escalate to v2.
rowPredicateFilter:
Post-hydrate transform that drops materialized rows failing a value-
equality predicate against any resolved UI column. Targets columns by
their group (testset / application / evaluator / metrics) + name +
optional group slug for evaluator namespacing.
Supports AND-composition via predicates: [...]. Operators: eq, ne, in,
nin, lt, lte, gt, gte.
Includes unwrapStatsForCompare — collapses metric stats blobs to their
dominant value before comparison so callers write 'value: false'
regardless of whether the column resolves via metric ({type:'binary',
freq:[...]}) or trace (raw boolean).
This transform runs AFTER hydrate because predicates target joined
values that don't exist on the bare scenario (evaluator output, metric
thresholds). Wasted hydration on dropped rows is the explicit cost the
hit-ratio meter decides whether to tolerate.
hitRatioMeter:
Tracks (matched / scanned) per chunk via rolling window. State machine:
warming → < windowSize chunks observed
client → rolling ratio ≥ threshold (v1 comfortable, keep client)
escalate → rolling ratio < threshold (recommend v2 server-side filter)
RFC defaults: windowSize=3, threshold=0.10. Configurable.
REPORTS the regime — does NOT swap engines. The actual swap (next
chunk's source request carries 'filtering' payload, predicateFilter
becomes a no-op) is the v2 milestone. The meter is the seam where
that integration will land.
Verified against this PR's run:
- 86% pass-rate filter → stays 'client' from chunk 3 onward
- 1% pass-rate composite filter → trips 'escalate' at chunk 3
13 unit tests covering state transitions, window sliding, edge cases,
config validation, custom resolver registration.
Brings the PoC from the previous 'engine + synthetic source' shape up to
a full v1 integration that exercises every layer added in this series:
source: createPaginatedEntityStore via makeSourceFromPaginatedStore
transforms:
1. statusFilter (cheap scenario-level filter)
2. hydrateScenarios (4 bulk fetches per chunk via molecules)
3. predicateFilter (opt) (post-hydrate, AND-composed clauses)
sink: in-memory accumulator
Run schema (run.data.steps + run.data.mappings) is fetched in pre-flight
and threaded into resolveMappings so the row dump mirrors the UI's
grouped-header layout exactly (Testset / Application / <Evaluator> /
Metrics). Synthesizes implicit Metrics columns (tokens/cost/duration)
for predicate targeting since run.data.mappings only declares user-
visible columns.
Observability per chunk + final report:
- timing breakdown (fetch / transform / sink ms)
- cache hit/miss per entity per chunk (results/metrics/testcases/traces)
- entity cache size + per-prefix breakdown
- atom family params (instrumented registry snapshot)
- hit-ratio meter regime evolution + final recommendation
- cache eviction verification (before/after, atom-family cleanup)
- human + JSON output modes
A/B knob: AGENTA_FETCHER_MODE=raw|molecule swaps the hydrate fetchers
between direct-HTTP and molecule-backed. After the trace-prefetch fix,
both modes hit identical 33 HTTP requests and within-noise timing.
Filter knobs:
AGENTA_PREDICATE_KIND testset | application | evaluator | metrics
AGENTA_PREDICATE_GROUP optional slug (e.g. 'exact-match')
AGENTA_PREDICATE_COLUMN e.g. 'success', 'tokens.cumulative.total'
AGENTA_PREDICATE_OP eq | ne | in | nin | lt | lte | gt | gte
AGENTA_PREDICATE_VALUE JSON-parsed
AGENTA_PREDICATE2_* second clause for AND composition
End-to-end assertions (15 checks): engine guarantees, entity-layer
integration, cache reuse (100% hits on rerun, 0ms network), correct
row materialization, predicate filter shape.
Validated against run 019e3701-523f-7782-8813-9ca438f48399:
300 scenarios, 6 chunks of 50, 33 HTTP requests, ~1s end-to-end.
Filter 'evaluator:exact-match.success eq false' → 258/300 (86%) client
Filter '... AND tokens > 35' → 3/300 (1%) escalate
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Plus Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Colocate the PoC with the package it validates. The script imports exclusively from @agenta/entities + @agenta/shared — its placement in web/oss/ was inertia from the previous PoC harness (etl-entities-probe.ts already lived there), not because it has any OSS-specific code. After move: before: web/oss/poc/etl-poc-entities.ts after: web/packages/agenta-entities/poc/etl-poc-entities.ts All relative imports updated from ../../packages/agenta-entities/src/ to ../src/. No external-facing API changes; tests + lint pass; PoC runs end-to-end against the same run with identical output (300 rows, 33 HTTP requests, all assertions pass). Run from web/packages/agenta-entities/: pnpm exec tsx poc/etl-poc-entities.ts
|
|
||
| function row(label: string, value: string | number): void { | ||
| const padded = label.padEnd(28) | ||
| log(` ${padded} ${value}`) |
The shared trace cache key ["trace-entity", projectId, traceId] stores a
TracesApiResponse envelope ({count, traces: {[traceIdNoDashes]: data}})
because every other consumer — traceEntityAtomFamily, traceRootSpanAtomFamily,
AnnotationTraceContent, EvalRunDetails/atoms/traces — expects that shape and
indexes via data.traces[noDash] to drill in.
Previously, cacheAwareFetchers.fetchTraces pre-unwrapped the envelope
before passing to the hydrate transform so resolveMappings could walk
it. Downstream cell consumers would have had to do the same unwrap.
Consolidate by teaching findInTrace to handle the envelope directly as
its own case (alongside the existing 4 trace shapes). Cell-side and
fetcher-side unwraps drop out, the shared cache contract is preserved
unchanged for all existing consumers.
41/41 resolveMappings tests still pass.
…ties
The 4 entity types the ETL hydrate pipeline materializes per page —
evaluation results, evaluation metrics, testcases, trace spans —
previously had inconsistent prefetch entry points:
evaluationResultMolecule.actions.prefetchByScenarioIds (proper molecule)
evaluationMetricMolecule.actions.prefetchByScenarioIds (proper molecule)
prefetchTestcasesByIds (sidecar function)
prefetchTracesByIds (sidecar function)
Future maintainers had to know two patterns. With predicate-driven
hydrate (next commit) and cell-side materialization landing soon, the
gap becomes painful — consumers call `molecule.actions.prefetch*` and
expect the convention to hold.
Add `testcaseMolecule.actions.prefetchByIds` and
`traceSpanMolecule.actions.prefetchByIds` that wrap the existing
sidecar functions (kept as separate exports for backwards compatibility
with non-molecule callers). All 4 entities now expose the same shape:
*Molecule.actions.prefetchBy{ScenarioIds|Ids}({projectId, ...ids})
→ {cacheHits, cacheMisses, fetchMs, ...}
Also re-export EvaluationMetric type from the evaluationRun barrel so
downstream callers can use molecule.actions.prefetchByScenarioIds without
deep-importing for the type.
Today the ETL hydrate pipeline fetches 4 entity slices per page
unconditionally — results + metrics + testcases + traces. When a
predicate is active, most of that data isn't needed to evaluate the
filter. For the common evaluator-output filter ("success eq true"),
testcases + traces are entirely irrelevant — traces alone account for
~70% of bytes and ~67% of loop time on the 1000-scenario reference run.
Add predicateToEntitySlices: given a run schema + predicate(s), return
the minimum set of entity slices the hydrate stage needs to fetch.
Mirrors resolveMappings's read-side step.type → entity convention
inversely (column → slice instead of column → value):
testset (step.type = input) → testcases (+ results for testcase_id)
application (step.type = invocation) → traces (+ results for trace_id)
evaluator (step.type = annotation) → metrics (+ results, + traces if not metrics-flat)
metrics (path = attributes.ag.metrics.*) → metrics
Unresolvable predicates fall back to all 4 slices to stay correct —
over-fetching is safer than silently dropping a predicate.
Measured impact on the 1000-scenario reference run with predicate on
evaluator:exact-match.success:
loop time 12.5s → 2.5s (-80%)
bytes 27 MB → 9 MB (-67%)
requests 103 → 63 (-39%)
peak heap 60 MB → 25 MB (-57%)
Consumers (PoC + UI test page) wire this in subsequent commits.
Bundles uncommitted PoC improvements from the ETL investigation thread.
1) AGENTA_SINK_MODE (accumulate | streaming)
- accumulate (default): sink keeps every hydrated row. Useful for
post-hoc sample dumps.
- streaming: sink updates running aggregates per row and releases the
chunk. Mirrors what a production sink does.
Decoupled the downstream reports (per-row summary stats, sample dump,
engine guarantees, JSON output) from matchedRows[] — all read from a
running SinkAggregate so streaming mode produces the same headline
numbers without retaining rows.
2) AGENTA_HEAP_WALK (1 = run residual investigation walk)
When enabled, after the loop completes the script tears down suspected
retainers one at a time, force-GCs, snapshots heap delta after each
step, dumps a V8 heap snapshot, and prints a per-step delta table.
Gated because (a) writeHeapSnapshot is ~50 MB to /tmp, (b) the walk
side-effects aggregate state, (c) the steady-state Memory bounded
assertion already catches regressions without it.
3) Inlined reprefetch* into IIFEs
The cache-reuse verification kept reprefetchResults/Metrics/Testcases/
Traces as main() locals — those return objects hold 1000 results
each, pinning ~25 MB of heap on the main() stack for the rest of the
script. Heap-snapshot retainer-path analysis traced post-eviction
residual to those four bindings. Inlining the stat extraction into
IIFEs releases the function return values the line after they're
used. Cut residual from 25-48 MB to 9 MB; the Memory bounded engine
guarantee now passes.
4) AGENTA_HYDRATE_SLICES (comma-separated subset of
results,metrics,testcases,traces; default = all 4)
Wraps the chosen HydrateFetchers — slices not in the active set
return empty results without network. Same hydrate transform body
for both modes, so this is a fair A/B of just the fetch cost.
Measured impact, 1000-scenario reference run, warm backend:
Baseline (all 4 slices):
loop 12,520 ms 103 requests ~27 MB bytes peak +60 MB heap
Slice-filtered (results+metrics):
loop 2,494 ms 63 requests ~9 MB bytes peak +25 MB heap
delta: -80% -39% -67% -57%
IIFE-inlined reprefetches (post-loop residual):
before: 25-48 MB ✗ Memory bounded
after: ~9 MB ✓ Memory bounded
…ipeline
Standalone debug route that mounts the production InfiniteVirtualTable
with the entities-package ETL hydrate strategy wired into a real React
+ IVT context. Validates the architecture end-to-end before production
wiring (which lives in a separate follow-up).
URL: /etl-poc/<runId>?project_id=<projectId>
Architecture (file map):
pages/etl-poc/[evaluation_id].tsx page route
components/EtlPocScenarios/
index.tsx main component
scenarioPaginatedStore.ts thin IVT row store ({key,id,scenarioId})
useHydrateScenarios.ts page-level bulk prefetch, predicate-driven
useCellMaterialization.ts cell-level lazy fetch, batched per tick
cellMaterializerContext.ts provider seam
useEtlColumns.tsx columns from runSchema via resolveMappings
useScopeChangeEviction.ts evictByRunId on (projectId, runId) change
EtlColumnHeader.tsx testset/app entity-name resolver for headers
PredicateFilterBar.tsx one-predicate dropdown UI
cells/EtlResolvedCell.tsx per-cell molecule cache + resolveMappings
Key architectural moves:
- Thin row shape ({key, id, scenarioId, __isSkeleton}) — same convention
as testcasePaginatedStore. All column data is materialized via molecule
caches; the IVT row carries only identity.
- Predicate-driven hydrate via predicateToEntitySlices (see prior commit).
When a predicate is active the page-level pass only fetches slices the
filter touches. Hydrate-strategy toggle in the header lets you A/B
auto-mode (predicate-driven) vs all-mode (legacy 4-slice fetch) live.
- Cell-side lazy materialization for slices the page-level hydrate
skipped. 30 visible cells requesting (slice, id) in the same render
tick coalesce into 1 bulk fetch per slice via a microtask queue.
Uses the now-symmetric *Molecule.actions.prefetch* surface.
- hydrationVersionAtom — bumped after each completed batch so cells
whose useMemo settled before stage 2 (testcases/traces) lands
re-render and pick up the late cache writes.
- Scope-change eviction (useScopeChangeEviction) — the cleanup snippet
the production scenarios controller should call on runId change.
Today production has no such call; this hook demonstrates the wire-up.
- Production-realistic layout chain: registered /etl-poc with the
Layout's isFullHeight matcher; container className set on the IVT so
its scroll container doesn't grow to content height (matches what
InfiniteVirtualTableFeatureShell does internally for production
tables).
- Resolver gets envelope-unwrapped values via findInTrace (prior commit)
+ stats-blob unwrap via unwrapStatsForCompare for display.
Out of scope for this commit (next-PR follow-ups):
- Wiring the molecule-backed hydrate into the actual production
scenarios table (web/oss/src/components/EvalRunDetails/Table.tsx).
- Migrating evaluationPreviewTableStore to a thin entities-package
store (this PR keeps it for the production view; the test page uses
its own thin store).
- HTTP 429 retry/backoff in prefetchTracesByIds (currently silent warn).
…slices)
Previously, sliceMode="auto" had two meanings:
- With predicate: fetch only what the predicate touches.
- Without predicate: fetch all 4 slices (display fallback).
That bifurcation made "Auto" inconsistent — same toggle label, very
different network behavior depending on filter state.
Single consistent semantic now: Auto means "fetch only what's needed
right now."
- No predicate: 0 page-level fetches. Cells materialize their own
data via useCellMaterialization (virtualization-aware, batched per
microtask). Same path v2 server-side filtering will land on.
- Predicate active: fetch the predicate's slice set so the filter
can run client-side.
- Unresolvable predicate: fall back to all 4 (correctness over speed).
Trade-off: no-predicate first paint shows skeleton cells until the
first cell-side batch lands (~200-500ms typical). In exchange the page
fetches exactly the data the visible window needs — no eager fetching
of data the user may never scroll to. Smaller memory profile, more
honest semantics.
The "All slices" toggle option remains for A/B and for workflows that
need every column populated up-front (bulk actions, exports).
Header chip updated to show 'slices: none (cell-side on-demand)' when
the page-level hydrate skips entirely.
When a strict predicate reduces the visible row count below the viewport (e.g. 1 match in the first 50-row page), the IVT's native scroll-bottom trigger never fires because the table never scrolls. loadMore is never called and the user is stuck looking at a partial table with no way to load more matches. Add a viewport-fill effect: while a predicate is active and the loaded dataset isn't exhausted, drive loadNextPage ourselves until either: - matched row count >= VIEWPORT_FILL_TARGET (30), or - paginationInfo.hasMore becomes false (full scan). isFetching dedupes concurrent calls; the effect re-runs after each page lands so we walk through pages one at a time. Skipped entirely when no predicate is active — native scroll-trigger handles that case. Production scenarios doesn't hit this because filtering runs server- side (the server returns already-matched rows). Test page does client-side v1 filter, so we own the fill loop.
Request() marked IDs as in-flight BEFORE pushing to the queue. The
subsequent drain's collectUnique then filtered out anything in
inflightIds — i.e., every ID in the queue. Net effect: drain reads
empty arrays for every slice, no prefetch fires, cells stay at '—'
forever.
Fix the dedup ordering:
- request() filters duplicates within the current tick by checking
inflightIds (across-tick dedup) AND scanning the existing queue
(within-tick dedup). Doesn't mark inflight yet.
- drain snapshots the queues, dedups within batch, then marks
inflightIds before firing the fetch. .finally() clears the marks
after the fetch resolves so later cells can re-request.
Test page Auto mode + no predicate now correctly materializes visible
cells: page-level hydrate fetches nothing, cells request their slices
on first render, materializer batches into 1 bulk fetch per slice,
hydrationVersionAtom bumps, cells re-render with cached data.
Cell-side materialization on mount works for visible cells but lags
when the user scrolls into a freshly-loaded page — cells mount,
request, wait, render. Visible delay between page-arrival and
cell-with-data, especially noticeable in Auto + no-predicate mode
where page-level hydrate fetches zero slices.
Close the gap with useLookaheadPrefetch — runs alongside cell-side
materialization. Two stages, both routed through the existing
materializer (dedup + batching reused for free):
stage 1: new scenarios in pagination.rows → request results + metrics
(the two slices keyed directly by scenarioId)
stage 2: re-fires on every hydrationVersion bump (i.e., after
stage-1 results land in cache). For each row's cached
results, derive testcase_id / trace_id and request those
slices. Self-bounded by per-id seen-sets so it doesn't
re-fire for already-prefetched ids.
Net effect: when IVT loads page N+1 via loadMore, all 4 slices begin
fetching for that page's scenarios immediately. By the time the user
scrolls cells into view (~few hundred ms later), the data is cached
and cells render with no flash of '—'.
Skipped when sliceMode === 'all' (page-level hydrate already covered
every slice; nothing for lookahead to add).
Materializer's dedup ensures requests against scenarios already in
cache (because cells materialized them on mount earlier) are no-ops.
When a predicate is active the IVT's viewport is *constructed* from multiple pagination pages — the viewport-fill loop loads pages 1-10 to accumulate 30 matched rows, with the other 470 immediately filtered out by the predicate. Previously useLookaheadPrefetch took pagination.rows and prefetched results+metrics (stage 1) + testcases+traces (stage 2) for every loaded scenario. With a strict predicate (~3% match rate) that meant ~94% of the lookahead work was wasted on rows the user will never see — particularly stage 2's testcase + trace fetches that derive IDs per row. Switch the input to filteredRows. Now lookahead targets only the constructed viewport. No predicate: filteredRows == pagination.rows → no behavior change With predicate: filteredRows ⊂ pagination.rows → prefetch only matched Edge case: filteredRows includes 'pending' rows (matchesPredicate's keep-visible-until-known fallback for unhydrated rows). Some of those will later drop out as the predicate slices land and the filter re-evaluates — we'll have over-prefetched for them. Accepted: the predicate-driven page hydrate already fetches predicate slices for the same set so stage 1 is net zero, and stage 2 over-prefetch is the cost of not flashing rows in/out during predicate eval. Manageable. Hook also moved below the filteredRows useMemo so it sees the correct data — was previously above with the wrong input. Updates the hook's file header + the call-site comment to make the 'filteredRows not pagination.rows' contract explicit.
…ches
Symptom: with a predicate active (e.g. evaluator.score eq false), rows
where the data clearly shows the predicate as false (e.g. score=true)
were visible in the table. After a delay, the table updated to show
only correctly-matching rows.
Root cause: matchesPredicate has a 'keep visible until known' fallback
— if predicate slices aren't cached yet for a scenario, the row passes
through. filteredRows useMemo's deps were [pagination.rows, predicate,
schema, projectId, runId] — none of which change when the molecule
cache updates. So:
1. Pagination page loads → 50 new rows, predicate slices not cached
2. Filter runs: matchesPredicate falls back to 'keep visible' for all
3. Predicate slices land → hydrationVersionAtom bumps → cells
re-render with real data
4. filteredRows DOESN'T re-run because its deps didn't change
5. Result: rows that don't actually match stay visible until the
next pagination event triggers a filteredRows re-eval
Fix: subscribe to hydrationVersionAtom + add hydrationVersion to
filteredRows deps. When the cache lands, filteredRows re-evaluates,
'keep visible until known' rows that turn out not to match get
correctly filtered out.
User-visible: brief 'keep visible' period during predicate evaluation
remains (~few hundred ms while results+metrics land per page), then
rows correctly filter. No more stale rows lingering until the next
page load.
…explicit pending/confirmed count chip Two issues from real-world predicate UX on 1000-scenario run. (1) Slow load — predicate on evaluator.llm-as-a-judge.score triggered 'slices: results, metrics, traces (predicate-driven)'. The trace fetch is by far the slowest endpoint (~100ms median per call) and dominates loop time. But the score value lives in metric.data — the metric writer unfolds the evaluator's emitted attributes (incl. attributes.ag.data.outputs.*) as flat keys under data[stepKey][path]. predicateToEntitySlices previously added traces speculatively for any annotation predicate whose path didn't start with 'attributes.ag.metrics.*'. The heuristic was over-cautious — for the common case (evaluators that write to metric.data), trace is unused. Drop the speculative trace add. resolveMappings's composeResolvers (metric -> trace) still falls back to trace at read time if metric returns null for a column, so correctness is preserved. The cell-side materializer requests traces on demand for cells where they're actually needed (e.g. invocation outputs from a span tree). Predicate hydrate becomes results+metrics only — cuts ~60-70% of loop time on the 1000-scenario reference run. (2) Flaky count chip — '100 matched / 600 loaded' then '57 matched / 650 loaded'. The 100 was inflated by 'keep visible until known' pending rows during predicate evaluation; once data landed, the predicate evaluated to false and the count dropped to 57. Replace the simple length-based chip with PredicateCountChip that distinguishes: - confirmed: predicate slices loaded + predicate returned true - pending: predicate slices not loaded yet (keep-visible fallback) - totalLoaded: scenarios in pagination buffer Display: 'X matched · Y pending / Z loaded'. The user sees stable 'confirmed' growth instead of a number that oscillates while predicate slices land. Recomputed on each hydrationVersion bump via parent re-render. 41/41 resolveMappings tests still pass.
…429)
Reported 189+ failed requests piling up against /tracing/spans/query —
all 429s, all retried forever. Root cause: prefetchTracesByIds (and
the other slice prefetches) swallow errors and return empty outcomes
with no cache writes. The materializer's drain .finally() clears
inflightIds unconditionally, then the next cell render finds the cache
still empty and re-requests. Tight retry loop.
Add per-slice failedIds tracking to the materializer:
1. After each slice's bulk fetch resolves, read the cache for every
requested ID. If the cache entry is undefined, the fetch failed
silently (rate-limited, network blip, etc.) — mark the ID as
failed in state.failedIds[slice].
2. request() now checks failedIds first; if the ID is there, skip
queueing. The lookahead hook's own seen-set already prevents
re-queueing too, so combined they prevent any future request for
a failed ID.
Permanent for the session — user reloads to retry. Could add a TTL +
exponential backoff later, but for the test page this is the simpler
correct behavior.
Cells for failed IDs render '—' indefinitely. Acceptable trade-off vs.
the alternative of hammering a rate-limited endpoint and degrading
every other consumer.
If the user scrolled to row 200, then added a filter that narrows the
table to a different subset, the IVT preserved the prior scroll offset.
The user landed somewhere mid-way through the filtered list, often
past the first matches — feels like the filter applied to a different
table or the filter is broken.
Wire a tableRef into the IVT, watch the predicate state, and call
tableRef.scrollTo({index: 0, align: 'top'}) on any predicate change
(added, cleared, or modified). Skipped on the very first render so we
don't fire a scroll on initial mount.
Scheduled inside a requestAnimationFrame so the IVT has the new
filteredRows mounted before we ask it to scroll.
EE's Next.js app uses filesystem routing over web/ee/src/pages/ and does not auto-inherit OSS pages — each route needs an explicit re-export file. The /etl-poc test page existed only in web/oss/src/pages/, so it 404'd when running the EE web frontend. Add web/ee/src/pages/etl-poc/[evaluation_id].tsx — a plain pass-through re-export of the OSS page (no EE-specific behaviour). Mirrors the existing EE page re-export convention (auth, _document, etc.). Now the test page is reachable on both OSS and EE web.
[fix] Resolve broken CORS headers
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
PoC + entity-layer foundation for the eval-filtering RFC's v1.
Testing
Verified locally
Added or updated tests
QA follow-up
Demo
Checklist
Contributor Resources