[FE POC] ETL implementation for evaluation runs. by ardaerzin · Pull Request #4354 · Agenta-AI/agenta

ardaerzin · 2026-05-18T15:14:43Z

Summary

PoC + entity-layer foundation for the eval-filtering RFC's v1.

Testing

Verified locally

Added or updated tests

QA follow-up

Demo

Checklist

I have included a video or screen recording for UI changes, or marked Demo as N/A
Relevant tests pass locally
Relevant linting and formatting pass locally
I have signed the CLA, or I will sign it when the bot prompts me

Contributor Resources

…L engine RFCs Three paired design docs covering the evaluation frontend refactor: - eval-filtering.md (390 lines, 8 mermaid diagrams) — what to build. Reuses tracing Filtering/Condition primitive, v1 frontend / v2 backend split, field-path convention, UI states, compare-mode gap with the query-backed-runs join-key problem flagged. - eval-package-architecture.md (647 lines, 11 diagrams) — where it lives. Maps current OSS data layer (28 atom files, ~9k LOC) to target @agenta/entities/evaluationRun package boundary. Documents existing molecule ground truth, the 4-namespace convention to follow, phased migration with Phase 1 (scenario + metrics molecules) as the prerequisite for the filter primitive. - eval-etl-engine.md (804 lines, 7 diagrams, 7 code blocks) — how transforms compose. JP's huddle diagrams as instances of one chunked iteration loop with 5 guarantees (memory-bounded, cancellation, progress, backpressure, idempotent resume). Source/Transform/Sink contracts, the ~40-line loop, three worked examples (filter, query->testset, file export), and full integration story with entity packages via the adapter pattern. All three docs cross-link, share yellow/green/red diagram color conventions, and explicitly defer DSL/vocabulary work until 3+ transforms exist with comparable shapes. Not pushed; staged on fe-experiment/etl-engine for review. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…future-improvement designs Honest performance audit of the trio surfaced several overstated guarantees and missing constraints. This commit applies the corrections, restructures the migration phases, and adds substantive design treatment for four future improvements that didn't make v1. Corrections applied across the trio: - "Memory bounded by chunk size" was overstated. Loop runtime IS bounded; cumulative session state is NOT. Doc now distinguishes pipeline memory from session memory, with resident-memory estimates per run size. - "Cancellation propagates" was a partial truth. Mid-flight HTTP requests aren't aborted unless AbortSignal is plumbed through the API layer. Promoted to Phase 1d (was implicit / "v2 polish"). - Row eviction was Phase 4+. Promoted to Phase 3a with explicit sliding-window policy + atomFamily.remove() in lockstep. - v1 client-side join sizing tightened from ~10k to ~5k per side based on memory math (10KB/row × 10k = 100MB hash map; browser struggles). New mandatory performance constraints in eval-filtering.md: - C1: ≥250ms debounce on scenarioFilterAtom writes (predicate eval is O(N)) - C2: Three predicate operator tiers (cheap / moderate / expensive); UI surfaces only Tier 1/2 by default, Tier 3 auto-escalates to v2 - C3: Eager v2 escalation on three triggers (hit-ratio, loaded > 10k, Tier 3 operator), not just hit-ratio - C4: Background tab pause via visibility-aware AbortSignal wrapper - C5: AtomFamily eviction in lockstep with row eviction New "Limitations and required discipline" section in eval-package-architecture.md spelling out what IS bounded vs what ISN'T, sizing expectations table, and what the design explicitly does NOT fix (server aggregations, cross-table joins beyond compare-mode, real-time streaming, offline resume). ETL engine doc clarifies the 5 guarantees with explicit caveats about loop-local vs caller-managed state, and adds Performance properties section with per-chunk cost tables. Four future-improvement designs added with concrete shapes: Filter RFC: - F1: Skip-ahead UX on filter transitions using last-visible row ID as a content anchor (not opaque cursor). Primitive findNearestPosition() on derived view; v2 needs API extension to accept anchor in query payload. - F2: Predicate explain mode (dev tool) with ring buffer of timed evaluations, tier classification recommendations, three enable mechanisms (URL param / devtools / env var). Key insight: tier classification should be measured, not stipulated. Package architecture RFC: - F1: Worker-thread predicate evaluation via snapshot-based ship-once pattern. Snapshot shape acts as predicate field schema; predicate changes ship only the predicate, not the data. Performance comparison table shows ~25x speedup for repeated predicates after first eval. - F2: Memoized derived results — LRU cache keyed by predicate hash with per-entry revision tracking. v1 coarse invalidation (all entries on any correlated update); fine-grained field-path tracking promoted on user feedback. Cache is tiny (~180KB for 10 entries × 500 matches). The four future improvements compose cleanly across the worker/main- thread boundary and are cross-linked between the two RFCs. Working tree returns to clean. Not pushed; stays on fe-experiment/etl-engine for review. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ter schema design Two structural changes the trio needed before it could be considered load-bearing. 1. Split the ETL engine doc into general + consumer The original eval-etl-engine.md mixed two concerns: the general loop engine (zero entity coupling) and eval's specific adoption (adapters, folder structure, filter pipeline worked example). Resolved by: - NEW: docs/designs/etl-engine.md (572 lines, 2 diagrams) General loop engine. Contracts, runtime, performance properties, 5 guarantees with caveats, generic worked examples (streaming file export, cross-entity query->testset, multi-source join). Open questions are engine-level only. "What to do next" is engine-level. - REFOCUSED: docs/designs/eval-etl-engine.md (331 lines, 3 diagrams, down from 922) Eval-specific adoption only. The filter pipeline as canonical first consumer of the engine. Adapter folder structure under @agenta/entities/evaluationRun/etl/. Per-chunk sequence for the eval filter case. Migration path aligned to architecture RFC phases. Explicit "why this doc is small" note: eval is a consumer, most work is in shared infrastructure. This pattern scales — future entity packages adopting the engine write similar small consumer docs (testset-etl-integration.md, etc.). The general engine doc stays stable. 2. Filter schema and field declarations (D4) The trio defined the predicate vocabulary (D1) and field-path convention (D3) but never specified how each entity declares its filterable surface. Without that, the filter UI has nothing to render, the predicate validator has nothing to check against, and tier-based escalation can't reason per-field. Resolved by: - NEW D4 in eval-filtering.md (~280 lines): - FilterSchema and FilterFieldSchema type definitions - Type-to-operator matrix (9 field types x tier classification) - Canonical scenario filter schema with static + dynamic fields - Evaluator output type -> FilterFieldType mapping - Schema-driven UI rendering pattern (one input shape per type) - Predicate validator (3 checks: field, operator, value type) - Tier propagation: per-field tier drives per-predicate escalation - Server-side parity strategy (3 options, v1 picks "independently authored + integration test") - Schema versioning rules - New subsection in eval-package-architecture.md "Cross-entity filter schemas" (~70 lines): - Folder structure: shared/paginated/filter/ for generic types + validator + tier walker; entity/etl/filterSchema.ts for per-entity schema builders - Construction flow diagram - Why this lives at the shared layer (not the engine) - Pattern for future transforms (ProjectionSchema, MapSchema) - Clarifying section in etl-engine.md "Filter / transform schemas are NOT engine concerns" (~20 lines): - Engine has zero knowledge of fields/types/operators - Schemas live one layer up at the derived layer - Cross-references to filter and architecture RFCs - Folder update in eval-etl-engine.md noting filterSchema.ts location and pattern for other entities The key architectural property: the same FilterSchema drives UI rendering, predicate validation, tier classification, and runtime field resolution. One source of truth per entity per context. Other entities (testset, observability, etc.) get the same leverage by writing their own filterSchema.ts following the shared contract. Working tree returns to clean. Not pushed. Final state — 4 docs, 31 mermaid diagrams, 2,914 lines: - etl-engine.md — 572 lines, 2 diagrams (general engine) - eval-etl-engine.md — 331 lines, 3 diagrams (eval adoption) - eval-filtering.md — 937 lines, 11 diagrams (what to filter) - eval-package-architecture.md — 1,074 lines, 15 diagrams (where state lives) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The entire architecture is environment-agnostic — no React, no DOM, no browser APIs. Adding a "PoC strategy" section that captures how to validate the design end-to-end against a real backend before touching the frontend. The section covers: - Why headless: faster iteration than UI work, cleaner architectural proof (if contracts hold in Node they hold in the UI), real performance measurement via process.memoryUsage and hrtime - What the PoC validates: the engine's 5 guarantees + prefetch hook + filter schema validator + tier escalator, all against real data - File layout: PoC files become the v1 implementation. The package paths in the PoC script are the real package paths. - Three layers of testing (unit / integration / E2E) - A complete sketch of scripts/etl-poc.ts (~50 lines, executable) - Suggested ordering with time budget (~5-6 days total) - What the PoC's run report should capture as the empirical complement to the design RFCs - Preconditions (dev stack runnable, test run with realistic shape, step-0 verification that createPaginatedEntityStore runs in Node) - Branch strategy: fe-experiment/etl-poc follows this trio Working tree clean after this commit. Not pushed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…es integration End-to-end validation of the ETL architecture against a real Agenta backend. The engine consumes data through the existing entities-package paginated store with full cursor pagination, cancellation, and memory bounds. ENGINE — web/packages/agenta-entities/src/etl/ - core/types.ts: Source, Transform, Sink, Chunk, MultiSourceTransform, JoinState, Progress, LoopResult. Plain TS, no DSL. - runtime/runLoop.ts: ~50-line AsyncGenerator implementing the loop. All five design-RFC guarantees fall out of the code. - adapters/makeSourceFromPaginatedStore.ts: wraps any createPaginatedEntityStore instance as a Source<TApiRow>. Drives the store's reactive controller subscription, uses scheduleNextPageAtomFamily to advance the cursor. - index.ts: public exports including the new adapter. - __tests__/runLoop.guarantees.test.ts: 5 guarantee tests via node:test (built-in, no vitest/jest dep). EVAL-SPECIFIC ADAPTER — web/packages/agenta-entities/src/evaluationRun/etl/ - realScenarioSource.ts: minimal Source<EvaluationScenario> with the OSS cursor-fallback pattern. POC SCRIPTS - scripts/etl-poc-smoke.ts: synthetic-data engine validation (5/5 pass) - scripts/etl-poc.ts: minimum-viable real-backend PoC via direct fetch - web/oss/poc/etl-entities-probe.ts: 4-stage Node-portability probe covering shared axios + Zod + Jotai + createPaginatedEntityStore (4/4 pass against real backend) - web/oss/poc/etl-poc-entities.ts: full "really using entities" PoC — wraps real createPaginatedEntityStore via makeSourceFromPaginatedStore, runs runLoop end-to-end (3 chunks, 150 rows, cursor advance via scheduleNextPageAtomFamily, all 5 assertions pass) PACKAGE EXPORTS — web/packages/agenta-entities/package.json - "./etl" → ./src/etl/index.ts - "./evaluationRun/etl" → ./src/evaluationRun/etl/index.ts - "test:etl" script runs node:test guarantee tests via tsx ARCHITECTURAL FINDING: @agenta/entities barrel exports transitively pull React components (via shared/user/UserAuthorLabel.tsx → @agenta/ui → CSS). Workaround: deep relative imports. Follow-up: split entities/shared barrel so data-layer consumers don't pull UI. Not blocking. VERIFIED against real backend (run 019e3701-...): - etl-poc.ts: 7 chunks, 300 scanned, 64 matched, ~73ms/chunk - etl-poc-entities.ts: 3 chunks via store, 150 rows, ~70ms/chunk - all 5 engine guarantees hold against real network + real data Branched from fe-experiment/etl-engine. Not pushed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The PoC's observational "heap=+X MB" logs were anecdotes, not tests. This commit turns the design RFC's claims about memory bounds and engine overhead into enforceable assertions that fail on regression. NEW TESTS — web/packages/agenta-entities/src/etl/__tests__/ runLoop.memory.test.ts (4 assertions, --expose-gc required) - 100 chunks × 1000 fat rows (~1MB each) — heap delta stays under 25MB budget. Unbounded would show 100MB+ growth. - Heap delta at chunk 100 does NOT exceed delta at chunk 25 by more than 10MB. Catches monotonic per-chunk growth. - After AbortSignal cancellation mid-stream, forced GC twice, heap returns within 15MB of baseline. Catches "abort doesn't release in-flight chunk references" regressions. - 10-transform identity chain over 100 chunks × 500 rows — heap stays under 30MB. Catches "transform array retains intermediate chunks" regressions. All four skip gracefully without --expose-gc (default test:etl unaffected for contributors without the flag). runLoop.overhead.test.ts (2 assertions) - runLoop vs hand-written equivalent: same source, transform, sink, iteration. Median of 5 runs with warmup. Asserts engine overhead < 25% of baseline. Measured locally at 9.4%. Reports numbers in test output for CI log inspection. - Correctness parity: engine and baseline produce identical scanned/matched/loaded counts. Catches "engine drops/double-counts rows" regressions independently of timing. SCRIPTS — web/packages/agenta-entities/package.json - test:etl — guarantees only (fast, no --expose-gc, every PR) - test:etl:memory — memory + overhead (--expose-gc, ~400ms) Scope B (next commit) adds: AtomFamily leak detection, per-chunk latency budgets per scenario, long-run (10k iter) leak regression test, budget tuning docs. Verified locally: - test:etl: 9/9 pass (existing guarantees, unchanged) - test:etl:memory: 6/6 pass (memory: 4, overhead: 2) - Engine overhead: 9.4% over baseline (budget 25%) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…n + docs Completes the perf/memory test suite started in Scope A. Adds per-scenario latency budgets, atomFamily-style leak detection via long-run heap sampling, and a README explaining what each test catches and how to handle failures. NEW TESTS — web/packages/agenta-entities/src/etl/__tests__/ runLoop.benchmark.test.ts (7 scenarios) - passthrough — 200 rows: p95 budget 5ms (local p95: 0.09ms) - tier1 eq filter — 200 rows: p95 budget 5ms (0.08ms) - tier1 gte filter — 200 rows: p95 budget 5ms (0.06ms) - tier2 in-set filter — 200 rows: p95 budget 10ms (0.07ms) - map transform — 200 rows: p95 budget 8ms (0.06ms) - large chunk — 1000 rows: p95 budget 15ms (0.75ms) - multi-transform chain (5 filters) — 200 rows: p95 budget 12ms (0.09ms) Budgets carry 50-150x headroom for CI variance. Each test reports its actual p50/p95/p99/max in test output for trend observability. runLoop.leak.test.ts (2 assertions, --expose-gc required, ~300ms) - 100 iter fresh source/sink: heap regression slope < 50 KB/iter. Local: 1.51 KB/iter. - 500 iter: heap range (max - min) under 5MB. Local: 0.77MB. Catches atomFamily leaks in makeSourceFromPaginatedStore indirectly: persistent atom entries manifest as monotonic heap growth. DOCS — web/packages/agenta-entities/src/etl/__tests__/README.md What each test catches, how to interpret failures, how budgets are calibrated, how to add new tests under the existing convention. SCRIPTS — web/packages/agenta-entities/package.json - test:etl guarantees only (~300ms, every PR) - test:etl:memory memory + overhead + benchmark (~1s, every PR) - test:etl:longrun leak detection (~30s, nightly) Verified locally: - test:etl: 9/9 pass - test:etl:memory: 13/13 pass (4 memory + 2 overhead + 7 benchmark) - test:etl:longrun: 2/2 pass Together Scope A + B turn design RFC claims into enforceable assertions: Claim Test ──────────────────────────────────────────────────────────── Memory bounded by chunk size → runLoop.memory.test.ts Cancellation releases held → runLoop.memory.test.ts Engine adds minimal overhead → runLoop.overhead.test.ts Per-scenario p95 latency → runLoop.benchmark.test.ts No leaks across pipelines → runLoop.leak.test.ts Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

PoC observation that surfaced this work: the same eval run shows different "Total scanned" values across runs (200 / 200 / 300) — not because the data differs, but because viewport-driven cancellation stops the loop early when matches arrive before the viewport target. POC OUTPUT — web/oss/poc/etl-poc-entities.ts Throughput section now reports: - Stop reason: "viewport-fill cancellation" vs "source exhausted" - Dataset coverage: "100% — scanned all N rows" or "partial — N rows" - Over-fetch waste (viewport-cancelled only): "180 rows matched beyond viewport target of 20 (900% over)" Before: "Rows scanned: 200" was ambiguous between dataset size and where we stopped. ARCHITECTURE — docs/designs/eval-package-architecture.md New section "Chunk size selection — the RTT vs over-fetch trade-off" in Limitations section: - Mermaid diagram showing the small-chunks-vs-big-chunks cost surface - Measured table from real PoC (300-row eval): chunk=25, viewport=200 → 8 RTTs, 0 over-fetch chunk=200, viewport=20 → 1 RTT, 180 over-fetch (9× viewport) chunk=1000, viewport=20 → 1 RTT, 980 over-fetch (49× viewport) - Recommended sizing per 6 consumer patterns - Filter UX implications + two mitigation strategies FILTER RFC — docs/designs/eval-filtering.md New constraint C6 in Performance constraints with same empirical data scoped to filter use case. Cross-references the architecture doc. Both docs now name the over-fetch cost explicitly rather than letting chunk_size be an arbitrary default. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… JSON output POC OUTPUT — web/oss/poc/etl-poc-entities.ts New "Rows per RTT" metric in Throughput section: Rows per RTT 200 (1 RTT(s) for 200 rows) Rows per RTT 25 (8 RTT(s) for 200 rows) New "Network requests (HTTP)" subsection showing every HTTP call grouped by endpoint with count, total ms, median ms, bytes. Implemented via axios interceptors capturing per-call latency and bytes; stored for both human-readable output and JSON report. AGENTA_OUTPUT=json mode: suppresses human output, emits a single structured JSON report on stdout covering config / runtime / outcome / throughput / latency / network / memory / chunks / assertions. Useful for CI artifacts, benchmark pipelines, scripted analysis. CURSOR IMPROVEMENT — realScenarioSource.ts + PoC fetchPage callback Refined three-case cursor resolution: Case 1: server returns windowing.next as string — use it Case 2: server returns windowing object with next=null/missing — authoritative end-of-stream, skip heuristic Case 3: server omits windowing entirely — fall back to OSS last-row-id heuristic Plus: items.length < limit → definitive end Saves one RTT for backends signalling end via windowing:{next:null}. Doesn't help local Agenta /evaluations/scenarios/query (omits windowing — still pays phantom RTT). Improvement is correct for spec-compliant backends. ARCHITECTURE — docs/designs/eval-package-architecture.md Updated chunk-size table with "Rows per RTT" and "Scan rate" columns. New paragraph: rows-per-RTT is the load-bearing metric, not rows-per-second. RTT amortization explanation. Recommendation: size chunks for rows-per-RTT. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The shared/index barrel transitively re-exports React-coupled paginated table helpers and the UserAuthorLabel TSX component, dragging @agenta/ui CSS modules into any consumer. That broke Node-side execution (scripts, tests, ETL adapters) — importing safeParseWithLogging via the barrel crashed on the first .module.css. Patch each api+core+state file to deep-import from the pure source modules (shared/utils/zodSchema, shared/molecule/...) instead of the contaminated barrel. The api+schema layers have no React surface and must stay Node-safe so they can be reused outside the browser. Also adds EvaluationMetric schema + queryEvaluationMetrics API to the evaluationRun module — the metric-query endpoint was the last entity the hydrate pipeline needs and wasn't represented in the package yet. Plus a bug fix: fetchTestcasesBatch was missing the cache-write side effect that fetchTestcasesPage already had — both batch fetchers now consistently populate the TanStack cache. Files patched: annotation, testcase, trace, evaluationRun api/core/state layers.

Scenarios as returned by /evaluations/scenarios/query are *references* — id, status, run_id, testcase_id only. To render anything meaningful in the UI (input data, app outputs, evaluator scores, traces) we have to join 4 additional entities per chunk: results, metrics, testcases, and traces. This transform stages those bulk fetches behind one chunk-level boundary so the rest of the pipeline sees fully-materialized rows. The four sub-fetches are injected via a HydrateFetchers interface, not hardcoded. Default is raw-HTTP via the entities-package api functions; callers can swap in molecule-backed or cache-aware variants without touching the transform itself. Two-stage internally: 1. results + metrics (parallel, scoped by scenarioIds) 2. testcases + traces (parallel, derived from result.testcase_id and result.trace_id once stage 1 completes) Bounded request budget: 4 bulk calls per chunk, independent of chunk size or column count. The transform is what makes the architecture RFC's Convention 7 ('correlatedDataPrefetch — data presence is a store concern, not a cell concern') concrete. Without this stage, the v1 filter has nothing to materialize predicates against.

…uping The run document (POST /evaluations/runs/query) carries run.data.steps and run.data.mappings — the eval graph and the column manifest the UI renders. Each mapping declares { column.kind, column.name, step.key, step.path } and the renderer is supposed to resolve each cell value declaratively against the joined entities. This module implements that resolver. Dispatch on step.type (not column.kind) so future custom step types just register a strategy without editing existing branches: input → resolveFromTestcase (testcase.data.*) invocation → resolveFromTrace (walks spans tree on result.trace_id) annotation → composeResolvers(metric, trace) • metric.data[step.key][flat-path] (cheap, pre-aggregated) • trace fallback if metric absent Plus column grouping: derived from step.type + step.references so two evaluators with the same column name (e.g. both emit 'success') resolve to distinct group keys ('evaluator:exact-match' vs 'evaluator:fuzzy-match') and the UI's grouped-header layout mirrors the live screenshot. Path-based override: paths under attributes.ag.metrics.* land in the cross-cutting 'Metrics' group regardless of the underlying step type — matches the UI's Cost/Duration/Tokens/Errors column placement. Includes a multi-shape findInTrace walker that handles {spans:{<name>:span}}, {spans:[...]}, {response:{tree:[...]}}, deep child trees, and the envelope-as-span case — so different trace endpoints don't require a resolver patch. 41 unit tests covering each strategy, the multi-evaluator collision case, metric-override-on-path, group ordering, edge cases, extensibility via customResolvers + fallbackResolver.

… tests Adds the entity layer the hydrate transform needed: evaluationResultMolecule.actions.prefetchByScenarioIds (new) evaluationMetricMolecule.actions.prefetchByScenarioIds (new) prefetchTestcasesByIds (new sidecar) prefetchTracesByIds (new sidecar) buildMoleculeBackedFetchers / MOLECULE_BACKED_HYDRATE_FETCHERS Every action consults the shared TanStack QueryClient before bulk-fetching misses, then writes results back. Cache keys: ['evaluation-results', projectId, runId, scenarioId] ['evaluation-metrics', projectId, runId, scenarioId] ['testcase', projectId, testcaseId] ['trace-entity', projectId, traceId] Empty arrays are cached too — a scenario with no metrics doesn't refetch every time. Caller-friendly stat block returned per call (cacheHits, cacheMisses, fetchMs) so observability surfaces can report hit ratios. evictByRunId on the result/metric molecules bulk-clears every entry for a run via prefix match — required for long-run ETL passes that rotate through many runs and would otherwise leak TanStack observer state. Trace prefetch uses fetchAllPreviewTraces with an IN filter directly (one network round-trip for the entire batch). Routing through the existing per-id traceBatchFetcher would have hit its maxBatchSize=50 cap and turned 100 ids into 2 round-trips instead of 1 — measured ~50% throughput regression vs raw fetcher. Bypassing the per-id coalescer for already-bulk inputs keeps performance flat. Tests: 15 unit tests for the cache contract (read, write, invalidate, isolation) 5 leak tests (--expose-gc) verifying: - re-prefetching same scenarios doesn't grow cache - evictByRunId fully releases run-scoped entries - evictByRunId is run-scoped (other runs untouched) - 100 iterations with evict-between → heap slope ~7 KB/iter - WITHOUT eviction → heap grows linearly (documents caller responsibility) package.json: test:etl:longrun split into per-file subprocesses to avoid cross-test Jotai store pollution. --test-force-exit ensures the process terminates after the test runner finishes.

…cache diagnostics jotai-family's atomFamily memoizes one atom per unique param and exposes .remove() for eviction, but no way to ask 'how many entries does this family hold right now?'. That makes memory diagnosis impossible — a family holding 50 ids and one holding 50,000 look identical from outside. This wrapper adds .size(), .params(), .clear() plus an optional global registry so consumers can list every instrumented family and its current param count via inspectAtomFamilies(). Drop-in compatible with the base factory's two call shapes (create only / create + areEqual). The trace store's 9 atom families and the 16 atom families across createPaginatedEntityStore + createInfiniteTableStore + createInfiniteDatasetStore migrate to the instrumented variant. Each paginated store now exposes: store.dispose() → releases every owned family's params AND removes the store's TanStack queries by prefix store.familySizes() → diagnostic snapshot per internal family Result on the combined leak test (50 iterations of pipeline + adapter + molecules with full teardown): before: 70 KB/iter heap slope (~50 KB unattributed paginated overhead) after: 0.5 KB/iter heap slope (~140× reduction, completely flat) The unattributed bleed was two distinct concerns: 1. 16 atomFamily closures per store instance retained forever — fixed by instrumentedAtomFamily.clear() in dispose() 2. TanStack queries keyed by [options.key, scopeId, ...] never evicted when the scopeId rotates — fixed by removeQueries({queryKey:[options.key]}) in dispose() Adds cacheDiagnostics module: inspectCache() walks the QueryClient by prefix, inspectMemory() bundles cache + atomFamily + heap, clearCacheByPrefix() for explicit teardown. Default prefix list includes span-level cache (populated as a side-effect of traceBatchFetcher) so per-trace memory cost isn't under-counted. Subtle bug fix in the migration: the python-script injection initially produced atomFamily(create, areEqual, name) but the local wrapper only accepted (create, name) and silently dropped the areEqual function. Without areEqual, params dedup by reference identity → every call creates a fresh atom → pagination state lost between chunks. Caught by re-running the PoC against the real backend (only 50 rows scanned instead of 300). Wrapper now accepts both call shapes. Combined leak test verifies the full chain: real paginatedStore adapter + molecule prefetch + atom family teardown stays under 30 KB/iter budget (actual: 3-7 KB/iter).

…calation signal) Implements the v1 client-side filter the eval-filtering RFC commits to shipping first, plus the meter that decides when to escalate to v2. rowPredicateFilter: Post-hydrate transform that drops materialized rows failing a value- equality predicate against any resolved UI column. Targets columns by their group (testset / application / evaluator / metrics) + name + optional group slug for evaluator namespacing. Supports AND-composition via predicates: [...]. Operators: eq, ne, in, nin, lt, lte, gt, gte. Includes unwrapStatsForCompare — collapses metric stats blobs to their dominant value before comparison so callers write 'value: false' regardless of whether the column resolves via metric ({type:'binary', freq:[...]}) or trace (raw boolean). This transform runs AFTER hydrate because predicates target joined values that don't exist on the bare scenario (evaluator output, metric thresholds). Wasted hydration on dropped rows is the explicit cost the hit-ratio meter decides whether to tolerate. hitRatioMeter: Tracks (matched / scanned) per chunk via rolling window. State machine: warming → < windowSize chunks observed client → rolling ratio ≥ threshold (v1 comfortable, keep client) escalate → rolling ratio < threshold (recommend v2 server-side filter) RFC defaults: windowSize=3, threshold=0.10. Configurable. REPORTS the regime — does NOT swap engines. The actual swap (next chunk's source request carries 'filtering' payload, predicateFilter becomes a no-op) is the v2 milestone. The meter is the seam where that integration will land. Verified against this PR's run: - 86% pass-rate filter → stays 'client' from chunk 3 onward - 1% pass-rate composite filter → trips 'escalate' at chunk 3 13 unit tests covering state transitions, window sliding, edge cases, config validation, custom resolver registration.

Brings the PoC from the previous 'engine + synthetic source' shape up to a full v1 integration that exercises every layer added in this series: source: createPaginatedEntityStore via makeSourceFromPaginatedStore transforms: 1. statusFilter (cheap scenario-level filter) 2. hydrateScenarios (4 bulk fetches per chunk via molecules) 3. predicateFilter (opt) (post-hydrate, AND-composed clauses) sink: in-memory accumulator Run schema (run.data.steps + run.data.mappings) is fetched in pre-flight and threaded into resolveMappings so the row dump mirrors the UI's grouped-header layout exactly (Testset / Application / <Evaluator> / Metrics). Synthesizes implicit Metrics columns (tokens/cost/duration) for predicate targeting since run.data.mappings only declares user- visible columns. Observability per chunk + final report: - timing breakdown (fetch / transform / sink ms) - cache hit/miss per entity per chunk (results/metrics/testcases/traces) - entity cache size + per-prefix breakdown - atom family params (instrumented registry snapshot) - hit-ratio meter regime evolution + final recommendation - cache eviction verification (before/after, atom-family cleanup) - human + JSON output modes A/B knob: AGENTA_FETCHER_MODE=raw|molecule swaps the hydrate fetchers between direct-HTTP and molecule-backed. After the trace-prefetch fix, both modes hit identical 33 HTTP requests and within-noise timing. Filter knobs: AGENTA_PREDICATE_KIND testset | application | evaluator | metrics AGENTA_PREDICATE_GROUP optional slug (e.g. 'exact-match') AGENTA_PREDICATE_COLUMN e.g. 'success', 'tokens.cumulative.total' AGENTA_PREDICATE_OP eq | ne | in | nin | lt | lte | gt | gte AGENTA_PREDICATE_VALUE JSON-parsed AGENTA_PREDICATE2_* second clause for AND composition End-to-end assertions (15 checks): engine guarantees, entity-layer integration, cache reuse (100% hits on rerun, 0ms network), correct row materialization, predicate filter shape. Validated against run 019e3701-523f-7782-8813-9ca438f48399: 300 scenarios, 6 chunks of 50, 33 HTTP requests, ~1s end-to-end. Filter 'evaluator:exact-match.success eq false' → 258/300 (86%) client Filter '... AND tokens > 35' → 3/300 (1%) escalate

vercel · 2026-05-18T15:14:49Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
agenta-documentation	Ready	Preview, Comment	May 20, 2026 4:06pm

coderabbitai · 2026-05-18T15:14:53Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 1bf7ed6e-0fc3-4276-b7da-ca03005d985d

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fe-experiment/etl-poc

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Colocate the PoC with the package it validates. The script imports exclusively from @agenta/entities + @agenta/shared — its placement in web/oss/ was inertia from the previous PoC harness (etl-entities-probe.ts already lived there), not because it has any OSS-specific code. After move: before: web/oss/poc/etl-poc-entities.ts after: web/packages/agenta-entities/poc/etl-poc-entities.ts All relative imports updated from ../../packages/agenta-entities/src/ to ../src/. No external-facing API changes; tests + lint pass; PoC runs end-to-end against the same run with identical output (300 rows, 33 HTTP requests, all assertions pass). Run from web/packages/agenta-entities/: pnpm exec tsx poc/etl-poc-entities.ts

+
+function row(label: string, value: string | number): void {
+    const padded = label.padEnd(28)
+    log(`  ${padded} ${value}`)


The shared trace cache key ["trace-entity", projectId, traceId] stores a TracesApiResponse envelope ({count, traces: {[traceIdNoDashes]: data}}) because every other consumer — traceEntityAtomFamily, traceRootSpanAtomFamily, AnnotationTraceContent, EvalRunDetails/atoms/traces — expects that shape and indexes via data.traces[noDash] to drill in. Previously, cacheAwareFetchers.fetchTraces pre-unwrapped the envelope before passing to the hydrate transform so resolveMappings could walk it. Downstream cell consumers would have had to do the same unwrap. Consolidate by teaching findInTrace to handle the envelope directly as its own case (alongside the existing 4 trace shapes). Cell-side and fetcher-side unwraps drop out, the shared cache contract is preserved unchanged for all existing consumers. 41/41 resolveMappings tests still pass.

…ties The 4 entity types the ETL hydrate pipeline materializes per page — evaluation results, evaluation metrics, testcases, trace spans — previously had inconsistent prefetch entry points: evaluationResultMolecule.actions.prefetchByScenarioIds (proper molecule) evaluationMetricMolecule.actions.prefetchByScenarioIds (proper molecule) prefetchTestcasesByIds (sidecar function) prefetchTracesByIds (sidecar function) Future maintainers had to know two patterns. With predicate-driven hydrate (next commit) and cell-side materialization landing soon, the gap becomes painful — consumers call `molecule.actions.prefetch*` and expect the convention to hold. Add `testcaseMolecule.actions.prefetchByIds` and `traceSpanMolecule.actions.prefetchByIds` that wrap the existing sidecar functions (kept as separate exports for backwards compatibility with non-molecule callers). All 4 entities now expose the same shape: *Molecule.actions.prefetchBy{ScenarioIds|Ids}({projectId, ...ids}) → {cacheHits, cacheMisses, fetchMs, ...} Also re-export EvaluationMetric type from the evaluationRun barrel so downstream callers can use molecule.actions.prefetchByScenarioIds without deep-importing for the type.

Today the ETL hydrate pipeline fetches 4 entity slices per page unconditionally — results + metrics + testcases + traces. When a predicate is active, most of that data isn't needed to evaluate the filter. For the common evaluator-output filter ("success eq true"), testcases + traces are entirely irrelevant — traces alone account for ~70% of bytes and ~67% of loop time on the 1000-scenario reference run. Add predicateToEntitySlices: given a run schema + predicate(s), return the minimum set of entity slices the hydrate stage needs to fetch. Mirrors resolveMappings's read-side step.type → entity convention inversely (column → slice instead of column → value): testset (step.type = input) → testcases (+ results for testcase_id) application (step.type = invocation) → traces (+ results for trace_id) evaluator (step.type = annotation) → metrics (+ results, + traces if not metrics-flat) metrics (path = attributes.ag.metrics.*) → metrics Unresolvable predicates fall back to all 4 slices to stay correct — over-fetching is safer than silently dropping a predicate. Measured impact on the 1000-scenario reference run with predicate on evaluator:exact-match.success: loop time 12.5s → 2.5s (-80%) bytes 27 MB → 9 MB (-67%) requests 103 → 63 (-39%) peak heap 60 MB → 25 MB (-57%) Consumers (PoC + UI test page) wire this in subsequent commits.

Bundles uncommitted PoC improvements from the ETL investigation thread. 1) AGENTA_SINK_MODE (accumulate | streaming) - accumulate (default): sink keeps every hydrated row. Useful for post-hoc sample dumps. - streaming: sink updates running aggregates per row and releases the chunk. Mirrors what a production sink does. Decoupled the downstream reports (per-row summary stats, sample dump, engine guarantees, JSON output) from matchedRows[] — all read from a running SinkAggregate so streaming mode produces the same headline numbers without retaining rows. 2) AGENTA_HEAP_WALK (1 = run residual investigation walk) When enabled, after the loop completes the script tears down suspected retainers one at a time, force-GCs, snapshots heap delta after each step, dumps a V8 heap snapshot, and prints a per-step delta table. Gated because (a) writeHeapSnapshot is ~50 MB to /tmp, (b) the walk side-effects aggregate state, (c) the steady-state Memory bounded assertion already catches regressions without it. 3) Inlined reprefetch* into IIFEs The cache-reuse verification kept reprefetchResults/Metrics/Testcases/ Traces as main() locals — those return objects hold 1000 results each, pinning ~25 MB of heap on the main() stack for the rest of the script. Heap-snapshot retainer-path analysis traced post-eviction residual to those four bindings. Inlining the stat extraction into IIFEs releases the function return values the line after they're used. Cut residual from 25-48 MB to 9 MB; the Memory bounded engine guarantee now passes. 4) AGENTA_HYDRATE_SLICES (comma-separated subset of results,metrics,testcases,traces; default = all 4) Wraps the chosen HydrateFetchers — slices not in the active set return empty results without network. Same hydrate transform body for both modes, so this is a fair A/B of just the fetch cost. Measured impact, 1000-scenario reference run, warm backend: Baseline (all 4 slices): loop 12,520 ms 103 requests ~27 MB bytes peak +60 MB heap Slice-filtered (results+metrics): loop 2,494 ms 63 requests ~9 MB bytes peak +25 MB heap delta: -80% -39% -67% -57% IIFE-inlined reprefetches (post-loop residual): before: 25-48 MB ✗ Memory bounded after: ~9 MB ✓ Memory bounded

…ipeline Standalone debug route that mounts the production InfiniteVirtualTable with the entities-package ETL hydrate strategy wired into a real React + IVT context. Validates the architecture end-to-end before production wiring (which lives in a separate follow-up). URL: /etl-poc/<runId>?project_id=<projectId> Architecture (file map): pages/etl-poc/[evaluation_id].tsx page route components/EtlPocScenarios/ index.tsx main component scenarioPaginatedStore.ts thin IVT row store ({key,id,scenarioId}) useHydrateScenarios.ts page-level bulk prefetch, predicate-driven useCellMaterialization.ts cell-level lazy fetch, batched per tick cellMaterializerContext.ts provider seam useEtlColumns.tsx columns from runSchema via resolveMappings useScopeChangeEviction.ts evictByRunId on (projectId, runId) change EtlColumnHeader.tsx testset/app entity-name resolver for headers PredicateFilterBar.tsx one-predicate dropdown UI cells/EtlResolvedCell.tsx per-cell molecule cache + resolveMappings Key architectural moves: - Thin row shape ({key, id, scenarioId, __isSkeleton}) — same convention as testcasePaginatedStore. All column data is materialized via molecule caches; the IVT row carries only identity. - Predicate-driven hydrate via predicateToEntitySlices (see prior commit). When a predicate is active the page-level pass only fetches slices the filter touches. Hydrate-strategy toggle in the header lets you A/B auto-mode (predicate-driven) vs all-mode (legacy 4-slice fetch) live. - Cell-side lazy materialization for slices the page-level hydrate skipped. 30 visible cells requesting (slice, id) in the same render tick coalesce into 1 bulk fetch per slice via a microtask queue. Uses the now-symmetric *Molecule.actions.prefetch* surface. - hydrationVersionAtom — bumped after each completed batch so cells whose useMemo settled before stage 2 (testcases/traces) lands re-render and pick up the late cache writes. - Scope-change eviction (useScopeChangeEviction) — the cleanup snippet the production scenarios controller should call on runId change. Today production has no such call; this hook demonstrates the wire-up. - Production-realistic layout chain: registered /etl-poc with the Layout's isFullHeight matcher; container className set on the IVT so its scroll container doesn't grow to content height (matches what InfiniteVirtualTableFeatureShell does internally for production tables). - Resolver gets envelope-unwrapped values via findInTrace (prior commit) + stats-blob unwrap via unwrapStatsForCompare for display. Out of scope for this commit (next-PR follow-ups): - Wiring the molecule-backed hydrate into the actual production scenarios table (web/oss/src/components/EvalRunDetails/Table.tsx). - Migrating evaluationPreviewTableStore to a thin entities-package store (this PR keeps it for the production view; the test page uses its own thin store). - HTTP 429 retry/backoff in prefetchTracesByIds (currently silent warn).

…slices) Previously, sliceMode="auto" had two meanings: - With predicate: fetch only what the predicate touches. - Without predicate: fetch all 4 slices (display fallback). That bifurcation made "Auto" inconsistent — same toggle label, very different network behavior depending on filter state. Single consistent semantic now: Auto means "fetch only what's needed right now." - No predicate: 0 page-level fetches. Cells materialize their own data via useCellMaterialization (virtualization-aware, batched per microtask). Same path v2 server-side filtering will land on. - Predicate active: fetch the predicate's slice set so the filter can run client-side. - Unresolvable predicate: fall back to all 4 (correctness over speed). Trade-off: no-predicate first paint shows skeleton cells until the first cell-side batch lands (~200-500ms typical). In exchange the page fetches exactly the data the visible window needs — no eager fetching of data the user may never scroll to. Smaller memory profile, more honest semantics. The "All slices" toggle option remains for A/B and for workflows that need every column populated up-front (bulk actions, exports). Header chip updated to show 'slices: none (cell-side on-demand)' when the page-level hydrate skips entirely.

When a strict predicate reduces the visible row count below the viewport (e.g. 1 match in the first 50-row page), the IVT's native scroll-bottom trigger never fires because the table never scrolls. loadMore is never called and the user is stuck looking at a partial table with no way to load more matches. Add a viewport-fill effect: while a predicate is active and the loaded dataset isn't exhausted, drive loadNextPage ourselves until either: - matched row count >= VIEWPORT_FILL_TARGET (30), or - paginationInfo.hasMore becomes false (full scan). isFetching dedupes concurrent calls; the effect re-runs after each page lands so we walk through pages one at a time. Skipped entirely when no predicate is active — native scroll-trigger handles that case. Production scenarios doesn't hit this because filtering runs server- side (the server returns already-matched rows). Test page does client-side v1 filter, so we own the fill loop.

Request() marked IDs as in-flight BEFORE pushing to the queue. The subsequent drain's collectUnique then filtered out anything in inflightIds — i.e., every ID in the queue. Net effect: drain reads empty arrays for every slice, no prefetch fires, cells stay at '—' forever. Fix the dedup ordering: - request() filters duplicates within the current tick by checking inflightIds (across-tick dedup) AND scanning the existing queue (within-tick dedup). Doesn't mark inflight yet. - drain snapshots the queues, dedups within batch, then marks inflightIds before firing the fetch. .finally() clears the marks after the fetch resolves so later cells can re-request. Test page Auto mode + no predicate now correctly materializes visible cells: page-level hydrate fetches nothing, cells request their slices on first render, materializer batches into 1 bulk fetch per slice, hydrationVersionAtom bumps, cells re-render with cached data.

Cell-side materialization on mount works for visible cells but lags when the user scrolls into a freshly-loaded page — cells mount, request, wait, render. Visible delay between page-arrival and cell-with-data, especially noticeable in Auto + no-predicate mode where page-level hydrate fetches zero slices. Close the gap with useLookaheadPrefetch — runs alongside cell-side materialization. Two stages, both routed through the existing materializer (dedup + batching reused for free): stage 1: new scenarios in pagination.rows → request results + metrics (the two slices keyed directly by scenarioId) stage 2: re-fires on every hydrationVersion bump (i.e., after stage-1 results land in cache). For each row's cached results, derive testcase_id / trace_id and request those slices. Self-bounded by per-id seen-sets so it doesn't re-fire for already-prefetched ids. Net effect: when IVT loads page N+1 via loadMore, all 4 slices begin fetching for that page's scenarios immediately. By the time the user scrolls cells into view (~few hundred ms later), the data is cached and cells render with no flash of '—'. Skipped when sliceMode === 'all' (page-level hydrate already covered every slice; nothing for lookahead to add). Materializer's dedup ensures requests against scenarios already in cache (because cells materialized them on mount earlier) are no-ops.

When a predicate is active the IVT's viewport is *constructed* from multiple pagination pages — the viewport-fill loop loads pages 1-10 to accumulate 30 matched rows, with the other 470 immediately filtered out by the predicate. Previously useLookaheadPrefetch took pagination.rows and prefetched results+metrics (stage 1) + testcases+traces (stage 2) for every loaded scenario. With a strict predicate (~3% match rate) that meant ~94% of the lookahead work was wasted on rows the user will never see — particularly stage 2's testcase + trace fetches that derive IDs per row. Switch the input to filteredRows. Now lookahead targets only the constructed viewport. No predicate: filteredRows == pagination.rows → no behavior change With predicate: filteredRows ⊂ pagination.rows → prefetch only matched Edge case: filteredRows includes 'pending' rows (matchesPredicate's keep-visible-until-known fallback for unhydrated rows). Some of those will later drop out as the predicate slices land and the filter re-evaluates — we'll have over-prefetched for them. Accepted: the predicate-driven page hydrate already fetches predicate slices for the same set so stage 1 is net zero, and stage 2 over-prefetch is the cost of not flashing rows in/out during predicate eval. Manageable. Hook also moved below the filteredRows useMemo so it sees the correct data — was previously above with the wrong input. Updates the hook's file header + the call-site comment to make the 'filteredRows not pagination.rows' contract explicit.

…ches Symptom: with a predicate active (e.g. evaluator.score eq false), rows where the data clearly shows the predicate as false (e.g. score=true) were visible in the table. After a delay, the table updated to show only correctly-matching rows. Root cause: matchesPredicate has a 'keep visible until known' fallback — if predicate slices aren't cached yet for a scenario, the row passes through. filteredRows useMemo's deps were [pagination.rows, predicate, schema, projectId, runId] — none of which change when the molecule cache updates. So: 1. Pagination page loads → 50 new rows, predicate slices not cached 2. Filter runs: matchesPredicate falls back to 'keep visible' for all 3. Predicate slices land → hydrationVersionAtom bumps → cells re-render with real data 4. filteredRows DOESN'T re-run because its deps didn't change 5. Result: rows that don't actually match stay visible until the next pagination event triggers a filteredRows re-eval Fix: subscribe to hydrationVersionAtom + add hydrationVersion to filteredRows deps. When the cache lands, filteredRows re-evaluates, 'keep visible until known' rows that turn out not to match get correctly filtered out. User-visible: brief 'keep visible' period during predicate evaluation remains (~few hundred ms while results+metrics land per page), then rows correctly filter. No more stale rows lingering until the next page load.

…explicit pending/confirmed count chip Two issues from real-world predicate UX on 1000-scenario run. (1) Slow load — predicate on evaluator.llm-as-a-judge.score triggered 'slices: results, metrics, traces (predicate-driven)'. The trace fetch is by far the slowest endpoint (~100ms median per call) and dominates loop time. But the score value lives in metric.data — the metric writer unfolds the evaluator's emitted attributes (incl. attributes.ag.data.outputs.*) as flat keys under data[stepKey][path]. predicateToEntitySlices previously added traces speculatively for any annotation predicate whose path didn't start with 'attributes.ag.metrics.*'. The heuristic was over-cautious — for the common case (evaluators that write to metric.data), trace is unused. Drop the speculative trace add. resolveMappings's composeResolvers (metric -> trace) still falls back to trace at read time if metric returns null for a column, so correctness is preserved. The cell-side materializer requests traces on demand for cells where they're actually needed (e.g. invocation outputs from a span tree). Predicate hydrate becomes results+metrics only — cuts ~60-70% of loop time on the 1000-scenario reference run. (2) Flaky count chip — '100 matched / 600 loaded' then '57 matched / 650 loaded'. The 100 was inflated by 'keep visible until known' pending rows during predicate evaluation; once data landed, the predicate evaluated to false and the count dropped to 57. Replace the simple length-based chip with PredicateCountChip that distinguishes: - confirmed: predicate slices loaded + predicate returned true - pending: predicate slices not loaded yet (keep-visible fallback) - totalLoaded: scenarios in pagination buffer Display: 'X matched · Y pending / Z loaded'. The user sees stable 'confirmed' growth instead of a number that oscillates while predicate slices land. Recomputed on each hydrationVersion bump via parent re-render. 41/41 resolveMappings tests still pass.

…429) Reported 189+ failed requests piling up against /tracing/spans/query — all 429s, all retried forever. Root cause: prefetchTracesByIds (and the other slice prefetches) swallow errors and return empty outcomes with no cache writes. The materializer's drain .finally() clears inflightIds unconditionally, then the next cell render finds the cache still empty and re-requests. Tight retry loop. Add per-slice failedIds tracking to the materializer: 1. After each slice's bulk fetch resolves, read the cache for every requested ID. If the cache entry is undefined, the fetch failed silently (rate-limited, network blip, etc.) — mark the ID as failed in state.failedIds[slice]. 2. request() now checks failedIds first; if the ID is there, skip queueing. The lookahead hook's own seen-set already prevents re-queueing too, so combined they prevent any future request for a failed ID. Permanent for the session — user reloads to retry. Could add a TTL + exponential backoff later, but for the test page this is the simpler correct behavior. Cells for failed IDs render '—' indefinitely. Acceptable trade-off vs. the alternative of hammering a rate-limited endpoint and degrading every other consumer.

If the user scrolled to row 200, then added a filter that narrows the table to a different subset, the IVT preserved the prior scroll offset. The user landed somewhere mid-way through the filtered list, often past the first matches — feels like the filter applied to a different table or the filter is broken. Wire a tableRef into the IVT, watch the predicate state, and call tableRef.scrollTo({index: 0, align: 'top'}) on any predicate change (added, cleared, or modified). Skipped on the very first render so we don't fire a scroll on initial mount. Scheduled inside a requestAnimationFrame so the IVT has the new filteredRows mounted before we ask it to scroll.

EE's Next.js app uses filesystem routing over web/ee/src/pages/ and does not auto-inherit OSS pages — each route needs an explicit re-export file. The /etl-poc test page existed only in web/oss/src/pages/, so it 404'd when running the EE web frontend. Add web/ee/src/pages/etl-poc/[evaluation_id].tsx — a plain pass-through re-export of the OSS page (no EE-specific behaviour). Mirrors the existing EE page re-export convention (auth, _document, etc.). Now the test page is reachable on both OSS and EE web.

[fix] Resolve broken CORS headers

ardaerzin and others added 16 commits May 18, 2026 16:58

github-advanced-security AI found potential problems May 18, 2026

View reviewed changes

Comment thread web/packages/agenta-entities/poc/etl-poc-entities.ts Fixed

vercel Bot deployed to Preview May 18, 2026 15:42 View deployment

github-advanced-security AI found potential problems May 18, 2026

View reviewed changes

Comment thread web/packages/agenta-entities/poc/etl-poc-entities.ts

function row(label: string, value: string | number): void {

const padded = label.padEnd(28)

log(` ${padded} ${value}`)

ardaerzin added 5 commits May 19, 2026 17:23

vercel Bot deployed to Preview May 19, 2026 15:28 View deployment

vercel Bot deployed to Preview May 19, 2026 15:36 View deployment

vercel Bot deployed to Preview May 19, 2026 15:52 View deployment

vercel Bot deployed to Preview May 19, 2026 16:00 View deployment

vercel Bot deployed to Preview May 19, 2026 16:13 View deployment

vercel Bot deployed to Preview May 19, 2026 16:18 View deployment

vercel Bot deployed to Preview May 19, 2026 16:27 View deployment

vercel Bot deployed to Preview May 19, 2026 17:42 View deployment

vercel Bot deployed to Preview May 19, 2026 17:51 View deployment

vercel Bot deployed to Preview May 19, 2026 18:03 View deployment

vercel Bot deployed to Preview May 20, 2026 00:24 View deployment

Merge branch 'release/v0.100.0' into fe-experiment/etl-poc

f514ccc

vercel Bot deployed to Preview May 20, 2026 08:43 View deployment

ardaerzin added 2 commits May 20, 2026 11:14

fix(entities): add missing comma in package.json scripts

56bcd5f

Merge branch 'release/v0.100.0' into fe-experiment/etl-poc

2b4a9f8

vercel Bot deployed to Preview May 20, 2026 09:26 View deployment

jp-agenta and others added 6 commits May 20, 2026 14:50

[fix] Resolve broken CORS headers

3d73f5f

v0.100.1

cae7a4c

Merge branch 'release/v0.100.1' into fix/broken-exposed-cors-headers

62e303e

Add missing check

17884b9

Merge pull request #4381 from Agenta-AI/fix/broken-exposed-cors-headers

14f43c1

[fix] Resolve broken CORS headers

Merge branch 'release/v0.100.1' into fe-experiment/etl-poc

2a6c2bd

vercel Bot deployed to Preview May 20, 2026 16:06 View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FE POC] ETL implementation for evaluation runs.#4354

[FE POC] ETL implementation for evaluation runs.#4354
ardaerzin wants to merge 41 commits into
mainfrom
fe-experiment/etl-poc

ardaerzin commented May 18, 2026 •

edited

Loading

Uh oh!

vercel Bot commented May 18, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented May 18, 2026 •

edited

Loading

Review skipped

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ardaerzin commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Verified locally

Added or updated tests

QA follow-up

Demo

Checklist

Contributor Resources

Uh oh!

vercel Bot commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ardaerzin commented May 18, 2026 •

edited

Loading

vercel Bot commented May 18, 2026 •

edited

Loading

coderabbitai Bot commented May 18, 2026 •

edited

Loading