Skip to content

feat(eval): evaluate the dashboard-summary skill via the /summary endpoint#1646

Open
romrak wants to merge 5 commits into
masterfrom
rr/summary-endpoint-eval
Open

feat(eval): evaluate the dashboard-summary skill via the /summary endpoint#1646
romrak wants to merge 5 commits into
masterfrom
rr/summary-endpoint-eval

Conversation

@romrak
Copy link
Copy Markdown

@romrak romrak commented Jun 4, 2026

What

Adds a dashboard_summary test kind to gooddata-eval so the GoodData dashboard-summary feature can be evaluated end-to-end, plus two supporting fixes uncovered while running it locally.

1. dashboard_summary test kind (Path B — dedicated /summary endpoint)

  • SummaryClient — posts summary_input to POST /api/v1/ai/workspaces/{ws}/summary (AFM executed server-side; no SSE, no client-side result_id handling) and maps the JSON summary into a ChatResult.
  • DashboardSummaryEvaluator — rubric-based LLM judge. expected_output is a rubric of must_include / must_not_include / rubric criteria, each scored independently, so quality_score is the fraction satisfied. Pass requires all gating (must_*) criteria; rubric items are graded-only. must_not_include is scored by plain presence-detection + invert (asking the judge to reason about "avoidance" under an EXPECTED OUTPUT label flipped verdicts).
  • SummaryInput dataset field (only dashboard_id is required).
  • Runner ChatBackend now receives the whole DatasetItem; the CLI routes summary items to SummaryClient and everything else to ChatClient.
  • Registered as a lazy [llm-judge] evaluator (skipped without the extra, like general_question).

2. Fix: resolve the active LLM provider by type, not a fixed id

The workspace ACTIVE_LLM_PROVIDER setting is keyed by type on the backend and may exist under any id. Reading it by the hardcoded id activeLlmProvider missed existing settings ("no active LLM provider"), and activating created a second setting of the same type (HTTP 409). Now it looks the setting up by setting_type and reuses the existing id on activate (UPDATE, not CREATE).

3. Fix: evaluator-agnostic FAIL note in the console report

The FAIL "Notes" column hardcoded visualization check names and showed a misleading "no visualization created" for summary items. It now lists whichever boolean checks are False.

Docs & examples

  • README: dashboard_summary section (rubric shape, summary_input) + supported-test-kinds row.
  • Three self-describing example cases under examples/summary_dataset/: full-dashboard, selected-visualizations (scoping), and format-hint-brief (format adherence). Each rubric is aligned to the endpoint's actual output and uses a small gating set.

Testing

  • New tests: test_summary_client.py, test_summary_evaluator.py, plus workspace lookup-by-type tests.
  • ruff format/check clean; full suite 115 passed.
  • Manually verified against a local gen-ai instance: endpoint path /summary and request/response shapes confirmed; all three example cases pass with the OpenAI judge.

Notes

  • Verified the endpoint path is /summary (not the /summarize referenced in some gen-ai notebooks). SummaryClient._PATH is the single place to change if it is ever renamed.
  • The chat-based summary skill (Path A — userContext + AFM result_ids) is intentionally not covered here; it is better smoke-tested via a Playwright e2e where the UI assembles that context. See docs/summary-eval.md (separate PR) for the strategy.

🤖 Generated with Claude Code

Roman Rakus and others added 3 commits June 4, 2026 17:26
Implements Path B: evaluate the dashboard-summary feature through the
dedicated synchronous endpoint (POST /api/v1/ai/workspaces/{ws}/summary),
which executes AFM server-side — no SSE or client-side result_id wrangling.

- SummaryClient: posts summary_input, maps the JSON summary into ChatResult
- DashboardSummaryEvaluator: rubric-based LLM judge (must_include / must_not_include /
  rubric), scored per-criterion so quality_score is the fraction satisfied
- SummaryInput dataset field; dashboard_id is the only required input
- runner ChatBackend now receives the DatasetItem; CLI routes summary items
  to SummaryClient and everything else to ChatClient
- registered as a lazy [llm-judge] evaluator; docs + example dataset + tests

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The workspace ACTIVE_LLM_PROVIDER setting is keyed by type on the backend
and may exist under any id (e.g. UI-generated). Reading it by the hardcoded
id "activeLlmProvider" missed existing settings (-> "no active LLM provider"),
and activating re-created a second setting of the same type (-> HTTP 409).

Now look it up by setting_type via list_workspace_settings, and reuse the
existing setting's id on activate so create_or_update performs an UPDATE.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- evaluator: score must_not_include via plain presence-detection then invert,
  instead of asking the judge to reason about avoidance under an "EXPECTED
  OUTPUT" label (which inverted verdicts on negative criteria)
- reporting: make the FAIL note evaluator-agnostic (list whichever boolean
  checks are False) instead of the visualization-only "no visualization created"
- examples: replace the single template with three self-describing cases —
  full-dashboard, selected-visualizations (scoping), and format-hint-brief
  (format adherence) — each with a small gating set and rubric aligned to the
  endpoint's actual output

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Roman Rakus and others added 2 commits June 4, 2026 19:03
…-eval

# Conflicts:
#	packages/gooddata-eval/README.md
#	packages/gooddata-eval/src/gooddata_eval/cli/main.py
#	packages/gooddata-eval/src/gooddata_eval/core/workspace.py
#	packages/gooddata-eval/tests/test_workspace.py
Running without --model takes the default branch, which set provider_name
but never provider_type, causing UnboundLocalError when building ResolvedModel.
Initialize it to "" alongside provider_name.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 4, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 79.10%. Comparing base (27021da) to head (a43d45b).

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #1646   +/-   ##
=======================================
  Coverage   79.10%   79.10%           
=======================================
  Files         231      231           
  Lines       15718    15718           
=======================================
  Hits        12433    12433           
  Misses       3285     3285           

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant