feat(eval): evaluate the dashboard-summary skill via the /summary endpoint by romrak · Pull Request #1646 · gooddata/gooddata-python-sdk

romrak · 2026-06-04T16:57:33Z

What

Adds a dashboard_summary test kind to gooddata-eval so the GoodData dashboard-summary feature can be evaluated end-to-end, plus two supporting fixes uncovered while running it locally.

1. `dashboard_summary` test kind (Path B — dedicated `/summary` endpoint)

SummaryClient — posts summary_input to POST /api/v1/ai/workspaces/{ws}/summary (AFM executed server-side; no SSE, no client-side result_id handling) and maps the JSON summary into a ChatResult.
DashboardSummaryEvaluator — rubric-based LLM judge. expected_output is a rubric of must_include / must_not_include / rubric criteria, each scored independently, so quality_score is the fraction satisfied. Pass requires all gating (must_*) criteria; rubric items are graded-only. must_not_include is scored by plain presence-detection + invert (asking the judge to reason about "avoidance" under an EXPECTED OUTPUT label flipped verdicts).
SummaryInput dataset field (only dashboard_id is required).
Runner ChatBackend now receives the whole DatasetItem; the CLI routes summary items to SummaryClient and everything else to ChatClient.
Registered as a lazy [llm-judge] evaluator (skipped without the extra, like general_question).

2. Fix: resolve the active LLM provider by type, not a fixed id

The workspace ACTIVE_LLM_PROVIDER setting is keyed by type on the backend and may exist under any id. Reading it by the hardcoded id activeLlmProvider missed existing settings ("no active LLM provider"), and activating created a second setting of the same type (HTTP 409). Now it looks the setting up by setting_type and reuses the existing id on activate (UPDATE, not CREATE).

3. Fix: evaluator-agnostic FAIL note in the console report

The FAIL "Notes" column hardcoded visualization check names and showed a misleading "no visualization created" for summary items. It now lists whichever boolean checks are False.

Docs & examples

README: dashboard_summary section (rubric shape, summary_input) + supported-test-kinds row.
Three self-describing example cases under examples/summary_dataset/: full-dashboard, selected-visualizations (scoping), and format-hint-brief (format adherence). Each rubric is aligned to the endpoint's actual output and uses a small gating set.

Testing

New tests: test_summary_client.py, test_summary_evaluator.py, plus workspace lookup-by-type tests.
ruff format/check clean; full suite 115 passed.
Manually verified against a local gen-ai instance: endpoint path /summary and request/response shapes confirmed; all three example cases pass with the OpenAI judge.

Notes

Verified the endpoint path is /summary (not the /summarize referenced in some gen-ai notebooks). SummaryClient._PATH is the single place to change if it is ever renamed.
The chat-based summary skill (Path A — userContext + AFM result_ids) is intentionally not covered here; it is better smoke-tested via a Playwright e2e where the UI assembles that context. See docs/summary-eval.md (separate PR) for the strategy.

🤖 Generated with Claude Code

Implements Path B: evaluate the dashboard-summary feature through the dedicated synchronous endpoint (POST /api/v1/ai/workspaces/{ws}/summary), which executes AFM server-side — no SSE or client-side result_id wrangling. - SummaryClient: posts summary_input, maps the JSON summary into ChatResult - DashboardSummaryEvaluator: rubric-based LLM judge (must_include / must_not_include / rubric), scored per-criterion so quality_score is the fraction satisfied - SummaryInput dataset field; dashboard_id is the only required input - runner ChatBackend now receives the DatasetItem; CLI routes summary items to SummaryClient and everything else to ChatClient - registered as a lazy [llm-judge] evaluator; docs + example dataset + tests Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The workspace ACTIVE_LLM_PROVIDER setting is keyed by type on the backend and may exist under any id (e.g. UI-generated). Reading it by the hardcoded id "activeLlmProvider" missed existing settings (-> "no active LLM provider"), and activating re-created a second setting of the same type (-> HTTP 409). Now look it up by setting_type via list_workspace_settings, and reuse the existing setting's id on activate so create_or_update performs an UPDATE. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

- evaluator: score must_not_include via plain presence-detection then invert, instead of asking the judge to reason about avoidance under an "EXPECTED OUTPUT" label (which inverted verdicts on negative criteria) - reporting: make the FAIL note evaluator-agnostic (list whichever boolean checks are False) instead of the visualization-only "no visualization created" - examples: replace the single template with three self-describing cases — full-dashboard, selected-visualizations (scoping), and format-hint-brief (format adherence) — each with a small gating set and rubric aligned to the endpoint's actual output Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…-eval # Conflicts: # packages/gooddata-eval/README.md # packages/gooddata-eval/src/gooddata_eval/cli/main.py # packages/gooddata-eval/src/gooddata_eval/core/workspace.py # packages/gooddata-eval/tests/test_workspace.py

Running without --model takes the default branch, which set provider_name but never provider_type, causing UnboundLocalError when building ResolvedModel. Initialize it to "" alongside provider_name. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

codecov · 2026-06-04T17:11:15Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 79.10%. Comparing base (27021da) to head (a43d45b).

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #1646   +/-   ##
=======================================
  Coverage   79.10%   79.10%           
=======================================
  Files         231      231           
  Lines       15718    15718           
=======================================
  Hits        12433    12433           
  Misses       3285     3285

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Roman Rakus and others added 3 commits June 4, 2026 17:26

romrak requested review from hkad98, jaceksan, lupko and pcerny as code owners June 4, 2026 16:57

Roman Rakus and others added 2 commits June 4, 2026 19:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(eval): evaluate the dashboard-summary skill via the /summary endpoint#1646

feat(eval): evaluate the dashboard-summary skill via the /summary endpoint#1646
romrak wants to merge 5 commits into
masterfrom
rr/summary-endpoint-eval

romrak commented Jun 4, 2026

Uh oh!

codecov Bot commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

romrak commented Jun 4, 2026

What

1. dashboard_summary test kind (Path B — dedicated /summary endpoint)

2. Fix: resolve the active LLM provider by type, not a fixed id

3. Fix: evaluator-agnostic FAIL note in the console report

Docs & examples

Testing

Notes

Uh oh!

codecov Bot commented Jun 4, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. `dashboard_summary` test kind (Path B — dedicated `/summary` endpoint)