feat(eval): evaluate the dashboard-summary skill via the /summary endpoint#1646
Open
romrak wants to merge 5 commits into
Open
feat(eval): evaluate the dashboard-summary skill via the /summary endpoint#1646romrak wants to merge 5 commits into
romrak wants to merge 5 commits into
Conversation
Implements Path B: evaluate the dashboard-summary feature through the
dedicated synchronous endpoint (POST /api/v1/ai/workspaces/{ws}/summary),
which executes AFM server-side — no SSE or client-side result_id wrangling.
- SummaryClient: posts summary_input, maps the JSON summary into ChatResult
- DashboardSummaryEvaluator: rubric-based LLM judge (must_include / must_not_include /
rubric), scored per-criterion so quality_score is the fraction satisfied
- SummaryInput dataset field; dashboard_id is the only required input
- runner ChatBackend now receives the DatasetItem; CLI routes summary items
to SummaryClient and everything else to ChatClient
- registered as a lazy [llm-judge] evaluator; docs + example dataset + tests
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The workspace ACTIVE_LLM_PROVIDER setting is keyed by type on the backend and may exist under any id (e.g. UI-generated). Reading it by the hardcoded id "activeLlmProvider" missed existing settings (-> "no active LLM provider"), and activating re-created a second setting of the same type (-> HTTP 409). Now look it up by setting_type via list_workspace_settings, and reuse the existing setting's id on activate so create_or_update performs an UPDATE. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- evaluator: score must_not_include via plain presence-detection then invert, instead of asking the judge to reason about avoidance under an "EXPECTED OUTPUT" label (which inverted verdicts on negative criteria) - reporting: make the FAIL note evaluator-agnostic (list whichever boolean checks are False) instead of the visualization-only "no visualization created" - examples: replace the single template with three self-describing cases — full-dashboard, selected-visualizations (scoping), and format-hint-brief (format adherence) — each with a small gating set and rubric aligned to the endpoint's actual output Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…-eval # Conflicts: # packages/gooddata-eval/README.md # packages/gooddata-eval/src/gooddata_eval/cli/main.py # packages/gooddata-eval/src/gooddata_eval/core/workspace.py # packages/gooddata-eval/tests/test_workspace.py
Running without --model takes the default branch, which set provider_name but never provider_type, causing UnboundLocalError when building ResolvedModel. Initialize it to "" alongside provider_name. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #1646 +/- ##
=======================================
Coverage 79.10% 79.10%
=======================================
Files 231 231
Lines 15718 15718
=======================================
Hits 12433 12433
Misses 3285 3285 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Adds a
dashboard_summarytest kind togooddata-evalso the GoodData dashboard-summary feature can be evaluated end-to-end, plus two supporting fixes uncovered while running it locally.1.
dashboard_summarytest kind (Path B — dedicated/summaryendpoint)SummaryClient— postssummary_inputtoPOST /api/v1/ai/workspaces/{ws}/summary(AFM executed server-side; no SSE, no client-sideresult_idhandling) and maps the JSON summary into aChatResult.DashboardSummaryEvaluator— rubric-based LLM judge.expected_outputis a rubric ofmust_include/must_not_include/rubriccriteria, each scored independently, soquality_scoreis the fraction satisfied. Pass requires all gating (must_*) criteria;rubricitems are graded-only.must_not_includeis scored by plain presence-detection + invert (asking the judge to reason about "avoidance" under anEXPECTED OUTPUTlabel flipped verdicts).SummaryInputdataset field (onlydashboard_idis required).ChatBackendnow receives the wholeDatasetItem; the CLI routes summary items toSummaryClientand everything else toChatClient.[llm-judge]evaluator (skipped without the extra, likegeneral_question).2. Fix: resolve the active LLM provider by type, not a fixed id
The workspace
ACTIVE_LLM_PROVIDERsetting is keyed by type on the backend and may exist under any id. Reading it by the hardcoded idactiveLlmProvidermissed existing settings ("no active LLM provider"), and activating created a second setting of the same type (HTTP 409). Now it looks the setting up bysetting_typeand reuses the existing id on activate (UPDATE, not CREATE).3. Fix: evaluator-agnostic FAIL note in the console report
The FAIL "Notes" column hardcoded visualization check names and showed a misleading "no visualization created" for summary items. It now lists whichever boolean checks are False.
Docs & examples
dashboard_summarysection (rubric shape,summary_input) + supported-test-kinds row.examples/summary_dataset/: full-dashboard, selected-visualizations (scoping), and format-hint-brief (format adherence). Each rubric is aligned to the endpoint's actual output and uses a small gating set.Testing
test_summary_client.py,test_summary_evaluator.py, plus workspace lookup-by-type tests.ruff format/checkclean; full suite 115 passed./summaryand request/response shapes confirmed; all three example cases pass with the OpenAI judge.Notes
/summary(not the/summarizereferenced in some gen-ai notebooks).SummaryClient._PATHis the single place to change if it is ever renamed.userContext+ AFMresult_ids) is intentionally not covered here; it is better smoke-tested via a Playwright e2e where the UI assembles that context. Seedocs/summary-eval.md(separate PR) for the strategy.🤖 Generated with Claude Code