docs(eval): summary-skill evaluation strategy deck by romrak · Pull Request #1644 · gooddata/gooddata-python-sdk

romrak · 2026-06-04T14:09:10Z

What

Adds packages/gooddata-eval/docs/summary-eval.md — a presentation deck outlining the proposed strategy for evaluating the GoodData dashboard-summary skill.

Why

Summaries are free-text and non-deterministic, so a single golden answer doesn't work. The deck proposes splitting the problem into two concerns, each tested with the right tool:

Is it good? (quality) → evaluate via the dedicated /summary endpoint in gd-eval, using a rubric-based LLM judge (must-include facts, must-not-include guards, soft quality criteria). Graded quality_score tames non-determinism and reuses existing gd-eval scoring.
Does it work? (integration) → a Playwright e2e smoke test of the chat summary skill. The real UI assembles userContext and runs AFM for resultIds, so we avoid reimplementing that plumbing in Python.

The deck also covers the dataset structure and what is reused out-of-the-box vs. what must be implemented.

Note

Open prerequisite flagged in the deck: confirm whether the chat skill and the /summary endpoint share the same summarizer — this determines whether endpoint quality-eval fully covers the chat skill.

Type

Docs only — no code changes.

🤖 Generated with Claude Code

Presentation outlining how to evaluate the dashboard-summary skill: quality via the /summary endpoint in gd-eval (rubric-based LLM judge), integration via a Playwright smoke test of the chat skill. Covers the dataset structure and what is reused vs. to be implemented. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

codecov · 2026-06-04T14:13:10Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 79.10%. Comparing base (c028c31) to head (d44a595).

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #1644   +/-   ##
=======================================
  Coverage   79.10%   79.10%           
=======================================
  Files         231      231           
  Lines       15718    15718           
=======================================
  Hits        12433    12433           
  Misses       3285     3285

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

romrak requested review from hkad98, jaceksan, lupko and pcerny as code owners June 4, 2026 14:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(eval): summary-skill evaluation strategy deck#1644

docs(eval): summary-skill evaluation strategy deck#1644
romrak wants to merge 1 commit into
masterfrom
rr/summary-skill-eval-presentation

romrak commented Jun 4, 2026

Uh oh!

codecov Bot commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

romrak commented Jun 4, 2026

What

Why

Note

Type

Uh oh!

codecov Bot commented Jun 4, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant