Skip to content

docs(eval): summary-skill evaluation strategy deck#1644

Open
romrak wants to merge 1 commit into
masterfrom
rr/summary-skill-eval-presentation
Open

docs(eval): summary-skill evaluation strategy deck#1644
romrak wants to merge 1 commit into
masterfrom
rr/summary-skill-eval-presentation

Conversation

@romrak
Copy link
Copy Markdown

@romrak romrak commented Jun 4, 2026

What

Adds packages/gooddata-eval/docs/summary-eval.md — a presentation deck outlining the proposed strategy for evaluating the GoodData dashboard-summary skill.

Why

Summaries are free-text and non-deterministic, so a single golden answer doesn't work. The deck proposes splitting the problem into two concerns, each tested with the right tool:

  • Is it good? (quality) → evaluate via the dedicated /summary endpoint in gd-eval, using a rubric-based LLM judge (must-include facts, must-not-include guards, soft quality criteria). Graded quality_score tames non-determinism and reuses existing gd-eval scoring.
  • Does it work? (integration) → a Playwright e2e smoke test of the chat summary skill. The real UI assembles userContext and runs AFM for resultIds, so we avoid reimplementing that plumbing in Python.

The deck also covers the dataset structure and what is reused out-of-the-box vs. what must be implemented.

Note

Open prerequisite flagged in the deck: confirm whether the chat skill and the /summary endpoint share the same summarizer — this determines whether endpoint quality-eval fully covers the chat skill.

Type

Docs only — no code changes.

🤖 Generated with Claude Code

Presentation outlining how to evaluate the dashboard-summary skill:
quality via the /summary endpoint in gd-eval (rubric-based LLM judge),
integration via a Playwright smoke test of the chat skill. Covers the
dataset structure and what is reused vs. to be implemented.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 4, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 79.10%. Comparing base (c028c31) to head (d44a595).

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #1644   +/-   ##
=======================================
  Coverage   79.10%   79.10%           
=======================================
  Files         231      231           
  Lines       15718    15718           
=======================================
  Hits        12433    12433           
  Misses       3285     3285           

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant