feat(eval): multi-model comparison, gd-eval models command, Langfuse … by zdenekmusil-gd · Pull Request #1642 · gooddata/gooddata-python-sdk

zdenekmusil-gd · 2026-06-04T11:18:05Z

…enhancements

Multi-model comparison (--model repeated):

gd-eval run --model gpt-5.2 --model gpt-4o evaluates the full dataset against each model sequentially and prints a side-by-side comparison table with winner. --model accepts provider/model syntax to disambiguate .
--provider flag removed — superseded by provider/model syntax; ambiguous-model error now suggests the syntax directly.
Workspace original active model restored in finally block after all models run.
JSON report always uses nested {models, runs, comparison} shape.
Langfuse: one named dataset run per model (gd-eval-{ts}-{model_id}).

New command: gd-eval models

Lists all LLM providers and their models with provider ID, family, and active marker. Marks the workspace's currently active model when --workspace given.

Provider type labelling:

Always fetches model family live via list_llm_provider_models_by_id() — automatically correct for new families without code changes.
Combines gateway + family: OPENAI→family only, AWS_BEDROCK→BEDROCK/family, AZURE_FOUNDRY→AZURE/family (e.g. BEDROCK/ANTHROPIC, AZURE/OPENAI).
Model and provider_type in Langfuse trace metadata, tags, and dataset-run-item runDescription + metadata.

EvalReport gains provider_name and provider_type; comparison table shows provider/model to distinguish runs with the same model id across providers.

JIRA: GDAI-1831
Risk: low — new isolated features; no changes to existing packages.

codecov · 2026-06-04T11:21:52Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 79.10%. Comparing base (c028c31) to head (c253bdf).

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #1642   +/-   ##
=======================================
  Coverage   79.10%   79.10%           
=======================================
  Files         231      231           
  Lines       15718    15718           
=======================================
  Hits        12433    12433           
  Misses       3285     3285

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…enhancements Multi-model comparison (--model repeated): - gd-eval run --model gpt-5.2 --model gpt-4o evaluates the full dataset against each model sequentially and prints a side-by-side comparison table with winner. --model accepts provider/model syntax to disambiguate (e.g. --model Foundry4o/gpt-5.2 or --model HN_Anthropic/claude-opus-4-7). - --provider flag removed — superseded by provider/model syntax; ambiguous-model error now suggests the syntax directly. - Workspace original active model restored in finally block after all models run. - JSON report always uses nested {models, runs, comparison} shape. - Langfuse: one named dataset run per model (gd-eval-{ts}-{model_id}). New command: gd-eval models - Lists all LLM providers and their models with provider ID, family, and active marker. Marks the workspace's currently active model when --workspace given. Provider type labelling: - Always fetches model family live via list_llm_provider_models_by_id() — automatically correct for new families without code changes. - Combines gateway + family: OPENAI→family only, AWS_BEDROCK→BEDROCK/family, AZURE_FOUNDRY→AZURE/family (e.g. BEDROCK/ANTHROPIC, AZURE/OPENAI). - Model and provider_type in Langfuse trace metadata, tags, and dataset-run-item runDescription + metadata. EvalReport gains provider_name and provider_type; comparison table shows provider/model to distinguish runs with the same model id across providers. 111 tests, ruff + ty clean. JIRA: GDAI-1831 Risk: low — new isolated features; no changes to existing packages.

zdenekmusil-gd requested review from hkad98, jaceksan, lupko and pcerny as code owners June 4, 2026 11:18

zdenekmusil-gd force-pushed the zmu/gdai-1831-multimodel-run branch 4 times, most recently from bf9a524 to 0a1000f Compare June 4, 2026 12:45

zdenekmusil-gd force-pushed the zmu/gdai-1831-multimodel-run branch from 0a1000f to c253bdf Compare June 4, 2026 12:47

hkad98 approved these changes Jun 4, 2026

View reviewed changes

zdenekmusil-gd merged commit 27021da into master Jun 4, 2026
13 checks passed

zdenekmusil-gd deleted the zmu/gdai-1831-multimodel-run branch June 4, 2026 15:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(eval): multi-model comparison, gd-eval models command, Langfuse …#1642

feat(eval): multi-model comparison, gd-eval models command, Langfuse …#1642
zdenekmusil-gd merged 1 commit into
masterfrom
zmu/gdai-1831-multimodel-run

zdenekmusil-gd commented Jun 4, 2026

Uh oh!

codecov Bot commented Jun 4, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zdenekmusil-gd commented Jun 4, 2026

Uh oh!

codecov Bot commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov Bot commented Jun 4, 2026 •

edited

Loading