Skip to content

feat(eval): multi-model comparison, gd-eval models command, Langfuse …#1642

Merged
zdenekmusil-gd merged 1 commit into
masterfrom
zmu/gdai-1831-multimodel-run
Jun 4, 2026
Merged

feat(eval): multi-model comparison, gd-eval models command, Langfuse …#1642
zdenekmusil-gd merged 1 commit into
masterfrom
zmu/gdai-1831-multimodel-run

Conversation

@zdenekmusil-gd
Copy link
Copy Markdown
Contributor

…enhancements

Multi-model comparison (--model repeated):

  • gd-eval run --model gpt-5.2 --model gpt-4o evaluates the full dataset against each model sequentially and prints a side-by-side comparison table with winner. --model accepts provider/model syntax to disambiguate .
  • --provider flag removed — superseded by provider/model syntax; ambiguous-model error now suggests the syntax directly.
  • Workspace original active model restored in finally block after all models run.
  • JSON report always uses nested {models, runs, comparison} shape.
  • Langfuse: one named dataset run per model (gd-eval-{ts}-{model_id}).

New command: gd-eval models

  • Lists all LLM providers and their models with provider ID, family, and active marker. Marks the workspace's currently active model when --workspace given.

Provider type labelling:

  • Always fetches model family live via list_llm_provider_models_by_id() — automatically correct for new families without code changes.
  • Combines gateway + family: OPENAI→family only, AWS_BEDROCK→BEDROCK/family, AZURE_FOUNDRY→AZURE/family (e.g. BEDROCK/ANTHROPIC, AZURE/OPENAI).
  • Model and provider_type in Langfuse trace metadata, tags, and dataset-run-item runDescription + metadata.

EvalReport gains provider_name and provider_type; comparison table shows provider/model to distinguish runs with the same model id across providers.

JIRA: GDAI-1831
Risk: low — new isolated features; no changes to existing packages.

@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 4, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 79.10%. Comparing base (c028c31) to head (c253bdf).

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #1642   +/-   ##
=======================================
  Coverage   79.10%   79.10%           
=======================================
  Files         231      231           
  Lines       15718    15718           
=======================================
  Hits        12433    12433           
  Misses       3285     3285           

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@zdenekmusil-gd zdenekmusil-gd force-pushed the zmu/gdai-1831-multimodel-run branch 4 times, most recently from bf9a524 to 0a1000f Compare June 4, 2026 12:45
…enhancements

Multi-model comparison (--model repeated):
- gd-eval run --model gpt-5.2 --model gpt-4o evaluates the full dataset
  against each model sequentially and prints a side-by-side comparison table
  with winner. --model accepts provider/model syntax to disambiguate
  (e.g. --model Foundry4o/gpt-5.2 or --model HN_Anthropic/claude-opus-4-7).
- --provider flag removed — superseded by provider/model syntax; ambiguous-model
  error now suggests the syntax directly.
- Workspace original active model restored in finally block after all models run.
- JSON report always uses nested {models, runs, comparison} shape.
- Langfuse: one named dataset run per model (gd-eval-{ts}-{model_id}).

New command: gd-eval models
- Lists all LLM providers and their models with provider ID, family, and
  active marker. Marks the workspace's currently active model when --workspace given.

Provider type labelling:
- Always fetches model family live via list_llm_provider_models_by_id() —
  automatically correct for new families without code changes.
- Combines gateway + family: OPENAI→family only, AWS_BEDROCK→BEDROCK/family,
  AZURE_FOUNDRY→AZURE/family (e.g. BEDROCK/ANTHROPIC, AZURE/OPENAI).
- Model and provider_type in Langfuse trace metadata, tags, and
  dataset-run-item runDescription + metadata.

EvalReport gains provider_name and provider_type; comparison table shows
provider/model to distinguish runs with the same model id across providers.

111 tests, ruff + ty clean.

JIRA: GDAI-1831
Risk: low — new isolated features; no changes to existing packages.
@zdenekmusil-gd zdenekmusil-gd force-pushed the zmu/gdai-1831-multimodel-run branch from 0a1000f to c253bdf Compare June 4, 2026 12:47
@zdenekmusil-gd zdenekmusil-gd merged commit 27021da into master Jun 4, 2026
13 checks passed
@zdenekmusil-gd zdenekmusil-gd deleted the zmu/gdai-1831-multimodel-run branch June 4, 2026 15:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants