diff --git a/packages/gooddata-eval/README.md b/packages/gooddata-eval/README.md index 0c9552b3a..44733933f 100644 --- a/packages/gooddata-eval/README.md +++ b/packages/gooddata-eval/README.md @@ -1,7 +1,7 @@ # gooddata-eval CLI to evaluate the GoodData AI agent against a dataset of natural-language -questions on a chosen workspace and LLM model. +questions on a chosen workspace and LLM model — including multi-model comparison. ## Install @@ -11,68 +11,145 @@ Or install `gd-eval` as a standalone tool: uv tool install gooddata-eval -## Quick start +## Commands + +| Command | Description | +|---|---| +| `gd-eval run` | Run an evaluation dataset against one or more models. | +| `gd-eval models` | List LLM providers and models configured in the org. | + +--- + +## `gd-eval run` + +### Quick start — single model ```bash export GOODDATA_TOKEN='your-api-token' gd-eval run \ --host https://your.gooddata.cloud \ - --workspace demo \ + --workspace ecommerce_demo \ --dataset ./my-dataset \ --model gpt-5.2 \ - --runs 2 \ + --runs 1 \ --json results.json ``` -## All flags +### Multi-model comparison + +Pass `--model` multiple times to evaluate the same dataset against several +models and get a side-by-side comparison: + +```bash +gd-eval run \ + --host https://your.gooddata.cloud \ + --workspace ecommerce_demo \ + --dataset ./my-dataset \ + --model gpt-5.2 \ + --model claude-opus-4-7 \ + --runs 1 \ + --json comparison.json +``` + +When the same model id is offered by multiple providers, use the +`provider/model` syntax to disambiguate: + +```bash + --model "Foundry4o_4.1_5.2/gpt-5.2" \ + --model "HN_Anthropic/claude-opus-4-7" +``` -### Connection +Both provider name and provider id are accepted as the prefix. + +### All flags + +#### Connection | Flag | Env var | Description | |---|---|---| -| `--host HOST` | — | GoodData host URL (e.g. `https://your.gooddata.cloud`). | +| `--host HOST` | — | GoodData host URL. | | `--token TOKEN` | `GOODDATA_TOKEN` | API token. Pass via flag or env var. | -| `--profile NAME` | — | Profile name in `~/.gooddata/profiles.yaml` (same file as the `gdc` CLI). Provides host + token when both flags are omitted. | +| `--profile NAME` | — | Profile name in `~/.gooddata/profiles.yaml` (same file as the `gdc` CLI). | | `--workspace ID` | — | **Required.** Workspace id to evaluate against. | -### Dataset source (pick one) +#### Dataset source (pick one) | Flag | Description | |---|---| -| `--dataset PATH` | Path to a flat folder of JSON files — one question per file. | -| `--langfuse-dataset NAME` | Pull dataset items by name from Langfuse. Requires `LANGFUSE_PUBLIC_KEY`, `LANGFUSE_SECRET_KEY`, `LANGFUSE_HOST` env vars. | +| `--dataset PATH` | Flat folder of JSON files — one question per file. | +| `--langfuse-dataset NAME` | Pull items by name from a Langfuse dataset. Requires `LANGFUSE_PUBLIC_KEY`, `LANGFUSE_SECRET_KEY`, `LANGFUSE_HOST`. | -### Model selection +#### Model selection | Flag | Description | |---|---| -| `--model ID` | LLM model id to evaluate (e.g. `gpt-5.2`). Defaults to the workspace's currently active model. If the model is offered by a different provider than the active one, the workspace's active provider is switched automatically. | -| `--provider NAME_OR_ID` | LLM provider name or id. Use when `--model` is offered by multiple providers and you need to pick one. Accepts either the human-readable provider name or its UUID id. | +| `--model MODEL` | Model id to evaluate. Repeat to compare multiple models. Accepts `provider/model` syntax to disambiguate when a model is offered by multiple providers (e.g. `--model "Foundry4o/gpt-5.2"`). Defaults to the workspace's current active model. | -### Evaluation +#### Evaluation | Flag | Default | Description | |---|---|---| -| `--runs K` | `2` | Number of independent conversation runs per item (pass@K). An item passes if any run passes. | +| `--runs K` | `2` | Independent runs per item (pass@K). An item passes if any run passes. | -### Output +#### Output | Flag | Description | |---|---| -| `--json PATH` | Write a machine-readable JSON report (keyed by item id, with per-item scores) to this path. Console summary is always printed. | -| `--quiet` | Suppress per-item progress output. Only the final table and summary are printed. | +| `--json PATH` | Write a JSON report to this path. Always uses the nested `{models, runs, comparison}` shape even for a single model. | +| `--quiet` | Suppress per-item progress. Per-model result tables and the comparison summary are still printed. | -### Langfuse sink +#### Langfuse sink | Flag | Description | |---|---| -| `--langfuse` | Log evaluation results to Langfuse after each item. Requires `--langfuse-dataset` (so item ids can be linked to Langfuse dataset items). Creates a named experiment run (`gd-eval-{timestamp}-{model}`) in the Langfuse dataset. Requires `LANGFUSE_PUBLIC_KEY`, `LANGFUSE_SECRET_KEY`, `LANGFUSE_HOST`. | +| `--langfuse` | Log scores and traces to Langfuse after each item. Requires `--langfuse-dataset`. Creates one named experiment run per model (`gd-eval-{timestamp}-{model}`). Requires `LANGFUSE_PUBLIC_KEY`, `LANGFUSE_SECRET_KEY`, `LANGFUSE_HOST`. | + +### JSON report shape + +The JSON report always uses the nested multi-model shape: + +```json +{ + "models": ["gpt-5.2", "claude-opus-4-7"], + "runs": { + "gpt-5.2": { "summary": { "passed": 22, ... }, "items": { ... } }, + "claude-opus-4-7": { "summary": { "passed": 18, ... }, "items": { ... } } + }, + "comparison": { + "gpt-5.2": { "passed": 22, "total": 31, "pass_rate": 0.71, "avg_quality_score": 0.81, ... }, + "claude-opus-4-7": { "passed": 18, "total": 31, "pass_rate": 0.58, "avg_quality_score": 0.72, ... } + } +} +``` + +Winner is selected by **pass rate → quality score → latency** (lower latency wins all-equal ties). + +--- + +## `gd-eval models` + +List all LLM providers and their models in the org. Marks the active model +for a workspace when `--workspace` is given: + +```bash +gd-eval models \ + --host https://your.gooddata.cloud \ + --workspace ecommerce_demo +``` + +``` +┃ Provider ┃ Provider ID ┃ Model ID ┃ Family ┃ Active ┃ +│ Foundry4o │ foundry_… │ gpt-5.2 │ OPENAI │ ◀ active │ +│ │ │ gpt-4o │ OPENAI │ │ +│ HN_Anthropic │ hn_anthr_… │ claude-opus-4-7 │ ANTHROPIC │ │ +``` + +--- ## Dataset format -A dataset is a folder of `.json` files, one per question. Each file must -contain a common envelope: +A dataset is a folder of `.json` files, one per question: ```json { @@ -87,8 +164,6 @@ contain a common envelope: Supported `test_kind` values: `visualization`, `metric_skill`, `alert_skill`, `search_tool`, `general_question`, `guardrail`. -See the full dataset specification for `expected_output` shapes per test kind. - ## Supported test kinds | test_kind | What the agent must produce | Extra required | @@ -104,31 +179,22 @@ See the full dataset specification for `expected_output` shapes per test kind. ### `[llm-judge]` — LLM-as-judge evaluators -`general_question` and `guardrail` items are scored by an LLM judge (GPT-4o) -that compares the agent's text response against your expected-output description. -This requires the OpenAI Python package and an API key: +`general_question` and `guardrail` items are scored by a GPT-4o judge. +Requires the OpenAI package and `OPENAI_API_KEY`: ```bash -uv add 'gooddata-eval[llm-judge]' # project dependency -# or, for the standalone gd-eval tool: +uv add 'gooddata-eval[llm-judge]' +# or for the standalone tool: uv tool install 'gooddata-eval[llm-judge]' ``` -Set your OpenAI key before running: - -```bash -export OPENAI_API_KEY='sk-...' -``` - -Without `[llm-judge]`, items with `test_kind: general_question` or `guardrail` -are reported as **skipped**. - +Without `[llm-judge]`, those items are **skipped**. ## Exit codes | Code | Meaning | |---|---| -| `0` | Run completed. Evaluation failures do **not** cause a non-zero exit — check the report. | +| `0` | Run completed. Evaluation failures do **not** cause a non-zero exit. | | `2` | Operational error: bad connection, missing model, unreadable dataset, missing credentials. | ## Scores (in JSON report and Langfuse) @@ -137,5 +203,6 @@ are reported as **skipped**. |---|---| | `pass_at_k` | 1 if any of the K runs passed strict checks, else 0. | | `quality_score` | Fraction of strict check flags that are `True` (0.0–1.0). Shown in CLI as a percentage. | -| `value_score` | Weighted blend: 0.6 × quality + 0.2 × speed (where speed = max(0, 1 − latency/60s)). | +| `value_score` | Weighted blend: 0.6 × quality + 0.2 × speed (speed = max(0, 1 − latency/60s)). | | `latency_s` | Average per-run latency in seconds. | +| `provider_type` | Model vendor + gateway label (e.g. `ANTHROPIC`, `BEDROCK/ANTHROPIC`, `AZURE/OPENAI`). Stored in Langfuse trace metadata and tags. | diff --git a/packages/gooddata-eval/src/gooddata_eval/cli/main.py b/packages/gooddata-eval/src/gooddata_eval/cli/main.py index 05b27664f..ff041f18b 100644 --- a/packages/gooddata-eval/src/gooddata_eval/cli/main.py +++ b/packages/gooddata-eval/src/gooddata_eval/cli/main.py @@ -9,6 +9,7 @@ import httpx from gooddata_api_client.exceptions import ApiException from rich.console import Console +from rich.table import Table from gooddata_eval.core.chat.sse_client import ChatClient from gooddata_eval.core.config import RunConfig @@ -16,8 +17,8 @@ from gooddata_eval.core.dataset.local import load_local_dataset from gooddata_eval.core.langfuse.sink import LangfuseSink from gooddata_eval.core.models import DatasetItem -from gooddata_eval.core.reporting.console import render_console -from gooddata_eval.core.reporting.json_report import write_json_report +from gooddata_eval.core.reporting.console import render_comparison, render_console +from gooddata_eval.core.reporting.json_report import write_multi_model_report from gooddata_eval.core.runner import ItemReport, run_items from gooddata_eval.core.workspace import ModelResolutionError, WorkspaceModelController @@ -37,11 +38,18 @@ def _build_parser() -> argparse.ArgumentParser: source = run.add_mutually_exclusive_group(required=True) source.add_argument("--dataset", help="Path to a folder of dataset JSON files.") source.add_argument("--langfuse-dataset", dest="langfuse_dataset", help="Langfuse dataset name.") - run.add_argument("--model", help="Model id (default: workspace's current active model).") run.add_argument( - "--provider", - help="LLM provider name or id (default: the workspace's active provider; " - "auto-selected when --model is offered by exactly one provider).", + "--model", + action="append", + dest="models", + metavar="MODEL", + help=( + "Model id to evaluate (e.g. --model gpt-5.2). " + "Prefix with provider name or id to disambiguate: " + "--model ProviderName/gpt-5.2 or --model provider_id/gpt-5.2. " + "Repeat to compare multiple models. " + "Default: workspace's current active model." + ), ) run.add_argument("--runs", type=int, default=2, help="Independent runs per item (pass@K). Default 2.") run.add_argument("--json", dest="json_path", help="Write a JSON report to this path.") @@ -51,6 +59,14 @@ def _build_parser() -> argparse.ArgumentParser: action="store_true", help="Log scores and traces to Langfuse (requires --langfuse-dataset and LANGFUSE_* env vars).", ) + models_cmd = sub.add_parser("models", help="List LLM providers and models configured in the org.") + models_cmd.add_argument("--host", help="GoodData host URL.") + models_cmd.add_argument("--token", help="API token (or set GOODDATA_TOKEN).") + models_cmd.add_argument("--profile", help="Profile name in ~/.gooddata/profiles.yaml.") + models_cmd.add_argument( + "--workspace", + help="Workspace id. When provided, marks the currently active model.", + ) return parser @@ -62,6 +78,23 @@ def _truncate(text: str, limit: int = 80) -> str: return text if len(text) <= limit else text[: limit - 1] + "…" +def _parse_model_arg(val: str) -> tuple[str | None, str]: + """Parse a model argument that may include a provider prefix. + + Accepts two forms: + "gpt-5.2" → (None, "gpt-5.2") + "ProviderName/gpt-5.2" → ("ProviderName", "gpt-5.2") + "provider_id.../model_id" → ("provider_id...", "model_id") + + The provider part (if present) is passed to resolve_and_activate and + accepted as either a provider name or provider id. + """ + if "/" in val: + provider_ref, _, model_id = val.partition("/") + return provider_ref.strip() or None, model_id.strip() + return None, val + + def _make_progress_callbacks(console: Console): """Build (on_item_start, on_run_done, on_item_done) callbacks that stream progress.""" @@ -104,8 +137,58 @@ def _load_dataset(config: RunConfig): return load_langfuse_dataset(config.langfuse_dataset) +def _list_models(host: str, token: str, workspace_id: str | None) -> int: + """List all LLM providers and their models; mark the active one if --workspace given.""" + from gooddata_eval.core.workspace import WorkspaceModelController # noqa: PLC0415 + + controller = WorkspaceModelController(host, token, workspace_id or "") + info = controller._provider_info() # {provider_id: {name, models: [{id, family}]}} + + active_provider_id: str | None = None + active_model_id: str | None = None + if workspace_id: + active = controller.get_active() + if active: + active_provider_id = active.provider_id + active_model_id = active.default_model_id + + console = Console() + + if not info: + console.print("[yellow]No LLM providers configured in this organisation.[/yellow]") + return _EXIT_OK + + table = Table(title=f"LLM Providers and Models{f' (workspace: {workspace_id})' if workspace_id else ''}") + table.add_column("Provider") + table.add_column("Provider ID") + table.add_column("Model ID") + table.add_column("Family") + table.add_column("Active") + + for provider_id, pinfo in sorted(info.items(), key=lambda kv: kv[1].get("name") or kv[0]): + name = pinfo.get("name") or provider_id + models = pinfo.get("models") or [] + if not models: + is_active_provider = provider_id == active_provider_id + table.add_row(name, provider_id, "[dim](none listed)[/dim]", "", "◀" if is_active_provider else "") + for i, model in enumerate(models): + model_id = model.get("id", "?") if isinstance(model, dict) else str(model) + family = model.get("family", "") if isinstance(model, dict) else "" + is_active = provider_id == active_provider_id and model_id == active_model_id + active_marker = "[green]◀ active[/green]" if is_active else "" + table.add_row( + name if i == 0 else "", + provider_id if i == 0 else "", + model_id, + family, + active_marker, + ) + + console.print(table) + return _EXIT_OK + + def _run(config: RunConfig) -> int: - # Enforce: --langfuse only valid with --langfuse-dataset if config.log_to_langfuse and config.langfuse_dataset is None: print( "error: --langfuse requires --langfuse-dataset (local datasets have no Langfuse item ids to link to).", @@ -114,63 +197,113 @@ def _run(config: RunConfig) -> int: return _EXIT_OPERATIONAL_ERROR items = _load_dataset(config) + models = config.models or [] + run_ts = datetime.now(timezone.utc).strftime("%Y-%m-%d-%H-%M") + n_models = len(models) if models else 1 controller = WorkspaceModelController(config.host, config.token, config.workspace_id) - resolved = controller.resolve_and_activate(config.model, config.provider_id) - - run_name = f"gd-eval-{datetime.now(timezone.utc).strftime('%Y-%m-%d-%H-%M')}-{resolved.model_id}" - - on_item_start = None - on_run_done = None - on_item_done = None - if not config.quiet: - progress_console = Console(stderr=True) - switched = " [switched active provider]" if resolved.switched else "" - provider_display = resolved.provider_name or resolved.provider_id - progress_console.print( - f"Evaluating {len(items)} item(s) on workspace '{config.workspace_id}' " - f"(provider={provider_display}, model={resolved.model_id}){switched}..." - ) - if config.log_to_langfuse: - progress_console.print(f"Logging to Langfuse dataset run '{run_name}'...") - on_item_start, on_run_done, on_item_done = _make_progress_callbacks(progress_console) - - on_langfuse_item_done = None - if config.log_to_langfuse: - assert config.langfuse_dataset is not None # guarded above - sink = LangfuseSink(dataset_name=config.langfuse_dataset, run_name=run_name) + original_active = controller.get_active() - def on_langfuse_item_done(index: int, total: int, report: ItemReport) -> None: - sink.log_item(report, dataset_item_id=report.id) + progress_console = Console(stderr=True) if not config.quiet else None + if progress_console: + multi_suffix = f" — {n_models} model(s)" if n_models > 1 else "" + progress_console.print(f"Evaluating {len(items)} item(s) on workspace '{config.workspace_id}'{multi_suffix}") - backend = ChatClient(host=config.host, token=config.token, workspace_id=config.workspace_id) + reports: list = [] try: - report = run_items( - items, - backend, - runs=config.runs, - model=resolved.model_id, - workspace_id=config.workspace_id, - on_item_start=on_item_start, - on_run_done=on_run_done, - on_item_done=on_item_done, - on_langfuse_item_done=on_langfuse_item_done, - ) + for k, model_id in enumerate(models or [None], start=1): + provider_ref, bare_model_id = _parse_model_arg(model_id or "") + effective_provider = provider_ref + try: + resolved = controller.resolve_and_activate(bare_model_id or None, effective_provider) + except (ModelResolutionError, httpx.HTTPError, ApiException, RuntimeError) as exc: + print(f"warning: skipping model '{model_id}': {exc}", file=sys.stderr) + continue + + if progress_console: + if n_models > 1: + progress_console.print(f"\n── Model {k}/{n_models}: {resolved.model_id} " + "─" * 40) + else: + switched = " [switched active provider]" if resolved.switched else "" + provider_display = resolved.provider_name or resolved.provider_id + progress_console.print(f"Provider={provider_display}, model={resolved.model_id}{switched}") + + run_name = f"gd-eval-{run_ts}-{resolved.model_id}" + if progress_console and config.log_to_langfuse: + progress_console.print(f"Logging to Langfuse run '{run_name}'...") + + on_item_start, on_run_done, on_item_done = (None, None, None) + if progress_console: + on_item_start, on_run_done, on_item_done = _make_progress_callbacks(progress_console) + + on_langfuse_item_done = None + if config.log_to_langfuse: + assert config.langfuse_dataset is not None + sink = LangfuseSink( + dataset_name=config.langfuse_dataset, + run_name=run_name, + model_id=resolved.model_id, + provider_type=resolved.provider_type, + ) + + def on_langfuse_item_done( + index: int, + total: int, + report: ItemReport, + _sink: LangfuseSink = sink, + _model_id: str = resolved.model_id, + _provider_type: str = resolved.provider_type, + ) -> None: + _sink.log_item(report, dataset_item_id=report.id) + + backend = ChatClient(host=config.host, token=config.token, workspace_id=config.workspace_id) + try: + report = run_items( + items, + backend, + runs=config.runs, + model=resolved.model_id, + provider_name=resolved.provider_name or resolved.provider_id, + provider_type=resolved.provider_type, + workspace_id=config.workspace_id, + on_item_start=on_item_start, + on_run_done=on_run_done, + on_item_done=on_item_done, + on_langfuse_item_done=on_langfuse_item_done, + ) + finally: + if hasattr(backend, "close"): + backend.close() + + skipped_kinds = sorted({i.test_kind for i in report.items if i.skipped}) + if skipped_kinds: + print( + f"warning: skipped {sum(i.skipped for i in report.items)} item(s) with " + f"unsupported test_kind(s): {', '.join(skipped_kinds)}", + file=sys.stderr, + ) + + render_console(report) + reports.append(report) finally: - if hasattr(backend, "close"): - backend.close() + try: + controller.restore(original_active) + except Exception as _restore_exc: + print( + f"warning: failed to restore workspace active model: {_restore_exc}", + file=sys.stderr, + ) - skipped_kinds = sorted({i.test_kind for i in report.items if i.skipped}) - if skipped_kinds: - print( - f"warning: skipped {sum(i.skipped for i in report.items)} item(s) with " - f"unsupported test_kind(s): {', '.join(skipped_kinds)}", - file=sys.stderr, - ) + if not reports: + print("error: no models evaluated successfully.", file=sys.stderr) + return _EXIT_OPERATIONAL_ERROR + + if len(reports) > 1: + render_comparison(reports) - render_console(report) if config.json_path is not None: - write_json_report(report, config.json_path) + write_multi_model_report(reports, config.json_path) + return _EXIT_OK @@ -178,14 +311,15 @@ def main(argv: list[str] | None = None) -> int: args = parse_args(argv if argv is not None else sys.argv[1:]) try: host, token = resolve_connection(host=args.host, token=args.token, profile=args.profile) + if args.command == "models": + return _list_models(host, token, getattr(args, "workspace", None)) config = RunConfig( host=host, token=token, workspace_id=args.workspace, dataset_folder=Path(args.dataset) if args.dataset else None, langfuse_dataset=args.langfuse_dataset, - model=args.model, - provider_id=args.provider, + models=args.models or [], runs=args.runs, json_path=Path(args.json_path) if args.json_path else None, log_to_langfuse=args.langfuse, diff --git a/packages/gooddata-eval/src/gooddata_eval/core/config.py b/packages/gooddata-eval/src/gooddata_eval/core/config.py index d6152b8ed..1effe8613 100644 --- a/packages/gooddata-eval/src/gooddata_eval/core/config.py +++ b/packages/gooddata-eval/src/gooddata_eval/core/config.py @@ -1,7 +1,7 @@ # (C) 2026 GoodData Corporation """Validated run configuration produced by the CLI and consumed by the runner.""" -from dataclasses import dataclass +from dataclasses import dataclass, field from pathlib import Path @@ -12,8 +12,7 @@ class RunConfig: workspace_id: str dataset_folder: Path | None = None langfuse_dataset: str | None = None - model: str | None = None - provider_id: str | None = None + models: list[str] = field(default_factory=list) runs: int = 2 json_path: Path | None = None log_to_langfuse: bool = False diff --git a/packages/gooddata-eval/src/gooddata_eval/core/langfuse/sink.py b/packages/gooddata-eval/src/gooddata_eval/core/langfuse/sink.py index f51a57c46..0e9650cf3 100644 --- a/packages/gooddata-eval/src/gooddata_eval/core/langfuse/sink.py +++ b/packages/gooddata-eval/src/gooddata_eval/core/langfuse/sink.py @@ -48,9 +48,11 @@ def compute_scores( class LangfuseSink: """Posts evaluation results to Langfuse via the ingestion REST API.""" - def __init__(self, dataset_name: str, run_name: str): + def __init__(self, dataset_name: str, run_name: str, model_id: str = "", provider_type: str = ""): self._dataset_name = dataset_name self._run_name = run_name + self._model_id = model_id + self._provider_type = provider_type host = os.environ.get("LANGFUSE_HOST", "https://cloud.langfuse.com").rstrip("/") pub = os.environ.get("LANGFUSE_PUBLIC_KEY", "") sec = os.environ.get("LANGFUSE_SECRET_KEY", "") @@ -93,8 +95,10 @@ def _event(event_type: str, body: dict[str, Any]) -> dict[str, Any]: "dataset_name": report.dataset_name, "test_kind": report.test_kind, "item_id": report.id, + "model": self._model_id, + "provider_type": self._provider_type, }, - "tags": [report.test_kind], + "tags": [t for t in [report.test_kind, self._provider_type] if t], }, ), ] @@ -152,6 +156,15 @@ def _event(event_type: str, body: dict[str, Any]) -> dict[str, Any]: "/api/public/dataset-run-items", json={ "runName": self._run_name, + "runDescription": ( + f"{self._provider_type}/{self._model_id}" + if self._provider_type and self._model_id + else self._model_id or "" + ), + "metadata": { + "model": self._model_id, + "provider_type": self._provider_type, + }, "datasetItemId": dataset_item_id, "traceId": trace_id, }, diff --git a/packages/gooddata-eval/src/gooddata_eval/core/reporting/console.py b/packages/gooddata-eval/src/gooddata_eval/core/reporting/console.py index bb8aecfaf..316490a84 100644 --- a/packages/gooddata-eval/src/gooddata_eval/core/reporting/console.py +++ b/packages/gooddata-eval/src/gooddata_eval/core/reporting/console.py @@ -53,3 +53,62 @@ def render_console(report: EvalReport, *, console: Console | None = None) -> str f"in {report.latency_s:.2f}s (avg {report.avg_latency_s:.2f}s/run)" ) return out.export_text() if out.record else "" + + +def render_comparison(reports: list[EvalReport], *, console: Console | None = None) -> str: + """Render a side-by-side comparison for multiple model runs. + + Winner is selected by (pass_rate, avg_quality_score, -avg_latency_s) — + higher pass rate first, then quality, then lower latency as final tiebreaker. + Returns an empty string when fewer than two reports are provided. + """ + if len(reports) < 2: + return "" + + out = console or Console(record=True, width=120) + + table = Table(title="Model Comparison") + table.add_column("Model") + table.add_column("Passed") + table.add_column("Quality") + table.add_column("Avg/run") + table.add_column("Total time") + + for r in reports: + evaluated = r.total - r.skipped + pass_pct = f"{r.passed / evaluated:.0%}" if evaluated else "—" + model_label = f"{r.provider_name}/{r.model}" if r.provider_name else r.model or "?" + table.add_row( + model_label, + f"{r.passed}/{r.total} ({pass_pct})", + f"{r.avg_quality_score:.0%}", + f"{r.avg_latency_s:.2f}s", + f"{r.latency_s:.0f}s", + ) + + out.print(table) + + evaluated_reports = [r for r in reports if r.total > 0] + if evaluated_reports: + winner = max( + evaluated_reports, + key=lambda r: ( + r.passed / r.total if r.total else 0, + r.avg_quality_score, + -r.avg_latency_s, # lower latency wins when pass rate and quality tie + ), + ) + runner_up = max( + (r for r in evaluated_reports if r is not winner), + key=lambda r: r.passed / r.total if r.total else 0, + default=None, + ) + delta = "" + if runner_up and runner_up.total > 0: + delta_items = winner.passed - runner_up.passed + delta_quality = winner.avg_quality_score - runner_up.avg_quality_score + delta = f" (+{delta_items} item(s) passed, +{delta_quality:.0%} quality)" + winner_label = f"{winner.provider_name}/{winner.model}" if winner.provider_name else winner.model or "?" + out.print(f"\n[bold]Winner: {winner_label}[/bold]{delta}") + + return out.export_text() if out.record else "" diff --git a/packages/gooddata-eval/src/gooddata_eval/core/reporting/json_report.py b/packages/gooddata-eval/src/gooddata_eval/core/reporting/json_report.py index 983b88906..1ad4a1cd7 100644 --- a/packages/gooddata-eval/src/gooddata_eval/core/reporting/json_report.py +++ b/packages/gooddata-eval/src/gooddata_eval/core/reporting/json_report.py @@ -1,5 +1,5 @@ # (C) 2026 GoodData Corporation -"""Build and write the machine-readable JSON report, keyed by item id.""" +"""Build and write machine-readable reports (single-model or multi-model).""" from pathlib import Path @@ -8,7 +8,7 @@ from gooddata_eval.core.runner import EvalReport -def build_json_report(report: EvalReport) -> dict: +def _build_run_dict(report: EvalReport) -> dict: return { "model": report.model, "workspace_id": report.workspace_id, @@ -39,6 +39,42 @@ def build_json_report(report: EvalReport) -> dict: } +def _build_comparison_entry(report: EvalReport) -> dict: + total = report.total + passed = report.passed + return { + "provider_name": report.provider_name, + "passed": passed, + "total": total, + "pass_rate": round(passed / total, 4) if total else 0.0, + "avg_quality_score": round(report.avg_quality_score, 4), + "avg_latency_s": round(report.avg_latency_s, 3), + "total_latency_s": round(report.latency_s, 3), + } + + +def _run_key(report: EvalReport) -> str: + """Collision-free key matching the console comparison table label.""" + return f"{report.provider_name}/{report.model}" if report.provider_name else report.model or "?" + + +def build_multi_model_report(reports: list[EvalReport]) -> dict: + """Build the nested multi-model JSON report (used for single-model runs too).""" + return { + "models": [_run_key(r) for r in reports], + "runs": {_run_key(r): _build_run_dict(r) for r in reports}, + "comparison": {_run_key(r): _build_comparison_entry(r) for r in reports}, + } + + +def write_multi_model_report(reports: list[EvalReport], path: Path) -> None: + Path(path).write_bytes(orjson.dumps(build_multi_model_report(reports), option=orjson.OPT_INDENT_2)) + + +# Backward-compatible aliases so existing callers keep working. +def build_json_report(report: EvalReport) -> dict: + return _build_run_dict(report) + + def write_json_report(report: EvalReport, path: Path) -> None: - path = Path(path) - path.write_bytes(orjson.dumps(build_json_report(report), option=orjson.OPT_INDENT_2)) + write_multi_model_report([report], path) diff --git a/packages/gooddata-eval/src/gooddata_eval/core/runner.py b/packages/gooddata-eval/src/gooddata_eval/core/runner.py index daae73634..7fb0d36d6 100644 --- a/packages/gooddata-eval/src/gooddata_eval/core/runner.py +++ b/packages/gooddata-eval/src/gooddata_eval/core/runner.py @@ -48,7 +48,9 @@ def quality_score(self) -> float: @dataclass class EvalReport: model: str | None - workspace_id: str + provider_name: str = "" + provider_type: str = "" + workspace_id: str = "" items: list[ItemReport] = field(default_factory=list) @property @@ -144,6 +146,8 @@ def run_items( *, runs: int = 2, model: str | None = None, + provider_name: str = "", + provider_type: str = "", workspace_id: str = "", on_item_start: Callable[[int, int, DatasetItem], None] | None = None, on_run_done: Callable[[int, int, int, int, bool, float], None] | None = None, @@ -159,7 +163,9 @@ def run_items( - on_item_done(index, total, report) after an item is fully evaluated - on_langfuse_item_done(index, total, report) after non-skipped, non-errored items only """ - report = EvalReport(model=model, workspace_id=workspace_id) + report = EvalReport( + model=model, provider_name=provider_name, provider_type=provider_type, workspace_id=workspace_id + ) total = len(items) for index, item in enumerate(items, start=1): if on_item_start is not None: diff --git a/packages/gooddata-eval/src/gooddata_eval/core/workspace.py b/packages/gooddata-eval/src/gooddata_eval/core/workspace.py index cdb322a86..e849c0aea 100644 --- a/packages/gooddata-eval/src/gooddata_eval/core/workspace.py +++ b/packages/gooddata-eval/src/gooddata_eval/core/workspace.py @@ -3,7 +3,8 @@ from dataclasses import dataclass, field -from gooddata_api_client.exceptions import NotFoundException +import httpx +from gooddata_api_client.exceptions import ApiException, NotFoundException from gooddata_sdk import CatalogWorkspaceSetting, GoodDataSdk _SETTING_ID = "activeLlmProvider" @@ -22,6 +23,7 @@ class ResolvedModel: model_id: str switched: bool # True when activation changed the workspace's active provider provider_name: str = field(default="") # human-readable name, empty if not available + provider_type: str = field(default="") # API gateway type: OPENAI, AWS_BEDROCK, AZURE_FOUNDRY class ModelResolutionError(Exception): @@ -96,7 +98,8 @@ def select_provider_and_model( if len(candidates) > 1: raise ModelResolutionError( f"Model '{requested_model}' is offered by multiple providers: {', '.join(candidates)}. " - "Pass --provider to choose one." + f"Use the provider/model syntax to pick one " + f"(e.g. --model '{candidates[0]}/{requested_model}')." ) available = sorted({m for models in providers_models.values() for m in (models or [])}) raise ModelResolutionError( @@ -157,15 +160,20 @@ def _provider_info(self) -> dict[str, dict]: for p in providers: attrs = p.attributes models = (attrs.models if attrs else None) or [] + cfg = attrs.provider_config if attrs else None result[p.id] = { "name": (attrs.name if attrs else None) or p.id, - "models": [m.id for m in models], + "models": [{"id": m.id, "family": m.family} for m in models], + "type": (cfg.type if cfg else "") or "", } return result def all_provider_models(self) -> dict[str, list[str]]: """Map each configured LLM provider id to the model ids it offers.""" - return {pid: info["models"] for pid, info in self._provider_info().items()} + return { + pid: [m["id"] if isinstance(m, dict) else m for m in info["models"]] + for pid, info in self._provider_info().items() + } def activate(self, provider_id: str, model_id: str) -> None: setting = CatalogWorkspaceSetting( @@ -175,6 +183,12 @@ def activate(self, provider_id: str, model_id: str) -> None: ) self._sdk.catalog_workspace.create_or_update_workspace_setting(self._workspace_id, setting) + def restore(self, original: ActiveLlmProvider | None) -> None: + """Restore the workspace active model to a previously saved state.""" + if original is None: + return + self.activate(original.provider_id, original.default_model_id) + def resolve_and_activate(self, requested_model: str | None, requested_provider: str | None = None) -> ResolvedModel: """Resolve provider+model (by name or id), activate them, and report what was chosen.""" active = self.get_active() @@ -183,7 +197,9 @@ def resolve_and_activate(self, requested_model: str | None, requested_provider: provider_name = "" else: info = self._provider_info() - providers_models = {pid: d["models"] for pid, d in info.items()} + providers_models = { + pid: [m["id"] if isinstance(m, dict) else m for m in d["models"]] for pid, d in info.items() + } resolved_provider = None if requested_provider is not None: resolved_provider = _resolve_provider_ref(requested_provider, info) @@ -191,6 +207,35 @@ def resolve_and_activate(self, requested_model: str | None, requested_provider: requested_model, resolved_provider, active, providers_models ) provider_name = info.get(provider_id, {}).get("name", "") + _pinfo = info.get(provider_id, {}) + _models = _pinfo.get("models", []) + _family = next( + (m["family"] for m in _models if isinstance(m, dict) and m.get("id") == model_id), + "", + ) + _gateway = _pinfo.get("type", "") + # Always fetch live model info from the provider — correct for any new family + # without requiring code changes. One call per run, non-fatal on failure. + if model_id: + try: + live = self._sdk.catalog_organization.list_llm_provider_models_by_id(provider_id) + if live.success and live.models: + _live_family = next( + (m.family for m in live.models if m.id == model_id), + "", + ) + _family = _live_family or _family + except (httpx.HTTPError, ApiException, OSError, AttributeError): # non-fatal + pass + # Build a label that preserves both infrastructure and model vendor: + # OPENAI gateway → family only (direct API, gateway is just protocol) + # AWS_BEDROCK → BEDROCK/family (e.g. BEDROCK/ANTHROPIC) + # AZURE_FOUNDRY → AZURE/family (e.g. AZURE/OPENAI) + if _gateway in ("AWS_BEDROCK", "AZURE_FOUNDRY") and _family: + _prefix = "BEDROCK" if _gateway == "AWS_BEDROCK" else "AZURE" + provider_type = f"{_prefix}/{_family}" + else: + provider_type = _family or _gateway switched = active is None or provider_id != active.provider_id self.activate(provider_id, model_id) return ResolvedModel( @@ -198,4 +243,5 @@ def resolve_and_activate(self, requested_model: str | None, requested_provider: model_id=model_id, switched=switched, provider_name=provider_name, + provider_type=provider_type, ) diff --git a/packages/gooddata-eval/tests/test_cli.py b/packages/gooddata-eval/tests/test_cli.py index 55fa320fb..c99aa888d 100644 --- a/packages/gooddata-eval/tests/test_cli.py +++ b/packages/gooddata-eval/tests/test_cli.py @@ -3,12 +3,13 @@ import orjson import pytest from gooddata_eval.cli import main as cli_main +from gooddata_eval.cli.main import _parse_model_arg from gooddata_eval.core.connection import ( ConnectionError_, # noqa: F401 - used in test_cli_operational_error_exits_nonzero ) from gooddata_eval.core.models import DatasetItem from gooddata_eval.core.runner import EvalReport, ItemReport -from gooddata_eval.core.workspace import ResolvedModel +from gooddata_eval.core.workspace import ActiveLlmProvider, ResolvedModel from rich.console import Console @@ -28,11 +29,15 @@ def test_cli_run_end_to_end(monkeypatch, tmp_path, fixtures_dir): class _FakeController: def __init__(self, *a, **k): ... + def get_active(self): + return ActiveLlmProvider(provider_id="prov", default_model_id="gpt-5.2") + def resolve_and_activate(self, requested, provider=None): return ResolvedModel( provider_id="prov", model_id=requested or "gpt-5.2", switched=False, provider_name="Test Provider" ) + def restore(self, original): ... def close(self): ... monkeypatch.setattr(cli_main, "WorkspaceModelController", _FakeController) @@ -44,10 +49,7 @@ def _fake_run( runs, model, workspace_id, - on_item_start=None, - on_run_done=None, - on_item_done=None, - on_langfuse_item_done=None, + **kw, ): return EvalReport( model=model, @@ -84,7 +86,7 @@ def _fake_run( ] ) assert exit_code == 0 - assert orjson.loads(out.read_bytes())["summary"]["passed"] == 1 + assert orjson.loads(out.read_bytes())["runs"]["gpt-5.2"]["summary"]["passed"] == 1 def test_cli_operational_error_exits_nonzero(monkeypatch, fixtures_dir): @@ -103,9 +105,13 @@ def test_cli_http_error_exits_nonzero(monkeypatch, fixtures_dir): class _BoomController: def __init__(self, *a, **k): ... + def get_active(self): + return None + def resolve_and_activate(self, requested, provider=None): raise httpx.HTTPError("401 unauthorized") + def restore(self, original): ... def close(self): ... monkeypatch.setattr(cli_main, "WorkspaceModelController", _BoomController) @@ -130,9 +136,13 @@ def test_cli_warns_on_skipped_kinds(monkeypatch, tmp_path, capsys): class _FakeController: def __init__(self, *a, **k): ... + def get_active(self): + return ActiveLlmProvider(provider_id="prov", default_model_id="gpt-5.2") + def resolve_and_activate(self, requested, provider=None): return ResolvedModel(provider_id="prov", model_id="gpt-5.2", switched=False, provider_name="Test Provider") + def restore(self, original): ... def close(self): ... monkeypatch.setattr(cli_main, "WorkspaceModelController", _FakeController) @@ -146,10 +156,7 @@ def _fake_run( runs, model, workspace_id, - on_item_start=None, - on_run_done=None, - on_item_done=None, - on_langfuse_item_done=None, + **kw, ): return EvalReport( model=model, @@ -201,9 +208,13 @@ def test_cli_langfuse_without_langfuse_dataset_exits_with_error(monkeypatch, fix class _FakeController: def __init__(self, *a, **k): ... + def get_active(self): + return ActiveLlmProvider(provider_id="p", default_model_id="gpt-5.2") + def resolve_and_activate(self, requested, provider=None): return ResolvedModel(provider_id="p", model_id="gpt-5.2", switched=False, provider_name="P") + def restore(self, original): ... def close(self): ... monkeypatch.setattr(cli_main, "WorkspaceModelController", _FakeController) @@ -229,9 +240,13 @@ def test_cli_langfuse_sink_called_per_item(monkeypatch, fixtures_dir): class _FakeController: def __init__(self, *a, **k): ... + def get_active(self): + return ActiveLlmProvider(provider_id="p", default_model_id="gpt-5.2") + def resolve_and_activate(self, requested, provider=None): return ResolvedModel(provider_id="p", model_id="gpt-5.2", switched=False, provider_name="P") + def restore(self, original): ... def close(self): ... monkeypatch.setattr(cli_main, "WorkspaceModelController", _FakeController) @@ -241,7 +256,7 @@ def close(self): ... langfuse_calls: list = [] class _FakeSink: - def __init__(self, dataset_name, run_name): ... + def __init__(self, dataset_name, run_name, model_id="", provider_type=""): ... def log_item(self, report, *, dataset_item_id): langfuse_calls.append(dataset_item_id) @@ -254,18 +269,16 @@ def _fake_run( runs, model, workspace_id, - on_item_start=None, - on_run_done=None, - on_item_done=None, - on_langfuse_item_done=None, + **kw, ): + on_lf = kw.get("on_langfuse_item_done") r = EvalReport(model=model, workspace_id=workspace_id) item_report = ItemReport( id="acme-001", dataset_name="d", test_kind="visualization", question="q", pass_at_k=True, runs=1 ) r.items.append(item_report) - if on_langfuse_item_done: - on_langfuse_item_done(1, 1, item_report) + if on_lf: + on_lf(1, 1, item_report) return r monkeypatch.setattr(cli_main, "run_items", _fake_run) @@ -286,3 +299,194 @@ def _fake_run( ) assert exit_code == 0 assert langfuse_calls == ["acme-001"] + + +def test_cli_multimodel_runs_each_model(monkeypatch, fixtures_dir): + monkeypatch.setattr(cli_main, "resolve_connection", lambda host, token, profile: ("https://h", "tok")) + activated_models = [] + + class _FakeController: + def __init__(self, *a, **k): ... + def get_active(self): + return ActiveLlmProvider(provider_id="p", default_model_id="orig") + + def resolve_and_activate(self, requested, provider=None): + activated_models.append(requested) + return ResolvedModel(provider_id="p", model_id=requested or "orig", switched=False, provider_name="P") + + def restore(self, original): + activated_models.append(f"restore:{original.default_model_id if original else 'none'}") + + def close(self): ... + + monkeypatch.setattr(cli_main, "WorkspaceModelController", _FakeController) + monkeypatch.setattr(cli_main, "ChatClient", lambda **k: object()) + + def _fake_run(items, backend, *, runs, model, workspace_id, **kw): + return EvalReport( + model=model, + workspace_id=workspace_id, + items=[ + ItemReport(id="i1", dataset_name="d", test_kind="visualization", question="q", pass_at_k=True, runs=1) + ], + ) + + monkeypatch.setattr(cli_main, "run_items", _fake_run) + exit_code = cli_main.main( + [ + "run", + "--host", + "https://h", + "--token", + "tok", + "--workspace", + "ws1", + "--dataset", + str(fixtures_dir / "sample_dataset"), + "--model", + "gpt-5.2", + "--model", + "gpt-4o", + "--quiet", + ] + ) + assert exit_code == 0 + assert "gpt-5.2" in activated_models + assert "gpt-4o" in activated_models + assert any("restore:" in m for m in activated_models) + + +def test_cli_multimodel_writes_nested_json(monkeypatch, tmp_path, fixtures_dir): + monkeypatch.setattr(cli_main, "resolve_connection", lambda host, token, profile: ("https://h", "tok")) + + class _FakeController: + def __init__(self, *a, **k): ... + def get_active(self): + return ActiveLlmProvider(provider_id="p", default_model_id="orig") + + def resolve_and_activate(self, requested, provider=None): + return ResolvedModel(provider_id="p", model_id=requested or "orig", switched=False, provider_name="P") + + def restore(self, original): ... + def close(self): ... + + monkeypatch.setattr(cli_main, "WorkspaceModelController", _FakeController) + monkeypatch.setattr(cli_main, "ChatClient", lambda **k: object()) + + def _fake_run(items, backend, *, runs, model, workspace_id, **kw): + return EvalReport( + model=model, + workspace_id=workspace_id, + items=[ + ItemReport(id="i1", dataset_name="d", test_kind="visualization", question="q", pass_at_k=True, runs=1) + ], + ) + + monkeypatch.setattr(cli_main, "run_items", _fake_run) + out = tmp_path / "results.json" + cli_main.main( + [ + "run", + "--host", + "https://h", + "--token", + "tok", + "--workspace", + "ws1", + "--dataset", + str(fixtures_dir / "sample_dataset"), + "--model", + "gpt-5.2", + "--model", + "gpt-4o", + "--json", + str(out), + "--quiet", + ] + ) + data = orjson.loads(out.read_bytes()) + assert data["models"] == ["gpt-5.2", "gpt-4o"] + assert "runs" in data and "comparison" in data + assert data["comparison"]["gpt-5.2"]["passed"] == 1 + + +def test_cli_restore_fires_even_when_model_loop_raises(monkeypatch, fixtures_dir): + """workspace.restore() must be called even if all model activations raise.""" + monkeypatch.setattr(cli_main, "resolve_connection", lambda host, token, profile: ("https://h", "tok")) + + restored = [] + + class _FakeController: + def __init__(self, *a, **k): ... + + def get_active(self): + return ActiveLlmProvider(provider_id="p", default_model_id="orig") + + def resolve_and_activate(self, requested, provider=None): + raise RuntimeError("activation failed") + + def restore(self, original): + restored.append(original.default_model_id if original else None) + + def close(self): ... + + monkeypatch.setattr(cli_main, "WorkspaceModelController", _FakeController) + exit_code = cli_main.main( + [ + "run", + "--host", + "https://h", + "--token", + "tok", + "--workspace", + "ws1", + "--dataset", + str(fixtures_dir / "sample_dataset"), + "--model", + "gpt-5.2", + "--quiet", + ] + ) + assert exit_code == 2 # no models succeeded → error + assert restored == ["orig"] # restore was still called + + +def test_parse_model_arg_plain_model(): + assert _parse_model_arg("gpt-5.2") == (None, "gpt-5.2") + + +def test_parse_model_arg_provider_slash_model(): + assert _parse_model_arg("Foundry4o/gpt-5.2") == ("Foundry4o", "gpt-5.2") + + +def test_parse_model_arg_provider_id_slash_model(): + p, m = _parse_model_arg("foundry_6563e0a3-9354/gpt-4o") + assert p == "foundry_6563e0a3-9354" and m == "gpt-4o" + + +def test_parse_model_arg_strips_whitespace(): + assert _parse_model_arg(" Provider / gpt-5.2 ") == ("Provider", "gpt-5.2") + + +def test_cli_models_command_exits_ok(monkeypatch): + """gd-eval models dispatches to _list_models and exits 0.""" + monkeypatch.setattr(cli_main, "resolve_connection", lambda host, token, profile: ("https://h", "tok")) + monkeypatch.setattr(cli_main, "_list_models", lambda host, token, workspace_id: cli_main._EXIT_OK) + exit_code = cli_main.main( + [ + "models", + "--host", + "https://h", + "--token", + "tok", + "--workspace", + "ws1", + ] + ) + assert exit_code == 0 + + +def test_parse_model_arg_plain_model_no_strip(): + # The no-slash path does not strip whitespace; argparse never passes + # whitespace through, so this documents the current behaviour. + assert _parse_model_arg(" gpt-5.2 ") == (None, " gpt-5.2 ") diff --git a/packages/gooddata-eval/tests/test_reporting.py b/packages/gooddata-eval/tests/test_reporting.py index 8edba6f9c..f191de38e 100644 --- a/packages/gooddata-eval/tests/test_reporting.py +++ b/packages/gooddata-eval/tests/test_reporting.py @@ -1,7 +1,12 @@ # (C) 2026 GoodData Corporation import orjson -from gooddata_eval.core.reporting.console import render_console -from gooddata_eval.core.reporting.json_report import build_json_report, write_json_report +from gooddata_eval.core.reporting.console import render_comparison, render_console +from gooddata_eval.core.reporting.json_report import ( + build_json_report, + build_multi_model_report, + write_json_report, + write_multi_model_report, +) from gooddata_eval.core.runner import EvalReport, ItemReport @@ -49,7 +54,7 @@ def test_write_json_report_creates_file(tmp_path): path = tmp_path / "out.json" write_json_report(_report(), path) loaded = orjson.loads(path.read_bytes()) - assert loaded["items"]["i2"]["pass_at_k"] is False + assert loaded["runs"]["gpt-5.2"]["items"]["i2"]["pass_at_k"] is False def test_render_console_returns_summary_text(): @@ -58,3 +63,129 @@ def test_render_console_returns_summary_text(): assert "1/3" in text assert "2.50s" in text # i1 total latency assert "1.25s" in text # i1 avg/run latency + + +def _two_reports(): + return [ + EvalReport( + model="gpt-5.2", + workspace_id="ws", + items=[ + ItemReport( + id="i1", + dataset_name="d", + test_kind="visualization", + question="q1", + pass_at_k=True, + runs=1, + latency_s=10.0, + best_detail={"metrics_correct": True}, + ), + ItemReport( + id="i2", + dataset_name="d", + test_kind="visualization", + question="q2", + pass_at_k=False, + runs=1, + latency_s=20.0, + best_detail={"metrics_correct": False}, + ), + ], + ), + EvalReport( + model="gpt-4o", + workspace_id="ws", + items=[ + ItemReport( + id="i1", + dataset_name="d", + test_kind="visualization", + question="q1", + pass_at_k=True, + runs=1, + latency_s=8.0, + best_detail={"metrics_correct": True}, + ), + ItemReport( + id="i2", + dataset_name="d", + test_kind="visualization", + question="q2", + pass_at_k=True, + runs=1, + latency_s=12.0, + best_detail={"metrics_correct": True}, + ), + ], + ), + ] + + +def test_build_multi_model_report_structure(): + data = build_multi_model_report(_two_reports()) + assert data["models"] == ["gpt-5.2", "gpt-4o"] + assert "gpt-5.2" in data["runs"] + assert "gpt-4o" in data["runs"] + assert data["runs"]["gpt-5.2"]["model"] == "gpt-5.2" + assert "comparison" in data + assert data["comparison"]["gpt-5.2"]["passed"] == 1 + assert data["comparison"]["gpt-4o"]["passed"] == 2 + + +def test_build_multi_model_report_comparison_keys(): + data = build_multi_model_report(_two_reports()) + cmp = data["comparison"]["gpt-4o"] + assert cmp["total"] == 2 + assert cmp["pass_rate"] == 1.0 + assert "avg_quality_score" in cmp + assert "avg_latency_s" in cmp + assert "total_latency_s" in cmp + + +def test_write_multi_model_report_creates_file(tmp_path): + path = tmp_path / "out.json" + write_multi_model_report(_two_reports(), path) + loaded = orjson.loads(path.read_bytes()) + assert loaded["models"] == ["gpt-5.2", "gpt-4o"] + assert "comparison" in loaded + + +def test_render_comparison_shows_both_models_and_winner(): + text = render_comparison(_two_reports()) + assert "gpt-5.2" in text + assert "gpt-4o" in text + assert "Winner" in text + # gpt-4o passed 2/2, gpt-5.2 passed 1/2 — gpt-4o wins + assert "gpt-4o" in text.split("Winner")[1] + + +def test_render_comparison_single_report_returns_empty(): + assert render_comparison([EvalReport(model="gpt-5.2", workspace_id="ws")]) == "" + + +def test_build_multi_model_report_no_key_collision_same_model_different_providers(): + """Two runs with the same model_id but different providers must both survive in JSON.""" + r1 = EvalReport( + model="claude-opus", + provider_name="DirectAnthropic", + workspace_id="ws", + items=[ + ItemReport(id="i1", dataset_name="d", test_kind="visualization", question="q", pass_at_k=True, runs=1), + ], + ) + r2 = EvalReport( + model="claude-opus", + provider_name="HN_Anthropic", + workspace_id="ws", + items=[ + ItemReport(id="i1", dataset_name="d", test_kind="visualization", question="q", pass_at_k=False, runs=1), + ], + ) + data = build_multi_model_report([r1, r2]) + assert len(data["runs"]) == 2, "both runs must be present — no silent overwrite" + assert len(data["comparison"]) == 2 + assert "DirectAnthropic/claude-opus" in data["runs"] + assert "HN_Anthropic/claude-opus" in data["runs"] + assert data["runs"]["DirectAnthropic/claude-opus"]["summary"]["passed"] == 1 + assert data["runs"]["HN_Anthropic/claude-opus"]["summary"]["passed"] == 0 diff --git a/packages/gooddata-eval/tests/test_workspace.py b/packages/gooddata-eval/tests/test_workspace.py index f919a8bdc..fbeb06095 100644 --- a/packages/gooddata-eval/tests/test_workspace.py +++ b/packages/gooddata-eval/tests/test_workspace.py @@ -1,8 +1,11 @@ # (C) 2026 GoodData Corporation +from unittest.mock import MagicMock + import pytest from gooddata_eval.core.workspace import ( ActiveLlmProvider, ModelResolutionError, + WorkspaceModelController, _resolve_provider_ref, active_provider_content, resolve_model, @@ -106,3 +109,26 @@ def test_resolve_provider_ref_ambiguous_name(): } with pytest.raises(ModelResolutionError, match="Multiple"): _resolve_provider_ref("Shared Name", info) + + +def test_workspace_controller_restore_calls_activate(): + ctrl = WorkspaceModelController.__new__(WorkspaceModelController) + ctrl._workspace_id = "ws" + ctrl._host = "https://h" + ctrl._headers = {} + ctrl._sdk = MagicMock() + + activated = [] + ctrl.activate = lambda pid, mid: activated.append((pid, mid)) + + original = ActiveLlmProvider(provider_id="orig-prov", default_model_id="gpt-5.2") + ctrl.restore(original) + assert activated == [("orig-prov", "gpt-5.2")] + + +def test_workspace_controller_restore_skips_when_none(): + ctrl = WorkspaceModelController.__new__(WorkspaceModelController) + activated = [] + ctrl.activate = lambda pid, mid: activated.append((pid, mid)) + ctrl.restore(None) + assert activated == []