gooddata · zdenekmusil-gd · Jun 4, 2026 · Jun 4, 2026
@@ -1,7 +1,7 @@
 # gooddata-eval
 
 CLI to evaluate the GoodData AI agent against a dataset of natural-language
-questions on a chosen workspace and LLM model.
+questions on a chosen workspace and LLM model — including multi-model comparison.
 
 ## Install
 
@@ -11,68 +11,145 @@ Or install `gd-eval` as a standalone tool:
 
     uv tool install gooddata-eval
 
-## Quick start
+## Commands
+
+| Command | Description |
+|---|---|
+| `gd-eval run` | Run an evaluation dataset against one or more models. |
+| `gd-eval models` | List LLM providers and models configured in the org. |
+
+---
+
+## `gd-eval run`
+
+### Quick start — single model
 
 ```bash
 export GOODDATA_TOKEN='your-api-token'
 
 gd-eval run \
   --host  https://your.gooddata.cloud \
-  --workspace  demo \
+  --workspace  ecommerce_demo \
   --dataset  ./my-dataset \
   --model  gpt-5.2 \
-  --runs  2 \
+  --runs  1 \
   --json  results.json
 ```
 
-## All flags
+### Multi-model comparison
+
+Pass `--model` multiple times to evaluate the same dataset against several
+models and get a side-by-side comparison:
+
+```bash
+gd-eval run \
+  --host  https://your.gooddata.cloud \
+  --workspace  ecommerce_demo \
+  --dataset  ./my-dataset \
+  --model  gpt-5.2 \
+  --model  claude-opus-4-7 \
+  --runs  1 \
+  --json  comparison.json
+```
+
+When the same model id is offered by multiple providers, use the
+`provider/model` syntax to disambiguate:
+
+```bash
+  --model  "Foundry4o_4.1_5.2/gpt-5.2" \
+  --model  "HN_Anthropic/claude-opus-4-7"
+```
 
-### Connection
+Both provider name and provider id are accepted as the prefix.
+
+### All flags
+
+#### Connection
 
 | Flag | Env var | Description |
 |---|---|---|
-| `--host HOST` | — | GoodData host URL (e.g. `https://your.gooddata.cloud`). |
+| `--host HOST` | — | GoodData host URL. |
 | `--token TOKEN` | `GOODDATA_TOKEN` | API token. Pass via flag or env var. |
-| `--profile NAME` | — | Profile name in `~/.gooddata/profiles.yaml` (same file as the `gdc` CLI). Provides host + token when both flags are omitted. |
+| `--profile NAME` | — | Profile name in `~/.gooddata/profiles.yaml` (same file as the `gdc` CLI). |
 | `--workspace ID` | — | **Required.** Workspace id to evaluate against. |
 
-### Dataset source (pick one)
+#### Dataset source (pick one)
 
 | Flag | Description |
 |---|---|
-| `--dataset PATH` | Path to a flat folder of JSON files — one question per file. |
-| `--langfuse-dataset NAME` | Pull dataset items by name from Langfuse. Requires `LANGFUSE_PUBLIC_KEY`, `LANGFUSE_SECRET_KEY`, `LANGFUSE_HOST` env vars. |
+| `--dataset PATH` | Flat folder of JSON files — one question per file. |
+| `--langfuse-dataset NAME` | Pull items by name from a Langfuse dataset. Requires `LANGFUSE_PUBLIC_KEY`, `LANGFUSE_SECRET_KEY`, `LANGFUSE_HOST`. |
 
-### Model selection
+#### Model selection
 
 | Flag | Description |
 |---|---|
-| `--model ID` | LLM model id to evaluate (e.g. `gpt-5.2`). Defaults to the workspace's currently active model. If the model is offered by a different provider than the active one, the workspace's active provider is switched automatically. |
-| `--provider NAME_OR_ID` | LLM provider name or id. Use when `--model` is offered by multiple providers and you need to pick one. Accepts either the human-readable provider name or its UUID id. |
+| `--model MODEL` | Model id to evaluate. Repeat to compare multiple models. Accepts `provider/model` syntax to disambiguate when a model is offered by multiple providers (e.g. `--model "Foundry4o/gpt-5.2"`). Defaults to the workspace's current active model. |
 
-### Evaluation
+#### Evaluation
 
 | Flag | Default | Description |
 |---|---|---|
-| `--runs K` | `2` | Number of independent conversation runs per item (pass@K). An item passes if any run passes. |
+| `--runs K` | `2` | Independent runs per item (pass@K). An item passes if any run passes. |
 
-### Output
+#### Output
 
 | Flag | Description |
 |---|---|
-| `--json PATH` | Write a machine-readable JSON report (keyed by item id, with per-item scores) to this path. Console summary is always printed. |
-| `--quiet` | Suppress per-item progress output. Only the final table and summary are printed. |
+| `--json PATH` | Write a JSON report to this path. Always uses the nested `{models, runs, comparison}` shape even for a single model. |
+| `--quiet` | Suppress per-item progress. Per-model result tables and the comparison summary are still printed. |
 
-### Langfuse sink
+#### Langfuse sink
 
 | Flag | Description |
 |---|---|
-| `--langfuse` | Log evaluation results to Langfuse after each item. Requires `--langfuse-dataset` (so item ids can be linked to Langfuse dataset items). Creates a named experiment run (`gd-eval-{timestamp}-{model}`) in the Langfuse dataset. Requires `LANGFUSE_PUBLIC_KEY`, `LANGFUSE_SECRET_KEY`, `LANGFUSE_HOST`. |
+| `--langfuse` | Log scores and traces to Langfuse after each item. Requires `--langfuse-dataset`. Creates one named experiment run per model (`gd-eval-{timestamp}-{model}`). Requires `LANGFUSE_PUBLIC_KEY`, `LANGFUSE_SECRET_KEY`, `LANGFUSE_HOST`. |
+
+### JSON report shape
+
+The JSON report always uses the nested multi-model shape:
+
+```json
+{
+  "models": ["gpt-5.2", "claude-opus-4-7"],
+  "runs": {
+    "gpt-5.2":        { "summary": { "passed": 22, ... }, "items": { ... } },
+    "claude-opus-4-7": { "summary": { "passed": 18, ... }, "items": { ... } }
+  },
+  "comparison": {
+    "gpt-5.2":        { "passed": 22, "total": 31, "pass_rate": 0.71, "avg_quality_score": 0.81, ... },
+    "claude-opus-4-7": { "passed": 18, "total": 31, "pass_rate": 0.58, "avg_quality_score": 0.72, ... }
+  }
+}
+```
+
+Winner is selected by **pass rate → quality score → latency** (lower latency wins all-equal ties).
+
+---
+
+## `gd-eval models`
+
+List all LLM providers and their models in the org. Marks the active model
+for a workspace when `--workspace` is given:
+
+```bash
+gd-eval models \
+  --host  https://your.gooddata.cloud \
+  --workspace  ecommerce_demo
+```
+
+```
+┃ Provider       ┃ Provider ID ┃ Model ID          ┃ Family    ┃ Active   ┃
+│ Foundry4o      │ foundry_…   │ gpt-5.2           │ OPENAI    │ ◀ active │
+│                │             │ gpt-4o            │ OPENAI    │          │
+│ HN_Anthropic   │ hn_anthr_…  │ claude-opus-4-7   │ ANTHROPIC │          │
+```
+
+---
 
 ## Dataset format
 
-A dataset is a folder of `.json` files, one per question. Each file must
-contain a common envelope:
+A dataset is a folder of `.json` files, one per question:
 
 ```json
 {
@@ -87,8 +164,6 @@ contain a common envelope:
 Supported `test_kind` values: `visualization`, `metric_skill`, `alert_skill`,
 `search_tool`, `general_question`, `guardrail`.
 
-See the full dataset specification for `expected_output` shapes per test kind.
-
 ## Supported test kinds
 
 | test_kind | What the agent must produce | Extra required |
@@ -104,31 +179,22 @@ See the full dataset specification for `expected_output` shapes per test kind.
 
 ### `[llm-judge]` — LLM-as-judge evaluators
 
-`general_question` and `guardrail` items are scored by an LLM judge (GPT-4o)
-that compares the agent's text response against your expected-output description.
-This requires the OpenAI Python package and an API key:
+`general_question` and `guardrail` items are scored by a GPT-4o judge.
+Requires the OpenAI package and `OPENAI_API_KEY`:
 
 ```bash
-uv add 'gooddata-eval[llm-judge]'        # project dependency
-# or, for the standalone gd-eval tool:
+uv add 'gooddata-eval[llm-judge]'
+# or for the standalone tool:
 uv tool install 'gooddata-eval[llm-judge]'
 ```
 
-Set your OpenAI key before running:
-
-```bash
-export OPENAI_API_KEY='sk-...'
-```
-
-Without `[llm-judge]`, items with `test_kind: general_question` or `guardrail`
-are reported as **skipped**.
-
+Without `[llm-judge]`, those items are **skipped**.
 
 ## Exit codes
 
 | Code | Meaning |
 |---|---|
-| `0` | Run completed. Evaluation failures do **not** cause a non-zero exit — check the report. |
+| `0` | Run completed. Evaluation failures do **not** cause a non-zero exit. |
 | `2` | Operational error: bad connection, missing model, unreadable dataset, missing credentials. |
 
 ## Scores (in JSON report and Langfuse)
@@ -137,5 +203,6 @@ are reported as **skipped**.
 |---|---|
 | `pass_at_k` | 1 if any of the K runs passed strict checks, else 0. |
 | `quality_score` | Fraction of strict check flags that are `True` (0.0–1.0). Shown in CLI as a percentage. |
-| `value_score` | Weighted blend: 0.6 × quality + 0.2 × speed (where speed = max(0, 1 − latency/60s)). |
+| `value_score` | Weighted blend: 0.6 × quality + 0.2 × speed (speed = max(0, 1 − latency/60s)). |
 | `latency_s` | Average per-run latency in seconds. |
+| `provider_type` | Model vendor + gateway label (e.g. `ANTHROPIC`, `BEDROCK/ANTHROPIC`, `AZURE/OPENAI`). Stored in Langfuse trace metadata and tags. |