Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
149 changes: 108 additions & 41 deletions packages/gooddata-eval/README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# gooddata-eval

CLI to evaluate the GoodData AI agent against a dataset of natural-language
questions on a chosen workspace and LLM model.
questions on a chosen workspace and LLM model — including multi-model comparison.

## Install

Expand All @@ -11,68 +11,145 @@ Or install `gd-eval` as a standalone tool:

uv tool install gooddata-eval

## Quick start
## Commands

| Command | Description |
|---|---|
| `gd-eval run` | Run an evaluation dataset against one or more models. |
| `gd-eval models` | List LLM providers and models configured in the org. |

---

## `gd-eval run`

### Quick start — single model

```bash
export GOODDATA_TOKEN='your-api-token'

gd-eval run \
--host https://your.gooddata.cloud \
--workspace demo \
--workspace ecommerce_demo \
--dataset ./my-dataset \
--model gpt-5.2 \
--runs 2 \
--runs 1 \
--json results.json
```

## All flags
### Multi-model comparison

Pass `--model` multiple times to evaluate the same dataset against several
models and get a side-by-side comparison:

```bash
gd-eval run \
--host https://your.gooddata.cloud \
--workspace ecommerce_demo \
--dataset ./my-dataset \
--model gpt-5.2 \
--model claude-opus-4-7 \
--runs 1 \
--json comparison.json
```

When the same model id is offered by multiple providers, use the
`provider/model` syntax to disambiguate:

```bash
--model "Foundry4o_4.1_5.2/gpt-5.2" \
--model "HN_Anthropic/claude-opus-4-7"
```

### Connection
Both provider name and provider id are accepted as the prefix.

### All flags

#### Connection

| Flag | Env var | Description |
|---|---|---|
| `--host HOST` | — | GoodData host URL (e.g. `https://your.gooddata.cloud`). |
| `--host HOST` | — | GoodData host URL. |
| `--token TOKEN` | `GOODDATA_TOKEN` | API token. Pass via flag or env var. |
| `--profile NAME` | — | Profile name in `~/.gooddata/profiles.yaml` (same file as the `gdc` CLI). Provides host + token when both flags are omitted. |
| `--profile NAME` | — | Profile name in `~/.gooddata/profiles.yaml` (same file as the `gdc` CLI). |
| `--workspace ID` | — | **Required.** Workspace id to evaluate against. |

### Dataset source (pick one)
#### Dataset source (pick one)

| Flag | Description |
|---|---|
| `--dataset PATH` | Path to a flat folder of JSON files — one question per file. |
| `--langfuse-dataset NAME` | Pull dataset items by name from Langfuse. Requires `LANGFUSE_PUBLIC_KEY`, `LANGFUSE_SECRET_KEY`, `LANGFUSE_HOST` env vars. |
| `--dataset PATH` | Flat folder of JSON files — one question per file. |
| `--langfuse-dataset NAME` | Pull items by name from a Langfuse dataset. Requires `LANGFUSE_PUBLIC_KEY`, `LANGFUSE_SECRET_KEY`, `LANGFUSE_HOST`. |

### Model selection
#### Model selection

| Flag | Description |
|---|---|
| `--model ID` | LLM model id to evaluate (e.g. `gpt-5.2`). Defaults to the workspace's currently active model. If the model is offered by a different provider than the active one, the workspace's active provider is switched automatically. |
| `--provider NAME_OR_ID` | LLM provider name or id. Use when `--model` is offered by multiple providers and you need to pick one. Accepts either the human-readable provider name or its UUID id. |
| `--model MODEL` | Model id to evaluate. Repeat to compare multiple models. Accepts `provider/model` syntax to disambiguate when a model is offered by multiple providers (e.g. `--model "Foundry4o/gpt-5.2"`). Defaults to the workspace's current active model. |

### Evaluation
#### Evaluation

| Flag | Default | Description |
|---|---|---|
| `--runs K` | `2` | Number of independent conversation runs per item (pass@K). An item passes if any run passes. |
| `--runs K` | `2` | Independent runs per item (pass@K). An item passes if any run passes. |

### Output
#### Output

| Flag | Description |
|---|---|
| `--json PATH` | Write a machine-readable JSON report (keyed by item id, with per-item scores) to this path. Console summary is always printed. |
| `--quiet` | Suppress per-item progress output. Only the final table and summary are printed. |
| `--json PATH` | Write a JSON report to this path. Always uses the nested `{models, runs, comparison}` shape even for a single model. |
| `--quiet` | Suppress per-item progress. Per-model result tables and the comparison summary are still printed. |

### Langfuse sink
#### Langfuse sink

| Flag | Description |
|---|---|
| `--langfuse` | Log evaluation results to Langfuse after each item. Requires `--langfuse-dataset` (so item ids can be linked to Langfuse dataset items). Creates a named experiment run (`gd-eval-{timestamp}-{model}`) in the Langfuse dataset. Requires `LANGFUSE_PUBLIC_KEY`, `LANGFUSE_SECRET_KEY`, `LANGFUSE_HOST`. |
| `--langfuse` | Log scores and traces to Langfuse after each item. Requires `--langfuse-dataset`. Creates one named experiment run per model (`gd-eval-{timestamp}-{model}`). Requires `LANGFUSE_PUBLIC_KEY`, `LANGFUSE_SECRET_KEY`, `LANGFUSE_HOST`. |

### JSON report shape

The JSON report always uses the nested multi-model shape:

```json
{
"models": ["gpt-5.2", "claude-opus-4-7"],
"runs": {
"gpt-5.2": { "summary": { "passed": 22, ... }, "items": { ... } },
"claude-opus-4-7": { "summary": { "passed": 18, ... }, "items": { ... } }
},
"comparison": {
"gpt-5.2": { "passed": 22, "total": 31, "pass_rate": 0.71, "avg_quality_score": 0.81, ... },
"claude-opus-4-7": { "passed": 18, "total": 31, "pass_rate": 0.58, "avg_quality_score": 0.72, ... }
}
}
```

Winner is selected by **pass rate → quality score → latency** (lower latency wins all-equal ties).

---

## `gd-eval models`

List all LLM providers and their models in the org. Marks the active model
for a workspace when `--workspace` is given:

```bash
gd-eval models \
--host https://your.gooddata.cloud \
--workspace ecommerce_demo
```

```
┃ Provider ┃ Provider ID ┃ Model ID ┃ Family ┃ Active ┃
│ Foundry4o │ foundry_… │ gpt-5.2 │ OPENAI │ ◀ active │
│ │ │ gpt-4o │ OPENAI │ │
│ HN_Anthropic │ hn_anthr_… │ claude-opus-4-7 │ ANTHROPIC │ │
```

---

## Dataset format

A dataset is a folder of `.json` files, one per question. Each file must
contain a common envelope:
A dataset is a folder of `.json` files, one per question:

```json
{
Expand All @@ -87,8 +164,6 @@ contain a common envelope:
Supported `test_kind` values: `visualization`, `metric_skill`, `alert_skill`,
`search_tool`, `general_question`, `guardrail`.

See the full dataset specification for `expected_output` shapes per test kind.

## Supported test kinds

| test_kind | What the agent must produce | Extra required |
Expand All @@ -104,31 +179,22 @@ See the full dataset specification for `expected_output` shapes per test kind.

### `[llm-judge]` — LLM-as-judge evaluators

`general_question` and `guardrail` items are scored by an LLM judge (GPT-4o)
that compares the agent's text response against your expected-output description.
This requires the OpenAI Python package and an API key:
`general_question` and `guardrail` items are scored by a GPT-4o judge.
Requires the OpenAI package and `OPENAI_API_KEY`:

```bash
uv add 'gooddata-eval[llm-judge]' # project dependency
# or, for the standalone gd-eval tool:
uv add 'gooddata-eval[llm-judge]'
# or for the standalone tool:
uv tool install 'gooddata-eval[llm-judge]'
```

Set your OpenAI key before running:

```bash
export OPENAI_API_KEY='sk-...'
```

Without `[llm-judge]`, items with `test_kind: general_question` or `guardrail`
are reported as **skipped**.

Without `[llm-judge]`, those items are **skipped**.

## Exit codes

| Code | Meaning |
|---|---|
| `0` | Run completed. Evaluation failures do **not** cause a non-zero exit — check the report. |
| `0` | Run completed. Evaluation failures do **not** cause a non-zero exit. |
| `2` | Operational error: bad connection, missing model, unreadable dataset, missing credentials. |

## Scores (in JSON report and Langfuse)
Expand All @@ -137,5 +203,6 @@ are reported as **skipped**.
|---|---|
| `pass_at_k` | 1 if any of the K runs passed strict checks, else 0. |
| `quality_score` | Fraction of strict check flags that are `True` (0.0–1.0). Shown in CLI as a percentage. |
| `value_score` | Weighted blend: 0.6 × quality + 0.2 × speed (where speed = max(0, 1 − latency/60s)). |
| `value_score` | Weighted blend: 0.6 × quality + 0.2 × speed (speed = max(0, 1 − latency/60s)). |
| `latency_s` | Average per-run latency in seconds. |
| `provider_type` | Model vendor + gateway label (e.g. `ANTHROPIC`, `BEDROCK/ANTHROPIC`, `AZURE/OPENAI`). Stored in Langfuse trace metadata and tags. |
Loading
Loading