RAG-Diff

Regression testing and diagnostic diffing for RAG pipelines.

RAG-Diff is a CLI tool that helps you answer three questions every time you change your RAG system:

Did I break it? — Snapshot-based regression testing
Was it retrieval or the model? — Heuristic triage
Show me what changed. — Context-level diffs with LOST/NEW tracking

Think of it as pytest for RAG — lightweight, fast, developer-first.

Install

pip install rag-diff

# With LLM-as-Judge support (optional):
pip install rag-diff[openai]

Requirements: Python 3.10+

Quick Start

1. Initialize a sample project

rag-diff init

This creates sample_adapter.py and sample_testset.json.

2. Run your first test

rag-diff run --testset sample_testset.json --target sample_adapter.py:ask

3. Make a change and run again

# Edit your adapter, change retrieval, swap models, etc.
rag-diff run --testset sample_testset.json --target sample_adapter.py:ask

On the second run, RAG-Diff automatically compares against the previous run and shows you what changed:

+- Diff: run_20260410_1000_abc12345 -> run_20260410_1015_def67890 -+
| 1 regression(s) | 1 changed | 8 unchanged                       |
+------------------------------------------------------------------+

  REGRESSION How is overtime calculated?
    Triage: Suspected Retrieval Issue
    Latency: 120ms -> 350ms (+192%)
    Judge:
      faithfulness: PASS (1.0) -> FAIL (0.2)
        Reason: Answer mentions "3x pay" but new contexts lack this policy.
    Contexts:
      - LOST 4a8b2c1d9e3f7a05...
      + NEW  f1e2d3c4b5a69078...

Writing an Adapter

An adapter is an async (or sync) Python function that takes a query and returns a dict:

# Minimal — just answer and contexts
async def ask(query: str) -> dict:
    result = my_rag_pipeline(query)
    return {
        "answer": result.answer,
        "contexts": [chunk.text for chunk in result.chunks],
    }

# Recommended — structured contexts with IDs and scores
async def ask(query: str) -> dict:
    result = my_rag_pipeline(query)
    return {
        "answer": result.answer,
        "contexts": [
            {"id": c.id, "text": c.text, "score": c.score}
            for c in result.chunks
        ],
    }

RAG-Diff auto-generates SHA-256 hashes for plain text contexts, so you don't need to provide IDs.

Test Set Format

A JSON array of test cases:

[
    {"query": "What is the refund policy?"},
    {"query": "How many vacation days?", "expected_answer": "15 days after 5 years"},
    {"query": "Overtime rules", "metadata": {"category": "hr", "priority": "high"}}
]

query (required): The question to ask your RAG system
expected_answer (optional): Golden answer for higher judge accuracy
metadata (optional): Arbitrary tags for your own tracking

CLI Reference

`rag-diff run`

Run the test set and record a snapshot.

rag-diff run --testset tests.json --target my_adapter.py:ask [OPTIONS]

Option	Default	Description
`--testset`	(required)	Path to test set JSON file
`--target`	(required)	Adapter in `module:function` format
`--output-dir`	`.ragdiff`	Directory for run data
`--concurrency`	`5`	Max concurrent adapter calls
`--compare-to`	(auto: HEAD)	Specific run ID to diff against
`--judge / --no-judge`	`--judge`	Enable/disable LLM evaluation
`--judge-model`	`gpt-4o-mini`	Model for LLM-as-Judge
`--regressions-only`	off	Only show regressions in output
`--dump`	off	Export diff report as JSON

`rag-diff init [DIRECTORY]`

Create sample adapter and test set files.

`rag-diff list`

List all recorded runs with HEAD marker.

How It Works

Architecture

rag-diff run --testset test.json --target adapter.py:ask
     |
     v
  cli.py --> loader.py (dynamic import)
     |
     v
  runner.py --> adapter(query) --> CaseResult
     |              |
     |              v
     |       context_hash.py (SHA-256 fingerprint)
     |
     v
  judge.py --> LLM (faithfulness + relevancy)
     |              |
     |              v
     |       judge_cache.py (SQLite, version-aware)
     |
     v
  run_manager.py --> .ragdiff/runs/{run_id}/snapshot.json
     |
     v
  diff.py --> heuristic triage (retrieval vs model)
     |
     v
  console.py --> Rich terminal output

Heuristic Triage

When a regression is detected, RAG-Diff classifies the root cause:

Scenario	Triage	What it means
Judge failed + contexts changed	Retrieval Issue	Your retriever returned different documents
Judge failed + contexts same	Model Issue	Same context, worse answer — model/prompt problem
No judge + contexts changed	Changed	Structural change detected, enable judge for classification

Judge Cache

LLM-as-Judge calls are expensive. RAG-Diff caches every verdict in a local SQLite database, keyed by SHA-256(query + answer + sorted_context_hashes) and scoped to the judge prompt version. If inputs haven't changed, the cached verdict is reused — typical cache hit rates are 80%+ during iterative development.

Judge Prompt Version

Judge prompts are locked to v1.0 to prevent evaluation drift. When prompts change in a future release, the version bumps and cached results are automatically invalidated.

Data Layout

.ragdiff/
├── runs/
│   ├── run_20260410_1000_abc12345/
│   │   └── snapshot.json          # Full run results
│   └── run_20260410_1015_def67890/
│       └── snapshot.json
├── cache/
│   └── judge_cache_v1.db          # SQLite judge cache
├── dumps/
│   └── diff_20260410_1015.json    # --dump output
└── HEAD.json                      # Points to latest run

Add .ragdiff/ to your .gitignore.

Using with CI

# GitHub Actions example
- name: RAG regression test
  run: |
    pip install rag-diff[openai]
    rag-diff run --testset tests/rag_tests.json --target src/adapter.py:ask
  env:
    OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

Roadmap

Phase 2: Agentic route diff, HTML report exporter
Phase 3: Custom judge injection, multi-dimensional failure classification

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
agentic_content		agentic_content
docs		docs
drafts		drafts
e2e_test		e2e_test
examples/real_rag		examples/real_rag
rag_diff		rag_diff
tests		tests
.gitignore		.gitignore
README.md		README.md
env.example		env.example
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAG-Diff

Install

Quick Start

1. Initialize a sample project

2. Run your first test

3. Make a change and run again

Writing an Adapter

Test Set Format

CLI Reference

`rag-diff run`

`rag-diff init [DIRECTORY]`

`rag-diff list`

How It Works

Architecture

Heuristic Triage

Judge Cache

Judge Prompt Version

Data Layout

Using with CI

Roadmap

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

RAG-Diff

Install

Quick Start

1. Initialize a sample project

2. Run your first test

3. Make a change and run again

Writing an Adapter

Test Set Format

CLI Reference

rag-diff run

rag-diff init [DIRECTORY]

rag-diff list

How It Works

Architecture

Heuristic Triage

Judge Cache

Judge Prompt Version

Data Layout

Using with CI

Roadmap

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`rag-diff run`

`rag-diff init [DIRECTORY]`

`rag-diff list`

Packages