test(llm-coverage): fix and unskip the backup-judge test (#48) by serkancakmakk · Pull Request #53 · weval-org/app

serkancakmakk · 2026-07-01T18:46:28Z

Summary

Closes #48.

The "should use backup judge when one primary judge fails" test in src/cli/evaluators/__tests__/llm-coverage-evaluator.test.ts was .skipped because its expectations were stale — they were written for an older DEFAULT_JUDGES set (qwen / gpt-oss / glm) that no longer exists. This rewrites the test against the current DEFAULT_JUDGES and un-skips it.

What changed

Current DEFAULT_JUDGES (all holistic): google/gemini-2.5-flash, openai/gpt-4.1-mini, anthropic/claude-haiku-4.5; backup judge: anthropic/claude-haiku-4.5 (id backup-claude-4-5-haiku).

The rewritten test:

Fails one primary judge (gpt-4.1-mini) and lets the other two succeed, so successfulJudgements < totalJudgesAttempted triggers the backup phase.
Matches the backup judge by its id (backup-claude-4-5-haiku) first, since it shares its model with a primary judge and couldn't otherwise be distinguished in the mock.
Averages the two surviving primaries plus the backup: (0.8 + 0.6 + 0.4) / 3 = 0.6, matching the source's avgScore.toFixed(2) consensus logic.

The assertions still verify the meaningful behavior — backup activation (DEFAULT_JUDGES.length + 1 judge calls), the consensus score, the "NOTE: Backup judge was used to supplement failed primary judges." reflection, and no error — so it reflects intended behavior rather than being forced to pass.

Acceptance criteria

The "backup judge" test is un-skipped and passing
Test reflects the actual intended backup-judge behavior
llm-coverage-evaluator suite green

Verification

pnpm exec vitest run --project cli src/cli/evaluators/__tests__/llm-coverage-evaluator.test.ts
# Tests  39 passed (39)   ← 0 skipped

No .skip remains in the file.

The "should use backup judge when one primary judge fails" test was .skipped with stale expectations written for an older DEFAULT_JUDGES set (qwen / gpt-oss / glm). Rewrite it against the current DEFAULT_JUDGES (gemini-2.5-flash, gpt-4.1-mini, claude-haiku-4.5): - Fail one primary (gpt-4.1-mini) so the backup judge is activated. - Match the backup by its id (backup-claude-4-5-haiku) since it shares a model with a primary judge. - Average the two surviving primaries plus the backup ((0.8+0.6+0.4)/3). Un-skipped and passing; the full llm-coverage-evaluator suite is green (39 passed, 0 skipped). Closes weval-org#48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

test(llm-coverage): fix and unskip the backup-judge test (#48)#53

test(llm-coverage): fix and unskip the backup-judge test (#48)#53
serkancakmakk wants to merge 1 commit into
weval-org:mainfrom
serkancakmakk:fix/48-unskip-backup-judge-test

serkancakmakk commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

serkancakmakk commented Jul 1, 2026

Summary

What changed

Acceptance criteria

Verification

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant