Skip to content

feat(judge): ✨ replace self-attested goal completion with independent Judge subagent#227

Open
jorben wants to merge 18 commits into
masterfrom
refact/goal-llm-judugement
Open

feat(judge): ✨ replace self-attested goal completion with independent Judge subagent#227
jorben wants to merge 18 commits into
masterfrom
refact/goal-llm-judugement

Conversation

@jorben

@jorben jorben commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Replace the self-attested goal_scored flow with an independent Judge acceptance subagent that evaluates the project state against the goal without relying on the main agent's self-report
  • Add new judge.md / output_contract.judge.md prompts and a typed JudgeRequest / JudgeReport contract, wired into execute_judge_tool
  • Drop the per-goal time_used_seconds field in favor of run-level elapsed tracking; remove the legacy mark_complete / complete verdict path
  • Make the task parameter on the subagent tool optional, fix UTF-8 safe truncation, and raise the builtin default max delegation depth to 5

Test Plan

  • cargo check --manifest-path src-tauri/Cargo.toml passes
  • cargo test --locked --manifest-path src-tauri/Cargo.toml — 811 passed / 0 failed
  • cargo fmt --check --manifest-path src-tauri/Cargo.toml passes
  • Run an end-to-end goal with the new Judge in the Tauri desktop app
  • Verify the rejection / re-verify path when the Judge returns passed=false

🤖 Generated with TiyCode

jorben added 11 commits June 7, 2026 11:54
…udge acceptance agent

Remove the `goal_scored` tool that allowed the main agent to
self-attest goal completion, replacing it with an `agent_judge`
built-in subagent that independently verifies goal attainment
against the project's current state.

Key changes:
- Add `SubagentProfile::Judge` with read-only file tools and
  diagnostic-only shell (soft constraint via prompt)
- Add `JudgeReport` structured contract (passed, completeness_pct,
  findings, summary) with safe fallback parsing
- Add `agent_judge` tool injection only for the main agent when
  an unverified goal exists; runtime gate blocks subagent/parallel
  recursion into Judge
- Add DB migration for `judge_passed`, `judge_completeness`,
  `judge_findings`, `judge_summary`, `judge_evaluated_run_id`
  columns with backfill for legacy `status='complete'` goals
- Replace continuation stop condition: `Complete && judge_passed`
  instead of `goal_scored`-driven status flip
- Rewrite continuation prompt to instruct main agent to call
  `agent_judge` and follow findings on rejection
- Add Judge prompt surface, templates, and output contract
- Update `active_goal.tpl.md` to reflect Judge acceptance flow
- Extend goal lifecycle tests for Judge pass/fail/legacy compat
Remove the mark_complete pathway from goals as completion will be
handled through a different mechanism:

- Remove mark_complete method from GoalManager
- Remove "complete" from GoalEvaluateResult verdict type
- Remove mark_complete test cases (evidence validation, etc.)
- Update subagent surface comments to include judge

BREAKING CHANGE: GoalEvaluateResult.verdict no longer includes "complete"
Update the feature descriptions and reorder the bullet points in both
README.md and README_zh.md to better reflect the current product
capabilities and improve readability. Changes include:

- Reordering features to highlight persistent goal management, real-time
  streaming, and extensibility earlier in the list
- Updating descriptions for several features to be more accurate
- Maintaining consistency between English and Chinese versions
- Keeping the overall structure while improving flow

These are documentation-only changes that do not affect functionality.
- Extract inline status key resolution into a pure exported function
  so the complete→verified (judgePassed) branch can be unit-tested
  without mounting the component
- Add unit tests covering all status mappings and judgePassed variants
- Add test for skipped verdict passthrough in goalEvaluate
Raise `BUILTIN_DEFAULT_MAX_DELEGATION_DEPTH` from 3 to 5 to match the
existing `GLOBAL_MAX_DELEGATION_DEPTH`, allowing built-in subagents
(explore/review) to be delegated to the same depth as custom profiles.

Update delegation validation tests to reflect the new depth limits.
- Downgrade Judge prompt versions from 2 to 1 (likely a revert of unintended bump)
- Change `task` field from required to optional in Judge tool schema, with updated description clarifying it is an optional note
- Replace byte-based truncation with character-safe truncation to avoid panicking on multi-byte UTF-8 in process compliance summary
- Simplify Judge request validation to only check input validity, discarding the parsed result used only for backward compatibility
- Skip abandoned task boards when building summary to focus on relevant goal state
Merge origin/master (fe7fbfc) into refact/goal-llm-judugement.

Resolution philosophy: keep the Judge independent-verifier stance from
our 4481759 (Judge evaluates project state alone, no main-agent
self-assessment is injected into the prompt). Where master added
genuine improvements, absorb them:

- agent_session_execution.rs: keep task_board + process_compliance
  injection from our branch; add master's prior-verdict block
  sourced from goal.judge_summary / judge_findings as objective
  re-verification context (not main-agent input).
- judge.md: keep our independent-auditor framing; absorb master's
  size-first verification structure (small/medium/large change
  budget) as a pure strategy improvement. Keep our call-chain
  verification and pre-existing / prior-judge prohibition clauses.
- judge_contract.rs, output_contract.judge.md, active_goal.tpl.md,
  runtime_orchestration.rs: keep our versions (task optional,
  agent_judge() no-arg call, no "required" on task).

Co-authored-by: merge-resolver
@github-actions

github-actions Bot commented Jun 11, 2026

Copy link
Copy Markdown

AI Code Review Summary

PR: #227 (feat(judge): ✨ replace self-attested goal completion with independent Judge subagent)
Preferred language: English

Overall Assessment

No blocking issue was detected in the reviewed diff; keep focused regression testing before merge.

Major Findings by Severity

No major issues identified from the reviewed diff.

Actionable Suggestions

  • Restore a DB status change for budget‑limited goals (or add a visible indicator) to prevent confusion.
  • Consider tightening has_process_requirements keywords to avoid “preview” false positives.
  • If auto‑resume removal is permanent, update user documentation and release notes so users are not surprised.

Potential Risks

  • UI shows active goal after token budget hit, but agent stops – users may not understand why.
  • Process compliance layer may flag goals that never needed reviews, causing noise.

Test Suggestions

  • Add integration test for BudgetLimited verdict propagation through run loop.
  • Add test uncovering that a goal with “preview” triggers compliance layer unintentionally.

File-Level Coverage Notes

  • src-tauri/src/core/agent_session_execution.rs: Judge context enrichment (task board, process compliance) improves independence; new helpers are well‑tested. Removing the agent note from the prompt is intentional. Minor risk from loose keyword matching.
  • src-tauri/src/core/context_compression.rs: Dynamic reserve calculation (20% of context window, min floor) is correct and well‑tested. No regressions.
  • src-tauri/src/core/goal_manager.rs: Removal of auto‑resume and tool‑based auto‑pause is a deliberate design change that may confuse users expecting the old automatic behaviour. Budget‑limited verdict no longer updates status, which could hide the stop reason.
  • src-tauri/src/core/subagent/judge_contract.rs: Change to allow empty task is consistent with discarding the agent note; tests updated.
  • src-tauri/src/core/subagent/runtime_orchestration.rs: Tool schema update aligns with contract change; description now accurate.
  • src-tauri/src/model/goal.rs: Removal of auto_resume_on_user_message matches the behaviour change in goal manager.
  • src-tauri/src/persistence/repo/run_helper_repo.rs: New list_by_thread_id is correct and tested; no issues.
  • src-tauri/src/persistence/repo/run_repo.rs: Fix to accumulate elapsed time on interruption is a welcome improvement.
  • src-tauri/tests/goal_lifecycle.rs: Integration tests updated to reflect new logic; coverage of removed features is appropriate.
  • src/modules/workbench-shell/ui/dashboard-workbench-logic.ts: Addition of compressionThresholdRatio is a clean mirror of the backend’s new 80% rule.
  • src/modules/workbench-shell/ui/dashboard-workbench.test.ts: Unit test for the new exposure is correct and sufficient.
  • src/modules/workbench-shell/ui/dashboard-workbench.tsx: Visual marker for compression threshold is consistent and uses the new ratio.
  • src/modules/workbench-shell/ui/runtime-thread-surface-state.test.ts: New tests for mapRunSummaryToContextUsage cover edge cases well; no issues.

Inline Downgraded Items (processed but not inline)

  • None

Coverage Status

  • Target files: 13
  • Covered files: 13
  • Uncovered files: 0
  • No-patch/binary covered as file-level: 0
  • Findings with unknown confidence (N/A): 0

Uncovered list:

  • None

No-patch covered list:

  • None

Runtime/Budget

  • Rounds used: 1/4
  • Planned batches: 1
  • Executed batches: 1
  • Sub-agent runs: 1
  • Planner calls: 1
  • Reviewer calls: 1
  • Model calls: 2/64
  • Structured-output summary-only degradation: NO

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated PR review completed.

  • Findings kept: 5
  • Findings with unknown confidence: 0
  • Inline comments attempted: 3
  • Target files: 4
  • Covered files: 4
  • Uncovered files: 0
    See the summary comment for detailed analysis and coverage details.

String::new()
};

let judge_task = format!(

This comment was marked as outdated.

"{}. `{}` called at {} (status: {})\n Scope: {}\n",
i + 1,
review.helper_kind,
&review.started_at[..review.started_at.len().min(19)],

This comment was marked as outdated.

// Conditionally include process compliance layer for goals that
// require reviews or phase-by-phase verification.
let process_compliance = if has_process_requirements(&goal.objective) {
build_process_compliance_summary(&self.pool, &goal.thread_id).await

This comment was marked as outdated.

…size()

Cherry-pick the master commit (a03d9ba) that bumps tiycore from 0.2.9
to 0.2.10-rc.2 and unifies context_size semantics across
RunUsageDto / frontend badge / auto-compression, removing the old
initial_context_calibration heuristic path. No file conflict with
the Judge work in this branch — the 25 files touched here do not
overlap with the 6 Judge files resolved in the previous merge.

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated PR review completed.

  • Findings kept: 5
  • Findings with unknown confidence: 0
  • Inline comments attempted: 5
  • Target files: 26
  • Covered files: 26
  • Uncovered files: 0
    See the summary comment for detailed analysis and coverage details.

// layer is typed against `Usage`; `context_size` is
// re-derived inside the `From` impl on read, so storing only
// the raw fields is sufficient and forward-compatible.
let usage = tiycore::types::Usage {

This comment was marked as outdated.

/// Check whether the goal objective contains process requirements (e.g.,
/// "review each phase", "每阶段验收"). When true, the Judge prompt will
/// include a process compliance layer showing the thread's review call history.
fn has_process_requirements(objective: &str) -> bool {

This comment was marked as outdated.

}
},
"required": ["task"]
}

This comment was marked as outdated.

// is a pure pass-through — no clone of messages, no LLM call, no
// DB access. This exercises the most common hot-path behaviour.
let messages = vec![make_user("hi"), make_assistant("hello")];
async fn returns_messages_unchanged_when_empty_or_dangling_weak() {

This comment was marked as outdated.


#[test]
fn message_end_usage_updates_consume_pending_prompt_estimate_once() {
fn message_end_usage_updates_record_observed_usage_once_per_change() {

This comment was marked as outdated.

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated PR review completed.

  • Findings kept: 1
  • Findings with unknown confidence: 0
  • Inline comments attempted: 1
  • Target files: 29
  • Covered files: 29
  • Uncovered files: 0
    See the summary comment for detailed analysis and coverage details.

goal_id = goal.id,
status = goal.status,
objective = goal.objective,
task_board_summary = task_board_summary,

This comment was marked as outdated.

…ments tests

Replace byte-index slicing with char-aware truncation to prevent
panics on multi-byte UTF-8 boundaries in timestamp formatting.

Add unit tests for `has_process_requirements()` covering English
and CJK keywords, substring match behaviour, edge cases, and
case-insensitive matching.

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated PR review completed.

  • Findings kept: 10
  • Findings with unknown confidence: 0
  • Inline comments attempted: 10
  • Target files: 29
  • Covered files: 29
  • Uncovered files: 0
    See the summary comment for detailed analysis and coverage details.

/// Build a human-readable summary of the task board state for the Judge.
/// Returns a string describing each step and its stage, or a note that no
/// task board exists.
async fn build_task_board_summary(pool: &sqlx::SqlitePool, thread_id: &str) -> String {

This comment was marked as outdated.

_ => {}
}
}
// Tool-based auto-pausing has been removed: status transitions are

This comment was marked as outdated.

}

// Completion claim without tool call
// Completion claim without tool call — keep nudging agent toward

This comment was marked as outdated.

// when the field is missing — e.g. older persisted snapshots
// written before the upgrade.
const explicitContextSize = run.usage.contextSize ?? 0;
const contextSize =

This comment was marked as outdated.

/// `lock_or_recover` poison-recovery helper so a panic in any consumer
/// cannot corrupt the trigger state.
fn record_observed_usage(last_observed_usage: &StdMutex<Option<Usage>>, usage: &Usage) {
if let Ok(mut guard) = last_observed_usage.lock() {

This comment was marked as outdated.

let latest_historical_run =
// No longer read after the calibration seed was removed; the
// `find_latest_with_prompt_usage_by_thread_excluding_run` call
// above is kept as documentation that historical run usage is

This comment was marked as outdated.


let resumed = mgr.try_auto_resume().await.unwrap();
assert!(resumed, "ClarifyPending should auto-resume");
// No auto-resume path exists; status stays paused.

This comment was marked as outdated.

// when the field is missing — e.g. older persisted snapshots
// written before the upgrade.
const explicitContextSize = run.usage.contextSize ?? 0;
const contextSize =

This comment was marked as outdated.

// cache_write). Fall back to that sum when the field is missing
// (older payloads or hand-crafted events).
contextSize:
event.usage.contextSize > 0

This comment was marked as outdated.

// cache_write). Fall back to that sum when the field is missing
// (older payloads or hand-crafted events).
contextSize:
event.usage.contextSize > 0

This comment was marked as outdated.

jorben added 4 commits June 11, 2026 19:31
…trigger

Backend: replace fixed 16,384 token reserve with 20% of model context
window (min floor 16,384).  Small-window models keep the floor; GPT-4o
class windows reserve ~25.6K, Claude-class ~40K, 1M-window ~200K.

Frontend: add dashed threshold marker at 80% position in the thread
header context pill so users can see when auto-compression will fire.
…llback

Add four integration tests in agent_session_execution.rs for the
Judge-prompt context builders (build_task_board_summary,
build_process_compliance_summary) covering absent boards, active/abandoned
board filtering, review-only helper filtering, status symbol mapping, and
200-char input truncation.

Add six unit tests in runtime-thread-surface-state.test.ts for
mapRunSummaryToContextUsage covering null input, explicit contextSize
precedence, fallback to per-bucket sum, and full-field passthrough.

Addresses review feedback from PR #227 (round 4):
- #1 New DB-backed Judge summary functions lack tests
- #4/#8 contextSize parsing/fallback logic lack unit tests
- context_compression.rs: keep HEAD additions (compression_settings_reserves_twenty_percent_of_context_window,
  compression_settings_reserve_clamps_to_minimum_for_small_windows) and the shared
  should_compress_via_context_size_triggers_when_last_usage_exceeds_budget test body.

All other files auto-merged cleanly. Validation: cargo fmt --check, npm run typecheck,
npm run test:unit (853 passed), cargo test context_compression (34 passed),
cargo test judge_summary_tests (4 passed).

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated PR review completed.

  • Findings kept: 0
  • Findings with unknown confidence: 0
  • Inline comments attempted: 1
  • Target files: 13
  • Covered files: 13
  • Uncovered files: 0
    See the summary comment for detailed analysis and coverage details.

@@ -1612,21 +1612,23 @@ impl AgentSession {
tool_call_storage_id: &str,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated review completed for this PR diff. No concrete inline issue was selected after aggregation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant