feat(judge): ✨ replace self-attested goal completion with independent Judge subagent#227
feat(judge): ✨ replace self-attested goal completion with independent Judge subagent#227jorben wants to merge 18 commits into
Conversation
…udge acceptance agent Remove the `goal_scored` tool that allowed the main agent to self-attest goal completion, replacing it with an `agent_judge` built-in subagent that independently verifies goal attainment against the project's current state. Key changes: - Add `SubagentProfile::Judge` with read-only file tools and diagnostic-only shell (soft constraint via prompt) - Add `JudgeReport` structured contract (passed, completeness_pct, findings, summary) with safe fallback parsing - Add `agent_judge` tool injection only for the main agent when an unverified goal exists; runtime gate blocks subagent/parallel recursion into Judge - Add DB migration for `judge_passed`, `judge_completeness`, `judge_findings`, `judge_summary`, `judge_evaluated_run_id` columns with backfill for legacy `status='complete'` goals - Replace continuation stop condition: `Complete && judge_passed` instead of `goal_scored`-driven status flip - Rewrite continuation prompt to instruct main agent to call `agent_judge` and follow findings on rejection - Add Judge prompt surface, templates, and output contract - Update `active_goal.tpl.md` to reflect Judge acceptance flow - Extend goal lifecycle tests for Judge pass/fail/legacy compat
Remove the mark_complete pathway from goals as completion will be handled through a different mechanism: - Remove mark_complete method from GoalManager - Remove "complete" from GoalEvaluateResult verdict type - Remove mark_complete test cases (evidence validation, etc.) - Update subagent surface comments to include judge BREAKING CHANGE: GoalEvaluateResult.verdict no longer includes "complete"
Update the feature descriptions and reorder the bullet points in both README.md and README_zh.md to better reflect the current product capabilities and improve readability. Changes include: - Reordering features to highlight persistent goal management, real-time streaming, and extensibility earlier in the list - Updating descriptions for several features to be more accurate - Maintaining consistency between English and Chinese versions - Keeping the overall structure while improving flow These are documentation-only changes that do not affect functionality.
- Extract inline status key resolution into a pure exported function so the complete→verified (judgePassed) branch can be unit-tested without mounting the component - Add unit tests covering all status mappings and judgePassed variants - Add test for skipped verdict passthrough in goalEvaluate
Raise `BUILTIN_DEFAULT_MAX_DELEGATION_DEPTH` from 3 to 5 to match the existing `GLOBAL_MAX_DELEGATION_DEPTH`, allowing built-in subagents (explore/review) to be delegated to the same depth as custom profiles. Update delegation validation tests to reflect the new depth limits.
…n-level elapsed tracking
- Downgrade Judge prompt versions from 2 to 1 (likely a revert of unintended bump) - Change `task` field from required to optional in Judge tool schema, with updated description clarifying it is an optional note - Replace byte-based truncation with character-safe truncation to avoid panicking on multi-byte UTF-8 in process compliance summary - Simplify Judge request validation to only check input validity, discarding the parsed result used only for backward compatibility - Skip abandoned task boards when building summary to focus on relevant goal state
Merge origin/master (fe7fbfc) into refact/goal-llm-judugement. Resolution philosophy: keep the Judge independent-verifier stance from our 4481759 (Judge evaluates project state alone, no main-agent self-assessment is injected into the prompt). Where master added genuine improvements, absorb them: - agent_session_execution.rs: keep task_board + process_compliance injection from our branch; add master's prior-verdict block sourced from goal.judge_summary / judge_findings as objective re-verification context (not main-agent input). - judge.md: keep our independent-auditor framing; absorb master's size-first verification structure (small/medium/large change budget) as a pure strategy improvement. Keep our call-chain verification and pre-existing / prior-judge prohibition clauses. - judge_contract.rs, output_contract.judge.md, active_goal.tpl.md, runtime_orchestration.rs: keep our versions (task optional, agent_judge() no-arg call, no "required" on task). Co-authored-by: merge-resolver
AI Code Review SummaryPR: #227 (feat(judge): ✨ replace self-attested goal completion with independent Judge subagent) Overall AssessmentNo blocking issue was detected in the reviewed diff; keep focused regression testing before merge. Major Findings by SeverityNo major issues identified from the reviewed diff. Actionable Suggestions
Potential Risks
Test Suggestions
File-Level Coverage Notes
Inline Downgraded Items (processed but not inline)
Coverage Status
Uncovered list:
No-patch covered list:
Runtime/Budget
|
| String::new() | ||
| }; | ||
|
|
||
| let judge_task = format!( |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
| "{}. `{}` called at {} (status: {})\n Scope: {}\n", | ||
| i + 1, | ||
| review.helper_kind, | ||
| &review.started_at[..review.started_at.len().min(19)], |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
| // Conditionally include process compliance layer for goals that | ||
| // require reviews or phase-by-phase verification. | ||
| let process_compliance = if has_process_requirements(&goal.objective) { | ||
| build_process_compliance_summary(&self.pool, &goal.thread_id).await |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
…size() Cherry-pick the master commit (a03d9ba) that bumps tiycore from 0.2.9 to 0.2.10-rc.2 and unifies context_size semantics across RunUsageDto / frontend badge / auto-compression, removing the old initial_context_calibration heuristic path. No file conflict with the Judge work in this branch — the 25 files touched here do not overlap with the 6 Judge files resolved in the previous merge.
| // layer is typed against `Usage`; `context_size` is | ||
| // re-derived inside the `From` impl on read, so storing only | ||
| // the raw fields is sufficient and forward-compatible. | ||
| let usage = tiycore::types::Usage { |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
| /// Check whether the goal objective contains process requirements (e.g., | ||
| /// "review each phase", "每阶段验收"). When true, the Judge prompt will | ||
| /// include a process compliance layer showing the thread's review call history. | ||
| fn has_process_requirements(objective: &str) -> bool { |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
| } | ||
| }, | ||
| "required": ["task"] | ||
| } |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
| // is a pure pass-through — no clone of messages, no LLM call, no | ||
| // DB access. This exercises the most common hot-path behaviour. | ||
| let messages = vec![make_user("hi"), make_assistant("hello")]; | ||
| async fn returns_messages_unchanged_when_empty_or_dangling_weak() { |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
|
|
||
| #[test] | ||
| fn message_end_usage_updates_consume_pending_prompt_estimate_once() { | ||
| fn message_end_usage_updates_record_observed_usage_once_per_change() { |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
… and Judge verdicts
| goal_id = goal.id, | ||
| status = goal.status, | ||
| objective = goal.objective, | ||
| task_board_summary = task_board_summary, |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
…ments tests Replace byte-index slicing with char-aware truncation to prevent panics on multi-byte UTF-8 boundaries in timestamp formatting. Add unit tests for `has_process_requirements()` covering English and CJK keywords, substring match behaviour, edge cases, and case-insensitive matching.
| /// Build a human-readable summary of the task board state for the Judge. | ||
| /// Returns a string describing each step and its stage, or a note that no | ||
| /// task board exists. | ||
| async fn build_task_board_summary(pool: &sqlx::SqlitePool, thread_id: &str) -> String { |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
| _ => {} | ||
| } | ||
| } | ||
| // Tool-based auto-pausing has been removed: status transitions are |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
| } | ||
|
|
||
| // Completion claim without tool call | ||
| // Completion claim without tool call — keep nudging agent toward |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
| // when the field is missing — e.g. older persisted snapshots | ||
| // written before the upgrade. | ||
| const explicitContextSize = run.usage.contextSize ?? 0; | ||
| const contextSize = |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
| /// `lock_or_recover` poison-recovery helper so a panic in any consumer | ||
| /// cannot corrupt the trigger state. | ||
| fn record_observed_usage(last_observed_usage: &StdMutex<Option<Usage>>, usage: &Usage) { | ||
| if let Ok(mut guard) = last_observed_usage.lock() { |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
| let latest_historical_run = | ||
| // No longer read after the calibration seed was removed; the | ||
| // `find_latest_with_prompt_usage_by_thread_excluding_run` call | ||
| // above is kept as documentation that historical run usage is |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
|
|
||
| let resumed = mgr.try_auto_resume().await.unwrap(); | ||
| assert!(resumed, "ClarifyPending should auto-resume"); | ||
| // No auto-resume path exists; status stays paused. |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
| // when the field is missing — e.g. older persisted snapshots | ||
| // written before the upgrade. | ||
| const explicitContextSize = run.usage.contextSize ?? 0; | ||
| const contextSize = |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
| // cache_write). Fall back to that sum when the field is missing | ||
| // (older payloads or hand-crafted events). | ||
| contextSize: | ||
| event.usage.contextSize > 0 |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
| // cache_write). Fall back to that sum when the field is missing | ||
| // (older payloads or hand-crafted events). | ||
| contextSize: | ||
| event.usage.contextSize > 0 |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
…trigger Backend: replace fixed 16,384 token reserve with 20% of model context window (min floor 16,384). Small-window models keep the floor; GPT-4o class windows reserve ~25.6K, Claude-class ~40K, 1M-window ~200K. Frontend: add dashed threshold marker at 80% position in the thread header context pill so users can see when auto-compression will fire.
…llback Add four integration tests in agent_session_execution.rs for the Judge-prompt context builders (build_task_board_summary, build_process_compliance_summary) covering absent boards, active/abandoned board filtering, review-only helper filtering, status symbol mapping, and 200-char input truncation. Add six unit tests in runtime-thread-surface-state.test.ts for mapRunSummaryToContextUsage covering null input, explicit contextSize precedence, fallback to per-bucket sum, and full-field passthrough. Addresses review feedback from PR #227 (round 4): - #1 New DB-backed Judge summary functions lack tests - #4/#8 contextSize parsing/fallback logic lack unit tests
- context_compression.rs: keep HEAD additions (compression_settings_reserves_twenty_percent_of_context_window, compression_settings_reserve_clamps_to_minimum_for_small_windows) and the shared should_compress_via_context_size_triggers_when_last_usage_exceeds_budget test body. All other files auto-merged cleanly. Validation: cargo fmt --check, npm run typecheck, npm run test:unit (853 passed), cargo test context_compression (34 passed), cargo test judge_summary_tests (4 passed).
| @@ -1612,21 +1612,23 @@ impl AgentSession { | |||
| tool_call_storage_id: &str, | |||
There was a problem hiding this comment.
Automated review completed for this PR diff. No concrete inline issue was selected after aggregation.
Summary
goal_scoredflow with an independent Judge acceptance subagent that evaluates the project state against the goal without relying on the main agent's self-reportjudge.md/output_contract.judge.mdprompts and a typedJudgeRequest/JudgeReportcontract, wired intoexecute_judge_tooltime_used_secondsfield in favor of run-level elapsed tracking; remove the legacymark_complete/completeverdict pathtaskparameter on the subagent tool optional, fix UTF-8 safe truncation, and raise the builtin default max delegation depth to 5Test Plan
cargo check --manifest-path src-tauri/Cargo.tomlpassescargo test --locked --manifest-path src-tauri/Cargo.toml— 811 passed / 0 failedcargo fmt --check --manifest-path src-tauri/Cargo.tomlpassespassed=false🤖 Generated with TiyCode