feat(judge): ✨ replace self-attested goal completion with independent Judge subagent by jorben · Pull Request #227 · tiylabs/tiycode

jorben · 2026-06-11T02:52:01Z

Summary

Replace the self-attested goal_scored flow with an independent Judge acceptance subagent that evaluates the project state against the goal without relying on the main agent's self-report
Add new judge.md / output_contract.judge.md prompts and a typed JudgeRequest / JudgeReport contract, wired into execute_judge_tool
Drop the per-goal time_used_seconds field in favor of run-level elapsed tracking; remove the legacy mark_complete / complete verdict path
Make the task parameter on the subagent tool optional, fix UTF-8 safe truncation, and raise the builtin default max delegation depth to 5

Test Plan

cargo check --manifest-path src-tauri/Cargo.toml passes
cargo test --locked --manifest-path src-tauri/Cargo.toml — 811 passed / 0 failed
cargo fmt --check --manifest-path src-tauri/Cargo.toml passes
Run an end-to-end goal with the new Judge in the Tauri desktop app
Verify the rejection / re-verify path when the Judge returns passed=false

🤖 Generated with TiyCode

…udge acceptance agent Remove the `goal_scored` tool that allowed the main agent to self-attest goal completion, replacing it with an `agent_judge` built-in subagent that independently verifies goal attainment against the project's current state. Key changes: - Add `SubagentProfile::Judge` with read-only file tools and diagnostic-only shell (soft constraint via prompt) - Add `JudgeReport` structured contract (passed, completeness_pct, findings, summary) with safe fallback parsing - Add `agent_judge` tool injection only for the main agent when an unverified goal exists; runtime gate blocks subagent/parallel recursion into Judge - Add DB migration for `judge_passed`, `judge_completeness`, `judge_findings`, `judge_summary`, `judge_evaluated_run_id` columns with backfill for legacy `status='complete'` goals - Replace continuation stop condition: `Complete && judge_passed` instead of `goal_scored`-driven status flip - Rewrite continuation prompt to instruct main agent to call `agent_judge` and follow findings on rejection - Add Judge prompt surface, templates, and output contract - Update `active_goal.tpl.md` to reflect Judge acceptance flow - Extend goal lifecycle tests for Judge pass/fail/legacy compat

Remove the mark_complete pathway from goals as completion will be handled through a different mechanism: - Remove mark_complete method from GoalManager - Remove "complete" from GoalEvaluateResult verdict type - Remove mark_complete test cases (evidence validation, etc.) - Update subagent surface comments to include judge BREAKING CHANGE: GoalEvaluateResult.verdict no longer includes "complete"

Update the feature descriptions and reorder the bullet points in both README.md and README_zh.md to better reflect the current product capabilities and improve readability. Changes include: - Reordering features to highlight persistent goal management, real-time streaming, and extensibility earlier in the list - Updating descriptions for several features to be more accurate - Maintaining consistency between English and Chinese versions - Keeping the overall structure while improving flow These are documentation-only changes that do not affect functionality.

- Extract inline status key resolution into a pure exported function so the complete→verified (judgePassed) branch can be unit-tested without mounting the component - Add unit tests covering all status mappings and judgePassed variants - Add test for skipped verdict passthrough in goalEvaluate

Raise `BUILTIN_DEFAULT_MAX_DELEGATION_DEPTH` from 3 to 5 to match the existing `GLOBAL_MAX_DELEGATION_DEPTH`, allowing built-in subagents (explore/review) to be delegated to the same depth as custom profiles. Update delegation validation tests to reflect the new depth limits.

…idelines

…n-level elapsed tracking

…eness

- Downgrade Judge prompt versions from 2 to 1 (likely a revert of unintended bump) - Change `task` field from required to optional in Judge tool schema, with updated description clarifying it is an optional note - Replace byte-based truncation with character-safe truncation to avoid panicking on multi-byte UTF-8 in process compliance summary - Simplify Judge request validation to only check input validity, discarding the parsed result used only for backward compatibility - Skip abandoned task boards when building summary to focus on relevant goal state

Merge origin/master (fe7fbfc) into refact/goal-llm-judugement. Resolution philosophy: keep the Judge independent-verifier stance from our 4481759 (Judge evaluates project state alone, no main-agent self-assessment is injected into the prompt). Where master added genuine improvements, absorb them: - agent_session_execution.rs: keep task_board + process_compliance injection from our branch; add master's prior-verdict block sourced from goal.judge_summary / judge_findings as objective re-verification context (not main-agent input). - judge.md: keep our independent-auditor framing; absorb master's size-first verification structure (small/medium/large change budget) as a pure strategy improvement. Keep our call-chain verification and pre-existing / prior-judge prohibition clauses. - judge_contract.rs, output_contract.judge.md, active_goal.tpl.md, runtime_orchestration.rs: keep our versions (task optional, agent_judge() no-arg call, no "required" on task). Co-authored-by: merge-resolver

github-actions · 2026-06-11T02:55:10Z

AI Code Review Summary

PR: #227 (feat(judge): ✨ replace self-attested goal completion with independent Judge subagent)
Preferred language: English

Overall Assessment

No blocking issue was detected in the reviewed diff; keep focused regression testing before merge.

Major Findings by Severity

No major issues identified from the reviewed diff.

Actionable Suggestions

Restore a DB status change for budget‑limited goals (or add a visible indicator) to prevent confusion.
Consider tightening has_process_requirements keywords to avoid “preview” false positives.
If auto‑resume removal is permanent, update user documentation and release notes so users are not surprised.

Potential Risks

UI shows active goal after token budget hit, but agent stops – users may not understand why.
Process compliance layer may flag goals that never needed reviews, causing noise.

Test Suggestions

Add integration test for BudgetLimited verdict propagation through run loop.
Add test uncovering that a goal with “preview” triggers compliance layer unintentionally.

File-Level Coverage Notes

src-tauri/src/core/agent_session_execution.rs: Judge context enrichment (task board, process compliance) improves independence; new helpers are well‑tested. Removing the agent note from the prompt is intentional. Minor risk from loose keyword matching.
src-tauri/src/core/context_compression.rs: Dynamic reserve calculation (20% of context window, min floor) is correct and well‑tested. No regressions.
src-tauri/src/core/goal_manager.rs: Removal of auto‑resume and tool‑based auto‑pause is a deliberate design change that may confuse users expecting the old automatic behaviour. Budget‑limited verdict no longer updates status, which could hide the stop reason.
src-tauri/src/core/subagent/judge_contract.rs: Change to allow empty task is consistent with discarding the agent note; tests updated.
src-tauri/src/core/subagent/runtime_orchestration.rs: Tool schema update aligns with contract change; description now accurate.
src-tauri/src/model/goal.rs: Removal of auto_resume_on_user_message matches the behaviour change in goal manager.
src-tauri/src/persistence/repo/run_helper_repo.rs: New list_by_thread_id is correct and tested; no issues.
src-tauri/src/persistence/repo/run_repo.rs: Fix to accumulate elapsed time on interruption is a welcome improvement.
src-tauri/tests/goal_lifecycle.rs: Integration tests updated to reflect new logic; coverage of removed features is appropriate.
src/modules/workbench-shell/ui/dashboard-workbench-logic.ts: Addition of compressionThresholdRatio is a clean mirror of the backend’s new 80% rule.
src/modules/workbench-shell/ui/dashboard-workbench.test.ts: Unit test for the new exposure is correct and sufficient.
src/modules/workbench-shell/ui/dashboard-workbench.tsx: Visual marker for compression threshold is consistent and uses the new ratio.
src/modules/workbench-shell/ui/runtime-thread-surface-state.test.ts: New tests for mapRunSummaryToContextUsage cover edge cases well; no issues.

Inline Downgraded Items (processed but not inline)

None

Coverage Status

Target files: 13
Covered files: 13
Uncovered files: 0
No-patch/binary covered as file-level: 0
Findings with unknown confidence (N/A): 0

Uncovered list:

None

No-patch covered list:

None

Runtime/Budget

Rounds used: 1/4
Planned batches: 1
Executed batches: 1
Sub-agent runs: 1
Planner calls: 1
Reviewer calls: 1
Model calls: 2/64
Structured-output summary-only degradation: NO

github-actions

Automated PR review completed.

Findings kept: 5
Findings with unknown confidence: 0
Inline comments attempted: 3
Target files: 4
Covered files: 4
Uncovered files: 0
See the summary comment for detailed analysis and coverage details.

Sign in to view

+            String::new()
+        };

        let judge_task = format!(


Sign in to view

+            "{}. `{}` called at {} (status: {})\n   Scope: {}\n",
+            i + 1,
+            review.helper_kind,
+            &review.started_at[..review.started_at.len().min(19)],


Sign in to view

+        // Conditionally include process compliance layer for goals that
+        // require reviews or phase-by-phase verification.
+        let process_compliance = if has_process_requirements(&goal.objective) {
+            build_process_compliance_summary(&self.pool, &goal.thread_id).await


…size() Cherry-pick the master commit (a03d9ba) that bumps tiycore from 0.2.9 to 0.2.10-rc.2 and unifies context_size semantics across RunUsageDto / frontend badge / auto-compression, removing the old initial_context_calibration heuristic path. No file conflict with the Judge work in this branch — the 25 files touched here do not overlap with the 6 Judge files resolved in the previous merge.

github-actions

Automated PR review completed.

Findings kept: 5
Findings with unknown confidence: 0
Inline comments attempted: 5
Target files: 26
Covered files: 26
Uncovered files: 0
See the summary comment for detailed analysis and coverage details.

Sign in to view

+                // layer is typed against `Usage`; `context_size` is
+                // re-derived inside the `From` impl on read, so storing only
+                // the raw fields is sufficient and forward-compatible.
                let usage = tiycore::types::Usage {


Sign in to view

+/// Check whether the goal objective contains process requirements (e.g.,
+/// "review each phase", "每阶段验收"). When true, the Judge prompt will
+/// include a process compliance layer showing the thread's review call history.
+fn has_process_requirements(objective: &str) -> bool {


Sign in to view

                    }
-                },
-                "required": ["task"]
+                }


Sign in to view

-            // is a pure pass-through — no clone of messages, no LLM call, no
-            // DB access. This exercises the most common hot-path behaviour.
-            let messages = vec![make_user("hi"), make_assistant("hello")];
+        async fn returns_messages_unchanged_when_empty_or_dangling_weak() {


Sign in to view


    #[test]
-    fn message_end_usage_updates_consume_pending_prompt_estimate_once() {
+    fn message_end_usage_updates_record_observed_usage_once_per_change() {


… and Judge verdicts

github-actions

Automated PR review completed.

Findings kept: 1
Findings with unknown confidence: 0
Inline comments attempted: 1
Target files: 29
Covered files: 29
Uncovered files: 0
See the summary comment for detailed analysis and coverage details.

Sign in to view

-            goal_id = goal.id,
-            status = goal.status,
            objective = goal.objective,
+            task_board_summary = task_board_summary,


…ments tests Replace byte-index slicing with char-aware truncation to prevent panics on multi-byte UTF-8 boundaries in timestamp formatting. Add unit tests for `has_process_requirements()` covering English and CJK keywords, substring match behaviour, edge cases, and case-insensitive matching.

github-actions

Automated PR review completed.

Findings kept: 10
Findings with unknown confidence: 0
Inline comments attempted: 10
Target files: 29
Covered files: 29
Uncovered files: 0
See the summary comment for detailed analysis and coverage details.

Sign in to view

+/// Build a human-readable summary of the task board state for the Judge.
+/// Returns a string describing each step and its stage, or a note that no
+/// task board exists.
+async fn build_task_board_summary(pool: &sqlx::SqlitePool, thread_id: &str) -> String {


Sign in to view

-                _ => {}
-            }
-        }
+        // Tool-based auto-pausing has been removed: status transitions are


Sign in to view

            }

-            // Completion claim without tool call
+            // Completion claim without tool call — keep nudging agent toward


Sign in to view

+  // when the field is missing — e.g. older persisted snapshots
+  // written before the upgrade.
+  const explicitContextSize = run.usage.contextSize ?? 0;
+  const contextSize =


Sign in to view

+/// `lock_or_recover` poison-recovery helper so a panic in any consumer
+/// cannot corrupt the trigger state.
+fn record_observed_usage(last_observed_usage: &StdMutex<Option<Usage>>, usage: &Usage) {
+    if let Ok(mut guard) = last_observed_usage.lock() {


Sign in to view

-    let latest_historical_run =
+    // No longer read after the calibration seed was removed; the
+    // `find_latest_with_prompt_usage_by_thread_excluding_run` call
+    // above is kept as documentation that historical run usage is


Sign in to view


-        let resumed = mgr.try_auto_resume().await.unwrap();
-        assert!(resumed, "ClarifyPending should auto-resume");
+        // No auto-resume path exists; status stays paused.


Sign in to view

+  // when the field is missing — e.g. older persisted snapshots
+  // written before the upgrade.
+  const explicitContextSize = run.usage.contextSize ?? 0;
+  const contextSize =


Sign in to view

+          // cache_write). Fall back to that sum when the field is missing
+          // (older payloads or hand-crafted events).
+          contextSize:
+            event.usage.contextSize > 0


Sign in to view

+          // cache_write). Fall back to that sum when the field is missing
+          // (older payloads or hand-crafted events).
+          contextSize:
+            event.usage.contextSize > 0


…trigger Backend: replace fixed 16,384 token reserve with 20% of model context window (min floor 16,384). Small-window models keep the floor; GPT-4o class windows reserve ~25.6K, Claude-class ~40K, 1M-window ~200K. Frontend: add dashed threshold marker at 80% position in the thread header context pill so users can see when auto-compression will fire.

…llback Add four integration tests in agent_session_execution.rs for the Judge-prompt context builders (build_task_board_summary, build_process_compliance_summary) covering absent boards, active/abandoned board filtering, review-only helper filtering, status symbol mapping, and 200-char input truncation. Add six unit tests in runtime-thread-surface-state.test.ts for mapRunSummaryToContextUsage covering null input, explicit contextSize precedence, fallback to per-bucket sum, and full-field passthrough. Addresses review feedback from PR #227 (round 4): - #1 New DB-backed Judge summary functions lack tests - #4/#8 contextSize parsing/fallback logic lack unit tests

- context_compression.rs: keep HEAD additions (compression_settings_reserves_twenty_percent_of_context_window, compression_settings_reserve_clamps_to_minimum_for_small_windows) and the shared should_compress_via_context_size_triggers_when_last_usage_exceeds_budget test body. All other files auto-merged cleanly. Validation: cargo fmt --check, npm run typecheck, npm run test:unit (853 passed), cargo test context_compression (34 passed), cargo test judge_summary_tests (4 passed).

github-actions

Automated PR review completed.

Findings kept: 0
Findings with unknown confidence: 0
Inline comments attempted: 1
Target files: 13
Covered files: 13
Uncovered files: 0
See the summary comment for detailed analysis and coverage details.

github-actions · 2026-06-12T05:10:09Z

@@ -1612,21 +1612,23 @@ impl AgentSession {
        tool_call_storage_id: &str,


Automated review completed for this PR diff. No concrete inline issue was selected after aggregation.

jorben added 11 commits June 7, 2026 11:54

docs: 📝 remove obsolete design document

c15e885

docs(judge): 📝 add size-first verification strategy and delegation gu…

d60daec

…idelines

refactor(goal): ♻️ remove goal-level time_used_seconds in favor of ru…

dc8fca0

…n-level elapsed tracking

feat(judge): ✨ redesign Judge evaluation for independence and complet…

4481759

…eness

github-actions Bot reviewed Jun 11, 2026

View reviewed changes

refactor(goal): ♻️ centralize status transitions to explicit commands…

afd221e

… and Judge verdicts

github-actions Bot reviewed Jun 11, 2026

View reviewed changes

Comment thread src-tauri/src/core/agent_session_execution.rs

goal_id = goal.id,

status = goal.status,

objective = goal.objective,

task_board_summary = task_board_summary,

This comment was marked as outdated.

Sign in to view

github-actions Bot reviewed Jun 11, 2026

View reviewed changes

jorben added 4 commits June 11, 2026 19:31

fix(run): 🐛 record elapsed running time when interrupting active runs

73c7cb5

github-actions Bot reviewed Jun 12, 2026

View reviewed changes

		@@ -1612,21 +1612,23 @@ impl AgentSession {
		tool_call_storage_id: &str,

Conversation

jorben commented Jun 11, 2026

Summary

Test Plan

Uh oh!

github-actions Bot commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

AI Code Review Summary

Overall Assessment

Major Findings by Severity

Actionable Suggestions

Potential Risks

Test Suggestions

File-Level Coverage Notes

Inline Downgraded Items (processed but not inline)

Coverage Status

Runtime/Budget

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

github-actions Bot commented Jun 11, 2026 •

edited

Loading