Skip to content

[CLI/Core] Polymorphic agent dispatch + TestRun owns trial counter#933

Merged
podkidyshev merged 4 commits into
NVIDIA:mainfrom
rutayan-nv:rpatro/agent-run-polymorphism
Jun 23, 2026
Merged

[CLI/Core] Polymorphic agent dispatch + TestRun owns trial counter#933
podkidyshev merged 4 commits into
NVIDIA:mainfrom
rutayan-nv:rpatro/agent-run-polymorphism

Conversation

@rutayan-nv

Copy link
Copy Markdown
Contributor

Issue

  • TestRun.step had multiple writers; RLlib's frequent reset() collapsed every trial onto step=1, overwriting trajectory.csv/env.csv rows.
  • The DSE loop lacked a uniform agent entry point and didn't distinguish recoverable failures from bugs.

Fix

  • BaseAgent.run() is the single polymorphic entry point; handle_dse_job is just err |= agent.run() — agents implement their own training loop in run().
  • TestRun.increment_step() is the sole mutator, called only by CloudAIGymEnv.step() (advances before output_path/trajectory are computed).
  • Hybrid failure contract: a non-zero rc accumulates and the sweep continues; an unexpected exception hard-fails (reports still generated, error written to <scenario_root>/dse_failure.txt, then re-raised).

Testing

pytest tests green; ruff + pyright + vulture clean.

Replaces #893 (branch renamed custom-training-loop-dispatchagent-run-polymorphism; history rebuilt into 2 clean commits, no add-then-remove churn). Stack base for #900#901#927#928#930#932.

@coderabbitai

coderabbitai Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 7b860695-6d85-4725-8bc9-3c794d0bb371

📥 Commits

Reviewing files that changed from the base of the PR and between 88d35c5 and 13355ba.

📒 Files selected for processing (7)
  • src/cloudai/_core/test_scenario.py
  • src/cloudai/cli/handlers.py
  • src/cloudai/configurator/base_agent.py
  • src/cloudai/configurator/cloudai_gym.py
  • tests/test_cloudaigym.py
  • tests/test_handlers.py
  • tests/test_test_scenario.py

📝 Walkthrough

Walkthrough

TestRun gains an increment_step() method called at the start of CloudAIGymEnv.step(). BaseAgent gains a default run() loop that orchestrates reset/select_action/env.step/update_policy. handle_dse_job is refactored to call agent.run() per TestRun, with exception capture, re-raise after report generation, and a new _record_run_failure helper that writes dse_failure.txt. generate_reports accepts an optional error parameter.

Changes

DSE Agent Run Delegation and Failure Reporting

Layer / File(s) Summary
TestRun.increment_step() and CloudAIGymEnv integration
src/cloudai/_core/test_scenario.py, src/cloudai/configurator/cloudai_gym.py, tests/test_cloudaigym.py, tests/test_test_scenario.py
increment_step() is added to TestRun and called at the start of CloudAIGymEnv.step(); gym and scenario tests are updated to assert that step is advanced before output paths and trajectory rows are recorded.
BaseAgent.run() default exploration loop
src/cloudai/configurator/base_agent.py
Adds logging import, updates select_action() signature to return tuple[int, dict[str, Any]] | None, and implements BaseAgent.run() which resets the environment, iterates select_action/env.step/update_policy up to max_steps with per-step logging, and returns 0.
handle_dse_job delegation and failure reporting
src/cloudai/cli/handlers.py
handle_dse_job now calls agent.run() per TestRun inside a try/except; on exception it calls generate_reports with the error then re-raises. _record_run_failure writes dse_failure.txt with exception details. generate_reports accepts an optional error parameter.
handle_dse_job agent.run() delegation tests
tests/test_handlers.py
Adds CustomRunStubAgent, custom_run_agent_name fixture, and six tests covering successful delegation, non-zero return code propagation and accumulation, exception propagation, abort-on-exception, and dse_failure.txt generation before re-raise.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐇 Hippity-hop, the agent now leads,
increment_step() plants the step-counting seeds.
On failure we write, then re-raise with care,
dse_failure.txt floats gently through the air.
The tests all confirm what the rabbit decreed —
each run() delegates at bunny-fast speed! 🎉

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely describes the two main architectural changes: polymorphic agent dispatch via BaseAgent.run() and centralized step mutation via TestRun.increment_step().
Description check ✅ Passed The description is directly related to the changeset, explaining the issue, the fix, and the testing approach with clear connection to the code changes across all modified files.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/cloudai/configurator/base_agent.py`:
- Around line 143-145: The run() method checks if select_action() returns None
and uses this as a stop signal, but the abstract select_action() method's return
type annotation and documentation do not declare None as a valid return value.
Update the return type annotation of the select_action() method to include None
(e.g., change from tuple[int, dict[str, Any]] to tuple[int, dict[str, Any]] |
None), and update its docstring to document that returning None signals
termination of the agent loop. This aligns the method's declared contract with
how it is actually used in run().
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: f47da465-3eed-4c1e-88f9-d4d27b909176

📥 Commits

Reviewing files that changed from the base of the PR and between 1b0e8cc and 9188272.

📒 Files selected for processing (7)
  • src/cloudai/_core/test_scenario.py
  • src/cloudai/cli/handlers.py
  • src/cloudai/configurator/base_agent.py
  • src/cloudai/configurator/cloudai_gym.py
  • tests/test_cloudaigym.py
  • tests/test_handlers.py
  • tests/test_test_scenario.py

Comment thread src/cloudai/configurator/base_agent.py
This was referenced Jun 16, 2026
@rutayan-nv

Copy link
Copy Markdown
Contributor Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor
✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@rutayan-nv

Copy link
Copy Markdown
Contributor Author

@podkidyshev — this PR supersedes the closed #893 (head branch was renamed, which auto-closed it). Carrying over your two review threads from there since GitHub stranded them on the closed PR:

1. handlers.py:121 — move out of handlers / mirror test structure
Resolved by the agent.run() polymorphism refactor: the custom-loop dispatch, the CustomTrainingLoopAgent Protocol, and the TypeGuard are gone from handlers.py. handle_dse_job now collapses to err |= agent.run() and agents that drive their own loop just override run(). Since no agent-loop logic remains in handlers.py, the only tests left are handle_dse_job delegation tests, which stay in tests/test_handlers.py (per your "not related to tests against handle_dse_job" note). If you'd still like a tests/configurator/test_base_agent.py split, say the word.

2. handlers.py:215 — early return err on err > 0? — still your call.
Current behavior is a hybrid: a graceful non-zero rc from agent.run() is accumulated (err |= ...) so the remaining independent test_runs still execute and produce results/reports, and the handler returns non-zero if any failed. An unexpected exception (a bug) is a hard-fail — it's caught only long enough to generate reports, then re-raised. Happy to switch to fail-fast for the whole scenario if you prefer that semantics.

@rutayan-nv

Copy link
Copy Markdown
Contributor Author

@podkidyshev — re: item 2 above (early return err on err > 0):

This is resolved — handled as a hybrid policy in handle_dse_job:

  • Recoverable failures (non-zero rc from agent.run()) accumulate via err |= agent.run() and the sweep continues; the accumulated err is returned. No early-return.
  • An unexpected exception is a hard-fail: it's captured, reports are generated (generate_reports(error=...), which writes dse_failure.txt with the traceback), then the exception is re-raised.

Lives on #933 (agent-run-polymorphism), inherited up the stack; routing + the mixed-job guard are on #934. Covered by test_handle_dse_job_accumulates_nonzero_rc_and_continues, ..._propagates_agent_run_exception, ..._hard_fail_aborts_remaining_runs, and ..._documents_failure_in_reports_before_raising. No further decision needed.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/cloudai/configurator/base_agent.py (1)

91-103: 🧹 Nitpick | 🔵 Trivial | ⚡ Quick win

Clarify step index semantics in the select_action() docstring.

The docstring describes the return as "The current step index" but doesn't specify whether it's 0-indexed or 1-indexed, or how it relates to the test_run.step counter. Since select_action() is abstract and defines a contract for implementers, the docstring should be precise.

Based on the run() usage (comparing step to max_steps, logging "Running step N (of max_steps)", and the fact that CloudAIGymEnv.step() calls increment_step() at the start), the step index should be 1-indexed and match the value that TestRun.increment_step() will return.

Consider updating the Returns section to:

Returns:
    Tuple[int, Dict[str, Any]] | None: The 1-indexed trial number (matching 
        ``test_run.step`` after ``env.step()`` increments it) and a dictionary 
        mapping action keys to selected values, or ``None`` to signal loop termination.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cloudai/configurator/base_agent.py` around lines 91 - 103, The Returns
section of the select_action() docstring uses vague language ("The current step
index") without clarifying the indexing scheme or its relationship to
TestRun.increment_step(). Update the Returns section to explicitly state that
the returned step index is 1-indexed and matches the value of test_run.step
after env.step() increments it, making the contract clear for implementers of
this abstract method.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@src/cloudai/configurator/base_agent.py`:
- Around line 91-103: The Returns section of the select_action() docstring uses
vague language ("The current step index") without clarifying the indexing scheme
or its relationship to TestRun.increment_step(). Update the Returns section to
explicitly state that the returned step index is 1-indexed and matches the value
of test_run.step after env.step() increments it, making the contract clear for
implementers of this abstract method.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: fae375c0-dcee-49e0-95b7-5e6da91bd513

📥 Commits

Reviewing files that changed from the base of the PR and between ef8af80 and 88d35c5.

📒 Files selected for processing (1)
  • src/cloudai/configurator/base_agent.py

…ole mutator

The trial index lives on TestRun and is advanced exactly once per
CloudAIGymEnv.step(), before output_path and the trajectory row are computed.
Step artifacts and trajectory rows are therefore tagged with the advanced
index rather than the pre-step value.
handle_dse_job calls agent.run() uniformly; agents (e.g. RLlib-based) implement
their own training loop in run(). Failure handling follows a hybrid contract: a
non-zero rc accumulates via `err |= rc` and the sweep continues, while an
unexpected exception hard-fails the scenario after reports are generated and the
aborting error is documented.
run() already treats a None return from select_action() as a stop signal, but
the abstract method declared only tuple[int, dict[str, Any]]. Widen the return
annotation to tuple[int, dict[str, Any]] | None and document the termination
semantics so the declared contract matches run()'s usage.
Convert the two f-string logging.info calls in the default run loop to
%-style args so formatting is deferred when the log level is disabled
(Ruff G004, CodeRabbit).
@rutayan-nv rutayan-nv force-pushed the rpatro/agent-run-polymorphism branch from 88d35c5 to 13355ba Compare June 21, 2026 15:34
@podkidyshev podkidyshev merged commit e4dc657 into NVIDIA:main Jun 23, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants