[CLI/Core] Polymorphic agent dispatch + TestRun owns trial counter by rutayan-nv · Pull Request #933 · NVIDIA/cloudai

rutayan-nv · 2026-06-16T13:54:06Z

Issue

TestRun.step had multiple writers; RLlib's frequent reset() collapsed every trial onto step=1, overwriting trajectory.csv/env.csv rows.
The DSE loop lacked a uniform agent entry point and didn't distinguish recoverable failures from bugs.

Fix

BaseAgent.run() is the single polymorphic entry point; handle_dse_job is just err |= agent.run() — agents implement their own training loop in run().
TestRun.increment_step() is the sole mutator, called only by CloudAIGymEnv.step() (advances before output_path/trajectory are computed).
Hybrid failure contract: a non-zero rc accumulates and the sweep continues; an unexpected exception hard-fails (reports still generated, error written to <scenario_root>/dse_failure.txt, then re-raised).

Testing

pytest tests green; ruff + pyright + vulture clean.

Replaces #893 (branch renamed custom-training-loop-dispatch → agent-run-polymorphism; history rebuilt into 2 clean commits, no add-then-remove churn). Stack base for #900 → #901 → #927 → #928 → #930 → #932.

coderabbitai · 2026-06-16T13:54:23Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 7b860695-6d85-4725-8bc9-3c794d0bb371

📥 Commits

Reviewing files that changed from the base of the PR and between 88d35c5 and 13355ba.

📒 Files selected for processing (7)

src/cloudai/_core/test_scenario.py
src/cloudai/cli/handlers.py
src/cloudai/configurator/base_agent.py
src/cloudai/configurator/cloudai_gym.py
tests/test_cloudaigym.py
tests/test_handlers.py
tests/test_test_scenario.py

📝 Walkthrough

Walkthrough

TestRun gains an increment_step() method called at the start of CloudAIGymEnv.step(). BaseAgent gains a default run() loop that orchestrates reset/select_action/env.step/update_policy. handle_dse_job is refactored to call agent.run() per TestRun, with exception capture, re-raise after report generation, and a new _record_run_failure helper that writes dse_failure.txt. generate_reports accepts an optional error parameter.

Changes

DSE Agent Run Delegation and Failure Reporting

Layer / File(s)	Summary
TestRun.increment_step() and CloudAIGymEnv integration `src/cloudai/_core/test_scenario.py`, `src/cloudai/configurator/cloudai_gym.py`, `tests/test_cloudaigym.py`, `tests/test_test_scenario.py`	`increment_step()` is added to `TestRun` and called at the start of `CloudAIGymEnv.step()`; gym and scenario tests are updated to assert that `step` is advanced before output paths and trajectory rows are recorded.
BaseAgent.run() default exploration loop `src/cloudai/configurator/base_agent.py`	Adds logging import, updates `select_action()` signature to return `tuple[int, dict[str, Any]] \| None`, and implements `BaseAgent.run()` which resets the environment, iterates `select_action`/`env.step`/`update_policy` up to `max_steps` with per-step logging, and returns `0`.
handle_dse_job delegation and failure reporting `src/cloudai/cli/handlers.py`	`handle_dse_job` now calls `agent.run()` per `TestRun` inside a try/except; on exception it calls `generate_reports` with the error then re-raises. `_record_run_failure` writes `dse_failure.txt` with exception details. `generate_reports` accepts an optional `error` parameter.
handle_dse_job agent.run() delegation tests `tests/test_handlers.py`	Adds `CustomRunStubAgent`, `custom_run_agent_name` fixture, and six tests covering successful delegation, non-zero return code propagation and accumulation, exception propagation, abort-on-exception, and `dse_failure.txt` generation before re-raise.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐇 Hippity-hop, the agent now leads,
increment_step() plants the step-counting seeds.
On failure we write, then re-raise with care,
dse_failure.txt floats gently through the air.
The tests all confirm what the rabbit decreed —
each run() delegates at bunny-fast speed! 🎉

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely describes the two main architectural changes: polymorphic agent dispatch via BaseAgent.run() and centralized step mutation via TestRun.increment_step().
Description check	✅ Passed	The description is directly related to the changeset, explaining the issue, the fix, and the testing approach with clear connection to the code changes across all modified files.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/cloudai/configurator/base_agent.py`:
- Around line 143-145: The run() method checks if select_action() returns None
and uses this as a stop signal, but the abstract select_action() method's return
type annotation and documentation do not declare None as a valid return value.
Update the return type annotation of the select_action() method to include None
(e.g., change from tuple[int, dict[str, Any]] to tuple[int, dict[str, Any]] |
None), and update its docstring to document that returning None signals
termination of the agent loop. This aligns the method's declared contract with
how it is actually used in run().

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: f47da465-3eed-4c1e-88f9-d4d27b909176

📥 Commits

Reviewing files that changed from the base of the PR and between 1b0e8cc and 9188272.

📒 Files selected for processing (7)

src/cloudai/_core/test_scenario.py
src/cloudai/cli/handlers.py
src/cloudai/configurator/base_agent.py
src/cloudai/configurator/cloudai_gym.py
tests/test_cloudaigym.py
tests/test_handlers.py
tests/test_test_scenario.py

rutayan-nv · 2026-06-16T14:08:59Z

@coderabbitai review

coderabbitai · 2026-06-16T14:09:08Z

✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

rutayan-nv · 2026-06-16T14:14:33Z

@podkidyshev — this PR supersedes the closed #893 (head branch was renamed, which auto-closed it). Carrying over your two review threads from there since GitHub stranded them on the closed PR:

1. handlers.py:121 — move out of handlers / mirror test structure
Resolved by the agent.run() polymorphism refactor: the custom-loop dispatch, the CustomTrainingLoopAgent Protocol, and the TypeGuard are gone from handlers.py. handle_dse_job now collapses to err |= agent.run() and agents that drive their own loop just override run(). Since no agent-loop logic remains in handlers.py, the only tests left are handle_dse_job delegation tests, which stay in tests/test_handlers.py (per your "not related to tests against handle_dse_job" note). If you'd still like a tests/configurator/test_base_agent.py split, say the word.

2. handlers.py:215 — early return err on err > 0? — still your call.
Current behavior is a hybrid: a graceful non-zero rc from agent.run() is accumulated (err |= ...) so the remaining independent test_runs still execute and produce results/reports, and the handler returns non-zero if any failed. An unexpected exception (a bug) is a hard-fail — it's caught only long enough to generate reports, then re-raised. Happy to switch to fail-fast for the whole scenario if you prefer that semantics.

rutayan-nv · 2026-06-16T22:23:49Z

@podkidyshev — re: item 2 above (early return err on err > 0):

This is resolved — handled as a hybrid policy in handle_dse_job:

Recoverable failures (non-zero rc from agent.run()) accumulate via err |= agent.run() and the sweep continues; the accumulated err is returned. No early-return.
An unexpected exception is a hard-fail: it's captured, reports are generated (generate_reports(error=...), which writes dse_failure.txt with the traceback), then the exception is re-raised.

Lives on #933 (agent-run-polymorphism), inherited up the stack; routing + the mixed-job guard are on #934. Covered by test_handle_dse_job_accumulates_nonzero_rc_and_continues, ..._propagates_agent_run_exception, ..._hard_fail_aborts_remaining_runs, and ..._documents_failure_in_reports_before_raising. No further decision needed.

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/cloudai/configurator/base_agent.py (1)
91-103: 🧹 Nitpick | 🔵 Trivial | ⚡ Quick win

Clarify step index semantics in the select_action() docstring.

The docstring describes the return as "The current step index" but doesn't specify whether it's 0-indexed or 1-indexed, or how it relates to the test_run.step counter. Since select_action() is abstract and defines a contract for implementers, the docstring should be precise.

Based on the run() usage (comparing step to max_steps, logging "Running step N (of max_steps)", and the fact that CloudAIGymEnv.step() calls increment_step() at the start), the step index should be 1-indexed and match the value that TestRun.increment_step() will return.

Consider updating the Returns section to:
Returns:
    Tuple[int, Dict[str, Any]] | None: The 1-indexed trial number (matching 
        ``test_run.step`` after ``env.step()`` increments it) and a dictionary 
        mapping action keys to selected values, or ``None`` to signal loop termination.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cloudai/configurator/base_agent.py` around lines 91 - 103, The Returns
section of the select_action() docstring uses vague language ("The current step
index") without clarifying the indexing scheme or its relationship to
TestRun.increment_step(). Update the Returns section to explicitly state that
the returned step index is 1-indexed and matches the value of test_run.step
after env.step() increments it, making the contract clear for implementers of
this abstract method.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@src/cloudai/configurator/base_agent.py`:
- Around line 91-103: The Returns section of the select_action() docstring uses
vague language ("The current step index") without clarifying the indexing scheme
or its relationship to TestRun.increment_step(). Update the Returns section to
explicitly state that the returned step index is 1-indexed and matches the value
of test_run.step after env.step() increments it, making the contract clear for
implementers of this abstract method.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: fae375c0-dcee-49e0-95b7-5e6da91bd513

📥 Commits

Reviewing files that changed from the base of the PR and between ef8af80 and 88d35c5.

📒 Files selected for processing (1)

src/cloudai/configurator/base_agent.py

…ole mutator The trial index lives on TestRun and is advanced exactly once per CloudAIGymEnv.step(), before output_path and the trajectory row are computed. Step artifacts and trajectory rows are therefore tagged with the advanced index rather than the pre-step value.

handle_dse_job calls agent.run() uniformly; agents (e.g. RLlib-based) implement their own training loop in run(). Failure handling follows a hybrid contract: a non-zero rc accumulates via `err |= rc` and the sweep continues, while an unexpected exception hard-fails the scenario after reports are generated and the aborting error is documented.

run() already treats a None return from select_action() as a stop signal, but the abstract method declared only tuple[int, dict[str, Any]]. Widen the return annotation to tuple[int, dict[str, Any]] | None and document the termination semantics so the declared contract matches run()'s usage.

Convert the two f-string logging.info calls in the default run loop to %-style args so formatting is deferred when the log level is disabled (Ruff G004, CodeRabbit).

rutayan-nv requested review from jeffnvidia, podkidyshev and srivatsankrishnan as code owners June 16, 2026 13:54

rutayan-nv mentioned this pull request Jun 16, 2026

[CLI/Core] Polymorphic agent dispatch + TestRun owns trial counter #893

Closed

coderabbitai Bot reviewed Jun 16, 2026

View reviewed changes

Comment thread src/cloudai/configurator/base_agent.py

This was referenced Jun 16, 2026

More Gym enhancements #884

Draft

Gym enhancements #863

Draft

This was referenced Jun 17, 2026

feat(cli): route live_rl_mode jobs through the DSE handler #934

Draft

[TDD-red] CloudAIGymEnv cache key must include env_params (drives domain-randomization correctness fix) #900

Closed

coderabbitai Bot reviewed Jun 21, 2026

View reviewed changes

rutayan-nv added 4 commits June 21, 2026 11:34

style(configurator): use %-style deferred logging in BaseAgent.run

13355ba

Convert the two f-string logging.info calls in the default run loop to %-style args so formatting is deferred when the log level is disabled (Ruff G004, CodeRabbit).

rutayan-nv force-pushed the rpatro/agent-run-polymorphism branch from 88d35c5 to 13355ba Compare June 21, 2026 15:34

podkidyshev approved these changes Jun 23, 2026

View reviewed changes

podkidyshev merged commit e4dc657 into NVIDIA:main Jun 23, 2026
4 checks passed

This was referenced Jun 23, 2026

fix(configurator): make env_params first-class to fix the trajectory cache key #901

Open

feat(core): add ContinuousSpace action-space primitive #927

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CLI/Core] Polymorphic agent dispatch + TestRun owns trial counter#933

[CLI/Core] Polymorphic agent dispatch + TestRun owns trial counter#933
podkidyshev merged 4 commits into
NVIDIA:mainfrom
rutayan-nv:rpatro/agent-run-polymorphism

rutayan-nv commented Jun 16, 2026

Uh oh!

coderabbitai Bot commented Jun 16, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

rutayan-nv commented Jun 16, 2026

Uh oh!

coderabbitai Bot commented Jun 16, 2026 •

edited

Loading

Uh oh!

rutayan-nv commented Jun 16, 2026

Uh oh!

rutayan-nv commented Jun 16, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

rutayan-nv commented Jun 16, 2026

Issue

Fix

Testing

Uh oh!

coderabbitai Bot commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rutayan-nv commented Jun 16, 2026

Uh oh!

coderabbitai Bot commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rutayan-nv commented Jun 16, 2026

Uh oh!

rutayan-nv commented Jun 16, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai Bot commented Jun 16, 2026 •

edited

Loading

coderabbitai Bot commented Jun 16, 2026 •

edited

Loading