Skip to content

feat(core): add ContinuousSpace action-space primitive#927

Draft
rutayan-nv wants to merge 13 commits into
NVIDIA:mainfrom
rutayan-nv:rpatro/action-space-continuous
Draft

feat(core): add ContinuousSpace action-space primitive#927
rutayan-nv wants to merge 13 commits into
NVIDIA:mainfrom
rutayan-nv:rpatro/action-space-continuous

Conversation

@rutayan-nv

Copy link
Copy Markdown
Contributor

Issue

  • CloudAI action spaces can only express discrete (list) candidate domains; there is no primitive for a continuous tunable parameter.

Fix

  • Add ContinuousSpace(low, high, dtype="int"|"float") in cloudai/_core/action_space.py (validates low < high), exported via cloudai.core. Consumers (agents, GymnasiumAdapter) read low/high/dtype to build their representation and quantize decoded samples.

Testing

  • tests/test_action_space.py (5 tests): defaults, int-bound coercion, low < high rejection, unknown-dtype rejection, extra-field rejection. ruff + pyright + vulture clean.

Stack: main#893#900#901this (first of the gymnasium-adapter upstreaming). Independent of #901 in content; stacked for linear history.

@coderabbitai

coderabbitai Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: e5fabbb2-968a-448e-97e2-4fc528ca9790

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds env-parameter sampling and validation support, wires env_params into DSE execution and CloudAIGymEnv caching/trajectory writing, adds CLI validation plus failure reporting, and exports a new ContinuousSpace action-space model.

Changes

DSE Environment Randomization and Execution

Layer / File(s) Summary
Action and env-parameter contracts
src/cloudai/_core/action_space.py, src/cloudai/configurator/env_params.py, src/cloudai/_core/test_scenario.py, src/cloudai/models/workload.py, src/cloudai/core.py
ContinuousSpace defines a closed interval action-space dimension with dtype validation; EnvParamSpec validates optional weights; TestRun gains current_env_params; TestDefinition gains env_params; ContinuousSpace is exported from cloudai.core.
EnvParams sampling and persistence
src/cloudai/configurator/env_params.py
EnvParamsSampler samples per-trial categorical env parameters with deterministic seeds; CsvSink writes step-aligned env.csv rows; StepObserver defines step hooks; EnvParamsObserver samples into test_run.current_env_params.
Workload validation and DSE filtering
src/cloudai/models/workload.py, src/cloudai/_core/test_scenario.py
is_dse_job skips env_params-backed cmd_args branches, validate_env_params checks env_params against cmd_args structure and weights, and param_space excludes list-valued cmd_args already covered by env_params.
CloudAIGymEnv observer and cache flow
src/cloudai/configurator/cloudai_gym.py
TrajectoryEntry stores env_params, CloudAIGymEnv builds env observers, injects sampled env params into step execution, records env_params in trajectories, writes env.csv, and requires action plus env_params to match cached results.
CLI validation and failure reporting
src/cloudai/cli/handlers.py
validate_dse_env_params rejects env_params on unsupported scenarios, startup and scenario verification call it, handle_dse_job catches unexpected exceptions, and generate_reports can record dse_failure.txt.
Tests for contracts, runtime, gym, and handlers
tests/test_action_space.py, tests/test_env_params.py, tests/test_cloudaigym.py, tests/test_handlers.py
Adds coverage for ContinuousSpace, env-parameter validation and sampling, CloudAIGymEnv env_params-aware step/cache behavior, and handler validation rules.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • NVIDIA/cloudai#933: Extends the same handle_dse_job/generate_reports path and includes TestRun.increment_step(), overlapping with the exception-handling and reporting changes here.

Suggested reviewers

  • srivatsankrishnan
  • jeffnvidia
  • podkidyshev

Poem

🐇 I hopped through fields of low and high,
With env params sampled by moonlit sky.
The tray of steps went clickety-clack,
And dse_failure.txt kept the stack.
Fresh caches, fresh trails, a bunny’s cheer —
CloudAI runs now wiggle with grace here.

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main change: adding the ContinuousSpace action-space primitive.
Description check ✅ Passed The description is directly related to the changeset and accurately describes the new ContinuousSpace primitive and tests.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/cloudai/configurator/cloudai_gym.py (1)

145-170: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Constraint-failure early return breaks env.csvtrajectory.csv step alignment.

Line 145 triggers before_step observers (which persist env.csv), but Line 168/Line 170 can return before write_trajectory(...). With declared env_params, this creates one-sided env rows for failed trials and violates the step-alignment contract.

Proposed fix
         if not self.test_run.test.constraint_check(self.test_run, self.runner.system):
             logging.info("Constraint check failed. Skipping step.")
-            return [-1.0], self.rewards.constraint_failure, True, {}
+            observation = [-1.0]
+            reward = self.rewards.constraint_failure
+            self.write_trajectory(
+                TrajectoryEntry(
+                    step=self.test_run.step,
+                    action=action,
+                    reward=reward,
+                    observation=observation,
+                    env_params=dict(self.test_run.current_env_params),
+                )
+            )
+            for observer in self.observers:
+                observer.after_step(self.test_run, observation, reward)
+            return observation, reward, True, {}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cloudai/configurator/cloudai_gym.py` around lines 145 - 170, The issue is
that the `before_step` observer (line 145) writes to env.csv, but when the
constraint check fails (line 168), the method returns early without calling
`write_trajectory`, creating a mismatch between env.csv and trajectory.csv rows.
To fix this, when the constraint check fails, you must write a corresponding
trajectory entry before returning. Create a TrajectoryEntry with the current
step, action, the constraint failure reward, and current env_params (similar to
how it's done in the cached result case), pass it to `write_trajectory`, and
only then return the constraint failure response. This ensures every
`before_step` call has a matching trajectory entry.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/cloudai/configurator/env_params.py`:
- Line 2: The copyright headers in both files use a ranged year format
(2024-2026) that does not match the repository's CI expectations. Update the
copyright year in src/cloudai/configurator/env_params.py at line 2 by changing
the year range to a single year format of 2026. Apply the identical copyright
year correction to tests/test_env_params.py at line 2, also changing to the
single year 2026.

In `@src/cloudai/models/workload.py`:
- Around line 115-122: The new env_params field added to the Workload class is
not being checked in the TestDefinition.is_dse_job method, which means test
scenarios that use only environment parameter randomization could be incorrectly
classified as non-DSE jobs and skip the CloudAIGymEnv sampling flow. Update the
is_dse_job method to include a check for env_params alongside any existing DSE
parameter checks, ensuring that a job is classified as a DSE job if env_params
is non-empty.

In `@tests/test_action_space.py`:
- Around line 43-50: The test functions
test_continuous_space_rejects_unknown_dtype and
test_continuous_space_forbids_extra_fields are calling the ContinuousSpace
constructor directly with invalid arguments (dtype="double" and step=0.1), which
violates the type signature and blocks CI type checking. Refactor both test
functions to use the model_validate() class method instead, passing dictionaries
containing the invalid payloads as arguments. This approach will exercise the
runtime validation logic while maintaining static type safety, allowing the
tests to verify that ValidationError is raised for invalid inputs without
triggering pyright type errors.

In `@tests/test_cloudaigym.py`:
- Around line 567-615: Create a new test case (either as a separate test
function or within the existing test structure) that validates step alignment
between env.csv and trajectory.csv when a constraint_check failure occurs. Set
up a test scenario similar to test_env_csv_is_step_aligned_with_trajectory with
declared env_params, but configure or mock the environment so that
constraint_check fails during execution. After the steps complete, verify that
both env_steps and traj_steps remain aligned (same step values in the same
order) even after the constraint failure, ensuring the step alignment contract
holds for trials that hit constraint violations.

---

Outside diff comments:
In `@src/cloudai/configurator/cloudai_gym.py`:
- Around line 145-170: The issue is that the `before_step` observer (line 145)
writes to env.csv, but when the constraint check fails (line 168), the method
returns early without calling `write_trajectory`, creating a mismatch between
env.csv and trajectory.csv rows. To fix this, when the constraint check fails,
you must write a corresponding trajectory entry before returning. Create a
TrajectoryEntry with the current step, action, the constraint failure reward,
and current env_params (similar to how it's done in the cached result case),
pass it to `write_trajectory`, and only then return the constraint failure
response. This ensures every `before_step` call has a matching trajectory entry.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: b6c39d36-ea91-40c3-ac4f-5345ef1e77f6

📥 Commits

Reviewing files that changed from the base of the PR and between 1b0e8cc and c291ba4.

📒 Files selected for processing (13)
  • src/cloudai/_core/action_space.py
  • src/cloudai/_core/test_scenario.py
  • src/cloudai/cli/handlers.py
  • src/cloudai/configurator/base_agent.py
  • src/cloudai/configurator/cloudai_gym.py
  • src/cloudai/configurator/env_params.py
  • src/cloudai/core.py
  • src/cloudai/models/workload.py
  • tests/test_action_space.py
  • tests/test_cloudaigym.py
  • tests/test_env_params.py
  • tests/test_handlers.py
  • tests/test_test_scenario.py

Comment thread src/cloudai/configurator/env_params.py Outdated
Comment thread src/cloudai/models/workload.py
Comment thread tests/test_action_space.py Outdated
Comment thread tests/test_cloudaigym.py Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/cloudai/configurator/base_agent.py (1)

90-104: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Type contract mismatch: select_action signature doesn't allow None return.

The abstract method signature declares tuple[int, dict[str, Any]] as the return type (line 91), but run() handles None as an early-termination signal (lines 143-145). This breaks the type contract and will cause static analysis issues for subclasses that want to return None.

Update the return type to reflect this:

🔧 Proposed fix
 `@abstractmethod`
-def select_action(self, observation: list[float] | None = None) -> tuple[int, dict[str, Any]]:
+def select_action(self, observation: list[float] | None = None) -> tuple[int, dict[str, Any]] | None:
     """
     Select an action from the action space.

     Args:
         observation: Latest observation produced by the environment (``env.reset()`` on the
             first call, then the result of the prior ``env.step()``). Stateless agents such
             as grid search or Bayesian optimization may ignore this; observation-conditioned
             agents (RL, contextual bandits) should use it.

     Returns:
-        Tuple[int, Dict[str, Any]]: The current step index and a dictionary mapping action keys to selected values.
+        Tuple[int, Dict[str, Any]] | None: The current step index and a dictionary mapping
+            action keys to selected values, or None to signal early termination.
     """
     pass

Also applies to: 143-146

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cloudai/configurator/base_agent.py` around lines 90 - 104, The abstract
method `select_action` declares a return type of `tuple[int, dict[str, Any]]`
but the `run()` method handles `None` as a valid return value for early
termination. Update the return type annotation of the `select_action` method to
allow `None` by changing it to `tuple[int, dict[str, Any]] | None` to correctly
reflect the actual contract and prevent type checking errors in subclasses.
♻️ Duplicate comments (2)
src/cloudai/models/workload.py (1)

115-122: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

env_params is not integrated into is_dse_job detection.

The is_dse_job property (lines 148-159) only checks cmd_args_dict and extra_env_vars for list values. A workload that declares only env_params (no list-valued cmd_args) would return is_dse_job = False, potentially bypassing the CloudAIGymEnv step/sampling flow.

🔧 Proposed fix
     `@property`
     def is_dse_job(self) -> bool:
         def check_dict(d: dict, parent_key: str = "") -> bool:
             if isinstance(d, dict):
                 for key, value in d.items():
                     path = f"{parent_key}.{key}" if parent_key else key
                     if self.is_dse_excluded_arg(path):
                         continue
                     if isinstance(value, list) or (isinstance(value, dict) and check_dict(value, path)):
                         return True
             return False

-        return check_dict(self.cmd_args_dict) or check_dict(self.extra_env_vars)
+        return check_dict(self.cmd_args_dict) or check_dict(self.extra_env_vars) or bool(self.env_params)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cloudai/models/workload.py` around lines 115 - 122, The `is_dse_job`
property does not check for values in the newly added `env_params` field when
determining whether a workload is a DSE job. Update the `is_dse_job` property
(which currently checks only `cmd_args_dict` and `extra_env_vars` for list
values) to also include a check for whether `env_params` is non-empty. This
ensures that workloads declaring only `env_params` without list-valued
`cmd_args` will be correctly identified as DSE jobs and properly routed through
the `CloudAIGymEnv` sampling flow.
tests/test_action_space.py (1)

43-50: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Switch invalid-input tests to model_validate() payloads to unblock pyright.

Line 45 and Line 50 currently call ContinuousSpace(...) with arguments that intentionally violate the constructor signature; this is causing the CI pyright failure. Keep the negative-runtime validation intent, but feed invalid payloads through model_validate({...}) instead.

✅ Minimal patch
 def test_continuous_space_rejects_unknown_dtype() -> None:
     with pytest.raises(ValidationError):
-        ContinuousSpace(low=0.0, high=1.0, dtype="double")
+        ContinuousSpace.model_validate({"low": 0.0, "high": 1.0, "dtype": "double"})
@@
 def test_continuous_space_forbids_extra_fields() -> None:
     with pytest.raises(ValidationError):
-        ContinuousSpace(low=0.0, high=1.0, step=0.1)
+        ContinuousSpace.model_validate({"low": 0.0, "high": 1.0, "step": 0.1})
#!/bin/bash
# Read-only verification: confirm the two constructor calls that trigger pyright.
rg -n 'ContinuousSpace\(low=0\.0,\s*high=1\.0,\s*dtype="double"\)|ContinuousSpace\(low=0\.0,\s*high=1\.0,\s*step=0\.1\)' tests/test_action_space.py
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_action_space.py` around lines 43 - 50, Replace the direct
constructor calls to ContinuousSpace that use invalid arguments with
model_validate() calls instead. In test_continuous_space_rejects_unknown_dtype()
at line 45, change ContinuousSpace(low=0.0, high=1.0, dtype="double") to
ContinuousSpace.model_validate({"low": 0.0, "high": 1.0, "dtype": "double"}). In
test_continuous_space_forbids_extra_fields() at line 50, change
ContinuousSpace(low=0.0, high=1.0, step=0.1) to
ContinuousSpace.model_validate({"low": 0.0, "high": 1.0, "step": 0.1}). This
preserves the negative-input validation testing intent while avoiding pyright
type-checking errors.

Source: Pipeline failures

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/test_env_params.py`:
- Line 2: The copyright header in tests/test_env_params.py on line 2 uses a year
range format (2025-2026) which fails the repository's copyright header
validation. Update the copyright year to use the single-year format (2026) as
expected by the tests/test_check_copyright_headers.py validation rules. Change
the year range in the copyright line from "2025-2026" to just "2026".

---

Outside diff comments:
In `@src/cloudai/configurator/base_agent.py`:
- Around line 90-104: The abstract method `select_action` declares a return type
of `tuple[int, dict[str, Any]]` but the `run()` method handles `None` as a valid
return value for early termination. Update the return type annotation of the
`select_action` method to allow `None` by changing it to `tuple[int, dict[str,
Any]] | None` to correctly reflect the actual contract and prevent type checking
errors in subclasses.

---

Duplicate comments:
In `@src/cloudai/models/workload.py`:
- Around line 115-122: The `is_dse_job` property does not check for values in
the newly added `env_params` field when determining whether a workload is a DSE
job. Update the `is_dse_job` property (which currently checks only
`cmd_args_dict` and `extra_env_vars` for list values) to also include a check
for whether `env_params` is non-empty. This ensures that workloads declaring
only `env_params` without list-valued `cmd_args` will be correctly identified as
DSE jobs and properly routed through the `CloudAIGymEnv` sampling flow.

In `@tests/test_action_space.py`:
- Around line 43-50: Replace the direct constructor calls to ContinuousSpace
that use invalid arguments with model_validate() calls instead. In
test_continuous_space_rejects_unknown_dtype() at line 45, change
ContinuousSpace(low=0.0, high=1.0, dtype="double") to
ContinuousSpace.model_validate({"low": 0.0, "high": 1.0, "dtype": "double"}). In
test_continuous_space_forbids_extra_fields() at line 50, change
ContinuousSpace(low=0.0, high=1.0, step=0.1) to
ContinuousSpace.model_validate({"low": 0.0, "high": 1.0, "step": 0.1}). This
preserves the negative-input validation testing intent while avoiding pyright
type-checking errors.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 1bc10229-4b90-4cdf-84d0-690819822eea

📥 Commits

Reviewing files that changed from the base of the PR and between c291ba4 and 4e52bc0.

📒 Files selected for processing (13)
  • src/cloudai/_core/action_space.py
  • src/cloudai/_core/test_scenario.py
  • src/cloudai/cli/handlers.py
  • src/cloudai/configurator/base_agent.py
  • src/cloudai/configurator/cloudai_gym.py
  • src/cloudai/configurator/env_params.py
  • src/cloudai/core.py
  • src/cloudai/models/workload.py
  • tests/test_action_space.py
  • tests/test_cloudaigym.py
  • tests/test_env_params.py
  • tests/test_handlers.py
  • tests/test_test_scenario.py

Comment thread tests/test_env_params.py Outdated
This was referenced Jun 16, 2026
@rutayan-nv

Copy link
Copy Markdown
Contributor Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor
✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@rutayan-nv rutayan-nv force-pushed the rpatro/action-space-continuous branch from 4e52bc0 to 5245d81 Compare June 16, 2026 15:06

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
tests/test_action_space.py (1)

43-50: 🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Replace # type: ignore constructor negatives with payload-based validation calls.

Line 45 and Line 50 currently bypass static typing. Use ContinuousSpace.model_validate({...}) to keep the negative-runtime checks while preserving type safety in tests.

Suggested patch
 def test_continuous_space_rejects_unknown_dtype() -> None:
     with pytest.raises(ValidationError):
-        ContinuousSpace(low=0.0, high=1.0, dtype="double")  # type: ignore
+        ContinuousSpace.model_validate({"low": 0.0, "high": 1.0, "dtype": "double"})
 
 
 def test_continuous_space_forbids_extra_fields() -> None:
     with pytest.raises(ValidationError):
-        ContinuousSpace(low=0.0, high=1.0, step=0.1)  # type: ignore
+        ContinuousSpace.model_validate({"low": 0.0, "high": 1.0, "step": 0.1})
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_action_space.py` around lines 43 - 50, Replace the direct
ContinuousSpace constructor calls that use `# type: ignore` comments with
`ContinuousSpace.model_validate({...})` calls in both
test_continuous_space_rejects_unknown_dtype and
test_continuous_space_forbids_extra_fields functions. Instead of passing
parameters directly to the constructor with type ignore comments, pass them as a
dictionary to model_validate to maintain type safety while still allowing the
negative test cases to validate the runtime behavior.
tests/test_cloudaigym.py (1)

567-615: 🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Add a constraint-failure alignment regression beside this step-alignment test.

This block validates alignment for successful/cache trials, but it still doesn’t pin the declared-env_params path when constraint_check fails. Add a case asserting env.csv and trajectory.csv remain step-aligned in that failure branch as well.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_cloudaigym.py` around lines 567 - 615, Add a regression test that
validates env.csv and trajectory.csv remain step-aligned when constraint_check
fails. Create a new test case (or extend the existing
test_env_csv_is_step_aligned_with_trajectory) that simulates a constraint check
failure scenario, then verify that the step alignment between env.csv and
trajectory.csv is preserved even in this failure branch. This ensures the
declared env_params contract holds for both successful execution paths and
constraint-failure paths.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/test_env_params.py`:
- Around line 86-87: The pytest.raises(ValueError) assertion is too broad and
may mask unrelated failures. Replace the broad ValueError check with a more
specific assertion that matches the exact step-validation error message that
should be raised when sink.write is called with step 0 and drop_rate 0.0. Use
pytest.raises with the match parameter to verify the specific error message
expected from step validation.

---

Duplicate comments:
In `@tests/test_action_space.py`:
- Around line 43-50: Replace the direct ContinuousSpace constructor calls that
use `# type: ignore` comments with `ContinuousSpace.model_validate({...})` calls
in both test_continuous_space_rejects_unknown_dtype and
test_continuous_space_forbids_extra_fields functions. Instead of passing
parameters directly to the constructor with type ignore comments, pass them as a
dictionary to model_validate to maintain type safety while still allowing the
negative test cases to validate the runtime behavior.

In `@tests/test_cloudaigym.py`:
- Around line 567-615: Add a regression test that validates env.csv and
trajectory.csv remain step-aligned when constraint_check fails. Create a new
test case (or extend the existing test_env_csv_is_step_aligned_with_trajectory)
that simulates a constraint check failure scenario, then verify that the step
alignment between env.csv and trajectory.csv is preserved even in this failure
branch. This ensures the declared env_params contract holds for both successful
execution paths and constraint-failure paths.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 1af5e41b-bb3a-409b-a00f-58669b8a911c

📥 Commits

Reviewing files that changed from the base of the PR and between 4e52bc0 and 5245d81.

📒 Files selected for processing (9)
  • src/cloudai/_core/action_space.py
  • src/cloudai/_core/test_scenario.py
  • src/cloudai/configurator/cloudai_gym.py
  • src/cloudai/configurator/env_params.py
  • src/cloudai/core.py
  • src/cloudai/models/workload.py
  • tests/test_action_space.py
  • tests/test_cloudaigym.py
  • tests/test_env_params.py

Comment thread tests/test_env_params.py Outdated
@rutayan-nv

Copy link
Copy Markdown
Contributor Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor
✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@rutayan-nv rutayan-nv force-pushed the rpatro/action-space-continuous branch from 5245d81 to 8a0e03d Compare June 16, 2026 21:17

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/cloudai/cli/handlers.py`:
- Around line 324-328: The validate_dse_env_params function call in the try
block currently validates env_params only against tr.is_dse_job, but does not
account for the --single-sbatch flag which routes execution through
handle_non_dse_job where CloudAIGymEnv sampling is not used. This allows
env_params to pass validation and then be silently ignored at runtime. Enhance
the validation logic to also check if --single-sbatch mode is enabled, and if
so, reject any env_params specification by raising a TestScenarioParsingError to
explicitly block this unsupported combination at validation time rather than
failing silently.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 3934bf03-5c1d-4d10-a50e-1ec080b51156

📥 Commits

Reviewing files that changed from the base of the PR and between 5245d81 and 8a0e03d.

📒 Files selected for processing (5)
  • src/cloudai/_core/action_space.py
  • src/cloudai/cli/handlers.py
  • src/cloudai/core.py
  • tests/test_action_space.py
  • tests/test_handlers.py

Comment thread src/cloudai/cli/handlers.py
@rutayan-nv rutayan-nv force-pushed the rpatro/action-space-continuous branch 4 times, most recently from 23558cd to 7c3c7b1 Compare June 17, 2026 22:03
rutayan-nv added a commit to rutayan-nv/cloudai that referenced this pull request Jun 17, 2026
GymnasiumAdapter._decode_continuous clamps to [low, high] then rounds, so
a dtype="int" space with non-integer bounds (e.g. [0.2, 0.8]) can decode
to a value outside the declared range (1). Validate that int-quantized
spaces declare integer bounds (CodeRabbit, NVIDIA#932/NVIDIA#927).
rutayan-nv added a commit to rutayan-nv/cloudai that referenced this pull request Jun 20, 2026
GymnasiumAdapter._decode_continuous clamps to [low, high] then rounds, so
a dtype="int" space with non-integer bounds (e.g. [0.2, 0.8]) can decode
to a value outside the declared range (1). Validate that int-quantized
spaces declare integer bounds (CodeRabbit, NVIDIA#932/NVIDIA#927).
@rutayan-nv rutayan-nv force-pushed the rpatro/action-space-continuous branch from 017842c to ea9af74 Compare June 20, 2026 22:02
rutayan-nv added a commit to rutayan-nv/cloudai that referenced this pull request Jun 21, 2026
GymnasiumAdapter._decode_continuous clamps to [low, high] then rounds, so
a dtype="int" space with non-integer bounds (e.g. [0.2, 0.8]) can decode
to a value outside the declared range (1). Validate that int-quantized
spaces declare integer bounds (CodeRabbit, NVIDIA#932/NVIDIA#927).
@rutayan-nv rutayan-nv force-pushed the rpatro/action-space-continuous branch from ea9af74 to 79854c6 Compare June 21, 2026 13:49
rutayan-nv added a commit to rutayan-nv/cloudai that referenced this pull request Jun 21, 2026
GymnasiumAdapter._decode_continuous clamps to [low, high] then rounds, so
a dtype="int" space with non-integer bounds (e.g. [0.2, 0.8]) can decode
to a value outside the declared range (1). Validate that int-quantized
spaces declare integer bounds (CodeRabbit, NVIDIA#932/NVIDIA#927).

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/cloudai/_core/action_space.py`:
- Around line 35-49: The ContinuousSpace class has been defined but is not
integrated into the action space extraction pipeline. You need to update
TestRun.param_space to extract ContinuousSpace domains from cmd_args and
extra_env_vars alongside the existing list-valued parameter extraction, then
ensure CloudAIGymEnv.define_action_space() includes these extracted
ContinuousSpace domains in the returned action_space. Finally, add an
integration test that verifies a ContinuousSpace parameter defined under
cmd_args correctly appears in the environment's action_space and is accessible
to agents/adapters.
- Around line 45-55: The _validate_bounds method in the ContinuousSpace class
needs explicit validation for non-finite bounds. The current comparison check
(self.low >= self.high) does not reject NaN values since NaN comparisons always
return False, and it allows infinities which can create invalid intervals. Add
checks to ensure both self.low and self.high are finite values (using a method
like math.isfinite) before or alongside the existing bounds validation, raising
a ValueError with a descriptive message if either bound is NaN or infinite.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: abbf4c85-c061-455b-8c33-7f12050b7a7c

📥 Commits

Reviewing files that changed from the base of the PR and between 8a0e03d and 463efee.

📒 Files selected for processing (13)
  • src/cloudai/_core/action_space.py
  • src/cloudai/_core/test_scenario.py
  • src/cloudai/cli/handlers.py
  • src/cloudai/configurator/base_agent.py
  • src/cloudai/configurator/cloudai_gym.py
  • src/cloudai/configurator/env_params.py
  • src/cloudai/core.py
  • src/cloudai/models/workload.py
  • tests/test_action_space.py
  • tests/test_cloudaigym.py
  • tests/test_env_params.py
  • tests/test_handlers.py
  • tests/test_test_scenario.py

Comment thread src/cloudai/_core/action_space.py
Comment on lines +45 to +55
model_config = ConfigDict(extra="forbid")

low: float
high: float
dtype: Literal["int", "float"] = "float"

@model_validator(mode="after")
def _validate_bounds(self) -> Self:
if self.low >= self.high:
raise ValueError(f"ContinuousSpace requires low < high; got low={self.low}, high={self.high}")
if self.dtype == "int" and (not self.low.is_integer() or not self.high.is_integer()):

@coderabbitai coderabbitai Bot Jun 23, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
python - <<'PY'
from pydantic import BaseModel, ConfigDict

class Model(BaseModel):
    model_config = ConfigDict(extra="forbid")
    x: float

for value in (float("nan"), float("inf"), float("-inf")):
    try:
        print("accepted", value, "->", Model(x=value).x)
    except Exception as exc:
        print("rejected", value, type(exc).__name__)
PY

Repository: NVIDIA/cloudai

Length of output: 214


🏁 Script executed:

# First, locate and examine the ContinuousSpace class
cat -n src/cloudai/_core/action_space.py | head -70

Repository: NVIDIA/cloudai

Length of output: 2879


🏁 Script executed:

# Search for usage of ContinuousSpace in the codebase
rg "ContinuousSpace" --type py -A 2 -B 2

Repository: NVIDIA/cloudai

Length of output: 152


🏁 Script executed:

# Check for any existing finite validation or tests related to bounds
rg "isfinite|NaN|infinity" --type py src/cloudai/

Repository: NVIDIA/cloudai

Length of output: 396


🏁 Script executed:

# Retry searching for ContinuousSpace usage with broader search
rg "ContinuousSpace" --type py

Repository: NVIDIA/cloudai

Length of output: 152


🏁 Script executed:

# Find test files related to action_space
find . -name "*test*action*" -o -name "*action*test*" | grep -E "\.py$"

Repository: NVIDIA/cloudai

Length of output: 181


🏁 Script executed:

# Search for imports from action_space to find consumers
rg "from.*action_space import|import.*action_space" --type py

Repository: NVIDIA/cloudai

Length of output: 152


🏁 Script executed:

# Check test file for ContinuousSpace
cat tests/test_action_space.py

Repository: NVIDIA/cloudai

Length of output: 2514


🏁 Script executed:

# Search broadly for ContinuousSpace in all text files
rg "ContinuousSpace" .

Repository: NVIDIA/cloudai

Length of output: 1501


🏁 Script executed:

# Look for files that use low/high bounds or action space
rg "\.low|\.high" --type py | grep -i "space\|adapter" | head -20

Repository: NVIDIA/cloudai

Length of output: 152


Reject non-finite bounds explicitly.

low >= high does not reject NaN (comparison with NaN is always False), and unbounded infinities can create invalid intervals for downstream adapters. Add finite-bound validation.

Proposed fix
+import math
 from typing import Literal
@@
     `@model_validator`(mode="after")
     def _validate_bounds(self) -> Self:
+        if not math.isfinite(self.low) or not math.isfinite(self.high):
+            raise ValueError(f"ContinuousSpace requires finite bounds; got low={self.low}, high={self.high}")
         if self.low >= self.high:
             raise ValueError(f"ContinuousSpace requires low < high; got low={self.low}, high={self.high}")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
model_config = ConfigDict(extra="forbid")
low: float
high: float
dtype: Literal["int", "float"] = "float"
@model_validator(mode="after")
def _validate_bounds(self) -> Self:
if self.low >= self.high:
raise ValueError(f"ContinuousSpace requires low < high; got low={self.low}, high={self.high}")
if self.dtype == "int" and (not self.low.is_integer() or not self.high.is_integer()):
model_config = ConfigDict(extra="forbid")
low: float
high: float
dtype: Literal["int", "float"] = "float"
`@model_validator`(mode="after")
def _validate_bounds(self) -> Self:
if not math.isfinite(self.low) or not math.isfinite(self.high):
raise ValueError(f"ContinuousSpace requires finite bounds; got low={self.low}, high={self.high}")
if self.low >= self.high:
raise ValueError(f"ContinuousSpace requires low < high; got low={self.low}, high={self.high}")
if self.dtype == "int" and (not self.low.is_integer() or not self.high.is_integer()):
🧰 Tools
🪛 Ruff (0.15.18)

[warning] 54-54: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cloudai/_core/action_space.py` around lines 45 - 55, The _validate_bounds
method in the ContinuousSpace class needs explicit validation for non-finite
bounds. The current comparison check (self.low >= self.high) does not reject NaN
values since NaN comparisons always return False, and it allows infinities which
can create invalid intervals. Add checks to ensure both self.low and self.high
are finite values (using a method like math.isfinite) before or alongside the
existing bounds validation, raising a ValueError with a descriptive message if
either bound is NaN or infinite.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valid; deferred with the PR. The finite-bound (NaN/inf) guard will land when ContinuousSpace is revived and wired into param_space.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rutayan-nv, understood — thanks for the context. Would you like me to open a GitHub issue to track adding the finite-bound (math.isfinite) guard to ContinuousSpace._validate_bounds so it doesn't get lost before the wiring-in PR lands?


🧠 Learnings used
Learnt from: amaslenn
Repo: NVIDIA/cloudai PR: 754
File: src/cloudai/_core/registry.py:226-234
Timestamp: 2025-12-16T19:47:41.994Z
Learning: In this repository, prefer expressing behavioral documentation through tests rather than docstrings. Tests act as living, verified documentation. Reserve docstrings for interfaces or high-level descriptions, and avoid duplicating behavior that is already covered by tests.

Learnt from: amaslenn
Repo: NVIDIA/cloudai PR: 799
File: src/cloudai/workloads/vllm/report_generation_strategy.py:47-48
Timestamp: 2026-02-10T13:29:25.671Z
Learning: In src/cloudai/workloads/vllm/report_generation_strategy.py, it is acceptable to annotate parse_vllm_bench_output with a caching decorator (e.g., functools.cache) because benchmark result files are immutable after creation in typical usage. This improves repeated calls by avoiding re-parsing unchanged data. Ensure the function is side-effect free and that cache invalidation is not needed for these inputs. If inputs include non-file-state variability, consider cache keys on file path or content hash.

Learnt from: podkidyshev
Repo: NVIDIA/cloudai PR: 821
File: src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py:87-97
Timestamp: 2026-03-10T11:01:25.158Z
Learning: In NVIDIA/cloudai repository, there is no secrets management infrastructure; environment variable values (including tokens/secrets) are serialized as literal strings into generated commands and run artifacts (e.g., cloudai_generated_command.sh). This is an architectural limitation acknowledged by the maintainer (podkidyshev). During code reviews for this repo, do not flag secret-serialization issues unless a secrets management solution is available. If a future change introduces proper secrets handling, update this guideline accordingly.

Learnt from: amaslenn
Repo: NVIDIA/cloudai PR: 836
File: tests/workloads/test_llm_serving.py:35-83
Timestamp: 2026-03-16T14:10:11.280Z
Learning: In the NVIDIA/cloudai repository, require meaningful docstrings for non-test Python classes (production/library code) to improve readability and maintainability. Test doubles, test helper classes, and other classes defined inside test files (e.g., Fake*, Plain*, stub classes) are exempt from the docstring requirement. Enforce via automated checks (e.g., a docstring style/quality gate using CI) that validate presence and quality of class docstrings for production code while skipping test-only classes within test files.

Learnt from: amaslenn
Repo: NVIDIA/cloudai PR: 839
File: src/cloudai/workloads/sglang/slurm_command_gen_strategy.py:41-41
Timestamp: 2026-03-17T20:01:53.137Z
Learning: In src/cloudai/workloads/sglang/slurm_command_gen_strategy.py, 0.0.0.0 bindings (default prefill_host/decode_host in get_serve_commands and --host 0.0.0.0 in get_helper_command) are intentional for in-cluster serving. Ruff S104 warnings on these lines should not be flagged or require suppression comments, as the maintainer (amaslenn) has explicitly confirmed this is the intended behavior.

Learnt from: podkidyshev
Repo: NVIDIA/cloudai PR: 846
File: src/cloudai/reporter.py:168-168
Timestamp: 2026-03-24T23:24:17.668Z
Learning: In the NVIDIA/cloudai repository, follow the project-wide logging convention: use root/global logging calls like `logging.info(...)` and `logging.warning(...)` rather than creating module-level loggers via `logging.getLogger(__name__)`. When reviewing Python code, do not flag or reference LOG015 for this reason and do not suggest switching to module-level loggers. Separately, Ruff’s formatting rule (e.g., G004 f-string vs %-style) is independent and may still be raised if applicable.

Learnt from: podkidyshev
Repo: NVIDIA/cloudai PR: 869
File: src/cloudai/_core/test_scenario.py:23-23
Timestamp: 2026-04-14T11:05:59.444Z
Learning: In the NVIDIA/cloudai repository, do not surface or fail code-review findings for Ruff UP035 warnings about deprecated typing aliases (e.g., typing.List, typing.Set, typing.Type) on individual files. This deprecation cleanup is planned for a dedicated whole-repo modernization bump, so UP035 should be treated as an expected/known issue rather than something reviewers should block at the per-file level.

Learnt from: orbalayla-nvidia
Repo: NVIDIA/cloudai PR: 899
File: src/cloudai/_core/grading_strategy.py:2-2
Timestamp: 2026-06-07T12:01:43.285Z
Learning: In the NVIDIA/cloudai repo, copyright headers’ year formatting must match the test logic enforced by tests/test_check_copyright_headers.py (via collect_years_same_file and _format_years_to_ranges): use comma-separated year lists for non-consecutive “touched” years (e.g., "2024, 2026") and use hyphen ranges only for consecutive years (e.g., "2024-2026"). Do not “normalize” "YYYY, YYYY" into "YYYY-YYYY" unless the missing years are actually present in git history; otherwise the copyright header check will fail.

Learnt from: rutayan-nv
Repo: NVIDIA/cloudai PR: 901
File: src/cloudai/cli/handlers.py:147-162
Timestamp: 2026-06-24T01:05:39.500Z
Learning: For NVIDIA/cloudai: Ruff rule G004 (logging-f-string) is not enabled in this repo’s Ruff configuration, and CI Ruff checks pass even with f-string usage in logging calls. During code reviews, do not flag or request changes specifically for Ruff G004 violations; leave logging f-strings as-is in .py files unless CI or other enabled rules fail.

rutayan-nv added a commit to rutayan-nv/cloudai that referenced this pull request Jun 23, 2026
GymnasiumAdapter._decode_continuous clamps to [low, high] then rounds, so
a dtype="int" space with non-integer bounds (e.g. [0.2, 0.8]) can decode
to a value outside the declared range (1). Validate that int-quantized
spaces declare integer bounds (CodeRabbit, NVIDIA#932/NVIDIA#927).
@rutayan-nv rutayan-nv force-pushed the rpatro/action-space-continuous branch from 463efee to 3ca21af Compare June 23, 2026 21:50

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/cloudai/models/workload.py`:
- Around line 236-245: The validation loop in the method iterating over
self.env_params.items() currently skips validation for unweighted candidates
when spec.weights is None, allowing empty cmd_args lists to pass through without
validation. This causes a crash later in EnvParamsSampler.sample() when
rng.choice() is called on an empty list. Remove the continue statement that
skips validation when spec.weights is None, and instead apply the same candidate
list validation (checking that the value is a list with at least 2 values) to
both weighted and unweighted cases. This ensures empty unweighted candidate
lists are caught and rejected during validation, preventing runtime crashes in
the sampler.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 290f3ff0-f5c1-446a-a2a0-3b7c87d25901

📥 Commits

Reviewing files that changed from the base of the PR and between 463efee and 3ca21af.

📒 Files selected for processing (11)
  • src/cloudai/_core/action_space.py
  • src/cloudai/_core/test_scenario.py
  • src/cloudai/cli/handlers.py
  • src/cloudai/configurator/cloudai_gym.py
  • src/cloudai/configurator/env_params.py
  • src/cloudai/core.py
  • src/cloudai/models/workload.py
  • tests/test_action_space.py
  • tests/test_cloudaigym.py
  • tests/test_env_params.py
  • tests/test_handlers.py

Comment thread src/cloudai/models/workload.py Outdated
rutayan-nv added a commit to rutayan-nv/cloudai that referenced this pull request Jun 24, 2026
GymnasiumAdapter._decode_continuous clamps to [low, high] then rounds, so
a dtype="int" space with non-integer bounds (e.g. [0.2, 0.8]) can decode
to a value outside the declared range (1). Validate that int-quantized
spaces declare integer bounds (CodeRabbit, NVIDIA#932/NVIDIA#927).
@rutayan-nv rutayan-nv force-pushed the rpatro/action-space-continuous branch from 3ca21af to 720b16a Compare June 24, 2026 01:02

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
tests/test_action_space.py (1)

33-33: 📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Split the combined type assertion for clearer pytest failure output.

Line 33 combines two checks in one assert, which makes failures less precise and triggers PT018.

Suggested patch
-    assert isinstance(space.low, float) and isinstance(space.high, float)
+    assert isinstance(space.low, float)
+    assert isinstance(space.high, float)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_action_space.py` at line 33, Split the combined isinstance check
for space.low and space.high from the single assert statement into two separate
assertions. Create one assertion to check isinstance(space.low, float) and
another to check isinstance(space.high, float). This approach provides clearer
pytest failure output when individual checks fail and resolves the PT018 linting
violation.

Source: Linters/SAST tools

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@tests/test_action_space.py`:
- Line 33: Split the combined isinstance check for space.low and space.high from
the single assert statement into two separate assertions. Create one assertion
to check isinstance(space.low, float) and another to check
isinstance(space.high, float). This approach provides clearer pytest failure
output when individual checks fail and resolves the PT018 linting violation.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: f3855293-a669-461a-a392-e7a1363e04b4

📥 Commits

Reviewing files that changed from the base of the PR and between 3ca21af and 720b16a.

📒 Files selected for processing (6)
  • src/cloudai/_core/action_space.py
  • src/cloudai/configurator/cloudai_gym.py
  • src/cloudai/core.py
  • src/cloudai/models/workload.py
  • tests/test_action_space.py
  • tests/test_env_params.py

rutayan-nv added a commit to rutayan-nv/cloudai that referenced this pull request Jun 24, 2026
GymnasiumAdapter._decode_continuous clamps to [low, high] then rounds, so
a dtype="int" space with non-integer bounds (e.g. [0.2, 0.8]) can decode
to a value outside the declared range (1). Validate that int-quantized
spaces declare integer bounds (CodeRabbit, NVIDIA#932/NVIDIA#927).
@rutayan-nv rutayan-nv force-pushed the rpatro/action-space-continuous branch from 720b16a to 852569a Compare June 24, 2026 10:32
@rutayan-nv rutayan-nv marked this pull request as draft June 24, 2026 10:35
@rutayan-nv

Copy link
Copy Markdown
Contributor Author

Parking this PR as draft pending a continuous-tunable use case. The ContinuousSpace primitive and the downstream GymnasiumAdapter plumbing are complete and tested, but the producer side (TestRun.param_space) does not yet surface ContinuousSpace, and no current RL workload needs it: the aiconfigurator action space is fully discrete (parallelism degrees, worker counts, batch sizes). Will revive and complete the param_space wiring when a workload with continuous tunables arrives.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/test_action_space.py`:
- Line 33: The combined type check in the action space test triggers PT018 and
hides which side fails, so split the isinstance assertion in the test around
space.low and space.high into two separate asserts for clearer failure output.
Update the assertion in the test that references space.low and space.high so
each type check stands alone.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 24222cf4-ba61-4cd1-a40c-f0dd56e67b71

📥 Commits

Reviewing files that changed from the base of the PR and between 720b16a and 852569a.

📒 Files selected for processing (5)
  • src/cloudai/_core/action_space.py
  • src/cloudai/core.py
  • src/cloudai/models/workload.py
  • tests/test_action_space.py
  • tests/test_env_params.py

def test_continuous_space_coerces_int_bounds_and_keeps_int_dtype() -> None:
space = ContinuousSpace(low=0, high=200, dtype="int")
assert space.dtype == "int"
assert isinstance(space.low, float) and isinstance(space.high, float)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Split the combined type assertion for clearer failure output.

Line 33 triggers PT018 and makes failures less precise; use two asserts.

Suggested patch
-    assert isinstance(space.low, float) and isinstance(space.high, float)
+    assert isinstance(space.low, float)
+    assert isinstance(space.high, float)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
assert isinstance(space.low, float) and isinstance(space.high, float)
assert isinstance(space.low, float)
assert isinstance(space.high, float)
🧰 Tools
🪛 Ruff (0.15.18)

[warning] 33-33: Assertion should be broken down into multiple parts

Break down assertion into multiple parts

(PT018)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_action_space.py` at line 33, The combined type check in the action
space test triggers PT018 and hides which side fails, so split the isinstance
assertion in the test around space.low and space.high into two separate asserts
for clearer failure output. Update the assertion in the test that references
space.low and space.high so each type check stands alone.

Source: Linters/SAST tools

Make env_params a first-class part of CloudAIGymEnv trial identity so the
trajectory cache keys on (action, env_params) rather than action alone,
fixing the domain-randomization correctness bug where the same action under
a different env_params sample returned a stale reward.

- Cache key now includes env_params; cache-key tests pin the contract
  (formerly the TDD-red specs of NVIDIA#900, folded in here).
- Keep env.csv and trajectory.csv 1:1 step-aligned: a single TrajectoryEntry
  sinks both files coherently, including on constraint failure.
- Reject env_params on non-DSE jobs; reject non-finite / negative weights.
- Add cache-hit + declared-env_params integration coverage.

Folds the test-only PR NVIDIA#900 (cache-key TDD) into this PR so the stack has no
permanently-red standalone PR.
…rom search space

Make env_params a thin annotation over cmd_args fields instead of a holder of
candidate values. Candidate values live in cmd_args (the single source of truth,
exactly like an action-space dimension); env_params.<name> only marks a field as
env-sampled and carries optional sampling weights, never the values.

- EnvParamSpec drops `values`; validates weights (finite, non-negative, sum=1.0).
- Sampler/observer resolve candidate lists from cmd_args; scalar knobs are no-ops.
- TestDefinition.validate_env_params cross-checks annotations against cmd_args
  (key must be a real field; weights require a list and must match its length).
- Exclude env_params keys from both param_space and is_dse_job: an env-sampled
  list is not a search dimension, so an env-params-only workload is not a DSE job.
- validate_dse_env_params rejects env_params on non-DSE runs and on grid_search
  (exhaustive search cannot exploit per-trial randomization).
- Scrub private-implementation references from public docstrings.
- Unit tests use generic Atari Breakout semantics (ball_speed / paddle_width).
…pyright

- validate_env_params: reject structured (non-leaf) cmd_args targets. The
  observer cannot sample them, yet param_space/is_dse_job exclude the whole
  key, which would silently drop nested action dimensions.
- CloudAIGymEnv.write_trajectory: rebind the env.csv sink to the current
  iteration path before each write, so env.csv stays 1:1 aligned with
  trajectory.csv when the env instance is reused across iterations.
- test_env_params: assert the unknown-field rejection via model_validate so the
  negative test no longer trips pyright's call-arg check (CI Linting fix); add a
  structured-target rejection test.
An unweighted env_params spec skipped the candidate-list check, so an empty cmd_args list
(e.g. ball_speed = []) passed validation and only failed later in EnvParamsSampler.sample()
via rng.choice([]) (IndexError). Guard against an empty candidate list in validate_env_params
so the error surfaces at TestDefinition build time. Addresses CodeRabbit feedback.
…ms value objects

Replace the EnvParamsSampler class and the StepObserver/EnvParamsObserver
indirection with two frozen dataclasses: EnvParam (one resolved knob: candidates,
optional weights, single draw) and EnvParams (per-run knobs + seed, built via
from_test, sampled per trial). The sampling RNG lives in the env: step() draws
this trial's values and hands concrete values to TestRun.apply_params_set(action,
env_params=...), which overlays action and sample through one deterministic path.

Centralize the cmd_args -> env_params lookup in TestDefinition.is_env_sampled and
access current_env_params directly. Expand EnvParam/EnvParams unit tests to cover
draw, from_test, sample, and immutability.
Drop the EnvParamsSink Protocol + CsvSink pair (and runtime_checkable) for a
single concrete EnvParamsSink, built unconditionally in CloudAIGymEnv. The
sink is now stateless: write() takes the record path per call and skips empty
samples, so non-DR runs write nothing and write_trajectory needs no branch.

Derive both records from a new iteration_dir property and expose the env
record via the env_params_record_path property (was _env_csv_path), keeping
env.csv and trajectory.csv step-aligned without coupling the name to CSV.
…ty flag

Replace the hardcoded `agent == "grid_search"` check with a BaseAgent.samples_env_params
capability flag (opt-in, defaults False). Only agents whose search consumes per-trial
env_params sampling set it True; enumerating/surrogate agents leave it False, so a config
that declares env_params for an agent that would ignore them is rejected up front instead
of silently no-op'ing. New agents answer for themselves with no string to maintain.

Relocate validate_dse_env_params out of the CLI handlers into configurator/env_params.py
next to the logic it guards, looking the agent up via the Registry. Unknown agents are
deferred to the dedicated agent-resolution error rather than masked here.

Keep all public-facing comments, docstrings, and the error message generic (no internal
agent names). Cover the full validator matrix, including the unknown-agent deferral.
Compress multi-line inline comments down to the single non-obvious rationale (or drop
them where the code already speaks), per the self-documenting-code principle. Public API
docstrings and test intent comments are left intact.
apply_params_set overlays sampled scalar draws onto cmd_args, then
reconstructs the TestDefinition to validate the applied action values.
That pass re-ran validate_env_params, which rejects a weighted env_param
whose cmd_args target is no longer a candidate list - exactly what the
overlay produces. env_params is already validated at parse time, so drop
it from the validation-only dump. Adds a regression test covering a
weighted env_param's scalar draw.
rutayan-nv added a commit to rutayan-nv/cloudai that referenced this pull request Jun 25, 2026
GymnasiumAdapter._decode_continuous clamps to [low, high] then rounds, so
a dtype="int" space with non-integer bounds (e.g. [0.2, 0.8]) can decode
to a value outside the declared range (1). Validate that int-quantized
spaces declare integer bounds (CodeRabbit, NVIDIA#932/NVIDIA#927).
@rutayan-nv rutayan-nv force-pushed the rpatro/action-space-continuous branch from 852569a to ba39c51 Compare June 25, 2026 21:15
@rutayan-nv

Copy link
Copy Markdown
Contributor Author

Parking this PR for now. Per a change in release plan, this work isn't needed for the upcoming aiconfigurator release, so I'm moving it to draft to keep it out of the review queue. @podkidyshev — no need to review this one yet; I'll mark it ready again when it's scheduled for release. Thanks!

@rutayan-nv

Copy link
Copy Markdown
Contributor Author

Heads-up for when this is unparked: the GymnasiumAdapter (#930) previously imported ContinuousSpace from this PR. To let #930 ship independently, its continuous-action support (the ContinuousSpace->gym.Box mapping plus decode/encode/clamp/quantize handling and the continuous contract tests) was removed in commit 0417e79. The full pre-decouple adapter is preserved on branch rpatro/gym-adapter-continuous so that support can be rebased on top of #927 + #930 and re-applied once this lands. @podkidyshev

An env_params entry only reclassifies a list-valued cmd_args sweep as
env-sampled; a scalar is already fixed, so annotating it is a meaningless
label. Previously such an annotation was tolerated as a silent no-op, which
let it slip through parse-time validation and inconsistently trip (or not)
the downstream "no agent will sample them" check depending on run mode.

Reject it where the contract lives - TestDefinition.validate_env_params -
so the failure is immediate and mode-independent. EnvParams.from_test's
non-list guard becomes defensive (parse-time now guarantees lists); the
post-overlay path already drops env_params before re-validating, so concrete
scalar draws are unaffected.

Extract the per-field checks into a helper to keep the validator under the
complexity limit, and update tests: scalar annotations now assert rejection
instead of no-op tolerance.
Introduce ContinuousSpace, a closed-interval [low, high] action-space
dimension with a dtype ("int" | "float") that declares how consumers
should quantize decoded samples. CloudAI action spaces can now express
continuous tunables alongside discrete (list) candidate domains; agents
and adapters read low/high/dtype to build their own representation.

Validated low < high; exported via cloudai.core.
…ejection tests

These negative tests pass invalid dtype/extra kwargs to assert pydantic
ValidationError; mark the deliberate type violations with type: ignore.
GymnasiumAdapter._decode_continuous clamps to [low, high] then rounds, so
a dtype="int" space with non-integer bounds (e.g. [0.2, 0.8]) can decode
to a value outside the declared range (1). Validate that int-quantized
spaces declare integer bounds (CodeRabbit, NVIDIA#932/NVIDIA#927).
@rutayan-nv rutayan-nv force-pushed the rpatro/action-space-continuous branch from ba39c51 to fb2c239 Compare June 26, 2026 15:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant