fix(configurator): make env_params first-class to fix the trajectory cache key#901
fix(configurator): make env_params first-class to fix the trajectory cache key#901rutayan-nv wants to merge 10 commits into
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Enterprise Run ID: 📒 Files selected for processing (3)
📝 WalkthroughWalkthroughAdds per-trial env parameter sampling and validation, threads sampled ChangesEnv parameter sampling and DSE flow
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
d0375b4 to
d076095
Compare
There was a problem hiding this comment.
Actionable comments posted: 7
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
src/cloudai/configurator/cloudai_gym.py (2)
143-171: 🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick winKeep
trajectory.csvaligned on the constraint-failure path.
observer.before_step()has already persisted the sampled env row by Line 147, but the early return on Lines 169-171 skipswrite_trajectory(). Any trial rejected byconstraint_check()now leaves an orphanenv.csvrow, which breaks the PR’s stated 1:1env.csv↔trajectory.csvcontract and makes per-trial artifacts inconsistent.Suggested fix
if not self.test_run.test.constraint_check(self.test_run, self.runner.system): logging.info("Constraint check failed. Skipping step.") - return [-1.0], self.rewards.constraint_failure, True, {} + observation = [-1.0] + reward = self.rewards.constraint_failure + self.write_trajectory( + TrajectoryEntry( + step=self.test_run.step, + action=action, + reward=reward, + observation=observation, + env_params=dict(self.test_run.current_env_params), + ) + ) + for observer in self.observers: + observer.after_step(self.test_run, observation, reward) + return observation, reward, True, {}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/cloudai/configurator/cloudai_gym.py` around lines 143 - 171, The early return on constraint_check leaves an env row without a matching trajectory; to fix, when test_run.test.constraint_check(test_run, runner.system) returns False, write a TrajectoryEntry (use self.write_trajectory with step=self.test_run.step, action=action, reward=self.rewards.constraint_failure, observation=[-1.0], env_params=dict(self.test_run.current_env_params)) and call each observer.after_step(self.test_run, observation, reward) before returning the same quadruple (observation, reward, True, {}); this mirrors the cached-result path and preserves the 1:1 env.csv ↔ trajectory.csv contract.
186-202: 🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick winPreserve
current_env_paramswhen recovering from runner failures.On Line 189 the fallback rebuilds
self.test_runfromself.original_test_run, which resetscurrent_env_paramsto its default. The trajectory row written on Line 202 then records{}even thoughenv.csvalready contains the sampled params, and the cached failure is stored under(action, {})instead of(action, sampled_env_params). Repeating the same failing trial will miss the cache and produce disagreeing artifacts.Suggested fix
if self.runner.test_scenario.test_runs and self.runner.test_scenario.test_runs[0].output_path.exists(): self.test_run = self.runner.test_scenario.test_runs[0] else: - self.test_run = copy.deepcopy(self.original_test_run) + self.test_run = copy.deepcopy(self.original_test_run).apply_params_set(action) self.test_run.step = new_tr.step self.test_run.output_path = new_tr.output_path + self.test_run.current_env_params = dict(new_tr.current_env_params)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/cloudai/configurator/cloudai_gym.py` around lines 186 - 202, When rebuilding self.test_run in the fallback branch (inside cloudai_gym.py) you must preserve the sampled env params so the trajectory/cache keys match the actual env.csv; after creating the deep-copied self.test_run from self.original_test_run and after setting step/output_path, copy the sampled params into self.test_run.current_env_params (e.g. assign a deepcopy of self.original_test_run.current_env_params or the sampled params present on the runner) before calling get_observation/compute_reward and write_trajectory; this ensures dict(self.test_run.current_env_params) in the TrajectoryEntry reflects the real sampled parameters.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/cloudai/configurator/base_agent.py`:
- Around line 116-149: In run(), replace the eager f-string logging calls with
%-style deferred logging so formatting happens only if the message will be
emitted; specifically update the two logging.info calls inside the run method
(the "Running step ..." and "Step {step}: Observation..." messages) to use
logging.info("Running step %s (of %s) with action %s", step, self.max_steps,
action) and logging.info("Step %s: Observation: %s, Reward: %.4f", step, ...,
reward) respectively, ensuring you format the observation list without using
f-strings so the logging framework can lazily interpolate.
In `@src/cloudai/configurator/env_params.py`:
- Around line 1-2: The file header years in
src/cloudai/configurator/env_params.py are incorrect for the touched year;
update the two top lines (the SPDX copyright line and the subsequent Copyright
line) by replacing "2024-2026" with the exact year string the repo's header test
expects for this file (e.g., "2026") so the header matches
tests/test_check_copyright_headers.py.
In `@src/cloudai/configurator/gymnasium_adapter.py`:
- Around line 133-165: Add a public close(self) -> None method to
GymnasiumAdapter that safely delegates to the wrapped environment's close
method: check for an attribute (e.g. getattr(self._env, "close", None)), call it
if callable, and return None; do not assume inheritance from gymnasium.Env. Add
the method near other lifecycle helpers (e.g. alongside
_sync_underlying_step_counter, _step_with_params) so callers that call
env.close() won't raise AttributeError.
- Around line 77-80: GymnasiumAdapter.__init__ currently filters action space
lists into _tunable_params (len>1) and _fixed_params (len==1) but silently drops
any keys with empty lists; modify GymnasiumAdapter.__init__ to detect any entry
in the raw_action_space where len(v) == 0 and raise a ValueError identifying the
offending key(s) so callers (e.g., TestRun.param_space) fail fast; add a unit
test that constructs a TestRun or raw_action_space containing an empty list for
a cmd_args key and asserts that GymnasiumAdapter(...) raises ValueError with a
message mentioning the empty key(s).
In `@tests/test_env_params.py`:
- Around line 86-87: The test uses a broad pytest.raises(ValueError) when
calling sink.write(0, {"drop_rate": 0.0}); narrow it by adding a match parameter
to pytest.raises that matches the specific ValueError message expected (e.g.,
the message about invalid drop_rate or index) so the test asserts the exact
error condition raised by sink.write; update the pytest.raises invocation around
sink.write to include match="expected message substring" referencing the
sink.write call and pytest.raises.
- Around line 1-2: The copyright header in the file contains the range
"2025-2026" which fails CI; update the header lines (the SPDX line and the
Copyright line that currently reads "2025-2026 NVIDIA CORPORATION & AFFILIATES")
to use a single year "2026" so they match the expected format for new 2026
files.
In `@tests/test_handlers.py`:
- Around line 287-288: The override of select_action in tests/test_handlers.py
has the wrong signature; change the function declaration for select_action in
that file to match the base class signature BaseAgent.select_action(self,
observation: list[float] | None = None) (i.e., add the observation parameter
with the same type and default) and keep the existing body (raising
AssertionError) so pyright/CI type-checking passes while preserving the test
intent.
---
Outside diff comments:
In `@src/cloudai/configurator/cloudai_gym.py`:
- Around line 143-171: The early return on constraint_check leaves an env row
without a matching trajectory; to fix, when
test_run.test.constraint_check(test_run, runner.system) returns False, write a
TrajectoryEntry (use self.write_trajectory with step=self.test_run.step,
action=action, reward=self.rewards.constraint_failure, observation=[-1.0],
env_params=dict(self.test_run.current_env_params)) and call each
observer.after_step(self.test_run, observation, reward) before returning the
same quadruple (observation, reward, True, {}); this mirrors the cached-result
path and preserves the 1:1 env.csv ↔ trajectory.csv contract.
- Around line 186-202: When rebuilding self.test_run in the fallback branch
(inside cloudai_gym.py) you must preserve the sampled env params so the
trajectory/cache keys match the actual env.csv; after creating the deep-copied
self.test_run from self.original_test_run and after setting step/output_path,
copy the sampled params into self.test_run.current_env_params (e.g. assign a
deepcopy of self.original_test_run.current_env_params or the sampled params
present on the runner) before calling get_observation/compute_reward and
write_trajectory; this ensures dict(self.test_run.current_env_params) in the
TrajectoryEntry reflects the real sampled parameters.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 0097c65c-5ab7-4bd2-8683-6e1ea26cc182
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock
📒 Files selected for processing (14)
pyproject.tomlsrc/cloudai/_core/test_scenario.pysrc/cloudai/cli/handlers.pysrc/cloudai/configurator/__init__.pysrc/cloudai/configurator/base_agent.pysrc/cloudai/configurator/cloudai_gym.pysrc/cloudai/configurator/env_params.pysrc/cloudai/configurator/gymnasium_adapter.pysrc/cloudai/models/workload.pytests/test_cloudaigym.pytests/test_env_params.pytests/test_gymnasium_adapter.pytests/test_handlers.pytests/test_test_scenario.py
d076095 to
5c85e08
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/cloudai/configurator/cloudai_gym.py (1)
169-171:⚠️ Potential issue | 🟠 Major | ⚡ Quick winPreserve env/traj alignment and env_params identity on non-happy paths.
Line 170 currently returns on constraint failure after
before_step()has already persistedenv.csv, soenv.csvandtrajectory.csvcan diverge.
Also, in Lines 189-203, replacingself.test_runwithoriginal_test_runcan dropcurrent_env_params, so the trajectory row is written with empty env params and weakens(action, env_params)cache identity.Proposed fix
if not self.test_run.test.constraint_check(self.test_run, self.runner.system): logging.info("Constraint check failed. Skipping step.") - return [-1.0], self.rewards.constraint_failure, True, {} + observation = [-1.0] + reward = self.rewards.constraint_failure + self.write_trajectory( + TrajectoryEntry( + step=self.test_run.step, + action=action, + reward=reward, + observation=observation, + env_params=dict(self.test_run.current_env_params), + ) + ) + for observer in self.observers: + observer.after_step(self.test_run, observation, reward) + return observation, reward, True, {} @@ if self.runner.test_scenario.test_runs and self.runner.test_scenario.test_runs[0].output_path.exists(): self.test_run = self.runner.test_scenario.test_runs[0] else: + preserved_env_params = dict(new_tr.current_env_params) self.test_run = copy.deepcopy(self.original_test_run) self.test_run.step = new_tr.step self.test_run.output_path = new_tr.output_path + self.test_run.current_env_params = preserved_env_paramsAlso applies to: 189-203
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/cloudai/configurator/cloudai_gym.py` around lines 169 - 171, The constraint check at lines 169-171 returns after before_step() has already persisted env.csv, causing env.csv and trajectory.csv to diverge. Move the constraint check before the before_step() call so that early returns don't create misalignment between the two files. Additionally, at lines 189-203, ensure that self.test_run is used instead of original_test_run to preserve current_env_params in the trajectory row, maintaining proper action-env_params cache identity.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/cloudai/configurator/gymnasium_adapter.py`:
- Around line 151-163: The _sync_underlying_step_counter method has an
off-by-one error where it sets test_run.step to self._step_count + 1, but
CloudAIGymEnv.step() also increments test_run.step on entry, causing a double
increment. Fix this by changing the assignment to test_run.step =
self._step_count instead, since the increment will be handled by
CloudAIGymEnv.step() when it is called, ensuring the first adapter-driven step
is recorded as trial 1 rather than trial 2.
---
Outside diff comments:
In `@src/cloudai/configurator/cloudai_gym.py`:
- Around line 169-171: The constraint check at lines 169-171 returns after
before_step() has already persisted env.csv, causing env.csv and trajectory.csv
to diverge. Move the constraint check before the before_step() call so that
early returns don't create misalignment between the two files. Additionally, at
lines 189-203, ensure that self.test_run is used instead of original_test_run to
preserve current_env_params in the trajectory row, maintaining proper
action-env_params cache identity.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: ea1d6771-30cc-49dd-bc66-394901097b13
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock
📒 Files selected for processing (12)
pyproject.tomlsrc/cloudai/_core/test_scenario.pysrc/cloudai/cli/handlers.pysrc/cloudai/configurator/__init__.pysrc/cloudai/configurator/cloudai_gym.pysrc/cloudai/configurator/env_params.pysrc/cloudai/configurator/gymnasium_adapter.pysrc/cloudai/models/workload.pytests/test_cloudaigym.pytests/test_env_params.pytests/test_gymnasium_adapter.pytests/test_handlers.py
5c85e08 to
0efcd11
Compare
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
src/cloudai/configurator/cloudai_gym.py (1)
146-171:⚠️ Potential issue | 🟠 Major | ⚡ Quick winKeep
env.csvandtrajectory.csvaligned on constraint-failure steps.
before_step()runs before the constraint check, soenv.csvis already appended. The early return on Line 171 skipswrite_trajectory(), creating step drift between artifacts when constraints fail.Proposed fix
if not self.test_run.test.constraint_check(self.test_run, self.runner.system): logging.info("Constraint check failed. Skipping step.") - return [-1.0], self.rewards.constraint_failure, True, {} + observation = [-1.0] + reward = self.rewards.constraint_failure + self.write_trajectory( + TrajectoryEntry( + step=self.test_run.step, + action=action, + reward=reward, + observation=observation, + env_params=dict(self.test_run.current_env_params), + ) + ) + for observer in self.observers: + observer.after_step(self.test_run, observation, reward) + return observation, reward, True, {}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/cloudai/configurator/cloudai_gym.py` around lines 146 - 171, The constraint check failure on line 171 returns early without calling write_trajectory(), while before_step() has already been called on line 146, causing env.csv and trajectory.csv to become misaligned. Add a write_trajectory() call before the early return on constraint failure, similar to what is done in the cached result case, to write a TrajectoryEntry with the current step, action, constraint_failure reward, an appropriate observation, and current environment parameters to keep the CSV artifacts synchronized.tests/test_cloudaigym.py (1)
165-194: 🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick winAdd an env_params-enabled constraint-failure regression test.
Current constraint-failure coverage does not exercise observer-enabled runs. Add a variant with
tdef.env_paramsdeclared and assert failed steps keepenv.csvandtrajectory.csvstep-aligned.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/test_cloudaigym.py` around lines 165 - 194, The test_constraint_failure function needs to be expanded to cover observer-enabled runs. Add a new test variant by either parameterizing the existing test or creating an additional test case where tdef.env_params is declared before creating the CloudAIGymEnv. After the env.step() call, add assertions to verify that the env.csv and trajectory.csv output files are kept step-aligned following the constraint failure, ensuring that observer-enabled runs handle failed steps correctly without misaligning the step records across output files.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@src/cloudai/configurator/cloudai_gym.py`:
- Around line 146-171: The constraint check failure on line 171 returns early
without calling write_trajectory(), while before_step() has already been called
on line 146, causing env.csv and trajectory.csv to become misaligned. Add a
write_trajectory() call before the early return on constraint failure, similar
to what is done in the cached result case, to write a TrajectoryEntry with the
current step, action, constraint_failure reward, an appropriate observation, and
current environment parameters to keep the CSV artifacts synchronized.
In `@tests/test_cloudaigym.py`:
- Around line 165-194: The test_constraint_failure function needs to be expanded
to cover observer-enabled runs. Add a new test variant by either parameterizing
the existing test or creating an additional test case where tdef.env_params is
declared before creating the CloudAIGymEnv. After the env.step() call, add
assertions to verify that the env.csv and trajectory.csv output files are kept
step-aligned following the constraint failure, ensuring that observer-enabled
runs handle failed steps correctly without misaligning the step records across
output files.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 804b42a1-6f0a-462b-b3b1-291ec8eea021
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock
📒 Files selected for processing (12)
pyproject.tomlsrc/cloudai/_core/test_scenario.pysrc/cloudai/configurator/__init__.pysrc/cloudai/configurator/base_agent.pysrc/cloudai/configurator/cloudai_gym.pysrc/cloudai/configurator/env_params.pysrc/cloudai/configurator/gymnasium_adapter.pysrc/cloudai/models/workload.pytests/test_cloudaigym.pytests/test_env_params.pytests/test_gymnasium_adapter.pytests/test_handlers.py
0efcd11 to
b3adf41
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/cloudai/configurator/cloudai_gym.py`:
- Around line 146-148: The constraint-failure early return in the code is
preventing `trajectory.csv` from being written and `after_step()` hooks from
being called, which breaks trial artifact consistency. Additionally, the early
return provides a fixed one-element observation instead of one matching the
configured observation space. Restructure the constraint-failure handling at
both affected locations to ensure that after encountering a constraint failure,
the code still executes the necessary finalization steps including writing
`trajectory.csv` and calling `after_step()` on all observers, while also
returning an observation that matches the shape of the configured observation
space rather than a hardcoded single-element observation. This applies at
src/cloudai/configurator/cloudai_gym.py lines 146-148 (anchor) and lines 169-172
(sibling).
In `@tests/test_cloudaigym.py`:
- Around line 567-615: Add a new test function to validate step alignment when
constraint_check fails during the env_params flow. Create a test similar to
test_env_csv_is_step_aligned_with_trajectory but configure the environment and
test scenario to trigger a constraint violation during env.step() calls. The
test must verify that despite the constraint failure: (1) env.csv and
trajectory.csv maintain 1:1 step row alignment (same step values in both files),
and (2) the observation returned from env.step() has length matching the number
of entries in agent_metrics. Use the same CloudAIGymEnv, TestRun, TestScenario,
and BaseRunner mocking patterns as the existing alignment test, but introduce
constraint conditions that will cause failures and ensure the alignment
verification assertions apply to the failure case.
In `@tests/test_gymnasium_adapter.py`:
- Around line 70-76: The `_GymWithTestRun` class initializes a `test_run` object
with a step counter but never increments it in the `step()` method, unlike the
real `CloudAIGymEnv.step()`. This means test assertions won't catch counter-sync
regressions in the adapter. Override or update the `step()` method in
`_GymWithTestRun` to increment `self.test_run.step` each time the method is
called, ensuring the test model accurately reflects the behavior of
`CloudAIGymEnv` and allows the adapter's pre-step sync assertions to detect
regressions.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 0af1c210-2370-43d6-90f0-330aa01c925f
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock
📒 Files selected for processing (13)
pyproject.tomlsrc/cloudai/_core/test_scenario.pysrc/cloudai/cli/handlers.pysrc/cloudai/configurator/__init__.pysrc/cloudai/configurator/base_agent.pysrc/cloudai/configurator/cloudai_gym.pysrc/cloudai/configurator/env_params.pysrc/cloudai/configurator/gymnasium_adapter.pysrc/cloudai/models/workload.pytests/test_cloudaigym.pytests/test_env_params.pytests/test_gymnasium_adapter.pytests/test_handlers.py
b3adf41 to
b06d621
Compare
b06d621 to
2411f39
Compare
|
@coderabbitai review |
✅ Action performedReview finished.
|
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/cloudai/configurator/cloudai_gym.py (1)
168-170:⚠️ Potential issue | 🟠 Major | ⚡ Quick winConstraint-failure early return still breaks
env.csv/trajectory.csvalignment.When
constraint_checkfails,before_step()has already executed (line 146), writing toenv.csv. But this early return skips bothwrite_trajectory()andafter_step()calls. This creates misalignment:env.csvhas step N whiletrajectory.csvdoes not.The past review flagged this and was marked "Addressed", but the current code still shows the bare early return. Either the fix was incomplete or reverted.
🛠️ Proposed fix
if not self.test_run.test.constraint_check(self.test_run, self.runner.system): logging.info("Constraint check failed. Skipping step.") - return [-1.0], self.rewards.constraint_failure, True, {} + failure_obs = [-1.0] + failure_reward = self.rewards.constraint_failure + self.write_trajectory( + TrajectoryEntry( + step=self.test_run.step, + action=action, + reward=failure_reward, + observation=failure_obs, + env_params=dict(self.test_run.current_env_params), + ) + ) + for observer in self.observers: + observer.after_step(self.test_run, failure_obs, failure_reward) + return failure_obs, failure_reward, True, {}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/cloudai/configurator/cloudai_gym.py` around lines 168 - 170, The constraint_check failure early return at line 168-170 skips both write_trajectory() and after_step() calls while before_step() has already executed and written to env.csv, causing misalignment between env.csv and trajectory.csv. Move the write_trajectory() and after_step() method calls to execute before this early return statement to ensure both files stay synchronized, so that whenever a step is written to env.csv, the corresponding trajectory is written to trajectory.csv.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@src/cloudai/configurator/cloudai_gym.py`:
- Around line 168-170: The constraint_check failure early return at line 168-170
skips both write_trajectory() and after_step() calls while before_step() has
already executed and written to env.csv, causing misalignment between env.csv
and trajectory.csv. Move the write_trajectory() and after_step() method calls to
execute before this early return statement to ensure both files stay
synchronized, so that whenever a step is written to env.csv, the corresponding
trajectory is written to trajectory.csv.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 3128d272-6f62-499c-b04a-76f177dbb799
📒 Files selected for processing (10)
src/cloudai/_core/test_scenario.pysrc/cloudai/cli/handlers.pysrc/cloudai/configurator/base_agent.pysrc/cloudai/configurator/cloudai_gym.pysrc/cloudai/configurator/env_params.pysrc/cloudai/models/workload.pytests/test_cloudaigym.pytests/test_env_params.pytests/test_handlers.pytests/test_test_scenario.py
2411f39 to
c53596d
Compare
|
@coderabbitai review |
✅ Action performedReview finished.
|
There was a problem hiding this comment.
♻️ Duplicate comments (1)
src/cloudai/configurator/cloudai_gym.py (1)
168-170:⚠️ Potential issue | 🟠 Major | ⚡ Quick winPreserve step finalization on constraint failure to keep artifacts aligned.
At Line 168, this early return skips both trajectory persistence and observer
after_step()even thoughbefore_step()already ran. That can leaveenv.csvandtrajectory.csvout of sync by step and breaks the observer lifecycle contract on failure paths.💡 Proposed fix
if not self.test_run.test.constraint_check(self.test_run, self.runner.system): logging.info("Constraint check failed. Skipping step.") - return [-1.0], self.rewards.constraint_failure, True, {} + failure_observation = [-1.0] * max(len(self.test_run.test.agent_metrics), 1) + failure_reward = self.rewards.constraint_failure + self.write_trajectory( + TrajectoryEntry( + step=self.test_run.step, + action=action, + reward=failure_reward, + observation=failure_observation, + env_params=dict(self.test_run.current_env_params), + ) + ) + for observer in self.observers: + observer.after_step(self.test_run, failure_observation, failure_reward) + return failure_observation, failure_reward, True, {}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/cloudai/configurator/cloudai_gym.py` around lines 168 - 170, The constraint check failure at line 168 returns early before executing trajectory persistence and the observer after_step() method, breaking the observer lifecycle since before_step() already executed. Refactor this code so that the finalization steps (trajectory persistence and observer after_step() call) run even when constraint_check() fails. Move the early return to after all finalization code completes, ensuring env.csv and trajectory.csv remain synchronized and the observer contract is maintained on all code paths including constraint failures.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In `@src/cloudai/configurator/cloudai_gym.py`:
- Around line 168-170: The constraint check failure at line 168 returns early
before executing trajectory persistence and the observer after_step() method,
breaking the observer lifecycle since before_step() already executed. Refactor
this code so that the finalization steps (trajectory persistence and observer
after_step() call) run even when constraint_check() fails. Move the early return
to after all finalization code completes, ensuring env.csv and trajectory.csv
remain synchronized and the observer contract is maintained on all code paths
including constraint failures.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: e0a47585-a77f-469b-ad79-79b222c1b20a
📒 Files selected for processing (6)
src/cloudai/_core/test_scenario.pysrc/cloudai/configurator/cloudai_gym.pysrc/cloudai/configurator/env_params.pysrc/cloudai/models/workload.pytests/test_cloudaigym.pytests/test_env_params.py
1298463 to
9e379d1
Compare
4fce989 to
5c3f65e
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/cloudai/cli/handlers.py`:
- Around line 147-162: Replace the f-string formatting in the logging.error call
(starting with "No agent available for type") with lazy logging interpolation by
converting the f-string to a regular string with placeholder values and passing
the variables as separate arguments to the logging.error function. The same
pattern should also be applied to the logging.info call at line 162 (starting
with "Using deterministic first sweep"). This defers string formatting until the
log is actually emitted, improving performance and satisfying the Ruff G004
rule.
In `@src/cloudai/configurator/cloudai_gym.py`:
- Around line 66-67: The CsvSink for env.csv is created once during
initialization but trajectory_file_path is iteration-dependent, causing the sink
to append to stale directories when the current_iteration changes. Instead of
creating the env sink once in the initialization method, move the CsvSink
creation logic for self._env_sink from the initialization code around line 80 to
be called per iteration where current_iteration is known. Initialize
self._env_sink as None in __init__ and then recreate it for each iteration using
the current iteration-dependent trajectory_file_path to ensure proper
per-iteration alignment of env.csv with the trajectory files.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 062e4bc1-2770-4594-8ee6-896015e5c6b2
📒 Files selected for processing (10)
src/cloudai/_core/test_scenario.pysrc/cloudai/cli/handlers.pysrc/cloudai/configurator/base_agent.pysrc/cloudai/configurator/cloudai_gym.pysrc/cloudai/configurator/env_params.pysrc/cloudai/models/workload.pytests/test_cloudaigym.pytests/test_env_params.pytests/test_handlers.pytests/test_test_scenario.py
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
src/cloudai/configurator/cloudai_gym.py (1)
80-81: 🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick winRebind
env.csvsink using the current iteration path before each write.Line 81 initializes
CsvSinkonce, but trajectory writes use the dynamictrajectory_file_path. Ifcurrent_iterationchanges,trajectory.csvmoves whileenv.csvkeeps appending to the old sink path, breaking 1:1 alignment.🛠️ Proposed fix
if self._env_sink is not None: - self._env_sink.write(entry.step, entry.env_params) + # Keep env.csv colocated with the current iteration's trajectory.csv. + self._env_sink = CsvSink(self._env_csv_path()) + self._env_sink.write(entry.step, entry.env_params)Also applies to: 303-305
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/cloudai/configurator/cloudai_gym.py` around lines 80 - 81, The CsvSink for _env_sink is initialized once at the beginning but needs to be dynamically rebinded before each write to track the current iteration path, similar to how trajectory files use dynamic paths. Instead of setting self._env_sink once during initialization in the constructor, you need to rebind the _env_sink to the current iteration path before each write operation by calling CsvSink(self._env_csv_path()) dynamically when needed. This ensures that env.csv stays in sync with the trajectory.csv files as the current_iteration changes. Apply this same fix to all locations where _env_sink is used for writing, including the area referenced at lines 303-305.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/cloudai/models/workload.py`:
- Around line 236-244: The validation loop that checks env_params should reject
structured cmd_args targets instead of silently skipping them. Remove or modify
the early continue statement when spec.weights is None, and instead add
validation to check if the cmd_args field referenced by name is a structured
type (not a list). If it is a structured field without weights declared, raise a
ValueError to prevent the field from being silently excluded from the action
space by param_space or is_dse_job checks. This ensures that env_params only
references candidate list fields in cmd_args, preventing action dimensions from
being unintentionally removed.
---
Duplicate comments:
In `@src/cloudai/configurator/cloudai_gym.py`:
- Around line 80-81: The CsvSink for _env_sink is initialized once at the
beginning but needs to be dynamically rebinded before each write to track the
current iteration path, similar to how trajectory files use dynamic paths.
Instead of setting self._env_sink once during initialization in the constructor,
you need to rebind the _env_sink to the current iteration path before each write
operation by calling CsvSink(self._env_csv_path()) dynamically when needed. This
ensures that env.csv stays in sync with the trajectory.csv files as the
current_iteration changes. Apply this same fix to all locations where _env_sink
is used for writing, including the area referenced at lines 303-305.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 1960514a-f9b1-4af4-b0c4-ed40d1947f92
📒 Files selected for processing (8)
src/cloudai/_core/test_scenario.pysrc/cloudai/cli/handlers.pysrc/cloudai/configurator/cloudai_gym.pysrc/cloudai/configurator/env_params.pysrc/cloudai/models/workload.pytests/test_cloudaigym.pytests/test_env_params.pytests/test_handlers.py
|
Core owns the randomization and the Real-world use. Domain randomization: vary an uncontrolled environment condition per trial so the learned policy generalizes instead of overfitting one fixed setting. Example: the agent tunes Injection. Core vs vs. sweep. Inverse of a sweep over the same list: a bare list in #900. |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/cloudai/models/workload.py (1)
228-230: 🎯 Functional Correctness | 🟡 Minor | ⚡ Quick winDon’t skip validation for empty
cmd_argsschemas.Line 229 treats an empty
model_fieldsmapping the same as a missing one, soenv_params={"foo": ...}is accepted whencmd_argsis a Pydantic model with no fields. Only skip whenmodel_fieldsis absent.Proposed fix
- if not cmd_args_fields: + if cmd_args_fields is None: return self🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/cloudai/models/workload.py` around lines 228 - 230, The validation logic in the workload model is treating an empty cmd_args schema the same as no schema at all, so env_params can slip through when the Pydantic model has zero fields. Update the check in the method that inspects cmd_args_fields to only return early when model_fields is missing (None), not when it is an empty mapping, so empty schemas still trigger validation and reject unexpected env_params.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@src/cloudai/models/workload.py`:
- Around line 228-230: The validation logic in the workload model is treating an
empty cmd_args schema the same as no schema at all, so env_params can slip
through when the Pydantic model has zero fields. Update the check in the method
that inspects cmd_args_fields to only return early when model_fields is missing
(None), not when it is an empty mapping, so empty schemas still trigger
validation and reject unexpected env_params.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: c111e710-1247-4ba9-b834-99fd924d1ef8
📒 Files selected for processing (2)
src/cloudai/models/workload.pytests/test_env_params.py
| if isinstance(value, list) or (isinstance(value, dict) and check_dict(value, path)): | ||
| # env_params lists are sampled by the env, not searched by the agent, so they | ||
| # are not action-space dimensions and must not make a run count as DSE. | ||
| if skip_env_params and path.split(".", 1)[0] in self.env_params: |
There was a problem hiding this comment.
I keep seeing this path.split(".", 1)[0] in regards of env_params. Does it really need to be dot-separated in all of those places?
moreover, some workloads declare more than just the root cmd_args to contain CLIs. e.g. vllm has bench_cmd_args just as well. at the same time certain feature are truly constrained to support only cmd_args
What I'm saying, I suggest the format of env_params is the full dot-separated path to a given attributed inside TestDefinition, not CmdArgs. I believe it will look cleaner in the toml:
num_nodes = 2
[cmd_args]
foo = 1
[cmd_args.args]
bar = [1, 2, 3]
[env_params]
"cmd_args.bar" = [0.1, 0.1, 0.8]There was a problem hiding this comment.
done - f25ef4d (removed the redundant split('.', 1)[0]; env_params keys are flat cmd_args field names. Kept scope to cmd_args for now rather than full TestDefinition dot-paths — can generalize to bench_cmd_args etc. in a follow-up if you'd prefer)
There was a problem hiding this comment.
Yes, I would do a follow-up later to support full paths cmd_args.foo.bar or anything.abc.def meaning any nested attribute starting from TestDefinition. So that all parsing feature act consistently
just to be clear - don't need to be this PR but highly appreciated not to be prolonged. I can take it myself as well
|
Overall arch suggestion (summary of some of my comments): move parsing and applying of |
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
src/cloudai/configurator/cloudai_gym.py (1)
157-159: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick winReturn a failure observation with the configured metric shape.
BaseAgent.run()intentionally does not stop ondone=True; it forwards this observation toupdate_policy(). With multipleagent_metrics,[-1.0]has the wrong shape compared with normalget_observation()output.Proposed fix
if not self.test_run.test.constraint_check(self.test_run, self.runner.system): logging.info("Constraint check failed. Skipping step.") - return [-1.0], self.rewards.constraint_failure, True, {} + failure_observation = [self.rewards.metric_failure] * max(len(self.test_run.test.agent_metrics), 1) + return failure_observation, self.rewards.constraint_failure, True, {}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/cloudai/configurator/cloudai_gym.py` around lines 157 - 159, The constraint-check early return in cloudai_gym.py uses a hardcoded [-1.0] observation, which does not match the configured agent_metrics shape expected by BaseAgent.run() and update_policy(). Update the failure path in the gym step logic to return a failure observation built with the same structure as get_observation(), using the configured metric set rather than a single scalar placeholder, while keeping the existing reward, done flag, and info dict behavior.Source: Learnings
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/cloudai/_core/test_scenario.py`:
- Around line 205-215: The revalidation pass in test_scenario’s constructor call
is re-running validate_env_params after action and env_params have already been
applied, so the post-overlay cmd_args can make weighted specs fail. Update the
validation-only path around type(tdef)(**tdef.model_dump()) to clear env_params
before the second construction, or otherwise avoid revalidating that field after
overlays in this flow. Keep the fix localized to the trial setup logic that
applies _apply to action and env_params.
---
Duplicate comments:
In `@src/cloudai/configurator/cloudai_gym.py`:
- Around line 157-159: The constraint-check early return in cloudai_gym.py uses
a hardcoded [-1.0] observation, which does not match the configured
agent_metrics shape expected by BaseAgent.run() and update_policy(). Update the
failure path in the gym step logic to return a failure observation built with
the same structure as get_observation(), using the configured metric set rather
than a single scalar placeholder, while keeping the existing reward, done flag,
and info dict behavior.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: d61df24e-8964-44da-b280-80330b0f1f78
📒 Files selected for processing (9)
src/cloudai/_core/test_scenario.pysrc/cloudai/cli/handlers.pysrc/cloudai/configurator/base_agent.pysrc/cloudai/configurator/cloudai_gym.pysrc/cloudai/configurator/env_params.pysrc/cloudai/models/workload.pytests/test_cloudaigym.pytests/test_env_params.pytests/test_handlers.py
|
Make env_params a first-class part of CloudAIGymEnv trial identity so the trajectory cache keys on (action, env_params) rather than action alone, fixing the domain-randomization correctness bug where the same action under a different env_params sample returned a stale reward. - Cache key now includes env_params; cache-key tests pin the contract (formerly the TDD-red specs of NVIDIA#900, folded in here). - Keep env.csv and trajectory.csv 1:1 step-aligned: a single TrajectoryEntry sinks both files coherently, including on constraint failure. - Reject env_params on non-DSE jobs; reject non-finite / negative weights. - Add cache-hit + declared-env_params integration coverage. Folds the test-only PR NVIDIA#900 (cache-key TDD) into this PR so the stack has no permanently-red standalone PR.
…rom search space Make env_params a thin annotation over cmd_args fields instead of a holder of candidate values. Candidate values live in cmd_args (the single source of truth, exactly like an action-space dimension); env_params.<name> only marks a field as env-sampled and carries optional sampling weights, never the values. - EnvParamSpec drops `values`; validates weights (finite, non-negative, sum=1.0). - Sampler/observer resolve candidate lists from cmd_args; scalar knobs are no-ops. - TestDefinition.validate_env_params cross-checks annotations against cmd_args (key must be a real field; weights require a list and must match its length). - Exclude env_params keys from both param_space and is_dse_job: an env-sampled list is not a search dimension, so an env-params-only workload is not a DSE job. - validate_dse_env_params rejects env_params on non-DSE runs and on grid_search (exhaustive search cannot exploit per-trial randomization). - Scrub private-implementation references from public docstrings. - Unit tests use generic Atari Breakout semantics (ball_speed / paddle_width).
…pyright - validate_env_params: reject structured (non-leaf) cmd_args targets. The observer cannot sample them, yet param_space/is_dse_job exclude the whole key, which would silently drop nested action dimensions. - CloudAIGymEnv.write_trajectory: rebind the env.csv sink to the current iteration path before each write, so env.csv stays 1:1 aligned with trajectory.csv when the env instance is reused across iterations. - test_env_params: assert the unknown-field rejection via model_validate so the negative test no longer trips pyright's call-arg check (CI Linting fix); add a structured-target rejection test.
An unweighted env_params spec skipped the candidate-list check, so an empty cmd_args list (e.g. ball_speed = []) passed validation and only failed later in EnvParamsSampler.sample() via rng.choice([]) (IndexError). Guard against an empty candidate list in validate_env_params so the error surfaces at TestDefinition build time. Addresses CodeRabbit feedback.
…ms value objects Replace the EnvParamsSampler class and the StepObserver/EnvParamsObserver indirection with two frozen dataclasses: EnvParam (one resolved knob: candidates, optional weights, single draw) and EnvParams (per-run knobs + seed, built via from_test, sampled per trial). The sampling RNG lives in the env: step() draws this trial's values and hands concrete values to TestRun.apply_params_set(action, env_params=...), which overlays action and sample through one deterministic path. Centralize the cmd_args -> env_params lookup in TestDefinition.is_env_sampled and access current_env_params directly. Expand EnvParam/EnvParams unit tests to cover draw, from_test, sample, and immutability.
Drop the EnvParamsSink Protocol + CsvSink pair (and runtime_checkable) for a single concrete EnvParamsSink, built unconditionally in CloudAIGymEnv. The sink is now stateless: write() takes the record path per call and skips empty samples, so non-DR runs write nothing and write_trajectory needs no branch. Derive both records from a new iteration_dir property and expose the env record via the env_params_record_path property (was _env_csv_path), keeping env.csv and trajectory.csv step-aligned without coupling the name to CSV.
…ty flag Replace the hardcoded `agent == "grid_search"` check with a BaseAgent.samples_env_params capability flag (opt-in, defaults False). Only agents whose search consumes per-trial env_params sampling set it True; enumerating/surrogate agents leave it False, so a config that declares env_params for an agent that would ignore them is rejected up front instead of silently no-op'ing. New agents answer for themselves with no string to maintain. Relocate validate_dse_env_params out of the CLI handlers into configurator/env_params.py next to the logic it guards, looking the agent up via the Registry. Unknown agents are deferred to the dedicated agent-resolution error rather than masked here. Keep all public-facing comments, docstrings, and the error message generic (no internal agent names). Cover the full validator matrix, including the unknown-agent deferral.
Compress multi-line inline comments down to the single non-obvious rationale (or drop them where the code already speaks), per the self-documenting-code principle. Public API docstrings and test intent comments are left intact.
apply_params_set overlays sampled scalar draws onto cmd_args, then reconstructs the TestDefinition to validate the applied action values. That pass re-ran validate_env_params, which rejects a weighted env_param whose cmd_args target is no longer a candidate list - exactly what the overlay produces. env_params is already validated at parse time, so drop it from the validation-only dump. Adds a regression test covering a weighted env_param's scalar draw.
5790ab9 to
7fec7f5
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (1)
src/cloudai/configurator/cloudai_gym.py (1)
153-155: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick winReturn a failure observation with the normal metric width.
This branch still returns
[-1.0]unconditionally. Ifagent_metricshas more than one entry, the agent sees a different observation shape on constraint-failure trials than on every successful trial.Proposed fix
if not self.test_run.test.constraint_check(self.test_run, self.runner.system): logging.info("Constraint check failed. Skipping step.") - return [-1.0], self.rewards.constraint_failure, True, {} + failure_observation = [self.rewards.metric_failure] * max(len(self.test_run.test.agent_metrics), 1) + return failure_observation, self.rewards.constraint_failure, True, {}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/cloudai/configurator/cloudai_gym.py` around lines 153 - 155, The constraint-failure branch in the gym step logic returns a fixed one-element observation, which can change the observation shape when agent_metrics normally has multiple values. Update the early return in the step method that performs the constraint_check so it builds a failure observation with the same width as the normal observation from agent_metrics, while still returning the failure reward and done flag.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/cloudai/_core/test_scenario.py`:
- Around line 206-222: The num-nodes sweep handling in test_scenario should
special-case NUM_NODES before the generic action overlay, since _apply()
currently treats every action key as a cmd_args field and can incorrectly write
NUM_NODES into the test definition. Update the logic in the action application
block so NUM_NODES is removed or handled separately from the loop that calls
_apply(), then continue using the existing new_tr.num_nodes assignment to
preserve the sweep value on the copied trajectory object.
In `@src/cloudai/cli/handlers.py`:
- Around line 307-312: The validation in handle_dry_run_and_run only checks
validate_dse_env_params against the parsed test_scenario, so env_params can
still slip through when args.single_sbatch forces the non-DSE path. Update the
validation flow in handle_dry_run_and_run to also consider args.single_sbatch
and reject env-aware DSE scenarios before dispatching to handle_non_dse_job. Use
the existing symbols validate_dse_env_params, handle_dry_run_and_run, and
handle_non_dse_job to locate the check and add the explicit guard for the
single-sbatch case.
In `@src/cloudai/configurator/env_params.py`:
- Around line 181-188: The validation loop in EnvParams.from_test is treating
any non-empty tr.test.env_params as requiring DSE sampling, which incorrectly
rejects scalar-only annotations that are meant to be fixed no-ops. Update the
check around the sampled calculation so it distinguishes scalar markers from
actual sampled env params, and only raises/blocks when the env_params entry
contains values that would truly be sampled. Keep the logic localized to
EnvParams.from_test and use the existing agent resolution path (agents.get,
tr.is_dse_job, agent.samples_env_params) to preserve the current unknown-agent
behavior.
---
Duplicate comments:
In `@src/cloudai/configurator/cloudai_gym.py`:
- Around line 153-155: The constraint-failure branch in the gym step logic
returns a fixed one-element observation, which can change the observation shape
when agent_metrics normally has multiple values. Update the early return in the
step method that performs the constraint_check so it builds a failure
observation with the same width as the normal observation from agent_metrics,
while still returning the failure reward and done flag.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 998335ea-88c1-4486-8001-0d885dc9d775
📒 Files selected for processing (9)
src/cloudai/_core/test_scenario.pysrc/cloudai/cli/handlers.pysrc/cloudai/configurator/base_agent.pysrc/cloudai/configurator/cloudai_gym.pysrc/cloudai/configurator/env_params.pysrc/cloudai/models/workload.pytests/test_cloudaigym.pytests/test_env_params.pytests/test_handlers.py
An env_params entry only reclassifies a list-valued cmd_args sweep as env-sampled; a scalar is already fixed, so annotating it is a meaningless label. Previously such an annotation was tolerated as a silent no-op, which let it slip through parse-time validation and inconsistently trip (or not) the downstream "no agent will sample them" check depending on run mode. Reject it where the contract lives - TestDefinition.validate_env_params - so the failure is immediate and mode-independent. EnvParams.from_test's non-list guard becomes defensive (parse-time now guarantees lists); the post-overlay path already drops env_params before re-validating, so concrete scalar draws are unaffected. Extract the per-field checks into a helper to keep the validator under the complexity limit, and update tests: scalar annotations now assert rejection instead of no-op tolerance.
| for key, value in action.items(): | ||
|
|
||
| def _apply(key: str, value: Any) -> None: |
There was a problem hiding this comment.
i'd suggest reverting the changes with the _apply function (local functions are a troublesome thing during debugging) and rather replace the previous root
for key, value in action.items():with
full_action = action | (env_params or {})
for key, value in full_action.items():it makes the delta of the PR quite shorter + no inner function
| new_tr.test = tdef | ||
| if "NUM_NODES" in action: | ||
| new_tr.num_nodes = action["NUM_NODES"] | ||
| new_tr.current_env_params = dict(env_params or {}) |
There was a problem hiding this comment.
nit
| new_tr.current_env_params = dict(env_params or {}) | |
| new_tr.current_env_params = env_params or {} |
| action=action, | ||
| reward=cached_result.reward, | ||
| observation=cached_result.observation, | ||
| env_params=dict(self.test_run.current_env_params), |
There was a problem hiding this comment.
nit
and I see this (to my understanding) redundant dict(...) call across entire PR. is there really a need for it?
| env_params=dict(self.test_run.current_env_params), | |
| env_params=self.test_run.current_env_params, |
| # The test_run rebuild above drops the sample; restore it so the entry, cache key, and env.csv match. | ||
| self.test_run.current_env_params = new_tr.current_env_params |
There was a problem hiding this comment.
I see the same line in src/cloudai/_core/test_scenario.py:223. If it's indeed a duplicate - I would vote for removing this one and keeping one in scenario
| if not self._values_match_exact(entry.action, action): | ||
| continue | ||
| if self._values_match_exact(entry.env_params, current_env_params): | ||
| return entry |
There was a problem hiding this comment.
nit again, but I'd like to push a bit for readibility. we sacrifice a few ms for validation but I believe it's just mush easier to read
| if not self._values_match_exact(entry.action, action): | |
| continue | |
| if self._values_match_exact(entry.env_params, current_env_params): | |
| return entry | |
| action_match = self._values_match_exact(entry.action, action) | |
| env_params_match = self._values_match_exact(entry.env_params, current_env_params) | |
| if action_match and env_params_match: | |
| return entry |
| seed: int | ||
|
|
||
| @classmethod | ||
| def from_test(cls, test: Any) -> Optional["EnvParams"]: |
| if isinstance(value, list) or (isinstance(value, dict) and check_dict(value, path)): | ||
| # env_params lists are sampled by the env, not searched by the agent, so they | ||
| # are not action-space dimensions and must not make a run count as DSE. | ||
| if skip_env_params and path.split(".", 1)[0] in self.env_params: |
There was a problem hiding this comment.
Yes, I would do a follow-up later to support full paths cmd_args.foo.bar or anything.abc.def meaning any nested attribute starting from TestDefinition. So that all parsing feature act consistently
just to be clear - don't need to be this PR but highly appreciated not to be prolonged. I can take it myself as well
| return {name: param.draw(random.Random(f"{self.seed}:{name}:{trial}")) for name, param in self.params.items()} | ||
|
|
||
|
|
||
| class EnvParamsSink: |
There was a problem hiding this comment.
redundant class definition: no state, no polymorphism
| writer.writerow([step, sample]) | ||
|
|
||
|
|
||
| def validate_dse_env_params(test_scenario: "TestScenario") -> None: |
There was a problem hiding this comment.
I wonder if you could just merge this validation with TestDefinition.validate_env_params (eliminating this one at all and calling it explicitly from handlers.py)
Follow-up gap: single-candidate (
|
Issue
actiononly → revisiting an action under a differentenv_paramssample serves the stale reward, corrupting the signal (the bug [TDD-red] CloudAIGymEnv cache key must include env_params (drives domain-randomization correctness fix) #900 pins).env.csvlived in workload code and was skipped on cache hits — sparse and misaligned withtrajectory.csv.Fix
env_paramsbecomes first-class schema (TestDefinition.env_params,TestRun.current_env_params,TrajectoryEntry.env_params); the cache key is now(action, current_env_params).CloudAIGymEnvbuilds anEnvParamsObserverthat samples deterministically and writesenv.csvbefore the cache lookup, so it aligns 1:1 withtrajectory.csv; workloads withoutenv_paramspay zero overhead.Testing
test_env_params.py) +2 integration (test_cloudaigym.py); fullpytest tests: 1641 passed, 4 skipped; ruff + pyright clean.Stack:
main← #893 ← #900 ← #901 (top). Depends on #900 → #893.