Add LLC CPU data canonicalization#670
Conversation
509b762 to
91815a4
Compare
d201ce9 to
8aa8e76
Compare
91815a4 to
7440baa
Compare
8aa8e76 to
d738b00
Compare
6aa69de to
35caa19
Compare
a21e9c9 to
4276e4a
Compare
4276e4a to
5c2d900
Compare
|
@fomo-bot can you please rebase this PR into a single clean commit on top of the latest main? |
5c2d900 to
47d5ed9
Compare
|
Done. PR #670 is now a single commit on top of latest Codex work
Codex transcript |
|
@fomo-bot can we unify the two different mock LLC datasets in tests, please? I also don't really understand all the complexity around the time stuff. Why do we need to change |
47d5ed9 to
0d2e5f3
Compare
|
Done and pushed. Codex work
Codex transcript |
| """Convert LLC time coordinates to Julian-calendar cftime datetimes.""" | ||
| values = np.asarray(time_coord.values) | ||
| if values.size == 0: | ||
| return values.astype(object) | ||
|
|
||
| if np.issubdtype(values.dtype, np.datetime64): | ||
| microseconds = values.astype("datetime64[us]").astype(np.int64) | ||
| converted = cftime.num2date( | ||
| microseconds, | ||
| "microseconds since 1970-01-01", | ||
| calendar="julian", | ||
| ) | ||
| return np.asarray(converted, dtype=object) | ||
|
|
||
| if values.dtype == object: | ||
| flat_values = values.ravel() | ||
| sample = flat_values[0] | ||
| if isinstance(sample, (cftime.datetime, dt.datetime, dt.date)): | ||
| return np.asarray( | ||
| [_to_julian_datetime(value) for value in flat_values], | ||
| dtype=object, | ||
| ).reshape(values.shape) | ||
|
|
||
| units = time_coord.attrs.get("units") | ||
| if units is None: | ||
| raise ValueError( | ||
| "LLC time coordinate is numeric but missing 'units'; unable to decode " | ||
| "raw time values." | ||
| ) |
There was a problem hiding this comment.
@fomo-bot do we really need all these cases? I would expect only one of them is in fact used?
There was a problem hiding this comment.
Done and pushed.
I simplified the LLC time conversion to only accept raw datetime64 coordinates and removed the extra cftime/Python datetime/numeric-unit fallback cases from data.py:45.
The PR is still a single commit, rebased onto current origin/main:
dece93c9 Add LLC CPU data canonicalization
Local checks passed:
uv run pytest tests/test_config.py tests/test_utils_data.py -m "not manual and not cuda"
uvx pre-commit run --files src/ocean_emulators/utils/data.py tests/test_utils_data.py
GitHub checks restarted and were queued when I checked.
Codex work
- Repository:
Open-Athena/Ocean_Emulator - Trigger: https://github.com/Open-Athena/Ocean_Emulator/pull/670#discussion_r3326718121
- Request ID:
77b6d479-83a1-4b7a-b7bc-bba0b14bd5a8 - Accepted:
true
Codex transcript
- I’ll pull the review-comment context first so I simplify the exact code being questioned, then amend the single PR commit again.
- The review is on `_convert_llc_time_coord_to_julian` and the helper that handled cftime, Python datetime/date, and numeric units. I agree with the comment: for the LLC zarr path we only need raw `datetime64`, so I’m cutting the fallback cases and making non-`datetime64` an expl...[truncated]
- That removes the generic date conversion helper entirely. The LLC canonicalizer now accepts only raw `datetime64` coordinates and raises a clear `TypeError` otherwise.
- `origin/main` moved by one commit since the last push, so after validation I’ll amend this into the PR’s single commit and rebase onto that newer main before pushing.
- The focused tests still pass after the simplification. Pre-commit is running now on the affected Python files.
- Pre-commit passed on the affected files. I’m amending the single PR commit and then rebasing it onto the current main tip.
- The one-commit rebase onto current `origin/main` completed cleanly. I’m rerunning the focused test pair once on the rebased commit, then I’ll force-push with lease.
- The rebased commit is clean locally and still a single commit over current `origin/main`. I’m updating the PR branch with a lease-protected force push.
- The PR branch is now updated to `dece93c9`. I’m verifying GitHub sees the rewritten single commit and checking the new CI state.
- Done and pushed.
I simplified the LLC time conversion to only accept raw `datetime64` coordinates and removed the extra cftime/Python datetime/numeric-unit fallback cases from [data.py](/home/fomo-bot/oa-jder-bot/worktrees/github.com/Open-Athena/Ocean_Emulator/670/src/ocean_emul...[truncated]
0d2e5f3 to
dece93c
Compare
| [xr.Dataset, xr.Dataset, xr.Dataset], | ||
| tuple[xr.Dataset, xr.Dataset, xr.Dataset], | ||
| ] | ||
| | None = None, |
There was a problem hiding this comment.
This seems a bit silly compared to just passing in the canonicalized datasets and pulling the open logic out into DataConfig.build
There was a problem hiding this comment.
@fomo-bot can you try this ^ please? And update to the latest main?
There was a problem hiding this comment.
Done and pushed.
I rebased PR #670 onto latest origin/main and applied the review suggestion: DataConfig.build now opens the datasets, canonicalizes them, and passes them directly to DataSource.from_datasets; the DataSource.from_locations wrapper is gone.
The PR is still one commit:
0c477e43 Add LLC CPU data canonicalization
Local checks passed:
uv run pytest tests/test_config.py tests/test_utils_data.py -m "not manual and not cuda"
uvx pre-commit run --files configs/data/llc.yaml src/ocean_emulators/config.py src/ocean_emulators/utils/data.py tests/llc_fixtures.py tests/test_config.py tests/test_utils_data.py
GitHub checks restarted and were in progress when I checked.
Codex work
- Repository:
Open-Athena/Ocean_Emulator - Trigger: https://github.com/Open-Athena/Ocean_Emulator/pull/670#discussion_r3404210391
- Request ID:
ba7da855-b403-492b-b581-ff874abe3b60 - Accepted:
true
Codex transcript
- I’ll fetch the exact review thread so “this ^” maps to the right suggestion, then rebase the PR branch onto current `main` and amend the single commit.
- The PR is clean but behind current `main`, and the review thread points at the `canonicalize` callback parameter in `DataSource.from_locations`. I’m fetching the parent comment now to see the suggested replacement exactly.
- The parent suggestion is to remove the `canonicalize` callback from `DataSource.from_locations`, open the datasets directly in `DataConfig.build`, canonicalize there, and then pass datasets into `DataSource.from_datasets`. I’m going to make that change and drop the now-unneeded...[truncated]
- `main` has relevant edits in the same data/config area, so I’m rebasing first and resolving the combined result there. That avoids building the review fix against stale context.
- The rebase conflicts are in `utils/data.py` and `tests/test_utils_data.py`, which matches the changed normalization code on latest `main`. I’m resolving those while also removing the `from_locations` canonicalize callback.
- The conflict is straightforward: latest `main` added a normalization regression test and changed stats flattening, while this branch adds LLC canonicalization in the same area. I’m keeping both changes, then moving dataset opening/canonicalization into `DataConfig.build` as req...[truncated]
- In `utils/data.py`, I’m deleting `DataSource.from_locations` instead of just removing its callback. `DataConfig.build` will own resolving locations, opening datasets, and canonicalizing before calling `DataSource.from_datasets`.
- In the test conflict I’m keeping the new `main` normalization regression and the LLC canonicalization tests, with the shared LLC fixture import.
- Next I’m moving the open calls into `DataConfig.build`: resolve locations, open with the same dask chunk setting, canonicalize through the dataset config, then construct the `DataSource` from those datasets.
- The `DataSource.from_locations` wrapper is now removed from `utils/data.py`, and `DataConfig.build` directly opens, canonicalizes, and constructs sources from datasets. I’m checking for remaining conflict markers and import issues.
- The conflict markers are gone. I’m marking the resolved files and continuing the rebase, then I’ll run the focused tests against the final rebased commit.
- The resolved diff now shows the intended deletion of the `from_locations` wrapper and both test additions retained. I’m staging the resolved files and continuing the rebase.
- `git rebase --continue` only stopped because no editor is configured for the commit message. I’m rerunning it with a no-op editor so it keeps the existing commit message.
- The branch is now rebased onto current `origin/main` with one commit. I’m running the focused tests and pre-commit checks next.
- The review-requested callback path is gone: no `from_locations` or `canonicalize=` callsites remain. I’m validating the rebased single commit now.
- The focused pytest run passed: 37 tests. I’m running the same pre-commit subset across the touched files before pushing.
- Pre-commit has passed ruff, format, YAML, and file-size checks; mypy is currently running.
- Pre-commit passed on the touched files. I’m doing a final status check and then force-pushing the rebased one-commit branch with lease.
- The branch is one commit over latest `origin/main`; the ahead/behind count is only relative to the old remote PR tip. I’m pushing `0c477e43` now.
- The force-with-lease push succeeded, updating the PR branch to `0c477e43`. I’m verifying GitHub now reports the single commit and fresh checks.
dece93c to
0c477e4
Compare
Summary
This is PR 3 in the LLC stack, based on #669.
It adds
data.dataset.type=llcand teaches the CPU data path how to canonicalize LLC inputs into the existing tensor contract. LLC-specific face, crop, stagger, and time handling move behind the dataset-family interface.Why
This is the first PR that carries real LLC data semantics, but it keeps them in the CPU path only. That lets us validate the dataset-family split before adding GPU decode and container/runtime changes.
Notes
This remains a draft stacked PR.