Skip to content

Add LLC CPU data canonicalization#670

Draft
jder wants to merge 1 commit into
mainfrom
llc-stack-03-llc-cpu-canonicalization
Draft

Add LLC CPU data canonicalization#670
jder wants to merge 1 commit into
mainfrom
llc-stack-03-llc-cpu-canonicalization

Conversation

@jder

@jder jder commented Apr 8, 2026

Copy link
Copy Markdown
Member

Summary

This is PR 3 in the LLC stack, based on #669.

It adds data.dataset.type=llc and teaches the CPU data path how to canonicalize LLC inputs into the existing tensor contract. LLC-specific face, crop, stagger, and time handling move behind the dataset-family interface.

Why

This is the first PR that carries real LLC data semantics, but it keeps them in the CPU path only. That lets us validate the dataset-family split before adding GPU decode and container/runtime changes.

Notes

This remains a draft stacked PR.

@jder jder mentioned this pull request Apr 8, 2026
@oa-jder-bot oa-jder-bot force-pushed the llc-stack-03-llc-cpu-canonicalization branch from 509b762 to 91815a4 Compare April 8, 2026 20:01
@oa-jder-bot oa-jder-bot force-pushed the llc-stack-02-dataset-specs branch from d201ce9 to 8aa8e76 Compare April 8, 2026 21:07
@oa-jder-bot oa-jder-bot force-pushed the llc-stack-03-llc-cpu-canonicalization branch from 91815a4 to 7440baa Compare April 8, 2026 21:07
@fomo-bot fomo-bot force-pushed the llc-stack-02-dataset-specs branch from 8aa8e76 to d738b00 Compare April 13, 2026 15:22
@fomo-bot fomo-bot force-pushed the llc-stack-03-llc-cpu-canonicalization branch 4 times, most recently from 6aa69de to 35caa19 Compare April 15, 2026 21:12
@fomo-bot fomo-bot force-pushed the llc-stack-03-llc-cpu-canonicalization branch 2 times, most recently from a21e9c9 to 4276e4a Compare April 22, 2026 19:42
@fomo-bot fomo-bot force-pushed the llc-stack-03-llc-cpu-canonicalization branch from 4276e4a to 5c2d900 Compare April 29, 2026 18:19
Base automatically changed from llc-stack-02-dataset-specs to main April 30, 2026 18:48
@jder

jder commented May 7, 2026

Copy link
Copy Markdown
Member Author

@fomo-bot can you please rebase this PR into a single clean commit on top of the latest main?

@fomo-bot fomo-bot force-pushed the llc-stack-03-llc-cpu-canonicalization branch from 5c2d900 to 47d5ed9 Compare May 7, 2026 14:16
@fomo-bot

fomo-bot commented May 7, 2026

Copy link
Copy Markdown
Collaborator

Done. PR #670 is now a single commit on top of latest main:



47d5ed9e Add LLC CPU data canonicalization



I force-pushed with lease to llc-stack-03-llc-cpu-canonicalization. Local validation passed:



uv run pytest tests/test_config.py tests/test_utils_data.py -m "not manual and not cuda"

37 passed, 1 warning.



GitHub checks have started and were still queued when I checked.

Codex work

Codex transcript

- I’ll inspect the current branch state, fetch the latest base branch, then squash the PR changes into one commit and update the remote branch with a lease-protected force push.
- The branch is clean and the PR targets `main`. I see the PR history includes several older stack commits plus merge commits, so I’m fetching `origin/main` now and will squash the actual branch diff against the fresh base rather than preserving that noisy ancestry.
- I’m going to keep a local backup ref at the current branch tip, then replay only the three LLC canonicalization commits onto `origin/main`. That should avoid carrying the old pre-merge dataset-spec history into the rewritten PR.
- The replay onto latest `origin/main` completed without conflicts. I’m checking the rewritten history and staged diff now before collapsing it to one commit.
- The rebased branch now has exactly the three LLC commits on top of current `origin/main`, with a five-file diff. I’m squashing those into a single commit named after the PR.
- The branch is now one local commit on top of current `origin/main`. I’m going to run a focused validation on the affected config/data tests before pushing the rewritten branch.
- `uv` had to create the local virtualenv, so the focused pytest run is still in progress. I’ll wait for that result before updating the remote branch.
- The focused tests passed: 37 tests in `tests/test_config.py` and `tests/test_utils_data.py`. I’m doing a final status check, then I’ll force-push with lease to update the PR branch.
- The force-with-lease push succeeded, updating the PR branch from `5c2d9008` to `47d5ed9e`. I’m verifying GitHub now reports the PR as a single commit on the rewritten branch.
- Done. PR #670 is now a single commit on top of latest `main`:

`47d5ed9e Add LLC CPU data canonicalization`

I force-pushed with lease to `llc-stack-03-llc-cpu-canonicalization`. Local validation passed:

`uv run pytest tests/test_config.py tests/test_utils_data.py -m "not manual...[truncated]

@jder

jder commented May 29, 2026

Copy link
Copy Markdown
Member Author

@fomo-bot can we unify the two different mock LLC datasets in tests, please?

I also don't really understand all the complexity around the time stuff. Why do we need to change slice if we are already remapping the times in the canonicalize function?

@fomo-bot fomo-bot force-pushed the llc-stack-03-llc-cpu-canonicalization branch from 47d5ed9 to 0d2e5f3 Compare May 29, 2026 18:50
@fomo-bot

Copy link
Copy Markdown
Collaborator

Done and pushed.



The PR is still a single commit, now rebased onto current origin/main:



0d2e5f36 Add LLC CPU data canonicalization



I unified the LLC test data in tests/llc_fixtures.py and removed the generic numeric-time DataSource.slice changes from data.py. The LLC path now relies on canonicalize_llc_datasets to remap raw LLC times to Julian cftime, so slice stays on the existing xarray label-slice path.



Local checks passed:



uv run pytest tests/test_config.py tests/test_utils_data.py -m "not manual and not cuda"

uvx pre-commit run --files ...



GitHub checks are running; a couple of data jobs had already passed when I checked, with the longer CI jobs still in progress.

Codex work

Codex transcript

- I’ll inspect the current test helpers and the LLC canonicalization path first, then simplify the tests and the time handling where the code proves it’s redundant.
- The local branch is clean, but the local `origin/main` ref is stale enough that its comparison is misleading. I’m going to fetch current refs before editing so this cleanup lands on the actual current PR state.
- `main` has moved six commits since this PR was last rebased. I’m going to make the requested cleanup against the PR diff, then rebase the single commit onto current `origin/main` if the touched files still merge cleanly.
- I found the duplication: one LLC fixture writes zarr/netcdf in `test_config.py`, while another builds an in-memory copy in `test_utils_data.py`. I’m moving that to one shared test helper, and I’m reverting the generic `DataSource.slice` time-coordinate coercion so LLC time no...[truncated]
- The code now leaves `DataSource.slice` on the existing xarray label-slice implementation and keeps only the LLC-specific time conversion in `canonicalize_llc_datasets`. I’m checking formatting/import fallout before running the focused tests.
- `ruff` is not installed in this uv environment, so I’ll rely on `git diff --check` plus the focused pytest run for now. The cleanup is otherwise syntactically small: the new shared helper is untracked until I stage/amend.
- The focused tests pass with the shared fixture and without the generic numeric-time slice path. I’m amending the single PR commit now, then rebasing that one commit onto today’s `origin/main` because the PR is currently behind.
- The rebase onto current `origin/main` completed without conflicts. I’m rerunning the focused tests on the rebased commit and checking the final one-commit shape before pushing.
- The rebased focused tests pass. `pre-commit` is not installed in the project environment, so I’m trying it through `uvx` to catch the CI pre-commit checks before I push.
- The pre-commit run has passed ruff, formatting, YAML, and basic file checks; it is currently on mypy. I’ll let it finish because this is the same class of check that previously failed on the PR.
- Pre-commit passed on the touched files, including reuse/license lint. I’m force-pushing the rewritten one-commit branch with lease now.
- The branch was updated on GitHub from `47d5ed9e` to `0d2e5f36`. I’m verifying the PR metadata and check status now.
- GitHub checks are running now. I started a short watch to catch any immediate failures, but I won’t block on the longer EC2/GPU jobs unless they fail quickly.
- Done and pushed.

The PR is still a single commit, now rebased onto current `origin/main`:

`0d2e5f36 Add LLC CPU data canonicalization`

I unified the LLC test data in [tests/llc_fixtures.py](/home/fomo-bot/oa-jder-bot/worktrees/github.com/Open-Athena/Ocean_Emulator/670/tests/ll...[truncated]

Comment thread src/ocean_emulators/utils/data.py Outdated
Comment on lines +79 to +107
"""Convert LLC time coordinates to Julian-calendar cftime datetimes."""
values = np.asarray(time_coord.values)
if values.size == 0:
return values.astype(object)

if np.issubdtype(values.dtype, np.datetime64):
microseconds = values.astype("datetime64[us]").astype(np.int64)
converted = cftime.num2date(
microseconds,
"microseconds since 1970-01-01",
calendar="julian",
)
return np.asarray(converted, dtype=object)

if values.dtype == object:
flat_values = values.ravel()
sample = flat_values[0]
if isinstance(sample, (cftime.datetime, dt.datetime, dt.date)):
return np.asarray(
[_to_julian_datetime(value) for value in flat_values],
dtype=object,
).reshape(values.shape)

units = time_coord.attrs.get("units")
if units is None:
raise ValueError(
"LLC time coordinate is numeric but missing 'units'; unable to decode "
"raw time values."
)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fomo-bot do we really need all these cases? I would expect only one of them is in fact used?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done and pushed.



I simplified the LLC time conversion to only accept raw datetime64 coordinates and removed the extra cftime/Python datetime/numeric-unit fallback cases from data.py:45.



The PR is still a single commit, rebased onto current origin/main:



dece93c9 Add LLC CPU data canonicalization



Local checks passed:



uv run pytest tests/test_config.py tests/test_utils_data.py -m "not manual and not cuda"

uvx pre-commit run --files src/ocean_emulators/utils/data.py tests/test_utils_data.py



GitHub checks restarted and were queued when I checked.

Codex work

Codex transcript

- I’ll pull the review-comment context first so I simplify the exact code being questioned, then amend the single PR commit again.
- The review is on `_convert_llc_time_coord_to_julian` and the helper that handled cftime, Python datetime/date, and numeric units. I agree with the comment: for the LLC zarr path we only need raw `datetime64`, so I’m cutting the fallback cases and making non-`datetime64` an expl...[truncated]
- That removes the generic date conversion helper entirely. The LLC canonicalizer now accepts only raw `datetime64` coordinates and raises a clear `TypeError` otherwise.
- `origin/main` moved by one commit since the last push, so after validation I’ll amend this into the PR’s single commit and rebase onto that newer main before pushing.
- The focused tests still pass after the simplification. Pre-commit is running now on the affected Python files.
- Pre-commit passed on the affected files. I’m amending the single PR commit and then rebasing it onto the current main tip.
- The one-commit rebase onto current `origin/main` completed cleanly. I’m rerunning the focused test pair once on the rebased commit, then I’ll force-push with lease.
- The rebased commit is clean locally and still a single commit over current `origin/main`. I’m updating the PR branch with a lease-protected force push.
- The PR branch is now updated to `dece93c9`. I’m verifying GitHub sees the rewritten single commit and checking the new CI state.
- Done and pushed.

I simplified the LLC time conversion to only accept raw `datetime64` coordinates and removed the extra cftime/Python datetime/numeric-unit fallback cases from [data.py](/home/fomo-bot/oa-jder-bot/worktrees/github.com/Open-Athena/Ocean_Emulator/670/src/ocean_emul...[truncated]

@fomo-bot fomo-bot force-pushed the llc-stack-03-llc-cpu-canonicalization branch from 0d2e5f3 to dece93c Compare May 29, 2026 20:22
Comment thread src/ocean_emulators/utils/data.py Outdated
[xr.Dataset, xr.Dataset, xr.Dataset],
tuple[xr.Dataset, xr.Dataset, xr.Dataset],
]
| None = None,

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a bit silly compared to just passing in the canonicalized datasets and pulling the open logic out into DataConfig.build

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fomo-bot can you try this ^ please? And update to the latest main?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done and pushed.



I rebased PR #670 onto latest origin/main and applied the review suggestion: DataConfig.build now opens the datasets, canonicalizes them, and passes them directly to DataSource.from_datasets; the DataSource.from_locations wrapper is gone.



The PR is still one commit:



0c477e43 Add LLC CPU data canonicalization



Local checks passed:



uv run pytest tests/test_config.py tests/test_utils_data.py -m "not manual and not cuda"

uvx pre-commit run --files configs/data/llc.yaml src/ocean_emulators/config.py src/ocean_emulators/utils/data.py tests/llc_fixtures.py tests/test_config.py tests/test_utils_data.py



GitHub checks restarted and were in progress when I checked.

Codex work

Codex transcript

- I’ll fetch the exact review thread so “this ^” maps to the right suggestion, then rebase the PR branch onto current `main` and amend the single commit.
- The PR is clean but behind current `main`, and the review thread points at the `canonicalize` callback parameter in `DataSource.from_locations`. I’m fetching the parent comment now to see the suggested replacement exactly.
- The parent suggestion is to remove the `canonicalize` callback from `DataSource.from_locations`, open the datasets directly in `DataConfig.build`, canonicalize there, and then pass datasets into `DataSource.from_datasets`. I’m going to make that change and drop the now-unneeded...[truncated]
- `main` has relevant edits in the same data/config area, so I’m rebasing first and resolving the combined result there. That avoids building the review fix against stale context.
- The rebase conflicts are in `utils/data.py` and `tests/test_utils_data.py`, which matches the changed normalization code on latest `main`. I’m resolving those while also removing the `from_locations` canonicalize callback.
- The conflict is straightforward: latest `main` added a normalization regression test and changed stats flattening, while this branch adds LLC canonicalization in the same area. I’m keeping both changes, then moving dataset opening/canonicalization into `DataConfig.build` as req...[truncated]
- In `utils/data.py`, I’m deleting `DataSource.from_locations` instead of just removing its callback. `DataConfig.build` will own resolving locations, opening datasets, and canonicalizing before calling `DataSource.from_datasets`.
- In the test conflict I’m keeping the new `main` normalization regression and the LLC canonicalization tests, with the shared LLC fixture import.
- Next I’m moving the open calls into `DataConfig.build`: resolve locations, open with the same dask chunk setting, canonicalize through the dataset config, then construct the `DataSource` from those datasets.
- The `DataSource.from_locations` wrapper is now removed from `utils/data.py`, and `DataConfig.build` directly opens, canonicalizes, and constructs sources from datasets. I’m checking for remaining conflict markers and import issues.
- The conflict markers are gone. I’m marking the resolved files and continuing the rebase, then I’ll run the focused tests against the final rebased commit.
- The resolved diff now shows the intended deletion of the `from_locations` wrapper and both test additions retained. I’m staging the resolved files and continuing the rebase.
- `git rebase --continue` only stopped because no editor is configured for the commit message. I’m rerunning it with a no-op editor so it keeps the existing commit message.
- The branch is now rebased onto current `origin/main` with one commit. I’m running the focused tests and pre-commit checks next.
- The review-requested callback path is gone: no `from_locations` or `canonicalize=` callsites remain. I’m validating the rebased single commit now.
- The focused pytest run passed: 37 tests. I’m running the same pre-commit subset across the touched files before pushing.
- Pre-commit has passed ruff, format, YAML, and file-size checks; mypy is currently running.
- Pre-commit passed on the touched files. I’m doing a final status check and then force-pushing the rebased one-commit branch with lease.
- The branch is one commit over latest `origin/main`; the ahead/behind count is only relative to the old remote PR tip. I’m pushing `0c477e43` now.
- The force-with-lease push succeeded, updating the PR branch to `0c477e43`. I’m verifying GitHub now reports the single commit and fresh checks.

@fomo-bot fomo-bot force-pushed the llc-stack-03-llc-cpu-canonicalization branch from dece93c to 0c477e4 Compare June 12, 2026 14:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants