Skip to content

Add DUACS daily->5-day-average preprocessing script#758

Draft
alxmrs wants to merge 2 commits into
mainfrom
worktree-duacs-5day
Draft

Add DUACS daily->5-day-average preprocessing script#758
alxmrs wants to merge 2 commits into
mainfrom
worktree-duacs-5day

Conversation

@alxmrs

@alxmrs alxmrs commented Jun 5, 2026

Copy link
Copy Markdown
Member

What

New ocean_preprocessing.make_duacs_5day utility that coarsens the daily (P1D) global gridded DUACS L4 altimetry product to non-overlapping consecutive 5-day means, mirroring the OM4 5-daily convention so the observations can sit alongside the emulator inputs.

How

  • Binning: coarsen(time=5, boundary="trim").mean() — drops the trailing 1-day remainder → 366 → 73 steps. Each window is labelled by its midpoint timestamp.
  • Variables: core ocean fields adt, sla, ugos, ugosa, vgos, vgosa + flag_ice re-derived as a 0–1 ice-presence fraction (metadata relabelled). err_* and tpa_correction dropped.
  • Output: uncompressed float32, one chunk per timestep, written to s3://emulators/am16581/data/2026-06/...P5D...zarr (P1D→P5D name swap).
  • Compute: reuses the OM4 pipeline's init_cluster, blosc single-thread guards, and retry-on-write. Coiled by default; OCEAN_DUACS_CLUSTER=local to run locally. Native time chunk (50) is a clean multiple of the window → no cross-chunk shuffle.
  • Native DUACS naming/grid preserved (latitude/longitude, −180…180, 0.125°) — not conformed to the emulator x/y/0–360 schema.

Validation

--dry_run against the live store: structure correct, mid-grid sample yields physically sensible values (adt ≈ 0.43 m, sla ≈ 0.045 m, velocities m/s). Dry-run output (final dataset repr + full ds.time) to be pasted in a comment below for review.

Tests

tests/test_make_duacs_5day.py — block-mean correctness, ice-flag→fraction, missing-var guard.

Run:

cd data/
export FSSPEC_S3_ENDPOINT_URL=https://nyu1.osn.mghpcc.org/
export AWS_ACCESS_KEY_ID=...  AWS_SECRET_ACCESS_KEY=...
OCEAN_DUACS_CLUSTER=local python -m ocean_preprocessing.make_duacs_5day --dry_run

🤖 Generated with Claude Code

New `make_duacs_5day` utility coarsens the daily (P1D) DUACS L4 altimetry
product to non-overlapping consecutive 5-day means (coarsen boundary='trim',
366 -> 73 steps), mirroring the OM4 5-daily convention so the observations can
sit alongside the emulator inputs.

- Keeps core ocean fields (adt, sla, ugos, ugosa, vgos, vgosa) plus flag_ice
  re-derived as a 0-1 ice-presence fraction; drops err_* and tpa_correction.
- Writes uncompressed float32, one chunk per timestep, to the user's bucket
  with a P1D->P5D name. Coiled by default, local via OCEAN_DUACS_CLUSTER.
- Reuses the OM4 pipeline's init_cluster, blosc single-thread guards, and
  retry-on-write logic. Streaming pass: native time chunk (50) is a clean
  multiple of the window, so there's no cross-chunk shuffle.
- --dry_run validates structure + a mid-grid sample and prints the full time
  axis without writing. Native DUACS naming/grid preserved (not conformed to
  the emulator x/y/0-360 schema).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@alxmrs

alxmrs commented Jun 5, 2026

Copy link
Copy Markdown
Member Author

Dry-run output

OCEAN_DUACS_CLUSTER=local python -m ocean_preprocessing.make_duacs_5day --dry_run against the live source store (nothing written). Reduction validated against a mid-grid sample.

Final dataset (out)

<xarray.Dataset> Size: 8GB
Dimensions:    (time: 73, latitude: 1440, longitude: 2880)
Coordinates:
  * time       (time) datetime64[ns] 584B 2022-01-03 2022-01-08 ... 2022-12-29
  * latitude   (latitude) float32 6kB -89.94 -89.81 -89.69 ... 89.69 89.81 89.94
  * longitude  (longitude) float32 12kB -179.9 -179.8 -179.7 ... 179.8 179.9
Data variables:
    adt        (time, latitude, longitude) float32 1GB dask.array<chunksize=(1, 1440, 2880), meta=np.ndarray>
    sla        (time, latitude, longitude) float32 1GB dask.array<chunksize=(1, 1440, 2880), meta=np.ndarray>
    ugos       (time, latitude, longitude) float32 1GB dask.array<chunksize=(1, 1440, 2880), meta=np.ndarray>
    ugosa      (time, latitude, longitude) float32 1GB dask.array<chunksize=(1, 1440, 2880), meta=np.ndarray>
    vgos       (time, latitude, longitude) float32 1GB dask.array<chunksize=(1, 1440, 2880), meta=np.ndarray>
    vgosa      (time, latitude, longitude) float32 1GB dask.array<chunksize=(1, 1440, 2880), meta=np.ndarray>
    flag_ice   (time, latitude, longitude) float32 1GB dask.array<chunksize=(1, 1440, 2880), meta=np.ndarray>
Attributes: (12/48)
    Conventions:                       CF-1.6
    Metadata_Conventions:              Unidata Dataset Discovery v1.0
    cdm_data_type:                     Grid
    comment:                           Sea Surface Height measured by Altimet...
    contact:                           servicedesk.cmems@mercator-ocean.eu
    copernicusmarine_version:          2.3.0
    ...                                ...
    title:                             DT merged all satellites Global Ocean ...
    m2lines/source_dataset:            s3://emulators/jr7309/data/duacs/cmems...
    m2lines/temporal_coarsening:       5-day consecutive simple mean (coarsen...
    m2lines/ocean_emulators_git_hash:  https://github.com/Open-Athena/Ocean_E...
    m2lines/date_created:              2026-06-05T11:18:07.434449
    m2lines/cli_args:                  /Users/alxmrs/git/Ocean_Emulator/.clau...

ds.time (73 window-center labels, 5 days apart)

<xarray.DataArray 'time' (time: 73)> Size: 584B
array(['2022-01-03T00:00:00.000000000', '2022-01-08T00:00:00.000000000',
       '2022-01-13T00:00:00.000000000', '2022-01-18T00:00:00.000000000',
       '2022-01-23T00:00:00.000000000', '2022-01-28T00:00:00.000000000',
       '2022-02-02T00:00:00.000000000', '2022-02-07T00:00:00.000000000',
       '2022-02-12T00:00:00.000000000', '2022-02-17T00:00:00.000000000',
       '2022-02-22T00:00:00.000000000', '2022-02-27T00:00:00.000000000',
       '2022-03-04T00:00:00.000000000', '2022-03-09T00:00:00.000000000',
       '2022-03-14T00:00:00.000000000', '2022-03-19T00:00:00.000000000',
       '2022-03-24T00:00:00.000000000', '2022-03-29T00:00:00.000000000',
       '2022-04-03T00:00:00.000000000', '2022-04-08T00:00:00.000000000',
       '2022-04-13T00:00:00.000000000', '2022-04-18T00:00:00.000000000',
       '2022-04-23T00:00:00.000000000', '2022-04-28T00:00:00.000000000',
       '2022-05-03T00:00:00.000000000', '2022-05-08T00:00:00.000000000',
       '2022-05-13T00:00:00.000000000', '2022-05-18T00:00:00.000000000',
       '2022-05-23T00:00:00.000000000', '2022-05-28T00:00:00.000000000',
       '2022-06-02T00:00:00.000000000', '2022-06-07T00:00:00.000000000',
       '2022-06-12T00:00:00.000000000', '2022-06-17T00:00:00.000000000',
       '2022-06-22T00:00:00.000000000', '2022-06-27T00:00:00.000000000',
       '2022-07-02T00:00:00.000000000', '2022-07-07T00:00:00.000000000',
       '2022-07-12T00:00:00.000000000', '2022-07-17T00:00:00.000000000',
       '2022-07-22T00:00:00.000000000', '2022-07-27T00:00:00.000000000',
       '2022-08-01T00:00:00.000000000', '2022-08-06T00:00:00.000000000',
       '2022-08-11T00:00:00.000000000', '2022-08-16T00:00:00.000000000',
       '2022-08-21T00:00:00.000000000', '2022-08-26T00:00:00.000000000',
       '2022-08-31T00:00:00.000000000', '2022-09-05T00:00:00.000000000',
       '2022-09-10T00:00:00.000000000', '2022-09-15T00:00:00.000000000',
       '2022-09-20T00:00:00.000000000', '2022-09-25T00:00:00.000000000',
       '2022-09-30T00:00:00.000000000', '2022-10-05T00:00:00.000000000',
       '2022-10-10T00:00:00.000000000', '2022-10-15T00:00:00.000000000',
       '2022-10-20T00:00:00.000000000', '2022-10-25T00:00:00.000000000',
       '2022-10-30T00:00:00.000000000', '2022-11-04T00:00:00.000000000',
       '2022-11-09T00:00:00.000000000', '2022-11-14T00:00:00.000000000',
       '2022-11-19T00:00:00.000000000', '2022-11-24T00:00:00.000000000',
       '2022-11-29T00:00:00.000000000', '2022-12-04T00:00:00.000000000',
       '2022-12-09T00:00:00.000000000', '2022-12-14T00:00:00.000000000',
       '2022-12-19T00:00:00.000000000', '2022-12-24T00:00:00.000000000',
       '2022-12-29T00:00:00.000000000'], dtype='datetime64[ns]')
Coordinates:
  * time     (time) datetime64[ns] 584B 2022-01-03 2022-01-08 ... 2022-12-29
Attributes:
    axis:           T
    long_name:      Time
    standard_name:  time
    unit_long:      Days Since 1950-01-01

Sample values (mid-grid 64×64×2 corner)

  adt: min=0.3831 max=0.4754 mean=0.4337
  sla: min=0.02266 max=0.1084 mean=0.0455
  ugos: min=-0.4018 max=0.2412 mean=-0.02374
  ugosa: min=-0.3034 max=0.1707 mean=-0.01237
  vgos: min=-0.2278 max=0.1878 mean=-0.04271
  vgosa: min=-0.1528 max=0.06514 mean=-0.04667

flag_ice is 0 in that equatorial corner (no ice), as expected. Absolute fields (adt/ugos/vgos) are legitimately NaN near the poles — no MDT reference under the ice — while the anomaly fields retain values there.

Both the DUACS source and the output live on OSN (no egress fees), so running
the reduction on Torch HPC instead of Coiled/AWS removes the cloud bill while
OSN<->Torch transfer stays free. The job is CPU-only and the script is already
cluster-agnostic, so no Python changes are needed.

- scripts/slurm_duacs_5day.sbatch: single CPU node, no Apptainer (uses the
  ocean_preprocessing mamba env directly via `mamba/conda run -n`), sources OSN
  creds from ~/.osn_env, runs --cluster=local sized to --cpus-per-task, streams
  from OSN and writes back. DRY_RUN/SRC/OUTPUT_PATH/WINDOW/ARGS overrides.
- docs/torch.md: new "Data Preprocessing On Torch" section documenting the env
  setup, credential handling, and dry-run-then-write flow.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant