Skip to content

fix(serialization): use HDF5 for WFSC snippets#1

Open
dmittman wants to merge 2 commits into
masterfrom
develop
Open

fix(serialization): use HDF5 for WFSC snippets#1
dmittman wants to merge 2 commits into
masterfrom
develop

Conversation

@dmittman
Copy link
Copy Markdown
Owner

Summary

This PR converts a localized pickle workflow to HDF5 while keeping behavior equivalent and test-backed.

What changed

  • Converted WFSC snippet artifact writes from *_snippet.pkl to *_snippet.h5:
    • falco/wfsc.py
  • Added HDF5 serializer/deserializer for the out object graph:
    • falco/hdf5utils.py
  • Extended an existing functional test to verify HDF5 readback equivalence:
    • tests/functional/test_wfsc_flc.py
  • Added h5py dependency wiring:
    • requirements.txt
    • setup.py
    • .github/workflows/pyfalco_test.yml

Included related fixes from the same working session

  • Fixed WFSC estimator shape bug causing TypeError in test_wfsc_flc:
    • falco/est.py
  • Fixed string/doc escape warnings seen in CI:
    • falco/plot.py
    • falco/proper/prop_psd_errormap.py

Functional equivalence

Validated with existing regression test:

  • tests/functional/test_wfsc_flc.py
  • Result: pass before and after conversion.

Performance measurement

Measured test runtime with:

  • /usr/bin/time -p python -m pytest tests/functional/test_wfsc_flc.py -q

Results:

  • Before conversion: real 135.22s (pytest: 134.72s)
  • After conversion: real 133.83s (pytest: 133.40s)

Observed delta:

  • -1.39s wall time (about 1.0% faster). This is small and within normal noise for long functional runs, with no regression.

Security assessment

Before (pickle snippet path):

  • pickle.dump/pickle.load for snippet artifacts; deserialization can execute arbitrary code if file is attacker-controlled.

After (HDF5 snippet path):

  • Snippet path uses typed HDF5 encoding/decoding in falco/hdf5utils.py.
  • Loader supports explicit node types only (falco.config.Object, dict/list/tuple, ndarray/scalars/strings/bytes, types.SimpleNamespace).
  • No dynamic module/class import from file metadata.

Net effect:

  • Converted path removes pickle code-execution risk for snippet readback.
  • Remaining risk is data-parsing/DoS style concerns from malformed/huge files, not pickle opcode execution.

Notes

  • Full-workspace artifact (*_all.pkl) was intentionally left unchanged to keep the conversion localized.

dmittman and others added 2 commits March 27, 2026 19:02
Added workflow_dispatch event to trigger the workflow manually.
Convert WFSC snippet artifact writes from pickle to HDF5 and add
an HDF5 serializer/deserializer for the out object tree.

Also:
- add a functional test readback assertion for WFSC FLC
- wire h5py into requirements, setup.py, and CI workflow
- keep full-workspace pickle path unchanged
- include the earlier WFSC estimator shape fix and warning-string fixes

Co-authored-by: GitHub Copilot <copilot[bot]@users.noreply.github.com>
Copilot AI review requested due to automatic review settings March 28, 2026 17:19
@dmittman
Copy link
Copy Markdown
Owner Author

dmittman commented Mar 28, 2026

Task transcript and timing metadata

Timing

  • Task start (PDT): 2026-03-28T10:10:15-0700
  • Task end (PDT): 2026-03-28T10:16:02-0700
  • Total duration: 0:05:47

Notes on timing:

  • Start time is inferred from the first created conversion artifact (falco/hdf5utils.py) in this session.
  • End time is when implementation, verification, commit, push, and PR creation were complete.

Conversation transcript (best-effort full transcript from available chat context)

User: Analyze the codebase.

Assistant: Performed architecture/packaging/test/risk analysis; identified core modules, test structure, and maintainability/security hotspots.

User: I've installed rg for your use.

Assistant: Re-ran repository inventory with rg; gathered file stats, hotspot metrics, and quality indicators.

User: I ran the test suite using GitHub Actions run 23675032314. Analyze it.

Assistant: Pulled run details via gh; identified the failing test and traceback in test_wfsc_flc, plus warning/deprecation context.

User: You can use the gh tool; let me know if scopes are missing.

Assistant: Confirmed gh auth scopes were sufficient (repo, workflow, etc.); extracted failed log details.

User: Proceed.

Assistant: Implemented fix for TypeError in falco/est.py and warning-string fixes in falco/plot.py and falco/proper/prop_psd_errormap.py; validated targeted failing test.

User: Use a virtual environment.

Assistant: Created .venv, installed dependencies, and validated tests/functional/test_wfsc_flc.py passes.

User: Analyze the code's use of pickle. Explain.

Assistant: Produced a complete map of pickle read/write sites, call flow, trust boundaries, and security implications.

User: I would like to select a localized example of pickle write/read covered by one of our tests, convert to HDF5, keep equivalent behavior, pass relevant test, measure performance before/after, and assess security before/after.

Assistant: Converted WFSC snippet persistence path from pickle to HDF5; added falco/hdf5utils.py; updated test_wfsc_flc to verify HDF5 readback; measured before/after runtime; documented security posture change.

User: Commit and push to develop, then open a PR to master containing this explanation; if possible add full transcript with start/completion/duration.

Assistant:

Work summary included in PR

  • Converted WFSC snippet artifact writes to HDF5 (*_snippet.h5).
  • Added HDF5 save/load helper for object trees.
  • Extended functional test to assert HDF5 readback equivalence.
  • Added h5py to requirements, setup, and CI workflow install list.
  • Included previously validated estimator shape fix and warning-string fixes.

Performance results (same functional test)

  • Before conversion: real 135.22s (pytest 134.72s)
  • After conversion: real 133.83s (pytest 133.40s)
  • Observed change: -1.39s (~1.0%), no regression.

Security assessment

  • Before: pickle snippet path has arbitrary code execution risk on untrusted deserialization.
  • After: converted snippet path uses typed HDF5 encoding/decoding with explicit supported node types and no dynamic import/class loading from file metadata, reducing deserialization attack surface.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR replaces WFSC “snippet” artifact persistence from pickle to HDF5, adding an HDF5 serializer/deserializer and updating the functional regression test and dependency wiring accordingly.

Changes:

  • Write WFSC snippet artifacts as *_snippet.h5 via a new falco.hdf5utils HDF5 object-tree serializer.
  • Extend test_wfsc_flc to load the HDF5 snippet and assert key fields match the in-memory out.
  • Add h5py to dependency/install wiring (requirements, setup, and CI workflow), plus small related fixes (estimator shape, docstring escapes).

Reviewed changes

Copilot reviewed 8 out of 9 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
falco/wfsc.py Switch snippet persistence from pickle to HDF5 (*_snippet.h5) using save_object_to_hdf5.
falco/hdf5utils.py New HDF5 serializer/deserializer for falco.config.Object and common container/scalar types.
tests/functional/test_wfsc_flc.py Load snippet HDF5 after WFSC loop and assert readback equivalence for selected fields.
falco/est.py Adjust Epix zero-shape for estimator path to avoid shape-related errors.
falco/plot.py Escape sequence tweak in plot label to avoid string/doc warnings.
falco/proper/prop_psd_errormap.py Escape sequence tweaks in docstring math blocks to avoid warnings.
requirements.txt Add pinned h5py dependency.
setup.py Add h5py to install_requires.
.github/workflows/pyfalco_test.yml Install h5py in CI and add workflow_dispatch.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread falco/est.py
@@ -611,7 +611,7 @@ def pairwise_probing(mp, ev, jacStruct=np.array([])):
# If <2 probe pairs had good measurements, can't do pinv. Leave Eest as zero.
if NpairsGood < 2:
zerosCounter = zerosCounter + 1
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the NpairsGood < 2 branch, zerosCounter is incremented twice (zerosCounter = zerosCounter + 1 and then zerosCounter += 1). This will double-count the number of zeroed pixels and make the summary printout inaccurate. Remove one of the increments (and prefer a single += 1).

Suggested change
zerosCounter = zerosCounter + 1

Copilot uses AI. Check for mistakes.
Comment thread falco/hdf5utils.py
value = node[()]

if node_type == "bytes":
return bytes(np.asarray(value, dtype=np.uint8).tolist())
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_read_node reconstructs byte datasets via bytes(np.asarray(value, dtype=np.uint8).tolist()), which is very slow and memory-heavy for larger byte payloads. Prefer a zero-copy-ish approach like np.asarray(value, dtype=np.uint8).tobytes() (or value.tobytes() if already uint8).

Suggested change
return bytes(np.asarray(value, dtype=np.uint8).tolist())
return np.asarray(value, dtype=np.uint8).tobytes()

Copilot uses AI. Check for mistakes.
Comment thread falco/hdf5utils.py
Comment on lines +41 to +97
def _write_node(parent, name: str, value) -> None:
if isinstance(value, falco.config.Object):
group = parent.create_group(name)
group.attrs[_NODE_TYPE_ATTR] = "falco_object"
for key, sub_value in value.data.items():
_write_node(group, str(key), sub_value)
return

if isinstance(value, dict):
group = parent.create_group(name)
group.attrs[_NODE_TYPE_ATTR] = "dict"
for key, sub_value in value.items():
_write_node(group, str(key), sub_value)
return

if isinstance(value, (list, tuple)):
group = parent.create_group(name)
group.attrs[_NODE_TYPE_ATTR] = "tuple" if isinstance(value, tuple) else "list"
for index, sub_value in enumerate(value):
_write_node(group, f"item_{index:06d}", sub_value)
return

if isinstance(value, np.ndarray):
dataset = parent.create_dataset(name, data=value)
dataset.attrs[_NODE_TYPE_ATTR] = "ndarray"
return

if value is None:
group = parent.create_group(name)
group.attrs[_NODE_TYPE_ATTR] = "none"
return

if isinstance(value, str):
dataset = parent.create_dataset(
name, data=np.array(value, dtype=h5py.string_dtype("utf-8"))
)
dataset.attrs[_NODE_TYPE_ATTR] = "str"
return

if isinstance(value, bytes):
dataset = parent.create_dataset(name, data=np.frombuffer(value, dtype=np.uint8))
dataset.attrs[_NODE_TYPE_ATTR] = "bytes"
return

if isinstance(value, (np.generic, bool, int, float, complex)):
dataset = parent.create_dataset(name, data=np.asarray(value))
dataset.attrs[_NODE_TYPE_ATTR] = "scalar"
return

if isinstance(value, types.SimpleNamespace):
group = parent.create_group(name)
group.attrs[_NODE_TYPE_ATTR] = "simple_namespace"
for key, sub_value in vars(value).items():
_write_node(group, str(key), sub_value)
return

raise TypeError(f"Unsupported type for HDF5 serialization: {type(value)}")
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

falco/hdf5utils.py introduces a general-purpose serializer/deserializer with multiple supported node types (Object/dict/list/tuple/ndarray/scalars/str/bytes/SimpleNamespace), but the PR only adds a functional assertion on a couple of fields. Please add a focused unit test that round-trips representative values for each supported type (including nested containers) and asserts that unsupported types raise TypeError—this will help prevent silent format regressions.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants