fix(serialization): use HDF5 for WFSC snippets by dmittman · Pull Request #1 · dmittman/falco-python

dmittman · 2026-03-28T17:19:22Z

Summary

This PR converts a localized pickle workflow to HDF5 while keeping behavior equivalent and test-backed.

What changed

Converted WFSC snippet artifact writes from *_snippet.pkl to *_snippet.h5:
- falco/wfsc.py
Added HDF5 serializer/deserializer for the out object graph:
- falco/hdf5utils.py
Extended an existing functional test to verify HDF5 readback equivalence:
- tests/functional/test_wfsc_flc.py
Added h5py dependency wiring:
- requirements.txt
- setup.py
- .github/workflows/pyfalco_test.yml

Included related fixes from the same working session

Fixed WFSC estimator shape bug causing TypeError in test_wfsc_flc:
- falco/est.py
Fixed string/doc escape warnings seen in CI:
- falco/plot.py
- falco/proper/prop_psd_errormap.py

Functional equivalence

Validated with existing regression test:

tests/functional/test_wfsc_flc.py
Result: pass before and after conversion.

Performance measurement

Measured test runtime with:

/usr/bin/time -p python -m pytest tests/functional/test_wfsc_flc.py -q

Results:

Before conversion: real 135.22s (pytest: 134.72s)
After conversion: real 133.83s (pytest: 133.40s)

Observed delta:

-1.39s wall time (about 1.0% faster). This is small and within normal noise for long functional runs, with no regression.

Security assessment

Before (pickle snippet path):

pickle.dump/pickle.load for snippet artifacts; deserialization can execute arbitrary code if file is attacker-controlled.

After (HDF5 snippet path):

Snippet path uses typed HDF5 encoding/decoding in falco/hdf5utils.py.
Loader supports explicit node types only (falco.config.Object, dict/list/tuple, ndarray/scalars/strings/bytes, types.SimpleNamespace).
No dynamic module/class import from file metadata.

Net effect:

Converted path removes pickle code-execution risk for snippet readback.
Remaining risk is data-parsing/DoS style concerns from malformed/huge files, not pickle opcode execution.

Notes

Full-workspace artifact (*_all.pkl) was intentionally left unchanged to keep the conversion localized.

Added workflow_dispatch event to trigger the workflow manually.

Convert WFSC snippet artifact writes from pickle to HDF5 and add an HDF5 serializer/deserializer for the out object tree. Also: - add a functional test readback assertion for WFSC FLC - wire h5py into requirements, setup.py, and CI workflow - keep full-workspace pickle path unchanged - include the earlier WFSC estimator shape fix and warning-string fixes Co-authored-by: GitHub Copilot <copilot[bot]@users.noreply.github.com>

dmittman · 2026-03-28T17:19:43Z

Task transcript and timing metadata

Timing

Task start (PDT): 2026-03-28T10:10:15-0700
Task end (PDT): 2026-03-28T10:16:02-0700
Total duration: 0:05:47

Notes on timing:

Start time is inferred from the first created conversion artifact (falco/hdf5utils.py) in this session.
End time is when implementation, verification, commit, push, and PR creation were complete.

Conversation transcript (best-effort full transcript from available chat context)

User: Analyze the codebase.

Assistant: Performed architecture/packaging/test/risk analysis; identified core modules, test structure, and maintainability/security hotspots.

User: I've installed rg for your use.

Assistant: Re-ran repository inventory with rg; gathered file stats, hotspot metrics, and quality indicators.

User: I ran the test suite using GitHub Actions run 23675032314. Analyze it.

Assistant: Pulled run details via gh; identified the failing test and traceback in test_wfsc_flc, plus warning/deprecation context.

User: You can use the gh tool; let me know if scopes are missing.

Assistant: Confirmed gh auth scopes were sufficient (repo, workflow, etc.); extracted failed log details.

User: Proceed.

Assistant: Implemented fix for TypeError in falco/est.py and warning-string fixes in falco/plot.py and falco/proper/prop_psd_errormap.py; validated targeted failing test.

User: Use a virtual environment.

Assistant: Created .venv, installed dependencies, and validated tests/functional/test_wfsc_flc.py passes.

User: Analyze the code's use of pickle. Explain.

Assistant: Produced a complete map of pickle read/write sites, call flow, trust boundaries, and security implications.

User: I would like to select a localized example of pickle write/read covered by one of our tests, convert to HDF5, keep equivalent behavior, pass relevant test, measure performance before/after, and assess security before/after.

Assistant: Converted WFSC snippet persistence path from pickle to HDF5; added falco/hdf5utils.py; updated test_wfsc_flc to verify HDF5 readback; measured before/after runtime; documented security posture change.

User: Commit and push to develop, then open a PR to master containing this explanation; if possible add full transcript with start/completion/duration.

Assistant:

Committed and pushed changes to develop.
Opened PR: fix(serialization): use HDF5 for WFSC snippets #1
Added this timing/transcript comment.

Work summary included in PR

Converted WFSC snippet artifact writes to HDF5 (*_snippet.h5).
Added HDF5 save/load helper for object trees.
Extended functional test to assert HDF5 readback equivalence.
Added h5py to requirements, setup, and CI workflow install list.
Included previously validated estimator shape fix and warning-string fixes.

Performance results (same functional test)

Before conversion: real 135.22s (pytest 134.72s)
After conversion: real 133.83s (pytest 133.40s)
Observed change: -1.39s (~1.0%), no regression.

Security assessment

Before: pickle snippet path has arbitrary code execution risk on untrusted deserialization.
After: converted snippet path uses typed HDF5 encoding/decoding with explicit supported node types and no dynamic import/class loading from file metadata, reducing deserialization attack surface.

Copilot

Pull request overview

This PR replaces WFSC “snippet” artifact persistence from pickle to HDF5, adding an HDF5 serializer/deserializer and updating the functional regression test and dependency wiring accordingly.

Changes:

Write WFSC snippet artifacts as *_snippet.h5 via a new falco.hdf5utils HDF5 object-tree serializer.
Extend test_wfsc_flc to load the HDF5 snippet and assert key fields match the in-memory out.
Add h5py to dependency/install wiring (requirements, setup, and CI workflow), plus small related fixes (estimator shape, docstring escapes).

Reviewed changes

Copilot reviewed 8 out of 9 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
`falco/wfsc.py`	Switch snippet persistence from pickle to HDF5 (`*_snippet.h5`) using `save_object_to_hdf5`.
`falco/hdf5utils.py`	New HDF5 serializer/deserializer for `falco.config.Object` and common container/scalar types.
`tests/functional/test_wfsc_flc.py`	Load snippet HDF5 after WFSC loop and assert readback equivalence for selected fields.
`falco/est.py`	Adjust `Epix` zero-shape for estimator path to avoid shape-related errors.
`falco/plot.py`	Escape sequence tweak in plot label to avoid string/doc warnings.
`falco/proper/prop_psd_errormap.py`	Escape sequence tweaks in docstring math blocks to avoid warnings.
`requirements.txt`	Add pinned `h5py` dependency.
`setup.py`	Add `h5py` to `install_requires`.
`.github/workflows/pyfalco_test.yml`	Install `h5py` in CI and add `workflow_dispatch`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-28T17:22:39Z

@@ -611,7 +611,7 @@ def pairwise_probing(mp, ev, jacStruct=np.array([])):
                        # If <2 probe pairs had good measurements, can't do pinv. Leave Eest as zero.
                        if NpairsGood < 2:
                            zerosCounter = zerosCounter + 1


In the NpairsGood < 2 branch, zerosCounter is incremented twice (zerosCounter = zerosCounter + 1 and then zerosCounter += 1). This will double-count the number of zeroed pixels and make the summary printout inaccurate. Remove one of the increments (and prefer a single += 1).

Suggested change

zerosCounter = zerosCounter + 1

Copilot · 2026-03-28T17:22:39Z

+        value = node[()]
+
+        if node_type == "bytes":
+            return bytes(np.asarray(value, dtype=np.uint8).tolist())


_read_node reconstructs byte datasets via bytes(np.asarray(value, dtype=np.uint8).tolist()), which is very slow and memory-heavy for larger byte payloads. Prefer a zero-copy-ish approach like np.asarray(value, dtype=np.uint8).tobytes() (or value.tobytes() if already uint8).

Suggested change

return bytes(np.asarray(value, dtype=np.uint8).tolist())

return np.asarray(value, dtype=np.uint8).tobytes()

Copilot · 2026-03-28T17:22:40Z

+def _write_node(parent, name: str, value) -> None:
+    if isinstance(value, falco.config.Object):
+        group = parent.create_group(name)
+        group.attrs[_NODE_TYPE_ATTR] = "falco_object"
+        for key, sub_value in value.data.items():
+            _write_node(group, str(key), sub_value)
+        return
+
+    if isinstance(value, dict):
+        group = parent.create_group(name)
+        group.attrs[_NODE_TYPE_ATTR] = "dict"
+        for key, sub_value in value.items():
+            _write_node(group, str(key), sub_value)
+        return
+
+    if isinstance(value, (list, tuple)):
+        group = parent.create_group(name)
+        group.attrs[_NODE_TYPE_ATTR] = "tuple" if isinstance(value, tuple) else "list"
+        for index, sub_value in enumerate(value):
+            _write_node(group, f"item_{index:06d}", sub_value)
+        return
+
+    if isinstance(value, np.ndarray):
+        dataset = parent.create_dataset(name, data=value)
+        dataset.attrs[_NODE_TYPE_ATTR] = "ndarray"
+        return
+
+    if value is None:
+        group = parent.create_group(name)
+        group.attrs[_NODE_TYPE_ATTR] = "none"
+        return
+
+    if isinstance(value, str):
+        dataset = parent.create_dataset(
+            name, data=np.array(value, dtype=h5py.string_dtype("utf-8"))
+        )
+        dataset.attrs[_NODE_TYPE_ATTR] = "str"
+        return
+
+    if isinstance(value, bytes):
+        dataset = parent.create_dataset(name, data=np.frombuffer(value, dtype=np.uint8))
+        dataset.attrs[_NODE_TYPE_ATTR] = "bytes"
+        return
+
+    if isinstance(value, (np.generic, bool, int, float, complex)):
+        dataset = parent.create_dataset(name, data=np.asarray(value))
+        dataset.attrs[_NODE_TYPE_ATTR] = "scalar"
+        return
+
+    if isinstance(value, types.SimpleNamespace):
+        group = parent.create_group(name)
+        group.attrs[_NODE_TYPE_ATTR] = "simple_namespace"
+        for key, sub_value in vars(value).items():
+            _write_node(group, str(key), sub_value)
+        return
+
+    raise TypeError(f"Unsupported type for HDF5 serialization: {type(value)}")


falco/hdf5utils.py introduces a general-purpose serializer/deserializer with multiple supported node types (Object/dict/list/tuple/ndarray/scalars/str/bytes/SimpleNamespace), but the PR only adds a functional assertion on a couple of fields. Please add a focused unit test that round-trips representative values for each supported type (including nested containers) and asserts that unsupported types raise TypeError—this will help prevent silent format regressions.

dmittman and others added 2 commits March 27, 2026 19:02

Add workflow_dispatch to pyfalco_test.yml

8c9d34a

Added workflow_dispatch event to trigger the workflow manually.

Copilot AI review requested due to automatic review settings March 28, 2026 17:19

Copilot started reviewing on behalf of dmittman March 28, 2026 17:19 View session

Copilot AI reviewed Mar 28, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(serialization): use HDF5 for WFSC snippets#1

fix(serialization): use HDF5 for WFSC snippets#1
dmittman wants to merge 2 commits into
masterfrom
develop

dmittman commented Mar 28, 2026

Uh oh!

dmittman commented Mar 28, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 28, 2026

Uh oh!

Copilot AI Mar 28, 2026

Uh oh!

Copilot AI Mar 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	return bytes(np.asarray(value, dtype=np.uint8).tolist())
	return np.asarray(value, dtype=np.uint8).tobytes()

Conversation

dmittman commented Mar 28, 2026

Summary

What changed

Included related fixes from the same working session

Functional equivalence

Performance measurement

Security assessment

Notes

Uh oh!

dmittman commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Task transcript and timing metadata

Timing

Conversation transcript (best-effort full transcript from available chat context)

Work summary included in PR

Performance results (same functional test)

Security assessment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dmittman commented Mar 28, 2026 •

edited

Loading