[bug] `backend_write_file` should catch serialization errors and write them to the output file

When `dump()` fails during result serialization in `backend_write_file`, the process crashes without writing an error into the HDF5 file. This leaves the file in the `_r.h5` state permanently — `get_future_from_cache` then raises `FileNotFoundError` (since it only looks for `_i.h5` or `_o.h5`), making the failure undetectable through the normal API.

**Reproduction:**

Submit a task whose return value is large enough that `np.void(cloudpickle.dumps(data_value))` exceeds the numpy void array size limit (~2 GB). The process crashes with:

```
File ".../executorlib/task_scheduler/file/backend.py", line 49, in backend_write_file
    dump(
        file_name=file_name_out + "_r.h5",
        data_dict={"output": output["result"], "runtime": runtime},
    )
File ".../executorlib/standalone/hdf.py", line 39, in dump
    data=np.void(cloudpickle.dumps(data_value)),
TypeError: byte-like to large to store inside array.
```

**Current behavior:**

`backend_write_file` renames `_i.h5` → `_r.h5`, then calls `dump()` which raises. The exception propagates out, and the final `os.rename(_r.h5 → _o.h5)` never executes. The file remains as `_r.h5` with no `"output"` or `"error"` key — a state that `get_future_from_cache` cannot interpret.

**Expected behavior:**

If `dump()` fails when writing the result, `backend_write_file` should catch the exception, write it as an `"error"` key into the `_r.h5` file, and then rename to `_o.h5`. This way `get_future_from_cache` can detect it as a failed task via the normal `"error"` path.

**Suggested fix:**

In `backend_write_file` (`executorlib/task_scheduler/file/backend.py`):

```python
def backend_write_file(file_name: str, output: Any, runtime: float) -> None:
    file_name_out = os.path.splitext(file_name)[0][:-2]
    os.rename(file_name, file_name_out + "_r.h5")
    try:
        if "result" in output:
            dump(
                file_name=file_name_out + "_r.h5",
                data_dict={"output": output["result"], "runtime": runtime},
            )
        else:
            dump(
                file_name=file_name_out + "_r.h5",
                data_dict={"error": output["error"], "runtime": runtime},
            )
    except Exception as serialize_error:
        # Serialization failed — store the error so the job is not stuck
        dump(
            file_name=file_name_out + "_r.h5",
            data_dict={"error": serialize_error, "runtime": runtime},
        )
    os.rename(file_name_out + "_r.h5", file_name_out + "_o.h5")
```

This ensures that any serialization failure (including the `np.void` size limit) is surfaced through the existing error-handling path rather than leaving the job in a limbo state.

**Environment:**
- executorlib version: 1.9.2
- Python 3.13
- Triggered by a 10,000-atom melt-quench simulation returning ~2+ GB of pickled data


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bug] `backend_write_file` should catch serialization errors and write them to the output file #982

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[bug] backend_write_file should catch serialization errors and write them to the output file #982

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

[bug] `backend_write_file` should catch serialization errors and write them to the output file #982