Skip to content

[bug] backend_write_file should catch serialization errors and write them to the output file #982

@ltalirz

Description

@ltalirz

When dump() fails during result serialization in backend_write_file, the process crashes without writing an error into the HDF5 file. This leaves the file in the _r.h5 state permanently — get_future_from_cache then raises FileNotFoundError (since it only looks for _i.h5 or _o.h5), making the failure undetectable through the normal API.

Reproduction:

Submit a task whose return value is large enough that np.void(cloudpickle.dumps(data_value)) exceeds the numpy void array size limit (~2 GB). The process crashes with:

File ".../executorlib/task_scheduler/file/backend.py", line 49, in backend_write_file
    dump(
        file_name=file_name_out + "_r.h5",
        data_dict={"output": output["result"], "runtime": runtime},
    )
File ".../executorlib/standalone/hdf.py", line 39, in dump
    data=np.void(cloudpickle.dumps(data_value)),
TypeError: byte-like to large to store inside array.

Current behavior:

backend_write_file renames _i.h5_r.h5, then calls dump() which raises. The exception propagates out, and the final os.rename(_r.h5 → _o.h5) never executes. The file remains as _r.h5 with no "output" or "error" key — a state that get_future_from_cache cannot interpret.

Expected behavior:

If dump() fails when writing the result, backend_write_file should catch the exception, write it as an "error" key into the _r.h5 file, and then rename to _o.h5. This way get_future_from_cache can detect it as a failed task via the normal "error" path.

Suggested fix:

In backend_write_file (executorlib/task_scheduler/file/backend.py):

def backend_write_file(file_name: str, output: Any, runtime: float) -> None:
    file_name_out = os.path.splitext(file_name)[0][:-2]
    os.rename(file_name, file_name_out + "_r.h5")
    try:
        if "result" in output:
            dump(
                file_name=file_name_out + "_r.h5",
                data_dict={"output": output["result"], "runtime": runtime},
            )
        else:
            dump(
                file_name=file_name_out + "_r.h5",
                data_dict={"error": output["error"], "runtime": runtime},
            )
    except Exception as serialize_error:
        # Serialization failed — store the error so the job is not stuck
        dump(
            file_name=file_name_out + "_r.h5",
            data_dict={"error": serialize_error, "runtime": runtime},
        )
    os.rename(file_name_out + "_r.h5", file_name_out + "_o.h5")

This ensures that any serialization failure (including the np.void size limit) is surfaced through the existing error-handling path rather than leaving the job in a limbo state.

Environment:

  • executorlib version: 1.9.2
  • Python 3.13
  • Triggered by a 10,000-atom melt-quench simulation returning ~2+ GB of pickled data

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions