mlenv

Snapshot the machine learning environment to a single file, and diff two snapshots to see exactly what changed.

"It trained fine last week" usually comes down to something in the stack moving: a different CUDA build behind PyTorch, a minor Python bump, a driver upgrade, a GPU swapped on a shared box. mlenv captures all of that into one JSON file you can commit, share, and compare, with the changes most likely to affect results ranked first.

$ mlenv snapshot -o good.json     # when results were correct
$ mlenv snapshot -o now.json      # after something changed
$ mlenv diff good.json now.json
risk  section   key         change
high  cuda      torch_cuda  12.1 -> 11.8
high  python    version     3.11.9 -> 3.12.0
low   packages  rich        13.7.0 -> 13.8.0

Install

$ pip install mlenv-cli                 # from PyPI, once released
$ pip install git+https://github.com/jmweb-org/mlenv   # latest, available now          # core: Python, platform, packages, CUDA, env vars
$ pip install "mlenv-cli[gpu]"   # adds GPU model and driver capture via NVML

mlenv has no heavy dependencies and runs anywhere. GPU details are read through NVML only when the gpu extra is installed and a driver is present; without it, every other section is still captured.

What it captures

Section	Contents
`python`	Version, implementation, interpreter path
`platform`	OS, release, architecture, libc
`packages`	Every installed distribution and its version
`cuda`	PyTorch build, its CUDA and cuDNN versions, availability
`gpus`	GPU count, model per index, driver version
`env`	Training-relevant variables (`CUDA_`, `NCCL_`, `OMP_NUM_THREADS`, ...)

Commands

$ mlenv snapshot -o env.json   # write a snapshot (omit -o to print to stdout)
$ mlenv show                   # pretty-print the current environment
$ mlenv diff a.json b.json     # show changes, highest risk first
$ mlenv diff a.json b.json --json    # machine-readable changes
$ mlenv diff a.json b.json --check   # exit non-zero on any high-risk change

In CI

--check turns the diff into a gate. Commit a baseline snapshot and fail the build when the runner drifts from it in a way that could change results:

- run: mlenv snapshot -o current.json
- run: mlenv diff baseline.json current.json --check

Risk levels

A change is high risk when it routinely moves numerical results: anything under cuda, a GPU model or count or driver change, or a Python minor-version bump. A major-version bump of a sensitive package (torch, numpy, tensorflow, jax, transformers) is high; a smaller bump is medium. Everything else is low. Only high-risk changes trip --check.

Exit codes

Code	Meaning
0	Ran; no high-risk change (or `--check` not set)
1	`--check` found a high-risk change
2	A snapshot file was missing or invalid

License

MIT. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.github/workflows		.github/workflows
src/mlenv		src/mlenv
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

mlenv

Install

What it captures

Commands

In CI

Risk levels

Exit codes

License

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

mlenv

Install

What it captures

Commands

In CI

Risk levels

Exit codes

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages