Bug Report
Description
The Neuron vLLM DLC image public.ecr.aws/neuron/pytorch-inference-vllm-neuronx:0.16.0-neuronx-py312-sdk2.29.0-ubuntu24.04 contains a directory named [ at the filesystem root (/[). This directory breaks Enroot container execution, which is the standard container runtime used with Slurm + Pyxis on AWS ParallelCluster.
Environment
- Image:
public.ecr.aws/neuron/pytorch-inference-vllm-neuronx:0.16.0-neuronx-py312-sdk2.29.0-ubuntu24.04
- Platform: AWS ParallelCluster 3.12+, Ubuntu 22.04
- Instance type: trn2.3xlarge
- Slurm: 23.x with Pyxis/Enroot
- Enroot: 3.4.1
Steps to Reproduce
-
Set up a ParallelCluster with Pyxis/Enroot using the official postinstall scripts:
-
Build a Docker image based on the vLLM Neuron DLC:
FROM public.ecr.aws/neuron/pytorch-inference-vllm-neuronx:0.16.0-neuronx-py312-sdk2.29.0-ubuntu24.04
WORKDIR /workspace
-
Convert to Enroot squashfs:
docker build -t neuron-inference:latest .
enroot import -o neuron-inference.sqsh dockerd://neuron-inference:latest
-
Run via Slurm + Pyxis:
srun --container-image=${PWD}/neuron-inference.sqsh python3 -c "import vllm"
Expected Behavior
The container starts and executes the command.
Actual Behavior
slurmstepd: error: pyxis: container start failed with error code: 1
slurmstepd: error: pyxis: printing enroot log file:
slurmstepd: error: pyxis: enroot-mount: failed to mount: /local_scratch/enroot/data/user-1000/pyxis_5.0[ at /local_scratch/enroot/data/user-1000/pyxis_5.0/[: No such file or directory
Root Cause
The DLC image contains a directory named [ at the filesystem root:
$ docker run --rm public.ecr.aws/neuron/pytorch-inference-vllm-neuronx:0.16.0-neuronx-py312-sdk2.29.0-ubuntu24.04 ls -la /\[
total 8
drwxr-xr-x 2 root root 4096 Apr 18 17:41 .
drwxr-xr-x 1 root root 4096 Apr 18 17:41 ..
When Enroot extracts the squashfs image, this [ directory gets permissions d--------- (000). Enroot then attempts to use it as a mount point, which fails because:
- The directory name
[ is a shell metacharacter
- The permissions are stripped to 000 during extraction
Verification
Other images (e.g., ubuntu:22.04) do not contain this directory and work correctly with the same Enroot/Pyxis setup.
Workaround
Add RUN rm -rf /\[ to the Dockerfile:
FROM public.ecr.aws/neuron/pytorch-inference-vllm-neuronx:0.16.0-neuronx-py312-sdk2.29.0-ubuntu24.04
RUN rm -rf /\[
WORKDIR /workspace
Impact
This issue affects all users who use vLLM Neuron DLC with Slurm + Pyxis/Enroot, which is the recommended container orchestration for AWS ParallelCluster. The error message is cryptic and extremely difficult to diagnose.
Suggested Fix
Remove the /[ directory from the DLC image build process. This directory appears to serve no purpose and breaks Enroot compatibility.
Bug Report
Description
The Neuron vLLM DLC image
public.ecr.aws/neuron/pytorch-inference-vllm-neuronx:0.16.0-neuronx-py312-sdk2.29.0-ubuntu24.04contains a directory named[at the filesystem root (/[). This directory breaks Enroot container execution, which is the standard container runtime used with Slurm + Pyxis on AWS ParallelCluster.Environment
public.ecr.aws/neuron/pytorch-inference-vllm-neuronx:0.16.0-neuronx-py312-sdk2.29.0-ubuntu24.04Steps to Reproduce
Set up a ParallelCluster with Pyxis/Enroot using the official postinstall scripts:
Build a Docker image based on the vLLM Neuron DLC:
Convert to Enroot squashfs:
docker build -t neuron-inference:latest . enroot import -o neuron-inference.sqsh dockerd://neuron-inference:latestRun via Slurm + Pyxis:
Expected Behavior
The container starts and executes the command.
Actual Behavior
Root Cause
The DLC image contains a directory named
[at the filesystem root:When Enroot extracts the squashfs image, this
[directory gets permissionsd---------(000). Enroot then attempts to use it as a mount point, which fails because:[is a shell metacharacterVerification
Other images (e.g.,
ubuntu:22.04) do not contain this directory and work correctly with the same Enroot/Pyxis setup.Workaround
Add
RUN rm -rf /\[to the Dockerfile:Impact
This issue affects all users who use vLLM Neuron DLC with Slurm + Pyxis/Enroot, which is the recommended container orchestration for AWS ParallelCluster. The error message is cryptic and extremely difficult to diagnose.
Suggested Fix
Remove the
/[directory from the DLC image build process. This directory appears to serve no purpose and breaks Enroot compatibility.