Skip to content

public.ecr.aws/neuron/pytorch-inference-vllm-neuronx contains a [ directory at root, breaking Enroot/Pyxis container execution #5968

@littlemex

Description

@littlemex

Bug Report

Description

The Neuron vLLM DLC image public.ecr.aws/neuron/pytorch-inference-vllm-neuronx:0.16.0-neuronx-py312-sdk2.29.0-ubuntu24.04 contains a directory named [ at the filesystem root (/[). This directory breaks Enroot container execution, which is the standard container runtime used with Slurm + Pyxis on AWS ParallelCluster.

Environment

  • Image: public.ecr.aws/neuron/pytorch-inference-vllm-neuronx:0.16.0-neuronx-py312-sdk2.29.0-ubuntu24.04
  • Platform: AWS ParallelCluster 3.12+, Ubuntu 22.04
  • Instance type: trn2.3xlarge
  • Slurm: 23.x with Pyxis/Enroot
  • Enroot: 3.4.1

Steps to Reproduce

  1. Set up a ParallelCluster with Pyxis/Enroot using the official postinstall scripts:

  2. Build a Docker image based on the vLLM Neuron DLC:

    FROM public.ecr.aws/neuron/pytorch-inference-vllm-neuronx:0.16.0-neuronx-py312-sdk2.29.0-ubuntu24.04
    WORKDIR /workspace
  3. Convert to Enroot squashfs:

    docker build -t neuron-inference:latest .
    enroot import -o neuron-inference.sqsh dockerd://neuron-inference:latest
  4. Run via Slurm + Pyxis:

    srun --container-image=${PWD}/neuron-inference.sqsh python3 -c "import vllm"

Expected Behavior

The container starts and executes the command.

Actual Behavior

slurmstepd: error: pyxis: container start failed with error code: 1
slurmstepd: error: pyxis: printing enroot log file:
slurmstepd: error: pyxis:     enroot-mount: failed to mount: /local_scratch/enroot/data/user-1000/pyxis_5.0[ at /local_scratch/enroot/data/user-1000/pyxis_5.0/[: No such file or directory

Root Cause

The DLC image contains a directory named [ at the filesystem root:

$ docker run --rm public.ecr.aws/neuron/pytorch-inference-vllm-neuronx:0.16.0-neuronx-py312-sdk2.29.0-ubuntu24.04 ls -la /\[
total 8
drwxr-xr-x 2 root root 4096 Apr 18 17:41 .
drwxr-xr-x 1 root root 4096 Apr 18 17:41 ..

When Enroot extracts the squashfs image, this [ directory gets permissions d--------- (000). Enroot then attempts to use it as a mount point, which fails because:

  1. The directory name [ is a shell metacharacter
  2. The permissions are stripped to 000 during extraction

Verification

Other images (e.g., ubuntu:22.04) do not contain this directory and work correctly with the same Enroot/Pyxis setup.

Workaround

Add RUN rm -rf /\[ to the Dockerfile:

FROM public.ecr.aws/neuron/pytorch-inference-vllm-neuronx:0.16.0-neuronx-py312-sdk2.29.0-ubuntu24.04
RUN rm -rf /\[
WORKDIR /workspace

Impact

This issue affects all users who use vLLM Neuron DLC with Slurm + Pyxis/Enroot, which is the recommended container orchestration for AWS ParallelCluster. The error message is cryptic and extremely difficult to diagnose.

Suggested Fix

Remove the /[ directory from the DLC image build process. This directory appears to serve no purpose and breaks Enroot compatibility.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions