Skip to content

sync_daemon_mpi: give daemon a distinct PMIX_NAMESPACE#514

Merged
TApplencourt merged 1 commit into
develfrom
fix/sync-daemon-mpi-pmix-namespace
Jun 24, 2026
Merged

sync_daemon_mpi: give daemon a distinct PMIX_NAMESPACE#514
TApplencourt merged 1 commit into
develfrom
fix/sync-daemon-mpi-pmix-namespace

Conversation

@TApplencourt

Copy link
Copy Markdown
Collaborator

The MPI sync daemon shares the traced application's inherited PMIx identity (PMIX_NAMESPACE). Two MPI instances claiming the same (namespace, rank) collide in the node-local SHM bootstrap and deadlock (n=128 --ppn 64) or abort with a truncation error (n=4).

Append a fixed suffix to PMIX_NAMESPACE before any MPI call so the daemon's session gets its own WORLD. Verified on 2 nodes with THAPI_SYNC_DAEMON=mpi: fix -> exit 0, full tally; control (fix disabled) -> hang. Default remains fs pending MPICH feedback.

The MPI sync daemon shares the traced application's inherited PMIx
identity (PMIX_NAMESPACE). Two MPI instances claiming the same
(namespace, rank) collide in the node-local SHM bootstrap and
deadlock (n=128 --ppn 64) or abort with a truncation error (n=4).

Append a fixed suffix to PMIX_NAMESPACE before any MPI call so the
daemon's session gets its own WORLD. Verified on 2 nodes with
THAPI_SYNC_DAEMON=mpi: fix -> exit 0, full tally; control (fix
disabled) -> hang. Default remains fs pending MPICH feedback.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@TApplencourt

Copy link
Copy Markdown
Collaborator Author

Related to pmodels/mpich#7852

@nscottnichols nscottnichols left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@TApplencourt TApplencourt merged commit cb7d9c4 into devel Jun 24, 2026
28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants