@alang, sorry to revive an old thread, but I’m still having trouble with building GROMACS with cuFFTMp.
NVSHMEM is now recognized, but some tests are still failing for reasons that appear related to either a cuFFTMp / NVSHMEM mismatch, or a cuFFTMp+NVSHMEM / OpenMPI mismatch. I’m trying to use OpenMPI 4.1.4 built CUDA-aware with CUDA 11.7 and HPC-SDK 22.7, compiling against cuFFTMp and NVSHMEM in the same HPC-SDK build. The GROMACS build goes fine, but then I get the test failure with make-check
. This is particularly confusing, because 22.7/comm_libs/11.7/nvshmem
is version 2.6.0 (I checked using 22.7/comm_libs/11.7/nvshmem/bin/nvshmem-info
), and cuFFTMp is 22.7/math_libs/11.7/lib64/libcufftMp.so.10.8.1
, which are both consistent with this: Release Notes Version 22.7, and should be compatible according to NVSHMEM and cuFFTMp — cuFFTMp 11.0.14 documentation. Here is the error:
30/86 Test #30: FFTMpiUnitTests …***Failed 12.03 sec
[==========] Running 4 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 4 tests from GpuFft/GpuFftTest3D
[ RUN ] GpuFft/GpuFftTest3D.GpuFftDecomposition/0
WARN: GDRCopy open call failed, falling back to not using GDRCopy
WARN: GDRCopy open call failed, falling back to not using GDRCopy
WARN: GDRCopy open call failed, falling back to not using GDRCopy
WARN: GDRCopy open call failed, falling back to not using GDRCopy
src/comm/transports/ibrc/ibrc.cpp:598: NULL value mem registration failed
src/mem/mem.cpp:294: non-zero status: 2 transport get memhandle failed
src/comm/transports/ibrc/ibrc.cpp:598: NULL value mem registration failed
src/mem/mem.cpp:294: non-zero status: 2 transport get memhandle failed
src/comm/transports/ibrc/ibrc.cpp:598: NULL value mem registration failed
src/mem/mem.cpp:294: non-zero status: 2 transport get memhandle failed
src/comm/transports/ibrc/ibrc.cpp:598: NULL value mem registration failed
src/mem/mem.cpp:294: non-zero status: 2 transport get memhandle failed
src/init/init.cu:726: non-zero status: 7 nvshmem setup heap failed
src/init/init.cu:726: non-zero status: 7 nvshmem setup heap failed
src/init/init.cu:726: non-zero status: 7 nvshmem setup heap failed
src/init/init.cu:726: non-zero status: 7 nvshmem setup heap failed
src/comm/transports/ibrc/ibrc.cpp:598: NULL value mem registration failed
src/mem/mem.cpp:294: non-zero status: 2 transport get memhandle failed
src/comm/transports/ibrc/ibrc.cpp:598: NULL value mem registration failed
src/mem/mem.cpp:294: non-zero status: 2 transport get memhandle failed
src/comm/transports/ibrc/ibrc.cpp:598: NULL value mem registration failed
src/mem/mem.cpp:294: non-zero status: 2 transport get memhandle failed
src/comm/transports/ibrc/ibrc.cpp:598: NULL value mem registration failed
src/mem/mem.cpp:294: non-zero status: 2 transport get memhandle failed
src/init/init.cu:726: non-zero status: 7 nvshmem setup heap failed
src/init/init.cu:796: non-zero status: 7 nvshmemi_common_init failed …src/init/init_device.cu:nvshmemi_check_stat
e_and_init:44: nvshmem initialization failed, exiting
src/util/cs.cpp:21: non-zero status: 16: Bad address, exiting… mutex destroy failed
[uscacl-0-34:25089:0:25089] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x40)
src/init/init.cu:726: non-zero status: 7 nvshmem setup heap failed
src/init/init.cu:796: non-zero status: 7 nvshmemi_common_init failed …src/init/init_device.cu:nvshmemi_check_stat
e_and_init:44: nvshmem initialization failed, exiting
src/util/cs.cpp:21: non-zero status: 16: Bad address, exiting… mutex destroy failed
[uscacl-0-34:25091:0:25091] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x41)
src/init/init.cu:726: non-zero status: 7 nvshmem setup heap failed
src/init/init.cu:796: non-zero status: 7 nvshmemi_common_init failed …src/init/init_device.cu:nvshmemi_check_stat
e_and_init:44: nvshmem initialization failed, exiting
src/util/cs.cpp:21: non-zero status: 16: Bad address, exiting… mutex destroy failed
[uscacl-0-34:25090:0:25090] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x41)
src/init/init.cu:726: non-zero status: 7 nvshmem setup heap failed
src/init/init.cu:796: non-zero status: 7 nvshmemi_common_init failed …src/init/init_device.cu:nvshmemi_check_stat
e_and_init:44: nvshmem initialization failed, exiting
src/util/cs.cpp:21: non-zero status: 16: Bad address, exiting… mutex destroy failed
[uscacl-0-34:25092:0:25092] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x41)
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
mpiexec noticed that process rank 2 with PID 0 on node uscacl-0-34 exited on signal 11 (Segmentation fault).
I also noticed that HPC-SDK 23.5 (but not 22.7 or 22.9) has this version of NVSHMEM: 23.5/comm_libs/11.0/nvshmem_cufftmp_compat
, in addition to 23.5/comm_libs/11.0/nvshmem
. So I tried building GROMACS with this nvshmem_cufftmp_compat
, cuFFTMp from 22.7/math_libs
, and the same CUDA-aware OpenMPI 4.1.4 built with CUDA 11.7 and hpc-sdk 22.7. This time I got the following test failure:
30/86 Test #30: FFTMpiUnitTests …***Failed 7.43 sec
ESC[0;32m[==========] ESC[mRunning 4 tests from 1 test suite.
ESC[0;32m[----------] ESC[mGlobal test environment set-up.
ESC[0;32m[----------] ESC[m4 tests from GpuFft/GpuFftTest3D
ESC[0;32m[ RUN ] ESC[mGpuFft/GpuFftTest3D.GpuFftDecomposition/0
src/bootstrap/bootstrap_mpi.c:nvshmemi_bootstrap_plugin_init:101: MPI bootstrap version (20800) is not compatible with NVSHMEM version (-1429415672)[uscacl-0-34:20708:0:20708] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x40)
src/bootstrap/bootstrap_mpi.c:nvshmemi_bootstrap_plugin_init:101: MPI bootstrap version (20800) is not compatible with NVSHMEM version (-1429415672)[uscacl-0-34:20710:0:20710] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x40)
src/bootstrap/bootstrap_mpi.c:nvshmemi_bootstrap_plugin_init:101: MPI bootstrap version (20800) is not compatible with NVSHMEM version (-1429415672)[uscacl-0-34:20711:0:20711] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x40)
src/bootstrap/bootstrap_mpi.c:nvshmemi_bootstrap_plugin_init:101: MPI bootstrap version (20800) is not compatible with NVSHMEM version (-1429415672)[uscacl-0-34:20709:0:20709] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x40)
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
mpiexec noticed that process rank 0 with PID 0 on node uscacl-0-34 exited on signal 11 (Segmentation fault).
Any suggestions?