Unable to build with cuFFTMp

GROMACS version: 2023
GROMACS modification: No

I am following the instructions here for building GROMACS with support for decomposing PME work across multiple GPUs, but the resulting version is unable to do so. As instructed, I’m including the following CMake options (obviously including the actual path to the HPC SDK):

-DGMX_USE_CUFFTMP=ON
-DcuFFTMp_ROOT=<path to NVIDIA HPC SDK math_libs folder>

However, -DGMX_USE_CUFFTMP=ON appears to be ignored completely – it does not show up among the options when I examine them with ccmake. The other option, -DcuFFTMp_ROOT, raises a warning:

CMake Warning:
Manually-specified variables were not used by the project:

cuFFTMp_ROOT

When I run this build with -npme 2, I get the error I would expect if the build did not incorporate the necessary functionality. However, the language about “not implemented with more than one PME rank” is confusing – isn’t it already implemented?

Program: gmx mdrun, version 2023
Source file: src/gromacs/taskassignment/decidegpuusage.cpp (line 277)
Function: bool gmx::decideWhetherToUseGpusForPmeWithThreadMpi(bool, gmx::TaskTarget, gmx::TaskTarget, int, const std::vector&, const t_inputrec&, int, int)

Feature not implemented:
PME tasks were required to run on GPUs, but that is not implemented with more
than one PME rank. Use a single rank simulation, or a separate PME rank, or
permit PME tasks to be assigned to the CPU.

I should add that I’m using CUDA runtime version 11.6 and CUDA driver version 11.4. Looking in the Linux_x86_64/2023/math_libs/ directory of the HPC SDK, which has subdirectories 11.0, 11.8, and 12.1, only 11.8/lib64 and 12.1/lib64 have libcufftMp.so, but not 11.0/lib64. However, there is also math_libs/lib64, which is soft-linked to 12.1/lib64.

So does the build system recognize that my CUDA version is <11.8 and refuse even attempt building multi-GPU PME capability?

@pszilard: In your article on the NVIDIA developer blog the instructions for building GROMACS with GMX_USE_CUFFTMP=ON specified an MPI build (GMX_MPI=ON), not thread-MPI. Are GMX_THREAD_MPI=ON and GMX_USE_CUFFTMP=ON mutually exclusive?

I confirmed that GMX_USE_CUFFTMP=ON only works with GMX_MPI=ON, not with GMX_THREAD_MPI=ON, but now I’m getting compilation errors:

[ 17%] Building NVCC (Device) object src/gromacs/CMakeFiles/libgromacs.dir/mdlib/libgromacs_generated_leapfrog_gpu.cpp.o
gromacs-2023/src/gromacs/fft/gpu_3dfft_cufftmp.cpp(115): error: identifier “cufftBox3d” is undefined

gromacs-2023/src/gromacs/fft/gpu_3dfft_cufftmp.cpp(115): error: expected a “;”

gromacs-2023/src/gromacs/fft/gpu_3dfft_cufftmp.cpp(122): error: expected a “;”

gromacs-2023/src/gromacs/fft/gpu_3dfft_cufftmp.cpp(143): error: identifier “complexBox” is undefined

gromacs-2023/src/gromacs/fft/gpu_3dfft_cufftmp.cpp(171): error: identifier “complexBox” is undefined

gromacs-2023/src/gromacs/fft/gpu_3dfft_cufftmp.cpp(188): error: identifier “realBox” is undefined

gromacs-2023/src/gromacs/fft/gpu_3dfft_cufftmp.cpp(188): error: identifier “complexBox” is undefined

gromacs-2023/src/gromacs/fft/gpu_3dfft_cufftmp.cpp(188): error: too few arguments in function call

gromacs-2023/src/gromacs/fft/gpu_3dfft_cufftmp.cpp(190): error: too few arguments in function call

9 errors detected in the compilation of “gromacs-2023/src/gromacs/fft/gpu_3dfft_cufftmp.cpp”.
CMake Error at libgromacs_generated_gpu_3dfft_cufftmp.cpp.o.Release.cmake:280 (message):
Error generating file build/src/gromacs/CMakeFiles/libgromacs.dir/fft/./libgromacs_generated_gpu_3dfft_cufftmp.cpp.o

I am using GCC 11.2, CUDA 11.6, OpenMPI 4.1.1 (compiled with GCC 11.2), and hpc-sdk 23.5.

This is an API mismatch between GROMACS and hpc-sdk versions, which we recently fixed - please just use the latest GROMACS 2023.2.

Thanks for letting me know!

Are there any plans to implement this in the thread-MPI context? I would love to use this to run, for example, a 6-GPU simulation on a single node, dedicating two or three GPUs to PME work.

No, there should not be a (major) benefit from using thread-MPI over MPI. If you see any, please let us know.

I am able to compile the 2023.2 version with cuFFTMp, but most of the tests are failing because of this:

error while loading shared libraries: libnvshmem_host.so.2: cannot open shared object file: No such file or directory

libnvshmem_host.so.2 is located here:

/usr/prog/hpc-sdk/23.5/Linux_x86_64/23.5/comm_libs/11.0/nvshmem/lib/libnvshmem_host.so.2,

part of the same HPC SDK as this:

-DcuFFTMp_ROOT=/usr/prog/hpc-sdk/23.5/Linux_x86_64/2023/math_libs,

but obviously not in the math_libs directory. I am still using CUDA 11.6 – could this be an incompatibility between CUDA 11.6 and HPC SDK 23.5? or is something else going on?

Please can you try adding

export LD_LIBRARY_PATH=/usr/prog/hpc-sdk/23.5/Linux_x86_64/23.5/comm_libs/11.0/nvshmem/lib/:$LD_LIBRARY_PATH

to your run script, such that this lib can be picked up at runtime.

@alang, sorry to revive an old thread, but I’m still having trouble with building GROMACS with cuFFTMp.

NVSHMEM is now recognized, but some tests are still failing for reasons that appear related to either a cuFFTMp / NVSHMEM mismatch, or a cuFFTMp+NVSHMEM / OpenMPI mismatch. I’m trying to use OpenMPI 4.1.4 built CUDA-aware with CUDA 11.7 and HPC-SDK 22.7, compiling against cuFFTMp and NVSHMEM in the same HPC-SDK build. The GROMACS build goes fine, but then I get the test failure with make-check. This is particularly confusing, because 22.7/comm_libs/11.7/nvshmem is version 2.6.0 (I checked using 22.7/comm_libs/11.7/nvshmem/bin/nvshmem-info), and cuFFTMp is 22.7/math_libs/11.7/lib64/libcufftMp.so.10.8.1, which are both consistent with this: Release Notes Version 22.7, and should be compatible according to NVSHMEM and cuFFTMp — cuFFTMp 11.0.14 documentation. Here is the error:

30/86 Test #30: FFTMpiUnitTests …***Failed 12.03 sec
[==========] Running 4 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 4 tests from GpuFft/GpuFftTest3D
[ RUN ] GpuFft/GpuFftTest3D.GpuFftDecomposition/0
WARN: GDRCopy open call failed, falling back to not using GDRCopy
WARN: GDRCopy open call failed, falling back to not using GDRCopy
WARN: GDRCopy open call failed, falling back to not using GDRCopy
WARN: GDRCopy open call failed, falling back to not using GDRCopy
src/comm/transports/ibrc/ibrc.cpp:598: NULL value mem registration failed
src/mem/mem.cpp:294: non-zero status: 2 transport get memhandle failed
src/comm/transports/ibrc/ibrc.cpp:598: NULL value mem registration failed
src/mem/mem.cpp:294: non-zero status: 2 transport get memhandle failed
src/comm/transports/ibrc/ibrc.cpp:598: NULL value mem registration failed
src/mem/mem.cpp:294: non-zero status: 2 transport get memhandle failed
src/comm/transports/ibrc/ibrc.cpp:598: NULL value mem registration failed
src/mem/mem.cpp:294: non-zero status: 2 transport get memhandle failed
src/init/init.cu:726: non-zero status: 7 nvshmem setup heap failed
src/init/init.cu:726: non-zero status: 7 nvshmem setup heap failed
src/init/init.cu:726: non-zero status: 7 nvshmem setup heap failed
src/init/init.cu:726: non-zero status: 7 nvshmem setup heap failed
src/comm/transports/ibrc/ibrc.cpp:598: NULL value mem registration failed
src/mem/mem.cpp:294: non-zero status: 2 transport get memhandle failed
src/comm/transports/ibrc/ibrc.cpp:598: NULL value mem registration failed
src/mem/mem.cpp:294: non-zero status: 2 transport get memhandle failed
src/comm/transports/ibrc/ibrc.cpp:598: NULL value mem registration failed
src/mem/mem.cpp:294: non-zero status: 2 transport get memhandle failed
src/comm/transports/ibrc/ibrc.cpp:598: NULL value mem registration failed
src/mem/mem.cpp:294: non-zero status: 2 transport get memhandle failed
src/init/init.cu:726: non-zero status: 7 nvshmem setup heap failed
src/init/init.cu:796: non-zero status: 7 nvshmemi_common_init failed …src/init/init_device.cu:nvshmemi_check_stat
e_and_init:44: nvshmem initialization failed, exiting
src/util/cs.cpp:21: non-zero status: 16: Bad address, exiting… mutex destroy failed
[uscacl-0-34:25089:0:25089] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x40)
src/init/init.cu:726: non-zero status: 7 nvshmem setup heap failed
src/init/init.cu:796: non-zero status: 7 nvshmemi_common_init failed …src/init/init_device.cu:nvshmemi_check_stat
e_and_init:44: nvshmem initialization failed, exiting
src/util/cs.cpp:21: non-zero status: 16: Bad address, exiting… mutex destroy failed
[uscacl-0-34:25091:0:25091] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x41)
src/init/init.cu:726: non-zero status: 7 nvshmem setup heap failed
src/init/init.cu:796: non-zero status: 7 nvshmemi_common_init failed …src/init/init_device.cu:nvshmemi_check_stat
e_and_init:44: nvshmem initialization failed, exiting
src/util/cs.cpp:21: non-zero status: 16: Bad address, exiting… mutex destroy failed
[uscacl-0-34:25090:0:25090] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x41)
src/init/init.cu:726: non-zero status: 7 nvshmem setup heap failed
src/init/init.cu:796: non-zero status: 7 nvshmemi_common_init failed …src/init/init_device.cu:nvshmemi_check_stat
e_and_init:44: nvshmem initialization failed, exiting
src/util/cs.cpp:21: non-zero status: 16: Bad address, exiting… mutex destroy failed
[uscacl-0-34:25092:0:25092] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x41)

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.

mpiexec noticed that process rank 2 with PID 0 on node uscacl-0-34 exited on signal 11 (Segmentation fault).

I also noticed that HPC-SDK 23.5 (but not 22.7 or 22.9) has this version of NVSHMEM: 23.5/comm_libs/11.0/nvshmem_cufftmp_compat, in addition to 23.5/comm_libs/11.0/nvshmem. So I tried building GROMACS with this nvshmem_cufftmp_compat, cuFFTMp from 22.7/math_libs, and the same CUDA-aware OpenMPI 4.1.4 built with CUDA 11.7 and hpc-sdk 22.7. This time I got the following test failure:

30/86 Test #30: FFTMpiUnitTests …***Failed 7.43 sec
ESC[0;32m[==========] ESC[mRunning 4 tests from 1 test suite.
ESC[0;32m[----------] ESC[mGlobal test environment set-up.
ESC[0;32m[----------] ESC[m4 tests from GpuFft/GpuFftTest3D
ESC[0;32m[ RUN ] ESC[mGpuFft/GpuFftTest3D.GpuFftDecomposition/0
src/bootstrap/bootstrap_mpi.c:nvshmemi_bootstrap_plugin_init:101: MPI bootstrap version (20800) is not compatible with NVSHMEM version (-1429415672)[uscacl-0-34:20708:0:20708] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x40)
src/bootstrap/bootstrap_mpi.c:nvshmemi_bootstrap_plugin_init:101: MPI bootstrap version (20800) is not compatible with NVSHMEM version (-1429415672)[uscacl-0-34:20710:0:20710] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x40)
src/bootstrap/bootstrap_mpi.c:nvshmemi_bootstrap_plugin_init:101: MPI bootstrap version (20800) is not compatible with NVSHMEM version (-1429415672)[uscacl-0-34:20711:0:20711] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x40)
src/bootstrap/bootstrap_mpi.c:nvshmemi_bootstrap_plugin_init:101: MPI bootstrap version (20800) is not compatible with NVSHMEM version (-1429415672)[uscacl-0-34:20709:0:20709] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x40)

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.

mpiexec noticed that process rank 0 with PID 0 on node uscacl-0-34 exited on signal 11 (Segmentation fault).

Any suggestions?

Hi Roman,

I wonder if your LD_LIBRARY_PATH is not set up correctly, and perhaps picking up another version of NVSHMEM at runtime. I just tested with HPC-SDK 23.5 and it works OK for me. Here are my explicit commands to build and test, including setting LD_LIBRARY_PATH:

HPCSDK=/lustre/fsw/coreai_devtech_all/alang/packages/nvhpc/nvhpc_2023_235_Linux_x86_64_cuda_multi-install
HPCSDK_LIBDIR=$HPCSDK/Linux_x86_64/2023/math_libs/12.1
NVSHMEM_HOME=$HPCSDK/Linux_x86_64/2023/comm_libs/12.1/nvshmem_cufftmp_compat

export LD_LIBRARY_PATH=$NVSHMEM_HOME/lib:$LD_LIBRARY_PATH

cd gromacs
git checkout v2023.3

rm -rf build
mkdir build
cd build

cmake \
    ../ \
    -DGMX_OPENMP=ON -DGMX_MPI=ON -DGMX_BUILD_OWN_FFTW=ON \
    -DGMX_GPU=CUDA  -DCMAKE_BUILD_TYPE=Release -DGMX_DOUBLE=off \
    -DGMX_USE_CUFFTMP=ON -DcuFFTMp_ROOT=$HPCSDK_LIBDIR \
    -DBUILD_TESTING=ON -DGMX_BUILD_UNITTESTS=ON -DGMX_DEVELOPER_BUILD=ON

make -j

mpirun --allow-run-as-root -np 4 ./bin/fft-mpi-test

Note that I am using OpenMPI 4.1.6a1, and CUDA 12.2

Scrolling up, I see that I already mentioned LD_LIBRARY_PATH, but I specified the “nvshmem” dir in the SDK, rather than the “nvshmem_cufftmp_compat”. Apologies, I didn’t previously realize that this version of the SDK had a compatibility issue with the default version so also shipped the “compat” version. This may well be the source of the issue - LD_LIBRARY_PATH should be set to the “compat” version, whenever that exists in the SDK.

Hi Alan,

I will definitely try OpenMPI built CUDA-aware with hpc-sdk 23.5, using both cuFFTMp and nvshmem (nvshmem_cufftmp_compat, obviously) from the same. Unfortunately, because of the outdated operating system – specifically the CUDA driver – I can’t use CUDA 12. The latest version of CUDA available to me at the moment is 11.7, though I will ask our HPC admin to build 11.8. Even then, it remains to be seen whether GROMACS will be able to use CUDA Runtime 11.8 with CUDA driver 11.4. Both hpc-sdk and OpenMPI will need to be built with CUDA 11.7 or 11.8.

However, I am concerned that my build with OpenMPI built with CUDA 11.7 and hpc-sdk 22.7, using nvshmem and cuFFTMp from the same, did not work (see the first error, “nvshmem setup heap failed”, above). There is no nvshmem_cufftmp_compat shipped with 22.7, which must mean the default nvshmem should be compatible with cuFFTMp. Did you ever build GROMACS with these versions, or was the work in Massively Improved Multi-node NVIDIA GPU Scalability with GROMACS | NVIDIA Technical Blog done with GROMACS based on later versions of CUDA and hpc-sdk?

I’ve never tried such an old hpc-sdk version as 22.7 - I think it is best to stick to a more recent version. I think your existing CUDA versions should be OK. I just tested on an different cluster building using nvcc 11.7 combined with the 11.8 dirs in the SDK and that worked fine.

That’s great to know! I will try to do everything with 11.8 (subject to constraints imposed by our cluster) and fall back to 11.7 + 11.8 directories from the SDK. Thank you so much for your help!