cuFFTMp + MPI with GROMACS 2025.2 on NVIDIA DGXA100

GROMACS version: 2025.2
GROMACS modification: No
cmake version 3.31.7

I am attempting to install GROMACS on our group’s NVIDIA DGX with [8 x A100-SXM4-40GB] with cuFFTMp and OpemMPI. I installed NVIDIA’s hpc-sdk as recommended in the GROMACS documentation and my cmake command (run in the build directory) is below:

~/bin/cmake-3.31.7-linux-x86_64/bin/cmake .. -DGMX_MPI=on -DGMX_GPU=CUDA \
-DREGRESSIONTEST_DOWNLOAD=ON -DGMX_BUILD_OWN_FFTW=ON \
-DCMAKE_CUDA_ARCHITECTURES=native \
-DCMAKE_BUILD_TYPE=Debug -DGMX_USE_CUFFTMP=ON \
-DcuFFTMp_ROOT=/opt/nvidia/hpc_sdk/Linux_x86_64/25.5/math_libs -DMPIEXEC=srun \
-DCMAKE_INSTALL_PREFIX=/raid/gromacs_builds/gromacs-2025.2 \
-DCMAKE_C_COMPILER=/opt/nvidia/hpc_sdk/Linux_x86_64/25.5/comm_libs/hpcx/bin/mpicc \
-DCMAKE_CXX_COMPILER=/opt/nvidia/hpc_sdk/Linux_x86_64/25.5/comm_libs/hpcx/bin/mpicxx \
-DMPI_C_COMPILER=/opt/nvidia/hpc_sdk/Linux_x86_64/25.5/comm_libs/hpcx/bin/mpicc \
-DMPI_CXX_COMPILER=/opt/nvidia/hpc_sdk/Linux_x86_64/25.5/comm_libs/hpcx/bin/mpicxx \

cmake and make run with minimal errors but when it comes time to run make check a lot of my tests time out and I am unable to find an appropriate way to fix this. Some help from the discussion forums and chatgpt led me to believe that one of the initial issues was an innapropriate linking

Some examples of the errors that get thrown are:

Start 23: PlumedAppliedForcesUnitTests
23/95 Test #23: PlumedAppliedForcesUnitTests ..............***Failed    0.01 sec
/home/ss171/9-bin/gromacs-2025.2/build_no_cufftmp/bin/plumed_applied_forces-test: symbol lookup error: /home/ss171/9-bin/gromacs-2025.2/build_no_cufftmp/bin/plumed_applied_forces-test: undefined symbol: _ZN3gmx20PlumedOptionProvider13setPlumedFileERKSt8optionalINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEE
Start 32: DomDecMpiTests
32/95 Test #32: DomDecMpiTests ............................***Failed   21.87 sec
[LOG_CAT_ML] You must specify a valid HCA device by setting:
-x HCOLL_MAIN_IB=<dev_name:port> or -x UCX_NET_DEVICES=<dev_name:port>.
If no device was specified for HCOLL (or the calling library), automatic device detection will be run.
In case of unfounded HCA device please contact your system administrator.
[tvpdgx:1985260] Error: ../../../../../ompi/mca/coll/hcoll/coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed
[LOG_CAT_ML] You must specify a valid HCA device by setting:
-x HCOLL_MAIN_IB=<dev_name:port> or -x UCX_NET_DEVICES=<dev_name:port>.
If no device was specified for HCOLL (or the calling library), automatic device detection will be run.
In case of unfounded HCA device please contact your system administrator.
[tvpdgx:1985259] Error: ../../../../../ompi/mca/coll/hcoll/coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed
[LOG_CAT_ML] You must specify a valid HCA device by setting:
-x HCOLL_MAIN_IB=<dev_name:port> or -x UCX_NET_DEVICES=<dev_name:port>.
If no device was specified for HCOLL (or the calling library), automatic device detection will be run.
In case of unfounded HCA device please contact your system administrator.
[tvpdgx:1985258] Error: ../../../../../ompi/mca/coll/hcoll/coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed
[LOG_CAT_ML] You must specify a valid HCA device by setting:
-x HCOLL_MAIN_IB=<dev_name:port> or -x UCX_NET_DEVICES=<dev_name:port>.
If no device was specified for HCOLL (or the calling library), automatic device detection will be run.
In case of unfounded HCA device please contact your system administrator.
[tvpdgx:1985257] Error: ../../../../../ompi/mca/coll/hcoll/coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed
[==========] Running 4 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 4 tests from HaloExchangeTest
[ RUN      ] HaloExchangeTest.Coordinates1dHaloWith1Pulse
free(): invalid pointer
[tvpdgx:1985260] *** Process received signal ***
[tvpdgx:1985260] Signal: Aborted (6)
[tvpdgx:1985260] Signal code:  (-6)
[tvpdgx:1985260] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f2159dcf520]
[tvpdgx:1985260] [ 1] /lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7f2159e239fc]
[tvpdgx:1985260] [ 2] /lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7f2159dcf476]
[tvpdgx:1985260] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7f2159db57f3]
[tvpdgx:1985260] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x89676)[0x7f2159e16676]
[tvpdgx:1985260] [ 5] /lib/x86_64-linux-gnu/libc.so.6(+0xa0cfc)[0x7f2159e2dcfc]
[tvpdgx:1985260] [ 6] /lib/x86_64-linux-gnu/libc.so.6(+0xa2a44)[0x7f2159e2fa44]
[tvpdgx:1985260] [ 7] /lib/x86_64-linux-gnu/libc.so.6(free+0x73)[0x7f2159e32453]
[tvpdgx:1985260] [ 8] /home/ss171/9-bin/gromacs-2025.2/build_no_cufftmp/bin/domdec-mpi-test(_ZN17gmx_domdec_comm_tD1Ev+0x299)[0x559e4aae86f9]
[tvpdgx:1985260] [ 9] /usr/local/gromacs/lib/libgromacs_mpi.so.10(_ZN12gmx_domdec_tD1Ev+0xf5)[0x7f215aa9f985]
[tvpdgx:1985260] [10] /home/ss171/9-bin/gromacs-2025.2/build_no_cufftmp/bin/domdec-mpi-test(+0xec3b)[0x559e4aae4c3b]
[tvpdgx:1985260] [11] /home/ss171/9-bin/gromacs-2025.2/build_no_cufftmp/bin/../lib/libgtest.so.1.13.0(_ZN7testing8internal35HandleExceptionsInMethodIfSupportedINS_4TestEvEET0_PT_MS4_FS3_vEPKc+0x51)[0x7f215a24c731]
[tvpdgx:1985260] [12] /home/ss171/9-bin/gromacs-2025.2/build_no_cufftmp/bin/../lib/libgtest.so.1.13.0(_ZN7testing4Test3RunEv+0xd6)[0x7f215a2387f6]
[tvpdgx:1985260] [13] /home/ss171/9-bin/gromacs-2025.2/build_no_cufftmp/bin/../lib/libgtest.so.1.13.0(_ZN7testing8TestInfo3RunEv+0x195)[0x7f215a2389b5]
[tvpdgx:1985260] [14] /home/ss171/9-bin/gromacs-2025.2/build_no_cufftmp/bin/../lib/libgtest.so.1.13.0(_ZN7testing9TestSuite3RunEv+0xf5)[0x7f215a238ea5]
[tvpdgx:1985260] [15] /home/ss171/9-bin/gromacs-2025.2/build_no_cufftmp/bin/../lib/libgtest.so.1.13.0(_ZN7testing8internal12UnitTestImpl11RunAllTestsEv+0x49f)[0x7f215a240b2f]
[tvpdgx:1985260] [16] /home/ss171/9-bin/gromacs-2025.2/build_no_cufftmp/bin/../lib/libgtest.so.1.13.0(_ZN7testing8UnitTest3RunEv+0x9a)[0x7f215a238a7a]
[tvpdgx:1985260] [17] /home/ss171/9-bin/gromacs-2025.2/build_no_cufftmp/bin/domdec-mpi-test(+0xbf04)[0x559e4aae1f04]
[tvpdgx:1985260] [18] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7f2159db6d90]
[tvpdgx:1985260] [19] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7f2159db6e40]
[tvpdgx:1985260] [20] /home/ss171/9-bin/gromacs-2025.2/build_no_cufftmp/bin/domdec-mpi-test(+0xc095)[0x559e4aae2095]
[tvpdgx:1985260] *** End of error message ***
free(): invalid pointer
[tvpdgx:1985259] *** Process received signal ***
[tvpdgx:1985259] Signal: Aborted (6)
[tvpdgx:1985259] Signal code:  (-6)
[tvpdgx:1985259] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7fd9993cf520]
[tvpdgx:1985259] [ 1] /lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7fd9994239fc]
[tvpdgx:1985259] [ 2] /lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7fd9993cf476]
[tvpdgx:1985259] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7fd9993b57f3]
[tvpdgx:1985259] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x89676)[0x7fd999416676]
[tvpdgx:1985259] [ 5] /lib/x86_64-linux-gnu/libc.so.6(+0xa0cfc)[0x7fd99942dcfc]
[tvpdgx:1985259] [ 6] /lib/x86_64-linux-gnu/libc.so.6(+0xa2a44)[0x7fd99942fa44]
[tvpdgx:1985259] [ 7] /lib/x86_64-linux-gnu/libc.so.6(free+0x73)[0x7fd999432453]
[tvpdgx:1985259] [ 8] /home/ss171/9-bin/gromacs-2025.2/build_no_cufftmp/bin/domdec-mpi-test(_ZN17gmx_domdec_comm_tD1Ev+0x299)[0x565062af56f9]
[tvpdgx:1985259] [ 9] /usr/local/gromacs/lib/libgromacs_mpi.so.10(_ZN12gmx_domdec_tD1Ev+0xf5)[0x7fd99a09f985]
[tvpdgx:1985259] [10] /home/ss171/9-bin/gromacs-2025.2/build_no_cufftmp/bin/domdec-mpi-test(+0xec3b)[0x565062af1c3b]
[tvpdgx:1985259] [11] /home/ss171/9-bin/gromacs-2025.2/build_no_cufftmp/bin/../lib/libgtest.so.1.13.0(_ZN7testing8internal35HandleExceptionsInMethodIfSupportedINS_4TestEvEET0_PT_MS4_FS3_vEPKc+0x51)[0x7fd99984c731]
[tvpdgx:1985259] [12] /home/ss171/9-bin/gromacs-2025.2/build_no_cufftmp/bin/../lib/libgtest.so.1.13.0(_ZN7testing4Test3RunEv+0xd6)[0x7fd9998387f6]
[tvpdgx:1985259] [13] /home/ss171/9-bin/gromacs-2025.2/build_no_cufftmp/bin/../lib/libgtest.so.1.13.0(_ZN7testing8TestInfo3RunEv+0x195)[0x7fd9998389b5]
[tvpdgx:1985259] [14] /home/ss171/9-bin/gromacs-2025.2/build_no_cufftmp/bin/../lib/libgtest.so.1.13.0(_ZN7testing9TestSuite3RunEv+0xf5)[0x7fd999838ea5]
[tvpdgx:1985259] [15] /home/ss171/9-bin/gromacs-2025.2/build_no_cufftmp/bin/../lib/libgtest.so.1.13.0(_ZN7testing8internal12UnitTestImpl11RunAllTestsEv+0x49f)[0x7fd999840b2f]
[tvpdgx:1985259] [16] /home/ss171/9-bin/gromacs-2025.2/build_no_cufftmp/bin/../lib/libgtest.so.1.13.0(_ZN7testing8UnitTest3RunEv+0x9a)[0x7fd999838a7a]
[tvpdgx:1985259] [17] /home/ss171/9-bin/gromacs-2025.2/build_no_cufftmp/bin/domdec-mpi-test(+0xbf04)[0x565062aeef04]
[tvpdgx:1985259] [18] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7fd9993b6d90]
[tvpdgx:1985259] [19] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7fd9993b6e40]
[tvpdgx:1985259] [20] /home/ss171/9-bin/gromacs-2025.2/build_no_cufftmp/bin/domdec-mpi-test(+0xc095)[0x565062aef095]
[tvpdgx:1985259] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec noticed that process rank 3 with PID 0 on node tvpdgx exited on signal 6 (Aborted).
--------------------------------------------------------------------------
Start 35: FFTMpiUnitTests
35/95 Test #35: FFTMpiUnitTests ...........................***Failed   10.96 sec
[LOG_CAT_ML] You must specify a valid HCA device by setting:
-x HCOLL_MAIN_IB=<dev_name:port> or -x UCX_NET_DEVICES=<dev_name:port>.
If no device was specified for HCOLL (or the calling library), automatic device detection will be run.
In case of unfounded HCA device please contact your system administrator.
[tvpdgx:1990618] Error: ../../../../../ompi/mca/coll/hcoll/coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed
[LOG_CAT_ML] You must specify a valid HCA device by setting:
-x HCOLL_MAIN_IB=<dev_name:port> or -x UCX_NET_DEVICES=<dev_name:port>.
If no device was specified for HCOLL (or the calling library), automatic device detection will be run.
In case of unfounded HCA device please contact your system administrator.
[tvpdgx:1990615] Error: ../../../../../ompi/mca/coll/hcoll/coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed
[LOG_CAT_ML] You must specify a valid HCA device by setting:
-x HCOLL_MAIN_IB=<dev_name:port> or -x UCX_NET_DEVICES=<dev_name:port>.
If no device was specified for HCOLL (or the calling library), automatic device detection will be run.
In case of unfounded HCA device please contact your system administrator.
[tvpdgx:1990617] Error: ../../../../../ompi/mca/coll/hcoll/coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed
[LOG_CAT_ML] You must specify a valid HCA device by setting:
-x HCOLL_MAIN_IB=<dev_name:port> or -x UCX_NET_DEVICES=<dev_name:port>.
If no device was specified for HCOLL (or the calling library), automatic device detection will be run.
In case of unfounded HCA device please contact your system administrator.
[tvpdgx:1990616] Error: ../../../../../ompi/mca/coll/hcoll/coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed
[==========] Running 4 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 4 tests from GpuFft/GpuFftTest3D
[ RUN      ] GpuFft/GpuFftTest3D.GpuFftDecomposition/0

-------------------------------------------------------
Program:     fft-mpi-test, version 2025-rc
Source file: src/gromacs/fft/gpu_3dfft.cpp (line 145)
Function:    gmx::Gpu3dFft::Gpu3dFft(gmx::FftBackend, bool, MPI_Comm, gmx::ArrayRef<const int>, gmx::ArrayRef<const int>, int, bool, const DeviceContext&, const DeviceStream&, int*, int*, int*, float**, float**)::<lambda()>
MPI rank:    0 (out of 4)

Assertion failed:
Condition: backend == FftBackend::HeFFTe_CUDA
Unsupported FFT backend requested

For more information and tips for troubleshooting, please check the GROMACS
website at https://manual.gromacs.org/current/user-guide/run-time-errors.html
-------------------------------------------------------
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------

Any suggestions on troubleshooting or pointers on what I may have missed would be appreciated. I’ve also taken a look at the How to Build and Run GROMACS article on NVIDIA’s technical blog but I ran into similar timeout issues and test failures.