Cannot figure out why gmx_mpi cannot detect nvidia GPU

ansel0250 · November 14, 2024, 9:59am

GROMACS version: 2024.4
GROMACS modification: No

I had been compiling the gromacs for a while and use CUDA 12.5 and the card can be detected using nvidia-smi

I have also checked the gmx --version to see if the CUDA driver and runtime is present in the gmx
GROMACS version: 2024.4
Precision: mixed
Memory model: 64 bit
MPI library: MPI
MPI library version: Intel(R) MPI Library 2021.9 for Linux* OS
OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 128)
GPU support: CUDA
NBNxM GPU setup: super-cluster 2x2x2 / cluster 8
SIMD instructions: AVX2_256
CPU FFT library: fftw-3.3.8-sse2-avx-avx2-avx2_128
GPU FFT library: cuFFT
Multi-GPU FFT: none
RDTSCP usage: enabled
TNG support: enabled
Hwloc support: disabled
Tracing support: disabled
C compiler: /bin/cc GNU 11.4.1
C compiler flags: -fexcess-precision=fast -funroll-all-loops -mavx2 -mfma -Wno-missing-field-initializers -O3 -DNDEBUG
C++ compiler: /bin/c++ GNU 11.4.1
C++ compiler flags: -fexcess-precision=fast -funroll-all-loops -mavx2 -mfma -Wno-missing-field-initializers -Wno-cast-function-type-strict -fopenmp -O3 -DNDEBUG
BLAS library: External - detected on the system
LAPACK library: External - detected on the system
CUDA compiler: /usr/local/cuda-12.5/bin/nvcc nvcc: NVIDIA (R) Cuda compiler driver;Copyright (c) 2005-2024 NVIDIA Corporation;Built on Thu_Jun__6_02:18:23_PDT_2024;Cuda compilation tools, release 12.5, V12.5.82;Build cuda_12.5.r12.5/compiler.34385749_0
CUDA compiler flags:-std=c++17;–generate-code=arch=compute_89,code=sm_89;-use_fast_math;-Xptxas;-warn-double-usage;-Xptxas;-Werror;-D_FORCE_INLINES;-Xcompiler;-fopenmp;-fexcess-precision=fast -funroll-all-loops -mavx2 -mfma -Wno-missing-field-initializers -Wno-cast-function-type-strict -fopenmp -O3 -DNDEBUG
CUDA driver: 12.50
CUDA runtime: 12.50

strangely the GMX cannot detect the GPU as it says in the log file
Running on 1 node with total 24 cores, 32 processing units (GPU detection failed)
Hardware detected on host gpu5 (the node of MPI rank 0):
CPU info:
Vendor: Intel
Brand: 13th Gen Intel(R) Core™ i9-13900K
Family: 6 Model: 183 Stepping: 1
Features: aes apic avx avx2 clfsh cmov cx8 cx16 f16c fma htt intel lahf mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd rdtscp sha sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
Hardware topology: Basic
Packages, cores, and logical processors:
[indices refer to OS logical processors]
Package 0: [ 0 1] [ 2 3] [ 4 5] [ 6 7] [ 8 9] [ 10 11] [ 12 13] [ 14 15] [ 16] [ 17] [ 18] [ 19] [ 20] [ 21] [ 22] [ 23] [ 24] [ 25] [ 26] [ 27] [ 28] [ 29] [ 30] [ 31]
CPU limit set by OS: -1 Recommended max number of threads: 32

the compilation is using intel oneAPI with with-gpu option , with-mpi option. I guess I need some help here.

MagnusL · November 15, 2024, 6:27am

Is this via a queuing system/workload management system, such as slurm? Have you requested any GPUs (the slurm -G option)?

ansel0250 · November 17, 2024, 2:55pm

there is SLURM and we have also applied -gres option into that , the main thing is the RTX4090 cannot be found in the system.

but other software itself like LAMMPS could accelerate the calculation.

for debugging purpose , the mpirun has also been run and the gmx_mpi still cannot find the GPU, -nb cpu is fine for calculation.

MagnusL · November 18, 2024, 7:40am

I’m afraid I don’t know what might be the problem then. @pszilard or @al42and, do you have any ideas?

al42and · November 18, 2024, 11:15am

Could you please share the md.log file from a problematic run (feel free to redact the usernames etc if you want)?

ansel0250 · November 18, 2024, 4:25pm

log.txt (14.7 KB)
How about the log here?

al42and · November 20, 2024, 12:03pm

Ok, thanks for sharing. Nothing wrong seen in the log file.

As Magnus said, such behavior (device visible with nvidia-smi, driver detected by GROMACS, but still getting “GPU detection failed” message) usually means that CUDA_VISIBLE_DEVICES is not set correctly (either by the batch system or by the MPI library, although, as far as I recall, IntelMPI does not limit the visibility of NVIDIA devices).

Could you try adding the lines below to your batch script right before gmx_mpi mdrun is called, and share the output?

echo "CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES-unset}"
env | grep '^I_MPI_OFFLOAD'
export I_MPI_OFFLOAD_DEVICES=all
export I_MPI_DEBUG=3

ansel0250 · November 29, 2024, 6:48am

ss.txt (17.1 KB)

Please find the log after applying CUDA visible devices as ALL , IMPI offload as all , turned on verbose 3 of IMPI debug of the calculation (mdrun -nb gpu)

al42and · December 2, 2024, 5:30pm

Thanks. Quite strange.

Could you try compiling this simple device detection utility in a similar SLURM session on the GPU node and see what it prints?

$ wget https://raw.githubusercontent.com/al42and/cuda-smi/refs/heads/master/cuda-smi.cpp
$ /usr/local/cuda-12.5/bin/nvcc cuda-smi.cpp -lcudart_static -lnvidia-ml -o cuda-smi
$ ./cuda-smi
$ mpirun --perhost 1 ./cuda-smi

ansel0250 · December 5, 2024, 9:27am

OK , after looking the cuda-smi , we found out the CUDAquery function isn’t functioning and we further dig in the process using modinfo , re-install the RPM of CUDA.

but it turns out the chroot environment of the deployment node images can’t build the nvidia_uvm etc modules and we discovered that problem (and remedy) in a month ago , undocumented , seems not discussed in nvidia forum also

Now it is functional :)

Topic		Replies	Views
GPU can not be detected User discussions	3	421	November 26, 2020
GPU is not detected User discussions	5	39	March 25, 2025
GROMACS - gpu is not detected User discussions	8	3551	July 16, 2021
Updating Cuda driver version User discussions	3	587	June 20, 2024
Gromcas can't detect my gpu User discussions	3	661	October 24, 2022

Cannot figure out why gmx_mpi cannot detect nvidia GPU

Related topics