Cannot figure out why gmx_mpi cannot detect nvidia GPU

GROMACS version: 2024.4
GROMACS modification: No

I had been compiling the gromacs for a while and use CUDA 12.5 and the card can be detected using nvidia-smi

I have also checked the gmx --version to see if the CUDA driver and runtime is present in the gmx
GROMACS version: 2024.4
Precision: mixed
Memory model: 64 bit
MPI library: MPI
MPI library version: Intel(R) MPI Library 2021.9 for Linux* OS
OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 128)
GPU support: CUDA
NBNxM GPU setup: super-cluster 2x2x2 / cluster 8
SIMD instructions: AVX2_256
CPU FFT library: fftw-3.3.8-sse2-avx-avx2-avx2_128
GPU FFT library: cuFFT
Multi-GPU FFT: none
RDTSCP usage: enabled
TNG support: enabled
Hwloc support: disabled
Tracing support: disabled
C compiler: /bin/cc GNU 11.4.1
C compiler flags: -fexcess-precision=fast -funroll-all-loops -mavx2 -mfma -Wno-missing-field-initializers -O3 -DNDEBUG
C++ compiler: /bin/c++ GNU 11.4.1
C++ compiler flags: -fexcess-precision=fast -funroll-all-loops -mavx2 -mfma -Wno-missing-field-initializers -Wno-cast-function-type-strict -fopenmp -O3 -DNDEBUG
BLAS library: External - detected on the system
LAPACK library: External - detected on the system
CUDA compiler: /usr/local/cuda-12.5/bin/nvcc nvcc: NVIDIA (R) Cuda compiler driver;Copyright (c) 2005-2024 NVIDIA Corporation;Built on Thu_Jun__6_02:18:23_PDT_2024;Cuda compilation tools, release 12.5, V12.5.82;Build cuda_12.5.r12.5/compiler.34385749_0
CUDA compiler flags:-std=c++17;–generate-code=arch=compute_89,code=sm_89;-use_fast_math;-Xptxas;-warn-double-usage;-Xptxas;-Werror;-D_FORCE_INLINES;-Xcompiler;-fopenmp;-fexcess-precision=fast -funroll-all-loops -mavx2 -mfma -Wno-missing-field-initializers -Wno-cast-function-type-strict -fopenmp -O3 -DNDEBUG
CUDA driver: 12.50
CUDA runtime: 12.50

strangely the GMX cannot detect the GPU as it says in the log file
Running on 1 node with total 24 cores, 32 processing units (GPU detection failed)
Hardware detected on host gpu5 (the node of MPI rank 0):
CPU info:
Vendor: Intel
Brand: 13th Gen Intel(R) Core™ i9-13900K
Family: 6 Model: 183 Stepping: 1
Features: aes apic avx avx2 clfsh cmov cx8 cx16 f16c fma htt intel lahf mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd rdtscp sha sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
Hardware topology: Basic
Packages, cores, and logical processors:
[indices refer to OS logical processors]
Package 0: [ 0 1] [ 2 3] [ 4 5] [ 6 7] [ 8 9] [ 10 11] [ 12 13] [ 14 15] [ 16] [ 17] [ 18] [ 19] [ 20] [ 21] [ 22] [ 23] [ 24] [ 25] [ 26] [ 27] [ 28] [ 29] [ 30] [ 31]
CPU limit set by OS: -1 Recommended max number of threads: 32

the compilation is using intel oneAPI with with-gpu option , with-mpi option. I guess I need some help here.

Is this via a queuing system/workload management system, such as slurm? Have you requested any GPUs (the slurm -G option)?

there is SLURM and we have also applied -gres option into that , the main thing is the RTX4090 cannot be found in the system.

but other software itself like LAMMPS could accelerate the calculation.

for debugging purpose , the mpirun has also been run and the gmx_mpi still cannot find the GPU, -nb cpu is fine for calculation.