Mdrun detects GPUs but no GPU utilization

GROMACS version: 2025.0
GROMACS modification: No
Here post your question

Hello everyone,

I’m been trying to run REMD simulations on a single-node GPU “cluster” recently purchased by my lab. I have a working gmx_mpi executable:

  gmx_mpi -version

GROMACS version:     2025.0
Precision:           mixed
Memory model:        64 bit
MPI library:         MPI
MPI library version: Open MPI v5.0.6, package: Open MPI sheppard@sn4622123733 Distribution, ident: 5.0.6, repo rev: v5.0.6, Nov 15, 2024
OpenMP support:      enabled (GMX_OPENMP_MAX_THREADS = 128)
GPU support:         CUDA
NBNxM GPU setup:     super-cluster 2x2x2 / cluster 8 (cluster-pair splitting on)
SIMD instructions:   AVX_512
CPU FFT library:     fftw-3.3.10-sse2-avx-avx2-avx2_128-avx512
GPU FFT library:     cuFFT
Multi-GPU FFT:       none
RDTSCP usage:        enabled
TNG support:         enabled
Hwloc support:       disabled
Tracing support:     disabled
C compiler:          /usr/bin/cc GNU 11.5.0
C compiler flags:    -fexcess-precision=fast -funroll-all-loops -march=skylake-avx512 -Wno-missing-field-initializers -O3 -DNDEBUG
C++ compiler:        /usr/bin/c++ GNU 11.5.0
C++ compiler flags:  -fexcess-precision=fast -funroll-all-loops -march=skylake-avx512 -Wno-missing-field-initializers -Wno-old-style-cast -Wno-cast-qual -Wno-suggest-override -Wno-suggest-destructor-override -Wno-zero-as-null-pointer-constant -Wno-cast-function-type-strict SHELL:-fopenmp -O3 -DNDEBUG
BLAS library:        Internal
LAPACK library:      Internal
CUDA compiler:       /home/sheppard/miniconda3/envs/cuda_env/bin/nvcc nvcc: NVIDIA (R) Cuda compiler driver;Copyright (c) 2005-2024 NVIDIA Corporation;Built on Thu_Mar_28_02:18:24_PDT_2024;Cuda compilation tools, release 12.4, V12.4.131;Build cuda_12.4.r12.4/compiler.34097967_0
CUDA compiler flags: -O3 -DNDEBUG
CUDA driver:         12.70
CUDA runtime:        12.40

But my REMD simulations with 64 replicas are running very slow - roughly 5 ns/day when I would have expected at least 50 ns/day. I have tried to force bonded/nonbonded and PME calculations onto the GPUs using:

mpirun -np $SLURM_NTASKS gmx_mpi mdrun -v -multidir <64 replicas> -replex 1500 -dlb yes -nb gpu -bonded gpu -pme gpu -pmefft gpu

And I do see the following in one of the replica’s log file:

Running on 1 node with total 104 cores, 208 processing units, 4 compatible GPUs
Hardware detected on host cs.ucsb.edu (the node of MPI rank 0):
  GPU info:
    Number of GPUs detected: 4
    #0: NVIDIA NVIDIA H200 MIG 1g.18gb, compute cap.: 9.0, ECC: yes, stat: compatible
    #1: NVIDIA NVIDIA H200 MIG 1g.18gb, compute cap.: 9.0, ECC: yes, stat: compatible
    #2: NVIDIA NVIDIA H200 MIG 1g.18gb, compute cap.: 9.0, ECC: yes, stat: compatible
    #3: NVIDIA NVIDIA H200 MIG 1g.18gb, compute cap.: 9.0, ECC: yes, stat: compatible

And a grep for PME:

$ grep -i PME md.log
   coulombtype                    = PME
   pme-order                      = 4
   lj-pme-comb-rule               = Geometric
  PP:0,PME:0,PP:0,PME:0,PP:0,PME:0,PP:0,PME:0,PP:0,PME:0,PP:0,PME:0,PP:0,PME:0,PP:0,PME:0,PP:0,PME:0,PP:0,PME:0,PP:0,PME:0,PP:0,PME:0,PP:0,PME:0,PP:0,PME:0,PP:0,PME:0,PP:0,PME:0,PP:1,PME:1,PP:1,PME:1,PP:1,PME:1,PP:1,PME:1,PP:1,PME:1,PP:1,PME:1,PP:1,PME:1,PP:1,PME:1,PP:1,PME:1,PP:1,PME:1,PP:1,PME:1,PP:1,PME:1,PP:1,PME:1,PP:1,PME:1,PP:1,PME:1,PP:1,PME:1,PP:2,PME:2,PP:2,PME:2,PP:2,PME:2,PP:2,PME:2,PP:2,PME:2,PP:2,PME:2,PP:2,PME:2,PP:2,PME:2,PP:2,PME:2,PP:2,PME:2,PP:2,PME:2,PP:2,PME:2,PP:2,PME:2,PP:2,PME:2,PP:2,PME:2,PP:2,PME:2,PP:3,PME:3,PP:3,PME:3,PP:3,PME:3,PP:3,PME:3,PP:3,PME:3,PP:3,PME:3,PP:3,PME:3,PP:3,PME:3,PP:3,PME:3,PP:3,PME:3,PP:3,PME:3,PP:3,PME:3,PP:3,PME:3,PP:3,PME:3,PP:3,PME:3,PP:3,PME:3
PME tasks will do all aspects on the GPU
Will do PME sum in reciprocal space for electrostatic interactions.
step  600: timed with pme grid 72 72 72, coulomb cutoff 1.200: 3062.6 M-cycles
step  800: timed with pme grid 60 60 60, coulomb cutoff 1.350: 3572.5 M-cycles
step 1000: timed with pme grid 64 64 64, coulomb cutoff 1.266: 3665.3 M-cycles
step 1200: timed with pme grid 72 72 72, coulomb cutoff 1.200: 3208.5 M-cycles
              optimal pme grid 72 72 72, coulomb cutoff 1.200
  PP:0,PME:0,PP:0,PME:0,PP:0,PME:0,PP:0,PME:0,PP:0,PME:0,PP:0,PME:0,PP:0,PME:0,PP:0,PME:0,PP:0,PME:0,PP:0,PME:0,PP:0,PME:0,PP:0,PME:0,PP:0,PME:0,PP:0,PME:0,PP:0,PME:0,PP:0,PME:0,PP:1,PME:1,PP:1,PME:1,PP:1,PME:1,PP:1,PME:1,PP:1,PME:1,PP:1,PME:1,PP:1,PME:1,PP:1,PME:1,PP:1,PME:1,PP:1,PME:1,PP:1,PME:1,PP:1,PME:1,PP:1,PME:1,PP:1,PME:1,PP:1,PME:1,PP:1,PME:1,PP:2,PME:2,PP:2,PME:2,PP:2,PME:2,PP:2,PME:2,PP:2,PME:2,PP:2,PME:2,PP:2,PME:2,PP:2,PME:2,PP:2,PME:2,PP:2,PME:2,PP:2,PME:2,PP:2,PME:2,PP:2,PME:2,PP:2,PME:2,PP:2,PME:2,PP:2,PME:2,PP:3,PME:3,PP:3,PME:3,PP:3,PME:3,PP:3,PME:3,PP:3,PME:3,PP:3,PME:3,PP:3,PME:3,PP:3,PME:3,PP:3,PME:3,PP:3,PME:3,PP:3,PME:3,PP:3,PME:3,PP:3,PME:3,PP:3,PME:3,PP:3,PME:3,PP:3,PME:3
PME tasks will do all aspects on the GPU
Will do PME sum in reciprocal space for electrostatic interactions.
step 4889900: timed with pme grid 72 72 72, coulomb cutoff 1.200: 3804.6 M-cycles
step 4890100: timed with pme grid 60 60 60, coulomb cutoff 1.350: 4386.0 M-cycles
step 4890300: timed with pme grid 64 64 64, coulomb cutoff 1.266: 4930.7 M-cycles
step 4890500: timed with pme grid 72 72 72, coulomb cutoff 1.200: 4243.6 M-cycles
              optimal pme grid 72 72 72, coulomb cutoff 1.200
  PP:0,PME:0,PP:0,PME:0,PP:0,PME:0,PP:0,PME:0,PP:0,PME:0,PP:0,PME:0,PP:0,PME:0,PP:0,PME:0,PP:0,PME:0,PP:0,PME:0,PP:0,PME:0,PP:0,PME:0,PP:0,PME:0,PP:0,PME:0,PP:0,PME:0,PP:0,PME:0,PP:1,PME:1,PP:1,PME:1,PP:1,PME:1,PP:1,PME:1,PP:1,PME:1,PP:1,PME:1,PP:1,PME:1,PP:1,PME:1,PP:1,PME:1,PP:1,PME:1,PP:1,PME:1,PP:1,PME:1,PP:1,PME:1,PP:1,PME:1,PP:1,PME:1,PP:1,PME:1,PP:2,PME:2,PP:2,PME:2,PP:2,PME:2,PP:2,PME:2,PP:2,PME:2,PP:2,PME:2,PP:2,PME:2,PP:2,PME:2,PP:2,PME:2,PP:2,PME:2,PP:2,PME:2,PP:2,PME:2,PP:2,PME:2,PP:2,PME:2,PP:2,PME:2,PP:2,PME:2,PP:3,PME:3,PP:3,PME:3,PP:3,PME:3,PP:3,PME:3,PP:3,PME:3,PP:3,PME:3,PP:3,PME:3,PP:3,PME:3,PP:3,PME:3,PP:3,PME:3,PP:3,PME:3,PP:3,PME:3,PP:3,PME:3,PP:3,PME:3,PP:3,PME:3,PP:3,PME:3
PME tasks will do all aspects on the GPU
Will do PME sum in reciprocal space for electrostatic interactions.
step 9975100: timed with pme grid 72 72 72, coulomb cutoff 1.200: 3882.0 M-cycles
step 9975300: timed with pme grid 60 60 60, coulomb cutoff 1.350: 4269.5 M-cycles
step 9975500: timed with pme grid 52 52 52, coulomb cutoff 1.558: 7345.5 M-cycles
step 9975700: timed with pme grid 56 56 56, coulomb cutoff 1.447: 3737.7 M-cycles
step 9975900: timed with pme grid 64 64 64, coulomb cutoff 1.266: 3896.2 M-cycles
step 9976100: timed with pme grid 72 72 72, coulomb cutoff 1.200: 3512.4 M-cycles
step 9976300: timed with pme grid 56 56 56, coulomb cutoff 1.447: 3871.3 M-cycles
step 9976500: timed with pme grid 64 64 64, coulomb cutoff 1.266: 3566.4 M-cycles
step 9976700: timed with pme grid 72 72 72, coulomb cutoff 1.200: 3386.8 M-cycles
step 9976900: timed with pme grid 56 56 56, coulomb cutoff 1.447: 3864.3 M-cycles
step 9977100: timed with pme grid 64 64 64, coulomb cutoff 1.266: 3583.7 M-cycles
step 9977300: timed with pme grid 72 72 72, coulomb cutoff 1.200: 3538.1 M-cycles
              optimal pme grid 72 72 72, coulomb cutoff 1.200
 PME GPU mesh              1    3   15025501   95917.664     575521.615  36.4
 Breakdown of PME mesh activities
 Wait PME GPU gather       1    3   15025501   78961.413     473781.347  30.0
 Reduce GPU PME F          1    3   15025501    2149.764      12898.935   0.8
 Launch PME GPU ops.       1    3  120204021     951.779       5710.831   0.4

So GROMACS appears to “see” the GPUs, and it think it is running on them. However, if I run nvidia-smi on the node running my job, I see “No running processes found”. Additionally, top reveals roughly 100% CPU for each of my replicas. And finally, dcgmi shows:

$ dcgmi dmon -e 100,203,252                                                                                                   
#Entity   SMCLK        GPUTL             FBUSD                                                                                                               ID                                                                                                                                                           GPU 3     1980         0                 63494                                                                                                               GPU 2     1980         N/A               48950                                                                                                               GPU 1     1980         N/A               32088                                                                                                               
GPU 0     1980         N/A               9329                                                                                                                
GPU 3     1980         0                 63494                                                                                                               
GPU 2     1980         N/A               48950                                                                                                               
GPU 1     1980         N/A               32088                                                                                                               
GPU 0     1980         N/A               9329                 

This all makes me think the GPUs are not getting used, which is causing the slow run times. Has anyone come across issues like this?

Please let me know if I can provide any follow-up details, or if you believe I have misinterpreted the logs above.

Thank you all so much!
Jackson