GROMACS only detects 1 GPU on a 4 GPU node

GROMACS version: 23.2/24.2
GROMACS modification: No

Hello,

I’m setting up GROMACS on a HPC system and have encountered an issues where GROMACS only detects one GPU on a node that has four GPUs available. Below is an excerpt from a log file:

Hardware detected on host  (the node of MPI rank 0):
  CPU info:
    Vendor: AMD
    Brand:  AMD EPYC 7742 64-Core Processor
    Family: 23   Model: 49   Stepping: 0
    Features: aes amd apic avx avx2 clfsh cmov cx8 cx16 f16c fma htt lahf misalignsse mmx msr nonstop_tsc pclmuldq pdpe1gb popcnt pse rdrnd rdtscp sha sse2 sse3 sse4a sse4.1 sse4.2 ssse3 x2apic
  Hardware topology: Full, with devices
    Packages, cores, and logical processors:
    [indices refer to OS logical processors]
      Package  0: [   0 128] [   1 129] [   2 130] [   3 131] [   4 132] [   5 133] [   6 134] [   7 135] [   8 136] [   9 137] [  10 138] [  11 139] [  12 140] [  13 141] [  14 142] [  15 143] [  16 144] [  17 145] [  18 146] [  19 147] [  20 148] [  21 149] [  22 150] [  23 151] [  24 152] [  25 153] [  26 154] [  27 155] [  28 156] [  29 157] [  30 158] [  31 159] [  32 160] [  33 161] [  34 162] [  35 163] [  36 164] [  37 165] [  38 166] [  39 167] [  40 168] [  41 169] [  42 170] [  43 171] [  44 172] [  45 173] [  46 174] [  47 175] [  48 176] [  49 177] [  50 178] [  51 179] [  52 180] [  53 181] [  54 182] [  55 183] [  56 184] [  57 185] [  58 186] [  59 187] [  60 188] [  61 189] [  62 190] [  63 191]
      Package  1: [  64 192] [  65 193] [  66 194] [  67 195] [  68 196] [  69 197] [  70 198] [  71 199] [  72 200] [  73 201] [  74 202] [  75 203] [  76 204] [  77 205] [  78 206] [  79 207] [  80 208] [  81 209] [  82 210] [  83 211] [  84 212] [  85 213] [  86 214] [  87 215] [  88 216] [  89 217] [  90 218] [  91 219] [  92 220] [  93 221] [  94 222] [  95 223] [  96 224] [  97 225] [  98 226] [  99 227] [ 100 228] [ 101 229] [ 102 230] [ 103 231] [ 104 232] [ 105 233] [ 106 234] [ 107 235] [ 108 236] [ 109 237] [ 110 238] [ 111 239] [ 112 240] [ 113 241] [ 114 242] [ 115 243] [ 116 244] [ 117 245] [ 118 246] [ 119 247] [ 120 248] [ 121 249] [ 122 250] [ 123 251] [ 124 252] [ 125 253] [ 126 254] [ 127 255]
    CPU limit set by OS: -1   Recommended max number of threads: 256
    Numa nodes:
      Node  0 (66902454272 bytes mem):   0 128   1 129   2 130   3 131   4 132   5 133   6 134   7 135   8 136   9 137  10 138  11 139  12 140  13 141  14 142  15 143
      Node  1 (67636203520 bytes mem):  16 144  17 145  18 146  19 147  20 148  21 149  22 150  23 151  24 152  25 153  26 154  27 155  28 156  29 157  30 158  31 159
      Node  2 (67636203520 bytes mem):  32 160  33 161  34 162  35 163  36 164  37 165  38 166  39 167  40 168  41 169  42 170  43 171  44 172  45 173  46 174  47 175
      Node  3 (67623620608 bytes mem):  48 176  49 177  50 178  51 179  52 180  53 181  54 182  55 183  56 184  57 185  58 186  59 187  60 188  61 189  62 190  63 191
      Node  4 (67636203520 bytes mem):  64 192  65 193  66 194  67 195  68 196  69 197  70 198  71 199  72 200  73 201  74 202  75 203  76 204  77 205  78 206  79 207
      Node  5 (67590402048 bytes mem):  80 208  81 209  82 210  83 211  84 212  85 213  86 214  87 215  88 216  89 217  90 218  91 219  92 220  93 221  94 222  95 223
      Node  6 (67636203520 bytes mem):  96 224  97 225  98 226  99 227 100 228 101 229 102 230 103 231 104 232 105 233 106 234 107 235 108 236 109 237 110 238 111 239
      Node  7 (67623944192 bytes mem): 112 240 113 241 114 242 115 243 116 244 117 245 118 246 119 247 120 248 121 249 122 250 123 251 124 252 125 253 126 254 127 255
      Latency:
               0     1     2     3     4     5     6     7
         0  1.00  1.20  1.20  1.20  3.20  3.20  3.20  3.20
         1  1.20  1.00  1.20  1.20  3.20  3.20  3.20  3.20
         2  1.20  1.20  1.00  1.20  3.20  3.20  3.20  3.20
         3  1.20  1.20  1.20  1.00  3.20  3.20  3.20  3.20
         4  3.20  3.20  3.20  3.20  1.00  1.20  1.20  1.20
         5  3.20  3.20  3.20  3.20  1.20  1.00  1.20  1.20
         6  3.20  3.20  3.20  3.20  1.20  1.20  1.00  1.20
         7  3.20  3.20  3.20  3.20  1.20  1.20  1.20  1.00
    Caches:
      L1: 32768 bytes, linesize 64 bytes, assoc. 8, shared 2 ways
      L2: 524288 bytes, linesize 64 bytes, assoc. 8, shared 2 ways
      L3: 16777216 bytes, linesize 64 bytes, assoc. 16, shared 8 ways
    PCI devices:
      0000:62:00.0  Id: 1a03:2000  Class: 0x0300  Numa: 0
      0000:43:00.0  Id: 15b3:101b  Class: 0x0207  Numa: 1
      0000:44:00.0  Id: 10de:20b0  Class: 0x0302  Numa: 1
      0000:45:00.0  Id: 1000:00b2  Class: 0x0107  Numa: 1
      0000:03:00.0  Id: 10de:20b0  Class: 0x0302  Numa: 3
      0000:05:00.0  Id: 1000:00b2  Class: 0x0107  Numa: 3
      0000:e1:00.0  Id: 8086:1523  Class: 0x0200  Numa: 4
      0000:e1:00.1  Id: 8086:1523  Class: 0x0200  Numa: 4
      0000:c4:00.0  Id: 10de:20b0  Class: 0x0302  Numa: 5
      0000:c5:00.0  Id: 1000:00b2  Class: 0x0107  Numa: 5
      0000:c8:00.0  Id: 1022:7901  Class: 0x0106  Numa: 5
      0000:83:00.0  Id: 15b3:101b  Class: 0x0207  Numa: 7
      0000:84:00.0  Id: 10de:20b0  Class: 0x0302  Numa: 7
      0000:85:00.0  Id: 1000:00b2  Class: 0x0107  Numa: 7
  GPU info:
    Number of GPUs detected: 1
    #0: NVIDIA NVIDIA A100-SXM4-40GB, compute cap.: 8.0, ECC: yes, stat: compatible

As you can see GROMACS only detects on GPU even though there are 4 gpus available (PCI devices 10de:20b0).
These GPUs also show up with nvidia-smi -L

GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-d0e4a7a4-d046-9f66-460e-81d527826f93)

GPU 1: NVIDIA A100-SXM4-40GB (UUID: GPU-0ed2dcdd-44de-9d8c-e180-cbc33e4a21df)

GPU 2: NVIDIA A100-SXM4-40GB (UUID: GPU-cb37e533-f45e-f022-c991-052d1990e13f)

GPU 3: NVIDIA A100-SXM4-40GB (UUID: GPU-35f9efa7-d3ea-9325-4c54-34cb5586ea1b)

and echo $CUDA_VISIBLE_DEVICES shows:

0,1,2,3

The hpc system uses Slurm and i have tried various combinations of resource requests, to no avail. The output above was obtained by requesting the node in --exclusive mode.

Similarly using different combinations of compilers and MPI libraries did not make a difference. I tested:

GCC/12.3.0 + ParaStationMPI/5.9.2-1 + CUDA/12
GCC/12.3.0 + OpenMPI/4.1.5 + CUDA/12
GCC/12.3.0 + OpenMPI/4.1.5 + CUDA/12 + hwloc/2.9.1
Intel/2023.2.1 + IntelMPI/2021.10.0 + CUDA/12

If anyone has encountered a comparable issue any help is highly appreciated.

Best Regards,
Florian

what was your command or setup to run the MD?

The original command was:

export GMX_ENABLE_DIRECT_GPU_COMM=1

srun -n 4 --cpus-per-task=32  gmx_mpi mdrun -pme gpu -bonded gpu -update gpu -nb gpu -ntomp $SLURM_CPUS_PER_TASK -noconfout -s topol.tpr -deffnm $SLURM_JOB_NAME -pin on  -nsteps 2

Fortunately we managed to resolve the issue: It turned out that the specific slurm was configured such that each rank only gets assigned one GPU.

Putting export CUDA_VISIBLE_DEVICES=0,1,2,3 before the srun statement overwrote this behavior. This was not immediately obvious to us, because we always queried CUDA_VISIBLE_DEVICES in the (main) batch script where it would show all 4 gpus, whereas srun -n 4 echo $CUDA_VISIBLE_DEVICES, would only show one gpu per rank.

yes, that is why i asked this, your script assigned only one gpu;

Hi!

In this situation, GROMACS indeed reports only one GPU (because each process only sees a single GPU), but SLURM takes care of assigning each rank to its own GPU, so the actual hardware allocation is correct: each of the four ranks on the node gets its own GPU. The only problem with this setup is how the hardware is reported in the log. We are planning to fix the issue in GROMACS 2025.

Not only setting CUDA_VISIBLE_DEVICES manually is not necessary in this situation, it can interfere with how SLURM does CPU-GPU mapping.