All CUDA-capable devices are busy or unavailable

GROMACS version: 2022.2
GROMACS modification: Yes/No

openmpi : 4.1.5

I have an issue running simulation with openmpi cuda aware.
I have a cluster with 2 GPUs per node in EXCLUSIVE PROCESS mode.
I configured slots to be equal to my number of GPUs on each node (2).

Typically, I tried to submit to multiple host but had the folowing error:

/site/eclub/app/x86_64/tools/openmpi/4.1.3-cuda/bin/mpirun --hostfile hostfile -x GMX_ENABLE_DIRECT_GPU_COMM -x PATH -x LD_LIBRARY_PATH -np 4 gmx_mpi mdrun -ntomp 12 -nb gpu -bonded gpu -s md2.tpr -g test.log 2>&1

<uda/bin/mpirun --hostfile hostfile -x GMX_ENABLE_DIRECT_GPU_COMM -x PATH -x LD_LIBRARY_PATH -np 4 gm>
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   g-node013
  Local device: mlx4_0
--------------------------------------------------------------------------
                       :-) GROMACS - gmx mdrun, 2022 (-:

Executable:   /site/eclub/app/x86_64/discovery/gromacs/2022_gpu/bin/gmx_mpi
Data prefix:  /site/eclub/app/x86_64/discovery/gromacs/2022_gpu
Working dir:  /site/eclub/work/users/appadmin/sample/soft/gromacs/local_gpu
Command line:
  gmx_mpi mdrun -ntomp 12 -nb gpu -bonded gpu -s md2.tpr -g test.log


Back Off! I just backed up test.log to ./#test.log.44#
Reading file md2.tpr, VERSION 5.1.3 (single precision)
Note: file tpx version 103, software tpx version 127
GMX_ENABLE_DIRECT_GPU_COMM environment variable detected, enabling direct GPU communication using GPU-aware MPI.
Changing nstlist from 10 to 100, rlist from 1 to 1.16


On host g-node010 2 GPUs selected for this run.
Mapping of GPU IDs to the 2 GPU tasks in the 2 ranks on this node:
  PP:0,PP:1
PP tasks will do (non-perturbed) short-ranged and most bonded interactions on the GPU
PP task will update and constrain coordinates on the CPU
GPU direct communication will be used between MPI ranks.

To simplify, I only submit on one node always using MPI (I know it’s not optimal and I should use Thread MPI - what I’ve done and it worked perfectly- but it’s for testing purpose).

/site/eclub/app/x86_64/tools/openmpi/4.1.3-cuda/bin/mpirun --hostfile hostfile -x GMX_ENABLE_DIRECT_GPU_COMM -x PATH -x LD_LIBRARY_PATH -np **2** gmx_mpi mdrun -ntomp 12 -nb gpu -bonded gpu -s md2.tpr -g test.log 2>&1

Same error.

Then I switched my gpu to DEFAULT mode on my execution node and it worked.

I use nvidia-smi to monitor what’s going on my node I saw that , 2 MPI processes were processed by one GPU what explains my previous error (in EXCLUSIVE MODE, my GPU cannot handle more that one process at a time)

T

hu Aug 25 17:02:35 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla K40m          On   | 00000000:02:00.0 Off |                    0 |
| N/A   23C    P0    61W / 235W |     72MiB / 11441MiB |      1%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K40m          On   | 00000000:84:00.0 Off |                    0 |
| N/A   22C    P0    62W / 235W |    140MiB / 11441MiB |      1%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     22568      C   gmx_mpi                            69MiB |
**|    1   N/A  N/A     22568      C   gmx_mpi                            65MiB |**
**|    1   N/A  N/A     22569      C   gmx_mpi                            69MiB |**
+-----------------------------------------------------------------------------+

So my question is how to prevent that behaviour keeping my GPUs in EXCLUSIVE MODE (because it’s mandatory on our cluster for policy reasons) using openmpi ?

Also, why did I not observed the same thing with Thread-MPI version of Gromacs ?

Thanks for your help.

Can you use srun in place of mpirun? If so, try srun --overlap (which is the opposite of srun --exclusive). Not sure if mpirun has the same type of option. I suspect that it’s not your GPUs that are blocking sharing, but mpirun that is doing this.

Hi,

I’ll check that. But at the end, i need to have only one process per GPU because of my company policy about our cluster.

So wht I want to understand is why there is 2 process on the same gpu and do not have 2 clearly separated MPI process (one process per gpu, no overlapping).

Thanks for your help.

Got it. I didn’t understand before. srun will not change what you see.

I don’t know if/how you can avoid what you are seeing with gmx2022.2, but you could use gromacs 2019.6 (and maybe something in the middle too).

Running 2022.2 with gpus, I see the rank with the lowest pid on each GPU and also one unique rank on each of the other GPUs. This is similar to what you reported.

$ srun -n 4 -c 1 --exclusive --threads-per-core=1 gmx_mpi -s my.tpr -nb gpu -bonded gpu -update cpu -pme cpu -npme 0 -dlb no -tunepme -maxh -deffnm out -ntomp 1

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 31064 C …22.2/mpi_cuda/bin/gmx_mpi 259MiB |
| 1 N/A N/A 31064 C …22.2/mpi_cuda/bin/gmx_mpi 257MiB |
| 1 N/A N/A 31065 C …22.2/mpi_cuda/bin/gmx_mpi 259MiB |
| 2 N/A N/A 31064 C …22.2/mpi_cuda/bin/gmx_mpi 257MiB |
| 2 N/A N/A 31066 C …22.2/mpi_cuda/bin/gmx_mpi 259MiB |
| 3 N/A N/A 31064 C …22.2/mpi_cuda/bin/gmx_mpi 257MiB |
| 3 N/A N/A 31067 C …22.2/mpi_cuda/bin/gmx_mpi 259MiB |
±----------------------------------------------------------------------------+

The results are similar whether or not I define GMX_ENABLE_DIRECT_GPU_COMM

If I stop actually doing anything on the gpu, but keep the gpu framework, the the rank with the lowest pid is still on each GPU, but the other ranks are not.

$ srun -n 4 -c 1 --exclusive --threads-per-core=1 gmx_mpi -s my.tpr -nb cpu -bonded cpu -update cpu -pme cpu -npme 0 -dlb no -tunepme -maxh -deffnm out -ntomp 1

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 32424 C …22.2/mpi_cuda/bin/gmx_mpi 257MiB |
| 1 N/A N/A 32424 C …22.2/mpi_cuda/bin/gmx_mpi 257MiB |
| 2 N/A N/A 32424 C …22.2/mpi_cuda/bin/gmx_mpi 257MiB |
| 3 N/A N/A 32424 C …22.2/mpi_cuda/bin/gmx_mpi 257MiB |
±----------------------------------------------------------------------------+

Reenable GPU usage with -nb gpu and add -gputasks 0001 and the results are what one would expect given the above behavior.

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 34419 C …22.2/mpi_cuda/bin/gmx_mpi 259MiB |
| 0 N/A N/A 34420 C …22.2/mpi_cuda/bin/gmx_mpi 259MiB |
| 0 N/A N/A 34421 C …22.2/mpi_cuda/bin/gmx_mpi 259MiB |
| 1 N/A N/A 34419 C …22.2/mpi_cuda/bin/gmx_mpi 257MiB |
| 1 N/A N/A 34422 C …22.2/mpi_cuda/bin/gmx_mpi 259MiB |
| 2 N/A N/A 34419 C …22.2/mpi_cuda/bin/gmx_mpi 257MiB |
| 3 N/A N/A 34419 C …22.2/mpi_cuda/bin/gmx_mpi 257MiB |
±----------------------------------------------------------------------------+

And here is with gmx2019.6

$ srun -n 4 -c 1 --exclusive --threads-per-core=1 gmx_mpi -s my.tpr -dlb no -tunepme -maxh -deffnm out -ntomp 1

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 36799 C …6/mpi_gnu_gpu/bin/gmx_mpi 321MiB |
| 1 N/A N/A 36800 C …6/mpi_gnu_gpu/bin/gmx_mpi 321MiB |
| 2 N/A N/A 36801 C …6/mpi_gnu_gpu/bin/gmx_mpi 321MiB |
| 3 N/A N/A 36802 C …6/mpi_gnu_gpu/bin/gmx_mpi 321MiB |
±----------------------------------------------------------------------------+

I guess gromacs changed something about what processes go on what gpus in the last few years. I am sure a developer could give details about what the exact cause is.

Hope it helps,
Chris.

Hi,

Many thanks to you for all your help and confirm that I’m not mad.

Actually the ThreadMPI should be enough (300 ns/day on human KRAS) but it s quite frustrating to have such a cluster and cannot fully use it.
Nevertheless, I saw that performance gain was very limited on multinode GPUs

-------- Message d’origine --------

Even with up to 1.5 million atoms in 2022.2, I find the non-MPI performance on 1 node with 64 cores + 4 GPUs is equivalent to the +MPI performance on 4 nodes, after which more nodes does make it faster (meaning 2 nodes +mpi is slower than 1 node -mpi). I think it is the fact that PME can only go on one GPU. That’s a pretty bad strong scaling efficiency if you use the non-mpi full single node as the denominator. I think probably you are not missing out on anything.

One of those processes is the MPI launcher itself. You may be seeing issues with process exclusive mode possibly because the launcher itself initializes (or tries to do so) a GPU context on both GPUs.

With thread-MPI you don’t see the same because it uses pthreads instead of processes to implement MPI, so you have a single process with multiple threads.

Cheers,
Szilárd

Chris, the scaling you can expect greatly depends on your hardware and the version of code you are using. For reference, Fig 12 of Cookie Absent illustrates roughly what you can expect from production GROMACS, though in 2021/2022 there have been slight improvements.

It is correct that with current versions PME decomposition is not efficient (note that preliminary support is available in 2022, but only with hybrid mode where FFts use the CPU which does not always deliver) so a single PME GPU is still often the best, but this will soon change thanks to work we’ve done the last year or two (and thanks to the cuFFTmp library).

On slides 25-27 you can get peak into the performance you can expect soon: