Does anyone have a good set of mpirun/gmx options for large MPI/GPU jobs?

GROMACS version: 2022.2
GROMACS modification: No

Does anyone know of a site with scaling info and specific mpirun and mdrun options for simulations on >10 nodes with GPUs? I did find this: Running GROMACS on GPU instances: multi-node price-performance | AWS HPC Blog and even watched the associated video, but I did not find any mdrun options there, just (presumably) optimized outcomes.

I realize that scaling performance is scaling specific, but for me the same things generally work over and over again for many differently sized/comprised systems on a single node with GPUs or many nodes without GPUs. Trying to get good performance for systems with 2-20 million atoms on 2-50 nodes, each with multiple P100 GPUs is certainly more difficult. I can’t yet figure out if that’s because I haven’t yet found the magic, or it’s just a lot more system-specific. It would be wonderful if somebody had a recipe that yields good performance at high node counts, even for only a single system.

I think there must be well over a thousand ways to run an MPI gromacs simulation on thousands of cores + tens/hundreds of GPUs.

Thank you,
Chris.

Dear Chris,

yes, there are indeed many different ways you can run a GROMACS simulation across a group of GPU-equipped nodes. On a single node with 2 or 4 GPUs you often get a good performance by running 4 MPI ranks, using 3 for PP and 1 for PME interactions, with both PP and PME offloaded to the GPU. This way, also PME can run on the GPU.

As GROMACS’ 2022 PME implementation can not use more than a single GPU, this approach does not work if you have many GPU-equipped nodes. In that case, I usually use a homogeneous setup, with several MPI ranks per node (ideally as many MPI ranks per node as GPUs on that node, or a multiple of that), while offloading only the short-range nonbonded interactions to the GPU. Here, PME runs on the CPU and thus can be run in parallel across all available MPI ranks.

I would try to always use all available cores on each node, but play around with the distribution into MPI ranks and OpenMP threads. Usually having 2-8 OpenMP threads per MPI ranks yields best performance, but when approaching the scaling limit, using more OpenMP threads (and less MPI ranks) makes sense. If you have a large number of nodes, you probably get best performance when using only the physical cores, not the available hardware threads, which is two times the number of physical cores.

We have recently done scaling experiments on up to 32 GPU nodes that you can find in the following paper: https://pubs.acs.org/doi/10.1021/acs.jcim.2c00044 (Tables 8-9) However, as these nodes only had a standard interconnect, the parallel efficiency is not very high on large numbers of nodes. But the optimal configuration in terms of MPI ranks times OpenMP threads is listed. Those runs were done using GROMACS 2020.2 with SLURM with

srun -N --tasks-per-node=“MPI per node” --cpus-per-task=“OpenMP threads per MPI rank” gmx_mpi mdrun -ntomp “OpenMP threads per MPI rank” -npme 0 -pin on -s in.tpr

Below is also an older publication that lists performances and run parameters for scaling across up to 256 nodes, each equipped with 2 GPUs:
https://onlinelibrary.wiley.com/doi/10.1002/jcc.24030 (see Table 12).

That should be a good start!

Best,
Carsten