Optimizing Production Run with GPUs and CHARMM36

GROMACS version: 2024.4
GROMACS modification: Yes/No

I am simulating a system of approximately 330,000 atoms using CHARMM36 and aiming to significantly improve the performance of my production run. Currently, I am utilizing 3 GPUs (Nvidia A100) and 64 CPUs on a single node (my node can support up to 6 GPUs). I have also increased the nstlist from 10 to 40 to enhance the efficiency of the pair search. Currently I am getting 134 nsec/day but I see that the PME wait and GPU state copy is 30-35 percent. I am looking for further optimizations to increase the production rate of the mdrun.
With -tunepme I managed to get 138 nsec. Any suggestions?

Log file:

gmx mdrun -ntmpi 3 -ntomp 10 -npme 3 -nb gpu -bonded gpu -pme gpu -update gpu
-g gpu3_test_multigpu_pme_optimized -nsteps -1 -maxh 0.017 -resethway -notunepme
-s benchmark.tpr -deffnm gpu3_test_multigpu_pme_optimized

A couple of general considerations:

  1. Run for a bit longer the simulation, I would stick to a couple of mins, even if I do not tune the pme (very minor point)
  2. how did you increase the nstlist? Because GROMACS will fine tune it at the beginning of the simulation anyway; if it is in the mdp file then it will be overwritten. It’s probably going to use ~100
  3. Have you tried with just one/two GPUs?
  1. From the gromacs wiki, it says to increase the nstlist from 10 to 20-40 for gpus.

  2. yes, but with 2 gpus got like 80-90 nsec if i remember correctly

With your setup, offloading everything to the GPUs should be optimal. As a rule of thumb, you want to aim for a 3:1 ratio of NB:PME ranks. You mentioned that your nodes can support up to 6 GPUs, so I would start by requesting 4 GPUs and having one dedicated to PME (i.e. -npme 1) and 3 ranks for everything else. Note that you are limited to one GPU for PME unless you have compiled gromacs with cuFFTMp. You could try requesting more GPUs for (short-range) non-bonded computations, but this is potentially not very efficient because they may have to wait for PME to “catch up”.

If you really want to get into the nitty-gritty of performance optimization, you might want to take a look at some of the nvidia blog posts:

However, 138 ns/day already seems relatively OK with this hardware and system size.