Optimizing Production Run with GPUs and CHARMM36

b-omics · February 25, 2025, 2:44pm

GROMACS version: 2024.4
GROMACS modification: Yes/No

I am simulating a system of approximately 330,000 atoms using CHARMM36 and aiming to significantly improve the performance of my production run. Currently, I am utilizing 3 GPUs (Nvidia A100) and 64 CPUs on a single node (my node can support up to 6 GPUs). I have also increased the nstlist from 10 to 40 to enhance the efficiency of the pair search. Currently I am getting 134 nsec/day but I see that the PME wait and GPU state copy is 30-35 percent. I am looking for further optimizations to increase the production rate of the mdrun.
With -tunepme I managed to get 138 nsec. Any suggestions?

Log file:

gmx mdrun -ntmpi 3 -ntomp 10 -npme 3 -nb gpu -bonded gpu -pme gpu -update gpu
-g gpu3_test_multigpu_pme_optimized -nsteps -1 -maxh 0.017 -resethway -notunepme
-s benchmark.tpr -deffnm gpu3_test_multigpu_pme_optimized

obZehn · February 25, 2025, 10:01pm

A couple of general considerations:

Run for a bit longer the simulation, I would stick to a couple of mins, even if I do not tune the pme (very minor point)
how did you increase the nstlist? Because GROMACS will fine tune it at the beginning of the simulation anyway; if it is in the mdp file then it will be overwritten. It’s probably going to use ~100
Have you tried with just one/two GPUs?

b-omics · February 25, 2025, 10:48pm

From the gromacs wiki, it says to increase the nstlist from 10 to 20-40 for gpus.
yes, but with 2 gpus got like 80-90 nsec if i remember correctly

Florian_Leidner · February 27, 2025, 3:05pm

With your setup, offloading everything to the GPUs should be optimal. As a rule of thumb, you want to aim for a 3:1 ratio of NB:PME ranks. You mentioned that your nodes can support up to 6 GPUs, so I would start by requesting 4 GPUs and having one dedicated to PME (i.e. -npme 1) and 3 ranks for everything else. Note that you are limited to one GPU for PME unless you have compiled gromacs with cuFFTMp. You could try requesting more GPUs for (short-range) non-bonded computations, but this is potentially not very efficient because they may have to wait for PME to “catch up”.

If you really want to get into the nitty-gritty of performance optimization, you might want to take a look at some of the nvidia blog posts:

However, 138 ns/day already seems relatively OK with this hardware and system size.

Topic		Replies	Views
Abysmal MD production performance on GPU node User discussions mdrun	8	1071	December 15, 2023
Gmx mdrun with GPU User discussions mdrun	3	1326	May 31, 2024
Optimizing CPU/GPU efficiency and performance in GROMACS simulations User discussions simulation-setup	4	1076	February 11, 2025
Pme calculation on gpu? User discussions	0	266	January 15, 2021
Gromacs performance on GPU User discussions	1	1397	March 23, 2022

Optimizing Production Run with GPUs and CHARMM36

Related topics