I am simulating a system of approximately 330,000 atoms using CHARMM36 and aiming to significantly improve the performance of my production run. Currently, I am utilizing 3 GPUs (Nvidia A100) and 64 CPUs on a single node (my node can support up to 6 GPUs). I have also increased the nstlist from 10 to 40 to enhance the efficiency of the pair search. Currently I am getting 134 nsec/day but I see that the PME wait and GPU state copy is 30-35 percent. I am looking for further optimizations to increase the production rate of the mdrun.
With -tunepme I managed to get 138 nsec. Any suggestions?
Run for a bit longer the simulation, I would stick to a couple of mins, even if I do not tune the pme (very minor point)
how did you increase the nstlist? Because GROMACS will fine tune it at the beginning of the simulation anyway; if it is in the mdp file then it will be overwritten. It’s probably going to use ~100
With your setup, offloading everything to the GPUs should be optimal. As a rule of thumb, you want to aim for a 3:1 ratio of NB:PME ranks. You mentioned that your nodes can support up to 6 GPUs, so I would start by requesting 4 GPUs and having one dedicated to PME (i.e. -npme 1) and 3 ranks for everything else. Note that you are limited to one GPU for PME unless you have compiled gromacs with cuFFTMp. You could try requesting more GPUs for (short-range) non-bonded computations, but this is potentially not very efficient because they may have to wait for PME to “catch up”.
If you really want to get into the nitty-gritty of performance optimization, you might want to take a look at some of the nvidia blog posts:
However, 138 ns/day already seems relatively OK with this hardware and system size.