Optimizing CPU/GPU efficiency and performance in GROMACS simulations

GROMACS version: 2024
GROMACS modification: No

Hello everyone,

I have been testing different configurations to optimize CPU/GPU usage and simulation speed in GROMACS. Below is a summary of my test runs:

Speed (ns/day) | Command Used | CPU Usage | GPU Usage
---------------|--------------------------------------|----------------------------------|-----------
895.51        | gmx mdrun -s md2.tpr -deffnm md2 -v -cpi step7_1.cpt -noappend -ntmpi 1 -ntomp 24 -gpu_id 0 -pme gpu -bonded gpu -nb gpu -pin on | One core at 100%, others at 40-60% | 80-90%
754.19        | gmx mdrun -s md2.tpr -deffnm md2 -v -cpi step7_1.cpt -noappend | All cores fully utilized | 80-90%
439.29        | gmx mdrun -s md2.tpr -deffnm md2 -v -cpi step7_1.cpt -noappend -ntmpi 2 -ntomp 12 -gpu_id 0 -pme gpu -npme 1 -bonded gpu -nb gpu -pin on | Half of cores at full usage, rest idle | 80-90%
419.79        | gmx mdrun -s md2.tpr -deffnm md2 -v -cpi step7_1.cpt -noappend -ntmpi 8 -ntomp 6 -gpu_id 0 -pme gpu -npme 1 -bonded gpu -nb gpu -pin on | CPU usage 30-60% | 80-90%
389.95        | gmx mdrun -s md2.tpr -deffnm md2 -v -cpi step7_1.cpt -noappend -ntmpi 8 -ntomp 3 -gpu_id 0 -pme gpu -npme 1 -bonded gpu -nb gpu -update gpu -pin on | All CPUs used, but two at 20-30% | 80-90%
201.18        | gmx mdrun -s md2.tpr -deffnm md2 -v -cpi step7_1.cpt -noappend -ntmpi 8 -ntomp 6 -gpu_id 0 -pme cpu -npme 3 -bonded gpu -nb gpu -pin on | CPU cores at 50-70% | Below 30%
  • The fastest run (895.51 ns/day) had one CPU core at 100% while others were at 40-60%. Does this imbalance affect CPU longevity?
  • Most configurations had GPU usage at 80-90%, except the slowest one. Should I adjust PP/PME load distribution?
  • Some setups had underutilized CPU cores (20-30%), while others had full CPU usage. What is the best way to balance MPI (-ntmpi) and OpenMP (-ntomp)?
  • The fastest run used -ntmpi 1 -ntomp 24, but would it be better to distribute work across more cores?

Looking for insights on how to fine-tune CPU/GPU settings for the best balance between speed and efficiency. Would appreciate any suggestions!

Hi,

That’s a comprehensive set of tests you’ve run!

I don’t know your hardware, but I’m assuming a high-end single CPU, single GPU workstation. In this case, the setting you discovered are likely optimal, but there are some more tips worth trying.

It should not. Could even improve it: less power → less heat → better longevity, not only for CPU but for all the power supply chain.

I don’t see how it will help. Except for the slowest one (running PME on the CPU), you have both PME and NB running on GPU, so balancing things between them won’t change much.

What could increase GPU utilization (and very likely improve performance) is making CPU-side neighbor search less frequent, moving even more work from CPU to GPU. Try adding -nstlist 300 flag (the default value is usually 100, you can check your md2.log for mentions of nstlist).

Also, don’t look too much into GPU and CPU utilization percentage. They are good to get a rough idea, but, as you’ve discovered, higher CPU usage does not mean better performance; being busy doing useless things is not good.

The answer here depends a lot on whether you use GPU or not. In most cases, you should set -ntmpi 1 when you use a GPU. There are cases where you might want to run more than one rank per GPU, but they usually arise when you have multiple GPUs. Set -ntomp to the number of cores. You can perhaps even reduce -ntomp to 12 with little performance penalty, since not much is happening on CPU anyway, and, in GROMACS 2024, a lot of what’s happening is only using one thread (that’s changed in GROMACS 2025, so there -ntomp can play larger role)

In this setup, you have 24 threads. The first thread coordinates the GPU and the remaining threads. It is always busy, and the remaining 23 threads are only busy occasionally when there’s CPU work. Since you have most of the things offloaded to the GPU, the CPU work is either neighbor search every ~100 steps, or maybe some of the bonded forces (even with -bonded gpu, CPU do some of the bonded forces).

Usually, when you have a powerful GPU, it’s better to run as much as possible on the GPU and the CPU be underutilized, so -bonded gpu (-nb, -pme, and -update are set to gpu by default in GROMACS 2024). Trying to “balance” the work will just add extra overheads of copying the data between the GPU and the CPU.

At 900 ns/day (assuming the 2fs timestep), that’s already ~0.2 ms per step. The CPU can be struggling to keep launching the work on the GPU at such rate. If you’re using CUDA, you can try doing export GMX_CUDA_GRAPH=1, this should reduce the cost of launching stuff on the GPU.

I guess you looked at our docs, but just in case, there’s the final section of Getting good performance from mdrun - GROMACS 2024.4 documentation. At the end of md2.log, there is a performance table, which shows how time is spent in different parts of code. Admittedly, it can be cryptic if you’re not familiar with the internal workings of the code, but the linked doc has some suggestions on what to look at.

1 Like

Hello Andrey,

First of all, thank you for your detailed response! I really appreciate the time you took to explain everything.

I have two follow-up questions:

  1. Do I need to manually specify -pme gpu -bonded gpu -nb gpu -pin on, or are these automatically set in GROMACS 2024? I have seen these flags included in many example commands, but I’m unsure if they are necessary.

  2. You suggested increasing -nstlist to 300. However, in most tutorials and research papers, I often see values like 1, 5, 10, or 20 being used. Would setting it to 300 be too large? What would be the trade-offs of such a high value?

Looking forward to your insights!

  • -pme gpu -nb gpu -update gpu are default when a GPU is detected and the config allows it. I prefer to specify things when tuning for performance. For example, if you have constraints=all-bonds, then GROMACS will throw an error if -update gpu is spelled explicitly, but silently fall back to CPU otherwise. You decide which behavior works best for you.
  • -bonded gpu needs to be explicit; the default is cpu even if the bonded forces are supported on the GPU.
  • -pin on is the default when you are using all cores (ntmpi Ă— ntomp = number of cores), otherwise GROMACS will cannot safely guess which cores you want to use and will not pin threads.

GROMACS 2025 behaves the same. However, GROMACS 2023 defaults to -update cpu.

A few cases when you might want to launch less threads than you have cores and definitely need pinning:

  • If you have a chiplet-based CPU, like recent AMD Zen series, you might be better off limiting GROMACS to a single chiplet (“CCX” in AMD terminology; equivalent to “LL cache” or “L3 cache” domains in various hardware-topology-reporting utilities). Check the CPU specs online to see the number of cores per chiplet, then set the -ntomp X -pin on.
  • Similarly, with P/E core distinction in recent Intel CPUs, it could be better to limit GROMACS to only P-cores. Again, look up the number of cores and set -pin on.
  • Otherwise, for single-socket machine, it should not matter that much.

You sure about that? The nstlist value from the MDP file is a minimum (unless set to 1, which actually enforces list update every step – very slow). GROMACS will set nstlist to around 100 when running on a GPU. Using the -nstlist flag is the only way to set the nstlist value to a specific, unchangeable value.

You can look for Changing nstlist line in your md2.log to see the actual value used.

GROMACS uses adaptive, dual pair list to keep the simulation physically-correct regardless of pairlist update frequency (see verlet-buffer-tolerance).

Larger nstlist means that the “outer” neighbor list is updated less often → the list must be larger to accommodate for particles moving more between updates → GPU has more work on each step. But it also means we do the slow list update (neighbor search) less often, which is why large values are good for performance when you have a fast GPU.

But it never should lead to any missing interactions (we had a nasty bug prior to GROMACS 2024 due to underestimation of pair list range, but, well, it was a bug, and only relatively uncommon simulation setups, inhomogenous and without PME, were affected).

1 Like

Thank you for your detailed explanation! Your insights on GPU defaults, -nstlist behavior, CPU architecture, and thread pinning were very helpful.