Guidance on Identifying CPU vs. GPU Bottlenecks in GROMACS Simulations

GROMACS version: 2025.2
GROMACS modification: No
Here post your question

Dear GROMACS Community,

I hope this message finds you well. I am currently running simulations and am trying to determine whether my runs are bottlenecked by CPU or GPU compute power (or perhaps their communication).

This information is crucial for optimizing resource allocation on cloud providers and high-performance computing clusters, where various combinations of CPU and GPU are available.

Could you please provide guidance or recommend any tools or methods to accurately assess where the bottleneck might be occurring in my simulations? Any insights or suggestions would be greatly appreciated.

Thank you for your assistance.

Best regards,

Ivan

Hi,

The best way to start investigating are to look at the performance counters table at the end of the log file.

There are some details available at: Getting good performance from mdrun - GROMACS 2025.2 documentation

That’s great. Thank you.

I have to say that I find it challenging to interpret.

For example, my last mrdun execution was GROMACS 2024.4 and it had these metrics at the end of the log file:

       M E G A - F L O P S   A C C O U N T I N G

 NB=Group-cutoff nonbonded kernels    NxN=N-by-N cluster Verlet kernels
 RF=Reaction-Field  VdW=Van der Waals  QSTab=quadratic-spline table
 W3=SPC/TIP3p  W4=TIP4p (single or pairs)
 V&F=Potential and force  V=Potential only  F=Force only

 Computing:                               M-Number         M-Flops  % Flops
-----------------------------------------------------------------------------
 Pair Search distance check        23512170.866768   211609537.801     0.0
 NxN Ewald Elec. + LJ [F]         29578669465.438080 1952192184718.913    84.6
 NxN Ewald Elec. + LJ [V&F]       3286518993.655168 351657532321.103    15.2
 1,4 nonbonded interactions         2767800.013839   249102001.246     0.0
 Shift-X                             152322.076161      913932.457     0.0
 Bonds                               554000.002770    32686000.163     0.0
 Propers                            2700200.013501   618345803.092     0.0
 Impropers                           178800.000894    37190400.186     0.0
 Virial                             1524120.076206    27434161.372     0.0
 Stop-CM                            1523220.076161    15232200.762     0.0
 Calc-Ekin                         15232200.152322   411269404.113     0.0
 Lincs                               516400.002582    30984000.155     0.0
 Lincs-Mat                          2512800.012564    10051200.050     0.0
 Constraint-V                      11647400.058237   104826600.524     0.0
 Constraint-Vir                     1113100.055655    26714401.336     0.0
 Settle                             3538200.017691  1309134006.546     0.1
 Virtual Site 3                     3892020.035382   144004741.309     0.0
 CMAP                                 68600.000343   116620000.583     0.0
 Urey-Bradley                       1916800.009584   350774401.754     0.0
-----------------------------------------------------------------------------
 Total                                             2307546609833.464   100.0
-----------------------------------------------------------------------------


      R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G

On 1 MPI rank, each using 16 OpenMP threads

 Activity:              Num   Num      Call    Wall time         Giga-Cycles
                        Ranks Threads  Count      (s)         total sum    %
--------------------------------------------------------------------------------
 Vsite constr.             1   16  200000001   17135.216     685408.076   2.9
 Neighbor search           1   16    2000001   14881.037     595240.993   2.5
 Launch PP GPU ops.        1   16  200000001    9395.024     375800.645   1.6
 Force                     1   16  200000001   30548.341    1221932.607   5.1
 PME GPU mesh              1   16  200000001  130433.960    5217354.051  21.7
 Wait GPU NB local         1   16  200000001  299630.121   11985194.848  49.8
 NB X/F buffer ops.        1   16  398000001   17725.823     709032.327   2.9
 Vsite spread              1   16  220000002    5698.201     227927.850   0.9
 Write traj.               1   16      40660    1325.255      53010.148   0.2
 Update                    1   16  200000001   20235.398     809415.257   3.4
 Constraints               1   16  200000001   24053.523     962140.110   4.0
 Rest                                          30020.718    1200827.709   5.0
--------------------------------------------------------------------------------
 Total                                        601082.617   24043284.622 100.0
--------------------------------------------------------------------------------
 Breakdown of PME mesh activities
--------------------------------------------------------------------------------
 Wait PME GPU gather       1   16  200000001  102728.439    4109134.134  17.1
 Reduce GPU PME F          1   16  200000001    4772.940     190917.426   0.8
 Launch PME GPU ops.       1   16 1800000009   11870.427     474816.675   2.0
--------------------------------------------------------------------------------

               Core t (s)   Wall t (s)        (%)
       Time:  9617321.820   601082.617     1600.0
                         6d22h58:02
                 (ns/day)    (hour/ns)
Performance:       57.496        0.417
Finished mdrun on rank 0 Mon Apr 14 21:52:08 2025

My question again, given the metrics above: is my bottleneck my CPU or my GPU?

Thank you.

Ivan

Hi,
I guess the optimal cloud instance is simply the one that gives you the highest performance to price ratio for your use case, regardless of whether CPU/GPU usage is well balanced for GROMACS or not. Maybe this article sheds some light on it: https://pubs.acs.org/doi/10.1021/acs.jcim.2c00044
Best,
Carsten

Hi Carsten.
Your paper is an alternative way of answering my question. I appreciate it.
By the way, great article.
Thank you.
Ivan

Hi, some time ago i was testing gromacs performance on my new system for better results or you can say to minimize simulation time and i discovered that 14 threads out of 16 threads worked much better than running on 16 threads and also 8 threads also give good performance but 14 threads is best as it reduced few hrs of simulation time that running on full 16 threads. I have i5 12500H (12 core, 16 threads) cpu and rtx 4060 gpu. And this also works on serve or cloud computing just leave 1 core or 2 threads empty.

This is puzzling but still highly appreciated.

I’ll try it.

By the way, I did notice that using two threads in one core performs just as well as only one thread in one core. Same performance. Indistinguishable. That is true at least for both v2024.4 and v2025.2.

Thank you.

Ivan

I am assuming that, when running two threads per core, you are using all the CPU’s hardware threads (HT), whereas with one thread per core, you are only using the number of physical cores. Whether using one or two threads per core is better for performance depends on various factors, such as the total number of cores, the number of atoms in your simulation system, whether you run update on the GPU or CPU, and whether your run is CPU- or GPU-bound. If you perform the update step on the CPU, using all HTs is usually beneficial for large systems as there will be enough work for each HT. However, if the system is small, using just the physical cores is often faster as there is not much work per core anyway, and the reductions across the threads will be faster with half the number of threads in an OpenMP region.

If not specified manually, GROMACS mdrun has built-in heuristics that will choose to utilise either all HTs or all physical cores. To be sure which is better, you would need to run a quick benchmark for each configuration. I have no experience of running GROMACS on systems like the one Ashutosh mentioned, where the CPU cores are not all equal. I suppose all sorts of things could happen there. :)
Best,
Carsten

The simple answer is: 49.8% + 17.1% of time is spent in CPU waiting for the GPU to do its tasks, so GPU is the bottleneck. However, this does not mean that CPU has no performance impact or that getting a faster GPU is the best way forward.

Elaborating more. In your case, you are doing NB and PME tasks on the GPU, but Bonded and Update on the CPU. This means that, for each time step, NB and PME tasks are launched on the GPU, and, in parallel, the CPU computes Bonded and misc. forces. After computing the forces (which are responsible for 5.1% of time), CPU waits for the forces from the GPU and then performs reduction (“buffer ops.”), integration (“Update”), and Constraints. The next step cannot start before the new coordinates are computed, so 2x slower CPU will make the Update+Constraints (which now takes 3.4% + 4.0%) 2x slower, reducing the overall application performance. There are other tasks, but the main idea still holds.

You can also notice that in this case we have to copy coordinates and forces between CPU and GPU on each step. If you enable fully GPU-resident mode (-update gpu -bonded gpu), you will likely get a speed-up: the GPU will do more work and the CPU will be mostly idle, but the overall simulation performance will benefit from avoiding the need to copy data back-and-forth and synchronize CPU and GPU execution. This, sadly, is not compatible with virtual sites, so could only work if your simulation parameters can be changed.