GROMACS version: 2025.2
GROMACS modification: No
Here post your question
Dear GROMACS Community,
I hope this message finds you well. I am currently running simulations and am trying to determine whether my runs are bottlenecked by CPU or GPU compute power (or perhaps their communication).
This information is crucial for optimizing resource allocation on cloud providers and high-performance computing clusters, where various combinations of CPU and GPU are available.
Could you please provide guidance or recommend any tools or methods to accurately assess where the bottleneck might be occurring in my simulations? Any insights or suggestions would be greatly appreciated.
Thank you for your assistance.
Best regards,
Ivan
Hi,
The best way to start investigating are to look at the performance counters table at the end of the log file.
There are some details available at: Getting good performance from mdrun - GROMACS 2025.2 documentation
That’s great. Thank you.
I have to say that I find it challenging to interpret.
For example, my last mrdun execution was GROMACS 2024.4 and it had these metrics at the end of the log file:
M E G A - F L O P S A C C O U N T I N G
NB=Group-cutoff nonbonded kernels NxN=N-by-N cluster Verlet kernels
RF=Reaction-Field VdW=Van der Waals QSTab=quadratic-spline table
W3=SPC/TIP3p W4=TIP4p (single or pairs)
V&F=Potential and force V=Potential only F=Force only
Computing: M-Number M-Flops % Flops
-----------------------------------------------------------------------------
Pair Search distance check 23512170.866768 211609537.801 0.0
NxN Ewald Elec. + LJ [F] 29578669465.438080 1952192184718.913 84.6
NxN Ewald Elec. + LJ [V&F] 3286518993.655168 351657532321.103 15.2
1,4 nonbonded interactions 2767800.013839 249102001.246 0.0
Shift-X 152322.076161 913932.457 0.0
Bonds 554000.002770 32686000.163 0.0
Propers 2700200.013501 618345803.092 0.0
Impropers 178800.000894 37190400.186 0.0
Virial 1524120.076206 27434161.372 0.0
Stop-CM 1523220.076161 15232200.762 0.0
Calc-Ekin 15232200.152322 411269404.113 0.0
Lincs 516400.002582 30984000.155 0.0
Lincs-Mat 2512800.012564 10051200.050 0.0
Constraint-V 11647400.058237 104826600.524 0.0
Constraint-Vir 1113100.055655 26714401.336 0.0
Settle 3538200.017691 1309134006.546 0.1
Virtual Site 3 3892020.035382 144004741.309 0.0
CMAP 68600.000343 116620000.583 0.0
Urey-Bradley 1916800.009584 350774401.754 0.0
-----------------------------------------------------------------------------
Total 2307546609833.464 100.0
-----------------------------------------------------------------------------
R E A L C Y C L E A N D T I M E A C C O U N T I N G
On 1 MPI rank, each using 16 OpenMP threads
Activity: Num Num Call Wall time Giga-Cycles
Ranks Threads Count (s) total sum %
--------------------------------------------------------------------------------
Vsite constr. 1 16 200000001 17135.216 685408.076 2.9
Neighbor search 1 16 2000001 14881.037 595240.993 2.5
Launch PP GPU ops. 1 16 200000001 9395.024 375800.645 1.6
Force 1 16 200000001 30548.341 1221932.607 5.1
PME GPU mesh 1 16 200000001 130433.960 5217354.051 21.7
Wait GPU NB local 1 16 200000001 299630.121 11985194.848 49.8
NB X/F buffer ops. 1 16 398000001 17725.823 709032.327 2.9
Vsite spread 1 16 220000002 5698.201 227927.850 0.9
Write traj. 1 16 40660 1325.255 53010.148 0.2
Update 1 16 200000001 20235.398 809415.257 3.4
Constraints 1 16 200000001 24053.523 962140.110 4.0
Rest 30020.718 1200827.709 5.0
--------------------------------------------------------------------------------
Total 601082.617 24043284.622 100.0
--------------------------------------------------------------------------------
Breakdown of PME mesh activities
--------------------------------------------------------------------------------
Wait PME GPU gather 1 16 200000001 102728.439 4109134.134 17.1
Reduce GPU PME F 1 16 200000001 4772.940 190917.426 0.8
Launch PME GPU ops. 1 16 1800000009 11870.427 474816.675 2.0
--------------------------------------------------------------------------------
Core t (s) Wall t (s) (%)
Time: 9617321.820 601082.617 1600.0
6d22h58:02
(ns/day) (hour/ns)
Performance: 57.496 0.417
Finished mdrun on rank 0 Mon Apr 14 21:52:08 2025
My question again, given the metrics above: is my bottleneck my CPU or my GPU?
Thank you.
Ivan
Hi,
I guess the optimal cloud instance is simply the one that gives you the highest performance to price ratio for your use case, regardless of whether CPU/GPU usage is well balanced for GROMACS or not. Maybe this article sheds some light on it: https://pubs.acs.org/doi/10.1021/acs.jcim.2c00044
Best,
Carsten
Hi Carsten.
Your paper is an alternative way of answering my question. I appreciate it.
By the way, great article.
Thank you.
Ivan
Hi, some time ago i was testing gromacs performance on my new system for better results or you can say to minimize simulation time and i discovered that 14 threads out of 16 threads worked much better than running on 16 threads and also 8 threads also give good performance but 14 threads is best as it reduced few hrs of simulation time that running on full 16 threads. I have i5 12500H (12 core, 16 threads) cpu and rtx 4060 gpu. And this also works on serve or cloud computing just leave 1 core or 2 threads empty.
This is puzzling but still highly appreciated.
I’ll try it.
By the way, I did notice that using two threads in one core performs just as well as only one thread in one core. Same performance. Indistinguishable. That is true at least for both v2024.4 and v2025.2.
Thank you.
Ivan
I am assuming that, when running two threads per core, you are using all the CPU’s hardware threads (HT), whereas with one thread per core, you are only using the number of physical cores. Whether using one or two threads per core is better for performance depends on various factors, such as the total number of cores, the number of atoms in your simulation system, whether you run update on the GPU or CPU, and whether your run is CPU- or GPU-bound. If you perform the update step on the CPU, using all HTs is usually beneficial for large systems as there will be enough work for each HT. However, if the system is small, using just the physical cores is often faster as there is not much work per core anyway, and the reductions across the threads will be faster with half the number of threads in an OpenMP region.
If not specified manually, GROMACS mdrun
has built-in heuristics that will choose to utilise either all HTs or all physical cores. To be sure which is better, you would need to run a quick benchmark for each configuration. I have no experience of running GROMACS on systems like the one Ashutosh mentioned, where the CPU cores are not all equal. I suppose all sorts of things could happen there. :)
Best,
Carsten
The simple answer is: 49.8% + 17.1% of time is spent in CPU waiting for the GPU to do its tasks, so GPU is the bottleneck. However, this does not mean that CPU has no performance impact or that getting a faster GPU is the best way forward.
Elaborating more. In your case, you are doing NB and PME tasks on the GPU, but Bonded and Update on the CPU. This means that, for each time step, NB and PME tasks are launched on the GPU, and, in parallel, the CPU computes Bonded and misc. forces. After computing the forces (which are responsible for 5.1% of time), CPU waits for the forces from the GPU and then performs reduction (“buffer ops.”), integration (“Update”), and Constraints. The next step cannot start before the new coordinates are computed, so 2x slower CPU will make the Update+Constraints (which now takes 3.4% + 4.0%) 2x slower, reducing the overall application performance. There are other tasks, but the main idea still holds.
You can also notice that in this case we have to copy coordinates and forces between CPU and GPU on each step. If you enable fully GPU-resident mode (-update gpu -bonded gpu
), you will likely get a speed-up: the GPU will do more work and the CPU will be mostly idle, but the overall simulation performance will benefit from avoiding the need to copy data back-and-forth and synchronize CPU and GPU execution. This, sadly, is not compatible with virtual sites, so could only work if your simulation parameters can be changed.