Performance regression in 2025.4: suboptimal CUDA NBNxM kernel selection vs 2024.2

GROMACS version: 2024.2 + 2025.4
GROMACS modification: No

I am observing a significant performance regression in GROMACS 2025.4 compared to 2024.2 for a realistic protein-in-water system on a single GPU. The regression appears to be caused by different CUDA NBNxM nonbonded kernel selection heuristics.

Observation:

  • A TPR generated with GROMACS 2024.2 runs at ~860–870 ns/day

  • A TPR generated with GROMACS 2025.4, using identical mdp/topology/coordinates, runs at ~580–630 ns/day

In both cases, the simulation is run using the same GROMACS 2025.4 mdrun binary. The only difference is the GROMACS version used to generate the TPR.

Key differences in log files:

  • 2024.2-generated TPR: Using GPU 8x8 nonbonded short-range kernels

  • 2025.4-generated TPR: Using GPU 8x4 nonbonded short-range kernels
    cluster-pair splitting on

My workaround now is to generate the TPR with GROMACS 2024.2 and then run it with GROMACS 2025.4 which restores full performance (~865 ns/day), indicating that the slower kernel choice in 2025.4 is not required for correctness.

Any suggestion on why 2025.4 is forcing a 8x4 kernel size, and how I can force it to use a 8x8?

Thanks in advance for your support.

This is just a change in reporting. There are no kernels to choose from (except for analyical vs tabulated Ewald correction). This must be caused by something else.

Could you run gmx check -s1 2024.tpr -s2 2025.tpr and report the differences here?

Thank you for your reply. I can’t spot much of a difference with the check command (other than the random seed number). But I compared the timing in the log files and it looks like there meaningful differences. I am reporting the results below. Sorry for the large post.

===== run_2025tpr.log =====

R E A L C Y C L E A N D T I M E A C C O U N T I N G

On 1 MPI rank, each using 12 OpenMP threads

Activity: Num Num Call Wall time Giga-Cycles
Ranks Threads Count (s) total sum %
-------------------------------------------------------------------------------–
Neighbor search 1 12 201 0.310 11.509 6.1
Launch PP GPU ops. 1 12 39801 1.332 49.489 26.2
Force 1 12 20001 0.083 3.076 1.6
PME GPU mesh 1 12 20001 0.752 27.934 14.8
Wait Bonded GPU 1 12 201 0.001 0.026 0.0
Wait GPU NB local 1 12 20001 0.508 18.857 10.0
Wait GPU state copy 1 12 12603 1.611 59.845 31.6
NB X/F buffer ops. 1 12 4001 0.080 2.963 1.6
Write traj. 1 12 2 0.050 1.859 1.0
GPU constr. setup 1 12 1 0.000 0.008 0.0
Kinetic energy 1 12 8001 0.195 7.241 3.8
Rest 0.170 6.315 3.3

Total 5.091 189.123 100.0

===== run_2024tpr.log =====
R E A L C Y C L E A N D T I M E A C C O U N T I N G

On 1 MPI rank, each using 12 OpenMP threads

Activity: Num Num Call Wall time Giga-Cycles
Ranks Threads Count (s) total sum %

Neighbor search 1 12 201 0.308 11.457 7.8
Launch PP GPU ops. 1 12 39801 0.538 19.978 13.5
Force 1 12 20001 0.015 0.567 0.4
PME GPU mesh 1 12 20001 0.465 17.264 11.7
Wait Bonded GPU 1 12 201 0.002 0.067 0.0
Wait GPU NB local 1 12 20001 0.124 4.591 3.1
Wait GPU state copy 1 12 5003 2.238 83.141 56.3
NB X/F buffer ops. 1 12 401 0.009 0.350 0.2
Write traj. 1 12 2 0.055 2.035 1.4
GPU constr. setup 1 12 1 0.000 0.010 0.0
Kinetic energy 1 12 4001 0.141 5.225 3.5
Rest 0.083 3.096 2.1

Total 3.978 147.782 100.0

             :-) GROMACS - gmx check, 2025.4 (-: 

Executable: /opt/gromacs/2025.4/bin/gmx
Data prefix: /opt/gromacs/2025.4
Working dir: /home/alex/Modeling/MD/Clean_up_PDB/Models/MD2/test_tpr
Command line:
gmx check -s1 md_0_1_2024.tpr -s2 md_0_1_2025.tpr

Note: When comparing run input files, default tolerances are reduced.
Reading file md_0_1_2024.tpr, VERSION 2024.2 (single precision)
Note: file tpx version 133, software tpx version 137
Reading file md_0_1_2025.tpr, VERSION 2025.4 (single precision)
comparing inputrec
inputrec->ld_seed (-268634125 - -1489504335)
comparing mtop topology
comparing force field parameters
comparing molecule types
comparing atoms
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing t_resinfo
comparing InteractionLists
comparing blocka excls[0]
comparing atoms
comparing t_resinfo
comparing InteractionLists
comparing blocka excls[1]
comparing atoms
comparing t_resinfo
comparing InteractionLists
comparing blocka excls[2]
comparing molecule blocks
comparing InteractionLists
comparing groups
comparing intermolecular exclusions
comparing moleculeBlockIndices
comparing flags
comparing box
comparing box_rel
comparing boxv
comparing x
comparing v

GROMACS reminds you: “Your Bones Got a Little Machine” (Pixies)

It seems like there are no differences between the two tpr files, so I don’t see the tpr file can be the cause.

The run with the 2025 tpr spends much more time launching the PP GPU ops. Can’t it simply be that the machine was busy with other things when you ran this test?

That was my first guess as well. But it’s reproducible to a point were we build it in our pipeline to generate the TPR file with gmx 2024.2 and then do the MD run with gmx 2025.4. We also tested it on two different nodes:

  1. Node 1: Debian GNU/Linux 13 (trixie), with a Intel i7-9700F, + NVIDIA GeForce RTX 3090 (550.163.01 driver, CUDA Version: 12.4).

  2. Node 2: Debian GNU/Linux 12 (bookworm), with a Intel i9-9900 + NVIDIA GeForce RTX 3090 (555.42.02 driver, CUDA Version: 12.5).

Thanks again.

Alex

Could you also compare the initial part of the log files? Maybe there are differences in the tpr that gmx check does not pick up.

Ok, I found the culprit. In the mdp file, I am not setting an explicit value for nstpcouple. I presume that means it is in auto mode (other parameters are tau-p=2 ps and dt=2 fs).

When I generate the tpr with gmx 2024.2, nstpcouple is attributed a value of 50. When I generate the tpr file with gmx 2025.4 nstpcouple receive a value of 5, and that slows down the simulation. Setting nstpcouple=50 in gmx 2025.4, restores the speed.

I don’t know why the two versions of gromacs attribute a different value to nstpcouple.

1 Like

Could that you found it. This should have shown up in the output of gmx check that you posted above though. Did you compare the wrong files there?

An optimization I added a few years ago set nstpcouple (and nsttcouple) to too small values in some certain cases. This has been corrected in 2025.4. If you want could performance I would suggest to increase tau_p instead, for which you apparently chose a small value.