Performance regression in 2025.4: suboptimal CUDA NBNxM kernel selection vs 2024.2

GROMACS version: 2024.2 + 2025.4
GROMACS modification: No

I am observing a significant performance regression in GROMACS 2025.4 compared to 2024.2 for a realistic protein-in-water system on a single GPU. The regression appears to be caused by different CUDA NBNxM nonbonded kernel selection heuristics.

Observation:

  • A TPR generated with GROMACS 2024.2 runs at ~860–870 ns/day

  • A TPR generated with GROMACS 2025.4, using identical mdp/topology/coordinates, runs at ~580–630 ns/day

In both cases, the simulation is run using the same GROMACS 2025.4 mdrun binary. The only difference is the GROMACS version used to generate the TPR.

Key differences in log files:

  • 2024.2-generated TPR: Using GPU 8x8 nonbonded short-range kernels

  • 2025.4-generated TPR: Using GPU 8x4 nonbonded short-range kernels
    cluster-pair splitting on

My workaround now is to generate the TPR with GROMACS 2024.2 and then run it with GROMACS 2025.4 which restores full performance (~865 ns/day), indicating that the slower kernel choice in 2025.4 is not required for correctness.

Any suggestion on why 2025.4 is forcing a 8x4 kernel size, and how I can force it to use a 8x8?

Thanks in advance for your support.

This is just a change in reporting. There are no kernels to choose from (except for analyical vs tabulated Ewald correction). This must be caused by something else.

Could you run gmx check -s1 2024.tpr -s2 2025.tpr and report the differences here?