GROMACS version: 2024.2 + 2025.4
GROMACS modification: No
I am observing a significant performance regression in GROMACS 2025.4 compared to 2024.2 for a realistic protein-in-water system on a single GPU. The regression appears to be caused by different CUDA NBNxM nonbonded kernel selection heuristics.
Observation:
-
A TPR generated with GROMACS 2024.2 runs at ~860–870 ns/day
-
A TPR generated with GROMACS 2025.4, using identical mdp/topology/coordinates, runs at ~580–630 ns/day
In both cases, the simulation is run using the same GROMACS 2025.4 mdrun binary. The only difference is the GROMACS version used to generate the TPR.
Key differences in log files:
-
2024.2-generated TPR: Using GPU 8x8 nonbonded short-range kernels
-
2025.4-generated TPR: Using GPU 8x4 nonbonded short-range kernels
cluster-pair splitting on
My workaround now is to generate the TPR with GROMACS 2024.2 and then run it with GROMACS 2025.4 which restores full performance (~865 ns/day), indicating that the slower kernel choice in 2025.4 is not required for correctness.
Any suggestion on why 2025.4 is forcing a 8x4 kernel size, and how I can force it to use a 8x8?
Thanks in advance for your support.