GROMACS version: 2020.2
GROMACS modification: No
I am running multiple (long diverged by now) simulations of the same large system: oligomeric protein in explicit water. Gromacs is compiled for thread-MPI and each simulation is running on a separate compute node on a cluster. The compute nodes have exactly the same hardware, so all the simulations are launched with the same PP/PME setup (selected with tune_pme) and run at nearly identical rates.
gmx mdrun -ntmpi 40 -ntomp 2 -npme 12 -pin on -tunepme no -dlb yes
A couple of days ago there was an issue with the shared file system on the cluster and some linked libraries ended up getting recompiled – for example, the whole GCC suite, including libgomp. After I restarted the simulations, they all ran at about 2/3 of the previous rate. Of course gromacs has been recompiled multiple times with multiple versions of GCC since then, but the simulations continued to run at 2/3 speed with 12 of 40 ranks doing PME.
So I ran tune_pme again, and figured out that the optimal PP/PME setup is now 16 of 40 ranks doing PME, which gets the simulation rate to about 4/5 of what it had previously been. Again, this change happened instantaneously across multiple simulations, so it has nothing to do with a particular configuration of the simulated system.
Does anyone have any idea what change to an external library, or swap of library version, or maybe a library becoming unavailable and gromacs falling back on some other code, could cause something like this? What output might help diagnose this sudden change is optimal PP/PME balance?
I would really appreciate any input!