Sudden change in optimal PP/PME balance across many simulations

GROMACS version: 2020.2
GROMACS modification: No

I am running multiple (long diverged by now) simulations of the same large system: oligomeric protein in explicit water. Gromacs is compiled for thread-MPI and each simulation is running on a separate compute node on a cluster. The compute nodes have exactly the same hardware, so all the simulations are launched with the same PP/PME setup (selected with tune_pme) and run at nearly identical rates.

gmx mdrun -ntmpi 40 -ntomp 2 -npme 12 -pin on -tunepme no -dlb yes

A couple of days ago there was an issue with the shared file system on the cluster and some linked libraries ended up getting recompiled – for example, the whole GCC suite, including libgomp. After I restarted the simulations, they all ran at about 2/3 of the previous rate. Of course gromacs has been recompiled multiple times with multiple versions of GCC since then, but the simulations continued to run at 2/3 speed with 12 of 40 ranks doing PME.

So I ran tune_pme again, and figured out that the optimal PP/PME setup is now 16 of 40 ranks doing PME, which gets the simulation rate to about 4/5 of what it had previously been. Again, this change happened instantaneously across multiple simulations, so it has nothing to do with a particular configuration of the simulated system.

Does anyone have any idea what change to an external library, or swap of library version, or maybe a library becoming unavailable and gromacs falling back on some other code, could cause something like this? What output might help diagnose this sudden change is optimal PP/PME balance?

I would really appreciate any input!

Roman

My guess would be FFT. Make sure the FFT library is as good as before (e.g. uses AVX2). I would suggest to look at the log file to see which kernel takes most time (the Performance breakdown at the end). Also check whether there is any performance “NOTE”.

The FFT library would make sense, but I had gromacs build it’s own FFTW3 libary, so the same (now recompiled several times, but from the same source) library has been used throughout:

SIMD instructions: AVX_512
FFT library: fftw-3.3.8-sse2-avx-avx2-avx2_128-avx512

However, I did find something in the “accounting” sections at the end. The megaflops accounting is identical (literally), but the real cycle / time accounting isn’t. It used to look like this:

But now it looks like this:

So the PME FFT is requiring more than twice the number of cycles and, accordingly, taking much longer. But it’s being done by the same library! Would failing to use AVX (of any flavor) account for this? If that’s the case, would it be because some library FFTW3 was able to use before is no longer available? Or is this something unrelated to vectorization?