Domain Decomposition issue upgrading from v2016.2 to v2021.1

GROMACS version: 2021.1 and 2016.2
GROMACS modification: No

Hi all,

I am working on installing Gromacs 2021.1 onto a new cluster at my University. On our older cluster, we used Gromacs 2016.2. I am benchmarking 2021.1 performance on the new cluster with systems that have been done with 2016.2 on the old cluster.

However, there is something funky happening with the domain decomposition and the new install that I cannot figure out. I am simulating a very small system of a tripeptide in explicit water in a 4x4x4 nm box. They have both been equilibrated identically. I use the same mdrun command for both simulations provided below (with which I am doing nothing special, just naming the outputs instead of using deffnm):

${MPIRUN} ${GMXBIN}/gmx_mpi mdrun -s md-equilibration/pep_md.tpr -o md-equilibration/pep_ion_md.trr -x md-equilibration/pep_ion_md.xtc -c md-equilibration/pep_ion_md_100ns.gro -g md-equilibration/md.log -e md-equilibration/md.edr -cpo md-equilibration/1.cpt

The sample output I provide below is an excerpt from the log output file of mdrun for both versions which show some of the DD info that has to be causing my simulations to not run. I have made the portions of the outputs that are different which also explain why my simulations will not run on the new cluster.

2016:

Initializing Domain Decomposition on 48 ranks
Dynamic load balancing: auto
Initial maximum inter charge-group distances:
two-body bonded interactions: 0.382 nm, LJ-14, atoms 12 22
multi-body bonded interactions: 0.382 nm, Proper Dih., atoms 12 22
Minimum cell size due to bonded interactions: 0.421 nm
Guess for relative PME load: 0.21
Will use 36 particle-particle and 12 PME only ranks
This is a guess, check the performance at the end of the log file
Using 12 separate PME ranks, as guessed by mdrun
Scaling the initial minimum size with 1/0.8 (option -dds) = 1.25
Optimizing the DD grid for 36 cells with a minimum initial size of 0.526 nm
The maximum allowed number of cells is: X 7 Y 7 Z 7
Domain decomposition grid 4 x 3 x 3, separate PME ranks 12
PME domain decomposition: 4 x 3 x 1
Interleaving PP and PME ranks
This rank does only particle-particle work.

This output shows that I should be able to use 7x7x7 ranks to decompose the system which means (if I wasn’t charged money for simulations on this cluster) I should be able to throw many more than 48 ranks at this job. However, when I go to 2021.1:

Initializing Domain Decomposition on 48 ranks
Dynamic load balancing: auto
Using update groups, nr 2164, average size 4.0 atoms, max. radius 0.078 nm
Minimum cell size due to atom displacement: 1.447 nm
Minimum cell size due to bonded interactions: 0.456 nm
Scaling the initial minimum size with 1/0.8 (option -dds) = 1.25
Guess for relative PME load: 0.20
Will use 36 particle-particle and 12 PME only ranks
This is a guess, check the performance at the end of the log file
Using 12 separate PME ranks, as guessed by mdrun
Optimizing the DD grid for 36 cells with a minimum initial size of 1.462 nm
The maximum allowed number of cells is: X 2 Y 2 Z 2


Program: gmx mdrun, version 2021.1-UNCHECKED
Source file: src/gromacs/domdec/domdec.cpp (line 2262)
MPI rank: 0 (out of 48)

Fatal error:
There is no domain decomposition for 36 ranks that is compatible with the
given box and a minimum cell size of 1.46162 nm
Change the number of ranks or mdrun option -rdd or -dds
Look in the log file for details on the domain decomposition

As you can see from comparing the bold parts of the log output, the number of domains I can have drops from 343 to 8 and the minimum cell size seems to be determined by atom displacement in 2021.1 whereas in 2016.2 it’s determined by bonded interactions. I’ve tried other, larger systems since this and all systems I have tested have been restricted to 2x2x2 maximum allowed cells. Changing the -rdd, -dds, -rcon, etc flags has not resolved this issue. Any help would be greatly appreciated!

Best,

Brian

I would just like to add 2 things to this issue:

  1. I am equilibrating on Gromacs 5.1.2 but that works fine with Gromacs 2016.2 so I hope that isn’t the issue.
  2. I tried reducing emtol variable in the energy minimization mdp file by 20% in case it was just randomly poorly equilibrated. But, the output of the attempt to simulate in Gromacs 2021.1 was the same as in the original post. So it doesn’t seem this issue is equilibration related?

I did find this resource (https://mailman-1.sys.kth.se/pipermail/gromacs.org_gmx-users/2018-February/118515.html) where another user was having a similar issue but it was related to P-LINCS whereas mine does not seem to be the case. I’m having trouble finding the source of this atom displacement. If I can make it so the system reduces this quantity I should be able to decompose my system further.

Thanks.

Brian,

Please share full log files. I suspect the change in minimum domain size is due to a combination of a few algorithmic changes: the use of update groups, changes in pair search setup and pair list update frequency. These should all be indicated in the log.
If the latter are the main contributors, one of the things you can do is to reduce nstlist which should reduce rlist. Combined with more OpenMP threads per rank (which should work better than in v2016) should hopefully compensate for the maximum number of cores you can use.

Cheers,
Szilárd

Hi Szilárd,

Somehow I never received a notification of your response and only saw as I logged in to post a new question. Sorry about not posting full log files.

I did solve this issue by just increasing the box edge size from 4 nm to 5 nm and I do not have this issue with systems of larger size. Those systems are still very small so the change did not affect speed much. I should’ve logged in to reply that I had resolved my issue regardless if someone replied.

Thank you for your suggestions and I will test them when I attempt to simulate some smaller systems again.

Brian