Missing bonded calculation - running simulation on many nodes, rdd option

GROMACS version:2020.2
GROMACS modification: No
I have a problem. I try to run my simulation on 864 cores and 36 nodes. I get an error at the beginning of my simulation.

WARNING: This run will generate roughly 257397 Mb of data

Not all bonded interactions have been properly assigned to the domain decomposition cells
A list of missing interactions:
Bond of 297000 missing 3191
Angle of 1274400 missing 15691
Ryckaert-Bell. of 1765800 missing 30665
LJ-14 of 1771200 missing 22474
Molecule type ‘mgdg’
the first 10 missing interactions, except for exclusions:
Ryckaert-Bell. atoms 112 114 116 118 global 886 888 890 892
LJ-14 atoms 112 118 global 886 892
Angle atoms 114 116 118 global 888 890 892

Fatal error:
72021 of the 5630310 bonded interactions could not be calculated because some
atoms involved moved further apart than the multi-body cut-off distance
(0.927311 nm) or the two-body cut-off distance (1.31595 nm), see option -rdd,
for pairs and tabulated bonds also see option -ddcheck

For more information and tips for troubleshooting, please check the GROMACS
website at http://www.gromacs.org/Documentation/Errors

application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0

So I know how to manage this problem:
1: I can run my simulation on not so many cores, for example, 432 cores and 18 nodes, so I will be able to prevent this error? Increasing parallelization = higher chance for such errors? I will wait longer, but my simulation will be completed if I use not so many cores and nodes
2. I can increase -rdd to for example to 1.4 or 1.6. In the gromacs documentation I have an information “Particles beyond the non-bonded cut-off are only communicated when they have missing bonded interactions; this means that the extra cost is minor and nearly independent of the value of -rdd.”
So what is the influence of increasing on my system? The precision of computing all bonded and non bonded will be the same? This extra cost = CPU hours? How many more CPU hours in percentage I need when I increasing rdd?
3. Why this problem occurs, when you use this domain decomposition algorithm?
4. Is it safe to use noddcheck? " When inter charge-group bonded interactions are beyond the bonded cut-off distance, terminates with an error message." I am afraid that I will lose some interaction.

Thanks in advance

1 Like

Hi Jakub,

I guess you already read thorugh the manual documentation

  1. You are right - the more nodes you use, the harder it is to perform a domain-decomposition in a good way. The final limit is at ~100 atoms per core, so your system should contain around 100,000 atoms to reasonably employ your setup with 864 cores, preferably more. It is often useful to sample better and have more atoms per core by, e.g., running two simulations with half the nodes each.

  2. The precision is not affected by this, the change in compute cost is very dependend on the system that you are simulating and the compute architecture you are simulating on, so here I cannot give you a precise number, just the hunch that changing the simulation strategy might beneficial over changing -rdd

  3. There are two culprits - either your system ends up having too few atoms per core or something in the way you set up your topology is off, so that you have unusual, very long bonds in your system - a usual situation for this to happen if you mix up atoms after some renaming.

  4. Keep ddcheck

Overall, I would try running a few steps of your simulation on a much smaller computer (e.g. your workstation) and verify that things look good topology-wise. If it makes sense in your case, I would also advice to just use fewer resources to run your simulation and you will get better resource utilisation.

Hope that helped a bit,
Christian

3 Likes

Thank you so much Christian. You helped me a lot.