System blow up when restart from a check point

GROMACS version: 2020.3
GROMACS modification: No

Dear All:

I have a strange problem when I continue a simulation from a checkpoint.

I am using GPU acceleration to run atomistic simulations (membrane bilayers with several kinds of lipids).
For this simulation, I used 1 GPU card. Afterwards, I was able to compile gromacs (the same version, 2020.3) so that it can run the simulations using several GPU cards within 1 node. Then I tried to continue this simulation using 4 GPU cards from the check point use the following command:

gmx mdrun -v -deffnm myfile -cpi myfile.cpt -append no -npme 1 -ntmpi 4 -ntomp 4 -pme gpu -nb gpu -bonded gpu -nstlist 200

and I got the following errors:


Program: gmx mdrun, version 2020.3-plumed-2.7.4-dev-20220218-660d9bc-dirty-unknown
Source file: src/gromacs/domdec/domdec_topology.cpp (line 421)
MPI rank: 0 (out of 4)

Fatal error:
5035 of the 741069 bonded interactions could not be calculated because some
atoms involved moved further apart than the multi-body cut-off distance
(1.81175 nm) or the two-body cut-off distance (1.81175 nm), see option -rdd,
for pairs and tabulated bonds also see option -ddcheck


I am actually farmiliar with this error and I tried to set -rdd 1.4 and played with -ntomp and -nt , but this time it does not work.
The strange part is, if I continue from the check point using 1 GPU, it can run smoothly. And if I start the simulations from beginning (not use the check point) by using 4 GPU, it can still run smoothly and get the acceleration I expected. The error only appears when I tried to continue from the check point using 4 GPU.
If I extract the last snapshot in the trajectory file and re-generate a tpr file to run the simulations, it goes smoothly by using either 1 GPU and 4 GPU (I think this further excludes a possible problem of the system).

This occurs for all systems I tried (in total 4), so it seemed unlikely to be a problem of my system setup.

Thanks for reading this message, and I look forward to hearing suggestions. I am really appreciate for any help you provide. Thanks in advance.

With my best regards,
Ruo-Xu

This is likely caused by a bug in 2020.3 which has been fixed in 2020.4.

Dear Hess:

Thanks very much for your reply. Could you be more elaborate? What kind bug it is? I am worrying about how it will affect my simulations (the part which I have already done).

Thanks again for your help.

With my best regards,
Ruo-Xu

I see now that the release note is very lacking. The effects of the bug are not mentioned and there is not issue number. I think there is no issue for this bug at all. But as this bug caused incorrect memory access, I would expect all runs affected by the bug to crash. So it can likely not have caused incorrect results.

As the release note says, runs with a single domain (e.g. single GPU) were not affected.