Restart a 1GPU simulation with 2GPU fails

GROMACS version: 2020.1
GROMACS modification: No
Dear gmx community,

I have restarted a simulation with 2 GPUs that was running with 1 GPU before. I get dd_dump_error*pdb and errors and the job fails. Is this normal? The log has this output:

Program:     gmx mdrun, version 2020.1
Source file: src/gromacs/domdec/domdec_topology.cpp (line 421)
MPI rank:    0 (out of 8)

Fatal error:
3700 of the 238965 bonded interactions could not be calculated because some
atoms involved moved further apart than the multi-body cut-off distance
(1.21815 nm) or the two-body cut-off distance (1.59775 nm), see option -rdd,
for pairs and tabulated bonds also see option -ddcheck

For more information and tips for troubleshooting, please check the GROMACS
website at http://www.gromacs.org/Documentation/Errors

Thanks a lot!
Sergio

Hi, can you tell me nstlist values in your mdp file.

Here you go:
nstlist=10

I’ll put the rest of mdp for reference

integrator               = md ; md
tinit                    = 0
dt                       = 0.002
nsteps                   = 25000000
init_step                = 0
comm-mode                = Linear
nstcomm                  = 100
comm-grps                = all 
nstxout                  = 0
nstvout                  = 0
nstfout                  = 0
nstlog                   = 10000
nstcalcenergy            = -1
nstenergy                = 1000
nstxout-compressed                = 10000
 compressed-x-precision       = 1000
energygrps               = 
cutoff-scheme=Verlet
nstlist                  = 10
ns_type                  = grid
pbc                      = xyz
periodic_molecules       = no
coulombtype              = PME
rcoulomb                 = 1.2
vdw-type                 = Cut-off
rvdw-switch              = 1.0
rvdw                     = 1.2
vdw-modifier            = Force-switch
DispCorr                 = no 
fourierspacing           = 0.15
pme_order                = 4
ewald_rtol               = 1e-05
ewald_geometry           = 3d
epsilon_surface          = 0
tcoupl                   = Berendsen
tc-grps                  = protein waters_or_ions resname_POPC_POPS_CHL1 
tau-t                    = 0.5 0.5 0.5
ref-t                    = 310 310 310
pcoupl                   = Berendsen 
pcoupltype               = semiisotropic
nstpcouple               = -1
tau-p                    = 5.0
compressibility          = 4.5e-5 4.5e-5
ref-p                    = 1.0 1.0
refcoord_scaling         = No 
gen_vel                  = yes 
gen-temp                 = 310
gen-seed                 = -1
constraints              = h-bonds 
constraint-algorithm     = Lincs
continuation             = no
lincs-order              = 4
lincs-iter               = 1
lincs-warnangle          = 30

Thank you!

I would like to suggest you to use smaller nstlist or gen_vel = no.

I am affraid this didn’t work. Not even with nstlist=1.

If tolerance-based list buffer estimate is used (default, see verlet-buffer-tolerance), nstlist is a free parameter and can be set arbitrarily an mdrun itself tunes it at startup. Hence, changing nstlist in the mdp might not even have an effect on the actual value used at runtime.

Any other ideas to fix this?

How long did you equilibrated your system?
Did you try to remove the gen_vel in this run?
I suggest you to additional short equilibration under NVT condition using gen_vel before this run. Then you can run npt production without gen_vel, even in multiple replicas.

The system is well equilibrated (ns) and stable if I don’t change the number of GPUs. The problem must be in a more algorithmical problem.

Please check in different versions.