Restart a 1GPU simulation with 2GPU fails

sperezconesa · May 27, 2020, 4:49pm

GROMACS version: 2020.1
GROMACS modification: No
Dear gmx community,

I have restarted a simulation with 2 GPUs that was running with 1 GPU before. I get dd_dump_error*pdb and errors and the job fails. Is this normal? The log has this output:

Program:     gmx mdrun, version 2020.1
Source file: src/gromacs/domdec/domdec_topology.cpp (line 421)
MPI rank:    0 (out of 8)

Fatal error:
3700 of the 238965 bonded interactions could not be calculated because some
atoms involved moved further apart than the multi-body cut-off distance
(1.21815 nm) or the two-body cut-off distance (1.59775 nm), see option -rdd,
for pairs and tabulated bonds also see option -ddcheck

For more information and tips for troubleshooting, please check the GROMACS
website at http://www.gromacs.org/Documentation/Errors

Thanks a lot!
Sergio

mijiddorj · May 28, 2020, 6:53am

Hi, can you tell me nstlist values in your mdp file.

sperezconesa · May 29, 2020, 7:31am

Here you go:
nstlist=10

I’ll put the rest of mdp for reference

integrator               = md ; md
tinit                    = 0
dt                       = 0.002
nsteps                   = 25000000
init_step                = 0
comm-mode                = Linear
nstcomm                  = 100
comm-grps                = all 
nstxout                  = 0
nstvout                  = 0
nstfout                  = 0
nstlog                   = 10000
nstcalcenergy            = -1
nstenergy                = 1000
nstxout-compressed                = 10000
 compressed-x-precision       = 1000
energygrps               = 
cutoff-scheme=Verlet
nstlist                  = 10
ns_type                  = grid
pbc                      = xyz
periodic_molecules       = no
coulombtype              = PME
rcoulomb                 = 1.2
vdw-type                 = Cut-off
rvdw-switch              = 1.0
rvdw                     = 1.2
vdw-modifier            = Force-switch
DispCorr                 = no 
fourierspacing           = 0.15
pme_order                = 4
ewald_rtol               = 1e-05
ewald_geometry           = 3d
epsilon_surface          = 0
tcoupl                   = Berendsen
tc-grps                  = protein waters_or_ions resname_POPC_POPS_CHL1 
tau-t                    = 0.5 0.5 0.5
ref-t                    = 310 310 310
pcoupl                   = Berendsen 
pcoupltype               = semiisotropic
nstpcouple               = -1
tau-p                    = 5.0
compressibility          = 4.5e-5 4.5e-5
ref-p                    = 1.0 1.0
refcoord_scaling         = No 
gen_vel                  = yes 
gen-temp                 = 310
gen-seed                 = -1
constraints              = h-bonds 
constraint-algorithm     = Lincs
continuation             = no
lincs-order              = 4
lincs-iter               = 1
lincs-warnangle          = 30

Thank you!

mijiddorj · May 29, 2020, 3:57pm

I would like to suggest you to use smaller nstlist or gen_vel = no.

sperezconesa · June 1, 2020, 7:43am

I am affraid this didn’t work. Not even with nstlist=1.

pszilard · June 1, 2020, 12:23pm

If tolerance-based list buffer estimate is used (default, see verlet-buffer-tolerance), nstlist is a free parameter and can be set arbitrarily an mdrun itself tunes it at startup. Hence, changing nstlist in the mdp might not even have an effect on the actual value used at runtime.

sperezconesa · June 3, 2020, 9:09am

Any other ideas to fix this?

mijiddorj · June 5, 2020, 5:50am

How long did you equilibrated your system?
Did you try to remove the gen_vel in this run?
I suggest you to additional short equilibration under NVT condition using gen_vel before this run. Then you can run npt production without gen_vel, even in multiple replicas.

sperezconesa · June 5, 2020, 6:03am

The system is well equilibrated (ns) and stable if I don’t change the number of GPUs. The problem must be in a more algorithmical problem.

mijiddorj · June 5, 2020, 8:38am

Please check in different versions.

Topic		Replies	Views
System blow up when restart from a check point User discussions mdrun	3	370	June 9, 2022
Error when using more than one GPU core on the same node User discussions gpu	0	353	August 12, 2021
Bonded interaction error when extending a simulation User discussions mdrun	0	248	September 14, 2023
Domain decomposition error + Setting MPI ranks compatible with custom domains User discussions mdrun , mdrun-performance	2	848	October 30, 2024
Confusing and possibly incomplete update on gpu error message in gmx2022.2 User discussions	5	1147	August 30, 2022

Restart a 1GPU simulation with 2GPU fails

Related topics