Another MPI rank encountered an exception

GROMACS version: 2020.2
GROMACS modification: Yes/No
Command line: srun gmx_mpi mdrun -maxh 24 -deffnm XXXX -cpi XXXX.cpt -append
and the above command line fails and returns me the following errors:

Program: gmx mdrun, version 2020.2
Source file: src/gromacs/mdrunutility/handlerestart.cpp (line 681)
Function: std::tuple<gmx::StartingBehavior, std::unique_ptr<t_fileio, gmx::functor_wrapper<t_fileio, gmx::closeLogF
ile> > > gmx::handleRestart(bool, MPI_Comm, const gmx_multisim_t*, gmx::AppendingBehavior, int, t_filenm*)
MPI rank: 123 (out of 160)

Communication (parallel processing) problem:


Program: gmx mdrun, version 2020.2
Source file: src/gromacs/mdrunutility/handlerestart.cpp (line 681)
Function: std::tuple<gmx::StartingBehavior, std::unique_ptr<t_fileio, gmx::functor_wrapper<t_fileio, gmx::closeLogF
ile> > > gmx::handleRestart(bool, MPI_Comm, const gmx_multisim_t*, gmx::AppendingBehavior, int, t_filenm*)
MPI rank: 43 (out of 160)

Communication (parallel processing) problem:



Program: gmx mdrun, version 2020.2
Source file: src/gromacs/mdrunutility/handlerestart.cpp (line 681)
Function: std::tuple<gmx::StartingBehavior, std::unique_ptr<t_fileio, gmx::functor_wrapper<t_fileio, gmx::closeLogF
ile> > > gmx::handleRestart(bool, MPI_Comm, const gmx_multisim_t*, gmx::AppendingBehavior, int, t_filenm*)
MPI rank: 126 (out of 160)

…etc

application called MPI_Abort(MPI_COMM_WORLD, 1) - process 9
slurmstepd: error: *** STEP 1058046.0 ON node033 CANCELLED AT 2020-12-09T23:24:43 ***
srun: error: node146: tasks 40-59: Killed
srun: Terminating job step 1058046.0
srun: error: node033: tasks 0-19: Killed
srun: error: node271: tasks 120-139: Killed
srun: error: node160: tasks 60-79: Killed
srun: error: node145: tasks 20-39: Killed
srun: error: node272: tasks 140-159: Killed
srun: error: node191: tasks 100-119: Killed

Thank you in advance!

Me too, I also encountered this problem. Have you got the answer? Thank you so much in advance.

In my case, I was getting this error with GROMACS/2020.3 when re-starting from a checkpoint file. Using a different checkpoint file (run.cpt instead of run_prev.cpt in my case) solved it. Maybe this helps.