Another MPI rank encountered an exception

GROMACS version: 2020.2
GROMACS modification: Yes/No
Command line: srun gmx_mpi mdrun -maxh 24 -deffnm XXXX -cpi XXXX.cpt -append
and the above command line fails and returns me the following errors:

Program: gmx mdrun, version 2020.2
Source file: src/gromacs/mdrunutility/handlerestart.cpp (line 681)
Function: std::tuple<gmx::StartingBehavior, std::unique_ptr<t_fileio, gmx::functor_wrapper<t_fileio, gmx::closeLogF
ile> > > gmx::handleRestart(bool, MPI_Comm, const gmx_multisim_t*, gmx::AppendingBehavior, int, t_filenm*)
MPI rank: 123 (out of 160)

Communication (parallel processing) problem:


Program: gmx mdrun, version 2020.2
Source file: src/gromacs/mdrunutility/handlerestart.cpp (line 681)
Function: std::tuple<gmx::StartingBehavior, std::unique_ptr<t_fileio, gmx::functor_wrapper<t_fileio, gmx::closeLogF
ile> > > gmx::handleRestart(bool, MPI_Comm, const gmx_multisim_t*, gmx::AppendingBehavior, int, t_filenm*)
MPI rank: 43 (out of 160)

Communication (parallel processing) problem:



Program: gmx mdrun, version 2020.2
Source file: src/gromacs/mdrunutility/handlerestart.cpp (line 681)
Function: std::tuple<gmx::StartingBehavior, std::unique_ptr<t_fileio, gmx::functor_wrapper<t_fileio, gmx::closeLogF
ile> > > gmx::handleRestart(bool, MPI_Comm, const gmx_multisim_t*, gmx::AppendingBehavior, int, t_filenm*)
MPI rank: 126 (out of 160)

…etc

application called MPI_Abort(MPI_COMM_WORLD, 1) - process 9
slurmstepd: error: *** STEP 1058046.0 ON node033 CANCELLED AT 2020-12-09T23:24:43 ***
srun: error: node146: tasks 40-59: Killed
srun: Terminating job step 1058046.0
srun: error: node033: tasks 0-19: Killed
srun: error: node271: tasks 120-139: Killed
srun: error: node160: tasks 60-79: Killed
srun: error: node145: tasks 20-39: Killed
srun: error: node272: tasks 140-159: Killed
srun: error: node191: tasks 100-119: Killed

Thank you in advance!

Me too, I also encountered this problem. Have you got the answer? Thank you so much in advance.

In my case, I was getting this error with GROMACS/2020.3 when re-starting from a checkpoint file. Using a different checkpoint file (run.cpt instead of run_prev.cpt in my case) solved it. Maybe this helps.

I also have this problem and using a different -cpi file.cpt option didn’t help. Is there any explanation why this problem occurs and/or a global solution for fixing it? Thanks!

Well, the actual error is printed a few lines upper:
Inconsistency in user input:
Some output files listed in the checkpoint file umbrella94.cpt are not present
or not named as the output files by the current program:)
Expected output files that are present:
umbrella94.log
umbrella94.xtc
umbrella94.edr

Expected output files that are not present or named differently:
umbrella94_pullx.xvg
umbrella94_pullf.xvg

So, the problem was either that you included the -cpi option but not the -append or that you did include -append but some files were named differently (my case here). In order to fix it, you should either fiox the naming of the files, or include -noappend.