REMD: init_step/-replex is not equal for all subsystems

GROMACS version:2019
GROMACS modification: Yes/No
Here post your question

I run REMD on our cluster. Appending jobs are run based on the checkpoint file cpt. However, the cluster sometimes crashes, so jobs are randomly terminated. As a result, the individual subsystems are not equal:

Initializing Replica Exchange
Repl  There are 16 replicas:
Multi-checking the number of atoms ... OK
Multi-checking the integrator ... OK
Multi-checking init_step+nsteps ... OK
Multi-checking first exchange step: init_step/-replex ... 
first exchange step: init_step/-replex is not equal for all subsystems
  subsystem 0: 75849
  subsystem 1: 75455
  subsystem 2: 75849
  subsystem 3: 75849
  subsystem 4: 75849
  subsystem 5: 75849
  subsystem 6: 75849
  subsystem 7: 75455
  subsystem 8: 75849
  subsystem 9: 75849
  subsystem 10: 75849
  subsystem 11: 75849
  subsystem 12: 75849
  subsystem 13: 75849
  subsystem 14: 75849
  subsystem 15: 75849

Program:     mdrun_mpi, version 2019.3
Source file: src/gromacs/gmxlib/network.cpp (line 745)
MPI rank:    15 (out of 16)

Fatal error:
The 16 subsystems are not compatible

Is there a way that I can tell the REMD to append from the “common” step. In my case, can it be appended from 75455?

It really requires huge resources to run REMD, and cluster crash is inevitable.

I had the same issue a while ago and the only solution I found was to use -noappend flag on mdrun, write checkpoint files more often, and backup the checkpoint files frequently. This way I could restart the simulations from a recent checkpoint file if necessary. I am eager to know if a better solution exists.