Multidir and continuing simulations using cpt file

GROMACS version: 2020
GROMACS modification: Yes/No

Dear all,
As below I like to continue four independent simulations together using multidir:

gmx_mpi mdrun -multidir 1 2 3 4 -s topol.tpr -deffnm md -g md.log -cpi md.cpt -noappend

The md.cpt file corresponding to each simulation exists inside each directory, each single simulation can be continued individually, but with multidir the simulation do not work with below error:

simulation checkpoint files were from the following respective
simulation parts:
Simulation 0: 7
Simulation 1: 9
Simulation 2: 11
Simulation 3: 10

Multi-simulations must all start in the same way, either a new
simulation, a restart with appending, or a restart without appending.
MPT ERROR: Rank 0(g:0) is aborting with error code 1.
Process ID: 47980, Host: r4i0n6, Program: /p/app/gromacs/gromacs-2020.1/bin/gmx_mpi
MPT Version: HPE MPT 2.20 08/30/19 04:33:45

MPT: --------stack traceback-------
For more information and tips for troubleshooting, please check the GROMACS
website at Errors - Gromacs

However, the checkpoint files you specified were from different
simulation parts. Either remove the checkpoint file from each directory,
or ensure each directory has a checkpoint file from the same simulation
part (and, if you want to append to output files, ensure the old output
files are present and named as they were when the checkpoint file was
written).

Would you please let me know what might be the reason?

Thank you
Alex

We check for identical simulation parts as in many cases something is wrong when they don’t match and the user might get simulations that run out of sync.

If you want to avoid this check, remove the “if (!identicalSimulationParts)” conditional in src/gromacs/mdrunutility/handlerestart.cpp

Thanks for the response.

What does identical simulation means here?
The four simulations I want to continue have the same number of atoms and atom-types and all had previously simulated for 300 ns want to be continued for identical times. And I use the “-cpi an.cpt -noappend” to continue all the same way.

I wonder if I should then recompile the source if I remove the "if (!identicalSimulationParts)” from the handlerestart.cpp?

BTW, without the “-cpi an.cpt -noappend”, the multidir works just fine.

Thanks

No, I meant identical “simulation part numbers”, so only the number that mdrun complains about.

Yes, you will need to compile.

Sorry @hess to exhume this old discussion.

I guess the check for identical simulations parts is used as a safeguard for -hrex or, in general, for communicating replicas. In my case I am exploting -multidir to run a few independent not-communicating replicas so that GROMACS can handle itself the hardware and I just have to fine tune the MPI/OMP tasks. Do you foresee any possible issue in commenting out the line GMX_THROW(InconsistentInputError(message)); in handlerestart.cpp (considering I am using versions 22,23,and 24)?

EDIT: Just to be clearer, this is the error throwing line within the !identicalSimulationParts if statement that you suggested modifying.

When you run multiple completely independent simulations with -multidir there should be no errors at all. You might get a warning/note about the number of steps or the initial step.

Or are you using a quite old version?

I am using 22.5, 23.0, and 24.0 which are (arguably?) relatively recent and anyway came out after this thread.

Ah, I see now that this check is in a different source file.

This is a safeguard in general. When you are running a set of independent simulations as a multisim resubmission job, you still want this check. If you want to throw in anything random, then not. We could add an environment variable to override this check.

It would be cool to override it as sometimes I find much easier to let GROMACS handle the task subdivision rather than do it myself. However, it is a very specific problem just in HPC clusters where I have access to large nodes and I can’t properly manage sub-tasks and job parallelization due to HPC queueing software limitations, so probably the number of users that would take advantage from this is approximately zero.

On a side note

When you are running a set of independent simulations as a multisim resubmission job, you still want this check.

Why does the check play a role even if they are totally independent?

You could have messed up your files and overwrite some output or redo a part. If you run N simulations in the same dirs with a resubmission script, the simulation parts should match.

1 Like

Thanks for this. I am implementing the same lines of code in my 22.5/23 installations, hoping this won’t break the compilation, otherwise I’m just going to comment out the error throwing function. I will report back on the github if I find any criticality during my runs.