Checkpoint files of different replicas correspond to different timesteps

GROMACS version: 2020.4-MODIFIED
GROMACS modification: Yes - Patched with Plumed

Hello,

I have been seeing the following error upon restarting my multi simulation with mdrun:

Fatal error: The initial step is not consistent across multi simulations which share the
state

Indeed the checkpoints for each replica do not correspond to the same initial step. I would like to understand why this is so and how to enforce synchronized checkpoint files. In my simulation no exchanges are attempted but there is a synchronization between replicas every 500 steps enforced by Plumed.

Looking forward to your feedback.

Best regards,

Pablo Piaggi

GROMACS version: 2020.6-MODIFIED
GROMACS modification: Yes - Patched with Plumed 2.7.2

Hi everyone,

I am running replica exchange simulations and I am also experiencing similar issues: checkpoint files from parallel simulations are asynchronous. I’ve ended up writing checkpoint files every 20 minutes and keeping them numbered with: ‘-cpnum -cpt 20’. With this I’m sometimes “lucky” and get checkpoint files identical through all my replicas (see suppositions about what being lucky means bellow). However, being “lucky” gets harder and harder with increasing number of parallel simulations.

I would like to add that I am sharing GPUs between simulations to reduce ideling GPU time. This degrades individual performance, but increase aggregated simulation performance. After benchmarking, my optimal mdrun setup was:

srun --ntasks-per-node=8 -c 8 gmx_mpi mdrun -ntomp 8 -cpi $state -cpnum -cpt 20 -stepout 10000 -v -s md.tpr -maxh 45.60 -plumed plumed.dat -multidir E_-3.45000/ E_-3.40175/ E_-3.35350/ E_-3.30525/ E_-3.25700/ TMP_-3.20875/ TMP_-3.16050/ TMP_-3.11225/ -replex 100 -hrex -gputasks 0011001122332233 -nb gpu -pme gpu >& md.ch0.lis

The above runs on 1 node with 4 gpus, but I also run this over 2 or 3 nodes (with almost no performance lost), only the number of working directory and the resource request to my job manager have to be changed.

As my simulations are running asynchronously, I guess this causes asynchronous checkpoint outputs. However, they have to wait for each other every 100 steps to try configuration exchange, so I also assume that when exchange attempt happens at the same time as checkpoint creation, this is when I get “lucky” with synchrone checkpoint files accross replicas. If my suppositions are exact, could it be possible improve the code to force checkpoint files to be created at the moment of exchange to avoid this kind of issues?
If you have any other ideas about what is the cause of this problem and how I could overcome this, I would be pleased to know about it. Thank you for your inputs.

Best regards,
Jeremy

[ Updates ]

By browsing the plumed mailing list I’ve noticed that at least 1 other user had the exact same issue as Pablo and me:
https://groups.google.com/g/plumed-users/c/zZyPDjPJgkA/m/DYO6z7LvBgAJ

and at least one other had related issue:
https://groups.google.com/g/plumed-users/c/59_osq0mPAI/m/QfqzIY5jHGUJ

I guess the main issue is that when GROMACS does not finish gracefully, replica simulations do not necessarily end at the same time step, hence the out-of-sync-checkpoint files.
I guess the safest solution to cover all cases would be to force replica synchronization every n-th step and create checkpoint files for every replicas every n-th step, n being user input. This way we would have checkpoint files available for restart, and avoid restarting simulations from scratch.

Best regards,
Jeremy

Thanks, Jeremy, for confirming this issue. If this is a problem that appears only when Gromacs is used with Plumed, then perhaps it would be best to create an issue on Plumed’s Github repo or mention it in the Plumed mailing list.

Unfortunately, I don’t have time to understand or attempt to solve this issue.

Regards,

Pablo