GROMACS version: 2020.4-MODIFIED
GROMACS modification: Yes - Patched with Plumed
Hello,
I have been seeing the following error upon restarting my multi simulation with mdrun:
Fatal error: The initial step is not consistent across multi simulations which share the
state
Indeed the checkpoints for each replica do not correspond to the same initial step. I would like to understand why this is so and how to enforce synchronized checkpoint files. In my simulation no exchanges are attempted but there is a synchronization between replicas every 500 steps enforced by Plumed.
I am running replica exchange simulations and I am also experiencing similar issues: checkpoint files from parallel simulations are asynchronous. I’ve ended up writing checkpoint files every 20 minutes and keeping them numbered with: ‘-cpnum -cpt 20’. With this I’m sometimes “lucky” and get checkpoint files identical through all my replicas (see suppositions about what being lucky means bellow). However, being “lucky” gets harder and harder with increasing number of parallel simulations.
I would like to add that I am sharing GPUs between simulations to reduce ideling GPU time. This degrades individual performance, but increase aggregated simulation performance. After benchmarking, my optimal mdrun setup was:
The above runs on 1 node with 4 gpus, but I also run this over 2 or 3 nodes (with almost no performance lost), only the number of working directory and the resource request to my job manager have to be changed.
As my simulations are running asynchronously, I guess this causes asynchronous checkpoint outputs. However, they have to wait for each other every 100 steps to try configuration exchange, so I also assume that when exchange attempt happens at the same time as checkpoint creation, this is when I get “lucky” with synchrone checkpoint files accross replicas. If my suppositions are exact, could it be possible improve the code to force checkpoint files to be created at the moment of exchange to avoid this kind of issues?
If you have any other ideas about what is the cause of this problem and how I could overcome this, I would be pleased to know about it. Thank you for your inputs.
I guess the main issue is that when GROMACS does not finish gracefully, replica simulations do not necessarily end at the same time step, hence the out-of-sync-checkpoint files.
I guess the safest solution to cover all cases would be to force replica synchronization every n-th step and create checkpoint files for every replicas every n-th step, n being user input. This way we would have checkpoint files available for restart, and avoid restarting simulations from scratch.
Thanks, Jeremy, for confirming this issue. If this is a problem that appears only when Gromacs is used with Plumed, then perhaps it would be best to create an issue on Plumed’s Github repo or mention it in the Plumed mailing list.
Unfortunately, I don’t have time to understand or attempt to solve this issue.