Metadynamics simulation unable to be restarted from checkpoint, despite mdrun maxh flag used

oriol-barcenas · January 10, 2024, 3:29pm

GROMACS version:    2020.4-MODIFIED
This program has been built from source code that has been altered and does not match the code released as part of the official GROMACS version 2020.4-MODIFIED. If you did not intend to use an altered GROMACS version, make sure to download an intact source distribution and compile that before proceeding.
If you have modified the source code, you are strongly encouraged to set your custom version suffix (using -DGMX_VERSION_STRING_OF_FORK) which will can help later with scientific reproducibility but also when reporting bugs.
Release checksum: 79c2857291b034542c26e90512b92fd4b184a1c9d6fa59c55f2e24ccf14e7281
Computed checksum: 71730c3e53f008bf8a6c6ee90f305b5807e001d8824f2de8ace37d9da6377c65
Precision:          single
Memory model:       64 bit
MPI library:        MPI
OpenMP support:     enabled (GMX_OPENMP_MAX_THREADS = 64)
GPU support:        CUDA
SIMD instructions:  IBM_VSX
FFT library:        fftw-3.3.7
RDTSCP usage:       disabled
TNG support:        enabled
Hwloc support:      hwloc-1.11.8
Tracing support:    disabled
C compiler:         /apps/GCC/7.3.0/bin/gcc GNU 7.3.0
C compiler flags:   -mcpu=power9 -mtune=power9 -mvsx -pthread -fexcess-precision=fast -funroll-all-loops -O3 -DNDEBUG
C++ compiler:       /apps/GCC/7.3.0/bin/g++ GNU 7.3.0
C++ compiler flags: -mcpu=power9 -mtune=power9 -mvsx -pthread -fexcess-precision=fast -funroll-all-loops -fopenmp -O3 -DNDEBUG
CUDA compiler:      /usr/local/cuda-10.2/bin/nvcc nvcc: NVIDIA (R) Cuda compiler driver;Copyright (c) 2005-2019 NVIDIA Corporation;Built on Thu_Oct_24_17:58:26_PDT_2019;Cuda compilation tools, release 10.2, V10.2.89
CUDA compiler flags:-gencode;arch=compute_30,code=sm_30;-gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_35,code=compute_35;-gencode;arch=compute_50,code=compute_50;-gencode;arch=compute_52,code=compute_52;-gencode;arch=compute_60,code=compute_60;-gencode;arch=compute_61,code=compute_61;-gencode;arch=compute_70,code=compute_70;-gencode;arch=compute_75,code=compute_75;-use_fast_math;;-mcpu=power9 -mtune=power9 -mvsx -pthread -fexcess-precision=fast -funroll-all-loops -fopenmp -O3 -DNDEBUG
CUDA driver:        10.20
CUDA runtime:       10.20

Dear forum. I am having an issue with multiple-replica simulation systems. For maximum flexibility, simulations are run for 72h and restarted from checkpoint. However, in 2/3 of the systems including the -nsteps flag to limit the number of steps to simulate, I am running into errors when simulations are restarted from the last checkpoint (init_step+nsteps is not equal for all subsystems). This is happening despite the -maxh flag is used to make sure that the simulation is finished by GROMACS and not killed by SLURM.

Could this be related to the use of the -nsteps flag? None of the other 5 systems that are not employing the -nsteps flag have failed. However, GROMACS seems to acknowledge the use of maxh, as can be seen in the Annex.

The GROMACS version had to be modified due to an incompatibility with Power9 CPUs.

I attach the GROMACS command line used to launch the simulation, and also an extract of the error as displayed by GROMACS (Annex).

Thanks!

Annex:

### GROMACS COMMAND LINE ### 
gmx_mpi mdrun -multidir prod_0 prod_1 prod_2 prod_3 prod_4 prod_5 prod_6 prod_7 prod_8 prod_9 prod_10 prod_11 prod_12 prod_13 prod_14 prod_15 prod_16 prod_17 prod_18 prod_19 prod_20 prod_21 prod_22 prod_23 prod_24 prod_25 prod_26 prod_27 prod_28 prod_29 prod_30 prod_31 prod_32 prod_33 prod_34 prod_35 prod_36 prod_37 prod_38 prod_39 -maxh 72 -nsteps 100000000 -cpi md.cpt -deffnm md -replex 2000 -plumed plumed_PTWTE.dat
  


###################
### ERROR BELOW ###
###################

Step 12397840: Run time exceeded 71.280 hours, will terminate the run within 400 steps
Replica exchange at step 12398000 time 24796.00000
Repl 0 <-> 1  dE_term =  3.214e+00 (kT)
  dpV = -6.787e-04  d =  3.213e+00
dplumed = -3.738e+00  dE_Term = -5.249e-01 (kT)
Repl ex  0 x  1    2    3    4 x  5    6    7    8 x  9   10   11   12   13   14   15   16 x 17   18   19   20   21   22   23   24   25   26   27   28   29   30   31   32   33   34   35   36 x 37   38   39
Repl pr   1.0       .34       .76       .79       1.0       .37       .03       .02       .96       .00       .19       .00       .33       .06       .26       .08       .04       .22       1.0       .03

           Step           Time
       12398001    24796.00200

Writing checkpoint, step 12398001 at Sun Jan  7 13:27:36 2024



Reading checkpoint file md.cpt
  file generated by:     /apps/GROMACS/2020.4-plumed.2.7.0-fftw3.3.7/GCC/OPENMPI/bin/gmx_mpi
  file generated at:     Sun Jan  7 13:27:36 2024

  GROMACS double prec.:  0
  simulation part #:     1
  step:                  12398001
  time:                  24796.002000



-----------------------------------------------------------
Restarting from checkpoint, appending to previous log file.

                  :-) GROMACS - gmx mdrun, 2020.4-MODIFIED (-:

Executable:   /apps/GROMACS/2020.4-plumed.2.7.0-fftw3.3.7/GCC/OPENMPI/bin/gmx_mpi
Data prefix:  /apps/GROMACS/2020.4-plumed.2.7.0-fftw3.3.7/GCC/OPENMPI
Working dir:  /gpfs/projects/csic35/md_folders/PACAP/solo/WTE_plumed/prod_0
Process ID:   74107
Command line:
  gmx_mpi mdrun -multidir prod_0 prod_1 prod_2 prod_3 prod_4 prod_5 prod_6 prod_7 prod_8 prod_9 prod_10 prod_11 prod_12 prod_13 prod_14 prod_15 prod_16 prod_17 prod_18 prod_19 prod_20 prod_21 prod_22 prod_23 prod_24 prod_25 prod_26 prod_27 prod_28 prod_29 prod_30 prod_31 prod_32 prod_33 prod_34 prod_35 prod_36 prod_37 prod_38 prod_39 -maxh 72 -nsteps 100000000 -cpi md.cpt -deffnm md -replex 2000 -plumed plumed_PTWTE.dat

GROMACS version:    2020.4-MODIFIED
This program has been built from source code that has been altered and does not match the code released as part of the official GROMACS version 2020.4-MODIFIED. If you did not intend to use an altered GROMACS version, make sure to download an intact source distribution and compile that before proceeding.
If you have modified the source code, you are strongly encouraged to set your custom version suffix (using -DGMX_VERSION_STRING_OF_FORK) which will can help later with scientific reproducibility but also when reporting bugs.
Release checksum: 79c2857291b034542c26e90512b92fd4b184a1c9d6fa59c55f2e24ccf14e7281
Computed checksum: 71730c3e53f008bf8a6c6ee90f305b5807e001d8824f2de8ace37d9da6377c65
Precision:          single
Memory model:       64 bit
MPI library:        MPI
OpenMP support:     enabled (GMX_OPENMP_MAX_THREADS = 64)
GPU support:        CUDA
SIMD instructions:  IBM_VSX
FFT library:        fftw-3.3.7
RDTSCP usage:       disabled
TNG support:        enabled
Hwloc support:      hwloc-1.11.8
Tracing support:    disabled
C compiler:         /apps/GCC/7.3.0/bin/gcc GNU 7.3.0
C compiler flags:   -mcpu=power9 -mtune=power9 -mvsx -pthread -fexcess-precision=fast -funroll-all-loops -O3 -DNDEBUG
C++ compiler:       /apps/GCC/7.3.0/bin/g++ GNU 7.3.0
C++ compiler flags: -mcpu=power9 -mtune=power9 -mvsx -pthread -fexcess-precision=fast -funroll-all-loops -fopenmp -O3 -DNDEBUG
CUDA compiler:      /usr/local/cuda-10.2/bin/nvcc nvcc: NVIDIA (R) Cuda compiler driver;Copyright (c) 2005-2019 NVIDIA Corporation;Built on Thu_Oct_24_17:58:26_PDT_2019;Cuda compilation tools, release 10.2, V10.2.89
CUDA compiler flags:-gencode;arch=compute_30,code=sm_30;-gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_35,code=compute_35;-gencode;arch=compute_50,code=compute_50;-gencode;arch=compute_52,code=compute_52;-gencode;arch=compute_60,code=compute_60;-gencode;arch=compute_61,code=compute_61;-gencode;arch=compute_70,code=compute_70;-gencode;arch=compute_75,code=compute_75;-use_fast_math;;-mcpu=power9 -mtune=power9 -mvsx -pthread -fexcess-precision=fast -funroll-all-loops -fopenmp -O3 -DNDEBUG
CUDA driver:        10.20
CUDA runtime:       10.20

The -nsteps functionality is deprecated, and may be removed in a future version. Consider using gmx convert-tpr -nsteps or changing the appropriate .mdp file field.

Overriding nsteps with value passed on the command line: 100000000 steps, 2e+05 ps
Changing nstlist from 10 to 80, rlist from 1.003 to 1.155


4 GPUs selected for this run.
Mapping of GPU IDs to the 80 GPU tasks in the 40 ranks on this node:
  PP:0,PME:0,PP:0,PME:0,PP:0,PME:0,PP:0,PME:0,PP:0,PME:0,PP:0,PME:0,PP:0,PME:0,PP:0,PME:0,PP:0,PME:0,PP:0,PME:0,PP:1,PME:1,PP:1,PME:1,PP:1,PME:1,PP:1,PME:1,PP:1,PME:1,PP:1,PME:1,PP:1,PME:1,PP:1,PME:1,PP:1,PME:1,PP:1,PME:1,PP:2,PME:2,PP:2,PME:2,PP:2,PME:2,PP:2,PME:2,PP:2,PME:2,PP:2,PME:2,PP:2,PME:2,PP:2,PME:2,PP:2,PME:2,PP:2,PME:2,PP:3,PME:3,PP:3,PME:3,PP:3,PME:3,PP:3,PME:3,PP:3,PME:3,PP:3,PME:3,PP:3,PME:3,PP:3,PME:3,PP:3,PME:3,PP:3,PME:3
PP tasks will do (non-perturbed) short-ranged interactions on the GPU
PP task will update and constrain coordinates on the CPU
PME tasks will do all aspects on the GPU

This is simulation 0 out of 40 running as a composite GROMACS
multi-simulation job. Setup for this simulation:

Using 1 MPI process

Non-default thread affinity set, disabling internal thread affinity

Using 1 OpenMP thread 

System total charge: -0.000
Will do PME sum in reciprocal space for electrostatic interactions.

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
U. Essmann, L. Perera, M. L. Berkowitz, T. Darden, H. Lee and L. G. Pedersen 
A smooth particle mesh Ewald method
J. Chem. Phys. 103 (1995) pp. 8577-8592
-------- -------- --- Thank You --- -------- --------

Using a Gaussian width (1/beta) of 0.320163 nm for Ewald
Potential shift: LJ r^-12: -1.000e+00 r^-6: -1.000e+00, Ewald -1.000e-05
Initialized non-bonded Ewald tables, spacing: 9.33e-04 size: 1073

Generated table with 1077 data points for 1-4 COUL.
Tabscale = 500 points/nm
Generated table with 1077 data points for 1-4 LJ6.
Tabscale = 500 points/nm
Generated table with 1077 data points for 1-4 LJ12.
Tabscale = 500 points/nm

Using GPU 8x8 nonbonded short-range kernels

Using a dual 8x8 pair-list setup updated with dynamic, rolling pruning:
  outer list: updated every 80 steps, buffer 0.155 nm, rlist 1.155 nm
  inner list: updated every  8 steps, buffer 0.002 nm, rlist 1.002 nm
At tolerance 0.005 kJ/mol/ps per atom, equivalent classical 1x1 list would be:
  outer list: updated every 80 steps, buffer 0.291 nm, rlist 1.291 nm
  inner list: updated every  8 steps, buffer 0.033 nm, rlist 1.033 nm

Using Lorentz-Berthelot Lennard-Jones combination rule

Long Range LJ corr.: <C6> 2.4138e-04


Initializing LINear Constraint Solver

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
B. Hess and H. Bekker and H. J. C. Berendsen and J. G. E. M. Fraaije
LINCS: A Linear Constraint Solver for molecular simulations
J. Comp. Chem. 18 (1997) pp. 1463-1472
-------- -------- --- Thank You --- -------- --------

The number of constraints is 339

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
S. Miyamoto and P. A. Kollman
SETTLE: An Analytical Version of the SHAKE and RATTLE Algorithms for Rigid
Water Models
J. Comp. Chem. 13 (1992) pp. 952-962
-------- -------- --- Thank You --- -------- --------

There are: 80862 Atoms
There are: 26682 VSites

Initializing Replica Exchange
Repl  There are 40 replicas:
Multi-checking the number of atoms ... OK
Multi-checking the integrator ... OK
Multi-checking init_step+nsteps ... 
init_step+nsteps is not equal for all subsystems
  subsystem 0: 112398001
  subsystem 1: 112398001
  subsystem 2: 112398080
  subsystem 3: 112398080
  subsystem 4: 112398001
  subsystem 5: 112398001
  subsystem 6: 112398080
  subsystem 7: 112398080
  subsystem 8: 112398001
  subsystem 9: 112398001
  subsystem 10: 112398080
  subsystem 11: 112398080
  subsystem 12: 112398080
  subsystem 13: 112398080
  subsystem 14: 112398080
  subsystem 15: 112398080
  subsystem 16: 112398001
  subsystem 17: 112398001
  subsystem 18: 112398080
  subsystem 19: 112398080
  subsystem 20: 112398080
  subsystem 21: 112398080
  subsystem 22: 112398080
  subsystem 23: 112398080
  subsystem 24: 112398080
  subsystem 25: 112398080
  subsystem 26: 112398080
  subsystem 27: 112398080
  subsystem 28: 112398080
  subsystem 29: 112398080
  subsystem 30: 112398080
  subsystem 31: 112398080
  subsystem 32: 112398050
  subsystem 33: 112398050
  subsystem 34: 112398050
  subsystem 35: 112398050
  subsystem 36: 112398001
  subsystem 37: 112398001
  subsystem 38: 112398050
  subsystem 39: 112398050

-------------------------------------------------------
Program:     gmx mdrun, version 2020.4-MODIFIED
Source file: src/gromacs/mdrunutility/multisim.cpp (line 381)
MPI rank:    0 (out of 40)

Fatal error:
The 40 subsystems are not compatible

For more information and tips for troubleshooting, please check the GROMACS
website at http://www.gromacs.org/Documentation/Errors
-------------------------------------------------------

oriol-barcenas · April 11, 2024, 11:21am

Hello! I will be replying to myself, as I managed to fix the issue.

Apparently, -nsteps is incompatible with the settings used. I believe it could be a problem of stopping simulations using the -maxh flag and re-starting them. In my case, I wanted to stop the simulations at an earlier timestep than specified in the .mdp file, and instead of launching a single job for the full simulation length, I was launching shorter sequential simulations. In this case, to stop simulations earlier:

Make a rough estimate of when the simulation should stop, using the already simulated time and the approximate performance (ns/day).
Stop sending more simulation jobs whenever the desired timestep should be fulfilled.

Best,
Oriol

Topic		Replies	Views
Gmx mdrun is stuck at step=0 User discussions mdrun , simulation-setup	23	1131	September 5, 2024
Problems in restarting simulation User discussions	5	2111	July 13, 2020
Mdrun crashes with some md parameter during NPT (gmx compiled for rocm,AdaptiveCPP) User discussions mdp-parameters , mdrun , gpu , simulation-setup	8	79	May 22, 2025
MDRUN crash during gREST simulation under NVT ensemble User discussions mdrun-crash , replica-exchange	1	402	June 26, 2023
WARNING: Incomplete energy frame: nr 1390086 time 1042563.750 User discussions	6	1618	September 25, 2020

Metadynamics simulation unable to be restarted from checkpoint, despite mdrun maxh flag used

Related topics