MDRUN crash during gREST simulation under NVT ensemble

GROMACS version: 2022.3
GROMACS modification: No

Hello everyone.

I am currently using GROMACS ver. 2022.3 patched with PLUMED ver. 2.8.1., and I’ve been trying to replicate the gREST simulations from Oshima et al. (J. Chem. Inf. Model. 2020, 60, 11, 5382–5394) using GROMACS.

In the paper, the authors perform gREST simulation on 12 replica under NVT ensemble.

I summarized my simulation conditions below:

Simulation Procedure

So I basically followed most of the procedures from the paper.

I first built a protein-ligand complex system using tLeap from Ambertools and converted them to .top & .gro formats using acpype.
The forcefield I used were Amber ff14SB for the protein, and gaff2 for the ligands.

For the solute potential term modification, I had to manually change the relevant parameters from the topology file since plumed’s partial_tempering module only modifies for the potential terms related to REST2 simulations (dihedrals, LJ, and electrostatic charges). The equations for the ratio of each potential terms are given in the paper from Kamiya and Sugita (J. Chem. Phys. 149, 072304 (2018)).

I also added a flat-bottom(FB) potential so that ligands don’t dissociate from the protein by using the pull options in the mdp file (R_0 = 1.0 nm, K = 418.4 kJ/mol/nm^2).

In the paper, the solute temperature set as 300.0, 324.5, 355.7, 389.5, 431.2, 476.8, 529.2, 593.6, 670.5, 763.7, 871.7, and 999.8 K.

The simulation procedure was done as the following:

  1. EM
**grompp** 
gmx grompp -f 1_EM.mdp -c box.gro -n index.ndx -p topol.top -o em.tpr

**mdrun**
export OMP_NUM_THREADS=8
gmx mdrun -ntmpi 1 -ntomp 8 -v -deffnm em
  1. 100 ps position-restrained NVT EQ (T=300K for each replica)
**grompp**
gmx grompp -f 2_NVT.mdp -c em.gro -r em.gro -n index.ndx -p topol.top -o nvt.tpr

**mdrun**
gmx mdrun -ntmpi 1 -ntomp 8 -v -deffnm nvt
  1. 100 ps position-restrained NPT EQ (T=300K, P=1 atm for each replica)
**grompp**
gmx grompp -f 3_NPT.mdp -c nvt.gro -r nvt.gro -n index.ndx -p topol.top -o npt.tpr

**mdrun**
gmx mdrun -ntmpi 1 -ntomp 8 -v -deffnm npt
  1. 500ps unrestrained NVT EQ (T=300 K for each replica)
**grompp**
gmx grompp -f 4_EQ1.mdp -c npt.gro -n index.ndx -p topol.top -o eq1.tpr

**mdrun**
gmx mdrun -ntmpi 1 -ntomp 8 -v -deffnm eq1
  1. Modify topology according to solute temperature
  2. 1000 ps unrestrained NVT EQ (different T’s for each replica, no replica exchange)
**grompp**
gmx grompp -f 5_EQ2.mdp -c eq1.gro -n index.ndx -p processed.top -o eq2.tpr

**mdrun**
gmx mdrun -ntmpi 1 -ntomp 8 -v -deffnm eq2
  1. NVT Production (different T’s for each replica with replica exchange)
**grompp**
gmx grompp -f 6_PROD.mdp -c eq2.gro -n index.ndx -p processed.top -o prod.tpr

**mdrun**
mpirun -np 12 --mca btl_tcp_if_exclude docker0,lo gmx_mpi mdrun -ntomp 8 -v -deffnm prod -plumed /home/jimmych/test/MD_Oshima/build/plumed.dat -multidir 1.000 0.924 0.843 0.770 0.696 0.629 0.567 0.505 0.447 0.393 0.344 0.300 -replex 1000 -hrex -dlb no

plumed.dat was only used to collect dRMSD values between the ligands and the pocket residues(File not attached).

I have attached the mdp files, index file, as well as the original & one example modified topology(T=999.8K) files for the simulation.

All the procedure seem to run ok until I get to the production stage, where after the first replica exchange, the simulation suddenly crashes due to bad contacts issue:

:-) GROMACS - gmx mdrun, 2022.3-plumed_2.8.1 (-:

Executable:   /root/opt/gromacs-2022.3/bin/gmx_mpi
Data prefix:  /root/opt/gromacs-2022.3
Working dir:  /home/jimmych/test/MD_Oshima/Production/ensembles/0/lambdas
Command line:
  gmx_mpi mdrun -ntomp 8 -v -deffnm prod -plumed /home/jimmych/test/MD_Oshima/build/plumed.dat -multidir 1.000 0.924 0.843 0.770 0.696 0.629 0.567 0.505 0.447 0.393 0.344 0.300 -replex 1000 -dlb no

Reading file prod.tpr, VERSION 2022.3 (single precision)
Reading file prod.tpr, VERSION 2022.3 (single precision)
Reading file prod.tpr, VERSION 2022.3 (single precision)
Reading file prod.tpr, VERSION 2022.3 (single precision)
Reading file prod.tpr, VERSION 2022.3 (single precision)
Reading file prod.tpr, VERSION 2022.3 (single precision)
Reading file prod.tpr, VERSION 2022.3 (single precision)
Reading file prod.tpr, VERSION 2022.3 (single precision)
Reading file prod.tpr, VERSION 2022.3 (single precision)
Reading file prod.tpr, VERSION 2022.3 (single precision)
Reading file prod.tpr, VERSION 2022.3 (single precision)
Reading file prod.tpr, VERSION 2022.3 (single precision)
Changing nstlist from 20 to 100, rlist from 0.8 to 0.887

1 GPU selected for this run.
Mapping of GPU IDs to the 2 GPU tasks in the 1 rank on this node:
  PP:0,PME:0
PP tasks will do (non-perturbed) short-ranged interactions on the GPU
PP task will update and constrain coordinates on the CPU
PME tasks will do all aspects on the GPU

This is simulation 0 out of 12 running as a composite GROMACS
multi-simulation job. Setup for this simulation:

Using 1 MPI process
Changing nstlist from 20 to 100, rlist from 0.8 to 0.884

1 GPU selected for this run.
Mapping of GPU IDs to the 2 GPU tasks in the 1 rank on this node:
  PP:0,PME:0
PP tasks will do (non-perturbed) short-ranged interactions on the GPU
PP task will update and constrain coordinates on the CPU
PME tasks will do all aspects on the GPU
Changing nstlist from 20 to 100, rlist from 0.8 to 0.883

1 GPU selected for this run.
Mapping of GPU IDs to the 2 GPU tasks in the 1 rank on this node:
  PP:0,PME:0
PP tasks will do (non-perturbed) short-ranged interactions on the GPU
PP task will update and constrain coordinates on the CPU
PME tasks will do all aspects on the GPU

This is simulation 3 out of 12 running as a composite GROMACS
multi-simulation job. Setup for this simulation:

Using 1 MPI process

This is simulation 2 out of 12 running as a composite GROMACS
multi-simulation job. Setup for this simulation:

Using 1 MPI process
Changing nstlist from 20 to 100, rlist from 0.8 to 0.886

1 GPU selected for this run.
Mapping of GPU IDs to the 2 GPU tasks in the 1 rank on this node:
  PP:0,PME:0
PP tasks will do (non-perturbed) short-ranged interactions on the GPU
PP task will update and constrain coordinates on the CPU
PME tasks will do all aspects on the GPU
Changing nstlist from 20 to 100, rlist from 0.8 to 0.884

1 GPU selected for this run.
Mapping of GPU IDs to the 2 GPU tasks in the 1 rank on this node:
  PP:0,PME:0
PP tasks will do (non-perturbed) short-ranged interactions on the GPU
PP task will update and constrain coordinates on the CPU
PME tasks will do all aspects on the GPU
Changing nstlist from 20 to 100, rlist from 0.8 to 0.883

1 GPU selected for this run.
Mapping of GPU IDs to the 2 GPU tasks in the 1 rank on this node:
  PP:0,PME:0
PP tasks will do (non-perturbed) short-ranged interactions on the GPU
PP task will update and constrain coordinates on the CPU
PME tasks will do all aspects on the GPU
Changing nstlist from 20 to 100, rlist from 0.8 to 0.883

1 GPU selected for this run.
Mapping of GPU IDs to the 2 GPU tasks in the 1 rank on this node:
  PP:0,PME:0
PP tasks will do (non-perturbed) short-ranged interactions on the GPU
PP task will update and constrain coordinates on the CPU
PME tasks will do all aspects on the GPU
Changing nstlist from 20 to 100, rlist from 0.8 to 0.882

1 GPU selected for this run.
Mapping of GPU IDs to the 2 GPU tasks in the 1 rank on this node:
  PP:0,PME:0
PP tasks will do (non-perturbed) short-ranged interactions on the GPU
PP task will update and constrain coordinates on the CPU
PME tasks will do all aspects on the GPU
Changing nstlist from 20 to 100, rlist from 0.8 to 0.886

1 GPU selected for this run.
Mapping of GPU IDs to the 2 GPU tasks in the 1 rank on this node:
  PP:0,PME:0
PP tasks will do (non-perturbed) short-ranged interactions on the GPU
PP task will update and constrain coordinates on the CPU
PME tasks will do all aspects on the GPU
Changing nstlist from 20 to 100, rlist from 0.8 to 0.885

1 GPU selected for this run.
Mapping of GPU IDs to the 2 GPU tasks in the 1 rank on this node:
  PP:0,PME:0
PP tasks will do (non-perturbed) short-ranged interactions on the GPU
PP task will update and constrain coordinates on the CPU
PME tasks will do all aspects on the GPU
Changing nstlist from 20 to 100, rlist from 0.8 to 0.885

1 GPU selected for this run.
Mapping of GPU IDs to the 2 GPU tasks in the 1 rank on this node:
  PP:0,PME:0
PP tasks will do (non-perturbed) short-ranged interactions on the GPU
PP task will update and constrain coordinates on the CPU
PME tasks will do all aspects on the GPU
Changing nstlist from 20 to 100, rlist from 0.8 to 0.885

1 GPU selected for this run.
Mapping of GPU IDs to the 2 GPU tasks in the 1 rank on this node:
  PP:0,PME:0
PP tasks will do (non-perturbed) short-ranged interactions on the GPU
PP task will update and constrain coordinates on the CPU
PME tasks will do all aspects on the GPU

This is simulation 10 out of 12 running as a composite GROMACS
multi-simulation job. Setup for this simulation:

Using 1 MPI process

This is simulation 11 out of 12 running as a composite GROMACS
multi-simulation job. Setup for this simulation:

Using 1 MPI process

This is simulation 8 out of 12 running as a composite GROMACS
multi-simulation job. Setup for this simulation:

Using 1 MPI process

This is simulation 1 out of 12 running as a composite GROMACS
multi-simulation job. Setup for this simulation:

Using 1 MPI process

This is simulation 9 out of 12 running as a composite GROMACS
multi-simulation job. Setup for this simulation:

Using 1 MPI process

This is simulation 5 out of 12 running as a composite GROMACS
multi-simulation job. Setup for this simulation:

Using 1 MPI process

This is simulation 6 out of 12 running as a composite GROMACS
multi-simulation job. Setup for this simulation:

Using 1 MPI process

This is simulation 7 out of 12 running as a composite GROMACS
multi-simulation job. Setup for this simulation:

Using 1 MPI process

This is simulation 4 out of 12 running as a composite GROMACS
multi-simulation job. Setup for this simulation:

Using 1 MPI process
Using 8 OpenMP threads 

Using 8 OpenMP threads 

Using 8 OpenMP threads 

Using 8 OpenMP threads 

Using 8 OpenMP threads 

Using 8 OpenMP threads 

Using 8 OpenMP threads 

Using 8 OpenMP threads 

Using 8 OpenMP threads 

Using 8 OpenMP threads 

Using 8 OpenMP threads 

Using 8 OpenMP threads 

WARNING: This run will generate roughly 11941 Mb of data

WARNING: This run will generate roughly 11941 Mb of data

WARNING: This run will generate roughly 11941 Mb of data

WARNING: This run will generate roughly 11941 Mb of data

WARNING: This run will generate roughly 11941 Mb of data

WARNING: This run will generate roughly 11941 Mb of data

WARNING: This run will generate roughly 11941 Mb of data

WARNING: This run will generate roughly 11941 Mb of data

WARNING: This run will generate roughly 11941 Mb of data

WARNING: This run will generate roughly 11941 Mb of data

WARNING: This run will generate roughly 11941 Mb of data

WARNING: This run will generate roughly 11941 Mb of data

starting mdrun 'wet.complex'
starting mdrun 'wet.complex'
500000000 steps, 1000000.0 ps.
500000000 steps, 1000000.0 ps.
starting mdrun 'wet.complex'
starting mdrun 'wet.complex'
500000000 steps, 1000000.0 ps.
starting mdrun 'wet.complex'
500000000 steps, 1000000.0 ps.
starting mdrun 'wet.complex'
starting mdrun 'wet.complex'
500000000 steps, 1000000.0 ps.
starting mdrun 'wet.complex'
starting mdrun 'wet.complex'
starting mdrun 'wet.complex'
500000000 steps, 1000000.0 ps.
500000000 steps, 1000000.0 ps.
starting mdrun 'wet.complex'
500000000 steps, 1000000.0 ps.
500000000 steps, 1000000.0 ps.
500000000 steps, 1000000.0 ps.
500000000 steps, 1000000.0 ps.
starting mdrun 'wet.complex'
500000000 steps, 1000000.0 ps.

step 0
step 100, will finish Mon Jun  5 21:54:33 2023
step 200, will finish Mon Jun  5 23:56:43 2023
step 300, will finish Mon Jun  5 06:00:12 2023
step 400, will finish Mon Jun  5 04:11:26 2023
step 500, will finish Sun Jun  4 17:41:58 2023
step 600, will finish Sun Jun  4 13:40:23 2023
step 700, will finish Sun Jun  4 07:27:08 2023
step 800, will finish Sun Jun  4 04:36:49 2023
step 900, will finish Sun Jun  4 00:37:29 2023
step 1000, will finish Sat Jun  3 23:25:01 2023
step 1001: One or more water molecules can not be settled.
Check for bad contacts and/or reduce the timestep if appropriate.

step 1100, will finish Sat Jun  3 21:41:19 2023
step 1133: One or more water molecules can not be settled.
Check for bad contacts and/or reduce the timestep if appropriate.
Wrote pdb files with previous and current coordinates
Wrote pdb files with previous and current coordinates
[spot-dy-g4dn2xlarge-22:27464] *** Process received signal ***
[spot-dy-g4dn2xlarge-22:27464] *** Process received signal ***
[spot-dy-g4dn2xlarge-22:27464] *** Process received signal ***
[spot-dy-g4dn2xlarge-22:27464] *** Process received signal ***
[spot-dy-g4dn2xlarge-22:27464] *** Process received signal ***
[spot-dy-g4dn2xlarge-22:27464] *** Process received signal ***
[spot-dy-g4dn2xlarge-22:27464] *** Process received signal ***
[spot-dy-g4dn2xlarge-22:27464] Signal: Segmentation fault (11)
[spot-dy-g4dn2xlarge-22:27464] Signal code: Address not mapped (1)
[spot-dy-g4dn2xlarge-22:27464] Failing at address: 0xfffffffc055eeb78
[spot-dy-g4dn2xlarge-21:27469] *** Process received signal ***
[spot-dy-g4dn2xlarge-21:27469] *** Process received signal ***
[spot-dy-g4dn2xlarge-21:27469] *** Process received signal ***
[spot-dy-g4dn2xlarge-21:27469] *** Process received signal ***
[spot-dy-g4dn2xlarge-21:27469] *** Process received signal ***
[spot-dy-g4dn2xlarge-21:27469] *** Process received signal ***
[spot-dy-g4dn2xlarge-21:27469] *** Process received signal ***
[spot-dy-g4dn2xlarge-21:27469] *** Process received signal ***
[spot-dy-g4dn2xlarge-21:27469] Signal: Segmentation fault (11)
[spot-dy-g4dn2xlarge-21:27469] Signal code: Address not mapped (1)
[spot-dy-g4dn2xlarge-21:27469] Failing at address: 0xfffffffc05a638b8
[spot-dy-g4dn2xlarge-21:27469] Signal: Segmentation fault (11)
[spot-dy-g4dn2xlarge-21:27469] Signal code: Address not mapped (1)
[spot-dy-g4dn2xlarge-21:27469] Failing at address: 0xfffffffc05a638b8
[spot-dy-g4dn2xlarge-21:27469] Signal: Segmentation fault (11)
[spot-dy-g4dn2xlarge-21:27469] Signal code: Address not mapped (1)
[spot-dy-g4dn2xlarge-21:27469] Failing at address: 0xfffffffc05a638b8
[spot-dy-g4dn2xlarge-21:27469] Signal: Segmentation fault (11)
[spot-dy-g4dn2xlarge-21:27469] Signal code: Address not mapped (1)
[spot-dy-g4dn2xlarge-21:27469] Failing at address: 0xfffffffc05a638b8
[spot-dy-g4dn2xlarge-21:27469] Signal: Segmentation fault (11)
[spot-dy-g4dn2xlarge-21:27469] Signal code: Address not mapped (1)
[spot-dy-g4dn2xlarge-21:27469] Failing at address: 0xfffffffc05a638b8
[spot-dy-g4dn2xlarge-21:27469] Signal: Segmentation fault (11)
[spot-dy-g4dn2xlarge-21:27469] Signal code: Address not mapped (1)
[spot-dy-g4dn2xlarge-21:27469] Failing at address: 0xfffffffc05a638b8
[spot-dy-g4dn2xlarge-21:27469] Signal: Segmentation fault (11)
[spot-dy-g4dn2xlarge-21:27469] Signal code: Address not mapped (1)
[spot-dy-g4dn2xlarge-21:27469] Failing at address: 0xfffffffc05a638b8
[spot-dy-g4dn2xlarge-21:27469] Signal: Segmentation fault (11)
[spot-dy-g4dn2xlarge-21:27469] Signal code: Address not mapped (1)
[spot-dy-g4dn2xlarge-21:27469] Failing at address: 0xfffffffc05a638b8
[spot-dy-g4dn2xlarge-21:27469] [ 0] /lib64/libpthread.so.0(+0x117e0)[0x7f41c357b7e0]
[spot-dy-g4dn2xlarge-21:27469] [ 1] [spot-dy-g4dn2xlarge-21:27469] [ 0] [spot-dy-g4dn2xlarge-21:27469] [ 0] [spot-dy-g4dn2xlarge-21:27469] [ 0] [spot-dy-g4dn2xlarge-21:27469] [ 0] [spot-dy-g4dn2xlarge-21:27469] [ 0] [spot-dy-g4dn2xlarge-21:27469] [ 0] [spot-dy-g4dn2xlarge-21:27469] [ 0] /workdir_efs/root/opt/gromacs-2022.3/lib64/libgromacs_mpi.so.7(+0x43aa50)[0x7f41c42e1a50]
[spot-dy-g4dn2xlarge-21:27469] [ 2] /lib64/libpthread.so.0(+0x117e0)[0x7f41c357b7e0]
[spot-dy-g4dn2xlarge-21:27469] [ 1] /home/jimmych/anaconda3/lib/libgomp.so.1(+0x146d5)[0x7f41c86e86d5]
[spot-dy-g4dn2xlarge-21:27469] [ 3] /lib64/libpthread.so.0(+0x740b)[0x7f41c357140b]
[spot-dy-g4dn2xlarge-21:27469] [ 4] /lib64/libc.so.6(clone+0x3f)[0x7f41c2f6c40f]
[spot-dy-g4dn2xlarge-21:27469] *** End of error message ***
--------------------------------------------------------------------------
An MPI communication peer process has unexpectedly disconnected.  This
usually indicates a failure in the peer process (e.g., a crash or
otherwise exiting without calling MPI_FINALIZE first).

Although this local MPI process will likely now behave unpredictably
(it may even hang or crash), the root cause of this problem is the
failure of the peer -- that is what you need to investigate.  For
example, there may be a core file that you can examine.  More
generally: such peer hangups are frequently caused by application bugs
or other external events.

  Local host: spot-dy-g4dn2xlarge-16
  Local PID:  27502
  Peer host:  spot-dy-g4dn2xlarge-22
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 9 with PID 27464 on node spot-dy-g4dn2xlarge-22 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
[spot-dy-g4dn2xlarge-13:27479] 3 more processes have sent help message help-mpi-btl-tcp.txt / peer hung up
[spot-dy-g4dn2xlarge-13:27479] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
2 total processes killed (some possibly by mpirun during cleanup)

When I change the production simulation ensemble from NVT to NPT, however, it runs ok without any problems.

I’m just curious as why it would crash if the ensemble is set to NVT but not NPT.

Thank you for reading, and have a nice day.

Sincerely,
jimmy_chang


topol.top (1.3 MB)
processed.top (1.3 MB)
1_EM.mdp (1.1 KB)
2_NVT.mdp (2.8 KB)
3_NPT.mdp (2.6 KB)
4_EQ1.mdp (2.5 KB)
5_EQ2.mdp (2.5 KB)
6_PROD.mdp (2.5 KB)

Since this is a PLUMED patched GROMACS not an unmodified installation it would be good to know first that the error is indeed due to GROMACS. Can you please test with an official unmodified GROMACS installation?

Cheers,
Szilárd