LINCS warning system crash--changing MPI and OpenMP threads helps going further

GROMACS version:2022.3
GROMACS modification: No

I am trying to simulate a large system using MARTINI 3.0 FF, containing +100 proteins,~1400 lipids, ions and ~1.7M water molecules (a total of 1.9M particles). I had a similar system that I talked about in another post last year, giving me segmentation errors (here) which was resolved.
Unlike my previous system, where the proteins were dispersed and randomly placed in the simulation box, in my current system, the proteins are placed on top of each other (using a docking-like method). Note that I am using the same proteins and the same itp files for forcefields.

After the EM step (I set the number of steps to 30k, and it usually finishes at ~6k steps), I try to do NPT equilibration with positions restrained using the following mdp options:

include = -I../include
define         = -DPOSRES_all_proteins_lipids
title          = Martini
tinit          = 0.0

integrator        = md
dt = XX
nsteps          = 500000
comm-mode        =none
nstcomm         = 100
comm-grps		 = 

nstxout         = 100000
nstvout         = 0
nstfout         = 0
nstlog          = 100 ; Output frequency for energies to log file 
nstenergy        = 10  ; Output frequency for energies to energy file
nstxtcout        = 1000 ; Output frequency for .xtc file
xtc_precision      = 100
compressed-x-grps    = protein

cutoff-scheme      = Verlet
nstlist         = 20
ns_type         = grid
pbc           = xyz
verlet-buffer-tolerance = 0.005

coulombtype       = reaction-field
rcoulomb         = 1.1
epsilon_r        = 15	; 2.5 (with polarizable water)
epsilon_rf        = 0 
vdw_type         = cutoff ;(for use with Verlet-pairlist)  
rvdw           = 1.1 ;(for use with Verlet-pairlist)

tcoupl          = berendsen; v-rescale 
tc-grps         = DOPC_DOPS protein W_ION
tau_t          = 1.0 1.0 1.0
ref_t          = 310 310 310
Pcoupl          = berendsen; parrinello-rahman; berendsen; C-rescale;berendsen ; parrinello-rahman
Pcoupltype        = isotropic
tau_p          = 20.0 ; I have used 2.0, and still I have a similar problem
compressibility     = 3e-4;  3e-4  3e-4
ref_p          = 1.0 ;  1.0   1.0

annealing        = single single single
annealing-npoints    = 4 4 4
annealing-time      = 0 200 400 600  0 200 400 600  0 200 400 600 ;; for smaller dt these are smaller
annealing-temp      = 50 150 250 310 50 150 250 310 50 150 250 310

gen_vel         = no; yes
;gen_temp         = 310
gen_seed         = 473529

constraints       = none 
constraint_algorithm   = Lincs
lincs_order       = 30 ; 6 is needed for large time-steps with virtual sites or BD
lincs_warnangle     = 30
;used charmm-gui from dops mdp generated!

refcoord_scaling     = all
sc-r-power        = 6

So, when I put 20fs (the normal step size for MARTINI) instead of dt = XX in my mdp file above, I get LINCS errors, so I go smaller and smaller until I get to 0.5 fs (dt= 0.0005), which is much smaller than the all-atom step size.
This is where I get almost no LINCS errors and can run the equilibration for the full 500k steps. Then I try to increase the dt step by step to the usual MARTINI step size (dt=0.020).

I am increasing the dt by steps of 0.5 fs (i.e., dt_current= dt_previous +0.0005) and running it for 500k steps. Starting from dt=0.001, I get LINCS errors at steps around 200k and above, and when I re-run the simulation from the checkpoint, it goes ahead, and the next LINCS error appears at, let’s say, ~300k. By being stubborn and running the simulation from previous checkpoints over and over, the simulation goes ahead and finally finishes at 500k. (the system is as expected, with no flying particles, and everything is as it should be!)
For larger steps, e.g., dt= 0.004 (which is still near to the all-atom realm), the frequency of LINCS errors and system crashes increases from, let’s say, every 100k steps to just 2k steps, and in some cases, the simulation is stuck in some steps until I choose a different of number of MPI and OpenMP threads! Then it goes ahead by ~2k steps and crashes in a later time step. I am still at dt=0.0045.

A point not to forget is the LINCS error is appearing in the proteins and the system is prepared in a way that it is centred (using editconf) in a simulation box with d=5nm from each side of the system.
Another point worth mentioning is that I have tried the steps of first NVT and then NPT and I have ended up with similar issues.

The two issues I am facing are:

  1. Why does changing the number of MPI and OpenMP make the simulation go ahead?

  2. I am using the same FF, same itp and mdp files compared to the previous system I am studying (the disperse proteins). Why can’t I just use dt=20fs from the start, similar to my previous system? I believe it has to do with the crowded system in the current simulation box but not sure if it is a GROMACS issue with domains and meshing or something else.

my submission script is:

#!/bin/bash
#SBATCH --job-name=z-v3-npt
#SBATCH --output=%x-%j.out
#SBATCH --mail-user=mhkh2976@me.com
#SBATCH --mail-type=ALL
#SBATCH --nodes=1      # number of nodes
#SBATCH --ntasks-per-node=6   # request 10 MPI tasks per node
#SBATCH --cpus-per-task=4    # 4 OpenMP threads per MPI task => total: 10 x 4 = 40 CPUs/node
#SBATCH --mem=40GB         # request all available memory on the node
#SBATCH --time=0-0:15:00      # time limit (D-HH:MM)
# SBATCH --constraint="[skylake|cascade]" # restrict to AVX512 capable nodes.
# SBATCH -p debug

module purge --force
module load StdEnv/2020 gcc/9.3.0 openmpi/4.0.3 gromacs/2022.3

export OMP_NUM_THREADS="${SLURM_CPUS_PER_TASK:-1}"

mpiexec gmx_mpi mdrun -s npt.tpr -deffnm npt -v -dlb yes -cpt 1 -cpi npt.cpt -pin on -ntomp ${SLURM_CPUS_PER_TASK:-1} -rdd 3.0

any ideas? @hess or @jalemkul?

My guess would be that your starting configuration is badly equilibrated or has issues like parts of molecules sticking through each other. As MD is chaotic, changing any parameter or detail in the run configuration will change the step as which the instability occurs.

I appreciate your input.

Hello!

Your issue sounds strikingly similar to a challenge I recently faced. Like you, I was also working with coarse-grained simulations, specifically on 25 different protein systems. All these systems were built using the same modeling parameters and operated with the same MDP files, including a simulated annealing step at a temperature of 700K. Within these 25 systems, a few didn’t run smoothly (though they could be continued to completion), but some failed right at the start, showing segmentation errors despite having undergone the Minimization step.

In addressing these failures, my first approach was to reduce the timestep to 1 fs. This adjustment solved the issue for most of the troubled systems. However, there were still three systems that wouldn’t run, even after I further decreased the timestep to 0.5 fs. I also tried relaxing the solvent and protein before running the simulated annealing, but this didn’t resolve the issue either; the simulations still wouldn’t run. Even setting the temperature to 300K didn’t help.

Do you have any suggestions or advice on how to tackle this problem? Any insights would be greatly appreciated! @hess @jalemkul @Mohammad