Getting segmentation fault mdrun

GROMACS version: 2021.2
GROMACS modification: NO
Here post your question:
I am trying to run a large simulation using Martini3 and 22.5 million particles, in a cubic box with 130nm on each side. the simulation box contains ~250 proteins.
I am in the nvt step and backbone positions are restrained. I am trying to run the system on multiple nodes on a cluster, but I get segmentation faults and the job crashes, creating core.[six digit numbers] files. When I check the .log file, it does not even show the start of the simulation at t=0 and there is no sign that it has gone beyond the preparation step.

Here is my submission script:

#!/bin/bash
#SBATCH --job-name=test-mar-large
#SBATCH --output=%x-%j.out
#SBATCH --mail-user=mhkh2976@me.com
#SBATCH --mail-type=ALL
#SBATCH --nodes=2 # number of nodes
#SBATCH --ntasks-per-node=10 # request 10 MPI tasks per node
#SBATCH --cpus-per-task=4 # 4 OpenMP threads per MPI task => total: 10 x 4 = 40 CPUs/node
#SBATCH --mem=0 # request all available memory on the node 202 GB
#SBATCH --time=00:15:00 # time limit (D-HH:MM)

module purge --force
module load CCEnv
module load arch/avx512 # switch architecture for up to 30% speedup
module load CCEnv StdEnv/2020 gcc/9.3.0 openmpi/4.0.3 gromacs/2021.2
export OMP_NUM_THREADS=“${SLURM_CPUS_PER_TASK:-1}”

srun -N 2 gmx_mpi mdrun -deffnm nvt -maxh .25

Here are the last few lines of the log file for the run with two nodes:

Initializing Parallel LINear Constraint Solver

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++

B. Hess

P-LINCS: A Parallel Linear Constraint Solver for molecular simulation

J. Chem. Theory Comput. 4 (2008) pp. 116-122

-------- -------- — Thank You — -------- --------

The number of constraints is 197396

There are constraints between atoms in different decomposition domains,

will communicate selected coordinates each lincs iteration

77436 constraints are involved in constraint triangles,

will apply an additional matrix expansion of order 6 for couplings

between constraints inside triangles

Linking all bonded interactions to atoms

There are 6608 inter update-group virtual sites,

will an extra communication step for selected coordinates and forces

Intra-simulation communication will occur every 5 steps.

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++

When I check the screen output of the code, I can see that after the warning regarding how large the output files will be, the simulation crashes.
Here are the last few lines of the screen output of the simulation:

Using 20 MPI processes
Using 4 OpenMP threads per MPI process

NOTE: The number of threads is not equal to the number of (logical) cores
and the -pin option is set to auto: will not pin threads to cores.
This can lead to significant performance degradation.
Consider using -pin on (and -pinoffset in case you run multiple jobs).

WARNING: This run will generate roughly 2895 Mb of data

srun: error: nia0702: tasks 14-15,18: Segmentation fault (core dumped)
srun: Terminating job step 6121990.0
slurmstepd: error: *** STEP 6121990.0 ON nia0701 CANCELLED AT 2021-09-18T10:55:23 ***
srun: error: nia0701: tasks 4-6: Segmentation fault (core dumped)
srun: error: nia0702: tasks 10,12,17,19: Segmentation fault (core dumped)
srun: error: nia0701: tasks 7-9: Segmentation fault (core dumped)
srun: error: nia0702: tasks 11,13: Segmentation fault (core dumped)
srun: error: nia0701: task 1: Segmentation fault (core dumped)
srun: error: nia0701: task 3: Segmentation fault (core dumped)
srun: error: nia0701: task 0: Segmentation fault (core dumped)
srun: error: nia0701: task 2: Killed
srun: error: nia0702: task 16: Killed
srun: Force Terminated job step 6121990.0

scontrol show jobid 6121990
JobId=6121990 JobName=test-mar-large
UserId=mkhatami(3007269) GroupId=pmkim(6000985) MCS_label=N/A
Priority=799457 Nice=0 Account=rrg-pmkim QOS=normal
JobState=FAILED Reason=NonZeroExitCode Dependency=(null)
Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=139:0
RunTime=00:01:30 TimeLimit=00:15:00 TimeMin=N/A
SubmitTime=2021-09-18T10:49:48 EligibleTime=2021-09-18T10:49:48
AccrueTime=2021-09-18T10:49:48
StartTime=2021-09-18T10:55:02 EndTime=2021-09-18T10:56:32 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-09-18T10:55:02
Partition=compute AllocNode:Sid=nia-login06:426581
ReqNodeList=(null) ExcNodeList=(null)
NodeList=nia[0701-0702]
BatchHost=nia0701
NumNodes=2 NumCPUs=160 NumTasks=20 CPUs/Task=4 ReqB:S:C:T=0:0::
TRES=cpu=160,mem=350000M,node=2,billing=80
Socks/Node=* NtasksPerN:B:S:C=10:0:: CoreSpec=*
MinCPUsNode=40 MinMemoryNode=175000M MinTmpDiskNode=0
Features=[skylake|cascade] DelayBoot=00:00:00
OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
Command=/gpfs/fs0/project/p/pmkim/mkhatami/work/gmx/gold-protein-martini3-warren/4times-more-152-60-20/test-martini-large/submit.sh
WorkDir=/gpfs/fs0/project/p/pmkim/mkhatami/work/gmx/gold-protein-martini3-warren/4times-more-152-60-20/test-martini-large
StdErr=/gpfs/fs0/project/p/pmkim/mkhatami/work/gmx/gold-protein-martini3-warren/4times-more-152-60-20/test-martini-large/test-mar-large-6121990.out
StdIn=/dev/null
StdOut=/gpfs/fs0/project/p/pmkim/mkhatami/work/gmx/gold-protein-martini3-warren/4times-more-152-60-20/test-martini-large/test-mar-large-6121990.out
Power=
MailUser=mhkh2976@me.com MailType=BEGIN,END,FAIL,REQUEUE,STAGE_OUT

sacct -j 6121990
JobID JobName Account Elapsed MaxVMSize MaxRSS SystemCPU UserCPU ExitCode


6121990 test-mar-+ rrg-pmkim 00:01:30 00:40.710 03:31.859 11:0
6121990.bat+ batch rrg-pmkim 00:01:30 576.50M 9672K 00:00.636 00:01.297 11:0
6121990.ext+ extern rrg-pmkim 00:01:30 138360K 876K 00:00:00 00:00.001 0:0
6121990.0 gmx_mpi rrg-pmkim 00:01:21 1831796K 1382004K 00:40.074 03:30.561 0:11

kernel messages produced during job executions:
[Sep18 10:34] gmx_mpi[325228]: segfault at 2b467c22d7e0 ip 00002b467c22d7e0 sp 00002b4688a8aef8 error 15
[ +0.000008] gmx_mpi[325237]: segfault at 2b467c22d7e0 ip 00002b467c22d7e0 sp 00002b4688c8bef8 error 15
[ +0.000006] in ucx_shm_posix_43e21cbb (deleted)[2b467c175000+107000]

[ +0.001087] gmx_mpi[325231]: segfault at 2ba0066587e0 ip 00002ba0066587e0 sp 00002ba017190ef8 error 15
[ +0.000003] gmx_mpi[325223]: segfault at 2ba0066587e0 ip 00002ba0066587e0 sp 00002ba007f3bef8 error 15
[ +0.000002] in ucx_shm_posix_78898a24 (deleted)[2ba00658f000+107000]

[ +0.000004] in ucx_shm_posix_78898a24 (deleted)[2ba00658f000+107000]

[ +0.004825] gmx_mpi[325226]: segfault at 2ba989b5a7e0 ip 00002ba989b5a7e0 sp 00002ba99a58fef8 error 15
[ +0.000011] in ucx_shm_posix_3cdaf27e (deleted)[2ba989a91000+107000]

[ +0.000388] gmx_mpi[325214]: segfault at 2b4ef37677e0 ip 00002b4ef37677e0 sp 00002b4f03f23ef8 error 15
[ +0.000019] in ucx_shm_posix_5c48c660 (deleted)[2b4ef369e000+107000]

[ +0.000063] gmx_mpi[325215]: segfault at 2ad65d3477e0 ip 00002ad65d3477e0 sp 00002ad66db8fef8 error 15
[ +0.000019] in ucx_shm_posix_5c48c660 (deleted)[2ad65d27e000+107000]

[ +0.000590] gmx_mpi[325218]: segfault at 2b1ecd1077e0 ip 00002b1ecd1077e0 sp 00002b1ecfdeaef8 error 15
[ +0.000005] gmx_mpi[325239]: segfault at 2b1ecd1077e0 ip 00002b1ecd1077e0 sp 00002b1eddb8fef8 error 15
[ +0.000001] in ucx_shm_posix_3b57c078 (deleted)[2b1ecd03e000+107000]

[ +0.000003] in ucx_shm_posix_3b57c078 (deleted)[2b1ecd03e000+107000]

[ +0.000696] gmx_mpi[325211]: segfault at 2b270a3577e0 ip 00002b270a3577e0 sp 00002b270ba39ef8 error 15
[ +0.000023] in ucx_shm_posix_70ff78be (deleted)[2b270a28e000+107000]

[ +0.155688] in ucx_shm_posix_43e21cbb (deleted)[2b467c175000+107000]

I have tried the same simulation on a single node and it runs fine.
I have tried a smaller version of the same system in a 60nm box with 1.6 million particles and it runs fine with the same script on multiple nodes of CPUs (I have tried up to 10 nodes x 40 cpus). My all-atom simulations also run fine with the same cluster and submission script.

I want to know what could cause the crashes on my large system and not the smaller-sized systems?

Best,
Mohammad

Can you provide access to your tpr file? Then I can try to figure out what goes wrong.

Here is the google drive link to the tpr file.
https://drive.google.com/file/d/1wZVnhrip8ybkl26yPV1LptOncKOvzyzm/view?usp=sharing
Just request the access and I will provide it for you.

Mohammad

@hess

I found and fixed the bug. The fix is up for review here:

You only need to change a few lines, so you can fix it yourself and try.

But with this fix I still can’t run your system on my machine, mdrun stops with the message “Killed” and the debugger does not give any information on what happened. This crash is unrelated to the bug I fixed. Could you check if it now works on yours system with the fix?

1 Like

I am now quite sure that mdrun gets killed on my workstation because it is using too much resources, not because of an issue in GROMACS. So this should be solved now.

Thanks @hess, it works like a charm!
Mohammad

@hess @Mohammad
I am facing a similar problem during the production step of the simulation.
The energy minimization and equilibration steps were successful.
Any help would be hugely appreciated.

Terminal shows:

Using 1 MPI thread
Using 8 OpenMP threads

starting mdrun ‘Title’
250000 steps, 1000.0 ps.

Step 0, time 0 (ps) LINCS WARNING
relative constraint deviation after LINCS:
rms 0.013830, max 0.479731 (between atoms 2137 and 2138)
bonds that rotated more than 30 degrees:
atom 1 atom 2 angle previous, current, constraint length
2074 2087 37.7 0.2508 0.1627 0.1450
2087 2088 30.5 0.1111 0.1177 0.1111
2087 2089 41.4 0.2311 0.1558 0.1400
2089 2090 41.1 0.0960 0.1088 0.0960
2136 2137 65.2 0.2186 0.1827 0.1420
2137 2138 84.6 0.1111 0.1644 0.1111
2137 2139 57.4 0.3271 0.1971 0.1456
2137 2151 67.3 0.2861 0.1886 0.1450
2151 2152 32.7 0.1111 0.1184 0.1111
2151 2153 36.0 0.1373 0.1543 0.1400
step 0
Step 1, time 0.004 (ps) LINCS WARNING
relative constraint deviation after LINCS:
rms 168882618368.000000, max 9011285983232.000000 (between atoms 2087 and 2088)
bonds that rotated more than 30 degrees:
atom 1 atom 2 angle previous, current, constraint length
2064 2067 132.2 0.1603 1.2749 0.1600
2067 2068 81.2 0.1476 5.7118 0.1440
2068 2069 86.9 0.1142 5.5788 0.1111
.
.
.
2162 2164 46.9 0.1111 0.1683 0.1111
2162 2165 39.2 0.1513 0.2036 0.1512
Wrote pdb files with previous and current coordinates
zsh: segmentation fault gmx mdrun -v -deffnm pro

This is unrelated. This is not a segmentation fault. This looks like the usual issue of an unstable initial configuration. You either need to do more energy minimization of fix clashes or incorrect conformations in your starting structure.