GROMACS version: 2021.2
GROMACS modification: NO
Here post your question:
I am trying to run a large simulation using Martini3 and 22.5 million particles, in a cubic box with 130nm on each side. the simulation box contains ~250 proteins.
I am in the nvt step and backbone positions are restrained. I am trying to run the system on multiple nodes on a cluster, but I get segmentation faults and the job crashes, creating core.[six digit numbers] files. When I check the .log file, it does not even show the start of the simulation at t=0 and there is no sign that it has gone beyond the preparation step.
Here is my submission script:
#!/bin/bash
#SBATCH --job-name=test-mar-large
#SBATCH --output=%x-%j.out
#SBATCH --mail-user=mhkh2976@me.com
#SBATCH --mail-type=ALL
#SBATCH --nodes=2 # number of nodes
#SBATCH --ntasks-per-node=10 # request 10 MPI tasks per node
#SBATCH --cpus-per-task=4 # 4 OpenMP threads per MPI task => total: 10 x 4 = 40 CPUs/node
#SBATCH --mem=0 # request all available memory on the node 202 GB
#SBATCH --time=00:15:00 # time limit (D-HH:MM)
module purge --force
module load CCEnv
module load arch/avx512 # switch architecture for up to 30% speedup
module load CCEnv StdEnv/2020 gcc/9.3.0 openmpi/4.0.3 gromacs/2021.2
export OMP_NUM_THREADS=“${SLURM_CPUS_PER_TASK:-1}”
srun -N 2 gmx_mpi mdrun -deffnm nvt -maxh .25
Here are the last few lines of the log file for the run with two nodes:
Initializing Parallel LINear Constraint Solver
++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
B. Hess
P-LINCS: A Parallel Linear Constraint Solver for molecular simulation
J. Chem. Theory Comput. 4 (2008) pp. 116-122
-------- -------- — Thank You — -------- --------
The number of constraints is 197396
There are constraints between atoms in different decomposition domains,
will communicate selected coordinates each lincs iteration
77436 constraints are involved in constraint triangles,
will apply an additional matrix expansion of order 6 for couplings
between constraints inside triangles
Linking all bonded interactions to atoms
There are 6608 inter update-group virtual sites,
will an extra communication step for selected coordinates and forces
Intra-simulation communication will occur every 5 steps.
++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
When I check the screen output of the code, I can see that after the warning regarding how large the output files will be, the simulation crashes.
Here are the last few lines of the screen output of the simulation:
Using 20 MPI processes
Using 4 OpenMP threads per MPI processNOTE: The number of threads is not equal to the number of (logical) cores
and the -pin option is set to auto: will not pin threads to cores.
This can lead to significant performance degradation.
Consider using -pin on (and -pinoffset in case you run multiple jobs).WARNING: This run will generate roughly 2895 Mb of data
srun: error: nia0702: tasks 14-15,18: Segmentation fault (core dumped)
srun: Terminating job step 6121990.0
slurmstepd: error: *** STEP 6121990.0 ON nia0701 CANCELLED AT 2021-09-18T10:55:23 ***
srun: error: nia0701: tasks 4-6: Segmentation fault (core dumped)
srun: error: nia0702: tasks 10,12,17,19: Segmentation fault (core dumped)
srun: error: nia0701: tasks 7-9: Segmentation fault (core dumped)
srun: error: nia0702: tasks 11,13: Segmentation fault (core dumped)
srun: error: nia0701: task 1: Segmentation fault (core dumped)
srun: error: nia0701: task 3: Segmentation fault (core dumped)
srun: error: nia0701: task 0: Segmentation fault (core dumped)
srun: error: nia0701: task 2: Killed
srun: error: nia0702: task 16: Killed
srun: Force Terminated job step 6121990.0scontrol show jobid 6121990
JobId=6121990 JobName=test-mar-large
UserId=mkhatami(3007269) GroupId=pmkim(6000985) MCS_label=N/A
Priority=799457 Nice=0 Account=rrg-pmkim QOS=normal
JobState=FAILED Reason=NonZeroExitCode Dependency=(null)
Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=139:0
RunTime=00:01:30 TimeLimit=00:15:00 TimeMin=N/A
SubmitTime=2021-09-18T10:49:48 EligibleTime=2021-09-18T10:49:48
AccrueTime=2021-09-18T10:49:48
StartTime=2021-09-18T10:55:02 EndTime=2021-09-18T10:56:32 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-09-18T10:55:02
Partition=compute AllocNode:Sid=nia-login06:426581
ReqNodeList=(null) ExcNodeList=(null)
NodeList=nia[0701-0702]
BatchHost=nia0701
NumNodes=2 NumCPUs=160 NumTasks=20 CPUs/Task=4 ReqB:S:C:T=0:0::
TRES=cpu=160,mem=350000M,node=2,billing=80
Socks/Node=* NtasksPerN:B:S:C=10:0:: CoreSpec=*
MinCPUsNode=40 MinMemoryNode=175000M MinTmpDiskNode=0
Features=[skylake|cascade] DelayBoot=00:00:00
OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
Command=/gpfs/fs0/project/p/pmkim/mkhatami/work/gmx/gold-protein-martini3-warren/4times-more-152-60-20/test-martini-large/submit.sh
WorkDir=/gpfs/fs0/project/p/pmkim/mkhatami/work/gmx/gold-protein-martini3-warren/4times-more-152-60-20/test-martini-large
StdErr=/gpfs/fs0/project/p/pmkim/mkhatami/work/gmx/gold-protein-martini3-warren/4times-more-152-60-20/test-martini-large/test-mar-large-6121990.out
StdIn=/dev/null
StdOut=/gpfs/fs0/project/p/pmkim/mkhatami/work/gmx/gold-protein-martini3-warren/4times-more-152-60-20/test-martini-large/test-mar-large-6121990.out
Power=
MailUser=mhkh2976@me.com MailType=BEGIN,END,FAIL,REQUEUE,STAGE_OUTsacct -j 6121990
JobID JobName Account Elapsed MaxVMSize MaxRSS SystemCPU UserCPU ExitCode
6121990 test-mar-+ rrg-pmkim 00:01:30 00:40.710 03:31.859 11:0
6121990.bat+ batch rrg-pmkim 00:01:30 576.50M 9672K 00:00.636 00:01.297 11:0
6121990.ext+ extern rrg-pmkim 00:01:30 138360K 876K 00:00:00 00:00.001 0:0
6121990.0 gmx_mpi rrg-pmkim 00:01:21 1831796K 1382004K 00:40.074 03:30.561 0:11kernel messages produced during job executions:
[Sep18 10:34] gmx_mpi[325228]: segfault at 2b467c22d7e0 ip 00002b467c22d7e0 sp 00002b4688a8aef8 error 15
[ +0.000008] gmx_mpi[325237]: segfault at 2b467c22d7e0 ip 00002b467c22d7e0 sp 00002b4688c8bef8 error 15
[ +0.000006] in ucx_shm_posix_43e21cbb (deleted)[2b467c175000+107000][ +0.001087] gmx_mpi[325231]: segfault at 2ba0066587e0 ip 00002ba0066587e0 sp 00002ba017190ef8 error 15
[ +0.000003] gmx_mpi[325223]: segfault at 2ba0066587e0 ip 00002ba0066587e0 sp 00002ba007f3bef8 error 15
[ +0.000002] in ucx_shm_posix_78898a24 (deleted)[2ba00658f000+107000][ +0.000004] in ucx_shm_posix_78898a24 (deleted)[2ba00658f000+107000]
[ +0.004825] gmx_mpi[325226]: segfault at 2ba989b5a7e0 ip 00002ba989b5a7e0 sp 00002ba99a58fef8 error 15
[ +0.000011] in ucx_shm_posix_3cdaf27e (deleted)[2ba989a91000+107000][ +0.000388] gmx_mpi[325214]: segfault at 2b4ef37677e0 ip 00002b4ef37677e0 sp 00002b4f03f23ef8 error 15
[ +0.000019] in ucx_shm_posix_5c48c660 (deleted)[2b4ef369e000+107000][ +0.000063] gmx_mpi[325215]: segfault at 2ad65d3477e0 ip 00002ad65d3477e0 sp 00002ad66db8fef8 error 15
[ +0.000019] in ucx_shm_posix_5c48c660 (deleted)[2ad65d27e000+107000][ +0.000590] gmx_mpi[325218]: segfault at 2b1ecd1077e0 ip 00002b1ecd1077e0 sp 00002b1ecfdeaef8 error 15
[ +0.000005] gmx_mpi[325239]: segfault at 2b1ecd1077e0 ip 00002b1ecd1077e0 sp 00002b1eddb8fef8 error 15
[ +0.000001] in ucx_shm_posix_3b57c078 (deleted)[2b1ecd03e000+107000][ +0.000003] in ucx_shm_posix_3b57c078 (deleted)[2b1ecd03e000+107000]
[ +0.000696] gmx_mpi[325211]: segfault at 2b270a3577e0 ip 00002b270a3577e0 sp 00002b270ba39ef8 error 15
[ +0.000023] in ucx_shm_posix_70ff78be (deleted)[2b270a28e000+107000][ +0.155688] in ucx_shm_posix_43e21cbb (deleted)[2b467c175000+107000]
I have tried the same simulation on a single node and it runs fine.
I have tried a smaller version of the same system in a 60nm box with 1.6 million particles and it runs fine with the same script on multiple nodes of CPUs (I have tried up to 10 nodes x 40 cpus). My all-atom simulations also run fine with the same cluster and submission script.
I want to know what could cause the crashes on my large system and not the smaller-sized systems?
Best,
Mohammad