Gromacs Production Runs taking very long somtimes stopped running

GROMACS version: 2024
GROMACS modification: Yes

System Description:

  • Protein simulation in water.
  • Systems range from 76,000 to 800,000 atoms.
  • Two single node servers, one with a GPU.
  • Pre-production run from pdb2gmx to npt executed seamlessly.

Problem Description:

  • Production run starts smoothly and runs fast.
  • However, it freezes at some point (around 600,000 to 1,000,000 steps).
  • The process appears to be stuck, not progressing further, without crashing.

System Settings:

  • Data prefix: /usr/local/gromacs
  • Working directory: /data/emanuel/HECW1/WT
  • Process ID: 169340
  • Command line:

gmx mdrun -deffnm md_0_1 -ntomp 32 -nb gpu -pin on

GROMACS Version:

  • Version: 2024.1
  • Precision: Mixed
  • Memory model: 64-bit
  • MPI library: thread_mpi
  • OpenMP support: Enabled (GMX_OPENMP_MAX_THREADS = 128)
  • GPU support: CUDA
  • NBNxM GPU setup: Super-cluster 2x2x2 / Cluster 8
  • SIMD instructions: AVX2_128
  • CPU FFT library: fftw-3.3.8-sse2-avx-avx2-avx2_128
  • GPU FFT library: cuFFT
  • Multi-GPU FFT: None
  • RDTSCP usage: Enabled
  • TNG support: Enabled
  • Hwloc support: Disabled
  • Tracing support: Disabled
  • C compiler: /usr/bin/cc GNU 9.4.0
  • C compiler flags: -fexcess-precision=fast -funroll-all-loops -mavx2 -mfma -Wno-missing-field-initializers -O3 -DNDEBUG

Notes:

  • The issue occurs during the production run phase.
  • The simulation appears to halt at some point without progressing.
  • No error messages or crashes are reported.
  • Further investigation is needed to identify the root cause of the freezing. Potential areas for investigation include system resources, potential memory leaks, or algorithmic issues

Input Parameters:
integrator = md
tinit = 0
dt = 0.002
nsteps = 5000000
init-step = 0
simulation-part = 1
mts = false
mass-repartition-factor = 1
comm-mode = Linear
nstcomm = 100
bd-fric = 0
ld-seed = -5517954
emtol = 10
emstep = 0.01
niter = 20
fcstep = 0
nstcgsteep = 1000
nbfgscorr = 10
rtpi = 0.05
nstxout = 0
nstvout = 0
nstfout = 0
nstlog = 5000
nstcalcenergy = 100
nstenergy = 5000
nstxout-compressed = 5000
compressed-x-precision = 1000
cutoff-scheme = Verlet
nstlist = 10
pbc = xyz
periodic-molecules = false
verlet-buffer-tolerance = 0.005
verlet-buffer-pressure-tolerance = 0.5
rlist = 1

Your help will be highly appreciated.
Thanks

Hi, Aman. I have the same question. My process has frozen at the same step twice.
step 66485500, will finish Thu May 30 21:59:14 2024^Zb F 14% pme/F 0.98 F 0.92
I’m not sure whether the issue is due to software settings or hardware problems.