GPU likely dies out mid-simulation, causing long run times

GROMACS version: 2023.3 with CUDA 12.3
GROMACS modification: No

Hi all,

Whenever I try to run a production run simulation of a bilayer from CHARMM-GUI, for the first 10 hours (for what is projected to be a 30-hour run time), everything runs efficiently. Then, at about the 10 hour mark, the estimated time ticks up about 1 second every 1.5 seconds (thus, the simulation would then take a week+ to complete).

I have had a similar issue before, though it was for a simpler ~7 hour simulation. The cause was that the GPU was crashing out mid-simulation. The solution was to add the -update and -nstlist 400 arguments to mdrun (e.g., gmx mdrun -s test.tpr -v -x test.xtc -c test.gro -nb gpu -bonded gpu -pme gpu -update gpu -nstlist 400). However, this is not fixing my current simulation, perhaps because this one is an even longer duration and the problem kicks in around hour 10.

Does anyone have advice on how to ensure your GPU doesn’t crash out mid-simulation? I have a RTX 4090 24GB, Intel i9-13900K, and 64GB RAM.

In case of value, here is the .mdp I am running and terminal command.

integrator = md
dt = 0.004
nsteps = 250000000
nstxout-compressed = 25000
nstxout = 0
nstvout = 0
nstfout = 0
nstcalcenergy = 100
nstenergy = 1000
nstlog = 1000
;
cutoff-scheme = Verlet
nstlist = 400
rlist = 1.2
vdwtype = Cut-off
vdw-modifier = Force-switch
rvdw_switch = 1.0
rvdw = 1.2
coulombtype = PME
rcoulomb = 1.2
;
tcoupl = v-rescale
tc_grps = MEMB SOLV
tau_t = 1.0 1.0
ref_t = 303.15 303.15
;
pcoupl = C-rescale
pcoupltype = semiisotropic
tau_p = 5.0
compressibility = 4.5e-5 4.5e-5
ref_p = 1.0 1.0
;
constraints = h-bonds
constraint_algorithm = LINCS
continuation = yes
;
nstcomm = 100
comm_mode = linear
comm_grps = MEMB SOLV

gmx mdrun -v -deffnm ${istep} -nb gpu -bonded gpu -pme gpu -update gpu -nstlist 400

Thanks in advance!