Cuda error with gromacs 2023.3 (CUDA error #700 an illegal memory access)

GROMACS version:2023.3
GROMACS modification: No
I 'm getting a CUDA error using the standard Gromacs 2023.3 version on my laptop using:
NVIDIA RTX-4060 gpu and 20 OpenMP threads (corei7 generation 13).
Cuda version: 12.3 nvidia driver version: 545.84 . Os: Ubuntu 22.4.02 (installed on WSL2 , WINDOWS 11)

The error message is:

Blockquote
Program: gmx mdrun, version 2023.3
Source file: src/gromacs/gpu_utils/device_stream.cu (line 100)
Function: DeviceStream::synchronize() const::<lambda()>
Assertion failed:
Condition: stat == cudaSuccess
cudaStreamSynchronize failed. CUDA error #700 (cudaErrorIllegalAddress): an
illegal memory access was encountered.

I am attaching .tpr file of my simulation to help developer team reproduce and find the root cause of the error.
tpr file link : https://filetransfer.io/data-package/IjCaW1po#link

PS. to give some perspective, I am trying to run an AWH fairly long simulation. i am getting this error every few hours (equal to a ca. 1 ns of simulation) and have to restart the simulation job using -cpi option but the same scenario keeps happening over and over again , hence making the whole process somewhat painstaking and bumpy.

@pszilard : Could you kindly look into this issue and offer any insight? as i can see the same issue has been raised on gromacs gitlab page under issue no.#4841 which on that case was resolved by further equilibration before md production run. In my case though it seems there is a more systematic problem as the same error comes up once every few hours (ca. 1 ns of simulation) despite the fact that during the course of AWH simulation, system moves to different states as part of the algorithm to span the reaction coordinate interval defined in the mdp file and hence it seems unlikely that the system’s initial condition to be the root cause of the error.
my job submission command line script is:

gmx mdrun -deffnm awh -px pullx.xvg -nb gpu -pme gpu -update gpu

kind regards,
roozi

Hi,
@pszilard sorry to interrupt but could you check the error? as it keeps popping up…

I suspect it’s an ordinary instability issue, but as you say, it might not be enough to equilibrate longer. Could you post your AWH and pull settings?

sure.

pull = yes
pull-ngroups = 2
pull-ncoords = 1
pull-nstxout = 5000
pull-nstfout = 5000
pull-group1-name = left
pull-group2-name = right
pull-coord1-groups = 1 2
pull-coord1-geometry = direction
pull-coord1-vec = 126.44 -12.53 7.82
pull-coord1-dim = Y Y Y
pull-coord1-type = external-potential
pull-coord1-potential-provider = AWH
pull_coord1_start = yes
pull-group1-pbcatom = 3135
pull-group2-pbcatom = 14153
pull-pbc-ref-prev-step-com = yes

awh = yes
awh-nstout = 50000
awh-nbias = 1
awh1-ndim = 1
awh1-dim1-coord-index = 1
awh1-dim1-start = 8.5
awh1-dim1-end = 12.8
awh1-dim1-force-constant = 100000
awh1-dim1-diffusion = 5e-5

FYI, i tweaked the AWH diffusion param(last line of my awh settings) from 5e-5 to
1e-5 , to check if slowing down the traverse of the RC can stabilize the simulation and prevent the error. Unfortunately, with a lower diffusion rate of 1e-5 , not only i got the same error but also my laptop glitched and became totally unresponsive and had to be restarted to become functional again.

None of these settings look like they should cause any problems, as long as the whole awh coord range is accessible (from 8.5 to 12.8). Unless you have a very high curvature in your free energy landscape/profile, you might want to test with an awh force constant a factor 5-10 lower, but I don’t think that should cause crashes like this (provided that the whole reaction coordinate range is accessible).

I just solved my problem by reducing dt from 0.002 to 0.001.