Cuda error with gromacs 2023.3 (CUDA error #700 an illegal memory access)

roozi · November 7, 2023, 10:16pm

GROMACS version:2023.3
GROMACS modification: No
I 'm getting a CUDA error using the standard Gromacs 2023.3 version on my laptop using:
NVIDIA RTX-4060 gpu and 20 OpenMP threads (corei7 generation 13).
Cuda version: 12.3 nvidia driver version: 545.84 . Os: Ubuntu 22.4.02 (installed on WSL2 , WINDOWS 11)

The error message is:

Blockquote
Program: gmx mdrun, version 2023.3
Source file: src/gromacs/gpu_utils/device_stream.cu (line 100)
Function: DeviceStream::synchronize() const::<lambda()>
Assertion failed:
Condition: stat == cudaSuccess
cudaStreamSynchronize failed. CUDA error #700 (cudaErrorIllegalAddress): an
illegal memory access was encountered.

I am attaching .tpr file of my simulation to help developer team reproduce and find the root cause of the error.
tpr file link : https://filetransfer.io/data-package/IjCaW1po#link

PS. to give some perspective, I am trying to run an AWH fairly long simulation. i am getting this error every few hours (equal to a ca. 1 ns of simulation) and have to restart the simulation job using -cpi option but the same scenario keeps happening over and over again , hence making the whole process somewhat painstaking and bumpy.

@pszilard : Could you kindly look into this issue and offer any insight? as i can see the same issue has been raised on gromacs gitlab page under issue no.#4841 which on that case was resolved by further equilibration before md production run. In my case though it seems there is a more systematic problem as the same error comes up once every few hours (ca. 1 ns of simulation) despite the fact that during the course of AWH simulation, system moves to different states as part of the algorithm to span the reaction coordinate interval defined in the mdp file and hence it seems unlikely that the system’s initial condition to be the root cause of the error.
my job submission command line script is:

gmx mdrun -deffnm awh -px pullx.xvg -nb gpu -pme gpu -update gpu

kind regards,
roozi

roozi · November 15, 2023, 3:12pm

Hi,
@pszilard sorry to interrupt but could you check the error? as it keeps popping up…

MagnusL · November 17, 2023, 10:43am

I suspect it’s an ordinary instability issue, but as you say, it might not be enough to equilibrate longer. Could you post your AWH and pull settings?

roozi · November 18, 2023, 11:19am

sure.

pull = yes
pull-ngroups = 2
pull-ncoords = 1
pull-nstxout = 5000
pull-nstfout = 5000
pull-group1-name = left
pull-group2-name = right
pull-coord1-groups = 1 2
pull-coord1-geometry = direction
pull-coord1-vec = 126.44 -12.53 7.82
pull-coord1-dim = Y Y Y
pull-coord1-type = external-potential
pull-coord1-potential-provider = AWH
pull_coord1_start = yes
pull-group1-pbcatom = 3135
pull-group2-pbcatom = 14153
pull-pbc-ref-prev-step-com = yes

awh = yes
awh-nstout = 50000
awh-nbias = 1
awh1-ndim = 1
awh1-dim1-coord-index = 1
awh1-dim1-start = 8.5
awh1-dim1-end = 12.8
awh1-dim1-force-constant = 100000
awh1-dim1-diffusion = 5e-5

roozi · November 18, 2023, 6:58pm

FYI, i tweaked the AWH diffusion param(last line of my awh settings) from 5e-5 to
1e-5 , to check if slowing down the traverse of the RC can stabilize the simulation and prevent the error. Unfortunately, with a lower diffusion rate of 1e-5 , not only i got the same error but also my laptop glitched and became totally unresponsive and had to be restarted to become functional again.

MagnusL · November 20, 2023, 2:14pm

None of these settings look like they should cause any problems, as long as the whole awh coord range is accessible (from 8.5 to 12.8). Unless you have a very high curvature in your free energy landscape/profile, you might want to test with an awh force constant a factor 5-10 lower, but I don’t think that should cause crashes like this (provided that the whole reaction coordinate range is accessible).

qingfuns · September 6, 2024, 4:11pm

I just solved my problem by reducing dt from 0.002 to 0.001.

Topic		Replies	Views
CUDA error in 10ns run User discussions mdrun	3	221	May 4, 2024
gmx_mpi mdrun cuda error #700 User discussions mdrun	0	987	April 13, 2023
CUDA Error #700 Random Encounter User discussions gpu	5	505	April 22, 2024
Fatal error: Unexpected cudaStreamQuery failure: an illegal memory access was encountered User discussions	3	1260	September 21, 2023
Why Fatal error: Unexpected cudaStreamQuery failure happend in gromacs2019? User discussions mdrun	28	1562	August 27, 2024

Cuda error with gromacs 2023.3 (CUDA error #700 an illegal memory access)

Related topics