CUDA Error #700 Random Encounter

dmorse · April 18, 2024, 4:35pm

GROMACS version: 2024.1
GROMACS modification: No
I am trying to implement a GPU driven parallelization of Gromacs in a cloud compute environment using an NVIDIA architecture and I randomly encounter CUDA Error #700 (Or else some other similar data leakage error), the machine type is very large with over 60 gbs of available memory, so I am at a loss for how to approach the problem. My CUDA version is 12.4 and I downloaded using the methodology provided by NVIDIA: CUDA Toolkit 12.4 Update 1 Downloads | NVIDIA Developer

Any advice?

-Best

MagnusL · April 18, 2024, 7:28pm

Do you get any messages, warnings or errors before the crash?

dmorse · April 18, 2024, 8:12pm

Yes, common warnings are: “You are using Ewald Statistics in a system with net charge…” (which I am ignoring as the net charge is below the float error cutoff), “The Berendsen barostat does not generate any strictly correct ensemble, and should not be used for new production simulations (in our opinion). We recommend using the C-rescale barostat instead.” (which I can switch in future), and “You are using soft-core interactions while the Van der Waals interactions
are not decoupled (note that the sc-coul option is only active when using
lambda states). Although this will not lead to errors, you will need much
more sampling than without soft-core interactions. Consider using
sc-alpha=0.” (which I have implemented to correct for near atom-atom inf errors), but these warnings do not always cause the CUDA #700 illegal memory error, which is why I did not suspect them as causal. The error is sporadic, hence my confusion.

MagnusL · April 19, 2024, 1:25pm

Nothing about LINCS or pressure coupling during the simulations?

dmorse · April 20, 2024, 12:08am

Not that I have seen so far, the error is always related to “illegal memory access” or “freeing of device buffer failed” , would recompiling with double precision help you think?

MagnusL · April 22, 2024, 6:28am

No, I don’t think compiling with double precision would help.

Topic		Replies	Views
Cuda error with gromacs 2023.3 (CUDA error #700 an illegal memory access) User discussions gpu	6	2590	September 6, 2024
CUDA error in 10ns run User discussions mdrun	3	221	May 4, 2024
gmx_mpi mdrun cuda error #700 User discussions mdrun	0	987	April 13, 2023
Random GROMACS Crashes with CUDA error #717 User discussions	2	193	February 28, 2025
Freeing of the device buffer failed. CUDA error #700 User discussions mdrun	2	1025	September 6, 2024

CUDA Error #700 Random Encounter

Related topics