CUDA Error #700 Random Encounter

GROMACS version: 2024.1
GROMACS modification: No
I am trying to implement a GPU driven parallelization of Gromacs in a cloud compute environment using an NVIDIA architecture and I randomly encounter CUDA Error #700 (Or else some other similar data leakage error), the machine type is very large with over 60 gbs of available memory, so I am at a loss for how to approach the problem. My CUDA version is 12.4 and I downloaded using the methodology provided by NVIDIA: CUDA Toolkit 12.4 Update 1 Downloads | NVIDIA Developer

Any advice?

-Best

Do you get any messages, warnings or errors before the crash?

Yes, common warnings are: “You are using Ewald Statistics in a system with net charge…” (which I am ignoring as the net charge is below the float error cutoff), “The Berendsen barostat does not generate any strictly correct ensemble, and should not be used for new production simulations (in our opinion). We recommend using the C-rescale barostat instead.” (which I can switch in future), and “You are using soft-core interactions while the Van der Waals interactions
are not decoupled (note that the sc-coul option is only active when using
lambda states). Although this will not lead to errors, you will need much
more sampling than without soft-core interactions. Consider using
sc-alpha=0.” (which I have implemented to correct for near atom-atom inf errors), but these warnings do not always cause the CUDA #700 illegal memory error, which is why I did not suspect them as causal. The error is sporadic, hence my confusion.

Nothing about LINCS or pressure coupling during the simulations?

Not that I have seen so far, the error is always related to “illegal memory access” or “freeing of device buffer failed” , would recompiling with double precision help you think?

No, I don’t think compiling with double precision would help.