Why Fatal error: Unexpected cudaStreamQuery failure happend in gromacs2019?

GROMACS version:2019
GROMACS modification: Yes/No no
I have Gromacs 2019. when I want to run mdrun by GPU command I faced this error " Fatal error: Unexpected cudaStreamQuery failure: an illegal memory access was encountered. what is the problem and how I can solve it?

The version that you are using is more than four years old and is no longer actively supported. Is there a reason why you are not using the 2023 version (2023.2 is the latest patch release)? If you have to stick to the 2019 version, make sure that you are using the latest patched version (2019.6). Hopefully that will fix the problem.

thanks for your answer

Hello, @MagnusL !

I had the same problem with Gromacs 2023.1 with CUDA 12.2 and nvidia-driver-535 on Ubuntu 23.10.

Program:     gmx mdrun, version 2023.1
Source file: src/gromacs/gpu_utils/cudautils.cuh (line 190)

Fatal error:
Unexpected cudaStreamQuery failure. CUDA error #719 (cudaErrorLaunchFailure):
unspecified launch failure.

The simulations were performed on a machine based on 1 GPU RTX4090 and an Intel i9-13900. I tried to simulate different systems on the machine using various Gromacs versions from 2021 to 2023.1 with different CUDA toolkits 11.7, 12.1, and 12.2; in all cases, the same problem occurred.

What do you think? How can I solve this problem?

Does the same tpr run stably when running on CPU only?

Hi, @hess !

I did not try to start calculation only using CPU. I will write here what happen in a couple of days.

Do you get the error at step 0? If so, then trying on a CPU takes a few minutes, even on a laptop.

Hello, @hess ,

When I started the calculation only with the CPU using the command

gmx mdrun -v -deffnm 300 -bonded cpu -pme cpu -nb cpu -s topol.tpr -cpi 300.cpt

The speed is much slower, and the calculation is stable for 200 ns. I did not perform the simulation for longer.

This error occurs 100–150 ns after the simulation begins. Yes, I initially tried to start the simulation using the CPU and then switched to the GPU. However, the error did not disappear.

I also tried changing the screen lock time, as suggested here Fatal error: Unexpected cudaStreamQuery failure . However, in my case, it did not help me get rid with error.

I believed it might be related to the hardware of the machine, but until now, I have not figured out whether it is related to the videocard, RAM, or CPU.

If it occurs after so much time that makes it difficult to debug and also difficult to guess where it comes from. If there is a bug in GROMACS the chance is very low that it would only pop up after 100 million time steps. It could of course be that you system becomes unstable and that the first error that is triggered is a CUDA error. But then it is also unlikely that your simulation becomes unstable only after so many steps.

Dear all,

Are there any updates on this? I’ve got this too.
I used the “lysozyme in water” tutorial to test my gromacs and the error only appeared in the middle of the md stage (extended to 100 ns), at step 18100000 (time 36200 ps).
I have one RTX 4090, Gromacs 2023.2, cuda 12.2, nvidia-driver-535.154.05, ubuntu 20.04 LTS.

------------------------------------------------------
Program:     gmx mdrun, version 2023.2
Source file: src/gromacs/gpu_utils/device_stream.cu (line 100)
Function:    DeviceStream::synchronize() const::<lambda()>

Assertion failed:
Condition: stat == cudaSuccess
cudaStreamSynchronize failed. CUDA error #719 (cudaErrorLaunchFailure):
unspecified launch failure.

For more information and tips for troubleshooting, please check the GROMACS
website at http://www.gromacs.org/Documentation/Errors
-------------------------------------------------------

mdrun
gpu

No, no updates.

I have no clue if this would be due to a bug in GROMACS, an instability in your simulation (which still should not result in a assertion failure, but an error message) or a CUDA driver issue.

Has this happened only once or multiple times?

Thanks for responding. I have just set this system up. So far, 6-7 trials have been run on the lysozyme in water md all failed at some point in the middle. I have not been able to complete this task at all. We tried another system that is a small peptide in water a few times (6-7), which also always failed with this error at some point. I also tried 2023.3 with no luck. At the moment, I cannot use this system to do any MD work at all.

The only known way to get such a failure is with an unstable system and PME on GPU, then an illegal memory access can occur when particles are far outside of the unit cell. Could you run with the option -pme cpu to check if this might avoid the issue?

Hello @hess,

I have conducted experiments on several systems at varying temperature conditions. The outcome is contingent upon the stability of the systems. In some instances, switching to the -pme cpu mode and carrying out further simulations for several nanoseconds may facilitate a return to the simulation using the -pme gpu mode. But it is not work for each case.

Hi, I tried with “-pme cpu” and got an error and core dump:

Step 4882651 Pressure scaling more than 1%. This may mean your system is not yet equilibrated. Use of Parrinello-Rahman pressure coupling during equilibration can lead to simulation instability, and is discouraged.
Segmentation fault (core dumped)

Does it reveal anything? I ran this same task (without “-pme cpu”) on a different server with 4 GPU and it completed without any issues.

Then I guess that the CUDA error is triggered by instability of the simulation. The question is then why your simulation becomes unstable.

What are the mdp settings for your thermostat and barostat?

Hi,

As follows:

; Temperature coupling is on
tcoupl                  = V-rescale             ; modified Berendsen thermostat
tc-grps                 = Protein Non-Protein   ; two coupling groups - more acc
urate
tau_t                   = 0.1     0.1           ; time constant, in ps
ref_t                   = 300     300           ; reference temperature, one for
 each group, in K
; Pressure coupling is on
pcoupl                  = Parrinello-Rahman     ; Pressure coupling on in NPT
pcoupltype              = isotropic             ; uniform scaling of box vectors
tau_p                   = 2.0                   ; time constant, in ps
ref_p                   = 1.0                   ; reference pressure, in bar
compressibility         = 4.5e-5                ; isothermal compressibility of 
water, bar^-1

The test was taken from the lysozyme tutorial. The only change I made was to run it for longer (change from 1 ns to 100 ns).

http://www.mdtutorials.com/gmx/lysozyme/08_MD.html

Thanks. It is quite probable that such an old tutorial was never tested (or meant for) running 100 ns and that there are instabilities that show up when you do.

It would be interesting to see if your simulation is stable if you modify the thermostat and barostat to this:

; Temperature coupling is on
tcoupl = V-rescale
tc-grps = Protein Non-Protein ; two coupling groups - more acc
urate
tau_t = 1 1 ; time constant, in ps
ref_t = 300 300 ; reference temperature, one for
each group, in K
; Pressure coupling is on
pcoupl = c-rescale ; Pressure coupling on in NPT
pcoupltype = isotropic ; uniform scaling of box vectors
tau_p = 5.0 ; time constant, in ps
ref_p = 1.0 ; reference pressure, in bar
compressibility = 4.5e-5 ; isothermal compressibility of
water, bar^-1

I.e., change the barostat to c-rescale (more stable - there is a risk of oscillations when using Parrinello-Rahman, especially with low tau-p) and change tau-t and tau-p.

If this does not help, it is possible that the system is not equilibrated enough to run this long. The NVT and NPT stages might need to be extended as well.

Thank you for the suggestion. I tried that which resulted in the same error message.

-------------------------------------------------------
Program:     gmx mdrun, version 2023.3
Source file: src/gromacs/gpu_utils/device_stream.cu (line 100)
Function:    DeviceStream::synchronize() const::<lambda()>

Assertion failed:
Condition: stat == cudaSuccess
cudaStreamSynchronize failed. CUDA error #719 (cudaErrorLaunchFailure):
unspecified launch failure.

For more information and tips for troubleshooting, please check the GROMACS
website at http://www.gromacs.org/Documentation/Errors
-------------------------------------------------------

In fact, I tried the same tutorial on a different system (but different CPU and GPU) with similar software (Ubuntu 20.04LTS, Gromacs 2023.3, cuda 12.2) and it has no problem completing the 100ns MD. I had a feeling that the real cause may be hardware related…

Try with tau_p=5. Maybe we’re lucky that the (too) short period for the barostat is causing the issues. Or switch to C-rescale, which is anyhow what we would recommend.