Gmx fails to release GPU resources

Sasha · August 16, 2023, 9:11pm

Hi all,

We just bumped into a weird problem after a CUDA upgrade. In our slurm wrappers, both Gromacs and CUDA are loaded as modules and I believe starting with a CUDA upgrade every time a Gromacs job has completed, GPUs remain occupied, which prevents any further runs. Specifically, everything but mdrun will work, but mdrun will just hang silently and occupy the target node with one CPU core at 100% (no logs or output files).

We have rebuilt gromacs and now load:

module load cuda/12.2.1
module load gromacs/2023.2/gcc-9.4.0-cuda-12.2.1

However, the problem is still there.

We are now resorting to a silly temporary solution for now, with every Gromacs user inserting pkills in the workflows to make sure things are killed, but that is not a good solution.

Any suggestions?

Thank you!

pszilard · August 17, 2023, 12:18pm

Hi,

I have not encountered this issue, but it sounds annoying and it should not happen.

Can you help with debugging it? If you build GROMACS with -DCMAKE_BUILD_TYPE=RelWithDebInfo, run a normal simulation to reproduce the hanging process and when this occurs log in into the node and attach a debugger to see where is it hanging. To do so you can type: gdb /paht/to/gmx PID where PID is the process ID of the running gmx executable (you can find this using a top tool or pdiof gmx). This will attach to the running gmx process and give you a prompt where you can type gdb commands. Next type “bt” to get a backtrace and share the content of this (additionally id the current thread is not thread 0, it might be useful if you type “thread 0” and then “bt” and share that too if it is different).

Cheers,
Szilárd

Sasha · August 17, 2023, 8:40pm

Hi Szilárd,

Thank you! I have sent the link for this exchange to our HPC support folks, will update here as soon as I hear back.

Sasha

Sasha · September 7, 2023, 7:46am

The issue just disappeared and I am not sure who did what, so for now it will remain a mystery. However, I am now seeing spontaneous crystallization from 1M NaCl under OPLS-AA. This is a known issue with OPLS, but I’ve never seen this happen at 1M.

Did the code get changed in the NB department?

pszilard · September 7, 2023, 4:57pm

Not that I know of. Do you see anything suspicious in the conserved energy drift reported? Are you using any additional features compared to those you used before?

Sasha · September 7, 2023, 5:33pm

Let me look at conserved energy, but no, I’ve simulated these systems many times before… The only relatively new additional feature I am using is the AC e-field, but I have a bunch of clean simulations using that right before we built the latest version. The release notes for 2023.2 mention something LJ-related, so I thought I’d ask…

Sasha · September 7, 2023, 9:57pm

I am using v-rescale and conserved energy drifts linearly, which seems consistent with the discussion here: "Conserved En." is increasing linearly with simulation time - Is this normal? though I did not check the amount of drift…

Not that many people want to see it in their simulations, but, at least in principle, at 1M spontaneous crystallization is actually reasonably physical, so, according to some of my colleagues, “maybe your Gromacs finally started working correctly.” :)

In any case, I am moving back to a 2022 version to check. Will report.

hess · September 11, 2023, 10:15am

The original OPLS-aa forcefield, as implemented in GROMACS, uses the ion parameters from Aqvist. There parameters are not supposed to be combined for Na+ and Cl-. They will produce crystals at relatively low concentration, as you have observed.

Sasha · September 11, 2023, 5:34pm

I’m fully aware, as posted earlier. We’ve just never seen this at 1M… We may be creating a crystallization precursor somewhere, who knows.

hess · September 12, 2023, 6:20am

There are many papers on bad NaCl forcefield behavior. In Shibboleth Authentication Request we write that with Aqvist parameters you likely get crystals already at 4.5M.

Topic		Replies	Views
Simulation stops and gmx hangs User discussions	0	302	November 14, 2023
Gromacs freezes during REMD simulation User discussions	0	114	April 14, 2024
Gmx mdrun with and without GPU support User discussions	3	1250	June 15, 2020
Gromacs 2024.2 mysteriously hanging User discussions	5	89	October 9, 2024
Random GROMACS Crashes with CUDA error #717 User discussions	2	189	February 28, 2025

Gmx fails to release GPU resources

Related topics