Gmx fails to release GPU resources

Hi all,

We just bumped into a weird problem after a CUDA upgrade. In our slurm wrappers, both Gromacs and CUDA are loaded as modules and I believe starting with a CUDA upgrade every time a Gromacs job has completed, GPUs remain occupied, which prevents any further runs. Specifically, everything but mdrun will work, but mdrun will just hang silently and occupy the target node with one CPU core at 100% (no logs or output files).

We have rebuilt gromacs and now load:

module load cuda/12.2.1
module load gromacs/2023.2/gcc-9.4.0-cuda-12.2.1

However, the problem is still there.

We are now resorting to a silly temporary solution for now, with every Gromacs user inserting pkills in the workflows to make sure things are killed, but that is not a good solution.

Any suggestions?

Thank you!

Hi,

I have not encountered this issue, but it sounds annoying and it should not happen.

Can you help with debugging it? If you build GROMACS with -DCMAKE_BUILD_TYPE=RelWithDebInfo, run a normal simulation to reproduce the hanging process and when this occurs log in into the node and attach a debugger to see where is it hanging. To do so you can type: gdb /paht/to/gmx PID where PID is the process ID of the running gmx executable (you can find this using a top tool or pdiof gmx). This will attach to the running gmx process and give you a prompt where you can type gdb commands. Next type “bt” to get a backtrace and share the content of this (additionally id the current thread is not thread 0, it might be useful if you type “thread 0” and then “bt” and share that too if it is different).

Cheers,
Szilárd

Hi Szilárd,

Thank you! I have sent the link for this exchange to our HPC support folks, will update here as soon as I hear back.

Sasha

The issue just disappeared and I am not sure who did what, so for now it will remain a mystery. However, I am now seeing spontaneous crystallization from 1M NaCl under OPLS-AA. This is a known issue with OPLS, but I’ve never seen this happen at 1M.

Did the code get changed in the NB department?

Not that I know of. Do you see anything suspicious in the conserved energy drift reported? Are you using any additional features compared to those you used before?

Let me look at conserved energy, but no, I’ve simulated these systems many times before… The only relatively new additional feature I am using is the AC e-field, but I have a bunch of clean simulations using that right before we built the latest version. The release notes for 2023.2 mention something LJ-related, so I thought I’d ask…

I am using v-rescale and conserved energy drifts linearly, which seems consistent with the discussion here: "Conserved En." is increasing linearly with simulation time - Is this normal? though I did not check the amount of drift…

Not that many people want to see it in their simulations, but, at least in principle, at 1M spontaneous crystallization is actually reasonably physical, so, according to some of my colleagues, “maybe your Gromacs finally started working correctly.” :)

In any case, I am moving back to a 2022 version to check. Will report.

The original OPLS-aa forcefield, as implemented in GROMACS, uses the ion parameters from Aqvist. There parameters are not supposed to be combined for Na+ and Cl-. They will produce crystals at relatively low concentration, as you have observed.

I’m fully aware, as posted earlier. We’ve just never seen this at 1M… We may be creating a crystallization precursor somewhere, who knows.

There are many papers on bad NaCl forcefield behavior. In Shibboleth Authentication Request we write that with Aqvist parameters you likely get crystals already at 4.5M.