Gromacs 2024.2 mysteriously hanging

Mala · October 9, 2024, 1:51pm

Hi,

We are longtime users of gromacs but recently installed 2024.2 on a gpu-based cluster with queueing software. In testing the new version, we consistently find that runs hang in the following way:

no files get written out after a certain time, although the process appears to still be using CPU and memory on the client node – this happens no matter where the output is written to, i.e. to a networked filesystem or to the local client node’s disk
the jobs hang after the same amount of time for a given “type” of job (for bigger/slower jobs it hangs earlier, so it’s almost like the cumulative amount of resources is what matters) – and if I restart a job that hung where it left off, it will again run for that same amount of time and hang again. So it’s not a problem with something going awry in the actual numbers in the run, because it wouldn’t start running again fine and then hang at exactly twice the point where it hung the first time.
We tried writing out smaller output files in case there was a hard disk write limitation for a job, but that didn’t affect when it hung.
We also tried running directly on the client node without any queueing “overhead” structure in place, and saw the same behavior
So it appears to be something either with the compilation (which had no obvious errors), some weird issue with the hardware on the local machines (we tried two different ones), or just some weird compatibility bug/issue with this version of gromacs.

I see that other people have posted similar behavior, but I don’t see clear responses to those questions when I’ve searched, so I’d really appreciate any suggestions. Thanks!

Mala

MagnusL · October 9, 2024, 2:18pm

It’s difficult to troubleshoot this kind of problem without access to the system. Hopefully we can at least try to find under which circumstances it hangs. Does it hang if you are not running on GPU? Does it hang if you are using thread MPI instead of MPI (which also means that you cannot run on more than one node)?

Mala · October 9, 2024, 2:34pm

Thanks for responding and for these follow-up questions – so it also hangs when we use only the CPUs and do not engage with GPUs (it just runs a lot slower so it takes more wall time to get to hanging point for a given simulation). We are currently running on only one gpu node (with 4 gpu’s recognizes by gromacs), so not across more than one node, so I think this means four MPI threads. We have not tried to run on more than one node (and we shouldn’t need to).

Does this help?
Thank you!
Mala

MagnusL · October 9, 2024, 3:04pm

Thanks. Would you be able to open an issue at Issues · GROMACS / GROMACS · GitLab with as much information as possible ? Hopefully that will get more attention from other developers and it’s easier to keep track of progress there.

Mala · October 9, 2024, 8:20pm

Thanks – I just attempted to create an issue, but some of the formatting might be mess up due to my not really being that familiar with GitLab. Hope it’s still readable!

Mala

MagnusL · October 9, 2024, 8:40pm

Thanks a lot!

Topic		Replies	Views
GROMACS hangs when run from python multiprocess User discussions python	22	1556	May 17, 2023
Gromacs freezes during REMD simulation User discussions	0	114	April 14, 2024
Simulation freeze in the middle of the run User discussions	1	213	September 28, 2024
Gmx fails to release GPU resources User discussions mdrun	9	539	September 12, 2023
Simulation stops and gmx hangs User discussions	0	300	November 14, 2023

Gromacs 2024.2 mysteriously hanging

Related topics