Gromacs 2024.2 mysteriously hanging

Hi,

We are longtime users of gromacs but recently installed 2024.2 on a gpu-based cluster with queueing software. In testing the new version, we consistently find that runs hang in the following way:

  • no files get written out after a certain time, although the process appears to still be using CPU and memory on the client node – this happens no matter where the output is written to, i.e. to a networked filesystem or to the local client node’s disk

  • the jobs hang after the same amount of time for a given “type” of job (for bigger/slower jobs it hangs earlier, so it’s almost like the cumulative amount of resources is what matters) – and if I restart a job that hung where it left off, it will again run for that same amount of time and hang again. So it’s not a problem with something going awry in the actual numbers in the run, because it wouldn’t start running again fine and then hang at exactly twice the point where it hung the first time.

  • We tried writing out smaller output files in case there was a hard disk write limitation for a job, but that didn’t affect when it hung.

  • We also tried running directly on the client node without any queueing “overhead” structure in place, and saw the same behavior

  • So it appears to be something either with the compilation (which had no obvious errors), some weird issue with the hardware on the local machines (we tried two different ones), or just some weird compatibility bug/issue with this version of gromacs.

I see that other people have posted similar behavior, but I don’t see clear responses to those questions when I’ve searched, so I’d really appreciate any suggestions. Thanks!

Mala

It’s difficult to troubleshoot this kind of problem without access to the system. Hopefully we can at least try to find under which circumstances it hangs. Does it hang if you are not running on GPU? Does it hang if you are using thread MPI instead of MPI (which also means that you cannot run on more than one node)?

Thanks for responding and for these follow-up questions – so it also hangs when we use only the CPUs and do not engage with GPUs (it just runs a lot slower so it takes more wall time to get to hanging point for a given simulation). We are currently running on only one gpu node (with 4 gpu’s recognizes by gromacs), so not across more than one node, so I think this means four MPI threads. We have not tried to run on more than one node (and we shouldn’t need to).

Does this help?
Thank you!
Mala

Thanks. Would you be able to open an issue at Issues · GROMACS / GROMACS · GitLab with as much information as possible ? Hopefully that will get more attention from other developers and it’s easier to keep track of progress there.

Thanks – I just attempted to create an issue, but some of the formatting might be mess up due to my not really being that familiar with GitLab. Hope it’s still readable!

Mala

Thanks a lot!