Hi,
We are longtime users of gromacs but recently installed 2024.2 on a gpu-based cluster with queueing software. In testing the new version, we consistently find that runs hang in the following way:
-
no files get written out after a certain time, although the process appears to still be using CPU and memory on the client node – this happens no matter where the output is written to, i.e. to a networked filesystem or to the local client node’s disk
-
the jobs hang after the same amount of time for a given “type” of job (for bigger/slower jobs it hangs earlier, so it’s almost like the cumulative amount of resources is what matters) – and if I restart a job that hung where it left off, it will again run for that same amount of time and hang again. So it’s not a problem with something going awry in the actual numbers in the run, because it wouldn’t start running again fine and then hang at exactly twice the point where it hung the first time.
-
We tried writing out smaller output files in case there was a hard disk write limitation for a job, but that didn’t affect when it hung.
-
We also tried running directly on the client node without any queueing “overhead” structure in place, and saw the same behavior
-
So it appears to be something either with the compilation (which had no obvious errors), some weird issue with the hardware on the local machines (we tried two different ones), or just some weird compatibility bug/issue with this version of gromacs.
I see that other people have posted similar behavior, but I don’t see clear responses to those questions when I’ve searched, so I’d really appreciate any suggestions. Thanks!
Mala