Simulations hang without crashing

GROMACS version: 2025.3, 2026-rc
We have simulations that run perfectly fine with older versions of gromacs (2023, I think), but die about half-way with versions mentioned above. When I say “die,” there is no crash, nothing interesting in the logs, no step*.pdb dumps. The trajectory file size stops changing, md.log stops updating, CPU load drops significantly, GPU loads cease. The simulation is eventually kicked out after bumping into the wall time. This exact behavior takes place on completely unrelated machines, one is in fact stand-alone, while the other is a cluster (and it happens regardless of the node). We have not identified the issue and we are not blaming Gromacs (yet).

Would the developers be interested in trying our exact inputs? Thanks!

How long does it take until nothing happens anymore?

About 12-15 hours.

I am trying to isolate the issue by 1) running the same exact simulations with gmx 2023.2 and 2) trying to limit trajectory output to xtc files with newer gmx versions. The original crashing simulations involve temporary trr outputs that get to about 62gb and they crash about halfway.

Initially we thought one of the nodes was overheating, then i tested on other nodes and it’s the same behavior.

Have you up till now only seen crashes when writing 10 of gigabytes of outputs?

I have not seen any crashes of this type. We output trajectories that go into 30-50 gigs, simulations finish, postprocessing does its thing, all huge files get deleted. This is new. We are specifically interested in running newer gromacs versions, because there is an insane performance difference. I have one cluster that yields 2x performance jump when going from 2023.2 to 2025.3. We have both installed as modules and it is crazy.

We now understand why there is a difference in performance depending on the gromacs version (older version uses older CUDA), but after eliminating uncompressed trajectory outputs with xtc only output (file size x10 smaller) simulations still crash the same way – regardless of the version, actually. Very weird. Any ideas?

It used to be possible to get enormously long loops when particles jumped very far out of the box due to large forces. But as we removed this loops some years ago, I have no clue what could cause this.

You could try the -pforce mdrun option to check for large forces, value e.g. 5000. But I see now that you need to have update on the CPU to make this work (we should add a check for this).

Don’t per-atom forces get dumped as part of the trr? So far I inspected the trajectories visually and absolutely nothing alarming, not before the crash, not anywhere else. Probably the weirdest issue I’ve seen in years…

If large forces occur, for whatever reason, they can crash the simulation within in a few steps. Trajectory output is almost never frequent enough to catch this.

Unfortunately, it is true. The timing of our crashes is also very unfortunately such that there is simply no way to output at every timestep.

But the mystery deepens: we have one cluster, where these simulations complete without any issue whatsoever.