Simulations hang without crashing

Sasha · March 18, 2026, 8:14pm

GROMACS version: 2025.3, 2026-rc
We have simulations that run perfectly fine with older versions of gromacs (2023, I think), but die about half-way with versions mentioned above. When I say “die,” there is no crash, nothing interesting in the logs, no step*.pdb dumps. The trajectory file size stops changing, md.log stops updating, CPU load drops significantly, GPU loads cease. The simulation is eventually kicked out after bumping into the wall time. This exact behavior takes place on completely unrelated machines, one is in fact stand-alone, while the other is a cluster (and it happens regardless of the node). We have not identified the issue and we are not blaming Gromacs (yet).

Would the developers be interested in trying our exact inputs? Thanks!

hess · March 19, 2026, 7:15am

How long does it take until nothing happens anymore?

Sasha · March 19, 2026, 7:32am

About 12-15 hours.

I am trying to isolate the issue by 1) running the same exact simulations with gmx 2023.2 and 2) trying to limit trajectory output to xtc files with newer gmx versions. The original crashing simulations involve temporary trr outputs that get to about 62gb and they crash about halfway.

Initially we thought one of the nodes was overheating, then i tested on other nodes and it’s the same behavior.

hess · March 19, 2026, 8:05am

Have you up till now only seen crashes when writing 10 of gigabytes of outputs?

Sasha · March 19, 2026, 8:23am

I have not seen any crashes of this type. We output trajectories that go into 30-50 gigs, simulations finish, postprocessing does its thing, all huge files get deleted. This is new. We are specifically interested in running newer gromacs versions, because there is an insane performance difference. I have one cluster that yields 2x performance jump when going from 2023.2 to 2025.3. We have both installed as modules and it is crazy.

Sasha · March 20, 2026, 6:38am

We now understand why there is a difference in performance depending on the gromacs version (older version uses older CUDA), but after eliminating uncompressed trajectory outputs with xtc only output (file size x10 smaller) simulations still crash the same way – regardless of the version, actually. Very weird. Any ideas?

hess · March 20, 2026, 9:33am

It used to be possible to get enormously long loops when particles jumped very far out of the box due to large forces. But as we removed this loops some years ago, I have no clue what could cause this.

You could try the -pforce mdrun option to check for large forces, value e.g. 5000. But I see now that you need to have update on the CPU to make this work (we should add a check for this).

Sasha · March 20, 2026, 8:55pm

Don’t per-atom forces get dumped as part of the trr? So far I inspected the trajectories visually and absolutely nothing alarming, not before the crash, not anywhere else. Probably the weirdest issue I’ve seen in years…

hess · March 23, 2026, 7:43am

If large forces occur, for whatever reason, they can crash the simulation within in a few steps. Trajectory output is almost never frequent enough to catch this.

Sasha · March 23, 2026, 6:57pm

Unfortunately, it is true. The timing of our crashes is also very unfortunately such that there is simply no way to output at every timestep.

But the mystery deepens: we have one cluster, where these simulations complete without any issue whatsoever.

Topic		Replies	Views
Not writing to output file for long simulation User discussions mdrun	4	646	March 21, 2025
GROMACS 2023.2 Hangs and not writing trajectories User discussions	0	93	March 15, 2024
Simulation freeze in the middle of the run User discussions	1	298	September 28, 2024
Not writing md.log file for long simulation User discussions	1	144	December 12, 2024
Simulation stops and gmx hangs User discussions	0	351	November 14, 2023

Simulations hang without crashing

Related topics