Zombie simulations

GROMACS version: 2023.4, 2024.0, and 2024.1
GROMACS modification: No

My parallelized simulations using thread-MPI and multiple OMP threads per rank turn into zombies after running for more than 24 hours, but less than 48. Here’s an example mdrun command:

gmx mdrun -ntmpi 7 -ntomp 4 -npme 1 -ntomp_pme 8

This happens with both CPU-only and GPU-accelerated simulations. A running simlation typically has a number of running threads a little less than (ntmpi-1)*ntomp+ntomp_pme, as well as some sleeping ones. A zombie simulation has exactly ntmpi running threads and a whole bunch of sleeping threads. And, of course, the simulation no longer advances in this state. My interpretation is that all of the OMP threads for some reason go to sleep, turning the simulation into a zombie. Mind you, the gmx process itself is still running, and does so indefinitely.

I understand that this is most likely the result of how compute nodes behave on my HPC resource, rather than something inherent to gromacs – especially given that I’m getting this behavior with multiple versions. But I’m hoping someone has encountered this kind of behavior before and could provide some insight for me to take to our HPC admins.

I suspect that it has to do with the HPC setup. Is there any documentation about max run time or resource usage? As a temporary fix I would use the -maxh option, e.g., -maxh 23.5 and allocate resources (if you are using a queuing system, e.g. slurm) for no more than 24h for your job. I would also contact the system admins to explain the situation to them and ask if they’ve got any advice.