Zombie simulations

RomanS · May 1, 2024, 10:48pm

GROMACS version: 2023.4, 2024.0, and 2024.1
GROMACS modification: No

My parallelized simulations using thread-MPI and multiple OMP threads per rank turn into zombies after running for more than 24 hours, but less than 48. Here’s an example mdrun command:

gmx mdrun -ntmpi 7 -ntomp 4 -npme 1 -ntomp_pme 8

This happens with both CPU-only and GPU-accelerated simulations. A running simlation typically has a number of running threads a little less than (ntmpi-1)*ntomp+ntomp_pme, as well as some sleeping ones. A zombie simulation has exactly ntmpi running threads and a whole bunch of sleeping threads. And, of course, the simulation no longer advances in this state. My interpretation is that all of the OMP threads for some reason go to sleep, turning the simulation into a zombie. Mind you, the gmx process itself is still running, and does so indefinitely.

I understand that this is most likely the result of how compute nodes behave on my HPC resource, rather than something inherent to gromacs – especially given that I’m getting this behavior with multiple versions. But I’m hoping someone has encountered this kind of behavior before and could provide some insight for me to take to our HPC admins.

MagnusL · May 2, 2024, 6:30am

I suspect that it has to do with the HPC setup. Is there any documentation about max run time or resource usage? As a temporary fix I would use the -maxh option, e.g., -maxh 23.5 and allocate resources (if you are using a queuing system, e.g. slurm) for no more than 24h for your job. I would also contact the system admins to explain the situation to them and ask if they’ve got any advice.

Topic		Replies	Views
Running on Open MPI User discussions	1	486	October 13, 2020
Gromacs stop running (or runnin without any output) after 12-18 hours User discussions	2	1618	October 19, 2020
Gmx_mpi GPU and HPC clusters User discussions	7	6700	November 9, 2020
Can you run multiple simulations at a time and how? User discussions	1	1710	February 3, 2023
Running in mutlinode with GPU User discussions	1	2086	October 12, 2020

Zombie simulations

Related topics