GROMACS version: 2024.2
GROMACS modification: No
Dear all, I am one of the system administrator for a HPC cluster in my university. We have some DGX A100 machines running in our small SLURM cluster, with BeeGFS 7.4.5 as our scratch storage supported with NVIDIA GDS.
We noticed that the process of GROMACS executing in all our DGX A100 nodes will randomly stuck in “z” state after sometime. The jobs will either get terminated by SLURM due to timeout after 3 days or terminated by the administrator when we noticed the stuck. We could not really identify which part of the execution is causing the issues as it happened randomly after certain amount of time, can be a few hours or up to few days. Check the graph below to get an overview:
We used the same exact input files in clean state (all scratch and output deleted) for all the test run. We tried using different version of GROMACS (v2024.4, v2024.2, v2024.1, v2023.2), and the result is the same, GROMACS just randomly stucked or complete without issue.
We tried running on local NVME of the node, but it doesn’t seems to stuck randomly. While we are suspecting this might have something to do with the BeeGFS storage we have, the behaviour doesn’t seems to happen to other jobs with different application like TeraChem, Amber or LAMMPS.
We have also tried the container from the Nvidia GPU cloud, and the result is the same, so I believe the GROMACS compilation/installation should be okay?
Since I am personally not a GROMACS user, I figured out it might be better to just create a post in the forum, so that I could get some helps here since our users in the university are mostly not proficient or expert enough in GROMACS. I have uploaded all the input files, output files and the submission script we have used to test in the following link. There are files from multiple runs in the directory, sorry for the mess!
I appreciate if someone can provide some insight on the problem we encountered, as we have tried various way of troubleshooting but still could not identified the problem on our side. I would list down all the steps we have tried:
- Running the calculation with different version of GROMACS (2023.2, 2024.1, 2024.2, 2024.4, 2023.2 from NGC)
- Clean up all files in the directory before rerun the same calculation.
- Recompile GROMACS with OpenMPI (previously compiled with HPCX)
- Set GMX_ENABLE_DIRECT_GPU_COMM=0
- Upgraded BeeGFS stack with newer NVIDIA GDS (1.11.1.6) and NVIDIA FS (2.17)
- Upgraded DGX OS and most libraries to latest compatible
- Running the calculation with and without other jobs in the same node (jobs are not sharing the same set of resources)
- Directly run the GROMACS calculation without going through the SLURM scheduler.
- Running in other storage (local NVME and CephFS) doesn’t seems to stuck
- Running with older GPU nodes without Infiniband (10G Ethernet) seems to be okay (v2022.1).
Sorry for the long post, and thank you in advance for helping.