Simulations hang on AMD CPU system

GROMACS version: 2022.3
GROMACS modification: No

I am attempting to run ~150 ns simulations that take ~12 hours on a single-node AMD CPU system, but the simulations sometimes hang after ~8 hours of run time (i.e. the system resources remain occupied but the simulation fails to progress). This is the same issue reported here, but my understanding is that the work-around solution of using intelmpi is incompatible with my system.

For reference, here is my GROMACS and system info:

GROMACS version:    2022.3
Precision:          mixed
Memory model:       64 bit
MPI library:        thread_mpi
OpenMP support:     enabled (GMX_OPENMP_MAX_THREADS = 128)
GPU support:        disabled
SIMD instructions:  AVX2_256
CPU FFT library:    fftw-3.3.8-sse2-avx-avx2-avx2_128
GPU FFT library:    none
RDTSCP usage:       enabled
TNG support:        enabled
Hwloc support:      disabled
Tracing support:    disabled
C compiler:         /usr/bin/clang Clang 10.0.0
C compiler flags:   -mavx2 -mfma -Wno-missing-field-initializers -O3 -DNDEBUG
C++ compiler:       /usr/bin/clang++ Clang 10.0.0
C++ compiler flags: -mavx2 -mfma -Wno-missing-field-initializers -Weverything -Wno-c++98-compat -Wno-c++98-compat-pedantic -Wno-source-uses-openmp -Wno-c++17-extensions -Wno-documentation-unknown-command -Wno-covered-switch-default -Wno-switch-enum -Wno-extra-semi-stmt -Wno-weak-vtables -Wno-shadow -Wno-padded -Wno-reserved-id-macro -Wno-double-promotion -Wno-exit-time-destructors -Wno-global-constructors -Wno-documentation -Wno-format-nonliteral -Wno-used-but-marked-unused -Wno-float-equal -Wno-conditional-uninitialized -Wno-conversion -Wno-disabled-macro-expansion -Wno-unused-macros -fopenmp=libomp -O3 -DNDEBUG

Running on 1 node with total 32 cores, 64 processing units
Hardware detected:
  CPU info:
    Vendor: AMD
    Brand:  AMD Ryzen Threadripper 3970X 32-Core Processor
    Family: 23   Model: 49   Stepping: 0
    Features: aes amd apic avx avx2 clfsh cmov cx8 cx16 f16c fma htt lahf misalignsse mmx msr nonstop_tsc pclmuldq pdpe1gb popcnt pse rdrnd rdtscp sha sse2 sse3 sse4a sse4.1 sse4.2 ssse3
  Hardware topology: Basic
    Packages, cores, and logical processors:
    [indices refer to OS logical processors]
      Package  0: [   0  40] [   1   9] [   2  10] [   3  11] [   4  12] [   5  13] [   6  14] [   7  15] [   8  16] [  17  25] [  18  26] [  19  27] [  20  28] [  21  29] [  22  30] [  23  31] [  24  32] [  33  41] [  34  42] [  35  43] [  36  44] [  37  45] [  38  46] [  39  47] [  48  56] [  49  57] [  50  58] [  51  59] [  52  60] [  53  61] [  54  62] [  55  63]
    CPU limit set by OS: -1   Recommended max number of threads: 64

I compiled GROMACS with clang because I experienced the make check issue that was reported here.

Any advice would be much appreciated - thank you for your help!

While you are seeing the same symptom, the cause is unlikely to be the same.

You could verify that the mdrun hanging does occur with:

  • using gcc
  • using OpenMP only, i.e. -ntmpi 1 -ntomp 32 ot 64 (assuming you are using the default domain decomposition)
  • using lib-MPI (install MPI and configure with -DGMX_MPI=ON).