GROMACS hangs on AMD GPU

GROMACS version: 2025.2
GROMACS modification: No
Hi,
I’ve installed GROMACS with the following cmake options:

cmake .. -DCMAKE_INSTALL_PREFIX=/opt/gromacs-2025.2 \
-DCMAKE_C_COMPILER=/usr/lib64/llvm18/bin/clang \
-DCMAKE_CXX_COMPILER=/usr/lib64/llvm18/bin/clang++ \
-DGMX_GPU=SYCL \
-DGMX_SYCL=ACPP \
-DHIPSYCL_TARGETS='hip:gfx1030'\
-DGMX_SIMD=AVX2_256\
-DREGRESSIONTEST_DOWNLOAD=ON \
-DGMX_BUILD_OWN_FFTW=ON

My hardware:

  • CPU: Ryzen 7 5800X
  • GPU: AMD Radeon 6800XT

Software environment:

  • OS: Fedora 42 (but I experienced the same issue on another machine with a 6900XT and Ubuntu 24.04)
  • ROCm: 6.3.1 (installed via package manager)
  • AdaptiveCpp: 25.02

When running make check some MPI-based test fails due to timeout (mostly the two ranks test).
When I try to run a simulation, mdrun hangs during PME tuning. However, if I offload the PME to the CPU and I offload only bonded interactions on the GPU, the simulation runs without problems.
Any suggestions?

Thanks in advance!

Hi!

That’s not a know issue, so more info would be helpful.

Have you checked dmesg or any other system log for errors/warnings that look related? This could offer some insights.

How did you install AdaptiveCpp?

Could you share md.logs from a failed and from a successful GROMACS run?

That’s a good find.

Could you try running -pme gpu -notunepme? This will keep PME on the GPU but disable its autotuning.

Another option to try is building GROMACS with -DGMX_GPU_FFT_LIBRARY=rocfft.


Disclaimer: AMD ROCm stack does not support your GPU. This usually does not stop things from working, but means that the drivers etc has been less tested, so you’re more likely to encounter bugs.

Hi!

  • I checked dmesg and didn’t find any unusual warnings or errors.
  • I’ve installed ACpp with the following cmake options:
cmake .. -DCMAKE_INSTALL_PREFIX=/opt/AdaptiveCpp-25.02.0 \
-DROCM_PATH=/usr/lib64/rocm \
-DCMAKE_C_COMPILER=/usr/lib64/llvm18/bin/clang \
-DCMAKE_CXX_COMPILER=/usr/lib64/llvm18/bin/clang++ \
-DLLVM_DIR=/usr/lib64/llvm18/lib/cmake/llvm/ \
-DWITH_ROCM_BACKEND=ON \
-DWITH_SSCP_COMPILER=OFF \
-DDEFAULT_TARGETS=‘hip:gfx1030’ \
-DROCM_DEVICE_LIBS_PATH=/usr/lib64/rocm/lib \
-DCLANG_INCLUDE_PATH=/lib64/rocm/llvm/lib/clang/18/include \
-DCLANG_EXECUTABLE_PATH=/usr/lib64/llvm18/bin/clang++ \
-DCLANG_INCLUDE_PATH=/usr/lib64/llvm18/include \
-DROCM_CLANG_PATH=/usr/lib64/llvm18/bin/clang++ -\
DROCM_CXX_FLAGS="--rocm-device-lib-path=/usr/lib64/rocm/llvm/lib/clang/18/amdgcn/bitcode"

  • Here’s the log file when PME is offloaded to the GPU. The simulation hangs, with one CPU core fully used. The GPU shows high utilization, but the power draw is low (~40W):
    test_pme_gpu.log (19.6 KB)

  • If I try to run with -notunepme it starts, but eventually hangs as well.

  • Here’s the log with the PME offloaded to the CPU.
    test_pme_cpu.log (55.3 KB)
    (I manually stopped this run — it wasn’t hanging.)

  • I’ve also tried building GROMACS with -DGMX_GPU_FFT_LIBRARY=rocfft, but it hangs as well. Just to be safe, I’ll try to recompile it again with that option and report back.

Really appreciate your time!

Thanks for checking. No good leads so far, unfortunately :(

A few more things to try:

  1. What if you run GROMACS with extra debug logging: ACPP_DEBUG_LEVEL=3 AMD_LOG_LEVEL=4 -s testsystem.tpr -deffnm test -pin on -pinstride 1 -ntmpi 1 -ntomp 8 -v -bonded gpu &> debug_log.txt? The file will be pretty large, but could give some clues.
  2. What if you run -pme gpu -bonded cpu? You anyway have CPU update due to constraints, so the bonded calculation on the CPU should overlap ok with non-bondeds and PME running on GPU.
  3. I see you build the most recent development snapshot of AdaptiveCpp which is a bit ahead of the latest release. What if you download the release archive from Release AdaptiveCpp 25.02.0 · AdaptiveCpp/AdaptiveCpp · GitHub and install it instead? Rebuilding GROMACS in full would be wise after that.

I tried with extra debug logging and found something strange.
When i run
ACPP_DEBUG_LEVEL=3 AMD_LOG_LEVEL=4 gmx mdrun -s testsystem.tpr -deffnm test2 -pin on -pinstride 1 -ntmpi 1 -ntomp 8 -v -bonded gpu &> debug_log.txt
The simulation crashes with a core dump, the last lines in the output are:

ESC[;32m[AdaptiveCpp Info] ESC[0minorder_executor: Dispatching to lane 0x3886c2c0: Memcpy: CPU-Device0 #1 {0, 0, 0}+{1, 1, 49152}-->ROCm-Device0 #1 {0, 0, 0}+{1, 1, 49152}{1, 1, 49152}
:3:hip_memory.cpp           :1543: 2052086538d us:  ESC[32m hipMemcpyAsync ( 0x152cdc3ef000, 0x3a962350, 49152, hipMemcpyHostToDevice, stream:0x389cacf0 ) ESC[0m
:4:command.cpp              :352 : 2052086549d us:  Command (CopyHostToDevice) enqueued: 0x382f0a50
:3:rocvirtual.cpp           :168 : 2052086555d us:  Signal = (0x152fca7fa680), Translated start/end = 2052086386999 / 2052086389719, Elapsed = 2720 ns, ticks start/end = 207224658597 / 207224658869, Ticks elapsed = 272
:4:command.cpp              :167 : 2052086020d us:  Command 0x382ef510 complete (Wall: 4811369, CPU: 0, GPU: 281 us)
/usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/bits/stl_vector.h:1282: const_reference std::vector<amd::roc::ProfilingSignal *>::operator[](size_type) const [_Tp = amd::roc::ProfilingSignal *, _Alloc = std::allocator<amd::roc::ProfilingSignal *>]: Assertion '__n < this->size()' failed.
:4:rocblit.cpp              :822 : 2052086568d us:  HSA Async Copy staged H2D dst=0x152cdc3ef000, src=0x152ec0200000, size=49152, completion_signal=0x152fca7fa600
:4:commandqueue.cpp         :151 : 2052086608d us:  Marker queued to ensure finish

On the other hand, if I run the command redirecting output with tee:

ACPP_DEBUG_LEVEL=3 AMD_LOG_LEVEL=4 gmx mdrun -s testsystem.tpr -deffnm test2 -pin on -pinstride 1 -ntmpi 1 -ntomp 8 -v -bonded gpu 2>&1 | tee debug_log2.txt

The run continues for a while (I stopped it manually after the log file reached ~20 GB).

If I run -pme gpu -bonded gpu with the extra debugging logging I get the same assertion failure error.

I’ll try to reinstall AdaptiveCpp from the stable release archive and rebuild GROMACS, then report back.
Thanks

Hm, at this point, I’d consider trying another version or ROCm… But perhaps not the easiest thing to try if you’ve installed the current one from Fedora repositories.

Irrespective of that, what if you run with HSA_ENABLE_SDMA=0 or AMD_SERIALIZE_COPY=2 instead of the two debug variables I mentioned above? Again, a shot in the dark, but the error is close to a memory copy and is somehow sensitive to the timings (given how it magically fixes itself when you pipe the output to tee), and the copy engines are somehow finicky on consumer AMD cards in my experience.

Oh, just to be sure: you did run make check after compiling GROMACS and it all passed?

Yes, as I mentioned in my first post, I ran make check. Some tests failed, but only due to timeouts.

With AMD_SERIALIZE_COPY=2 and HSA_ENABLE_SDMA=0I managed to do 500k steps succesfully, with the latter faster by some nanoseconds.

Additionally, I tried running make check again with HSA_ENABLE_SDMA=0, and encountered the following test failures:
1 - GmxapiExternalInterfaceTests (Failed) GTest IntegrationTest QuickGpuTest
2 - GmxapiInternalInterfaceTests (Subprocess aborted) GTest IntegrationTest QuickGpuTest
3 - NbLibListedForcesTests (Failed) GTest IntegrationTest
4 - NbLibSamplesTestArgon (Failed)
5 - NbLibSamplesTestMethaneWater (Failed)
6 - NbLibUtilTests (Failed) GTest UnitTest
7 - NbLibSetupTests (Failed) GTest IntegrationTest
8 - NbLibTprTests (SEGFAULT) GTest UnitTest
9 - NbLibIntegrationTests (Failed) GTest IntegrationTest
10 - NbLibIntegratorTests (Failed) GTest IntegrationTest
49 - PullTest (Subprocess aborted) GTest UnitTest
68 - MdrunIOTests (Subprocess aborted) GTest IntegrationTest SlowGpuTest
89 - MdrunPullTests (Subprocess aborted) GTest IntegrationTest QuickGpuTest

With error messages like:

  • bin/nblib-integration-test: symbol lookup error: undefined symbol: _ZN5nblib3BoxC1Ef
  • Fatal glibc error: malloc.c:4434 (_int_malloc): assertion failed: (unsigned long) (size) >= (unsigned long) (nb)

Good, I guess. So, it’s something with copy engines indeed. The HSA... workaround is safe, so this could be a long-term solution (tests failures notwithstanding). And it’s still faster than running -pme cpu, right?

ldd bin/nblib-integration-test could provide some info, but that very much looks like a mismatch between the two GROMACS versions installed, at least for Gmxapi and NbLib tests. Have you tried completely clearing up your GROMACS build directory and any installed versions, and then rebuilding from scratch?

Provided rebuilding GROMACS does not fix the issue: could you run gdb -ex run -ex bt -ex quit --args ./bin/mdrun-pull-test? It will run the test under debugger and will print a stack trace if it crashes.

Yes, it’s a lot faster. With -pme cpu I get ~65 ns/day, with the PME offloaded to the GPU I get ~120 ns/day on my system.

Actually, I did start from scratch and:

  • exported HSA_ENABLE_SDMA=0 as environment variable in my .bashrc
  • recompiled AdaptiveCpp (development branch, the release hasn’t recognized some variable) with the following cmake options:
cmake .. \
-DCMAKE_INSTALL_PREFIX=/opt/AdaptiveCpp-25.02.0 \
-DROCM_PATH=/usr/lib64/rocm \
-DCMAKE_C_COMPILER=/usr/lib64/llvm18/bin/clang \
-DCMAKE_CXX_COMPILER=/usr/lib64/llvm18/bin/clang++ \
-DLLVM_DIR=/usr/lib64/llvm18/lib/cmake/llvm/ \
-DACPP_COMPILER_FEATURE_PROFILE=full \
-DWITH_ROCM_BACKEND=ON \
-DWITH_SSCP_COMPILER=OFF \
-DWITH_OPENCL_BACKEND=OFF \
-DWITH_LEVEL_ZERO_BACKEND=OFF \
-DWITH_CUDA_BACKEND=OFF \
-DDEFAULT_TARGETS=‘hip:gfx1030’ \
-DROCM_DEVICE_LIBS_PATH=/usr/lib64/rocm/lib \
-DCLANG_EXECUTABLE_PATH=/usr/lib64/llvm18/bin/clang++ \
-DCLANG_INCLUDE_PATH=/usr/lib64/llvm18/include \
-DROCM_CLANG_PATH=/usr/lib64/llvm18/bin/clang++ \
-DROCM_CXX_FLAGS="--rocm-device-lib-path=/usr/lib64/rocm/llvm/lib/clang/18/amdgcn/bitcode"
  • recompiled GROMACS 2025.2 with the following cmake options:
cmake .. \ 
-DCMAKE_INSTALL_PREFIX=/opt/gromacs-2025.2 \
-DCMAKE_C_COMPILER=/usr/lib64/llvm18/bin/clang \
-DCMAKE_CXX_COMPILER=/usr/lib64/llvm18/bin/clang++ \
-DGMX_GPU=SYCL \
-DGMX_SYCL=ACPP \
-DHIPSYCL_TARGETS='hip:gfx1030' \
-DGMX_GPU_FFT_LIBRARY=VkFFT \
-DGMX_ENABLE_AMD_RDNA_SUPPORT=ON  \
-DGMX_SIMD=AVX2_256 \
-DREGRESSIONTEST_DOWNLOAD=ON \
-DGMX_BUILD_OWN_FFTW=ON

And… Tadaaaaaa!

100% tests passed, 0 tests failed out of 94

Label Time Summary:
GTest              = 270.99 sec*proc (90 tests)
IntegrationTest    = 201.03 sec*proc (29 tests)
MpiTest            = 161.36 sec*proc (21 tests)
QuickGpuTest       =  53.27 sec*proc (23 tests)
SlowGpuTest        = 351.00 sec*proc (16 tests)
SlowTest           =  64.59 sec*proc (14 tests)
UnitTest           =   5.38 sec*proc (47 tests)

Total Test time (real) = 277.35 sec

1 Like