Extreme performance loss with version 2026.1 on APUs

GROMACS version: 2026.1
GROMACS modification: No

Dear all,

I want to ask the community for help with an issue i encountered when running simulations on APUs. I am running these simulation on a HPC system where each node has two AMD Instinct MI300A APUs + 24 CPU cores per APU.

With Gromacs 2025.3 the performance was as expected, I have simulated the same system on NVIDIA A100s and H200s and the MI300A is somewhere in between. However there was a big issue, a fraction of my jobs would enter a “coma“ state, where they would technically run, but their performance would drop to unde 1 ns/day. This issue affected 30-50% of the jobs and occurred seemingly at random. At least we never managed to pin it down, and since version 2026.1 changed how GROMACS interacts with APUs we decided to simply sit it out.

Unfortunately version 2026.1 made things worse, not better. Below I’m sharing a performance benchmark comparing versions 2025.3 and 2026.1. The plots show the simulation characteristics for my ~1.3 million atom system. The simulations either ran for 30 minutes or until they reached 100k steps. The left plot shows the performance and the right plot shows how many of the three replicates did not finish. If a simulation did not finish, it either did not cross the halfway point (50,000 steps), or it returned an error. In this benchmark I only encountered the former. Simulations that didn’t finish would run, but at less than ~0.1 ns/day; none finished with an error. The x-axis shows the number of nodes and MPI ranks used for the simulations.

Things are kind of OK when PME is done on CPUs. Performance drops off a cliff once you start using more than four MPI ranks per node, but this is to be expected. However, when PME is performed on the GPU, things get ugly. There is a noticeable performance gap between versions 2026.1 and 2025.3, and once you use more than two MPI ranks, all simulations enter a “coma” state in which they would technically run, but their performance would be so low that they would not be able to complete 50,000 steps. Since we only calculate performance after the first 50,000 steps, they would not produce any output.

2025.3 scales quite well, but it is not without flaws; it has an apparently stochastic issue where some jobs experience an extreme decrease in performance. Due to the low number of replicates, this only occurred twice in this test (two of the two node four MPI rank runs).

I am not sure what to do about this.I am pretty sure the issue is related to offloading PME to GPU (or, more broadly, performing all updates on the GPU). However, this raises the question of why 2026.1 still works when using a single node with one PP and one PME rank.

I had a hunch that you can’t run PP and PME on the same node, so I tried assigning GPUs directly using the -gputsks argument in mdrun. However, this did not help.

I am not sure whether this is an issue with our Gromacs install, node settings or an issue with GROMACS itself. If anyone has encountered something similar or has an idea, i would really appreciate some input on the issue.

Best Regards,
Florian

Hi,

It should not be a problem. Good that you tested, though.

If you share gmx -version output and some details about your simulation (mdp file, slurm script, output log) and hardware, that can help. Checking dmesg log on a node with “comatose” runs could also shed some light.

Can you try setting ROCR_VISIBLE_DEVICES to limit GPU/APU visibility for each MPI rank (e.g., export ROCR_VISIBLE_DEVICES=${SLURM_LOCALID}): GROMACS “touches” all visible GPUs devices before starting the simulation, and we have seen this causing troubles with some versions of AMD GPU driver. I don’t think it was that bad, but could help.

Could you elaborate? We did some optimizations for AMD hardware, but nothing APU-specific in the main release as far as I recall.

Hi,

thank you so much for taking the time and responding!

                       :-) GROMACS - gmx_mpi, 2026.1 (-:

Executable:   /mpcdf/soft/RHEL_9/packages/znver4/gromacs/gcc_15-15.1.0-openmpi_gpu_5.0-5.0.9-rocm_7.2-7.2.0/2026.1/bin/gmx_mpi
Data prefix:  /mpcdf/soft/RHEL_9/packages/znver4/gromacs/gcc_15-15.1.0-openmpi_gpu_5.0-5.0.9-rocm_7.2-7.2.0/2026.1
Working dir:  /viper/u2/fleidne
Command line:
  gmx_mpi -version

GROMACS version:     2026.1
Precision:           mixed
Memory model:        64 bit
MPI library:         MPI
MPI version:         Open MPI v5.0.9, package: Open MPI abuild@znver4-03 Distribution, ident: 5.0.9, repo rev: v5.0.9, Oct 30, 2025
OpenMP support:      enabled (GMX_OPENMP_MAX_THREADS = 128)
GPU support:         HIP
NBNxM GPU setup:     super-cluster 2x2x2 / cluster 8 (cluster-pair splitting off)
SIMD instructions:   AVX_512
CPU FFT library:     fftw-3.3.10-sse2-avx-avx2-avx2_128-avx512
GPU FFT library:     VkFFT internal (1.3.1) with HIP backend
Multi-GPU FFT:       none
RDTSCP:              enabled
TNG support:         enabled
Hwloc support:       disabled
Tracing support:     disabled
Colvars support:     enabled (version 2025-10-13)
CP2K support:        disabled
Torch support:       disabled
Plumed support:      enabled
C compiler:          /mpcdf/soft/RHEL_9/packages/x86_64/gcc/15.1.0/bin/gcc GNU 15.1.0
C compiler flags:    -fexcess-precision=fast -funroll-all-loops -march=skylake-avx512 -Wno-missing-field-initializers -O3 -DNDEBUG
C++ compiler:        /mpcdf/soft/RHEL_9/packages/x86_64/gcc/15.1.0/bin/g++ GNU 15.1.0
C++ compiler flags:  -fexcess-precision=fast -funroll-all-loops -march=skylake-avx512 -Wno-missing-field-initializers -Wno-old-style-cast -Wno-cast-qual -Wno-suggest-override -Wno-suggest-destructor-override -Wno-zero-as-null-pointer-constant -Wno-unused-parameter -Wno-unused-variable -Wno-newline-eof -Wno-old-style-cast -Wno-zero-as-null-pointer-constant -Wno-unused-but-set-variable -Wno-sign-compare -Wno-unused-result -Wno-unused-value -Wno-stringop-truncation -Wno-cast-function-type-strict SHELL:-fopenmp -O3 -DNDEBUG
BLAS library:        Internal
LAPACK library:      Internal
HIP compiler:        /mpcdf/soft/RHEL_9/packages/x86_64/rocm/7.2.0/bin/hipcc 7.2.26015-fc0010cf6a
HIP compiler flags:  --offload-arch=gfx942;-fPIC;-fno-gpu-rdc;-ffast-math;-munsafe-fp-atomics;-fdenormal-fp-math=ieee;-fcuda-flush-denormals-to-zero;-fno-slp-vectorize;-Wno-unused-command-line-argument;-Wno-pass-failed;-DNDEBUG -O3 -DNDEBUG
HIP driver/runtime:  7.2.26015

The dmesq log will take me a while.

I will try limiting the visibility of the GPUS, these runs are in exclusive mode so i would not expect any external interference, but i think you suggest that this might still be able to fix some issues.

Here are the MDP file, my SLURM job script, and a TAR-compressed folder containing the GROMACS and SLURM logs for a successful and an unsuccessful run with 2026.1. I added the .txt suffix to the batch file and the archive to upload them. I can also upload the raw logs if that is preferred.

step2_production.mdp (1.2 KB)

logs.tar.gz.txt (20.0 KB)

jobscript_pme_gpu.sh.txt (459 Bytes)

My mistake; I thought the full HIP support had been added in version 2026.1, but I checked the release notes and it was actually added in version 2026.0. However, I was referring to that and using relatively sloppy wording. What I meant to say is that we wanted to wait for full HIP support because that would also get us more support from AMD.

Hope that clarifies things,

All the best,

Florian

Hello,

If you can share also the gmx --version from the previous runs it would help to see what you are comparing to. I’m responsible for the HIP version, will have a look what is going on. Can you also provide the output from rocm-smi and rocminfo as well?

All the best
Paul

Hi Paul,

attached is the output for

gmx -version

rocm-smi

rocminfo

with the 2025.3 / 2026.1 module loaded.

info2025.txt (27.7 KB)

info2026.txt (27.4 KB)

Had a quick glance and noticed that with 2026.1 i get a low-power state warning, which appears to be a known bug with rocm 7.1, that is unrelated to performance.

[EDIT] The files i uploaded initially included the output from gmx -version when it should be gmx_mpi -version . Updated the files with the correct gromacs version.

The low power state warning is something you can ignore, as you state it is a bug in rocm and doesn’t indicate anything about the status of the GPU.
The APU is very sensitive to how you pin the CPU cores to the GPU nodes and I see that you only rely on the pinning that GROMACS does. We have reworked some of the pinning strategies in 2026, so I’m wondering if that is having an impact.
I also know that the oversubscription on MI300A has very bad performance impacts (at least in my testing with the HIP backend), would it be possible to use one of the GPU partitioning schemes instead for running more than one rank per GPU?

Thanks for the suggestions. I’ll have to experiment with pinnig a little bit to see if I can find a condition that restores performance.

That’s good to know. I’m still experiencing severe performance loss in the two-node, two-task-per-node setup (which corresponds to one rank per APU). I think this means that oversubscription cannot be the only issue, but I’ll definitely avoid it in future so that we can rule it out as a source of error.

Unfortunately I’m not super familiar with GPU partitioning. But from what you mention i think my next two logical steps are to figure out what partition scheme I am using and pin ranks using a wrapper script.

Best,

Florian

I ran two tests today, in the first one i “manually” assigned the tasks/cpus to apus using the following jobscript:

#!/bin/bash -l
#
#SBATCH -o ./job.out.%j
#SBATCH -e ./job.err.%j
#SBATCH -D ./
#
#SBATCH --time=00:35:00
#SBATCH -N 2
#SBATCH --mem=0
#SBATCH --constraint="apu"   #   providing APUs.
#SBATCH --gres=gpu:2         # Request 2 APUs per node.
#SBATCH --tasks-per-node=2
#SBATCH --cpus-per-task=24
#SBATCH --distribution=block:block

module purge
ml gromacs/2026.1

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

srun -n 4 --cpu_bind=cores ./gpu_wrapper.sh gmx_mpi mdrun -s benchmark.tpr -deffnm traj -nb gpu  -pme gpu -npme 1 -maxh 0.5 -resethway

and gpu_wrapper.sh:

#!/bin/bash
#

# Map the local rank ID to the ROCR_VISIBLE_DEVICES variable
export ROCR_VISIBLE_DEVICES=$SLURM_LOCALID

exec "$@"

This should ensure that each rank only sees one APU and that all the cores are assigned to the correct APU.

This did not work, jobs still did not progress.

Next i decided to run a job where i set

GMX_DISABLE_DIRECT_GPU_COMM=1

I know the issue relates to having multiple rank on multiple apus and therefore turning off direct communications was one way to manipulate how the jobs ran.

This did work. Using the same setup as the first job i got 40 ns/day. This is only 66% of the performance of the 2025.3 jobs, but the 2025.3 jobs can use direct gpu communication.

@pbauer do you have any idea what could cause the jobs to get stuck when using direct gpu communication?

Best,
Florian

Hello Florian,
I had another look at your job log files, and see that the bonded task is not on the GPU. I’m wondering if there is something that causes the job to get stuck because it is waiting on some communication that is not happening. Can you run one of those jobs under rocgdb, and then break once it gets stuck? I would like to know if it is stopped on some communication step, or on something else. The fact that it is running when not using direct GPU comms tells me that it could be stuck there.
Also, can you give me some information how your MPI build is configured? I have run into issues with slow GPU direct MPI before when the build didn’t include some options.

/Paul

This will take me a little bit. thus far i have been using the GMX version provided by our hpc admins. However this does not have debug symbols. I am now compiling my own version, which will also make further tests easier.

Here is the output of:

ompi_info

ucx_info -v

ucx_info -c

mpi_build_info.txt (28.7 KB)

I can’t run the jobs under rocgdb because i don’t have direct access to the nodes. I did create a core dump after the job has been running for a while and this is the backtrace for Nodes 1:

[Current thread is 1 (Thread 0x1492c30c9500 (LWP 914153))]

(gdb) bt

#0  0x00001492c2bb8dfc in mca_pml_ucx_send () from /mpcdf/soft/RHEL_9/packages/znver4/openmpi_gpu/gcc_15-15.1.0-rocm_7.2-7.2.0/5.0.9/lib/openmpi/mca_pml_ucx.so

#1  0x00001492cf8203ea in PMPI_Send () from /viper/u2/system/soft/RHEL_9/packages/znver4/openmpi_gpu/gcc_15-15.1.0-rocm_7.2-7.2.0/5.0.9/lib/libmpi.so.40

#2  0x00001492d64d9c6a in gmx::GpuHaloExchange::Impl::communicateHaloForces(bool, gmx::FixedCapacityVector<GpuEventSynchronizer*, 2ul>*) ()

   from /viper/u2/fleidne/software/gromacs/2026.1/bin/../lib64/libgromacs_mpi.so.11

#3  0x00001492d64831df in communicateGpuHaloForces(t_commrec const&, bool, gmx::FixedCapacityVector<GpuEventSynchronizer*, 2ul>*) ()

   from /viper/u2/fleidne/software/gromacs/2026.1/bin/../lib64/libgromacs_mpi.so.11

#4  0x00001492d6a91566 in gmx::do_force(_IO_FILE*, t_commrec const*, t_inputrec const&, gmx::MDModulesNotifiers const&, gmx::Awh*, gmx_enfrot*, gmx::ImdSession*, pull_t*, long, t_nrnb*, gmx_wallcycle*, gmx_localtop_t const*, float const (*) [3], gmx::ArrayRefWithPadding<gmx::BasicVector<float> >, gmx::ArrayRef<gmx::BasicVector<float> >, history_t const*, gmx::ForceBuffersView*, float (*) [3], t_mdatoms const*, gmx_enerdata_t*, gmx::ArrayRef<float const>, t_forcerec*, gmx::MdrunScheduleWorkload const&, gmx::VirtualSitesHandler*, float*, double, gmx_edsam*, CpuPpLongRangeNonbondeds*, DDBalanceRegionHandler const&) () from /viper/u2/fleidne/software/gromacs/2026.1/bin/../lib64/libgromacs_mpi.so.11

#5  0x00001492d6cdda7c in gmx::LegacySimulator::do_md() () from /viper/u2/fleidne/software/gromacs/2026.1/bin/../lib64/libgromacs_mpi.so.11

#6  0x00001492d6d15433 in gmx::Mdrunner::mdrunner() () from /viper/u2/fleidne/software/gromacs/2026.1/bin/../lib64/libgromacs_mpi.so.11

#7  0x0000000000409862 in gmx::gmx_mdrun(ompi_communicator_t*, gmx_hw_info_t const&, int, char**) ()

#8  0x00000000004099cd in gmx::gmx_mdrun(int, char**) ()

#9  0x00001492d6456c83 in gmx::CommandLineModuleManager::run(int, char**) () from /viper/u2/fleidne/software/gromacs/2026.1/bin/../lib64/libgromacs_mpi.so.11

#10 0x0000000000405ecd in main ()

And node 2

[Current thread is 1 (Thread 0x148c70d63500 (LWP 3621680))]

(gdb) bt

#0  0x0000148c6bcc6b98 in uct_rc_mlx5_iface_progress_cyclic () from /mpcdf/soft/RHEL_9/packages/znver4/UCX-GPU/gcc_15-15.1.0-rocm_7.2-7.2.0/1.18.0/lib/ucx/libuct_ib_mlx5.so.0

#1  0x0000148c7056979a in ucp_worker_progress () from /mpcdf/soft/RHEL_9/packages/znver4/UCX-GPU/gcc_15-15.1.0-rocm_7.2-7.2.0/1.18.0/lib/libucp.so.0

#2  0x0000148c7ce87033 in opal_progress () from /mpcdf/soft/RHEL_9/packages/znver4/openmpi_gpu/gcc_15-15.1.0-rocm_7.2-7.2.0/5.0.9/lib/libopen-pal.so.80

#3  0x0000148c7d48b260 in ompi_request_default_wait () from /viper/u2/system/soft/RHEL_9/packages/znver4/openmpi_gpu/gcc_15-15.1.0-rocm_7.2-7.2.0/5.0.9/lib/libmpi.so.40

#4  0x0000148c7d4bfaf6 in PMPI_Wait () from /viper/u2/system/soft/RHEL_9/packages/znver4/openmpi_gpu/gcc_15-15.1.0-rocm_7.2-7.2.0/5.0.9/lib/libmpi.so.40

#5  0x0000148c84173c76 in gmx::GpuHaloExchange::Impl::communicateHaloForces(bool, gmx::FixedCapacityVector<GpuEventSynchronizer*, 2ul>*) ()

   from /viper/u2/fleidne/software/gromacs/2026.1/bin/../lib64/libgromacs_mpi.so.11

#6  0x0000148c8411d1df in communicateGpuHaloForces(t_commrec const&, bool, gmx::FixedCapacityVector<GpuEventSynchronizer*, 2ul>*) ()

   from /viper/u2/fleidne/software/gromacs/2026.1/bin/../lib64/libgromacs_mpi.so.11

#7  0x0000148c8472b566 in gmx::do_force(_IO_FILE*, t_commrec const*, t_inputrec const&, gmx::MDModulesNotifiers const&, gmx::Awh*, gmx_enfrot*, gmx::ImdSession*, pull_t*, long, t_nrnb*, gmx_wallcycle*, gmx_localtop_t const*, float const (*) [3], gmx::ArrayRefWithPadding<gmx::BasicVector<float> >, gmx::ArrayRef<gmx::BasicVector<float> >, history_t const*, gmx::ForceBuffersView*, float (*) [3], t_mdatoms const*, gmx_enerdata_t*, gmx::ArrayRef<float const>, t_forcerec*, gmx::MdrunScheduleWorkload const&, gmx::VirtualSitesHandler*, float*, double, gmx_edsam*, CpuPpLongRangeNonbondeds*, DDBalanceRegionHandler const&) () from /viper/u2/fleidne/software/gromacs/2026.1/bin/../lib64/libgromacs_mpi.so.11

#8  0x0000148c84977a7c in gmx::LegacySimulator::do_md() () from /viper/u2/fleidne/software/gromacs/2026.1/bin/../lib64/libgromacs_mpi.so.11

#9  0x0000148c849af433 in gmx::Mdrunner::mdrunner() () from /viper/u2/fleidne/software/gromacs/2026.1/bin/../lib64/libgromacs_mpi.so.11

#10 0x0000000000409862 in gmx::gmx_mdrun(ompi_communicator_t*, gmx_hw_info_t const&, int, char**) ()

#11 0x00000000004099cd in gmx::gmx_mdrun(int, char**) ()

#12 0x0000148c840f0c83 in gmx::CommandLineModuleManager::run(int, char**) () from /viper/u2/fleidne/software/gromacs/2026.1/bin/../lib64/libgromacs_mpi.so.11

#13 0x0000000000405ecd in main ()

I hope this is helpful.

Best,

Florian