Extreme performance loss with version 2026.1 on APUs

GROMACS version: 2026.1
GROMACS modification: No

Dear all,

I want to ask the community for help with an issue i encountered when running simulations on APUs. I am running these simulation on a HPC system where each node has two AMD Instinct MI300A APUs + 24 CPU cores per APU.

With Gromacs 2025.3 the performance was as expected, I have simulated the same system on NVIDIA A100s and H200s and the MI300A is somewhere in between. However there was a big issue, a fraction of my jobs would enter a “coma“ state, where they would technically run, but their performance would drop to unde 1 ns/day. This issue affected 30-50% of the jobs and occurred seemingly at random. At least we never managed to pin it down, and since version 2026.1 changed how GROMACS interacts with APUs we decided to simply sit it out.

Unfortunately version 2026.1 made things worse, not better. Below I’m sharing a performance benchmark comparing versions 2025.3 and 2026.1. The plots show the simulation characteristics for my ~1.3 million atom system. The simulations either ran for 30 minutes or until they reached 100k steps. The left plot shows the performance and the right plot shows how many of the three replicates did not finish. If a simulation did not finish, it either did not cross the halfway point (50,000 steps), or it returned an error. In this benchmark I only encountered the former. Simulations that didn’t finish would run, but at less than ~0.1 ns/day; none finished with an error. The x-axis shows the number of nodes and MPI ranks used for the simulations.

Things are kind of OK when PME is done on CPUs. Performance drops off a cliff once you start using more than four MPI ranks per node, but this is to be expected. However, when PME is performed on the GPU, things get ugly. There is a noticeable performance gap between versions 2026.1 and 2025.3, and once you use more than two MPI ranks, all simulations enter a “coma” state in which they would technically run, but their performance would be so low that they would not be able to complete 50,000 steps. Since we only calculate performance after the first 50,000 steps, they would not produce any output.

2025.3 scales quite well, but it is not without flaws; it has an apparently stochastic issue where some jobs experience an extreme decrease in performance. Due to the low number of replicates, this only occurred twice in this test (two of the two node four MPI rank runs).

I am not sure what to do about this.I am pretty sure the issue is related to offloading PME to GPU (or, more broadly, performing all updates on the GPU). However, this raises the question of why 2026.1 still works when using a single node with one PP and one PME rank.

I had a hunch that you can’t run PP and PME on the same node, so I tried assigning GPUs directly using the -gputsks argument in mdrun. However, this did not help.

I am not sure whether this is an issue with our Gromacs install, node settings or an issue with GROMACS itself. If anyone has encountered something similar or has an idea, i would really appreciate some input on the issue.

Best Regards,
Florian

Hi,

It should not be a problem. Good that you tested, though.

If you share gmx -version output and some details about your simulation (mdp file, slurm script, output log) and hardware, that can help. Checking dmesg log on a node with “comatose” runs could also shed some light.

Can you try setting ROCR_VISIBLE_DEVICES to limit GPU/APU visibility for each MPI rank (e.g., export ROCR_VISIBLE_DEVICES=${SLURM_LOCALID}): GROMACS “touches” all visible GPUs devices before starting the simulation, and we have seen this causing troubles with some versions of AMD GPU driver. I don’t think it was that bad, but could help.

Could you elaborate? We did some optimizations for AMD hardware, but nothing APU-specific in the main release as far as I recall.