Please help me! Low GROMACS performance on RTX 5090

Dear GROMACS Community,

I recently set up a new workstation with the following hardware:

  • CPU: Intel Core Ultra 9 285K

  • GPU: NVIDIA RTX 5090 (CUDA 12.8)

  • RAM: 32 GB

  • Storage: 2 TB NVMe SSD

I built GROMACS 2025.2 from source with GPU support enabled. Below is the output of gmx --version:

GROMACS version: 2025.2

Precision: mixed

Memory model: 64 bit

MPI library: thread_mpi

OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 128)

GPU support: CUDA

NBNxM GPU setup: super-cluster 2x2x2 / cluster 8 (cluster-pair splitting on) SIMD instructions: AVX2_256 CPU FFT library: fftw-3.3.10-sse2-avx-avx2-avx2_128 GPU FFT library: cuFFT Multi-GPU FFT: none RDTSCP usage: enabled TNG support: enabled Hwloc support: disabled Tracing support: disabled C compiler: /usr/bin/gcc-11 GNU 11.4.0 C++ compiler: /usr/bin/g++-11 GNU 11.4.0 CUDA compiler: /usr/local/cuda-12.8/bin/nvcc (v12.8.93) CUDA flags: --allow-unsupported-compiler; -arch=sm_89 -O3 -DNDEBUG CUDA runtime: 12.80 CUDA driver: 12.80

I ran a simple test with a ~30,000-atom system. However, performance was unexpectedly low — only around 150 ns/day.
Monitoring with nvidia-smi showed GPU utilization hovering around 15–20%, and power draw remained low (~110W), far below the card’s 575W capacity.

I’ve confirmed that the run uses CUDA GPU kernels, and GROMACS was compiled with sm_89 enabled. The system is running under WSL2 on Windows 11 (which may be a factor).

Thanks in advance!

Best regards.

Thang.

(post deleted by author)

Here is my log file.

test.log (31.1 KB)

Hi Thang,

Here is my log file.

Great that you attached it! Let’s look at the performance counter table at the end of the file:

      R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G

On 1 MPI rank, each using 24 OpenMP threads

 Activity:              Num   Num      Call    Wall time         Giga-Cycles
                        Ranks Threads  Count      (s)         total sum    %
--------------------------------------------------------------------------------
 Neighbor search           1   24        126       7.445        657.051  12.1
 Launch PP GPU ops.        1   24      19876      14.975       1321.567  24.4
 Force                     1   24      10001       1.810        159.753   2.9
 PME GPU mesh              1   24      10001      13.258       1170.098  21.6
 Wait Bonded GPU           1   24        101       0.000          0.025   0.0
 Wait GPU NB local         1   24      10001       0.007          0.575   0.0
 Wait GPU state copy       1   24        854       0.074          6.529   0.1
 NB X/F buffer ops.        1   24        101       0.902         79.581   1.5
 Write traj.               1   24          3       1.403        123.824   2.3
 GPU constr. setup         1   24          1       0.039          3.447   0.1
 Kinetic energy            1   24        201       1.380        121.821   2.2
 Rest                                             20.144       1777.751  32.8
--------------------------------------------------------------------------------
 Total                                            61.437       5422.021 100.0
--------------------------------------------------------------------------------
 Breakdown of PME mesh activities
--------------------------------------------------------------------------------
 Wait PME GPU gather       1   24      10001       0.002          0.166   0.0
 Reduce GPU PME F          1   24      10001       0.845         74.611   1.4
 Launch PME GPU ops.       1   24      90009      12.342       1089.197  20.1
--------------------------------------------------------------------------------

First thing to note, you’re running for around 60 seconds here. It should be enough, but it could also be that first few steps are exceptionally slow and skew the time measurement. I’m assuming that you’ve run longer simulations and the performance is roughly the same, so we can discard this possible effect.

Almost 45% of time is spent just launching the GPU tasks (see “Launch PME GPU ops.” + “Launch PP GPU ops.”). Additionally, 32.8% of time is spent in “Rest”. “Neighbor search” over 10% is not nice, but minor compared to the other two issues.

  • High “Rest” time is a huge red flag: that’s time not accounted elsewhere and should be <5% normally. So, something unexpected is wasting a lot of time. This could be due to P/E cores in your CPU: GROMACS is launching 24 threads which are getting split between 8 P and 16 E cores, which is not great. Running GROMACS with -ntmpi 1 -ntomp 8 -pin on would ensure GROMACS only uses P-cores, and will leave E cores empty for whichever other tasks could run on your machine. But unless you’re running other CPU-heavy tasks alongside your simulation, the effect should not be that large.
  • High launch time, to an extent, is normal: you have a rather small system, so GPU can compute things nearly as fast as CPU manages to throw tasks at it. Enabling CUDA Graphs could help here. But I’d still expect the launch time to be much lower even without using Graphs; a 30k atom system is not that small.

As you suspect, it is possible that one or both issues are greatly exacerbated by WSL and/or Windows driver stack. There seem to be reports of large launch overheads when using CUDA on Windows while the same card is also used for graphics (via WDDM); does not seem like there’s a solution for it. Using CUDA Graphs might help by reducing the number of GPU launch operations.

Building with Visual Studio + CUDA is another thing you can try to at least rule out the effects of WSL. Won’t help with lower-level WDDM issues.

We don’t have any Windows machines around for testing, so please share your findings if you try any of the suggestions above.

1 Like

Thank you so much. I will try to rebuild Gromacs as your advice. it worked normally with my old system of RTX3080 but when I updated to RTX 5090 the performance was too low.

I think that the P and E core is problem. I changed to -ntmpi 1 -ntomp 8 -pin on and the utilization of GPU increased to 50%. Thank you again.