Dear GROMACS Community,
I recently set up a new workstation with the following hardware:
I built GROMACS 2025.2 from source with GPU support enabled. Below is the output of gmx --version
:
GROMACS version: 2025.2
Precision: mixed
Memory model: 64 bit
MPI library: thread_mpi
OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 128)
GPU support: CUDA
NBNxM GPU setup: super-cluster 2x2x2 / cluster 8 (cluster-pair splitting on) SIMD instructions: AVX2_256 CPU FFT library: fftw-3.3.10-sse2-avx-avx2-avx2_128 GPU FFT library: cuFFT Multi-GPU FFT: none RDTSCP usage: enabled TNG support: enabled Hwloc support: disabled Tracing support: disabled C compiler: /usr/bin/gcc-11 GNU 11.4.0 C++ compiler: /usr/bin/g++-11 GNU 11.4.0 CUDA compiler: /usr/local/cuda-12.8/bin/nvcc (v12.8.93) CUDA flags: --allow-unsupported-compiler; -arch=sm_89 -O3 -DNDEBUG CUDA runtime: 12.80 CUDA driver: 12.80
I ran a simple test with a ~30,000-atom system. However, performance was unexpectedly low — only around 150 ns/day.
Monitoring with nvidia-smi
showed GPU utilization hovering around 15–20%, and power draw remained low (~110W), far below the card’s 575W capacity.
I’ve confirmed that the run uses CUDA GPU kernels, and GROMACS was compiled with sm_89
enabled. The system is running under WSL2 on Windows 11 (which may be a factor).
Thanks in advance!
Best regards.
Thang.
Hi Thang,
Here is my log file.
Great that you attached it! Let’s look at the performance counter table at the end of the file:
R E A L C Y C L E A N D T I M E A C C O U N T I N G
On 1 MPI rank, each using 24 OpenMP threads
Activity: Num Num Call Wall time Giga-Cycles
Ranks Threads Count (s) total sum %
--------------------------------------------------------------------------------
Neighbor search 1 24 126 7.445 657.051 12.1
Launch PP GPU ops. 1 24 19876 14.975 1321.567 24.4
Force 1 24 10001 1.810 159.753 2.9
PME GPU mesh 1 24 10001 13.258 1170.098 21.6
Wait Bonded GPU 1 24 101 0.000 0.025 0.0
Wait GPU NB local 1 24 10001 0.007 0.575 0.0
Wait GPU state copy 1 24 854 0.074 6.529 0.1
NB X/F buffer ops. 1 24 101 0.902 79.581 1.5
Write traj. 1 24 3 1.403 123.824 2.3
GPU constr. setup 1 24 1 0.039 3.447 0.1
Kinetic energy 1 24 201 1.380 121.821 2.2
Rest 20.144 1777.751 32.8
--------------------------------------------------------------------------------
Total 61.437 5422.021 100.0
--------------------------------------------------------------------------------
Breakdown of PME mesh activities
--------------------------------------------------------------------------------
Wait PME GPU gather 1 24 10001 0.002 0.166 0.0
Reduce GPU PME F 1 24 10001 0.845 74.611 1.4
Launch PME GPU ops. 1 24 90009 12.342 1089.197 20.1
--------------------------------------------------------------------------------
First thing to note, you’re running for around 60 seconds here. It should be enough, but it could also be that first few steps are exceptionally slow and skew the time measurement. I’m assuming that you’ve run longer simulations and the performance is roughly the same, so we can discard this possible effect.
Almost 45% of time is spent just launching the GPU tasks (see “Launch PME GPU ops.” + “Launch PP GPU ops.”). Additionally, 32.8% of time is spent in “Rest”. “Neighbor search” over 10% is not nice, but minor compared to the other two issues.
- High “Rest” time is a huge red flag: that’s time not accounted elsewhere and should be <5% normally. So, something unexpected is wasting a lot of time. This could be due to P/E cores in your CPU: GROMACS is launching 24 threads which are getting split between 8 P and 16 E cores, which is not great. Running GROMACS with
-ntmpi 1 -ntomp 8 -pin on
would ensure GROMACS only uses P-cores, and will leave E cores empty for whichever other tasks could run on your machine. But unless you’re running other CPU-heavy tasks alongside your simulation, the effect should not be that large.
- High launch time, to an extent, is normal: you have a rather small system, so GPU can compute things nearly as fast as CPU manages to throw tasks at it. Enabling CUDA Graphs could help here. But I’d still expect the launch time to be much lower even without using Graphs; a 30k atom system is not that small.
As you suspect, it is possible that one or both issues are greatly exacerbated by WSL and/or Windows driver stack. There seem to be reports of large launch overheads when using CUDA on Windows while the same card is also used for graphics (via WDDM); does not seem like there’s a solution for it. Using CUDA Graphs might help by reducing the number of GPU launch operations.
Building with Visual Studio + CUDA is another thing you can try to at least rule out the effects of WSL. Won’t help with lower-level WDDM issues.
We don’t have any Windows machines around for testing, so please share your findings if you try any of the suggestions above.
1 Like
Thank you so much. I will try to rebuild Gromacs as your advice. it worked normally with my old system of RTX3080 but when I updated to RTX 5090 the performance was too low.
I think that the P and E core is problem. I changed to -ntmpi 1 -ntomp 8 -pin on and the utilization of GPU increased to 50%. Thank you again.