Low GROMACS performance on RTX 4090

GROMACS version: 2023
GROMACS modification: No

Dear community!
I have two slightly different PC assemblies both based on RTX 4090 GPU, which differ in GROMACS performance more than twice. I have no idea about the reason for it, so I would be happy if someone can give me a reasonable advice.

Technical details (OS Ubuntu 22.04 LTR, CUDA 12.1, NVIDIA Driver 530.30.02, default BIOS settings for both):
Assembly #1
GPU: Palit GeForce RTX 4090 24 GB
CPU: AMD Ryzen 9 7900X 12 cores
Motherboard: GIGABYTE X670 GAMING X AX
RAM: Kingston FURY Beast (16 GB x 2) DDR5 6000 MHz

Assembly #2
GPU: Palit GeForce RTX 4090 24 GB
CPU: AMD Ryzen 9 7950X 16 cores
Motherboard: MSI PRO X670-P WIFI RTL
RAM: Kingston FURY Beast (16 GB x 4) DDR5 5600 MHz

gmx --version output (similar for both assemblies)

                          :-) GROMACS - gmx, 2023 (-:

Executable:   /usr/local/gromacs-2023/bin/gmx
Data prefix:  /usr/local/gromacs-2023
Working dir:  /media/data/Egor/m1_new/prod_re
Command line:
  gmx --version

GROMACS version:    2023
Precision:          mixed
Memory model:       64 bit
MPI library:        thread_mpi
OpenMP support:     enabled (GMX_OPENMP_MAX_THREADS = 128)
GPU support:        CUDA
NB cluster size:    8
SIMD instructions:  AVX2_256
CPU FFT library:    fftw-3.3.8-sse2-avx-avx2-avx2_128
GPU FFT library:    cuFFT
Multi-GPU FFT:      none
RDTSCP usage:       enabled
TNG support:        enabled
Hwloc support:      disabled
Tracing support:    disabled
C compiler:         /usr/bin/cc GNU 11.3.0
C compiler flags:   -fexcess-precision=fast -funroll-all-loops -mavx2 -mfma -Wno-missing-field-initializers -O3 -DNDEBUG
C++ compiler:       /usr/bin/c++ GNU 11.3.0
C++ compiler flags: -fexcess-precision=fast -funroll-all-loops -mavx2 -mfma -Wno-missing-field-initializers -Wno-cast-function-type-strict -fopenmp -O3 -DNDEBUG
BLAS library:       
LAPACK library:     
CUDA compiler:      /usr/local/cuda-12.1/bin/nvcc nvcc: NVIDIA (R) Cuda compiler driver;Copyright (c) 2005-2023 NVIDIA Corporation;Built on Tue_Feb__7_19:32:13_PST_2023;Cuda compilation tools, release 12.1, V12.1.66;Build cuda_12.1.r12.1/compiler.32415258_0
CUDA compiler flags:-std=c++17;--generate-code=arch=compute_50,code=sm_50;--generate-code=arch=compute_52,code=sm_52;--generate-code=arch=compute_60,code=sm_60;--generate-code=arch=compute_61,code=sm_61;--generate-code=arch=compute_70,code=sm_70;--generate-code=arch=compute_75,code=sm_75;--generate-code=arch=compute_80,code=sm_80;--generate-code=arch=compute_86,code=sm_86;--generate-code=arch=compute_89,code=sm_89;--generate-code=arch=compute_90,code=sm_90;-Wno-deprecated-gpu-targets;--generate-code=arch=compute_53,code=sm_53;--generate-code=arch=compute_80,code=sm_80;-use_fast_math;-Xptxas;-warn-double-usage;-Xptxas;-Werror;-D_FORCE_INLINES;-fexcess-precision=fast -funroll-all-loops -mavx2 -mfma -Wno-missing-field-initializers -Wno-cast-function-type-strict -fopenmp -O3 -DNDEBUG
CUDA driver:        12.10
CUDA runtime:       12.10

Benchmarks

  1. Unigine Valley - OpenCL (UNIGINE Benchmarks)
    Assembly #1: FPS 277.7 (min 71.8, max 517.2), score 11620
    Assembly #2: FPS 263.5 (min 92.9, max 507.3), score 11025

  2. mixbench - CUDA, double precision (GitHub - ekondis/mixbench: A GPU benchmark tool for evaluating GPUs and CPUs on mixed operational intensity kernels (CUDA, OpenCL, HIP, SYCL, OpenMP))
    Assembly #1: 1085.85 GFLOPS
    Assembly #2: 917.73 GFLOPS

GROMACS tests
Command: gmx mdrun -deffnm prod_re -v -nsteps -1 -nb gpu -bonded gpu -update gpu -ntomp 12

  1. ~30000 atoms system
    Assembly #1: ~900 ns/day
    Assembly #2: ~1500 ns/day

  2. ~100000 atoms system
    Assembly #1: ~170 ns/day
    Assembly #2: ~360 ns/day

As you can see, in general benchmarks both assemblies demonstrate ± similar performance (#1 is even a bit better), but in gromacs tests #2 works 1.7-2.1 times faster than #1. We also have old assembly with RTX 3080TI that on 100000 atoms system has the calculation speed ~190 ns/day which is still faster than #1.

Thanks in advance!

Best,
Egor

Hi Egor,

Can you please post complete log files of the two runs you are comparing?

Thanks,
Szilárd

assembly_1.log (10.9 KB)
assembly_2.log (29.2 KB)

Dear Szilard,

See two log files attached. Thanks for the interest to my problem.

Best,
Egor

Dear Egor,

There is something clearly peculiar with your runs, but I can’t say what based on these logs. I have spotted some differences like one run is a continuation and does thread pinning, the other is not continuation, does not pin and ends up with slightly different settings (cutoffs, table sizes, etc.). I assume the two simulation systems are slightly different?

What I’d recommend as a first step is to try to isolate the issue by executing two short benchmarks runs on the two machines with identical inputs and command lines, e.g.:
gmx mdrun -nsteps 100000 -ntmpi 1 -ntomp 12 -pin on -nb gpu -pme gpu -update gpu -bonded gpu

This may still not explain the differences since it looks like the GPU execution is slower in the “assembly 1” and unlike timing details of CPU execution printed at the end of the log, such information is not available for GPU computation.

Therefore, it could be helpful in diagnosing the issue if you run the above benchmarks prefixed with:
CUDA_LAUNCH_BLOCKING=1 nsys profile -c cudaProfilerApi --stats=true gmx mdrun ...
Here nsys is a profiling tool which should should be part of your CUDA installation. This will collect at runtime GPU performance metrics and also print a summary of these.
If you manage to run this on the two machines, please post here the “gpukernsum” table printed after the completion of mdrun.

Cheers,
Szilárd