GROMACS version: 2023
GROMACS modification: No
Dear community!
I have two slightly different PC assemblies both based on RTX 4090 GPU, which differ in GROMACS performance more than twice. I have no idea about the reason for it, so I would be happy if someone can give me a reasonable advice.
Technical details (OS Ubuntu 22.04 LTR, CUDA 12.1, NVIDIA Driver 530.30.02, default BIOS settings for both):
Assembly #1
GPU: Palit GeForce RTX 4090 24 GB
CPU: AMD Ryzen 9 7900X 12 cores
Motherboard: GIGABYTE X670 GAMING X AX
RAM: Kingston FURY Beast (16 GB x 2) DDR5 6000 MHz
Assembly #2
GPU: Palit GeForce RTX 4090 24 GB
CPU: AMD Ryzen 9 7950X 16 cores
Motherboard: MSI PRO X670-P WIFI RTL
RAM: Kingston FURY Beast (16 GB x 4) DDR5 5600 MHz
gmx --version output (similar for both assemblies)
:-) GROMACS - gmx, 2023 (-:
Executable: /usr/local/gromacs-2023/bin/gmx
Data prefix: /usr/local/gromacs-2023
Working dir: /media/data/Egor/m1_new/prod_re
Command line:
gmx --version
GROMACS version: 2023
Precision: mixed
Memory model: 64 bit
MPI library: thread_mpi
OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 128)
GPU support: CUDA
NB cluster size: 8
SIMD instructions: AVX2_256
CPU FFT library: fftw-3.3.8-sse2-avx-avx2-avx2_128
GPU FFT library: cuFFT
Multi-GPU FFT: none
RDTSCP usage: enabled
TNG support: enabled
Hwloc support: disabled
Tracing support: disabled
C compiler: /usr/bin/cc GNU 11.3.0
C compiler flags: -fexcess-precision=fast -funroll-all-loops -mavx2 -mfma -Wno-missing-field-initializers -O3 -DNDEBUG
C++ compiler: /usr/bin/c++ GNU 11.3.0
C++ compiler flags: -fexcess-precision=fast -funroll-all-loops -mavx2 -mfma -Wno-missing-field-initializers -Wno-cast-function-type-strict -fopenmp -O3 -DNDEBUG
BLAS library:
LAPACK library:
CUDA compiler: /usr/local/cuda-12.1/bin/nvcc nvcc: NVIDIA (R) Cuda compiler driver;Copyright (c) 2005-2023 NVIDIA Corporation;Built on Tue_Feb__7_19:32:13_PST_2023;Cuda compilation tools, release 12.1, V12.1.66;Build cuda_12.1.r12.1/compiler.32415258_0
CUDA compiler flags:-std=c++17;--generate-code=arch=compute_50,code=sm_50;--generate-code=arch=compute_52,code=sm_52;--generate-code=arch=compute_60,code=sm_60;--generate-code=arch=compute_61,code=sm_61;--generate-code=arch=compute_70,code=sm_70;--generate-code=arch=compute_75,code=sm_75;--generate-code=arch=compute_80,code=sm_80;--generate-code=arch=compute_86,code=sm_86;--generate-code=arch=compute_89,code=sm_89;--generate-code=arch=compute_90,code=sm_90;-Wno-deprecated-gpu-targets;--generate-code=arch=compute_53,code=sm_53;--generate-code=arch=compute_80,code=sm_80;-use_fast_math;-Xptxas;-warn-double-usage;-Xptxas;-Werror;-D_FORCE_INLINES;-fexcess-precision=fast -funroll-all-loops -mavx2 -mfma -Wno-missing-field-initializers -Wno-cast-function-type-strict -fopenmp -O3 -DNDEBUG
CUDA driver: 12.10
CUDA runtime: 12.10
Benchmarks
-
Unigine Valley - OpenCL (UNIGINE Benchmarks)
Assembly #1: FPS 277.7 (min 71.8, max 517.2), score 11620
Assembly #2: FPS 263.5 (min 92.9, max 507.3), score 11025 -
mixbench - CUDA, double precision (GitHub - ekondis/mixbench: A GPU benchmark tool for evaluating GPUs and CPUs on mixed operational intensity kernels (CUDA, OpenCL, HIP, SYCL, OpenMP))
Assembly #1: 1085.85 GFLOPS
Assembly #2: 917.73 GFLOPS
GROMACS tests
Command: gmx mdrun -deffnm prod_re -v -nsteps -1 -nb gpu -bonded gpu -update gpu -ntomp 12
-
~30000 atoms system
Assembly #1: ~900 ns/day
Assembly #2: ~1500 ns/day -
~100000 atoms system
Assembly #1: ~170 ns/day
Assembly #2: ~360 ns/day
As you can see, in general benchmarks both assemblies demonstrate ± similar performance (#1 is even a bit better), but in gromacs tests #2 works 1.7-2.1 times faster than #1. We also have old assembly with RTX 3080TI that on 100000 atoms system has the calculation speed ~190 ns/day which is still faster than #1.
Thanks in advance!
Best,
Egor