GROMACS 2018.8 PERFORMANCE ON GPUs

GROMACS version:2018.8
GROMACS modification: Yes/No
Here post your question

Hello Everyone!

I have installed Gromacs 2018.8 on two different workstations. Workstation having NVIDIA Quadro RTX 5000 (3072 CUDA Cores) gives better performance (approx 350 ns per day for lysozyme) than the NVIDIA Quadro RTX 6000 (4608 CUDA Cores) (approx 200 ns per day). Please see the log file for the both. It seems there are some problems while installation. Please help.


NVIDIA Quadro RTX 6000,
GROMACS is free software; you can redistribute it and/or modify it
under the terms of the GNU Lesser General Public License
as published by the Free Software Foundation; either version 2.1
of the License, or (at your option) any later version.

GROMACS: gmx mdrun, version 2018.8
Executable: /usr/local/gromacs-2018.8/bin/gmx
Data prefix: /usr/local/gromacs-2018.8
Working dir: /home/pglab-6000/Desktop/new
Command line:
gmx mdrun -v -deffnm md_0_1

GROMACS version: 2018.8
Precision: single
Memory model: 64 bit
MPI library: thread_mpi
OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 64)
GPU support: CUDA
SIMD instructions: AVX_512
FFT library: fftw-3.3.8-sse2-avx
RDTSCP usage: enabled
TNG support: enabled
Hwloc support: disabled
Tracing support: disabled
Built on: 2020-12-11 17:22:07
Built by: root@pglab6000-ThinkStation-P520 [CMAKE]
Build OS/arch: Linux 5.4.0-54-generic x86_64
Build CPU vendor: Unknown
Build CPU brand: Unknown
Build CPU family: 0 Model: 0 Stepping: 0
Build CPU features: Unknown
C compiler: /usr/bin/gcc-7 GNU 7.5.0
C compiler flags: -mavx512f -mfma -O3 -DNDEBUG -funroll-all-loops -fexcess-precision=fast
C++ compiler: /usr/bin/g+±7 GNU 7.5.0
C++ compiler flags: -mavx512f -mfma -std=c++11 -O3 -DNDEBUG -funroll-all-loops -fexcess-precision=fast
CUDA compiler: /usr/bin/nvcc nvcc: NVIDIA ® Cuda compiler driver;Copyright © 2005-2019 NVIDIA Corporation;Built on Sun_Jul_28_19:07:16_PDT_2019;Cuda compilation tools, release 10.1, V10.1.243
CUDA compiler flags:-gencode;arch=compute_30,code=sm_30;-gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=compute_75;-use_fast_math;-D_FORCE_INLINES;; ;-mavx512f;-mfma;-std=c++11;-O3;-DNDEBUG;-funroll-all-loops;-fexcess-precision=fast;
CUDA driver: 10.20
CUDA runtime: 10.10

Running on 1 node with total 10 cores, 20 logical cores, 1 compatible GPU
Hardware detected:
CPU info:
Vendor: Intel
Brand: Intel® Xeon® W-2155 CPU @ 3.30GHz
Family: 6 Model: 85 Stepping: 4
Features: aes apic avx avx2 avx512f avx512cd avx512bw avx512vl clfsh cmov cx8 cx16 f16c fma hle htt intel lahf mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd rdtscp rtm sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
Number of AVX-512 FMA units: 2
Hardware topology: Basic
Sockets, cores, and logical processors:
Socket 0: [ 0 10] [ 1 11] [ 2 12] [ 3 13] [ 4 14] [ 5 15] [ 6 16] [ 7 17] [ 8 18] [ 9 19]
GPU info:
Number of GPUs detected: 1
#0: NVIDIA Quadro RTX 6000, compute cap.: 7.5, ECC: no, stat: compatible

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E.
Lindahl
GROMACS: High performance molecular simulations through multi-level
parallelism from laptops to supercomputers
SoftwareX 1 (2015) pp. 19-25
-------- -------- — Thank You — -------- --------

NVIDIA Quadro RTX 5000
GROMACS is free software; you can redistribute it and/or modify it
under the terms of the GNU Lesser General Public License
as published by the Free Software Foundation; either version 2.1
of the License, or (at your option) any later version.

GROMACS: gmx mdrun, version 2018.8
Executable: /usr/local/gromacs-2018.8/bin/gmx
Data prefix: /usr/local/gromacs-2018.8
Working dir: /home/pglab-5000/Desktop/Test_Simualtion
Command line:
gmx mdrun -v -deffnm md_0_1

GROMACS version: 2018.8
Precision: single
Memory model: 64 bit
MPI library: thread_mpi
OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 64)
GPU support: CUDA
SIMD instructions: AVX_512
FFT library: fftw-3.3.8-sse2-avx
RDTSCP usage: enabled
TNG support: enabled
Hwloc support: disabled
Tracing support: disabled
Built on: 2020-12-09 23:44:03
Built by: root@pglab5000-ThinkStation-P520 [CMAKE]
Build OS/arch: Linux 5.4.0-42-generic x86_64
Build CPU vendor: Unknown
Build CPU brand: Unknown
Build CPU family: 0 Model: 0 Stepping: 0
Build CPU features: Unknown
C compiler: /usr/bin/gcc-8 GNU 8.4.0
C compiler flags: -mavx512f -mfma -O3 -DNDEBUG -funroll-all-loops -fexcess-precision=fast
C++ compiler: /usr/bin/c++ GNU 9.3.0
C++ compiler flags: -mavx512f -mfma -std=c++11 -O3 -DNDEBUG -funroll-all-loops -fexcess-precision=fast
CUDA compiler: /usr/bin/nvcc nvcc: NVIDIA ® Cuda compiler driver;Copyright © 2005-2019 NVIDIA Corporation;Built on Sun_Jul_28_19:07:16_PDT_2019;Cuda compilation tools, release 10.1, V10.1.243
CUDA compiler flags:-gencode;arch=compute_30,code=sm_30;-gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=compute_75;-use_fast_math;-D_FORCE_INLINES;; ;-mavx512f;-mfma;-std=c++11;-O3;-DNDEBUG;-funroll-all-loops;-fexcess-precision=fast;
CUDA driver: 11.10
CUDA runtime: 10.10

Running on 1 node with total 10 cores, 20 logical cores, 1 compatible GPU
Hardware detected:
CPU info:
Vendor: Intel
Brand: Intel® Xeon® W-2155 CPU @ 3.30GHz
Family: 6 Model: 85 Stepping: 4
Features: aes apic avx avx2 avx512f avx512cd avx512bw avx512vl clfsh cmov cx8 cx16 f16c fma hle htt intel lahf mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd rdtscp rtm sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
Number of AVX-512 FMA units: 2
Hardware topology: Basic
Sockets, cores, and logical processors:
Socket 0: [ 0 10] [ 1 11] [ 2 12] [ 3 13] [ 4 14] [ 5 15] [ 6 16] [ 7 17] [ 8 18] [ 9 19]
GPU info:
Number of GPUs detected: 1
#0: NVIDIA Quadro RTX 5000, compute cap.: 7.5, ECC: no, stat: compatible

Hi. Your Quadro RTX 5000 setup was built using a more recent version of gcc and against CUDA 11 rather than 10. However, I wouldn’t expect performance differences from that to be that drastic, and everything else about your build seems fine.

Please post the runtime performance statistics from the end of the log runs. There may be something there we can see.

Hi !
From your logs, I see different things that could be in favor of the gromacs installed on RTX 5000 machine.

Different C/C++ compiler and CUDA driver
It can affect somehow your performances

Any other differences : water molecules ?

Best
Xavier

Thank you Kevin.
I also got same reduced performance from RTX 6000 even after using the CUDA 11 and GCC-8.


 R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G

On 1 MPI rank, each using 10 OpenMP threads

Computing: Num Num Call Wall time Giga-Cycles
Ranks Threads Count (s) total sum %

Neighbor search 1 10 500001 979.840 32452.306 2.5
Launch GPU ops. 1 10 100000002 2499.516 82783.936 6.5
Force 1 10 50000001 1957.389 64828.719 5.1
Wait PME GPU gather 1 10 50000001 7582.033 251116.868 19.7
Reduce GPU PME F 1 10 50000001 279.883 9269.726 0.7
Wait GPU NB local 1 10 50000001 19279.318 638530.844 50.1
NB X/F buffer ops. 1 10 99500001 1653.156 54752.504 4.3
Write traj. 1 10 10041 21.452 710.492 0.1
Update 1 10 50000001 976.390 32338.017 2.5
Constraints 1 10 50000001 1857.287 61513.343 4.8
Rest 1414.804 46858.306 3.7

Total 38501.070 1275155.060 100.0

           Core t (s)   Wall t (s)        (%)
   Time:   385010.696    38501.070     1000.0
                     10h41:41
             (ns/day)    (hour/ns)

Performance: 224.409 0.107
Finished mdrun on rank 0 Sat Dec 12 10:23:51 2020

Thanks!
No, system having proteins water molecules all mdp parameters for both remains same.

I tried using similar C/C++ compiler and CUDA driver for RTX 6000. But got the same reduced performance.

Please share full log files rather than just excerpts.

Because of compilers, there can’t be such a difference. Unless of course something is broken.
It seems to me that your task is too small. It cannot be parallelized to all video card cores. And in the Quadro 5000, the core frequency is higher, so it is faster.
How many atoms do you have? if about 10000-15000, then this is not enough.

Thankyou everyone for helping me out.
The problem was due to issues in hardware.
Actually the NVIDIA graphic card was showing that it was inserted in in PCIe slot having link width = x4. But actually it was x16. So I connected the graphic card to other secondary PCIe slot having link width x16. The performance increased from 224 ns to around 415 ns/day for the same system. (Lysozyme in 10 A cubic box using OPLS-AA ff).

Thankyou Everyone!