Optimizing GPU performance for GROMACS?

GROMACS version: 2018
GROMACS modification: Yes/No
Here post your question

Hi
My system I am trying to simulate has 600.000 atoms. I want to simulate for 1 us
I am using a supercomputer to run the simulation with tesla K80 GPUs. I am using 1 node and a total of 28 core but the performance is not good at all. only 10 ns/day

    Core t (s)   Wall t (s)        (%)                                                                                                                        
   Time:   221835.265     7922.688     2800.0                                                                                                                        
                     2h12:02                                                                                                                                         
             (ns/day)    (hour/ns)                                                                                                                                   

Performance: 10.905 2.201

I noticed this note in the log file
NOTE: GROMACS was configured without NVML support hence it can not exploit
application clocks of the detected Tesla K80 GPU to improve performance.
Recompile with the NVML library (compatible with the driver used) or set application clocks manually.

PME mesh takes 70% of the computation time according to the log file

What are the ways to optimize performance and speed up the simulation?

Thank you all

Hi,

Please post the contents of the beginning and end of the log file (you can skip the actual per-step data).

The NVML message shouldn’t be a huge deal. PME definitely should not take that long.

sure
thank you very much

gmx_mpi mdrun -v -deffnm 6pij_equilnpt -s npt.tpr

GROMACS version: 2018
Precision: single
Memory model: 64 bit
MPI library: MPI
OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 256)
GPU support: CUDA
SIMD instructions: AVX_256
FFT library: fftw-3.3.3-sse2
RDTSCP usage: enabled
TNG support: enabled
Hwloc support: disabled
Tracing support: disabled
Built on: 2018-02-20 21:19:31
Built by: mmadrid@gpu048.pvt.bridges.psc.edu [CMAKE]
Build OS/arch: Linux 3.10.0-693.11.6.el7.x86_64 x86_64
Build CPU vendor: Intel
Build CPU brand: Intel® Xeon® CPU E5-2683 v4 @ 2.10GHz
Build CPU family: 6 Model: 79 Stepping: 1
Build CPU features: aes apic avx avx2 clfsh cmov cx8 cx16 f16c fma hle htt intel lahf mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd rdtscp rtm sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
C compiler: /usr/lib64/ccache/cc GNU 4.8.5
C compiler flags: -mavx -O3 -DNDEBUG -funroll-all-loops -fexcess-precision=fast
C++ compiler: /usr/lib64/ccache/c++ GNU 4.8.5
C++ compiler flags: -mavx -std=c++11 -O3 -DNDEBUG -funroll-all-loops -fexcess-precision=fast
CUDA compiler: /opt/packages/cuda/9.0RC/bin/nvcc nvcc: NVIDIA ® Cuda compiler driver;Copyright © 2005-2017 NVIDIA Corporation;Built on Mon_Jun_26_16:13:28_CDT_2017;Cuda compilation tools, release 9.0, V9.0.102
CUDA compiler flags:-gencode;arch=compute_30,code=sm_30;-gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_70,code=compute_70;-use_fast_math;;; ;-mavx;-std=c++11;-O3;-DNDEBUG;-funroll-all-loops;-fexcess-precision=fast;
CUDA driver: 10.20
CUDA runtime: 9.0

Running on 1 node with total 28 cores, 28 logical cores, 4 compatible GPUs
Hardware detected on host gpu013.pvt.bridges.psc.edu (the node of MPI rank 0):
CPU info:
Vendor: Intel
Brand: Intel® Xeon® CPU E5-2695 v3 @ 2.30GHz
Family: 6 Model: 63 Stepping: 2
Features: aes apic avx avx2 clfsh cmov cx8 cx16 f16c fma htt intel lahf mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd rdtscp sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
Hardware topology: Basic
Sockets, cores, and logical processors:
Socket 0: [ 0] [ 14] [ 1] [ 15] [ 2] [ 16] [ 3] [ 17] [ 4] [ 18] [ 5] [ 19] [ 6] [ 20]
Socket 1: [ 7] [ 21] [ 8] [ 22] [ 9] [ 23] [ 10] [ 24] [ 11] [ 25] [ 12] [ 26] [ 13] [ 27]
GPU info:
Number of GPUs detected: 4
#0: NVIDIA Tesla K80, compute cap.: 3.7, ECC: yes, stat: compatible
#1: NVIDIA Tesla K80, compute cap.: 3.7, ECC: yes, stat: compatible
#2: NVIDIA Tesla K80, compute cap.: 3.7, ECC: yes, stat: compatible
#3: NVIDIA Tesla K80, compute cap.: 3.7, ECC: yes, stat: compatible

Highest SIMD level requested by all nodes in run: AVX2_256
SIMD instructions selected at compile time: AVX_256
This program was compiled for different hardware than you are running on,
which could influence performance.

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E.
Lindahl
GROMACS: High performance molecular simulations through multi-level
parallelism from laptops to supercomputers
SoftwareX 1 (2015) pp. 19-25
-------- -------- — Thank You — -------- --------

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
S. Páll, M. J. Abraham, C. Kutzner, B. Hess, E. Lindahl
Tackling Exascale Software Challenges in Molecular Dynamics Simulations with
GROMACS
In S. Markidis & E. Laure (Eds.), Solving Software Challenges for Exascale 8759 (2015) pp. 3-27
-------- -------- — Thank You — -------- --------

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
S. Pronk, S. Páll, R. Schulz, P. Larsson, P. Bjelkmar, R. Apostolov, M. R.
Shirts, J. C. Smith, P. M. Kasson, D. van der Spoel, B. Hess, and E. Lindahl
GROMACS 4.5: a high-throughput and highly parallel open source molecular
simulation toolkit
Bioinformatics 29 (2013) pp. 845-54
-------- -------- — Thank You — -------- --------

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
B. Hess and C. Kutzner and D. van der Spoel and E. Lindahl
GROMACS 4: Algorithms for highly efficient, load-balanced, and scalable
molecular simulation
J. Chem. Theory Comput. 4 (2008) pp. 435-447
-------- -------- — Thank You — -------- --------

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
D. van der Spoel, E. Lindahl, B. Hess, G. Groenhof, A. E. Mark and H. J. C.
Berendsen
GROMACS: Fast, Flexible and Free
J. Comp. Chem. 26 (2005) pp. 1701-1719
-------- -------- — Thank You — -------- --------

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
E. Lindahl and B. Hess and D. van der Spoel
GROMACS 3.0: A package for molecular simulation and trajectory analysis
J. Mol. Mod. 7 (2001) pp. 306-317
-------- -------- — Thank You — -------- --------

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
H. J. C. Berendsen, D. van der Spoel and R. van Drunen
GROMACS: A message-passing parallel molecular dynamics implementation
Comp. Phys. Comm. 91 (1995) pp. 43-56
-------- -------- — Thank You — -------- --------

Input Parameters:
integrator = md
tinit = 0
dt = 0.002
nsteps = 500000
init-step = 0
simulation-part = 1
comm-mode = Linear
nstcomm = 100
bd-fric = 0
ld-seed = -1404411713
emtol = 10
emstep = 0.01
niter = 20
fcstep = 0
nstcgsteep = 1000
nbfgscorr = 10
rtpi = 0.05
nstxout = 1000
nstvout = 1000
nstfout = 1000
nstlog = 1000
nstcalcenergy = 100
nstenergy = 1000
nstxout-compressed = 1000
compressed-x-precision = 1000
cutoff-scheme = Verlet
nstlist = 10
ns-type = Grid
pbc = xyz
periodic-molecules = false
verlet-buffer-tolerance = 0.005
rlist = 1.2
coulombtype = PME
coulomb-modifier = Potential-shift
rcoulomb-switch = 0
rcoulomb = 1.2
epsilon-r = 1
epsilon-rf = inf
vdw-type = Cut-off
vdw-modifier = Potential-shift
rvdw-switch = 0.8
rvdw = 1.2
DispCorr = EnerPres
table-extension = 1
fourierspacing = 0.12
fourier-nx = 128
fourier-ny = 160
fourier-nz = 192
pme-order = 4
ewald-rtol = 1e-05
ewald-rtol-lj = 0.001
lj-pme-comb-rule = Geometric
ewald-geometry = 0
epsilon-surface = 0
implicit-solvent = No
gb-algorithm = Still
nstgbradii = 1
rgbradii = 1
gb-epsilon-solvent = 80
gb-saltconc = 0
gb-obc-alpha = 1
gb-obc-beta = 0.8
gb-obc-gamma = 4.85
gb-dielectric-offset = 0.009
sa-algorithm = Ace-approximation
sa-surface-tension = 2.05016
tcoupl = Nose-Hoover
nsttcouple = 10
nh-chain-length = 1
print-nose-hoover-chain-variables = false
pcoupl = Parrinello-Rahman
pcoupltype = Isotropic
nstpcouple = 10
tau-p = 2.5
compressibility (3x3):
compressibility[ 0]={ 4.50000e-05, 0.00000e+00, 0.00000e+00}
compressibility[ 1]={ 0.00000e+00, 4.50000e-05, 0.00000e+00}
compressibility[ 2]={ 0.00000e+00, 0.00000e+00, 4.50000e-05}
ref-p (3x3):
ref-p[ 0]={ 1.00000e+00, 0.00000e+00, 0.00000e+00}
ref-p[ 1]={ 0.00000e+00, 1.00000e+00, 0.00000e+00}
ref-p[ 2]={ 0.00000e+00, 0.00000e+00, 1.00000e+00}
refcoord-scaling = No
posres-com (3):
posres-com[0]= 0.00000e+00
posres-com[1]= 0.00000e+00
posres-com[2]= 0.00000e+00
posres-comB (3):
posres-comB[0]= 0.00000e+00
posres-comB[1]= 0.00000e+00
posres-comB[2]= 0.00000e+00
QMMM = false
QMconstraints = 0
QMMMscheme = 0
MMChargeScaleFactor = 1
qm-opts:
ngQM = 0
constraint-algorithm = Lincs
continuation = true
Shake-SOR = false
shake-tol = 0.0001
lincs-order = 4
lincs-iter = 1
lincs-warnangle = 30
nwall = 0
wall-type = 9-3
wall-r-linpot = -1
wall-atomtype[0] = -1
wall-atomtype[1] = -1
wall-density[0] = 0
wall-density[1] = 0
wall-ewald-zfac = 3
pull = false
awh = false
rotation = false
interactiveMD = false
disre = No
disre-weighting = Conservative
disre-mixed = false
dr-fc = 1000
dr-tau = 0
nstdisreout = 100
orire-fc = 0
orire-tau = 0
nstorireout = 100
free-energy = no
cos-acceleration = 0
deform (3x3):
deform[ 0]={ 0.00000e+00, 0.00000e+00, 0.00000e+00}
deform[ 1]={ 0.00000e+00, 0.00000e+00, 0.00000e+00}
deform[ 2]={ 0.00000e+00, 0.00000e+00, 0.00000e+00}
simulated-tempering = false
swapcoords = no
userint1 = 0
userint2 = 0
userint3 = 0
userint4 = 0
userreal1 = 0
userreal2 = 0
userreal3 = 0
userreal4 = 0
applied-forces:
electric-field:
x:
E0 = 0
omega = 0
t0 = 0
sigma = 0
y:
E0 = 0
omega = 0
t0 = 0
sigma = 0
z:
E0 = 0
omega = 0
t0 = 0
sigma = 0
grpopts:
nrdf: 147287 3010.99 5098.98 959295
ref-t: 300 300 300 300
tau-t: 1 1 1 1
annealing: No No No No
annealing-npoints: 0 0 0 0
acc: 0 0 0
nfreeze: N N N
energygrp-flags[ 0]: 0

Changing nstlist from 10 to 100, rlist from 1.2 to 1.337

Initializing Domain Decomposition on 4 ranks
Dynamic load balancing: locked
Initial maximum inter charge-group distances:
two-body bonded interactions: 0.446 nm, LJ-14, atoms 36965 36972
multi-body bonded interactions: 0.446 nm, Proper Dih., atoms 36965 36972
Minimum cell size due to bonded interactions: 0.490 nm
Maximum distance for 5 constraints, at 120 deg. angles, all-trans: 0.218 nm
Estimated maximum distance required for P-LINCS: 0.218 nm
Using 0 separate PME ranks
Scaling the initial minimum size with 1/0.8 (option -dds) = 1.25
Optimizing the DD grid for 4 cells with a minimum initial size of 0.613 nm
The maximum allowed number of cells is: X 24 Y 29 Z 33
Domain decomposition grid 1 x 4 x 1, separate PME ranks 0
PME domain decomposition: 1 x 4 x 1
Domain decomposition rank 0, coordinates 0 0 0

The initial number of communication pulses is: Y 1
The initial domain decomposition cell size is: Y 4.47 nm

The maximum allowed distance for charge groups involved in interactions is:
non-bonded interactions 1.337 nm
(the following are initial values, they could change due to box deformation)
two-body bonded interactions (-rdd) 1.337 nm
multi-body bonded interactions (-rdd) 1.337 nm
atoms separated by up to 5 constraints (-rcon) 4.465 nm

When dynamic load balancing gets turned on, these settings will change to:
The maximum number of communication pulses is: Y 1
The minimum size for domain decomposition cells is 1.337 nm
The requested allowed shrink of DD cells (option -dds) is: 0.80
The allowed shrink of domain decomposition cells is: Y 0.30
The maximum allowed distance for charge groups involved in interactions is:
non-bonded interactions 1.337 nm
two-body bonded interactions (-rdd) 1.337 nm
multi-body bonded interactions (-rdd) 1.337 nm
atoms separated by up to 5 constraints (-rcon) 1.337 nm

Using 4 MPI processes
Using 7 OpenMP threads per MPI process

On host gpu013.pvt.bridges.psc.edu 4 GPUs auto-selected for this run.
Mapping of GPU IDs to the 4 GPU tasks in the 4 ranks on this node:
PP:0,PP:1,PP:2,PP:3

NOTE: GROMACS was configured without NVML support hence it can not exploit
application clocks of the detected Tesla K80 GPU to improve performance.
Recompile with the NVML library (compatible with the driver used) or set application clocks manually.

Non-default thread affinity set probably by the OpenMP library,
disabling internal thread affinity
System total charge: 0.000
Will do PME sum in reciprocal space for electrostatic interactions.

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
U. Essmann, L. Perera, M. L. Berkowitz, T. Darden, H. Lee and L. G. Pedersen
A smooth particle mesh Ewald method
J. Chem. Phys. 103 (1995) pp. 8577-8592
-------- -------- — Thank You — -------- --------

Using a Gaussian width (1/beta) of 0.384195 nm for Ewald
Potential shift: LJ r^-12: -1.122e-01 r^-6: -3.349e-01, Ewald -8.333e-06
Initialized non-bonded Ewald correction tables, spacing: 1.02e-03 size: 1176

Long Range LJ corr.: 3.3098e-04
Generated table with 1168 data points for Ewald.
Tabscale = 500 points/nm
Generated table with 1168 data points for LJ6.
Tabscale = 500 points/nm
Generated table with 1168 data points for LJ12.
Tabscale = 500 points/nm
Generated table with 1168 data points for 1-4 COUL.
Tabscale = 500 points/nm
Generated table with 1168 data points for 1-4 LJ6.
Tabscale = 500 points/nm
Generated table with 1168 data points for 1-4 LJ12.
Tabscale = 500 points/nm

Using GPU 8x8 nonbonded short-range kernels

Using a dual 8x4 pair-list setup updated with dynamic, rolling pruning:
outer list: updated every 100 steps, buffer 0.137 nm, rlist 1.337 nm
inner list: updated every 12 steps, buffer 0.002 nm, rlist 1.202 nm
At tolerance 0.005 kJ/mol/ps per atom, equivalent classical 1x1 list would be:
outer list: updated every 100 steps, buffer 0.290 nm, rlist 1.490 nm
inner list: updated every 12 steps, buffer 0.051 nm, rlist 1.251 nm

Using Lorentz-Berthelot Lennard-Jones combination rule

Initializing Parallel LINear Constraint Solver

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
B. Hess
P-LINCS: A Parallel Linear Constraint Solver for molecular simulation
J. Chem. Theory Comput. 4 (2008) pp. 116-122
-------- -------- — Thank You — -------- --------

The number of constraints is 30135
There are inter charge-group constraints,
will communicate selected coordinates each lincs iteration

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
S. Miyamoto and P. A. Kollman
SETTLE: An Analytical Version of the SHAKE and RATTLE Algorithms for Rigid
Water Models
J. Comp. Chem. 13 (1992) pp. 952-962
-------- -------- — Thank You — -------- --------

Linking all bonded interactions to atoms

Intra-simulation communication will occur every 10 steps.
Center of mass motion removal mode is Linear
We have the following groups for center of mass motion removal:
0: Protein
1: non-Protein
There are: 541456 Atoms
Atom distribution over 4 domains: av 135364 stddev 1229 min 134101 max 136761

NOTE: DLB will not turn on during the first phase of PME tuning

Started mdrun on rank 0 Tue Dec 8 10:12:41 2020
Step Time
0 0.00000

<====== ############### ==>
<==== A V E R A G E S ====>
<== ############### ======>

    Statistics over 500001 steps using 5001 frames

Energies (kJ/mol)
Bond Angle Proper Dih. Improper Dih. LJ-14
4.96873e+04 1.32300e+05 2.04227e+05 7.50529e+03 5.98631e+04
Coulomb-14 LJ (SR) Disper. corr. Coulomb (SR) Coul. recip.
5.75874e+05 8.38201e+05 -4.35894e+04 -8.78241e+06 2.43953e+04
Potential Kinetic En. Total Energy Conserved En. Temperature
-6.93394e+06 1.39019e+06 -5.54375e+06 -5.52014e+06 2.99997e+02
Pres. DC (bar) Pressure (bar) Constr. rmsd
-1.34316e+02 7.74791e-01 0.00000e+00

      Box-X          Box-Y          Box-Z
1.48049e+01    1.77294e+01    2.05447e+01

Total Virial (kJ/mol)
4.63374e+05 -1.12038e+01 1.10092e+02
-1.33525e+01 4.63136e+05 -2.05774e+02
1.19072e+02 -2.01882e+02 4.63314e+05

Pressure (bar)
-2.96153e-01 3.11122e-01 -2.61415e-01
3.24377e-01 1.43441e+00 1.14228e-01
-3.16707e-01 9.02453e-02 1.18612e+00

  T-Protein          T-DNA          T-RNAT-Water_and_ions
2.99993e+02    2.99907e+02    3.00034e+02    2.99997e+02


   P P   -   P M E   L O A D   B A L A N C I N G

PP/PME load balancing changed the cut-off and PME settings:
particle-particle PME
rcoulomb rlist grid spacing 1/beta
initial 1.200 nm 1.202 nm 128 160 192 0.117 nm 0.384 nm
final 1.243 nm 1.245 nm 120 144 168 0.124 nm 0.398 nm
cost-ratio 1.11 0.74
(note that these numbers concern only part of the total PP and PME load)

    M E G A - F L O P S   A C C O U N T I N G

NB=Group-cutoff nonbonded kernels NxN=N-by-N cluster Verlet kernels
RF=Reaction-Field VdW=Van der Waals QSTab=quadratic-spline table
W3=SPC/TIP3p W4=TIP4p (single or pairs)
V&F=Potential and force V=Potential only F=Force only

Computing: M-Number M-Flops % Flops

Pair Search distance check 342958.776080 3086628.985 0.0
NxN Ewald Elec. + LJ [F] 370479090.320064 24451619961.124 95.5
NxN Ewald Elec. + LJ [V&F] 3742960.944896 400496821.104 1.6
1,4 nonbonded interactions 81278.662557 7315079.630 0.0
Calc Weights 812185.624368 29238682.477 0.1
Spread Q Bspline 17326626.653184 34653253.306 0.1
Gather F Bspline 17326626.653184 103959759.919 0.4
3D-FFT 62300263.264306 498402106.114 1.9
Solve PME 34549.957120 2211197.256 0.0
Reset In Box 2706.738544 8120.216 0.0
CG-CoM 2707.821456 8123.464 0.0
Bonds 16351.532703 964740.429 0.0
Angles 56610.113220 9510499.021 0.0
Propers 100840.701681 23092520.685 0.1
Impropers 6374.512749 1325898.652 0.0
Virial 27082.341636 487482.149 0.0
Stop-CM 2707.821456 27078.215 0.0
Calc-Ekin 54146.682912 1461960.439 0.0
Lincs 15382.511746 922950.705 0.0
Lincs-Mat 77204.433180 308817.733 0.0
Constraint-V 275163.445736 2201307.566 0.0
Constraint-Vir 25978.561030 623485.465 0.0
Settle 81466.140748 26313563.462 0.1

Total 25598240038.115 100.0

D O M A I N   D E C O M P O S I T I O N   S T A T I S T I C S

av. #atoms communicated per step for force: 2 x 168610.4
av. #atoms communicated per step for LINCS: 2 x 10246.4

Dynamic load balancing report:
DLB was off during the run due to low measured imbalance.
Average load imbalance: 3.4%.
The balanceable part of the MD step is 52%, load imbalance is computed from this.
Part of the total run time spent waiting due to load imbalance: 1.8%.

 R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G

On 4 MPI ranks, each using 7 OpenMP threads

Computing: Num Num Call Wall time Giga-Cycles
Ranks Threads Count (s) total sum %

Domain decomp. 4 7 5000 128.063 8227.873 1.6
DD comm. load 4 7 4981 0.304 19.515 0.0
DD comm. bounds 4 7 36 0.007 0.423 0.0
Neighbor search 4 7 5001 51.576 3313.687 0.7
Launch GPU ops. 4 7 1000002 66.556 4276.142 0.8
Comm. coord. 4 7 495000 258.585 16613.734 3.3
Force 4 7 500001 482.030 30969.761 6.1
Wait + Comm. F 4 7 500001 268.613 17258.012 3.4
PME mesh 4 7 500001 5830.035 374571.548 73.6
Wait GPU NB nonloc. 4 7 500001 14.829 952.731 0.2
Wait GPU NB local 4 7 500001 17.990 1155.843 0.2
NB X/F buffer ops. 4 7 1990002 153.995 9893.932 1.9
Write traj. 4 7 509 39.913 2564.356 0.5
Update 4 7 500001 166.687 10709.377 2.1
Constraints 4 7 500001 337.901 21709.685 4.3
Comm. energies 4 7 50001 46.248 2971.352 0.6
Rest 59.357 3813.610 0.7

Total 7922.688 509021.581 100.0

Breakdown of PME mesh computation

PME redist. X/F 4 7 1000002 1011.957 65016.814 12.8
PME spread 4 7 500001 1497.578 96217.291 18.9
PME gather 4 7 500001 991.901 63728.249 12.5
PME 3D-FFT 4 7 1000002 965.924 62059.233 12.2
PME 3D-FFT Comm. 4 7 1000002 1191.254 76536.370 15.0
PME solve Elec 4 7 500001 167.511 10762.318 2.1

Breakdown of PP computation

DD redist. 4 7 4999 9.609 617.359 0.1
DD NS grid + sort 4 7 5000 24.148 1551.501 0.3
DD setup comm. 4 7 5000 20.788 1335.625 0.3
DD make top. 4 7 5000 22.761 1462.342 0.3
DD make constr. 4 7 5000 18.140 1165.446 0.2
DD top. other 4 7 5000 27.911 1793.211 0.4
NS grid non-loc. 4 7 5001 3.464 222.543 0.0
NS search local 4 7 5001 34.511 2217.260 0.4
NS search non-loc. 4 7 5001 9.279 596.164 0.1
Bonded F 4 7 500001 402.997 25892.014 5.1
Listed buffer ops. 4 7 500001 19.210 1234.215 0.2
Launch NB GPU tasks 4 7 1000002 66.358 4263.396 0.8
NB X buffer ops. 4 7 990000 66.498 4272.430 0.8
NB F buffer ops. 4 7 1000002 87.298 5608.803 1.1

           Core t (s)   Wall t (s)        (%)
   Time:   221835.265     7922.688     2800.0
                     2h12:02
             (ns/day)    (hour/ns)

Performance: 10.905 2.201

You should recompile with AVX2_256 - though with GPUs that will not make a massive speedup and isn’t the source of your problem.

Your GPUs are running version 10 drivers, but you compiled against CUDA 9. Again, probably small gains from compiling with CUDA 10, but likely not a massive improvement.

Here is where you’re likely having issues. PME is being done on the CPU, and you may be imbalanced, where the short-range forces are completing on the GPU, but the CPU takes a lot longer to finish its part.

You may want to try running PME on the GPU. You can set it up where 3 of your GPUs are running short-range computation and 1 of them is doing PME. Something like:

gmx mdrun -ntmpi 4 -ntomp 7 -nb gpu -pme gpu -npme 1 -gputasks 0123

You may have to play around with it for a while.

A few more notes -

  1. Tesla k80s are useful but not all that fast. Anecdotally, I’ve only ever seen speedups of 3-4x what you would get from running on that node without GPUs - very very roughly that’s the sort of performance increase you should expect, but it’s wildly dependent on system size and other factors so don’t be very surprised if it’s higher or lower. The point is that old GPUs don’t give the wild performance increases you see in more modern ones.
  2. You’ll get a lot more efficiency out of running multiple simulations on the same node, with 1 simulation per GPU (or even 2 per GPU), rather than splitting one simulation across multiple GPUs. Gromacs does not scale well to multiple GPUs per simulation (though more recent versions are addressing that issue). If you’re interested, I wrote a tool called Gromax to auto-generate a bunch of possible run configurations for your node - check it out on GitHub, it could help you with finding optimal runtime parameters for a single large simulation with all the GPUs, or dividing up the node like I suggested.

Cheers,

Kevin

That is great! Thank you very much
I am able to improve my performance ( 17ns/say ). Still not enough!
Will try to use different GPU type ( P100 and Volta 100 )

I have a small question,
when I used -ntmpi option in the command, I got the follwing error

Setting the number of thread-MPI ranks is only supported with thread-MPI and
GROMACS was compiled without thread-MPI
How to solve this error?

Thank you very much

You are using a MPI binary which uses a different parallelization model that you control using the mpirun -np launcher. thread-MPI is an alternative that can be used instead of MPI for single-node runs. For mode details see the docs, e.g. Installation guide — GROMACS 2020.5 documentation