Maximum performance GPU Gromacs 2021

GROMACS version: 2021.2
GROMACS modification: Yes - PLUMED patched

I have been running the Umbrella sampling tutorial of Justin Lemkul. I could see however that the performance is relatively low compared to what is supposed to be expected by using GPUs. I used a simple workstation with 3 GPUs available. Any ideas how to improve it? I read there are some variables that need to be set but only for Gromacs 2020.

Here an example log:

              :-) GROMACS - gmx mdrun, 2021.2-MODIFIED (-:

                        GROMACS is written by:
 Andrey Alekseenko              Emile Apol              Rossen Apostolov     
     Paul Bauer           Herman J.C. Berendsen           Par Bjelkmar       
   Christian Blau           Viacheslav Bolnykh             Kevin Boyd        
 Aldert van Buuren           Rudi van Drunen             Anton Feenstra      
Gilles Gouaillardet             Alan Gray               Gerrit Groenhof      
   Anca Hamuraru            Vincent Hindriksen          M. Eric Irrgang      
  Aleksei Iupinov           Christoph Junghans             Joe Jordan        
Dimitrios Karkoulis            Peter Kasson                Jiri Kraus        
  Carsten Kutzner              Per Larsson              Justin A. Lemkul     
   Viveca Lindahl            Magnus Lundborg             Erik Marklund       
    Pascal Merz             Pieter Meulenhoff            Teemu Murtola       
    Szilard Pall               Sander Pronk              Roland Schulz       
   Michael Shirts            Alexey Shvetsov             Alfons Sijbers      
   Peter Tieleman              Jon Vincent              Teemu Virolainen     
 Christian Wennberg            Maarten Wolf              Artem Zhmurov       
                       and the project leaders:
    Mark Abraham, Berk Hess, Erik Lindahl, and David van der Spoel

Copyright (c) 1991-2000, University of Groningen, The Netherlands.
Copyright (c) 2001-2019, The GROMACS development team at
Uppsala University, Stockholm University and
the Royal Institute of Technology, Sweden.
check out http://www.gromacs.org for more information.

GROMACS is free software; you can redistribute it and/or modify it
under the terms of the GNU Lesser General Public License
as published by the Free Software Foundation; either version 2.1
of the License, or (at your option) any later version.

GROMACS: gmx mdrun, version 2021.2-MODIFIED
Executable: /usr/local/gromacs/bin/gmx
Data prefix: /usr/local/gromacs
Working dir: /home/tga_user/gromacs_tutorials/Umbrella2
Process ID: 67364
Command line:
gmx mdrun -deffnm umbrella495 -nb gpu

GROMACS version: 2021.2-MODIFIED
This program has been built from source code that has been altered and does not match the code released as part of the official GROMACS version 2021.2-MODIFIED. If you did not intend to use an altered GROMACS version, make sure to download an intact source distribution and compile that before proceeding.
If you have modified the source code, you are strongly encouraged to set your custom version suffix (using -DGMX_VERSION_STRING_OF_FORK) which will can help later with scientific reproducibility but also when reporting bugs.
Release checksum: d91a739522d82c53dc47c2276b9ac5fae3b2119283e84e3a1c7bed8ce741fda2
Computed checksum: 5d0beb2d24e8cff7911423126493f055ef6a7d6ddb7306aab1f5f2c9748aedc7
Precision: mixed
Memory model: 64 bit
MPI library: thread_mpi
OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 64)
GPU support: CUDA
SIMD instructions: AVX2_256
FFT library: fftw-3.3.8-sse2-avx-avx2-avx2_128
RDTSCP usage: enabled
TNG support: enabled
Hwloc support: disabled
Tracing support: disabled
C compiler: /usr/bin/gcc GNU 9.3.0
C compiler flags: -mavx2 -mfma -Wno-missing-field-initializers -fexcess-precision=fast -funroll-all-loops -pthread -O3 -DNDEBUG
C++ compiler: /usr/bin/c++ GNU 9.3.0
C++ compiler flags: -mavx2 -mfma -Wno-missing-field-initializers -fexcess-precision=fast -funroll-all-loops -pthread -fopenmp -O3 -DNDEBUG
CUDA compiler: /usr/local/cuda-11.2/bin/nvcc nvcc: NVIDIA (R) Cuda compiler driver;Copyright (c) 2005-2021 NVIDIA Corporation;Built on Sun_Feb_14_21:12:58_PST_2021;Cuda compilation tools, release 11.2, V11.2.152;Build cuda_11.2.r11.2/compiler.29618528_0
CUDA compiler flags:-std=c++17;-gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-Wno-deprecated-gpu-targets;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_35,code=compute_35;-gencode;arch=compute_53,code=compute_53;-gencode;arch=compute_80,code=compute_80;-use_fast_math;-D_FORCE_INLINES;-mavx2 -mfma -Wno-missing-field-initializers -fexcess-precision=fast -funroll-all-loops -pthread -fopenmp -O3 -DNDEBUG
CUDA driver: 11.20
CUDA runtime: 11.20

Running on 1 node with total 32 cores, 64 logical cores, 3 compatible GPUs
Hardware detected:
CPU info:
Vendor: Intel
Brand: Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz
Family: 6 Model: 85 Stepping: 7
Features: aes apic avx avx2 avx512f avx512cd avx512bw avx512vl clfsh cmov cx8 cx16 f16c fma htt intel lahf mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd rdtscp sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
Number of AVX-512 FMA units: 1 (AVX2 is faster w/o 2 AVX-512 FMA units)
Hardware topology: Basic
Sockets, cores, and logical processors:
Socket 0: [ 0 32] [ 1 33] [ 2 34] [ 3 35] [ 4 36] [ 5 37] [ 6 38] [ 7 39] [ 8 40] [ 9 41] [ 10 42] [ 11 43] [ 12 44] [ 13 45] [ 14 46] [ 15 47]
Socket 1: [ 16 48] [ 17 49] [ 18 50] [ 19 51] [ 20 52] [ 21 53] [ 22 54] [ 23 55] [ 24 56] [ 25 57] [ 26 58] [ 27 59] [ 28 60] [ 29 61] [ 30 62] [ 31 63]
GPU info:
Number of GPUs detected: 3
#0: NVIDIA Quadro P5000, compute cap.: 6.1, ECC: no, stat: compatible
#1: NVIDIA Quadro P5000, compute cap.: 6.1, ECC: no, stat: compatible
#2: NVIDIA Quadro P620, compute cap.: 6.1, ECC: no, stat: compatible


Input Parameters:
integrator = md
tinit = 0
dt = 0.002
nsteps = 5000000
init-step = 0
simulation-part = 1
mts = false
comm-mode = Linear
nstcomm = 100
bd-fric = 0
ld-seed = -537013001
emtol = 10
emstep = 0.01
niter = 20
fcstep = 0
nstcgsteep = 1000
nbfgscorr = 10
rtpi = 0.05
nstxout = 0
nstvout = 0
nstfout = 0
nstlog = 1000
nstcalcenergy = 100
nstenergy = 5000
nstxout-compressed = 5000
compressed-x-precision = 1000
cutoff-scheme = Verlet
nstlist = 20
pbc = xyz
periodic-molecules = false
verlet-buffer-tolerance = 0.005
rlist = 1.419
coulombtype = PME
coulomb-modifier = Potential-shift
rcoulomb-switch = 0
rcoulomb = 1.4
epsilon-r = 1
epsilon-rf = inf
vdw-type = Cut-off
vdw-modifier = Potential-shift
rvdw-switch = 0
rvdw = 1.4
DispCorr = EnerPres
table-extension = 1
fourierspacing = 0.12
fourier-nx = 56
fourier-ny = 40
fourier-nz = 104
pme-order = 4
ewald-rtol = 1e-05
ewald-rtol-lj = 0.001
lj-pme-comb-rule = Geometric
ewald-geometry = 0
epsilon-surface = 0
tcoupl = Nose-Hoover
nsttcouple = 10
nh-chain-length = 1
print-nose-hoover-chain-variables = false
pcoupl = Parrinello-Rahman
pcoupltype = Isotropic
nstpcouple = 10
tau-p = 2
compressibility (3x3):
compressibility[ 0]={ 4.50000e-05, 0.00000e+00, 0.00000e+00}
compressibility[ 1]={ 0.00000e+00, 4.50000e-05, 0.00000e+00}
compressibility[ 2]={ 0.00000e+00, 0.00000e+00, 4.50000e-05}
ref-p (3x3):
ref-p[ 0]={ 1.00000e+00, 0.00000e+00, 0.00000e+00}
ref-p[ 1]={ 0.00000e+00, 1.00000e+00, 0.00000e+00}
ref-p[ 2]={ 0.00000e+00, 0.00000e+00, 1.00000e+00}
refcoord-scaling = COM
posres-com (3):
posres-com[0]= 5.03393e-01
posres-com[1]= 5.02192e-01
posres-com[2]= 2.47462e-01
posres-comB (3):
posres-comB[0]= 5.03393e-01
posres-comB[1]= 5.02192e-01
posres-comB[2]= 2.47462e-01
QMMM = false
qm-opts:
ngQM = 0
constraint-algorithm = Lincs
continuation = true
Shake-SOR = false
shake-tol = 0.0001
lincs-order = 4
lincs-iter = 1
lincs-warnangle = 30
nwall = 0
wall-type = 9-3
wall-r-linpot = -1
wall-atomtype[0] = -1
wall-atomtype[1] = -1
wall-density[0] = 0
wall-density[1] = 0
wall-ewald-zfac = 3
pull = true
pull-cylinder-r = 1.5
pull-constr-tol = 1e-06
pull-print-COM = false
pull-print-ref-value = false
pull-print-components = false
pull-nstxout = 50
pull-nstfout = 50
pull-pbc-ref-prev-step-com = true
pull-xout-average = false
pull-fout-average = false
pull-ngroups = 3
pull-group 0:
atom: not available
weight: not available
pbcatom = -1
pull-group 1:
atom (226):
atom[0,…,225] = {0,…,225}
weight: not available
pbcatom = 127
pull-group 2:
atom (226):
atom[0,…,225] = {226,…,451}
weight: not available
pbcatom = 338
pull-ncoords = 1
pull-coord 0:
type = umbrella
geometry = distance
group[0] = 1
group[1] = 2
dim (3):
dim[0]=0
dim[1]=0
dim[2]=1
origin (3):
origin[0]= 0.00000e+00
origin[1]= 0.00000e+00
origin[2]= 0.00000e+00
vec (3):
vec[0]= 0.00000e+00
vec[1]= 0.00000e+00
vec[2]= 0.00000e+00
start = true
init = 5.47532
rate = 0
k = 1000
kB = 1000
awh = false
rotation = false
interactiveMD = false
disre = No
disre-weighting = Conservative
disre-mixed = false
dr-fc = 1000
dr-tau = 0
nstdisreout = 100
orire-fc = 0
orire-tau = 0
nstorireout = 100
free-energy = no
cos-acceleration = 0
deform (3x3):
deform[ 0]={ 0.00000e+00, 0.00000e+00, 0.00000e+00}
deform[ 1]={ 0.00000e+00, 0.00000e+00, 0.00000e+00}
deform[ 2]={ 0.00000e+00, 0.00000e+00, 0.00000e+00}
simulated-tempering = false
swapcoords = no
userint1 = 0
userint2 = 0
userint3 = 0
userint4 = 0
userreal1 = 0
userreal2 = 0
userreal3 = 0
userreal4 = 0
applied-forces:
electric-field:
x:
E0 = 0
omega = 0
t0 = 0
sigma = 0
y:
E0 = 0
omega = 0
t0 = 0
sigma = 0
z:
E0 = 0
omega = 0
t0 = 0
sigma = 0
density-guided-simulation:
active = false
group = protein
similarity-measure = inner-product
atom-spreading-weight = unity
force-constant = 1e+09
gaussian-transform-spreading-width = 0.2
gaussian-transform-spreading-range-in-multiples-of-width = 4
reference-density-filename = reference.mrc
nst = 1
normalize-densities = true
adaptive-force-scaling = false
adaptive-force-scaling-time-constant = 4
shift-vector =
transformation-matrix =
grpopts:
nrdf: 2254.9 64203.1
ref-t: 310 310
tau-t: 1 1
annealing: No No
annealing-npoints: 0 0
acc: 0 0 0
nfreeze: N N N
energygrp-flags[ 0]: 0

Changing nstlist from 20 to 100, rlist from 1.419 to 1.536


           Core t (s)   Wall t (s)        (%)
   Time:   621165.993    10352.767     6000.0
                     2h52:32
             (ns/day)    (hour/ns)

Performance: 83.456 0.288
Finished mdrun on rank 0 Sat Jul 24 11:31:29 2021