MD performance help using V_100 GPU

GROMACS version: 2021.5
GROMACS modification: No

I have attempted various methods to enhance the simulation performance, but unfortunately, none have yielded positive results. I have included my log below. Would someone be willing to assist me? I would greatly appreciate it.

                  :-) GROMACS - gmx mdrun, 2021.5 (-:

                        GROMACS is written by:
 Andrey Alekseenko              Emile Apol              Rossen Apostolov     
     Paul Bauer           Herman J.C. Berendsen           Par Bjelkmar       
   Christian Blau           Viacheslav Bolnykh             Kevin Boyd        
 Aldert van Buuren           Rudi van Drunen             Anton Feenstra      
Gilles Gouaillardet             Alan Gray               Gerrit Groenhof      
   Anca Hamuraru            Vincent Hindriksen          M. Eric Irrgang      
  Aleksei Iupinov           Christoph Junghans             Joe Jordan        
Dimitrios Karkoulis            Peter Kasson                Jiri Kraus        
  Carsten Kutzner              Per Larsson              Justin A. Lemkul     
   Viveca Lindahl            Magnus Lundborg             Erik Marklund       
    Pascal Merz             Pieter Meulenhoff            Teemu Murtola       
    Szilard Pall               Sander Pronk              Roland Schulz       
   Michael Shirts            Alexey Shvetsov             Alfons Sijbers      
   Peter Tieleman              Jon Vincent              Teemu Virolainen     
 Christian Wennberg            Maarten Wolf              Artem Zhmurov       
                       and the project leaders:
    Mark Abraham, Berk Hess, Erik Lindahl, and David van der Spoel

Copyright (c) 1991-2000, University of Groningen, The Netherlands.
Copyright (c) 2001-2019, The GROMACS development team at
Uppsala University, Stockholm University and
the Royal Institute of Technology, Sweden.
check out http://www.gromacs.org for more information.

GROMACS is free software; you can redistribute it and/or modify it
under the terms of the GNU Lesser General Public License
as published by the Free Software Foundation; either version 2.1
of the License, or (at your option) any later version.

GROMACS: gmx mdrun, version 2021.5
Executable: /cvmfs/soft.ccr.buffalo.edu/versions/2023.01/easybuild/software/avx512/MPI/gcc/11.2.0/openmpi/4.1.1/gromacs/2021.5-CUDA-11.5.1/bin/gmx_mpi
Data prefix: /cvmfs/soft.ccr.buffalo.edu/versions/2023.01/easybuild/software/avx512/MPI/gcc/11.2.0/openmpi/4.1.1/gromacs/2021.5-CUDA-11.5.1
Working dir: /projects/academic/tdgrant/pkoduro/new_gromacs/gromacs_degs1
Process ID: 134079
Command line:
gmx_mpi mdrun -v -deffnm step6.1_equilibration -ntomp 8 -nb gpu -pme gpu -npme 1 -gputasks 0011

GROMACS version: 2021.5
Precision: mixed
Memory model: 64 bit
MPI library: MPI
OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 64)
GPU support: CUDA
SIMD instructions: AVX_512
FFT library: fftw-3.3.10-sse2-avx-avx2-avx2_128
RDTSCP usage: enabled
TNG support: enabled
Hwloc support: disabled
Tracing support: disabled
C compiler: /cvmfs/soft.ccr.buffalo.edu/versions/2023.01/easybuild/software/avx512/Compiler/gcc/11.2.0/openmpi/4.1.1/bin/mpicc GNU 11.2.0
C compiler flags: -mavx512f -mfma -Wno-missing-field-initializers -fexcess-precision=fast -funroll-all-loops -O3 -DNDEBUG
C++ compiler: /cvmfs/soft.ccr.buffalo.edu/versions/2023.01/easybuild/software/avx512/Compiler/gcc/11.2.0/openmpi/4.1.1/bin/mpicxx GNU 11.2.0
C++ compiler flags: -mavx512f -mfma -Wno-missing-field-initializers -fexcess-precision=fast -funroll-all-loops -fopenmp -O3 -DNDEBUG
CUDA compiler: /cvmfs/soft.ccr.buffalo.edu/versions/2023.01/easybuild/software/Core/cuda/11.5.1/bin/nvcc nvcc: NVIDIA (R) Cuda compiler driver;Copyright (c) 2005-2021 NVIDIA Corporation;Built on Thu_Nov_18_09:45:30_PST_2021;Cuda compilation tools, release 11.5, V11.5.119;Build cuda_11.5.r11.5/compiler.30672275_0
CUDA compiler flags:-std=c++17;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-use_fast_math;-D_FORCE_INLINES;-mavx512f -mfma -Wno-missing-field-initializers -fexcess-precision=fast -funroll-all-loops -fopenmp -O3 -DNDEBUG
CUDA driver: 11.40
CUDA runtime: 11.50

Running on 1 node with total 40 cores, 40 logical cores, 2 compatible GPUs
Hardware detected on host cpn-v10-03.compute.cbls.ccr.buffalo.edu (the node of MPI rank 0):
CPU info:
Vendor: Intel
Brand: Intel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz
Family: 6 Model: 85 Stepping: 7
Features: aes apic avx avx2 avx512f avx512cd avx512bw avx512vl avx512secondFMA clfsh cmov cx8 cx16 f16c fma hle htt intel lahf mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd rdtscp rtm sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
Number of AVX-512 FMA units: 2
Hardware topology: Basic
Sockets, cores, and logical processors:
Socket 0: [ 0] [ 4] [ 8] [ 6] [ 2] [ 12] [ 16] [ 18] [ 14] [ 10] [ 20] [ 24] [ 28] [ 26] [ 22] [ 32] [ 36] [ 38] [ 34] [ 30]
Socket 1: [ 1] [ 5] [ 9] [ 7] [ 3] [ 13] [ 17] [ 19] [ 15] [ 11] [ 21] [ 25] [ 29] [ 27] [ 23] [ 33] [ 37] [ 39] [ 35] [ 31]
GPU info:
Number of GPUs detected: 2
#0: NVIDIA Tesla V100-PCIE-32GB, compute cap.: 7.0, ECC: yes, stat: compatible
#1: NVIDIA Tesla V100-PCIE-32GB, compute cap.: 7.0, ECC: yes, stat: compatible

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E.
Lindahl
GROMACS: High performance molecular simulations through multi-level
parallelism from laptops to supercomputers
SoftwareX 1 (2015) pp. 19-25
-------- -------- — Thank You — -------- --------

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
S. Páll, M. J. Abraham, C. Kutzner, B. Hess, E. Lindahl
Tackling Exascale Software Challenges in Molecular Dynamics Simulations with
GROMACS
In S. Markidis & E. Laure (Eds.), Solving Software Challenges for Exascale 8759 (2015) pp. 3-27
-------- -------- — Thank You — -------- --------

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
S. Pronk, S. Páll, R. Schulz, P. Larsson, P. Bjelkmar, R. Apostolov, M. R.
Shirts, J. C. Smith, P. M. Kasson, D. van der Spoel, B. Hess, and E. Lindahl
GROMACS 4.5: a high-throughput and highly parallel open source molecular
simulation toolkit
Bioinformatics 29 (2013) pp. 845-54
-------- -------- — Thank You — -------- --------

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
B. Hess and C. Kutzner and D. van der Spoel and E. Lindahl
GROMACS 4: Algorithms for highly efficient, load-balanced, and scalable
molecular simulation
J. Chem. Theory Comput. 4 (2008) pp. 435-447
-------- -------- — Thank You — -------- --------

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
D. van der Spoel, E. Lindahl, B. Hess, G. Groenhof, A. E. Mark and H. J. C.
Berendsen
GROMACS: Fast, Flexible and Free
J. Comp. Chem. 26 (2005) pp. 1701-1719
-------- -------- — Thank You — -------- --------

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
E. Lindahl and B. Hess and D. van der Spoel
GROMACS 3.0: A package for molecular simulation and trajectory analysis
J. Mol. Mod. 7 (2001) pp. 306-317
-------- -------- — Thank You — -------- --------

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
H. J. C. Berendsen, D. van der Spoel and R. van Drunen
GROMACS: A message-passing parallel molecular dynamics implementation
Comp. Phys. Comm. 91 (1995) pp. 43-56
-------- -------- — Thank You — -------- --------

++++ PLEASE CITE THE DOI FOR THIS VERSION OF GROMACS ++++

-------- -------- — Thank You — -------- --------

Input Parameters:
integrator = md
tinit = 0
dt = 0.001
nsteps = 125000
init-step = 0
simulation-part = 1
mts = false
comm-mode = Linear
nstcomm = 100
bd-fric = 0
ld-seed = -134250497
emtol = 10
emstep = 0.01
niter = 20
fcstep = 0
nstcgsteep = 1000
nbfgscorr = 10
rtpi = 0.05
nstxout = 0
nstvout = 100000
nstfout = 100000
nstlog = 1000
nstcalcenergy = 100
nstenergy = 1000
nstxout-compressed = 5000
compressed-x-precision = 1000
cutoff-scheme = Verlet
nstlist = 20
pbc = xyz
periodic-molecules = false
verlet-buffer-tolerance = 0.005
rlist = 1.2
coulombtype = PME
coulomb-modifier = Potential-shift
rcoulomb-switch = 0
rcoulomb = 1.2
epsilon-r = 1
epsilon-rf = inf
vdw-type = Cut-off
vdw-modifier = Force-switch
rvdw-switch = 1
rvdw = 1.2
DispCorr = No
table-extension = 1
fourierspacing = 0.12
fourier-nx = 96
fourier-ny = 96
fourier-nz = 96
pme-order = 4
ewald-rtol = 1e-05
ewald-rtol-lj = 0.001
lj-pme-comb-rule = Geometric
ewald-geometry = 0
epsilon-surface = 0
tcoupl = V-rescale
nsttcouple = 10
nh-chain-length = 0
print-nose-hoover-chain-variables = false
pcoupl = No
pcoupltype = Isotropic
nstpcouple = -1
tau-p = 1
compressibility (3x3):
compressibility[ 0]={ 0.00000e+00, 0.00000e+00, 0.00000e+00}
compressibility[ 1]={ 0.00000e+00, 0.00000e+00, 0.00000e+00}
compressibility[ 2]={ 0.00000e+00, 0.00000e+00, 0.00000e+00}
ref-p (3x3):
ref-p[ 0]={ 0.00000e+00, 0.00000e+00, 0.00000e+00}
ref-p[ 1]={ 0.00000e+00, 0.00000e+00, 0.00000e+00}
ref-p[ 2]={ 0.00000e+00, 0.00000e+00, 0.00000e+00}
refcoord-scaling = No
posres-com (3):
posres-com[0]= 0.00000e+00
posres-com[1]= 0.00000e+00
posres-com[2]= 0.00000e+00
posres-comB (3):
posres-comB[0]= 0.00000e+00
posres-comB[1]= 0.00000e+00
posres-comB[2]= 0.00000e+00
QMMM = false
qm-opts:
ngQM = 0
constraint-algorithm = Lincs
continuation = false
Shake-SOR = false
shake-tol = 0.0001
lincs-order = 4
lincs-iter = 1
lincs-warnangle = 30
nwall = 0
wall-type = 9-3
wall-r-linpot = -1
wall-atomtype[0] = -1
wall-atomtype[1] = -1
wall-density[0] = 0
wall-density[1] = 0
wall-ewald-zfac = 3
pull = false
awh = false
rotation = false
interactiveMD = false
disre = No
disre-weighting = Conservative
disre-mixed = false
dr-fc = 1000
dr-tau = 0
nstdisreout = 100
orire-fc = 0
orire-tau = 0
nstorireout = 100
free-energy = no
cos-acceleration = 0
deform (3x3):
deform[ 0]={ 0.00000e+00, 0.00000e+00, 0.00000e+00}
deform[ 1]={ 0.00000e+00, 0.00000e+00, 0.00000e+00}
deform[ 2]={ 0.00000e+00, 0.00000e+00, 0.00000e+00}
simulated-tempering = false
swapcoords = no
userint1 = 0
userint2 = 0
userint3 = 0
userint4 = 0
userreal1 = 0
userreal2 = 0
userreal3 = 0
userreal4 = 0
applied-forces:
electric-field:
x:
E0 = 0
omega = 0
t0 = 0
sigma = 0
y:
E0 = 0
omega = 0
t0 = 0
sigma = 0
z:
E0 = 0
omega = 0
t0 = 0
sigma = 0
density-guided-simulation:
active = false
group = protein
similarity-measure = inner-product
atom-spreading-weight = unity
force-constant = 1e+09
gaussian-transform-spreading-width = 0.2
gaussian-transform-spreading-range-in-multiples-of-width = 4
reference-density-filename = reference.mrc
nst = 1
normalize-densities = true
adaptive-force-scaling = false
adaptive-force-scaling-time-constant = 4
shift-vector =
transformation-matrix =
grpopts:
nrdf: 13298.7 122492 149364
ref-t: 303.15 303.15 303.15
tau-t: 1 1 1
annealing: No No No
annealing-npoints: 0 0 0
acc: 0 0 0
nfreeze: N N N
energygrp-flags[ 0]: 0

Changing nstlist from 20 to 100, rlist from 1.2 to 1.261

Initializing Domain Decomposition on 4 ranks
Dynamic load balancing: auto
Using update groups, nr 47714, average size 2.7 atoms, max. radius 0.139 nm
Minimum cell size due to atom displacement: 0.421 nm
Initial maximum distances in bonded interactions:
two-body bonded interactions: 0.425 nm, LJ-14, atoms 34254 34263
multi-body bonded interactions: 0.489 nm, CMAP Dih., atoms 5017 5030
Minimum cell size due to bonded interactions: 0.538 nm
Scaling the initial minimum size with 1/0.8 (option -dds) = 1.25
Using 1 separate PME ranks
Optimizing the DD grid for 3 cells with a minimum initial size of 0.672 nm
The maximum allowed number of cells is: X 16 Y 16 Z 15
Domain decomposition grid 3 x 1 x 1, separate PME ranks 1
PME domain decomposition: 1 x 1 x 1
Interleaving PP and PME ranks
This rank does only particle-particle work.
Domain decomposition rank 0, coordinates 0 0 0

The initial number of communication pulses is: X 1
The initial domain decomposition cell size is: X 3.79 nm

The maximum allowed distance for atom groups involved in interactions is:
non-bonded interactions 1.539 nm
two-body bonded interactions (-rdd) 1.539 nm
multi-body bonded interactions (-rdd) 1.539 nm

When dynamic load balancing gets turned on, these settings will change to:
The maximum number of communication pulses is: X 1
The minimum size for domain decomposition cells is 1.539 nm
The requested allowed shrink of DD cells (option -dds) is: 0.80
The allowed shrink of domain decomposition cells is: X 0.41
The maximum allowed distance for atom groups involved in interactions is:
non-bonded interactions 1.539 nm
two-body bonded interactions (-rdd) 1.539 nm
multi-body bonded interactions (-rdd) 1.539 nm

On host cpn-v10-03.compute.cbls.ccr.buffalo.edu 2 GPUs selected for this run.
Mapping of GPU IDs to the 4 GPU tasks in the 4 ranks on this node:
PP:0,PP:0,PP:1,PME:1
PP tasks will do (non-perturbed) short-ranged interactions on the GPU
PP task will update and constrain coordinates on the CPU
PME tasks will do all aspects on the GPU

NOTE: You assigned the same GPU ID(s) to multiple ranks, which is a good idea if you have measured the performance of alternatives.

Using 4 MPI processes

Non-default thread affinity set, disabling internal thread affinity

Using 8 OpenMP threads per MPI process

System total charge: 0.000
Will do PME sum in reciprocal space for electrostatic interactions.

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
U. Essmann, L. Perera, M. L. Berkowitz, T. Darden, H. Lee and L. G. Pedersen
A smooth particle mesh Ewald method
J. Chem. Phys. 103 (1995) pp. 8577-8592
-------- -------- — Thank You — -------- --------

Using a Gaussian width (1/beta) of 0.384195 nm for Ewald
Potential shift: LJ r^-12: -2.648e-01 r^-6: -5.349e-01, Ewald -8.333e-06
Initialized non-bonded Ewald tables, spacing: 1.02e-03 size: 1176

Generated table with 1130 data points for 1-4 COUL.
Tabscale = 500 points/nm
Generated table with 1130 data points for 1-4 LJ6.
Tabscale = 500 points/nm
Generated table with 1130 data points for 1-4 LJ12.
Tabscale = 500 points/nm

Using GPU 8x8 nonbonded short-range kernels

Using a dual 8x8 pair-list setup updated with dynamic, rolling pruning:
outer list: updated every 100 steps, buffer 0.061 nm, rlist 1.261 nm
inner list: updated every 38 steps, buffer 0.001 nm, rlist 1.201 nm
At tolerance 0.005 kJ/mol/ps per atom, equivalent classical 1x1 list would be:
outer list: updated every 100 steps, buffer 0.197 nm, rlist 1.397 nm
inner list: updated every 38 steps, buffer 0.077 nm, rlist 1.277 nm
Removing pbc first time

Initializing LINear Constraint Solver

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
B. Hess and H. Bekker and H. J. C. Berendsen and J. G. E. M. Fraaije
LINCS: A Linear Constraint Solver for molecular simulations
J. Comp. Chem. 18 (1997) pp. 1463-1472
-------- -------- — Thank You — -------- --------

The number of constraints is 33841

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
S. Miyamoto and P. A. Kollman
SETTLE: An Analytical Version of the SHAKE and RATTLE Algorithms for Rigid
Water Models
J. Comp. Chem. 13 (1992) pp. 952-962
-------- -------- — Thank You — -------- --------

Linking all bonded interactions to atoms

Intra-simulation communication will occur every 10 steps.

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
G. Bussi, D. Donadio and M. Parrinello
Canonical sampling through velocity rescaling
J. Chem. Phys. 126 (2007) pp. 014101
-------- -------- — Thank You — -------- --------

There are: 131113 Atoms
Atom distribution over 3 domains: av 43704 stddev 649 min 42836 max 44292

NOTE: DLB will not turn on during the first phase of PME tuning

Constraining the starting coordinates (step 0)

Constraining the coordinates at t0-dt (step 0)
Center of mass motion removal mode is Linear
We have the following groups for center of mass motion removal:
0: SOLU_MEMB
1: SOLV
RMS relative constraint deviation after constraining: 1.99e-06
Initial temperature: 302.649 K

Started mdrun on rank 0 Thu Jun 15 15:39:16 2023

Energy conservation over simulation part #1 of length 125 ns, time 0 to 125 ns
Conserved energy drift: 1.74e-04 kJ/mol/ps per atom

<======  ###############  ==>
<====  A V E R A G E S  ====>
<==  ###############  ======>

Statistics over 125001 steps using 1251 frames

Energies (kJ/mol)
Bond U-B Proper Dih. Improper Dih. CMAP Dih.
2.76098e+04 1.36198e+05 1.01719e+05 1.52593e+03 -9.28820e+01
LJ-14 Coulomb-14 LJ (SR) Coulomb (SR) Coul. recip.
2.04226e+04 1.65712e+05 2.09739e+04 -1.59034e+06 7.23204e+03
Position Rest. Dih. Rest. Potential Kinetic En. Total Energy
2.83604e+03 8.38407e+02 -1.10537e+06 3.57318e+05 -7.48050e+05
Conserved En. Temperature Pressure (bar) Constr. rmsd
-1.00672e+06 3.01419e+02 -7.33768e+02 0.00000e+00

Total Virial (kJ/mol)
1.45085e+05 -9.61927e+01 -5.40617e+02
-9.67665e+01 1.45493e+05 -4.06045e+02
-5.33296e+02 -3.97303e+02 1.58705e+05

Pressure (bar)
-6.62188e+02 2.69532e+00 1.18316e+01
2.70906e+00 -6.70220e+02 9.10453e+00
1.16564e+01 8.89527e+00 -8.68896e+02

     T-SOLU         T-MEMB         T-SOLV
3.01045e+02    3.01726e+02    3.01199e+02


M E G A - F L O P S   A C C O U N T I N G

NB=Group-cutoff nonbonded kernels NxN=N-by-N cluster Verlet kernels
RF=Reaction-Field VdW=Van der Waals QSTab=quadratic-spline table
W3=SPC/TIP3p W4=TIP4p (single or pairs)
V&F=Potential and force V=Potential only F=Force only

Computing: M-Number M-Flops % Flops

Pair Search distance check 18271.141632 164440.275 0.0
NxN QSTab Elec. + LJ [F] 18022793.272704 955208043.453 97.3
NxN QSTab Elec. + LJ [V&F] 182195.185920 14757810.060 1.5
1,4 nonbonded interactions 18992.276937 1709304.924 0.2
Reset In Box 164.022363 492.067 0.0
CG-CoM 164.153476 492.460 0.0
Bonds 2819.897559 166373.956 0.0
Propers 22823.057583 5226480.187 0.5
Impropers 191.626533 39858.319 0.0
Dihedral Restr. 129.251034 25850.207 0.0
Pos. Restr. 385.503084 19275.154 0.0
Virial 164.191248 2955.442 0.0
Stop-CM 164.153476 1641.535 0.0
Calc-Ekin 3278.087226 88508.355 0.0
Lincs 4230.226523 253813.591 0.0
Lincs-Mat 26981.147532 107924.590 0.0
Constraint-V 17752.659038 159773.931 0.0
Constraint-Vir 135.330678 3247.936 0.0
Settle 3097.449337 1146056.255 0.1
CMAP 40.125321 68213.046 0.0
Urey-Bradley 13485.357882 2467820.492 0.3

Total 981618376.236 100.0

D O M A I N   D E C O M P O S I T I O N   S T A T I S T I C S

av. #atoms communicated per step for force: 2 x 39050.7

Dynamic load balancing report:
DLB was off during the run due to low measured imbalance.
Average load imbalance: 4.3%.
The balanceable part of the MD step is 72%, load imbalance is computed from this.
Part of the total run time spent waiting due to load imbalance: 3.1%.
Average PME mesh/force load: 1.093
Part of the total run time spent waiting due to PP/PME imbalance: 3.2 %

 R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G

On 3 MPI ranks doing PP, each using 8 OpenMP threads, and
on 1 MPI rank doing PME, using 8 OpenMP threads

Computing: Num Num Call Wall time Giga-Cycles
Ranks Threads Count (s) total sum %

Domain decomp. 3 8 1251 20.212 1016.205 5.4
DD comm. load 3 8 1233 0.039 1.981 0.0
Send X to PME 3 8 125001 38.286 1924.876 10.2
Neighbor search 3 8 1251 5.424 272.679 1.4
Launch GPU ops. 3 8 250002 8.367 420.687 2.2
Comm. coord. 3 8 123750 17.310 870.305 4.6
Force 3 8 125001 65.808 3308.620 17.6
Wait + Comm. F 3 8 125001 24.261 1219.768 6.5
PME mesh * 1 8 125001 135.307 2267.590 12.1
PME wait for PP * 145.266 2434.490 12.9
Wait + Recv. PME F 3 8 125001 36.972 1858.844 9.9
Wait PME GPU gather 3 8 125001 43.440 2184.008 11.6
Wait GPU NB nonloc. 3 8 125001 25.717 1292.961 6.9
Wait GPU NB local 3 8 125001 0.776 39.023 0.2
NB X/F buffer ops. 3 8 497502 7.843 394.294 2.1
Write traj. 3 8 26 0.226 11.350 0.1
Update 3 8 125001 3.905 196.338 1.0
Constraints 3 8 125003 22.743 1143.425 6.1
Comm. energies 3 8 12501 1.008 50.655 0.3

Total 280.596 18809.876 100.0

(*) Note that with separate PME ranks, the walltime column actually sums to
twice the total reported, but the cycle count total and % are correct.

           Core t (s)   Wall t (s)        (%)
   Time:     8978.804      280.596     3199.9
             (ns/day)    (hour/ns)

Performance: 38.490 0.624
Finished mdrun on rank 0 Thu Jun 15 15:43:57 2023

@pszilard Please, I will be grateful if you can take a look and offer some advice. Thank you. I appreciate your time.

Hi,

when I read your log file correctly, it seems you have about 131,000 atoms in your system. It might very well be that this system does not scale well over multiple strong GPUs (only large systems do). I would first try to run the system (both PME and PP) on a single V100. Another thing that might deteriorate your performance is that the threads are not pinned to the compute cores. If you have the whole node for yourself, you can let GROMACS do the proper pinning for you if you run two simulations at the same time, e.g. with the multidir functionality:

mpirun -np 2 mdrun_mpi -s in.tpr -multidir sim1 sim2 -pin on -nsteps 50000 -resethway

This way each simulation should run on one GPU using half of the available CPU cores. That should give you a good value for the single-GPU performance and you can try whether you get significant performance gains on two GPUs.

Is there a reason why you are using a 1 fs time step? With h-bonds constraints, a 2 fs time step would probably work and instantly double your performance.

Best,
Carsten

Dear Carsten,

I am deeply grateful for your assistance, insightful comments, and guidance, all of which have significantly contributed to the progress of my work.

Following your recommendations, I have implemented changes and extensively benchmarked my system to identify optimal performance and scaling configurations.

However, upon implementing your suggested command and initiating multiple simulations on my v100 GPU card, I observed a decrease in performance. The results are as follows:

Core time (s): 7629.778
Wall time (s): 381.513
Performance (ns/day): 11.324
Performance (hour/ns): 2.119
Simulation completion time: Tue Jun 20 11:56:18 2023

On the other hand, I also set up another benchmark using a100 GPU cards. This time, I observed a considerable improvement in performance when using the following benchmarking command line: gmx mdrun -v -s step7_1.tpr -nsteps 100000 -ntmpi 4 -ntomp 8 -resetstep 90000 -noconfout -pme gpu -nb gpu -bonded gpu -npme 1 -nstlist 100 -pin on

The dynamic load balancing report showed an average load imbalance of 1.9%, with the balanceable part of the MD step being 72%. The total run time spent waiting due to load imbalance was 1.4% and due to PP/PME imbalance was 4.6%.

The real cycle and time accounting was conducted on 3 MPI ranks performing PP, each using 8 OpenMP threads, and on 1 MPI rank performing PME, using 8 OpenMP threads. Detailed results are as follows:
On 3 MPI ranks doing PP, each using 8 OpenMP threads, and
on 1 MPI rank doing PME, using 8 OpenMP threads

Computing: Num Num Call Wall time Giga-Cycles
Ranks Threads Count (s) total sum %

Domain decomp. 3 8 101 0.617 29.490 4.2
DD comm. load 3 8 101 0.008 0.402 0.1
Send X to PME 3 8 10001 1.007 48.141 6.9
Neighbor search 3 8 101 0.350 16.745 2.4
Launch GPU ops. 3 8 20002 0.701 33.537 4.8
Comm. coord. 3 8 9900 1.216 58.174 8.3
Force 3 8 10001 0.306 14.637 2.1
Wait + Comm. F 3 8 10001 0.822 39.298 5.6
PME mesh * 1 8 10001 4.523 72.104 10.3
PME wait for PP * 6.454 102.891 14.7
Wait + Recv. PME F 3 8 10001 0.815 38.990 5.6
Wait PME GPU gather 3 8 10001 1.354 64.761 9.2
Wait Bonded GPU 3 8 101 0.000 0.007 0.0
Wait GPU NB nonloc. 3 8 10001 1.497 71.609 10.2
Wait GPU NB local 3 8 10001 0.649 31.015 4.4
NB X/F buffer ops. 3 8 39802 0.570 27.255 3.9
Write traj. 3 8 1 0.018 0.841 0.1
Update 3 8 10001 0.347 16.578 2.4
Constraints 3 8 10001 1.560 74.619 10.7
Comm. energies 3 8 1001 0.099 4.717 0.7

Total 10.982 700.274 100.0

(*) Note that with separate PME ranks, the walltime column actually sums to
twice the total reported, but the cycle count total and % are correct.

           Core t (s)   Wall t (s)        (%)
   Time:      350.776       10.982     3194.1
             (ns/day)    (hour/ns)

Performance: 157.362 0.153

The reason I’m using a 1 fs time step in this particular instance is because this is a stage of my equilibration NPT ensemble. I initially start with a 2 fs time step, then transition to a 1 fs time step in the subsequent phase of my equilibration NPT ensemble - based on the protocol by CHARMM-GUI.

Indeed, I have observed that using a 2 fs time step could increase performance, however, the improvement is not as significant as one might expect. Specifically, the performance increase equates to only about a ~4 ns improvement.

However, the info above is about my production step;
Input Parameters:
integrator = md
tinit = 0
dt = 0.002
nsteps = 500000000
init-step = 0
simulation-part = 1
mts = false
comm-mode = Linear
nstcomm = 100
bd-fric = 0
ld-seed = -1073819660
emtol = 10
emstep = 0.01
niter = 20
fcstep = 0
nstcgsteep = 1000
nbfgscorr = 10
rtpi = 0.05
nstxout = 50000
nstvout = 50000
nstfout = 50000
nstlog = 1000
nstcalcenergy = 100
nstenergy = 1000
nstxout-compressed = 50000
compressed-x-precision = 1000
cutoff-scheme = Verlet
nstlist = 400
pbc = xyz
periodic-molecules = false
verlet-buffer-tolerance = 0.005
rlist = 2.212
coulombtype = PME
coulomb-modifier = Potential-shift
rcoulomb-switch = 0
rcoulomb = 1.2
epsilon-r = 1
epsilon-rf = inf
vdw-type = Cut-off
vdw-modifier = Force-switch
rvdw-switch = 1
rvdw = 1.2
DispCorr = No
table-extension = 1
fourierspacing = 0.12
fourier-nx = 96
fourier-ny = 96
fourier-nz = 96
pme-order = 4
ewald-rtol = 1e-05
ewald-rtol-lj = 0.001
lj-pme-comb-rule = Geometric
ewald-geometry = 0
epsilon-surface = 0
tcoupl = V-rescale
nsttcouple = 10
nh-chain-length = 0
print-nose-hoover-chain-variables = false
pcoupl = Parrinello-Rahman
pcoupltype = Semiisotropic
nstpcouple = 10
tau-p = 5
compressibility (3x3):
compressibility[ 0]={ 4.50000e-05, 0.00000e+00, 0.00000e+00}
compressibility[ 1]={ 0.00000e+00, 4.50000e-05, 0.00000e+00}
compressibility[ 2]={ 0.00000e+00, 0.00000e+00, 4.50000e-05}

A twice as long time step should give you nearly a factor of two higher simulation performance. If that is not the case, then somthing else is different between those runs. You could call xxdiff (or some similar text comparison tool) on the md.log files of both runs to spot differences easily. The pasted .log file output here is almost impossible to read unfortunately.

You should see a huge improvement on multi-GPU runs by enabling GPU direct communication. Please see my comments at Using multiple GPUs on one machine - User discussions - GROMACS forums (bioexcel.eu). Note that you should use the latest 2023 version of GROMACS, and build with thread-MPI (-DGMX_MPI=OFF). You can also use lib-MPI (-DGMX_MPI=ON) but it must be a CUDA-aware build of MPI to allow GPU direct communications.

Hi Carsten,

Yeah, that is right because that is what exactly I get with the a100 GPU cards. When the time step is 1, the performance was:
Core t (s) Wall t (s) (%)
Time: 353159.055 11036.245 3200.0
3h03:56
(ns/day) (hour/ns)
Performance: 78.288 0.307
When I switched to the 2 fs in the next step, it was:
Core t (s) Wall t (s) (%)
Time: 175241.142 5476.301 3200.0
1h31:16
(ns/day) (hour/ns)
Performance: 157.771 0.152

Hi Alang,

That’s a fantastic suggestion! I will try enabling GPU direct communication, Although, regarding GROMACS version 2023, when I asked my school’s HPC systems Administrator to build the GROMACS 2023 version, they couldn’t find an easybuild config for it yet. I’ve attempted to build it on the cluster myself but without success. As a result, we went with the GROMACS 2021.7 version, which was launched in February 2023.