GROMACS version: 2020.2-dev-20200430-5e78835-unkown
GROMACS modification: https://catalog.ngc.nvidia.com/orgs/hpc/containers/gromacs
I’m performing protein-bilayer simulations of ~360k atoms on a cluster running a docker image of gromacs 2020.2 from nvcr.io
Since GPUs are my most limited resource I’m running on a single NVIDIA Quadro RTX 6000 GPU with 16 Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz CPUs. My performance is currently ~30.7ns/d (timed here over 500ps)
This is the command I’m using:
‘gmx mdrun -ntmpi 1 -ntomp 16 -nb gpu -bonded gpu -pme gpu -deffnm run’
I have enabled GMX_GPU_DD_COMMS, GMX_GPU_PME_PP_COMMS and GMX_FORCE_UPDATE_DEFAULT_GPU
Two things in particular stand out to me in the log file:
‘Wait GPU state copy 1 16 237500 1027.031 41081.436 73.2’
For some reason the ‘-gpu update’ flag does not support use of the Nose_Hoover thermostat
‘Nose-Hoover temperature coupling is not supported.’
‘Will use CPU version of update.’
Included is a shortened version of my log file.
I’d be thrilled if someone could point me in the right direction for teasing better performance out of my resources. I’ve tried multiple combinations of thead-MPI and openMP threads and thus for 1 : 16 gave the best performance.
Thank you in advance!
GROMACS: gmx mdrun, version 2020.2-dev-20200430-5e78835-unknown
Executable: /usr/local/gromacs/sm70/bin/gmx
Data prefix: /usr/local/gromacs/sm70
Command line:
gmx mdrun -ntmpi 1 -ntomp 16 -nb gpu -bonded gpu -pme gpu -deffnm run
GROMACS version: 2020.2-dev-20200430-5e78835-unknown
GIT SHA1 hash: 5e788350ad75c15ba91d2ba02779f1f8200f61ee
Branched from: unknown
Precision: single
Memory model: 64 bit
MPI library: thread_mpi
OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 64)
GPU support: CUDA
SIMD instructions: AVX2_256
FFT library: fftw-3.3.8
RDTSCP usage: enabled
TNG support: enabled
Hwloc support: disabled
Tracing support: disabled
C compiler: /usr/bin/gcc GNU 8.4.0
C compiler flags: -mavx2 -mfma -Wall -Wno-unused -Wunused-value -Wunused-parameter -Wextra -Wno-missing-field-initializers -Wno-sign-compare -Wpointer-arith -Wundef -Werror=stringop-truncation -fexcess-precision=fast -funroll-all-loops -Wno-array-bounds -mtune=generic -march=x86-64 -O2 -pipe -mavx -DNDEBUG
C++ compiler: /usr/bin/g++ GNU 8.4.0
C++ compiler flags: -mavx2 -mfma -Wall -Wextra -Wno-missing-field-initializers -Wpointer-arith -Wmissing-declarations -Wundef -Wstringop-truncation -fexcess-precision=fast -funroll-all-loops -Wno-array-bounds -fopenmp -mtune=generic -march=x86-64 -O2 -pipe -mavx -DNDEBUG
CUDA compiler: /usr/local/cuda/bin/nvcc nvcc: NVIDIA (R) Cuda compiler driver;Copyright (c) 2005-2020 NVIDIA Corporation;Built on Wed_May__6_19:09:25_PDT_2020;Cuda compilation tools, release 11.0, V11.0.167;Build cuda_11.0_bu.TC445_37.28358933_0
CUDA compiler flags:-std=c++14;-gencode;arch=compute_70,code=sm_70;-use_fast_math;-D_FORCE_INLINES;-mavx2 -mfma -Wall -Wextra -Wno-missing-field-initializers -Wpointer-arith -Wmissing-declarations -Wundef -Wstringop-truncation -fexcess-precision=fast -funroll-all-loops -Wno-array-bounds -fopenmp -mtune=generic -march=x86-64 -O2 -pipe -mavx -DNDEBUG
CUDA driver: 11.20
CUDA runtime: 11.0
Running on 1 node with total 40 cores, 80 logical cores, 1 compatible GPU
Hardware detected:
CPU info:
Vendor: Intel
Brand: Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz
Family: 6 Model: 85 Stepping: 7
Features: aes apic avx avx2 avx512f avx512cd avx512bw avx512vl clfsh cmov cx8 cx16 f16c fma htt intel lahf mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd rdtscp sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
Number of AVX-512 FMA units: 2
Hardware topology: Basic
Sockets, cores, and logical processors:
Socket 0: [ 0 40] [ 1 41] [ 2 42] [ 3 43] [ 4 44] [ 5 45] [ 6 46] [ 7 47] [ 8 48] [ 9 49] [ 10 50] [ 11 51] [ 12 52] [ 13 53] [ 14 54] [ 15 55] [ 16 56] [ 17 57] [ 18 58] [ 19 59]
Socket 1: [ 20 60] [ 21 61] [ 22 62] [ 23 63] [ 24 64] [ 25 65] [ 26 66] [ 27 67] [ 28 68] [ 29 69] [ 30 70] [ 31 71] [ 32 72] [ 33 73] [ 34 74] [ 35 75] [ 36 76] [ 37 77] [ 38 78] [ 39 79]
GPU info:
Number of GPUs detected: 1
#0: NVIDIA Quadro RTX 6000, compute cap.: 7.5, ECC: no, stat: compatible
Highest SIMD level requested by all nodes in run: AVX_512
SIMD instructions selected at compile time: AVX2_256
This program was compiled for different hardware than you are running on,
which could influence performance. This build might have been configured on a
login node with only a single AVX-512 FMA unit (in which case AVX2 is faster),
while the node you are running on has dual AVX-512 FMA units.
++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
-------- -------- --- Thank You --- -------- --------
This run will default to '-update gpu' as requested by the GMX_FORCE_UPDATE_DEFAULT_GPU environment variable. GPU update with domain decomposition lacks substantial testing and should be used with caution.
Enabling GPU buffer operations required by GMX_GPU_DD_COMMS (equivalent with GMX_USE_GPU_BUFFER_OPS=1).
This run uses the 'GPU halo exchange' feature, enabled by the GMX_GPU_DD_COMMS environment variable.
This run uses the 'GPU PME-PP communications' feature, enabled by the GMX_GPU_PME_PP_COMMS environment variable.
Input Parameters:
integrator = md
tinit = 0
dt = 0.002
nsteps = 250000
init-step = 0
simulation-part = 1
comm-mode = Linear
nstcomm = 100
bd-fric = 0
ld-seed = 375672286
emtol = 10
emstep = 0.01
niter = 20
fcstep = 0
nstcgsteep = 1000
nbfgscorr = 10
rtpi = 0.05
nstxout = 0
nstvout = 0
nstfout = 0
nstlog = 5000
nstcalcenergy = 100
nstenergy = 5000
nstxout-compressed = 5000
compressed-x-precision = 1000
cutoff-scheme = Verlet
nstlist = 20
pbc = xyz
periodic-molecules = false
verlet-buffer-tolerance = 0.005
rlist = 1.212
coulombtype = PME
coulomb-modifier = Potential-shift
rcoulomb-switch = 0
rcoulomb = 1.2
epsilon-r = 1
epsilon-rf = inf
vdw-type = Cut-off
vdw-modifier = Force-switch
rvdw-switch = 1
rvdw = 1.2
DispCorr = No
table-extension = 1
fourierspacing = 0.12
fourier-nx = 128
fourier-ny = 128
fourier-nz = 144
pme-order = 4
ewald-rtol = 1e-05
ewald-rtol-lj = 0.001
lj-pme-comb-rule = Geometric
ewald-geometry = 0
epsilon-surface = 0
tcoupl = Nose-Hoover
nsttcouple = 20
nh-chain-length = 1
print-nose-hoover-chain-variables = false
pcoupl = Parrinello-Rahman
pcoupltype = Semiisotropic
nstpcouple = 20
tau-p = 5
compressibility (3x3):
compressibility[ 0]={ 4.50000e-05, 0.00000e+00, 0.00000e+00}
compressibility[ 1]={ 0.00000e+00, 4.50000e-05, 0.00000e+00}
compressibility[ 2]={ 0.00000e+00, 0.00000e+00, 4.50000e-05}
ref-p (3x3):
ref-p[ 0]={ 1.00000e+00, 0.00000e+00, 0.00000e+00}
ref-p[ 1]={ 0.00000e+00, 1.00000e+00, 0.00000e+00}
ref-p[ 2]={ 0.00000e+00, 0.00000e+00, 1.00000e+00}
refcoord-scaling = No
posres-com (3):
posres-com[0]= 0.00000e+00
posres-com[1]= 0.00000e+00
posres-com[2]= 0.00000e+00
posres-comB (3):
posres-comB[0]= 0.00000e+00
posres-comB[1]= 0.00000e+00
posres-comB[2]= 0.00000e+00
QMMM = false
QMconstraints = 0
QMMMscheme = 0
MMChargeScaleFactor = 1
qm-opts:
ngQM = 0
constraint-algorithm = Lincs
continuation = true
Shake-SOR = false
shake-tol = 0.0001
lincs-order = 4
lincs-iter = 1
lincs-warnangle = 30
nwall = 0
wall-type = 9-3
wall-r-linpot = -1
wall-atomtype[0] = -1
wall-atomtype[1] = -1
wall-density[0] = 0
wall-density[1] = 0
wall-ewald-zfac = 3
pull = false
awh = false
rotation = false
interactiveMD = false
disre = No
disre-weighting = Conservative
disre-mixed = false
dr-fc = 1000
dr-tau = 0
nstdisreout = 100
orire-fc = 0
orire-tau = 0
nstorireout = 100
free-energy = no
cos-acceleration = 0
deform (3x3):
deform[ 0]={ 0.00000e+00, 0.00000e+00, 0.00000e+00}
deform[ 1]={ 0.00000e+00, 0.00000e+00, 0.00000e+00}
deform[ 2]={ 0.00000e+00, 0.00000e+00, 0.00000e+00}
simulated-tempering = false
swapcoords = no
userint1 = 0
userint2 = 0
userint3 = 0
userint4 = 0
userreal1 = 0
userreal2 = 0
userreal3 = 0
userreal4 = 0
applied-forces:
electric-field:
x:
E0 = 0
omega = 0
t0 = 0
sigma = 0
y:
E0 = 0
omega = 0
t0 = 0
sigma = 0
z:
E0 = 0
omega = 0
t0 = 0
sigma = 0
density-guided-simulation:
active = false
group = protein
similarity-measure = inner-product
atom-spreading-weight = unity
force-constant = 1e+09
gaussian-transform-spreading-width = 0.2
gaussian-transform-spreading-range-in-multiples-of-width = 4
reference-density-filename = reference.mrc
nst = 1
normalize-densities = true
adaptive-force-scaling = false
adaptive-force-scaling-time-constant = 4
grpopts:
nrdf: 62495.2 179918 515571
ref-t: 300 300 300
tau-t: 1 1 1
annealing: No No No
annealing-npoints: 0 0 0
acc: 0 0 0
nfreeze: N N N
energygrp-flags[ 0]: 0
Changing nstlist from 20 to 100, rlist from 1.212 to 1.327
Update task on the GPU was required, by the GMX_FORCE_UPDATE_DEFAULT_GPU environment variable, but the following condition(s) were not satisfied:
Nose-Hoover temperature coupling is not supported.
Will use CPU version of update.
1 GPU selected for this run.
Mapping of GPU IDs to the 2 GPU tasks in the 1 rank on this node:
PP:0,PME:0
PP tasks will do (non-perturbed) short-ranged and most bonded interactions on the GPU
PP task will update and constrain coordinates on the CPU
PME tasks will do all aspects on the GPU
Using 1 MPI thread
Non-default thread affinity set, disabling internal thread affinity
Using 16 OpenMP threads
System total charge: 0.000
Will do PME sum in reciprocal space for electrostatic interactions.
Using a Gaussian width (1/beta) of 0.384195 nm for Ewald
Potential shift: LJ r^-12: -2.648e-01 r^-6: -5.349e-01, Ewald -8.333e-06
Initialized non-bonded Ewald tables, spacing: 1.02e-03 size: 1176
Generated table with 1163 data points for 1-4 COUL.
Tabscale = 500 points/nm
Generated table with 1163 data points for 1-4 LJ6.
Tabscale = 500 points/nm
Generated table with 1163 data points for 1-4 LJ12.
Tabscale = 500 points/nm
Using GPU 8x8 nonbonded short-range kernels
Using a dual 8x8 pair-list setup updated with dynamic, rolling pruning:
outer list: updated every 100 steps, buffer 0.127 nm, rlist 1.327 nm
inner list: updated every 14 steps, buffer 0.001 nm, rlist 1.201 nm
At tolerance 0.005 kJ/mol/ps per atom, equivalent classical 1x1 list would be:
outer list: updated every 100 steps, buffer 0.284 nm, rlist 1.484 nm
inner list: updated every 14 steps, buffer 0.058 nm, rlist 1.258 nm
Initializing LINear Constraint Solver
There are: 357992 Atoms
Center of mass motion removal mode is Linear
We have the following groups for center of mass motion removal:
0: SOLU_MEMB
1: SOLV
Started mdrun on rank 0 Wed Jan 26 14:17:01 2022
Step Time
0 0.00000
Energies (kJ/mol)
Bond U-B Proper Dih. Improper Dih. CMAP Dih.
5.57838e+04 2.48719e+05 1.84803e+05 4.28059e+03 -2.28378e+03
LJ-14 Coulomb-14 LJ (SR) Coulomb (SR) Coul. recip.
4.20522e+04 1.34794e+04 2.15573e+05 -4.59834e+06 1.67067e+04
Potential Kinetic En. Total Energy Conserved En. Temperature
-3.81922e+06 9.59816e+05 -2.85941e+06 -2.85920e+06 3.04596e+02
Pressure (bar) Constr. rmsd
3.99005e+02 4.59983e-06
step 1000: timed with pme grid 128 128 144, coulomb cutoff 1.200: 1385.7 M-cycles
step 1200: timed with pme grid 120 120 128, coulomb cutoff 1.340: 1544.8 M-cycles
step 1400: timed with pme grid 108 108 120, coulomb cutoff 1.429: 1688.2 M-cycles
step 1600: timed with pme grid 108 108 128, coulomb cutoff 1.416: 1719.0 M-cycles
step 1800: timed with pme grid 112 112 128, coulomb cutoff 1.365: 1571.8 M-cycles
step 2000: timed with pme grid 120 120 128, coulomb cutoff 1.340: 1541.0 M-cycles
step 2200: timed with pme grid 120 120 144, coulomb cutoff 1.274: 1445.4 M-cycles
step 2400: timed with pme grid 128 128 144, coulomb cutoff 1.200: 1346.2 M-cycles
step 2600: timed with pme grid 120 120 144, coulomb cutoff 1.274: 1444.9 M-cycles
step 2800: timed with pme grid 128 128 144, coulomb cutoff 1.200: 1351.0 M-cycles
optimal pme grid 128 128 144, coulomb cutoff 1.200
Step Time
5000 10.00000
<====== ############### ==>
<==== A V E R A G E S ====>
<== ############### ======>
Statistics over 250001 steps using 2501 frames
Energies (kJ/mol)
Bond U-B Proper Dih. Improper Dih. CMAP Dih.
5.55737e+04 2.48513e+05 1.83543e+05 4.43680e+03 -2.23988e+03
LJ-14 Coulomb-14 LJ (SR) Coulomb (SR) Coul. recip.
4.20071e+04 1.50332e+04 2.17391e+05 -4.60268e+06 1.67342e+04
Potential Kinetic En. Total Energy Conserved En. Temperature
-3.82169e+06 9.45369e+05 -2.87632e+06 -2.82994e+06 3.00011e+02
Pressure (bar) Constr. rmsd
1.32668e+00 0.00000e+00
Box-X Box-Y Box-Z
1.50846e+01 1.30636e+01 1.75994e+01
Total Virial (kJ/mol)
3.12620e+05 2.94396e+02 -1.20361e+02
2.91342e+02 3.12180e+05 2.18304e+02
-1.25560e+02 2.17851e+02 3.20159e+05
Pressure (bar)
-1.65116e+00 -2.03722e+00 -2.01050e+00
-2.00800e+00 4.38258e+00 -1.36357e+00
-1.96069e+00 -1.35918e+00 1.24861e+00
T-SOLU T-MEMB T-SOLV
3.00017e+02 3.00015e+02 3.00009e+02
M E G A - F L O P S A C C O U N T I N G
NB=Group-cutoff nonbonded kernels NxN=N-by-N cluster Verlet kernels
RF=Reaction-Field VdW=Van der Waals QSTab=quadratic-spline table
W3=SPC/TIP3p W4=TIP4p (single or pairs)
V&F=Potential and force V=Potential only F=Force only
Computing: M-Number M-Flops % Flops
-----------------------------------------------------------------------------
Pair Search distance check 110154.048272 991386.434 0.0
NxN Ewald Elec. + LJ [F] 112862084.144832 8803242563.297 97.9
NxN Ewald Elec. + LJ [V&F] 1140476.197760 147121429.511 1.6
1,4 nonbonded interactions 68948.525793 6205367.321 0.1
Shift-X 895.337992 5372.028 0.0
Bonds 10477.541910 618174.973 0.0
Propers 78973.315892 18084889.339 0.2
Impropers 1222.004888 254177.017 0.0
Virial 4475.820537 80564.770 0.0
Stop-CM 895.337992 8953.380 0.0
Calc-Ekin 8950.515984 241663.932 0.0
Lincs 14729.058916 883743.535 0.0
Lincs-Mat 96528.386112 386113.544 0.0
Constraint-V 93725.874902 749806.999 0.0
Constraint-Vir 3950.140986 94803.384 0.0
Settle 21422.585690 6919495.178 0.1
CMAP 386.251545 656627.626 0.0
Urey-Bradley 48227.692910 8825667.803 0.1
-----------------------------------------------------------------------------
Total 8995370800.071 100.0
-----------------------------------------------------------------------------
R E A L C Y C L E A N D T I M E A C C O U N T I N G
On 1 MPI rank, each using 16 OpenMP threads
Computing: Num Num Call Wall time Giga-Cycles
Ranks Threads Count (s) total sum %
-----------------------------------------------------------------------------
Neighbor search 1 16 2501 53.643 2145.720 3.8
Launch GPU ops. 1 16 250001 28.859 1154.360 2.1
Force 1 16 250001 27.565 1102.619 2.0
Wait PME GPU gather 1 16 250001 7.569 302.758 0.5
Wait Bonded GPU 1 16 2501 0.004 0.142 0.0
Reduce GPU PME F 1 16 250001 1.405 56.206 0.1
Wait GPU NB local 1 16 237500 7.637 305.477 0.5
Wait GPU state copy 1 16 237500 1027.031 41081.436 73.2
NB X/F buffer ops. 1 16 500002 7.404 296.176 0.5
Write traj. 1 16 52 1.582 63.262 0.1
Update 1 16 250001 60.261 2410.469 4.3
Constraints 1 16 250001 101.647 4065.882 7.2
Rest 78.359 3134.361 5.6
-----------------------------------------------------------------------------
Total 1402.965 56118.869 100.0
-----------------------------------------------------------------------------
Core t (s) Wall t (s) (%)
Time: 22447.426 1402.965 1600.0
(ns/day) (hour/ns)
Performance: 30.792 0.779
Finished mdrun on rank 0 Wed Jan 26 14:40:25 2022