alexmas
January 14, 2022, 5:49am
#1
GROMACS version: 2021.4
GROMACS modification: No
Hello,
I just completed a test run on a new installation and the performance was much lower then I expected: 0.661ns/day with a 41,175 Atoms system on an 8 cores CPU hardware.
Below I am listing more details about my simulation and my hardware.
The gromacs website showcase an example with ~50ns/day for a 24,000 atoms on a 6 cores CPU hardware (no GPU). Since my system is less than 2x larger, I would have expected to achieve at least 10ns/day.
I have not installed a GPU yet, and before I invest in it I would like to make sure I have the system optimally set up.
Is there something I am missing?
This is my simulation:
1ns NPT run, 50,000 steps, 2 fs/step
Total: 41,175 Atoms
Solvent: 12,580 water molecules (37,740 atoms)
Protein: 3,431 atoms (255 residues)
GROMACS version: 2021.4
Running on 1 node with total 8 cores, 16 logical cores
Using 1 MPI thread
Using 16 OpenMP threads
Using SIMD 4x4 nonbonded short-range kernels
Hardware:
AMD Ryzen 7 1700, 3.0GHz 8 cores (16 threads) non overclocked, 16M/AM4/65W
RAM: 16GB DDR4 2400UDIMM
During the simulation, all 16 threads engaged at ~3.1GHz with >95% load (according to Conky).
Thanks in advance for your help,
Al
That does indeed seem very low. Please post a full log file, that would help identifying any issues.
alexmas
January 15, 2022, 11:00pm
#3
Hi,
Thanks for your reply. As new user I can’t post attachments. I have stripped the npt.log files of irrelevant content and posted below:
GROMACS: gmx mdrun, version 2021.4
Executable: /usr/local/bin/gmx
Data prefix: /usr/local
Working dir: /home/alex/Modeling/MD/IAB
Process ID: 39199
Command line:
gmx mdrun -deffnm npt
GROMACS version: 2021.4
Precision: mixed
Memory model: 64 bit
MPI library: thread_mpi
OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 64)
GPU support: disabled
SIMD instructions: AVX2_128
FFT library: fftw-3.3.8-sse2-avx-avx2-avx2_128
RDTSCP usage: enabled
TNG support: enabled
Hwloc support: disabled
Tracing support: disabled
C compiler: /usr/bin/cc GNU 9.3.0
C compiler flags: -mavx2 -mfma -Wno-missing-field-initializers -fexcess-precision=fast -funroll-all-loops -O3 -DNDEBUG
C++ compiler: /usr/bin/c++ GNU 9.3.0
C++ compiler flags: -mavx2 -mfma -Wno-missing-field-initializers -fexcess-precision=fast -funroll-all-loops -fopenmp
Running on 1 node with total 8 cores, 16 logical cores
Hardware detected:
CPU info:
Vendor: AMD
Brand: AMD Ryzen 7 1700 Eight-Core Processor
Family: 23 Model: 1 Stepping: 1
Features: aes amd apic avx avx2 clfsh cmov cx8 cx16 f16c fma htt lahf misalignsse mmx msr nonstop_tsc pclmuldq pdpe1gb popcnt pse rdrnd rdtscp sha sse2 sse3 sse4a sse4.1 sse4.2 ssse3
Hardware topology: Basic
Sockets, cores, and logical processors:
Socket 0: [ 0 1] [ 2 3] [ 4 5] [ 6 7] [ 8 9] [ 10 11] [ 12 13] [ 14 15]
Input Parameters:
integrator = md
tinit = 0
dt = 0.002
nsteps = 50000
init-step = 0
simulation-part = 1
mts = false
comm-mode = Linear
nstcomm = 100
bd-fric = 0
ld-seed = -272737541
emtol = 10
emstep = 0.01
niter = 20
fcstep = 0
nstcgsteep = 1000
nbfgscorr = 10
rtpi = 0.05
nstxout = 500
nstvout = 500
nstfout = 0
nstlog = 500
nstcalcenergy = 100
nstenergy = 500
nstxout-compressed = 0
compressed-x-precision = 1000
cutoff-scheme = Verlet
nstlist = 10
pbc = xyz
periodic-molecules = false
verlet-buffer-tolerance = 0.005
rlist = 1
coulombtype = PME
coulomb-modifier = Potential-shift
rcoulomb-switch = 0
rcoulomb = 1
epsilon-r = 1
epsilon-rf = inf
vdw-type = Cut-off
vdw-modifier = Potential-shift
rvdw-switch = 0
rvdw = 1
DispCorr = EnerPres
table-extension = 1
fourierspacing = 0.16
fourier-nx = 48
fourier-ny = 48
fourier-nz = 48
pme-order = 4
ewald-rtol = 1e-05
ewald-rtol-lj = 0.001
lj-pme-comb-rule = Geometric
ewald-geometry = 0
epsilon-surface = 0
tcoupl = V-rescale
nsttcouple = 10
nh-chain-length = 0
print-nose-hoover-chain-variables = false
pcoupl = Parrinello-Rahman
pcoupltype = Isotropic
nstpcouple = 10
tau-p = 2
compressibility (3x3):
compressibility[ 0]={ 4.50000e-05, 0.00000e+00, 0.00000e+00}
compressibility[ 1]={ 0.00000e+00, 4.50000e-05, 0.00000e+00}
compressibility[ 2]={ 0.00000e+00, 0.00000e+00, 4.50000e-05}
ref-p (3x3):
ref-p[ 0]={ 1.00000e+00, 0.00000e+00, 0.00000e+00}
ref-p[ 1]={ 0.00000e+00, 1.00000e+00, 0.00000e+00}
ref-p[ 2]={ 0.00000e+00, 0.00000e+00, 1.00000e+00}
refcoord-scaling = COM
posres-com (3):
posres-com[0]= 5.00485e-01
posres-com[1]= 5.02474e-01
posres-com[2]= 5.01594e-01
posres-comB (3):
posres-comB[0]= 5.00485e-01
posres-comB[1]= 5.02474e-01
posres-comB[2]= 5.01594e-01
QMMM = false
qm-opts:
ngQM = 0
constraint-algorithm = Lincs
continuation = true
Shake-SOR = false
shake-tol = 0.0001
lincs-order = 4
lincs-iter = 1
lincs-warnangle = 30
nwall = 0
wall-type = 9-3
wall-r-linpot = -1
wall-atomtype[0] = -1
wall-atomtype[1] = -1
wall-density[0] = 0
wall-density[1] = 0
wall-ewald-zfac = 3
pull = false
awh = false
rotation = false
interactiveMD = false
disre = No
disre-weighting = Conservative
disre-mixed = false
dr-fc = 1000
dr-tau = 0
nstdisreout = 100
orire-fc = 0
orire-tau = 0
nstorireout = 100
free-energy = no
cos-acceleration = 0
deform (3x3):
deform[ 0]={ 0.00000e+00, 0.00000e+00, 0.00000e+00}
deform[ 1]={ 0.00000e+00, 0.00000e+00, 0.00000e+00}
deform[ 2]={ 0.00000e+00, 0.00000e+00, 0.00000e+00}
simulated-tempering = false
swapcoords = no
userint1 = 0
userint2 = 0
userint3 = 0
userint4 = 0
userreal1 = 0
userreal2 = 0
userreal3 = 0
userreal4 = 0
applied-forces:
electric-field:
x:
E0 = 0
omega = 0
t0 = 0
sigma = 0
y:
E0 = 0
omega = 0
t0 = 0
sigma = 0
z:
E0 = 0
omega = 0
t0 = 0
sigma = 0
density-guided-simulation:
active = false
group = protein
similarity-measure = inner-product
atom-spreading-weight = unity
force-constant = 1e+09
gaussian-transform-spreading-width = 0.2
gaussian-transform-spreading-range-in-multiples-of-width = 4
reference-density-filename = reference.mrc
nst = 1
normalize-densities = true
adaptive-force-scaling = false
adaptive-force-scaling-time-constant = 4
shift-vector =
transformation-matrix =
grpopts:
nrdf: 8602.69 75489.3
ref-t: 300 300
tau-t: 0.1 0.1
annealing: No No
annealing-npoints: 0 0
acc: 0 0 0
nfreeze: N N N
energygrp-flags[ 0]: 0
Changing nstlist from 10 to 50, rlist from 1 to 1.11
Using 1 MPI thread
Using 16 OpenMP threads
Pinning threads with an auto-selected logical core stride of 1
System total charge: -0.000
Will do PME sum in reciprocal space for electrostatic interactions.
Using a Gaussian width (1/beta) of 0.320163 nm for Ewald
Potential shift: LJ r^-12: -1.000e+00 r^-6: -1.000e+00, Ewald -1.000e-05
Initialized non-bonded Ewald tables, spacing: 9.33e-04 size: 1073
Generated table with 1055 data points for 1-4 COUL.
Tabscale = 500 points/nm
Generated table with 1055 data points for 1-4 LJ6.
Tabscale = 500 points/nm
Generated table with 1055 data points for 1-4 LJ12.
Tabscale = 500 points/nm
Long Range LJ corr.: <C6> 3.2909e-04
Using SIMD 4x4 nonbonded short-range kernels
Using a dual 4x4 pair-list setup updated with dynamic pruning:
outer list: updated every 50 steps, buffer 0.110 nm, rlist 1.110 nm
inner list: updated every 13 steps, buffer 0.003 nm, rlist 1.003 nm
At tolerance 0.005 kJ/mol/ps per atom, equivalent classical 1x1 list would be:
outer list: updated every 50 steps, buffer 0.239 nm, rlist 1.239 nm
inner list: updated every 13 steps, buffer 0.052 nm, rlist 1.052 nm
Using Lorentz-Berthelot Lennard-Jones combination rule
Initializing LINear Constraint Solver
The number of constraints is 1690
There are: 41175 Atoms
Center of mass motion removal mode is Linear
We have the following groups for center of mass motion removal:
0: rest
Started mdrun on rank 0 Sun Jan 9 22:12:49 2022
Step Time
0 0.00000
Energies (kJ/mol)
Bond Angle Proper Dih. Improper Dih. LJ-14
2.62752e+03 6.87239e+03 8.87775e+03 4.26168e+02 3.37074e+03
Coulomb-14 LJ (SR) Disper. corr. Coulomb (SR) Coul. recip.
3.98406e+04 1.01956e+05 -5.29819e+03 -7.67027e+05 3.88859e+03
Position Rest. Potential Kinetic En. Total Energy Conserved En.
2.20376e-01 -6.04466e+05 1.05252e+05 -4.99214e+05 -4.99188e+05
Temperature Pres. DC (bar) Pressure (bar) Constr. rmsd
3.01072e+02 -1.99963e+02 -1.12481e+03 2.92565e-06
Step Time
500 1.00000
Energies (kJ/mol)
Bond Angle Proper Dih. Improper Dih. LJ-14
2.76232e+03 6.56919e+03 8.91485e+03 3.74066e+02 3.29520e+03
Coulomb-14 LJ (SR) Disper. corr. Coulomb (SR) Coul. recip.
3.96421e+04 1.10388e+05 -5.67387e+03 -7.82778e+05 3.73430e+03
Position Rest. Potential Kinetic En. Total Energy Conserved En.
7.80381e+02 -6.11992e+05 1.05707e+05 -5.06285e+05 -4.98936e+05
Temperature Pres. DC (bar) Pressure (bar) Constr. rmsd
3.02375e+02 -2.29287e+02 2.16987e+02 3.08949e-06
Step Time
1000 2.00000
Energies (kJ/mol)
Bond Angle Proper Dih. Improper Dih. LJ-14
2.60550e+03 6.52426e+03 8.88778e+03 4.06964e+02 3.25137e+03
Coulomb-14 LJ (SR) Disper. corr. Coulomb (SR) Coul. recip.
3.98419e+04 1.08540e+05 -5.70377e+03 -7.82129e+05 3.68749e+03
Position Rest. Potential Kinetic En. Total Energy Conserved En.
8.44355e+02 -6.13243e+05 1.04718e+05 -5.08526e+05 -4.98940e+05
Temperature Pres. DC (bar) Pressure (bar) Constr. rmsd
2.99544e+02 -2.31707e+02 5.17654e-01 2.86946e-06
Step Time
1500 3.00000
Energies (kJ/mol)
Bond Angle Proper Dih. Improper Dih. LJ-14
2.55369e+03 6.47372e+03 9.01303e+03 4.09099e+02 3.34877e+03
Coulomb-14 LJ (SR) Disper. corr. Coulomb (SR) Coul. recip.
3.96822e+04 1.08273e+05 -5.71906e+03 -7.80746e+05 3.56991e+03
Position Rest. Potential Kinetic En. Total Energy Conserved En.
8.64326e+02 -6.12277e+05 1.04020e+05 -5.08257e+05 -4.98916e+05
Temperature Pres. DC (bar) Pressure (bar) Constr. rmsd
2.97548e+02 -2.32950e+02 1.46750e+02 2.83228e-06
...
...
Step Time
50000 100.00000
Writing checkpoint, step 50000 at Mon Jan 10 01:50:32 2022
Energies (kJ/mol)
Bond Angle Proper Dih. Improper Dih. LJ-14
2.65868e+03 6.46689e+03 8.87458e+03 4.26712e+02 3.29757e+03
Coulomb-14 LJ (SR) Disper. corr. Coulomb (SR) Coul. recip.
3.97034e+04 1.07877e+05 -5.73794e+03 -7.82740e+05 3.63468e+03
Position Rest. Potential Kinetic En. Total Energy Conserved En.
9.03844e+02 -6.14635e+05 1.04525e+05 -5.10110e+05 -4.98818e+05
Temperature Pres. DC (bar) Pressure (bar) Constr. rmsd
2.98994e+02 -2.34488e+02 2.32401e+01 3.11058e-06
Energy conservation over simulation part #1 of length 100 ns, time 0 to 100 ns
Conserved energy drift: 8.96e-05 kJ/mol/ps per atom
<====== ############### ==>
<==== A V E R A G E S ====>
<== ############### ======>
Statistics over 50001 steps using 501 frames
Energies (kJ/mol)
Bond Angle Proper Dih. Improper Dih. LJ-14
2.58042e+03 6.57565e+03 8.87256e+03 4.09996e+02 3.27516e+03
Coulomb-14 LJ (SR) Disper. corr. Coulomb (SR) Coul. recip.
3.96588e+04 1.07828e+05 -5.73306e+03 -7.81547e+05 3.64830e+03
Position Rest. Potential Kinetic En. Total Energy Conserved En.
8.64482e+02 -6.13566e+05 1.04913e+05 -5.08653e+05 -4.98868e+05
Temperature Pres. DC (bar) Pressure (bar) Constr. rmsd
3.00104e+02 -2.34097e+02 2.17823e+00 0.00000e+00
Box-X Box-Y Box-Z
7.41182e+00 7.41182e+00 7.41182e+00
Total Virial (kJ/mol)
3.49632e+04 -1.77442e+02 -2.68029e+02
-1.77536e+02 3.50165e+04 -9.25268e+01
-2.68075e+02 -9.23627e+01 3.48659e+04
Pressure (bar)
4.00992e+00 1.14163e+01 2.23425e+01
1.14240e+01 -6.13640e+00 9.09152e+00
2.23463e+01 9.07817e+00 8.66119e+00
T-Protein T-non-Protein
3.00231e+02 3.00089e+02
M E G A - F L O P S A C C O U N T I N G
NB=Group-cutoff nonbonded kernels NxN=N-by-N cluster Verlet kernels
RF=Reaction-Field VdW=Van der Waals QSTab=quadratic-spline table
W3=SPC/TIP3p W4=TIP4p (single or pairs)
V&F=Potential and force V=Potential only F=Force only
Computing: M-Number M-Flops % Flops
-----------------------------------------------------------------------------
Pair Search distance check 14313.835096 128824.516 0.3
NxN QSTab Elec. + LJ [F] 561563.609592 23024107.993 53.9
NxN QSTab Elec. + LJ [V&F] 5683.876680 335348.724 0.8
NxN QSTab Elec. [F] 464090.345800 15779071.757 36.9
NxN QSTab Elec. [V&F] 4696.691736 192564.361 0.5
1,4 nonbonded interactions 450.059001 40505.310 0.1
Calc Weights 6176.373525 222349.447 0.5
Spread Q Bspline 131762.635200 263525.270 0.6
Gather F Bspline 131762.635200 790575.811 1.9
3D-FFT 185299.305912 1482394.447 3.5
Solve PME 115.202304 7372.947 0.0
Shift-X 41.216175 247.297 0.0
Bonds 89.101782 5257.005 0.0
Angles 311.706234 52366.647 0.1
Propers 481.909638 110357.307 0.3
Impropers 35.400708 7363.347 0.0
Pos. Restr. 87.051741 4352.587 0.0
Virial 206.141220 3710.542 0.0
Stop-CM 20.628675 206.287 0.0
Calc-Ekin 411.832350 11119.473 0.0
Lincs 84.501690 5070.101 0.0
Lincs-Mat 421.208424 1684.834 0.0
Constraint-V 2056.041120 18504.370 0.0
Constraint-Vir 197.189430 4732.546 0.0
Settle 629.012580 232734.655 0.5
-----------------------------------------------------------------------------
Total 42724347.584 100.0
-----------------------------------------------------------------------------
R E A L C Y C L E A N D T I M E A C C O U N T I N G
On 1 MPI rank, each using 16 OpenMP threads
Computing: Num Num Call Wall time Giga-Cycles
Ranks Threads Count (s) total sum %
-----------------------------------------------------------------------------
Neighbor search 1 16 1001 171.904 8235.853 1.3
Force 1 16 50001 11761.402 563483.831 90.0
PME mesh 1 16 50001 902.885 43256.855 6.9
NB X/F buffer ops. 1 16 99001 51.463 2465.549 0.4
Write traj. 1 16 113 1.387 66.428 0.0
Update 1 16 50001 28.337 1357.622 0.2
Constraints 1 16 50001 126.837 6076.727 1.0
Rest 18.569 889.653 0.1
-----------------------------------------------------------------------------
Total 13062.785 625832.517 100.0
-----------------------------------------------------------------------------
Breakdown of PME mesh computation
-----------------------------------------------------------------------------
PME spread 1 16 50001 317.864 15228.748 2.4
PME gather 1 16 50001 497.147 23818.110 3.8
PME 3D-FFT 1 16 100002 39.412 1888.205 0.3
PME solve Elec 1 16 50001 47.287 2265.490 0.4
-----------------------------------------------------------------------------
Core t (s) Wall t (s) (%)
Time: 209004.518 13062.785 1600.0
3h37:42
(ns/day) (hour/ns)
Performance: 0.661 36.285
Finished mdrun on rank 0 Mon Jan 10 01:50:32 2022
I see no reason why this would run so slow, it is also strange that you have >90% runtime in the short-range force calculation (“Force” counter), that is typically more like 70-80%. How’s the performance with 4 / 8 threads? Can you rebuild with cmake . -DGMX_CYCLE_SUBCOUNTERS=ON
and post a log with that? This will show further breakdown of the wall times.
alexmas
January 28, 2022, 6:10am
#5
I was able to figure it out. Apparently there is another gmx executable in:
/usr/local/bin/gmx
I though it was the version of gromacs that comes from the Ubuntu repository, but when I tried to removed it with apt-get, it tells me gromacs was not installed:
Package 'gromacs' is not installed, so not removed
I sourced the GXMRC file and now gmx is pointing to the right executable:
source /usr/local/gromacs/bin/GMXRC
which gmx
/usr/local/gromacs/bin/gmx
Now my simulation runs at 21 ns/day.