GROMACS version: 2021.4
GROMACS modification: No
Hello,
I just completed a test run on a new installation and the performance was much lower then I expected: 0.661ns/day with a 41,175 Atoms system on an 8 cores CPU hardware.
Below I am listing more details about my simulation and my hardware.
The gromacs website showcase an example with ~50ns/day for a 24,000 atoms on a 6 cores CPU hardware (no GPU). Since my system is less than 2x larger, I would have expected to achieve at least 10ns/day.
I have not installed a GPU yet, and before I invest in it I would like to make sure I have the system optimally set up.
Is there something I am missing?
This is my simulation:
1ns NPT run, 50,000 steps, 2 fs/step
Total: 41,175 Atoms
Solvent: 12,580 water molecules (37,740 atoms)
Protein: 3,431 atoms (255 residues)
GROMACS version: 2021.4
Running on 1 node with total 8 cores, 16 logical cores
Using 1 MPI thread
Using 16 OpenMP threads
Using SIMD 4x4 nonbonded short-range kernels
Hardware:
AMD Ryzen 7 1700, 3.0GHz 8 cores (16 threads) non overclocked, 16M/AM4/65W
RAM: 16GB DDR4 2400UDIMM
During the simulation, all 16 threads engaged at ~3.1GHz with >95% load (according to Conky).
Thanks in advance for your help,
Al
That does indeed seem very low. Please post a full log file, that would help identifying any issues.
alexmas
January 15, 2022, 11:00pm
3
Hi,
Thanks for your reply. As new user I can’t post attachments. I have stripped the npt.log files of irrelevant content and posted below:
GROMACS: gmx mdrun, version 2021.4
Executable: /usr/local/bin/gmx
Data prefix: /usr/local
Working dir: /home/alex/Modeling/MD/IAB
Process ID: 39199
Command line:
gmx mdrun -deffnm npt
GROMACS version: 2021.4
Precision: mixed
Memory model: 64 bit
MPI library: thread_mpi
OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 64)
GPU support: disabled
SIMD instructions: AVX2_128
FFT library: fftw-3.3.8-sse2-avx-avx2-avx2_128
RDTSCP usage: enabled
TNG support: enabled
Hwloc support: disabled
Tracing support: disabled
C compiler: /usr/bin/cc GNU 9.3.0
C compiler flags: -mavx2 -mfma -Wno-missing-field-initializers -fexcess-precision=fast -funroll-all-loops -O3 -DNDEBUG
C++ compiler: /usr/bin/c++ GNU 9.3.0
C++ compiler flags: -mavx2 -mfma -Wno-missing-field-initializers -fexcess-precision=fast -funroll-all-loops -fopenmp
Running on 1 node with total 8 cores, 16 logical cores
Hardware detected:
CPU info:
Vendor: AMD
Brand: AMD Ryzen 7 1700 Eight-Core Processor
Family: 23 Model: 1 Stepping: 1
Features: aes amd apic avx avx2 clfsh cmov cx8 cx16 f16c fma htt lahf misalignsse mmx msr nonstop_tsc pclmuldq pdpe1gb popcnt pse rdrnd rdtscp sha sse2 sse3 sse4a sse4.1 sse4.2 ssse3
Hardware topology: Basic
Sockets, cores, and logical processors:
Socket 0: [ 0 1] [ 2 3] [ 4 5] [ 6 7] [ 8 9] [ 10 11] [ 12 13] [ 14 15]
Input Parameters:
integrator = md
tinit = 0
dt = 0.002
nsteps = 50000
init-step = 0
simulation-part = 1
mts = false
comm-mode = Linear
nstcomm = 100
bd-fric = 0
ld-seed = -272737541
emtol = 10
emstep = 0.01
niter = 20
fcstep = 0
nstcgsteep = 1000
nbfgscorr = 10
rtpi = 0.05
nstxout = 500
nstvout = 500
nstfout = 0
nstlog = 500
nstcalcenergy = 100
nstenergy = 500
nstxout-compressed = 0
compressed-x-precision = 1000
cutoff-scheme = Verlet
nstlist = 10
pbc = xyz
periodic-molecules = false
verlet-buffer-tolerance = 0.005
rlist = 1
coulombtype = PME
coulomb-modifier = Potential-shift
rcoulomb-switch = 0
rcoulomb = 1
epsilon-r = 1
epsilon-rf = inf
vdw-type = Cut-off
vdw-modifier = Potential-shift
rvdw-switch = 0
rvdw = 1
DispCorr = EnerPres
table-extension = 1
fourierspacing = 0.16
fourier-nx = 48
fourier-ny = 48
fourier-nz = 48
pme-order = 4
ewald-rtol = 1e-05
ewald-rtol-lj = 0.001
lj-pme-comb-rule = Geometric
ewald-geometry = 0
epsilon-surface = 0
tcoupl = V-rescale
nsttcouple = 10
nh-chain-length = 0
print-nose-hoover-chain-variables = false
pcoupl = Parrinello-Rahman
pcoupltype = Isotropic
nstpcouple = 10
tau-p = 2
compressibility (3x3):
compressibility[ 0]={ 4.50000e-05, 0.00000e+00, 0.00000e+00}
compressibility[ 1]={ 0.00000e+00, 4.50000e-05, 0.00000e+00}
compressibility[ 2]={ 0.00000e+00, 0.00000e+00, 4.50000e-05}
ref-p (3x3):
ref-p[ 0]={ 1.00000e+00, 0.00000e+00, 0.00000e+00}
ref-p[ 1]={ 0.00000e+00, 1.00000e+00, 0.00000e+00}
ref-p[ 2]={ 0.00000e+00, 0.00000e+00, 1.00000e+00}
refcoord-scaling = COM
posres-com (3):
posres-com[0]= 5.00485e-01
posres-com[1]= 5.02474e-01
posres-com[2]= 5.01594e-01
posres-comB (3):
posres-comB[0]= 5.00485e-01
posres-comB[1]= 5.02474e-01
posres-comB[2]= 5.01594e-01
QMMM = false
qm-opts:
ngQM = 0
constraint-algorithm = Lincs
continuation = true
Shake-SOR = false
shake-tol = 0.0001
lincs-order = 4
lincs-iter = 1
lincs-warnangle = 30
nwall = 0
wall-type = 9-3
wall-r-linpot = -1
wall-atomtype[0] = -1
wall-atomtype[1] = -1
wall-density[0] = 0
wall-density[1] = 0
wall-ewald-zfac = 3
pull = false
awh = false
rotation = false
interactiveMD = false
disre = No
disre-weighting = Conservative
disre-mixed = false
dr-fc = 1000
dr-tau = 0
nstdisreout = 100
orire-fc = 0
orire-tau = 0
nstorireout = 100
free-energy = no
cos-acceleration = 0
deform (3x3):
deform[ 0]={ 0.00000e+00, 0.00000e+00, 0.00000e+00}
deform[ 1]={ 0.00000e+00, 0.00000e+00, 0.00000e+00}
deform[ 2]={ 0.00000e+00, 0.00000e+00, 0.00000e+00}
simulated-tempering = false
swapcoords = no
userint1 = 0
userint2 = 0
userint3 = 0
userint4 = 0
userreal1 = 0
userreal2 = 0
userreal3 = 0
userreal4 = 0
applied-forces:
electric-field:
x:
E0 = 0
omega = 0
t0 = 0
sigma = 0
y:
E0 = 0
omega = 0
t0 = 0
sigma = 0
z:
E0 = 0
omega = 0
t0 = 0
sigma = 0
density-guided-simulation:
active = false
group = protein
similarity-measure = inner-product
atom-spreading-weight = unity
force-constant = 1e+09
gaussian-transform-spreading-width = 0.2
gaussian-transform-spreading-range-in-multiples-of-width = 4
reference-density-filename = reference.mrc
nst = 1
normalize-densities = true
adaptive-force-scaling = false
adaptive-force-scaling-time-constant = 4
shift-vector =
transformation-matrix =
grpopts:
nrdf: 8602.69 75489.3
ref-t: 300 300
tau-t: 0.1 0.1
annealing: No No
annealing-npoints: 0 0
acc: 0 0 0
nfreeze: N N N
energygrp-flags[ 0]: 0
Changing nstlist from 10 to 50, rlist from 1 to 1.11
Using 1 MPI thread
Using 16 OpenMP threads
Pinning threads with an auto-selected logical core stride of 1
System total charge: -0.000
Will do PME sum in reciprocal space for electrostatic interactions.
Using a Gaussian width (1/beta) of 0.320163 nm for Ewald
Potential shift: LJ r^-12: -1.000e+00 r^-6: -1.000e+00, Ewald -1.000e-05
Initialized non-bonded Ewald tables, spacing: 9.33e-04 size: 1073
Generated table with 1055 data points for 1-4 COUL.
Tabscale = 500 points/nm
Generated table with 1055 data points for 1-4 LJ6.
Tabscale = 500 points/nm
Generated table with 1055 data points for 1-4 LJ12.
Tabscale = 500 points/nm
Long Range LJ corr.: <C6> 3.2909e-04
Using SIMD 4x4 nonbonded short-range kernels
Using a dual 4x4 pair-list setup updated with dynamic pruning:
outer list: updated every 50 steps, buffer 0.110 nm, rlist 1.110 nm
inner list: updated every 13 steps, buffer 0.003 nm, rlist 1.003 nm
At tolerance 0.005 kJ/mol/ps per atom, equivalent classical 1x1 list would be:
outer list: updated every 50 steps, buffer 0.239 nm, rlist 1.239 nm
inner list: updated every 13 steps, buffer 0.052 nm, rlist 1.052 nm
Using Lorentz-Berthelot Lennard-Jones combination rule
Initializing LINear Constraint Solver
The number of constraints is 1690
There are: 41175 Atoms
Center of mass motion removal mode is Linear
We have the following groups for center of mass motion removal:
0: rest
Started mdrun on rank 0 Sun Jan 9 22:12:49 2022
Step Time
0 0.00000
Energies (kJ/mol)
Bond Angle Proper Dih. Improper Dih. LJ-14
2.62752e+03 6.87239e+03 8.87775e+03 4.26168e+02 3.37074e+03
Coulomb-14 LJ (SR) Disper. corr. Coulomb (SR) Coul. recip.
3.98406e+04 1.01956e+05 -5.29819e+03 -7.67027e+05 3.88859e+03
Position Rest. Potential Kinetic En. Total Energy Conserved En.
2.20376e-01 -6.04466e+05 1.05252e+05 -4.99214e+05 -4.99188e+05
Temperature Pres. DC (bar) Pressure (bar) Constr. rmsd
3.01072e+02 -1.99963e+02 -1.12481e+03 2.92565e-06
Step Time
500 1.00000
Energies (kJ/mol)
Bond Angle Proper Dih. Improper Dih. LJ-14
2.76232e+03 6.56919e+03 8.91485e+03 3.74066e+02 3.29520e+03
Coulomb-14 LJ (SR) Disper. corr. Coulomb (SR) Coul. recip.
3.96421e+04 1.10388e+05 -5.67387e+03 -7.82778e+05 3.73430e+03
Position Rest. Potential Kinetic En. Total Energy Conserved En.
7.80381e+02 -6.11992e+05 1.05707e+05 -5.06285e+05 -4.98936e+05
Temperature Pres. DC (bar) Pressure (bar) Constr. rmsd
3.02375e+02 -2.29287e+02 2.16987e+02 3.08949e-06
Step Time
1000 2.00000
Energies (kJ/mol)
Bond Angle Proper Dih. Improper Dih. LJ-14
2.60550e+03 6.52426e+03 8.88778e+03 4.06964e+02 3.25137e+03
Coulomb-14 LJ (SR) Disper. corr. Coulomb (SR) Coul. recip.
3.98419e+04 1.08540e+05 -5.70377e+03 -7.82129e+05 3.68749e+03
Position Rest. Potential Kinetic En. Total Energy Conserved En.
8.44355e+02 -6.13243e+05 1.04718e+05 -5.08526e+05 -4.98940e+05
Temperature Pres. DC (bar) Pressure (bar) Constr. rmsd
2.99544e+02 -2.31707e+02 5.17654e-01 2.86946e-06
Step Time
1500 3.00000
Energies (kJ/mol)
Bond Angle Proper Dih. Improper Dih. LJ-14
2.55369e+03 6.47372e+03 9.01303e+03 4.09099e+02 3.34877e+03
Coulomb-14 LJ (SR) Disper. corr. Coulomb (SR) Coul. recip.
3.96822e+04 1.08273e+05 -5.71906e+03 -7.80746e+05 3.56991e+03
Position Rest. Potential Kinetic En. Total Energy Conserved En.
8.64326e+02 -6.12277e+05 1.04020e+05 -5.08257e+05 -4.98916e+05
Temperature Pres. DC (bar) Pressure (bar) Constr. rmsd
2.97548e+02 -2.32950e+02 1.46750e+02 2.83228e-06
...
...
Step Time
50000 100.00000
Writing checkpoint, step 50000 at Mon Jan 10 01:50:32 2022
Energies (kJ/mol)
Bond Angle Proper Dih. Improper Dih. LJ-14
2.65868e+03 6.46689e+03 8.87458e+03 4.26712e+02 3.29757e+03
Coulomb-14 LJ (SR) Disper. corr. Coulomb (SR) Coul. recip.
3.97034e+04 1.07877e+05 -5.73794e+03 -7.82740e+05 3.63468e+03
Position Rest. Potential Kinetic En. Total Energy Conserved En.
9.03844e+02 -6.14635e+05 1.04525e+05 -5.10110e+05 -4.98818e+05
Temperature Pres. DC (bar) Pressure (bar) Constr. rmsd
2.98994e+02 -2.34488e+02 2.32401e+01 3.11058e-06
Energy conservation over simulation part #1 of length 100 ns, time 0 to 100 ns
Conserved energy drift: 8.96e-05 kJ/mol/ps per atom
<====== ############### ==>
<==== A V E R A G E S ====>
<== ############### ======>
Statistics over 50001 steps using 501 frames
Energies (kJ/mol)
Bond Angle Proper Dih. Improper Dih. LJ-14
2.58042e+03 6.57565e+03 8.87256e+03 4.09996e+02 3.27516e+03
Coulomb-14 LJ (SR) Disper. corr. Coulomb (SR) Coul. recip.
3.96588e+04 1.07828e+05 -5.73306e+03 -7.81547e+05 3.64830e+03
Position Rest. Potential Kinetic En. Total Energy Conserved En.
8.64482e+02 -6.13566e+05 1.04913e+05 -5.08653e+05 -4.98868e+05
Temperature Pres. DC (bar) Pressure (bar) Constr. rmsd
3.00104e+02 -2.34097e+02 2.17823e+00 0.00000e+00
Box-X Box-Y Box-Z
7.41182e+00 7.41182e+00 7.41182e+00
Total Virial (kJ/mol)
3.49632e+04 -1.77442e+02 -2.68029e+02
-1.77536e+02 3.50165e+04 -9.25268e+01
-2.68075e+02 -9.23627e+01 3.48659e+04
Pressure (bar)
4.00992e+00 1.14163e+01 2.23425e+01
1.14240e+01 -6.13640e+00 9.09152e+00
2.23463e+01 9.07817e+00 8.66119e+00
T-Protein T-non-Protein
3.00231e+02 3.00089e+02
M E G A - F L O P S A C C O U N T I N G
NB=Group-cutoff nonbonded kernels NxN=N-by-N cluster Verlet kernels
RF=Reaction-Field VdW=Van der Waals QSTab=quadratic-spline table
W3=SPC/TIP3p W4=TIP4p (single or pairs)
V&F=Potential and force V=Potential only F=Force only
Computing: M-Number M-Flops % Flops
-----------------------------------------------------------------------------
Pair Search distance check 14313.835096 128824.516 0.3
NxN QSTab Elec. + LJ [F] 561563.609592 23024107.993 53.9
NxN QSTab Elec. + LJ [V&F] 5683.876680 335348.724 0.8
NxN QSTab Elec. [F] 464090.345800 15779071.757 36.9
NxN QSTab Elec. [V&F] 4696.691736 192564.361 0.5
1,4 nonbonded interactions 450.059001 40505.310 0.1
Calc Weights 6176.373525 222349.447 0.5
Spread Q Bspline 131762.635200 263525.270 0.6
Gather F Bspline 131762.635200 790575.811 1.9
3D-FFT 185299.305912 1482394.447 3.5
Solve PME 115.202304 7372.947 0.0
Shift-X 41.216175 247.297 0.0
Bonds 89.101782 5257.005 0.0
Angles 311.706234 52366.647 0.1
Propers 481.909638 110357.307 0.3
Impropers 35.400708 7363.347 0.0
Pos. Restr. 87.051741 4352.587 0.0
Virial 206.141220 3710.542 0.0
Stop-CM 20.628675 206.287 0.0
Calc-Ekin 411.832350 11119.473 0.0
Lincs 84.501690 5070.101 0.0
Lincs-Mat 421.208424 1684.834 0.0
Constraint-V 2056.041120 18504.370 0.0
Constraint-Vir 197.189430 4732.546 0.0
Settle 629.012580 232734.655 0.5
-----------------------------------------------------------------------------
Total 42724347.584 100.0
-----------------------------------------------------------------------------
R E A L C Y C L E A N D T I M E A C C O U N T I N G
On 1 MPI rank, each using 16 OpenMP threads
Computing: Num Num Call Wall time Giga-Cycles
Ranks Threads Count (s) total sum %
-----------------------------------------------------------------------------
Neighbor search 1 16 1001 171.904 8235.853 1.3
Force 1 16 50001 11761.402 563483.831 90.0
PME mesh 1 16 50001 902.885 43256.855 6.9
NB X/F buffer ops. 1 16 99001 51.463 2465.549 0.4
Write traj. 1 16 113 1.387 66.428 0.0
Update 1 16 50001 28.337 1357.622 0.2
Constraints 1 16 50001 126.837 6076.727 1.0
Rest 18.569 889.653 0.1
-----------------------------------------------------------------------------
Total 13062.785 625832.517 100.0
-----------------------------------------------------------------------------
Breakdown of PME mesh computation
-----------------------------------------------------------------------------
PME spread 1 16 50001 317.864 15228.748 2.4
PME gather 1 16 50001 497.147 23818.110 3.8
PME 3D-FFT 1 16 100002 39.412 1888.205 0.3
PME solve Elec 1 16 50001 47.287 2265.490 0.4
-----------------------------------------------------------------------------
Core t (s) Wall t (s) (%)
Time: 209004.518 13062.785 1600.0
3h37:42
(ns/day) (hour/ns)
Performance: 0.661 36.285
Finished mdrun on rank 0 Mon Jan 10 01:50:32 2022
I see no reason why this would run so slow, it is also strange that you have >90% runtime in the short-range force calculation (“Force” counter), that is typically more like 70-80%. How’s the performance with 4 / 8 threads? Can you rebuild with cmake . -DGMX_CYCLE_SUBCOUNTERS=ON
and post a log with that? This will show further breakdown of the wall times.
I was able to figure it out. Apparently there is another gmx executable in:
/usr/local/bin/gmx
I though it was the version of gromacs that comes from the Ubuntu repository, but when I tried to removed it with apt-get, it tells me gromacs was not installed:
Package 'gromacs' is not installed, so not removed
I sourced the GXMRC file and now gmx is pointing to the right executable:
source /usr/local/gromacs/bin/GMXRC
which gmx
/usr/local/gromacs/bin/gmx
Now my simulation runs at 21 ns/day.
Hi alexmas,
I have the same issue. The performance was lower than I expected (0.636 ns/day).
Could you be more specific about how you figured that out?
Thank you in advance.
Note:
I don’t know anything about GMXRC file. I couldn’t find it.
MagnusL
February 20, 2024, 7:23am
7
In order to help you more, it would be good to have some more information. How did you install GROMACS? Was it after compilation from source? Or did you install a pre-compiled version, e.g., by using a package manager?
Hi Magnus,
Thanks for asking. My problem is now solved.
I tried to re-install GROMACS and it worked. Before that, I didn’t know anything because it was my friend who helped me to install GROMACS. I think he had missed something and it affected the directory (as alexmas said it was because of the directory).
Now my simulation performance increases to 28 ns/day. However, I also noticed that it also decreases a little. At first, it ran at 32 ns/day, then decreased to 30, and then 28 ns/day, but it was for different usage (NVT, NPT, and Production).
Is that actually OK?
Regards,
Antonius
MagnusL
February 20, 2024, 10:06am
9
Good to hear that it’s working better now. It’s difficult to say what performance to expect, since it depends on the system size and hardware resources, but at least 28-32 ns/day sounds reasonable, whereas the previously reported (0.6 ns/day) sounded very low.
The performance may vary a little bit from one run to another. It would also be affected by system load (from other programs and processes) during the simulation, especially if you are not running the simulations on a computer dedicated only to that.