GROMACS performance on 8 cores workstation

GROMACS version: 2021.4
GROMACS modification: No

Hello,
I just completed a test run on a new installation and the performance was much lower then I expected: 0.661ns/day with a 41,175 Atoms system on an 8 cores CPU hardware.

Below I am listing more details about my simulation and my hardware.

The gromacs website showcase an example with ~50ns/day for a 24,000 atoms on a 6 cores CPU hardware (no GPU). Since my system is less than 2x larger, I would have expected to achieve at least 10ns/day.

I have not installed a GPU yet, and before I invest in it I would like to make sure I have the system optimally set up.

Is there something I am missing?

This is my simulation:
1ns NPT run, 50,000 steps, 2 fs/step
Total: 41,175 Atoms
Solvent: 12,580 water molecules (37,740 atoms)
Protein: 3,431 atoms (255 residues)
GROMACS version: 2021.4
Running on 1 node with total 8 cores, 16 logical cores
Using 1 MPI thread
Using 16 OpenMP threads
Using SIMD 4x4 nonbonded short-range kernels

Hardware:
AMD Ryzen 7 1700, 3.0GHz 8 cores (16 threads) non overclocked, 16M/AM4/65W
RAM: 16GB DDR4 2400UDIMM

During the simulation, all 16 threads engaged at ~3.1GHz with >95% load (according to Conky).

Thanks in advance for your help,

Al

That does indeed seem very low. Please post a full log file, that would help identifying any issues.

Hi,
Thanks for your reply. As new user I can’t post attachments. I have stripped the npt.log files of irrelevant content and posted below:

GROMACS:      gmx mdrun, version 2021.4
Executable:   /usr/local/bin/gmx
Data prefix:  /usr/local
Working dir:  /home/alex/Modeling/MD/IAB
Process ID:   39199
Command line:
  gmx mdrun -deffnm npt

GROMACS version:    2021.4
Precision:          mixed
Memory model:       64 bit
MPI library:        thread_mpi
OpenMP support:     enabled (GMX_OPENMP_MAX_THREADS = 64)
GPU support:        disabled
SIMD instructions:  AVX2_128
FFT library:        fftw-3.3.8-sse2-avx-avx2-avx2_128
RDTSCP usage:       enabled
TNG support:        enabled
Hwloc support:      disabled
Tracing support:    disabled
C compiler:         /usr/bin/cc GNU 9.3.0
C compiler flags:   -mavx2 -mfma -Wno-missing-field-initializers -fexcess-precision=fast -funroll-all-loops -O3 -DNDEBUG
C++ compiler:       /usr/bin/c++ GNU 9.3.0
C++ compiler flags: -mavx2 -mfma -Wno-missing-field-initializers -fexcess-precision=fast -funroll-all-loops -fopenmp 


Running on 1 node with total 8 cores, 16 logical cores
Hardware detected:
  CPU info:
    Vendor: AMD
    Brand:  AMD Ryzen 7 1700 Eight-Core Processor          
    Family: 23   Model: 1   Stepping: 1
    Features: aes amd apic avx avx2 clfsh cmov cx8 cx16 f16c fma htt lahf misalignsse mmx msr nonstop_tsc pclmuldq pdpe1gb popcnt pse rdrnd rdtscp sha sse2 sse3 sse4a sse4.1 sse4.2 ssse3
  Hardware topology: Basic
    Sockets, cores, and logical processors:
      Socket  0: [   0   1] [   2   3] [   4   5] [   6   7] [   8   9] [  10  11] [  12  13] [  14  15]

Input Parameters:
   integrator                     = md
   tinit                          = 0
   dt                             = 0.002
   nsteps                         = 50000
   init-step                      = 0
   simulation-part                = 1
   mts                            = false
   comm-mode                      = Linear
   nstcomm                        = 100
   bd-fric                        = 0
   ld-seed                        = -272737541
   emtol                          = 10
   emstep                         = 0.01
   niter                          = 20
   fcstep                         = 0
   nstcgsteep                     = 1000
   nbfgscorr                      = 10
   rtpi                           = 0.05
   nstxout                        = 500
   nstvout                        = 500
   nstfout                        = 0
   nstlog                         = 500
   nstcalcenergy                  = 100
   nstenergy                      = 500
   nstxout-compressed             = 0
   compressed-x-precision         = 1000
   cutoff-scheme                  = Verlet
   nstlist                        = 10
   pbc                            = xyz
   periodic-molecules             = false
   verlet-buffer-tolerance        = 0.005
   rlist                          = 1
   coulombtype                    = PME
   coulomb-modifier               = Potential-shift
   rcoulomb-switch                = 0
   rcoulomb                       = 1
   epsilon-r                      = 1
   epsilon-rf                     = inf
   vdw-type                       = Cut-off
   vdw-modifier                   = Potential-shift
   rvdw-switch                    = 0
   rvdw                           = 1
   DispCorr                       = EnerPres
   table-extension                = 1
   fourierspacing                 = 0.16
   fourier-nx                     = 48
   fourier-ny                     = 48
   fourier-nz                     = 48
   pme-order                      = 4
   ewald-rtol                     = 1e-05
   ewald-rtol-lj                  = 0.001
   lj-pme-comb-rule               = Geometric
   ewald-geometry                 = 0
   epsilon-surface                = 0
   tcoupl                         = V-rescale
   nsttcouple                     = 10
   nh-chain-length                = 0
   print-nose-hoover-chain-variables = false
   pcoupl                         = Parrinello-Rahman
   pcoupltype                     = Isotropic
   nstpcouple                     = 10
   tau-p                          = 2
   compressibility (3x3):
      compressibility[    0]={ 4.50000e-05,  0.00000e+00,  0.00000e+00}
      compressibility[    1]={ 0.00000e+00,  4.50000e-05,  0.00000e+00}
      compressibility[    2]={ 0.00000e+00,  0.00000e+00,  4.50000e-05}
   ref-p (3x3):
      ref-p[    0]={ 1.00000e+00,  0.00000e+00,  0.00000e+00}
      ref-p[    1]={ 0.00000e+00,  1.00000e+00,  0.00000e+00}
      ref-p[    2]={ 0.00000e+00,  0.00000e+00,  1.00000e+00}
   refcoord-scaling               = COM
   posres-com (3):
      posres-com[0]= 5.00485e-01
      posres-com[1]= 5.02474e-01
      posres-com[2]= 5.01594e-01
   posres-comB (3):
      posres-comB[0]= 5.00485e-01
      posres-comB[1]= 5.02474e-01
      posres-comB[2]= 5.01594e-01
   QMMM                           = false
qm-opts:
   ngQM                           = 0
   constraint-algorithm           = Lincs
   continuation                   = true
   Shake-SOR                      = false
   shake-tol                      = 0.0001
   lincs-order                    = 4
   lincs-iter                     = 1
   lincs-warnangle                = 30
   nwall                          = 0
   wall-type                      = 9-3
   wall-r-linpot                  = -1
   wall-atomtype[0]               = -1
   wall-atomtype[1]               = -1
   wall-density[0]                = 0
   wall-density[1]                = 0
   wall-ewald-zfac                = 3
   pull                           = false
   awh                            = false
   rotation                       = false
   interactiveMD                  = false
   disre                          = No
   disre-weighting                = Conservative
   disre-mixed                    = false
   dr-fc                          = 1000
   dr-tau                         = 0
   nstdisreout                    = 100
   orire-fc                       = 0
   orire-tau                      = 0
   nstorireout                    = 100
   free-energy                    = no
   cos-acceleration               = 0
   deform (3x3):
      deform[    0]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
      deform[    1]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
      deform[    2]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
   simulated-tempering            = false
   swapcoords                     = no
   userint1                       = 0
   userint2                       = 0
   userint3                       = 0
   userint4                       = 0
   userreal1                      = 0
   userreal2                      = 0
   userreal3                      = 0
   userreal4                      = 0
   applied-forces:
     electric-field:
       x:
         E0                       = 0
         omega                    = 0
         t0                       = 0
         sigma                    = 0
       y:
         E0                       = 0
         omega                    = 0
         t0                       = 0
         sigma                    = 0
       z:
         E0                       = 0
         omega                    = 0
         t0                       = 0
         sigma                    = 0
     density-guided-simulation:
       active                     = false
       group                      = protein
       similarity-measure         = inner-product
       atom-spreading-weight      = unity
       force-constant             = 1e+09
       gaussian-transform-spreading-width = 0.2
       gaussian-transform-spreading-range-in-multiples-of-width = 4
       reference-density-filename = reference.mrc
       nst                        = 1
       normalize-densities        = true
       adaptive-force-scaling     = false
       adaptive-force-scaling-time-constant = 4
       shift-vector               = 
       transformation-matrix      = 
grpopts:
   nrdf:     8602.69     75489.3
   ref-t:         300         300
   tau-t:         0.1         0.1
annealing:          No          No
annealing-npoints:           0           0
   acc:	           0           0           0
   nfreeze:           N           N           N
   energygrp-flags[  0]: 0

Changing nstlist from 10 to 50, rlist from 1 to 1.11

Using 1 MPI thread
Using 16 OpenMP threads 

Pinning threads with an auto-selected logical core stride of 1
System total charge: -0.000
Will do PME sum in reciprocal space for electrostatic interactions.

Using a Gaussian width (1/beta) of 0.320163 nm for Ewald
Potential shift: LJ r^-12: -1.000e+00 r^-6: -1.000e+00, Ewald -1.000e-05
Initialized non-bonded Ewald tables, spacing: 9.33e-04 size: 1073

Generated table with 1055 data points for 1-4 COUL.
Tabscale = 500 points/nm
Generated table with 1055 data points for 1-4 LJ6.
Tabscale = 500 points/nm
Generated table with 1055 data points for 1-4 LJ12.
Tabscale = 500 points/nm
Long Range LJ corr.: <C6> 3.2909e-04


Using SIMD 4x4 nonbonded short-range kernels

Using a dual 4x4 pair-list setup updated with dynamic pruning:
  outer list: updated every 50 steps, buffer 0.110 nm, rlist 1.110 nm
  inner list: updated every 13 steps, buffer 0.003 nm, rlist 1.003 nm
At tolerance 0.005 kJ/mol/ps per atom, equivalent classical 1x1 list would be:
  outer list: updated every 50 steps, buffer 0.239 nm, rlist 1.239 nm
  inner list: updated every 13 steps, buffer 0.052 nm, rlist 1.052 nm

Using Lorentz-Berthelot Lennard-Jones combination rule

Initializing LINear Constraint Solver

The number of constraints is 1690

There are: 41175 Atoms
Center of mass motion removal mode is Linear
We have the following groups for center of mass motion removal:
  0:  rest

Started mdrun on rank 0 Sun Jan  9 22:12:49 2022

           Step           Time
              0        0.00000

   Energies (kJ/mol)
           Bond          Angle    Proper Dih.  Improper Dih.          LJ-14
    2.62752e+03    6.87239e+03    8.87775e+03    4.26168e+02    3.37074e+03
     Coulomb-14        LJ (SR)  Disper. corr.   Coulomb (SR)   Coul. recip.
    3.98406e+04    1.01956e+05   -5.29819e+03   -7.67027e+05    3.88859e+03
 Position Rest.      Potential    Kinetic En.   Total Energy  Conserved En.
    2.20376e-01   -6.04466e+05    1.05252e+05   -4.99214e+05   -4.99188e+05
    Temperature Pres. DC (bar) Pressure (bar)   Constr. rmsd
    3.01072e+02   -1.99963e+02   -1.12481e+03    2.92565e-06

           Step           Time
            500        1.00000

   Energies (kJ/mol)
           Bond          Angle    Proper Dih.  Improper Dih.          LJ-14
    2.76232e+03    6.56919e+03    8.91485e+03    3.74066e+02    3.29520e+03
     Coulomb-14        LJ (SR)  Disper. corr.   Coulomb (SR)   Coul. recip.
    3.96421e+04    1.10388e+05   -5.67387e+03   -7.82778e+05    3.73430e+03
 Position Rest.      Potential    Kinetic En.   Total Energy  Conserved En.
    7.80381e+02   -6.11992e+05    1.05707e+05   -5.06285e+05   -4.98936e+05
    Temperature Pres. DC (bar) Pressure (bar)   Constr. rmsd
    3.02375e+02   -2.29287e+02    2.16987e+02    3.08949e-06

           Step           Time
           1000        2.00000

   Energies (kJ/mol)
           Bond          Angle    Proper Dih.  Improper Dih.          LJ-14
    2.60550e+03    6.52426e+03    8.88778e+03    4.06964e+02    3.25137e+03
     Coulomb-14        LJ (SR)  Disper. corr.   Coulomb (SR)   Coul. recip.
    3.98419e+04    1.08540e+05   -5.70377e+03   -7.82129e+05    3.68749e+03
 Position Rest.      Potential    Kinetic En.   Total Energy  Conserved En.
    8.44355e+02   -6.13243e+05    1.04718e+05   -5.08526e+05   -4.98940e+05
    Temperature Pres. DC (bar) Pressure (bar)   Constr. rmsd
    2.99544e+02   -2.31707e+02    5.17654e-01    2.86946e-06

           Step           Time
           1500        3.00000

   Energies (kJ/mol)
           Bond          Angle    Proper Dih.  Improper Dih.          LJ-14
    2.55369e+03    6.47372e+03    9.01303e+03    4.09099e+02    3.34877e+03
     Coulomb-14        LJ (SR)  Disper. corr.   Coulomb (SR)   Coul. recip.
    3.96822e+04    1.08273e+05   -5.71906e+03   -7.80746e+05    3.56991e+03
 Position Rest.      Potential    Kinetic En.   Total Energy  Conserved En.
    8.64326e+02   -6.12277e+05    1.04020e+05   -5.08257e+05   -4.98916e+05
    Temperature Pres. DC (bar) Pressure (bar)   Constr. rmsd
    2.97548e+02   -2.32950e+02    1.46750e+02    2.83228e-06


...
...


           Step           Time
          50000      100.00000

Writing checkpoint, step 50000 at Mon Jan 10 01:50:32 2022


   Energies (kJ/mol)
           Bond          Angle    Proper Dih.  Improper Dih.          LJ-14
    2.65868e+03    6.46689e+03    8.87458e+03    4.26712e+02    3.29757e+03
     Coulomb-14        LJ (SR)  Disper. corr.   Coulomb (SR)   Coul. recip.
    3.97034e+04    1.07877e+05   -5.73794e+03   -7.82740e+05    3.63468e+03
 Position Rest.      Potential    Kinetic En.   Total Energy  Conserved En.
    9.03844e+02   -6.14635e+05    1.04525e+05   -5.10110e+05   -4.98818e+05
    Temperature Pres. DC (bar) Pressure (bar)   Constr. rmsd
    2.98994e+02   -2.34488e+02    2.32401e+01    3.11058e-06


Energy conservation over simulation part #1 of length 100 ns, time 0 to 100 ns
  Conserved energy drift: 8.96e-05 kJ/mol/ps per atom


	<======  ###############  ==>
	<====  A V E R A G E S  ====>
	<==  ###############  ======>

	Statistics over 50001 steps using 501 frames

   Energies (kJ/mol)
           Bond          Angle    Proper Dih.  Improper Dih.          LJ-14
    2.58042e+03    6.57565e+03    8.87256e+03    4.09996e+02    3.27516e+03
     Coulomb-14        LJ (SR)  Disper. corr.   Coulomb (SR)   Coul. recip.
    3.96588e+04    1.07828e+05   -5.73306e+03   -7.81547e+05    3.64830e+03
 Position Rest.      Potential    Kinetic En.   Total Energy  Conserved En.
    8.64482e+02   -6.13566e+05    1.04913e+05   -5.08653e+05   -4.98868e+05
    Temperature Pres. DC (bar) Pressure (bar)   Constr. rmsd
    3.00104e+02   -2.34097e+02    2.17823e+00    0.00000e+00

          Box-X          Box-Y          Box-Z
    7.41182e+00    7.41182e+00    7.41182e+00

   Total Virial (kJ/mol)
    3.49632e+04   -1.77442e+02   -2.68029e+02
   -1.77536e+02    3.50165e+04   -9.25268e+01
   -2.68075e+02   -9.23627e+01    3.48659e+04

   Pressure (bar)
    4.00992e+00    1.14163e+01    2.23425e+01
    1.14240e+01   -6.13640e+00    9.09152e+00
    2.23463e+01    9.07817e+00    8.66119e+00

      T-Protein  T-non-Protein
    3.00231e+02    3.00089e+02


	M E G A - F L O P S   A C C O U N T I N G

 NB=Group-cutoff nonbonded kernels    NxN=N-by-N cluster Verlet kernels
 RF=Reaction-Field  VdW=Van der Waals  QSTab=quadratic-spline table
 W3=SPC/TIP3p  W4=TIP4p (single or pairs)
 V&F=Potential and force  V=Potential only  F=Force only

 Computing:                               M-Number         M-Flops  % Flops
-----------------------------------------------------------------------------
 Pair Search distance check           14313.835096      128824.516     0.3
 NxN QSTab Elec. + LJ [F]            561563.609592    23024107.993    53.9
 NxN QSTab Elec. + LJ [V&F]            5683.876680      335348.724     0.8
 NxN QSTab Elec. [F]                 464090.345800    15779071.757    36.9
 NxN QSTab Elec. [V&F]                 4696.691736      192564.361     0.5
 1,4 nonbonded interactions             450.059001       40505.310     0.1
 Calc Weights                          6176.373525      222349.447     0.5
 Spread Q Bspline                    131762.635200      263525.270     0.6
 Gather F Bspline                    131762.635200      790575.811     1.9
 3D-FFT                              185299.305912     1482394.447     3.5
 Solve PME                              115.202304        7372.947     0.0
 Shift-X                                 41.216175         247.297     0.0
 Bonds                                   89.101782        5257.005     0.0
 Angles                                 311.706234       52366.647     0.1
 Propers                                481.909638      110357.307     0.3
 Impropers                               35.400708        7363.347     0.0
 Pos. Restr.                             87.051741        4352.587     0.0
 Virial                                 206.141220        3710.542     0.0
 Stop-CM                                 20.628675         206.287     0.0
 Calc-Ekin                              411.832350       11119.473     0.0
 Lincs                                   84.501690        5070.101     0.0
 Lincs-Mat                              421.208424        1684.834     0.0
 Constraint-V                          2056.041120       18504.370     0.0
 Constraint-Vir                         197.189430        4732.546     0.0
 Settle                                 629.012580      232734.655     0.5
-----------------------------------------------------------------------------
 Total                                                42724347.584   100.0
-----------------------------------------------------------------------------


     R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G

On 1 MPI rank, each using 16 OpenMP threads

 Computing:          Num   Num      Call    Wall time         Giga-Cycles
                     Ranks Threads  Count      (s)         total sum    %
-----------------------------------------------------------------------------
 Neighbor search        1   16       1001     171.904       8235.853   1.3
 Force                  1   16      50001   11761.402     563483.831  90.0
 PME mesh               1   16      50001     902.885      43256.855   6.9
 NB X/F buffer ops.     1   16      99001      51.463       2465.549   0.4
 Write traj.            1   16        113       1.387         66.428   0.0
 Update                 1   16      50001      28.337       1357.622   0.2
 Constraints            1   16      50001     126.837       6076.727   1.0
 Rest                                          18.569        889.653   0.1
-----------------------------------------------------------------------------
 Total                                      13062.785     625832.517 100.0
-----------------------------------------------------------------------------
 Breakdown of PME mesh computation
-----------------------------------------------------------------------------
 PME spread             1   16      50001     317.864      15228.748   2.4
 PME gather             1   16      50001     497.147      23818.110   3.8
 PME 3D-FFT             1   16     100002      39.412       1888.205   0.3
 PME solve Elec         1   16      50001      47.287       2265.490   0.4
-----------------------------------------------------------------------------

               Core t (s)   Wall t (s)        (%)
       Time:   209004.518    13062.785     1600.0
                         3h37:42
                 (ns/day)    (hour/ns)
Performance:        0.661       36.285
Finished mdrun on rank 0 Mon Jan 10 01:50:32 2022

I see no reason why this would run so slow, it is also strange that you have >90% runtime in the short-range force calculation (“Force” counter), that is typically more like 70-80%. How’s the performance with 4 / 8 threads? Can you rebuild with cmake . -DGMX_CYCLE_SUBCOUNTERS=ON and post a log with that? This will show further breakdown of the wall times.