GROMACS performance on 8 cores workstation

GROMACS version: 2021.4
GROMACS modification: No

Hello,
I just completed a test run on a new installation and the performance was much lower then I expected: 0.661ns/day with a 41,175 Atoms system on an 8 cores CPU hardware.

Below I am listing more details about my simulation and my hardware.

The gromacs website showcase an example with ~50ns/day for a 24,000 atoms on a 6 cores CPU hardware (no GPU). Since my system is less than 2x larger, I would have expected to achieve at least 10ns/day.

I have not installed a GPU yet, and before I invest in it I would like to make sure I have the system optimally set up.

Is there something I am missing?

This is my simulation:
1ns NPT run, 50,000 steps, 2 fs/step
Total: 41,175 Atoms
Solvent: 12,580 water molecules (37,740 atoms)
Protein: 3,431 atoms (255 residues)
GROMACS version: 2021.4
Running on 1 node with total 8 cores, 16 logical cores
Using 1 MPI thread
Using 16 OpenMP threads
Using SIMD 4x4 nonbonded short-range kernels

Hardware:
AMD Ryzen 7 1700, 3.0GHz 8 cores (16 threads) non overclocked, 16M/AM4/65W
RAM: 16GB DDR4 2400UDIMM

During the simulation, all 16 threads engaged at ~3.1GHz with >95% load (according to Conky).

Thanks in advance for your help,

Al

That does indeed seem very low. Please post a full log file, that would help identifying any issues.

Hi,
Thanks for your reply. As new user I can’t post attachments. I have stripped the npt.log files of irrelevant content and posted below:

GROMACS:      gmx mdrun, version 2021.4
Executable:   /usr/local/bin/gmx
Data prefix:  /usr/local
Working dir:  /home/alex/Modeling/MD/IAB
Process ID:   39199
Command line:
  gmx mdrun -deffnm npt

GROMACS version:    2021.4
Precision:          mixed
Memory model:       64 bit
MPI library:        thread_mpi
OpenMP support:     enabled (GMX_OPENMP_MAX_THREADS = 64)
GPU support:        disabled
SIMD instructions:  AVX2_128
FFT library:        fftw-3.3.8-sse2-avx-avx2-avx2_128
RDTSCP usage:       enabled
TNG support:        enabled
Hwloc support:      disabled
Tracing support:    disabled
C compiler:         /usr/bin/cc GNU 9.3.0
C compiler flags:   -mavx2 -mfma -Wno-missing-field-initializers -fexcess-precision=fast -funroll-all-loops -O3 -DNDEBUG
C++ compiler:       /usr/bin/c++ GNU 9.3.0
C++ compiler flags: -mavx2 -mfma -Wno-missing-field-initializers -fexcess-precision=fast -funroll-all-loops -fopenmp 


Running on 1 node with total 8 cores, 16 logical cores
Hardware detected:
  CPU info:
    Vendor: AMD
    Brand:  AMD Ryzen 7 1700 Eight-Core Processor          
    Family: 23   Model: 1   Stepping: 1
    Features: aes amd apic avx avx2 clfsh cmov cx8 cx16 f16c fma htt lahf misalignsse mmx msr nonstop_tsc pclmuldq pdpe1gb popcnt pse rdrnd rdtscp sha sse2 sse3 sse4a sse4.1 sse4.2 ssse3
  Hardware topology: Basic
    Sockets, cores, and logical processors:
      Socket  0: [   0   1] [   2   3] [   4   5] [   6   7] [   8   9] [  10  11] [  12  13] [  14  15]

Input Parameters:
   integrator                     = md
   tinit                          = 0
   dt                             = 0.002
   nsteps                         = 50000
   init-step                      = 0
   simulation-part                = 1
   mts                            = false
   comm-mode                      = Linear
   nstcomm                        = 100
   bd-fric                        = 0
   ld-seed                        = -272737541
   emtol                          = 10
   emstep                         = 0.01
   niter                          = 20
   fcstep                         = 0
   nstcgsteep                     = 1000
   nbfgscorr                      = 10
   rtpi                           = 0.05
   nstxout                        = 500
   nstvout                        = 500
   nstfout                        = 0
   nstlog                         = 500
   nstcalcenergy                  = 100
   nstenergy                      = 500
   nstxout-compressed             = 0
   compressed-x-precision         = 1000
   cutoff-scheme                  = Verlet
   nstlist                        = 10
   pbc                            = xyz
   periodic-molecules             = false
   verlet-buffer-tolerance        = 0.005
   rlist                          = 1
   coulombtype                    = PME
   coulomb-modifier               = Potential-shift
   rcoulomb-switch                = 0
   rcoulomb                       = 1
   epsilon-r                      = 1
   epsilon-rf                     = inf
   vdw-type                       = Cut-off
   vdw-modifier                   = Potential-shift
   rvdw-switch                    = 0
   rvdw                           = 1
   DispCorr                       = EnerPres
   table-extension                = 1
   fourierspacing                 = 0.16
   fourier-nx                     = 48
   fourier-ny                     = 48
   fourier-nz                     = 48
   pme-order                      = 4
   ewald-rtol                     = 1e-05
   ewald-rtol-lj                  = 0.001
   lj-pme-comb-rule               = Geometric
   ewald-geometry                 = 0
   epsilon-surface                = 0
   tcoupl                         = V-rescale
   nsttcouple                     = 10
   nh-chain-length                = 0
   print-nose-hoover-chain-variables = false
   pcoupl                         = Parrinello-Rahman
   pcoupltype                     = Isotropic
   nstpcouple                     = 10
   tau-p                          = 2
   compressibility (3x3):
      compressibility[    0]={ 4.50000e-05,  0.00000e+00,  0.00000e+00}
      compressibility[    1]={ 0.00000e+00,  4.50000e-05,  0.00000e+00}
      compressibility[    2]={ 0.00000e+00,  0.00000e+00,  4.50000e-05}
   ref-p (3x3):
      ref-p[    0]={ 1.00000e+00,  0.00000e+00,  0.00000e+00}
      ref-p[    1]={ 0.00000e+00,  1.00000e+00,  0.00000e+00}
      ref-p[    2]={ 0.00000e+00,  0.00000e+00,  1.00000e+00}
   refcoord-scaling               = COM
   posres-com (3):
      posres-com[0]= 5.00485e-01
      posres-com[1]= 5.02474e-01
      posres-com[2]= 5.01594e-01
   posres-comB (3):
      posres-comB[0]= 5.00485e-01
      posres-comB[1]= 5.02474e-01
      posres-comB[2]= 5.01594e-01
   QMMM                           = false
qm-opts:
   ngQM                           = 0
   constraint-algorithm           = Lincs
   continuation                   = true
   Shake-SOR                      = false
   shake-tol                      = 0.0001
   lincs-order                    = 4
   lincs-iter                     = 1
   lincs-warnangle                = 30
   nwall                          = 0
   wall-type                      = 9-3
   wall-r-linpot                  = -1
   wall-atomtype[0]               = -1
   wall-atomtype[1]               = -1
   wall-density[0]                = 0
   wall-density[1]                = 0
   wall-ewald-zfac                = 3
   pull                           = false
   awh                            = false
   rotation                       = false
   interactiveMD                  = false
   disre                          = No
   disre-weighting                = Conservative
   disre-mixed                    = false
   dr-fc                          = 1000
   dr-tau                         = 0
   nstdisreout                    = 100
   orire-fc                       = 0
   orire-tau                      = 0
   nstorireout                    = 100
   free-energy                    = no
   cos-acceleration               = 0
   deform (3x3):
      deform[    0]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
      deform[    1]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
      deform[    2]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
   simulated-tempering            = false
   swapcoords                     = no
   userint1                       = 0
   userint2                       = 0
   userint3                       = 0
   userint4                       = 0
   userreal1                      = 0
   userreal2                      = 0
   userreal3                      = 0
   userreal4                      = 0
   applied-forces:
     electric-field:
       x:
         E0                       = 0
         omega                    = 0
         t0                       = 0
         sigma                    = 0
       y:
         E0                       = 0
         omega                    = 0
         t0                       = 0
         sigma                    = 0
       z:
         E0                       = 0
         omega                    = 0
         t0                       = 0
         sigma                    = 0
     density-guided-simulation:
       active                     = false
       group                      = protein
       similarity-measure         = inner-product
       atom-spreading-weight      = unity
       force-constant             = 1e+09
       gaussian-transform-spreading-width = 0.2
       gaussian-transform-spreading-range-in-multiples-of-width = 4
       reference-density-filename = reference.mrc
       nst                        = 1
       normalize-densities        = true
       adaptive-force-scaling     = false
       adaptive-force-scaling-time-constant = 4
       shift-vector               = 
       transformation-matrix      = 
grpopts:
   nrdf:     8602.69     75489.3
   ref-t:         300         300
   tau-t:         0.1         0.1
annealing:          No          No
annealing-npoints:           0           0
   acc:	           0           0           0
   nfreeze:           N           N           N
   energygrp-flags[  0]: 0

Changing nstlist from 10 to 50, rlist from 1 to 1.11

Using 1 MPI thread
Using 16 OpenMP threads 

Pinning threads with an auto-selected logical core stride of 1
System total charge: -0.000
Will do PME sum in reciprocal space for electrostatic interactions.

Using a Gaussian width (1/beta) of 0.320163 nm for Ewald
Potential shift: LJ r^-12: -1.000e+00 r^-6: -1.000e+00, Ewald -1.000e-05
Initialized non-bonded Ewald tables, spacing: 9.33e-04 size: 1073

Generated table with 1055 data points for 1-4 COUL.
Tabscale = 500 points/nm
Generated table with 1055 data points for 1-4 LJ6.
Tabscale = 500 points/nm
Generated table with 1055 data points for 1-4 LJ12.
Tabscale = 500 points/nm
Long Range LJ corr.: <C6> 3.2909e-04


Using SIMD 4x4 nonbonded short-range kernels

Using a dual 4x4 pair-list setup updated with dynamic pruning:
  outer list: updated every 50 steps, buffer 0.110 nm, rlist 1.110 nm
  inner list: updated every 13 steps, buffer 0.003 nm, rlist 1.003 nm
At tolerance 0.005 kJ/mol/ps per atom, equivalent classical 1x1 list would be:
  outer list: updated every 50 steps, buffer 0.239 nm, rlist 1.239 nm
  inner list: updated every 13 steps, buffer 0.052 nm, rlist 1.052 nm

Using Lorentz-Berthelot Lennard-Jones combination rule

Initializing LINear Constraint Solver

The number of constraints is 1690

There are: 41175 Atoms
Center of mass motion removal mode is Linear
We have the following groups for center of mass motion removal:
  0:  rest

Started mdrun on rank 0 Sun Jan  9 22:12:49 2022

           Step           Time
              0        0.00000

   Energies (kJ/mol)
           Bond          Angle    Proper Dih.  Improper Dih.          LJ-14
    2.62752e+03    6.87239e+03    8.87775e+03    4.26168e+02    3.37074e+03
     Coulomb-14        LJ (SR)  Disper. corr.   Coulomb (SR)   Coul. recip.
    3.98406e+04    1.01956e+05   -5.29819e+03   -7.67027e+05    3.88859e+03
 Position Rest.      Potential    Kinetic En.   Total Energy  Conserved En.
    2.20376e-01   -6.04466e+05    1.05252e+05   -4.99214e+05   -4.99188e+05
    Temperature Pres. DC (bar) Pressure (bar)   Constr. rmsd
    3.01072e+02   -1.99963e+02   -1.12481e+03    2.92565e-06

           Step           Time
            500        1.00000

   Energies (kJ/mol)
           Bond          Angle    Proper Dih.  Improper Dih.          LJ-14
    2.76232e+03    6.56919e+03    8.91485e+03    3.74066e+02    3.29520e+03
     Coulomb-14        LJ (SR)  Disper. corr.   Coulomb (SR)   Coul. recip.
    3.96421e+04    1.10388e+05   -5.67387e+03   -7.82778e+05    3.73430e+03
 Position Rest.      Potential    Kinetic En.   Total Energy  Conserved En.
    7.80381e+02   -6.11992e+05    1.05707e+05   -5.06285e+05   -4.98936e+05
    Temperature Pres. DC (bar) Pressure (bar)   Constr. rmsd
    3.02375e+02   -2.29287e+02    2.16987e+02    3.08949e-06

           Step           Time
           1000        2.00000

   Energies (kJ/mol)
           Bond          Angle    Proper Dih.  Improper Dih.          LJ-14
    2.60550e+03    6.52426e+03    8.88778e+03    4.06964e+02    3.25137e+03
     Coulomb-14        LJ (SR)  Disper. corr.   Coulomb (SR)   Coul. recip.
    3.98419e+04    1.08540e+05   -5.70377e+03   -7.82129e+05    3.68749e+03
 Position Rest.      Potential    Kinetic En.   Total Energy  Conserved En.
    8.44355e+02   -6.13243e+05    1.04718e+05   -5.08526e+05   -4.98940e+05
    Temperature Pres. DC (bar) Pressure (bar)   Constr. rmsd
    2.99544e+02   -2.31707e+02    5.17654e-01    2.86946e-06

           Step           Time
           1500        3.00000

   Energies (kJ/mol)
           Bond          Angle    Proper Dih.  Improper Dih.          LJ-14
    2.55369e+03    6.47372e+03    9.01303e+03    4.09099e+02    3.34877e+03
     Coulomb-14        LJ (SR)  Disper. corr.   Coulomb (SR)   Coul. recip.
    3.96822e+04    1.08273e+05   -5.71906e+03   -7.80746e+05    3.56991e+03
 Position Rest.      Potential    Kinetic En.   Total Energy  Conserved En.
    8.64326e+02   -6.12277e+05    1.04020e+05   -5.08257e+05   -4.98916e+05
    Temperature Pres. DC (bar) Pressure (bar)   Constr. rmsd
    2.97548e+02   -2.32950e+02    1.46750e+02    2.83228e-06


...
...


           Step           Time
          50000      100.00000

Writing checkpoint, step 50000 at Mon Jan 10 01:50:32 2022


   Energies (kJ/mol)
           Bond          Angle    Proper Dih.  Improper Dih.          LJ-14
    2.65868e+03    6.46689e+03    8.87458e+03    4.26712e+02    3.29757e+03
     Coulomb-14        LJ (SR)  Disper. corr.   Coulomb (SR)   Coul. recip.
    3.97034e+04    1.07877e+05   -5.73794e+03   -7.82740e+05    3.63468e+03
 Position Rest.      Potential    Kinetic En.   Total Energy  Conserved En.
    9.03844e+02   -6.14635e+05    1.04525e+05   -5.10110e+05   -4.98818e+05
    Temperature Pres. DC (bar) Pressure (bar)   Constr. rmsd
    2.98994e+02   -2.34488e+02    2.32401e+01    3.11058e-06


Energy conservation over simulation part #1 of length 100 ns, time 0 to 100 ns
  Conserved energy drift: 8.96e-05 kJ/mol/ps per atom


	<======  ###############  ==>
	<====  A V E R A G E S  ====>
	<==  ###############  ======>

	Statistics over 50001 steps using 501 frames

   Energies (kJ/mol)
           Bond          Angle    Proper Dih.  Improper Dih.          LJ-14
    2.58042e+03    6.57565e+03    8.87256e+03    4.09996e+02    3.27516e+03
     Coulomb-14        LJ (SR)  Disper. corr.   Coulomb (SR)   Coul. recip.
    3.96588e+04    1.07828e+05   -5.73306e+03   -7.81547e+05    3.64830e+03
 Position Rest.      Potential    Kinetic En.   Total Energy  Conserved En.
    8.64482e+02   -6.13566e+05    1.04913e+05   -5.08653e+05   -4.98868e+05
    Temperature Pres. DC (bar) Pressure (bar)   Constr. rmsd
    3.00104e+02   -2.34097e+02    2.17823e+00    0.00000e+00

          Box-X          Box-Y          Box-Z
    7.41182e+00    7.41182e+00    7.41182e+00

   Total Virial (kJ/mol)
    3.49632e+04   -1.77442e+02   -2.68029e+02
   -1.77536e+02    3.50165e+04   -9.25268e+01
   -2.68075e+02   -9.23627e+01    3.48659e+04

   Pressure (bar)
    4.00992e+00    1.14163e+01    2.23425e+01
    1.14240e+01   -6.13640e+00    9.09152e+00
    2.23463e+01    9.07817e+00    8.66119e+00

      T-Protein  T-non-Protein
    3.00231e+02    3.00089e+02


	M E G A - F L O P S   A C C O U N T I N G

 NB=Group-cutoff nonbonded kernels    NxN=N-by-N cluster Verlet kernels
 RF=Reaction-Field  VdW=Van der Waals  QSTab=quadratic-spline table
 W3=SPC/TIP3p  W4=TIP4p (single or pairs)
 V&F=Potential and force  V=Potential only  F=Force only

 Computing:                               M-Number         M-Flops  % Flops
-----------------------------------------------------------------------------
 Pair Search distance check           14313.835096      128824.516     0.3
 NxN QSTab Elec. + LJ [F]            561563.609592    23024107.993    53.9
 NxN QSTab Elec. + LJ [V&F]            5683.876680      335348.724     0.8
 NxN QSTab Elec. [F]                 464090.345800    15779071.757    36.9
 NxN QSTab Elec. [V&F]                 4696.691736      192564.361     0.5
 1,4 nonbonded interactions             450.059001       40505.310     0.1
 Calc Weights                          6176.373525      222349.447     0.5
 Spread Q Bspline                    131762.635200      263525.270     0.6
 Gather F Bspline                    131762.635200      790575.811     1.9
 3D-FFT                              185299.305912     1482394.447     3.5
 Solve PME                              115.202304        7372.947     0.0
 Shift-X                                 41.216175         247.297     0.0
 Bonds                                   89.101782        5257.005     0.0
 Angles                                 311.706234       52366.647     0.1
 Propers                                481.909638      110357.307     0.3
 Impropers                               35.400708        7363.347     0.0
 Pos. Restr.                             87.051741        4352.587     0.0
 Virial                                 206.141220        3710.542     0.0
 Stop-CM                                 20.628675         206.287     0.0
 Calc-Ekin                              411.832350       11119.473     0.0
 Lincs                                   84.501690        5070.101     0.0
 Lincs-Mat                              421.208424        1684.834     0.0
 Constraint-V                          2056.041120       18504.370     0.0
 Constraint-Vir                         197.189430        4732.546     0.0
 Settle                                 629.012580      232734.655     0.5
-----------------------------------------------------------------------------
 Total                                                42724347.584   100.0
-----------------------------------------------------------------------------


     R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G

On 1 MPI rank, each using 16 OpenMP threads

 Computing:          Num   Num      Call    Wall time         Giga-Cycles
                     Ranks Threads  Count      (s)         total sum    %
-----------------------------------------------------------------------------
 Neighbor search        1   16       1001     171.904       8235.853   1.3
 Force                  1   16      50001   11761.402     563483.831  90.0
 PME mesh               1   16      50001     902.885      43256.855   6.9
 NB X/F buffer ops.     1   16      99001      51.463       2465.549   0.4
 Write traj.            1   16        113       1.387         66.428   0.0
 Update                 1   16      50001      28.337       1357.622   0.2
 Constraints            1   16      50001     126.837       6076.727   1.0
 Rest                                          18.569        889.653   0.1
-----------------------------------------------------------------------------
 Total                                      13062.785     625832.517 100.0
-----------------------------------------------------------------------------
 Breakdown of PME mesh computation
-----------------------------------------------------------------------------
 PME spread             1   16      50001     317.864      15228.748   2.4
 PME gather             1   16      50001     497.147      23818.110   3.8
 PME 3D-FFT             1   16     100002      39.412       1888.205   0.3
 PME solve Elec         1   16      50001      47.287       2265.490   0.4
-----------------------------------------------------------------------------

               Core t (s)   Wall t (s)        (%)
       Time:   209004.518    13062.785     1600.0
                         3h37:42
                 (ns/day)    (hour/ns)
Performance:        0.661       36.285
Finished mdrun on rank 0 Mon Jan 10 01:50:32 2022

I see no reason why this would run so slow, it is also strange that you have >90% runtime in the short-range force calculation (“Force” counter), that is typically more like 70-80%. How’s the performance with 4 / 8 threads? Can you rebuild with cmake . -DGMX_CYCLE_SUBCOUNTERS=ON and post a log with that? This will show further breakdown of the wall times.

I was able to figure it out. Apparently there is another gmx executable in:

/usr/local/bin/gmx

I though it was the version of gromacs that comes from the Ubuntu repository, but when I tried to removed it with apt-get, it tells me gromacs was not installed:

Package 'gromacs' is not installed, so not removed

I sourced the GXMRC file and now gmx is pointing to the right executable:

source /usr/local/gromacs/bin/GMXRC
which gmx
/usr/local/gromacs/bin/gmx

Now my simulation runs at 21 ns/day.

Hi alexmas,

I have the same issue. The performance was lower than I expected (0.636 ns/day).
Could you be more specific about how you figured that out?
Thank you in advance.

Note:
I don’t know anything about GMXRC file. I couldn’t find it.

In order to help you more, it would be good to have some more information. How did you install GROMACS? Was it after compilation from source? Or did you install a pre-compiled version, e.g., by using a package manager?

Hi Magnus,
Thanks for asking. My problem is now solved.
I tried to re-install GROMACS and it worked. Before that, I didn’t know anything because it was my friend who helped me to install GROMACS. I think he had missed something and it affected the directory (as alexmas said it was because of the directory).

Now my simulation performance increases to 28 ns/day. However, I also noticed that it also decreases a little. At first, it ran at 32 ns/day, then decreased to 30, and then 28 ns/day, but it was for different usage (NVT, NPT, and Production).
Is that actually OK?

Regards,
Antonius

Good to hear that it’s working better now. It’s difficult to say what performance to expect, since it depends on the system size and hardware resources, but at least 28-32 ns/day sounds reasonable, whereas the previously reported (0.6 ns/day) sounded very low.

The performance may vary a little bit from one run to another. It would also be affected by system load (from other programs and processes) during the simulation, especially if you are not running the simulations on a computer dedicated only to that.