Efficiency low

GROMACS version: 2021
GROMACS modification: No

Hello GROMACS users,

Unfortunately I am having some trouble with the efficiency of my jobs. I am using two nodes and 16 cores for my job, but at the end of the job GROMACS states that the efficiency is 25% (it took a long time). Does somebody happen to know how to optimize the efficiency? Thank you in advance!!

Hi,

Please post your log file contents, we can’t help without more information

Hi IlseF,

maybe you can find something useful here: Getting good performance from mdrun — GROMACS 2021.1 documentation.

Michele

Do you mean the complete file or only the last part? I found this:

R E A L C Y C L E A N D T I M E A C C O U N T I N G

On 4 MPI ranks

Computing: Num Num Call Wall time Giga-Cycles
Ranks Threads Count (s) total sum %

Domain decomp. 4 1 125000 378.257 3782.561 0.2
DD comm. load 4 1 125000 1.495 14.947 0.0
DD comm. bounds 4 1 75198 3.261 32.608 0.0
Neighbor search 4 1 125001 1167.797 11677.935 0.8
Comm. coord. 4 1 12375000 476.096 4760.944 0.3
Force 4 1 12500001 126226.745 1262264.071 82.6
Wait + Comm. F 4 1 12500001 431.946 4319.445 0.3
PME mesh 4 1 12500001 21290.964 212909.069 13.9
NB X/F buffer ops. 4 1 37250001 679.277 6792.753 0.4
Write traj. 4 1 420 28.558 285.577 0.0
Update 4 1 12500001 352.373 3523.723 0.2
Constraints 4 1 12500001 1420.637 14206.334 0.9
Comm. energies 4 1 1250001 80.893 808.927 0.1
Rest 236.878 2368.771 0.2

Total 152775.175 1527747.664 100.0

Breakdown of PME mesh computation

PME redist. X/F 4 1 25000002 4788.207 47881.944 3.1
PME spread 4 1 12500001 5049.845 50498.320 3.3
PME gather 4 1 12500001 2965.756 29657.479 1.9
PME 3D-FFT 4 1 25000002 5932.719 59327.032 3.9
PME 3D-FFT Comm. 4 1 25000002 1747.479 17474.748 1.1
PME solve Elec 4 1 12500001 782.267 7822.651 0.5

           Core t (s)   Wall t (s)        (%)
   Time:   611100.697   152775.175      400.0
                     1d18h26:15
             (ns/day)    (hour/ns)

Performance: 14.138 1.698
Finished mdrun on rank 0 Sun Apr 4 09:29:33 2021

Can you paste the beginning and end, everything except the per-step energies in the middle? What message is Gromacs giving you specifically about efficiency? It should also show up in the log

Beginning:

 :-) GROMACS - gmx mdrun, 2021-MODIFIED (-:

                            GROMACS is written by:
     Andrey Alekseenko              Emile Apol              Rossen Apostolov     
         Paul Bauer           Herman J.C. Berendsen           Par Bjelkmar       
       Christian Blau           Viacheslav Bolnykh             Kevin Boyd        
     Aldert van Buuren           Rudi van Drunen             Anton Feenstra      
    Gilles Gouaillardet             Alan Gray               Gerrit Groenhof      
       Anca Hamuraru            Vincent Hindriksen          M. Eric Irrgang      
      Aleksei Iupinov           Christoph Junghans             Joe Jordan        
    Dimitrios Karkoulis            Peter Kasson                Jiri Kraus        
      Carsten Kutzner              Per Larsson              Justin A. Lemkul     
       Viveca Lindahl            Magnus Lundborg             Erik Marklund       
        Pascal Merz             Pieter Meulenhoff            Teemu Murtola       
        Szilard Pall               Sander Pronk              Roland Schulz       
       Michael Shirts            Alexey Shvetsov             Alfons Sijbers      
       Peter Tieleman              Jon Vincent              Teemu Virolainen     
     Christian Wennberg            Maarten Wolf              Artem Zhmurov       
                           and the project leaders:
        Mark Abraham, Berk Hess, Erik Lindahl, and David van der Spoel

Copyright (c) 1991-2000, University of Groningen, The Netherlands.
Copyright (c) 2001-2019, The GROMACS development team at
Uppsala University, Stockholm University and
the Royal Institute of Technology, Sweden.
check out http://www.gromacs.org for more information.

GROMACS is free software; you can redistribute it and/or modify it
under the terms of the GNU Lesser General Public License
as published by the Free Software Foundation; either version 2.1
of the License, or (at your option) any later version.

GROMACS:      gmx mdrun, version 2021-MODIFIED
Executable:   /software/software/GROMACS/2021-foss-2020b/bin/gmx_mpi
Data prefix:  /software/software/GROMACS/2021-foss-2020b
Working dir:  /home/s3347648/MD
Process ID:   28346
Command line:
  gmx_mpi mdrun -v -deffnm step6_1

GROMACS version:    2021-MODIFIED
This program has been built from source code that has been altered and does not match the code released as part of the official GROMACS version 2021-MODIFIED. If you did not intend to use an altered GROMACS version, make sure to download an intact source distribution and compile that before proceeding.
If you have modified the source code, you are strongly encouraged to set your custom version suffix (using -DGMX_VERSION_STRING_OF_FORK) which will can help later with scientific reproducibility but also when reporting bugs.
Release checksum: 3e06a5865d6ff726fc417dea8d55afd37ac3cbb94c02c54c76d7a881c49c5dd8
Computed checksum: 703b977784a0aa51372c3d549c8d6a3d866be317e94e2b89ea42bf0257c5aa04
Precision:          mixed
Memory model:       64 bit
MPI library:        MPI
OpenMP support:     enabled (GMX_OPENMP_MAX_THREADS = 64)
GPU support:        disabled
SIMD instructions:  AVX2_256
FFT library:        fftw-3.3.8-sse2-avx-avx2-avx2_128
RDTSCP usage:       enabled
TNG support:        enabled
Hwloc support:      disabled
Tracing support:    disabled
C compiler:         /software/software/OpenMPI/4.0.5-GCC-10.2.0/bin/mpicc GNU 10.2.0
C compiler flags:   -mavx2 -mfma -Wno-missing-field-initializers -fexcess-precision=fast -funroll-all-loops -O3 -DNDEBUG
C++ compiler:       /software/software/OpenMPI/4.0.5-GCC-10.2.0/bin/mpicxx GNU 10.2.0
C++ compiler flags: -mavx2 -mfma -Wno-missing-field-initializers -fexcess-precision=fast -funroll-all-loops -fopenmp -O3 -DNDEBUG


Running on 2 nodes with total 48 cores, 48 logical cores
  Cores per node:           24
  Logical cores per node:   24
Hardware detected on host pg-node024 (the node of MPI rank 0):
  CPU info:
    Vendor: Intel
    Brand:  Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
    Family: 6   Model: 63   Stepping: 2
    Features: aes apic avx avx2 clfsh cmov cx8 cx16 f16c fma htt intel lahf mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd rdtscp sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
  Hardware topology: Only logical processor count


++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E.
Lindahl
GROMACS: High performance molecular simulations through multi-level
parallelism from laptops to supercomputers
SoftwareX 1 (2015) pp. 19-25
-------- -------- --- Thank You --- -------- --------


++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
S. Páll, M. J. Abraham, C. Kutzner, B. Hess, E. Lindahl
Tackling Exascale Software Challenges in Molecular Dynamics Simulations with
GROMACS
In S. Markidis & E. Laure (Eds.), Solving Software Challenges for Exascale 8759 (2015) pp. 3-27
-------- -------- --- Thank You --- -------- --------


++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
S. Pronk, S. Páll, R. Schulz, P. Larsson, P. Bjelkmar, R. Apostolov, M. R.
Shirts, J. C. Smith, P. M. Kasson, D. van der Spoel, B. Hess, and E. Lindahl
GROMACS 4.5: a high-throughput and highly parallel open source molecular
simulation toolkit
Bioinformatics 29 (2013) pp. 845-54
-------- -------- --- Thank You --- -------- --------


++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
B. Hess and C. Kutzner and D. van der Spoel and E. Lindahl
GROMACS 4: Algorithms for highly efficient, load-balanced, and scalable
molecular simulation
J. Chem. Theory Comput. 4 (2008) pp. 435-447
-------- -------- --- Thank You --- -------- --------


++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
D. van der Spoel, E. Lindahl, B. Hess, G. Groenhof, A. E. Mark and H. J. C.
Berendsen
GROMACS: Fast, Flexible and Free
J. Comp. Chem. 26 (2005) pp. 1701-1719
-------- -------- --- Thank You --- -------- --------


++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
E. Lindahl and B. Hess and D. van der Spoel
GROMACS 3.0: A package for molecular simulation and trajectory analysis
J. Mol. Mod. 7 (2001) pp. 306-317
-------- -------- --- Thank You --- -------- --------


++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
H. J. C. Berendsen, D. van der Spoel and R. van Drunen
GROMACS: A message-passing parallel molecular dynamics implementation
Comp. Phys. Comm. 91 (1995) pp. 43-56
-------- -------- --- Thank You --- -------- --------


++++ PLEASE CITE THE DOI FOR THIS VERSION OF GROMACS ++++
https://doi.org/10.5281/zenodo.4457626
-------- -------- --- Thank You --- -------- --------


The number of OpenMP threads was set by environment variable OMP_NUM_THREADS to 1

Input Parameters:
   integrator                     = md
   tinit                          = 0
   dt                             = 0.002
   nsteps                         = 12500000
   init-step                      = 0
   simulation-part                = 1
   mts                            = false
   comm-mode                      = Linear
   nstcomm                        = 100
   bd-fric                        = 0
   ld-seed                        = 2147205111
   emtol                          = 10
   emstep                         = 0.01
   niter                          = 20
   fcstep                         = 0
   nstcgsteep                     = 1000
   nbfgscorr                      = 10
   rtpi                           = 0.05
   nstxout                        = 0
   nstvout                        = 50000
   nstfout                        = 50000
   nstlog                         = 1000
   nstcalcenergy                  = 100
   nstenergy                      = 1000
   nstxout-compressed             = 50000
   compressed-x-precision         = 1000
   cutoff-scheme                  = Verlet
   nstlist                        = 20
   pbc                            = xyz
   periodic-molecules             = false
   verlet-buffer-tolerance        = 0.005
   rlist                          = 1.216
   coulombtype                    = PME
   coulomb-modifier               = Potential-shift
   rcoulomb-switch                = 0
   rcoulomb                       = 1.2
   epsilon-r                      = 1
   epsilon-rf                     = inf
   vdw-type                       = Cut-off
   vdw-modifier                   = Force-switch
   rvdw-switch                    = 1
   rvdw                           = 1.2
   DispCorr                       = No
   table-extension                = 1
   fourierspacing                 = 0.12
   fourier-nx                     = 52
   fourier-ny                     = 52
   fourier-nz                     = 52
   pme-order                      = 4
   ewald-rtol                     = 1e-05
   ewald-rtol-lj                  = 0.001
   lj-pme-comb-rule               = Geometric
   ewald-geometry                 = 0
   epsilon-surface                = 0
   tcoupl                         = Nose-Hoover
   nsttcouple                     = 10
   nh-chain-length                = 1
   print-nose-hoover-chain-variables = false
   pcoupl                         = Parrinello-Rahman
   pcoupltype                     = Isotropic
   nstpcouple                     = 10
   tau-p                          = 5
   compressibility (3x3):
      compressibility[    0]={ 4.50000e-05,  0.00000e+00,  0.00000e+00}
      compressibility[    1]={ 0.00000e+00,  4.50000e-05,  0.00000e+00}
      compressibility[    2]={ 0.00000e+00,  0.00000e+00,  4.50000e-05}
   ref-p (3x3):
      ref-p[    0]={ 1.00000e+00,  0.00000e+00,  0.00000e+00}
      ref-p[    1]={ 0.00000e+00,  1.00000e+00,  0.00000e+00}
      ref-p[    2]={ 0.00000e+00,  0.00000e+00,  1.00000e+00}
   refcoord-scaling               = No
   posres-com (3):
      posres-com[0]= 0.00000e+00
      posres-com[1]= 0.00000e+00
      posres-com[2]= 0.00000e+00
   posres-comB (3):
      posres-comB[0]= 0.00000e+00
      posres-comB[1]= 0.00000e+00
      posres-comB[2]= 0.00000e+00
   QMMM                           = false
qm-opts:
   ngQM                           = 0
   constraint-algorithm           = Lincs
   continuation                   = true
   Shake-SOR                      = false
   shake-tol                      = 0.0001
   lincs-order                    = 4
   lincs-iter                     = 1
   lincs-warnangle                = 30
   nwall                          = 0
   wall-type                      = 9-3
   wall-r-linpot                  = -1
   wall-atomtype[0]               = -1
   wall-atomtype[1]               = -1
   wall-density[0]                = 0
   wall-density[1]                = 0
   wall-ewald-zfac                = 3
   pull                           = false
   awh                            = false
   rotation                       = false
   interactiveMD                  = false
   disre                          = No
   disre-weighting                = Conservative
   disre-mixed                    = false
   dr-fc                          = 1000
   dr-tau                         = 0
   nstdisreout                    = 100
   orire-fc                       = 0
   orire-tau                      = 0
   nstorireout                    = 100
   free-energy                    = no
   cos-acceleration               = 0
   deform (3x3):
      deform[    0]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
      deform[    1]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
      deform[    2]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
   simulated-tempering            = false
   swapcoords                     = no
   userint1                       = 0
   userint2                       = 0
   userint3                       = 0
   userint4                       = 0
   userreal1                      = 0
   userreal2                      = 0
   userreal3                      = 0
   userreal4                      = 0
   applied-forces:
     electric-field:
       x:
         E0                       = 0
         omega                    = 0
         t0                       = 0
         sigma                    = 0
       y:
         E0                       = 0
         omega                    = 0
         t0                       = 0
         sigma                    = 0
       z:
         E0                       = 0
         omega                    = 0
         t0                       = 0
         sigma                    = 0
     density-guided-simulation:
       active                     = false
       group                      = protein
       similarity-measure         = inner-product
       atom-spreading-weight      = unity
       force-constant             = 1e+09
       gaussian-transform-spreading-width = 0.2
       gaussian-transform-spreading-range-in-multiples-of-width = 4
       reference-density-filename = reference.mrc
       nst                        = 1
       normalize-densities        = true
       adaptive-force-scaling     = false
       adaptive-force-scaling-time-constant = 4
       shift-vector               = 
       transformation-matrix      = 
grpopts:
   nrdf:       10237       32970
   ref-t:      310.15      310.15
   tau-t:           1           1
annealing:          No          No
annealing-npoints:           0           0
   acc:	           0           0           0
   nfreeze:           N           N           N
   energygrp-flags[  0]: 0

Changing nstlist from 20 to 100, rlist from 1.216 to 1.334


Initializing Domain Decomposition on 4 ranks
Dynamic load balancing: auto
Minimum cell size due to atom displacement: 0.760 nm
Initial maximum distances in bonded interactions:
    two-body bonded interactions: 0.436 nm, LJ-14, atoms 2432 2439
  multi-body bonded interactions: 0.498 nm, CMAP Dih., atoms 3286 3295
Minimum cell size due to bonded interactions: 0.548 nm
Maximum distance for 5 constraints, at 120 deg. angles, all-trans: 0.222 nm
Estimated maximum distance required for P-LINCS: 0.222 nm
Scaling the initial minimum size with 1/0.8 (option -dds) = 1.25
Using 0 separate PME ranks, as there are too few total
 ranks for efficient splitting
Optimizing the DD grid for 4 cells with a minimum initial size of 0.950 nm
The maximum allowed number of cells is: X 6 Y 6 Z 6
Domain decomposition grid 4 x 1 x 1, separate PME ranks 0
PME domain decomposition: 4 x 1 x 1
Domain decomposition rank 0, coordinates 0 0 0

The initial number of communication pulses is: X 1
The initial domain decomposition cell size is: X 1.50 nm

The maximum allowed distance for atoms involved in interactions is:
                 non-bonded interactions           1.334 nm
(the following are initial values, they could change due to box deformation)
            two-body bonded interactions  (-rdd)   1.334 nm
          multi-body bonded interactions  (-rdd)   1.334 nm
  atoms separated by up to 5 constraints  (-rcon)  1.500 nm

When dynamic load balancing gets turned on, these settings will change to:
The maximum number of communication pulses is: X 2
The minimum size for domain decomposition cells is 1.024 nm
The requested allowed shrink of DD cells (option -dds) is: 0.80
The allowed shrink of domain decomposition cells is: X 0.68
The maximum allowed distance for atoms involved in interactions is:
                 non-bonded interactions           1.334 nm
            two-body bonded interactions  (-rdd)   1.334 nm
          multi-body bonded interactions  (-rdd)   1.024 nm
  atoms separated by up to 5 constraints  (-rcon)  1.024 nm
Using two step summing over 2 groups of on average 2.0 ranks


Using 4 MPI processes

Non-default thread affinity set, disabling internal thread affinity

Using 1 OpenMP thread per MPI process

System total charge: 0.000
Will do PME sum in reciprocal space for electrostatic interactions.

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
U. Essmann, L. Perera, M. L. Berkowitz, T. Darden, H. Lee and L. G. Pedersen 
A smooth particle mesh Ewald method
J. Chem. Phys. 103 (1995) pp. 8577-8592
-------- -------- --- Thank You --- -------- --------

Using a Gaussian width (1/beta) of 0.384195 nm for Ewald
Potential shift: LJ r^-12: -2.648e-01 r^-6: -5.349e-01, Ewald -8.333e-06
Initialized non-bonded Ewald tables, spacing: 1.02e-03 size: 1176

Generated table with 1167 data points for 1-4 COUL.
Tabscale = 500 points/nm
Generated table with 1167 data points for 1-4 LJ6.
Tabscale = 500 points/nm
Generated table with 1167 data points for 1-4 LJ12.
Tabscale = 500 points/nm


Using SIMD 4x8 nonbonded short-range kernels

Using a dual 4x8 pair-list setup updated with dynamic pruning:
  outer list: updated every 100 steps, buffer 0.134 nm, rlist 1.334 nm
  inner list: updated every  15 steps, buffer 0.002 nm, rlist 1.202 nm
At tolerance 0.005 kJ/mol/ps per atom, equivalent classical 1x1 list would be:
  outer list: updated every 100 steps, buffer 0.289 nm, rlist 1.489 nm
  inner list: updated every  15 steps, buffer 0.060 nm, rlist 1.260 nm

Initializing Parallel LINear Constraint Solver

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
B. Hess
P-LINCS: A Parallel Linear Constraint Solver for molecular simulation
J. Chem. Theory Comput. 4 (2008) pp. 116-122
-------- -------- --- Thank You --- -------- --------

The number of constraints is 2051
There are constraints between atoms in different decomposition domains,
will communicate selected coordinates each lincs iteration

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
S. Miyamoto and P. A. Kollman
SETTLE: An Analytical Version of the SHAKE and RATTLE Algorithms for Rigid
Water Models
J. Comp. Chem. 13 (1992) pp. 952-962
-------- -------- --- Thank You --- -------- --------


Linking all bonded interactions to atoms


Intra-simulation communication will occur every 10 steps.
There are: 20580 Atoms
Atom distribution over 4 domains: av 5145 stddev 171 min 4930 max 5354
Center of mass motion removal mode is Linear
We have the following groups for center of mass motion removal:
  0:  SOLU
  1:  SOLV

*End:* 

Energy conservation over simulation part #1 of length 25000 ns, time 0 to 25000 ns
  Conserved energy drift: -3.11e-04 kJ/mol/ps per atom


	<======  ###############  ==>
	<====  A V E R A G E S  ====>
	<==  ###############  ======>

	Statistics over 12500001 steps using 125001 frames

   Energies (kJ/mol)
           Bond            U-B    Proper Dih.  Improper Dih.      CMAP Dih.
    3.51018e+03    9.83976e+03    1.03884e+04    5.96427e+02   -4.44274e+02
          LJ-14     Coulomb-14        LJ (SR)   Coulomb (SR)   Coul. recip.
    2.68622e+03    3.66119e+04    1.57117e+04   -3.31705e+05    1.09901e+03
      Potential    Kinetic En.   Total Energy  Conserved En.    Temperature
   -2.51706e+05    5.57103e+04   -1.95996e+05   -2.68776e+05    3.10154e+02
 Pressure (bar)   Constr. rmsd
    1.38171e+00    0.00000e+00

          Box-X          Box-Y          Box-Z
    5.86104e+00    5.86104e+00    5.86104e+00

   Total Virial (kJ/mol)
    1.85910e+04   -3.72208e+00   -1.96834e+01
   -3.68913e+00    1.85317e+04    8.71106e-01
   -1.96866e+01    8.31952e-01    1.85673e+04

   Pressure (bar)
    1.16684e+00   -2.82005e-01    1.58376e-01
   -2.87441e-01    2.72299e+00   -8.67067e-01
    1.58895e-01   -8.60609e-01    2.55302e-01

         T-SOLU         T-SOLV
    3.10164e+02    3.10150e+02


	M E G A - F L O P S   A C C O U N T I N G

 NB=Group-cutoff nonbonded kernels    NxN=N-by-N cluster Verlet kernels
 RF=Reaction-Field  VdW=Van der Waals  QSTab=quadratic-spline table
 W3=SPC/TIP3p  W4=TIP4p (single or pairs)
 V&F=Potential and force  V=Potential only  F=Force only

 Computing:                               M-Number         M-Flops  % Flops
-----------------------------------------------------------------------------
 Pair Search distance check         1502727.649524    13524548.846     0.1
 NxN Ewald Elec. + LJ [F]         239350552.475760 18669343093.109    94.0
 NxN Ewald Elec. + LJ [V&F]         2417701.692688   311883518.357     1.6
 NxN Ewald Elec. [F]                1991262.773808   121467029.202     0.6
 NxN Ewald Elec. [V&F]                20113.927824     1689569.937     0.0
 1,4 nonbonded interactions          134537.510763    12108375.969     0.1
 Calc Weights                        771750.061740    27783002.223     0.1
 Spread Q Bspline                  16464001.317120    32928002.634     0.2
 Gather F Bspline                  16464001.317120    98784007.903     0.5
 3D-FFT                            60114554.809164   480916438.473     2.4
 Solve PME                            33800.002704     2163200.173     0.0
 Reset In Box                          2572.088400        7716.265     0.0
 CG-CoM                                2572.520580        7717.562     0.0
 Bonds                                26012.502081     1534737.623     0.0
 Propers                             131400.010512    30090602.407     0.2
 Impropers                             8362.500669     1739400.139     0.0
 Virial                               25950.020760      467100.374     0.0
 Stop-CM                               2572.520580       25725.206     0.0
 Calc-Ekin                            51450.041160     1389151.111     0.0
 Lincs                                27299.779975     1637986.798     0.0
 Lincs-Mat                           153289.864968      613159.460     0.0
 Constraint-V                        272487.121859     2452384.097     0.0
 Constraint-Vir                       24518.751814      588450.044     0.0
 Settle                               72629.187303    26872799.302     0.1
 CMAP                                  3262.500261     5546250.444     0.0
 Urey-Bradley                         93562.507485    17121938.870     0.1
-----------------------------------------------------------------------------
 Total                                             19862685906.527   100.0
-----------------------------------------------------------------------------


    D O M A I N   D E C O M P O S I T I O N   S T A T I S T I C S

 av. #atoms communicated per step for force:  2 x 18740.9
 av. #atoms communicated per step for LINCS:  2 x 1159.2


Dynamic load balancing report:
 DLB was turned on during the run due to measured imbalance.
 Average load imbalance: 2.3%.
 The balanceable part of the MD step is 83%, load imbalance is computed from this.
 Part of the total run time spent waiting due to load imbalance: 1.9%.
 Steps where the load balancing was limited by -rdd, -rcon and/or -dds: X 0 %


     R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G

On 4 MPI ranks

 Computing:          Num   Num      Call    Wall time         Giga-Cycles
                     Ranks Threads  Count      (s)         total sum    %
-----------------------------------------------------------------------------
 Domain decomp.         4    1     125000     378.257       3782.561   0.2
 DD comm. load          4    1     125000       1.495         14.947   0.0
 DD comm. bounds        4    1      75198       3.261         32.608   0.0
 Neighbor search        4    1     125001    1167.797      11677.935   0.8
 Comm. coord.           4    1   12375000     476.096       4760.944   0.3
 Force                  4    1   12500001  126226.745    1262264.071  82.6
 Wait + Comm. F         4    1   12500001     431.946       4319.445   0.3
 PME mesh               4    1   12500001   21290.964     212909.069  13.9
 NB X/F buffer ops.     4    1   37250001     679.277       6792.753   0.4
 Write traj.            4    1        420      28.558        285.577   0.0
 Update                 4    1   12500001     352.373       3523.723   0.2
 Constraints            4    1   12500001    1420.637      14206.334   0.9
 Comm. energies         4    1    1250001      80.893        808.927   0.1
 Rest                                         236.878       2368.771   0.2
-----------------------------------------------------------------------------
 Total                                     152775.175    1527747.664 100.0
-----------------------------------------------------------------------------
 Breakdown of PME mesh computation
-----------------------------------------------------------------------------
 PME redist. X/F        4    1   25000002    4788.207      47881.944   3.1
 PME spread             4    1   12500001    5049.845      50498.320   3.3
 PME gather             4    1   12500001    2965.756      29657.479   1.9
 PME 3D-FFT             4    1   25000002    5932.719      59327.032   3.9
 PME 3D-FFT Comm.       4    1   25000002    1747.479      17474.748   1.1
 PME solve Elec         4    1   12500001     782.267       7822.651   0.5
-----------------------------------------------------------------------------

               Core t (s)   Wall t (s)        (%)
       Time:   611100.697   152775.175      400.0
                         1d18h26:15
                 (ns/day)    (hour/ns)
Performance:       14.138        1.698
Finished mdrun on rank 0 Sun Apr  4 09:29:33 2021

I hope this helps

Interesting - it looks like only 4 MPI ranks were spawned.

I’m a bit confused - the log states that you have 2 nodes and 48 cores available to you. With gmx_mpi mdrun -v -deffnm step6_1 as your run command, it should have started 1 rank per thread.

My only thought would be something off with an mpirun or srun command. Something like mpirun -np 4 would limit you like that… how did you dispatch the simulation?

Hello Kevin,

I dispatched the simulation using the command sbatch Gromacs_script.sh
In the script, the command gmx_mpi mdrun -v -deffnm step6_1.

Looking at this manual (Getting good performance from mdrun — GROMACS 5.1 documentation) I can try two commands:

option 1) gmx mpirun -np 4 -v -deffnm step6_1.
option 2) gmx mpirun_mpi -np 4 -v -deffnm step6_1.

Do you think using one of the two commands will improve the efficiency?
Kind regards,
Ilse

Hi,

Your issue is not efficiency but that you are not using most of the cores in the nodes.

You allocated resources with 2x24 cores in total but you are only using four of these as you have launched four MPI ranks and a single thread per rank. The following note in the log indicates the latter
The number of OpenMP threads was set by environment variable OMP_NUM_THREADS to 1
OMP_NUM_THREADS=1 was most likely set by the job scheduler (likely default if you have not specified how many threads you want).

Start by launching as many ranks as cores; you can also try using 2-4 times fewer ranks with 2-4 ranks each to use all cores – which you’ll have to specify at job launch (either as an argument to the MPI launcher or using -ntomp).

Cheers,
Szilárd

@pszilard it looks like the underuse issue is happening with the default run command, which doesn’t seem right.

  gmx_mpi mdrun -v -deffnm step6_1

So did Gromacs decide to span 12 OpenMP threads per rank but was then overridden by the env variable?

Both command line and environment variable setting the OpenMP thread count override the default of using all cores per node; and as OMP_NUM_THREADS=1 was set in the environment per log (-ntomp was not explicitly set), mdrun used what was requested, i.e. 1 thread per rank.

Ah, thanks, that explains it.

Thanks Kevin and Pszilard for your help. Just a follow-up for who might run into similar issues: Finally the job has run with 99,86% efficiency. The script I wrote is attached below.

After contact via email I learned the following:
I had to make sure that the value of both OMP_NUM_THREADS and -ntomp were set to the same value as I use for --cpus-per-task, as the latter is SLURM terminology for the number of OpenMP threads. I could even leave -ntomp away, because GROMACS would probably use OMP_NUM_THREADS, but it doesn’t harm to use both as long as they’re set to the same value. Finally, since I am using multiple tasks and nodes, I had to start gmx_mpi with srun, otherwise the MPI functionality would not work.

You seem to be using fairly wide tasks (--ntasks-per-node=2) which is generally not optimal. I recommend to test a few settings, in general start testing with 1-2 cores / task , so with those 24-core nodes you used before, use e.g. --ntasks-per-node=2 (and adjust the rest of the SLURM parameters accordingly).
You were trying to simulate a pretty small system which may run into domain decomposition limitations with many ranks, but it should still work for up to at least 64-96 ranks.

Also note that you should always bind / pin MPI tasks / OpenMP threads, either using SLURM (e.g. --cpu-bind="cores") or pass the -pin on to mdrun.