Performance on Intel(R) Xeon Phi(TM) CPU 7250 @ 1.40GHz

GROMACS version: 2020.4
GROMACS modification: NO
Here post your question
I am trying to check the performance of a simulation sysem with 1 M atoms and protein in water. I have tested on below settings
Running on 1 node with total 68 cores, 272 logical cores
Hardware detected on host c0131.ofp (the node of MPI rank 0):
CPU info:
Vendor: Intel
Brand: Intel(R) Xeon Phi™ CPU 7250 @ 1.40GHz
Family: 6 Model: 87 Stepping: 1
Features: aes apic avx avx2 avx512f avx512pf avx512er avx512cd clfsh cmov cx8 cx16 f16c fma htt intel lahf mmx msr nonstop_tsc pclmuldq pdcm pdpe1gb popcnt pse rdrnd rdtscp sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
Number of AVX-512 FMA units: Cannot run AVX-512 detection - assuming 2
Hardware topology: Basic
Sockets, cores, and logical processors:
Socket 0: [ 0 68 136 204] [ 1 69 137 205] [ 2 70 138 206] [ 3 71 139 207] [ 4 72 140 208] [ 5 73 141 209] [ 6 74 142 210] [ 7 75 143 211] [ 8 76 144 212] [ 9 77 145 213] [ 10 78 146 214] [ 11 79 147 215] [ 12 80 148 216] [ 13 81 149 217] [ 14 82 150 218] [ 15 83 151 219] [ 16 84 152 220] [ 17 85 153 221] [ 18 86 154 222] [ 19 87 155 223] [ 20 88 156 224] [ 21 89 157 225] [ 22 90 158 226] [ 23 91 159 227] [ 24 92 160 228] [ 25 93 161 229] [ 26 94 162 230] [ 27 95 163 231] [ 28 96 164 232] [ 29 97 165 233] [ 30 98 166 234] [ 31 99 167 235] [ 32 100 168 236] [ 33 101 169 237] [ 34 102 170 238] [ 35 103 171 239] [ 36 104 172 240] [ 37 105 173 241] [ 38 106 174 242] [ 39 107 175 243] [ 40 108 176 244] [ 41 109 177 245] [ 42 110 178 246] [ 43 111 179 247] [ 44 112 180 248] [ 45 113 181 249] [ 46 114 182 250] [ 47 115 183 251] [ 48 116 184 252] [ 49 117 185 253] [ 50 118 186 254] [ 51 119 187 255] [ 52 120 188 256] [ 53 121 189 257] [ 54 122 190 258] [ 55 123 191 259] [ 56 124 192 260] [ 57 125 193 261] [ 58 126 194 262] [ 59 127 195 263] [ 60 128 196 264] [ 61 129 197 265] [ 62 130 198 266] [ 63 131 199 267] [ 64 132 200 268] [ 65 133 201 269] [ 66 134 202 270] [ 67 135 203 271]

I have tried upto 20 nodes the performance i am getting is around 3 ns per day, which is weired. Is it normal for GROMACS to give this bad performance for Xeon Phi processors? Anyone can give me any comment.

That is unusual, I would expect performance to be in that ballpark on a single node. What do you get on a single node? Can you post a log file?

It is Oakforest-PACS. THe script they provide doest optimise the gromacs performance. I have optimised it and got 116 ns/day for 10 nodes. (still i think it is low).
Here is the important parts of log file,

gmx_mpi mdrun -deffnm 6 -ntomp 1

GROMACS version:    2020.4
Verified release checksum is 79c2857291b034542c26e90512b92fd4b184a1c9d6fa59c55f2e24ccf14e7281
Precision:          single
Memory model:       64 bit
MPI library:        MPI
OpenMP support:     enabled (GMX_OPENMP_MAX_THREADS = 64)
GPU support:        disabled
SIMD instructions:  AVX_512_KNL
FFT library:        fftw-3.3.8-avx-avx2-avx2_128-avx512
RDTSCP usage:       enabled
TNG support:        enabled
Hwloc support:      hwloc-1.11.5rc2
Tracing support:    disabled
C compiler:         /opt/intel/impi/2019.5.281/intel64/bin/mpiicc Intel 19.0.5.20190815
C compiler flags:   -xMIC-AVX512 -std=gnu99 -ip -funroll-all-loops -alias-const -ansi-alias -no-prec-div -fimf-domain-exclusion=14 -qoverride-limits -O3 -DNDEBUG
C++ compiler:       /opt/intel/impi/2019.5.281/intel64/bin/mpiicpc Intel 19.0.5.20190815
C++ compiler flags: -xMIC-AVX512 -ip -funroll-all-loops -alias-const -ansi-alias -no-prec-div -fimf-domain-exclusion=14 -qoverride-limits -qopenmp -O3 -DNDEBUG


Running on 10 nodes with total 680 cores, 2720 logical cores
  Cores per node:           68
  Logical cores per node:   272
Hardware detected on host c0201.ofp (the node of MPI rank 0):
  CPU info:
    Vendor: Intel
    Brand:  Intel(R) Xeon Phi(TM) CPU 7250 @ 1.40GHz
    Family: 6   Model: 87   Stepping: 1
    Features: aes apic avx avx2 avx512f avx512pf avx512er avx512cd clfsh cmov cx8 cx16 f16c fma htt intel lahf mmx msr nonstop_tsc pclmuldq pdcm pdpe1gb popcnt pse rdrnd rdtscp sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
    Number of AVX-512 FMA units: Cannot run AVX-512 detection - assuming 2
  Hardware topology: Basic
    Sockets, cores, and logical processors:
      Socket  0: [   0  68 136 204] [   1  69 137 205] [   2  70 138 206] [   3  71 139 207] [   4  72 140 208] [   5  73 141 209] [   6  74 142 210] [   7  75 143 211] [   8  76 144 212] [   9  77 145 213] [  10  78 146 214] [  11  79 147 215] [  12  80 148 216] [  13  81 149 217] [  14  82 150 218] [  15  83 151 219] [  16  84 152 220] [  17  85 153 221] [  18  86 154 222] [  19  87 155 223] [  20  88 156 224] [  21  89 157 225] [  22  90 158 226] [  23  91 159 227] [  24  92 160 228] [  25  93 161 229] [  26  94 162 230] [  27  95 163 231] [  28  96 164 232] [  29  97 165 233] [  30  98 166 234] [  31  99 167 235] [  32 100 168 236] [  33 101 169 237] [  34 102 170 238] [  35 103 171 239] [  36 104 172 240] [  37 105 173 241] [  38 106 174 242] [  39 107 175 243] [  40 108 176 244] [  41 109 177 245] [  42 110 178 246] [  43 111 179 247] [  44 112 180 248] [  45 113 181 249] [  46 114 182 250] [  47 115 183 251] [  48 116 184 252] [  49 117 185 253] [  50 118 186 254] [  51 119 187 255] [  52 120 188 256] [  53 121 189 257] [  54 122 190 258] [  55 123 191 259] [  56 124 192 260] [  57 125 193 261] [  58 126 194 262] [  59 127 195 263] [  60 128 196 264] [  61 129 197 265] [  62 130 198 266] [  63 131 199 267] [  64 132 200 268] [  65 133 201 269] [  66 134 202 270] [  67 135 203 271]

When dynamic load balancing gets turned on, these settings will change to:
The maximum number of communication pulses is: X 3 Y 3 Z 3
The minimum size for domain decomposition cells is 0.672 nm
The requested allowed shrink of DD cells (option -dds) is: 0.80
The allowed shrink of domain decomposition cells is: X 0.80 Y 0.32 Z 0.64
The maximum allowed distance for atom groups involved in interactions is:
                 non-bonded interactions           1.620 nm
            two-body bonded interactions  (-rdd)   1.620 nm
          multi-body bonded interactions  (-rdd)   0.672 nm
Using two step summing over 10 groups of on average 32.0 ranks


Using 410 MPI processes

Non-default thread affinity set, disabling internal thread affinity

Using 1 OpenMP thread per MPI process


On 320 MPI ranks doing PP, and
on 90 MPI ranks doing PME

 Computing:          Num   Num      Call    Wall time         Giga-Cycles
                     Ranks Threads  Count      (s)         total sum    %
-----------------------------------------------------------------------------
 Domain decomp.       320    1        765       1.180        528.504   0.8
 DD comm. load        320    1         17       0.001          0.496   0.0
 Send X to PME        320    1      76501       0.654        293.082   0.4
 Neighbor search      320    1        766       1.355        606.828   0.9
 Comm. coord.         320    1      75735       7.736       3465.537   5.3
 Force                320    1      76501      52.209      23388.873  35.9
 Wait + Comm. F       320    1      76501      14.530       6509.069  10.0
 PME mesh *            90    1      76501      93.982      11841.319  18.2
 PME wait for PP *                             17.765       2238.333   3.4
 Wait + Recv. PME F   320    1      76501      24.066      10781.342  16.5
 NB X/F buffer ops.   320    1     227971       3.833       1717.197   2.6
 Write traj.          320    1         17       0.073         32.659   0.1
 Update               320    1      76501       1.366        611.992   0.9
 Constraints          320    1      76501       2.190        981.049   1.5
 Comm. energies       320    1       3826       2.154        964.945   1.5
 Rest                                           2.174        973.828   1.5
-----------------------------------------------------------------------------
 Total                                        113.521      65158.484 100.0
-----------------------------------------------------------------------------
(*) Note that with separate PME ranks, the walltime column actually sums to
    twice the total reported, but the cycle count total and % are correct.
-----------------------------------------------------------------------------
 Breakdown of PME mesh computation
-----------------------------------------------------------------------------
 PME redist. X/F       90    1     153002      23.303       2936.136   4.5
 PME spread            90    1      76501      15.991       2014.841   3.1
 PME gather            90    1      76501      14.080       1774.026   2.7
 PME 3D-FFT            90    1     153002      18.150       2286.765   3.5
 PME 3D-FFT Comm.      90    1     306004      20.337       2562.336   3.9
 PME solve Elec        90    1      76501       1.083        136.413   0.2
-----------------------------------------------------------------------------

               Core t (s)   Wall t (s)        (%)
       Time:    46539.223      113.521    40996.3
                 (ns/day)    (hour/ns)
Performance:      116.449        0.206
Finished mdrun on rank 0 Mon Jan 10 20:42:49 2022

How is the performance on single node? Are you getting good parallel efficiency? I can’t see anything wrong in the information you shared. GROMACS performance on KNL is in general quite good (relative to the capabilities of the hardware), but less so if we’d compare to more recent CPUs and it may seem rather low especially if comparing to GPUs.

Perhaps you could get better performance by increasing the threads/rank and improving PP-PME load balance, but don’t expect huge improvements.

Full log would be more useful, from this it is not obvious how much DD headroom you have, how much load imbalance is there, etc.

Cheers,
Szilárd

Attached below is the single node performance

                      :-) GROMACS - gmx mdrun, 2020.4 (-:

                            GROMACS is written by:
     Emile Apol      Rossen Apostolov      Paul Bauer     Herman J.C. Berendsen
    Par Bjelkmar      Christian Blau   Viacheslav Bolnykh     Kevin Boyd    
 Aldert van Buuren   Rudi van Drunen     Anton Feenstra       Alan Gray     
  Gerrit Groenhof     Anca Hamuraru    Vincent Hindriksen  M. Eric Irrgang  
  Aleksei Iupinov   Christoph Junghans     Joe Jordan     Dimitrios Karkoulis
    Peter Kasson        Jiri Kraus      Carsten Kutzner      Per Larsson    
  Justin A. Lemkul    Viveca Lindahl    Magnus Lundborg     Erik Marklund   
    Pascal Merz     Pieter Meulenhoff    Teemu Murtola       Szilard Pall   
    Sander Pronk      Roland Schulz      Michael Shirts    Alexey Shvetsov  
   Alfons Sijbers     Peter Tieleman      Jon Vincent      Teemu Virolainen 
 Christian Wennberg    Maarten Wolf      Artem Zhmurov   
                           and the project leaders:
        Mark Abraham, Berk Hess, Erik Lindahl, and David van der Spoel

Copyright (c) 1991-2000, University of Groningen, The Netherlands.
Copyright (c) 2001-2019, The GROMACS development team at
Uppsala University, Stockholm University and
the Royal Institute of Technology, Sweden.
check out http://www.gromacs.org for more information.

GROMACS is free software; you can redistribute it and/or modify it
under the terms of the GNU Lesser General Public License
as published by the Free Software Foundation; either version 2.1
of the License, or (at your option) any later version.

GROMACS:      gmx mdrun, version 2020.4
Executable:   /work/opt/local/apps/intel/2019.5.281/impi/2019.5.281/gromacs/2020.4/bin/gmx_mpi
Data prefix:  /work/opt/local/apps/intel/2019.5.281/impi/2019.5.281/gromacs/2020.4
Working dir:  /work/2/hp210295/u18000/test
Process ID:   39745
Command line:
  gmx_mpi mdrun -deffnm 6

GROMACS version:    2020.4
Verified release checksum is 79c2857291b034542c26e90512b92fd4b184a1c9d6fa59c55f2e24ccf14e7281
Precision:          single
Memory model:       64 bit
MPI library:        MPI
OpenMP support:     enabled (GMX_OPENMP_MAX_THREADS = 64)
GPU support:        disabled
SIMD instructions:  AVX_512_KNL
FFT library:        fftw-3.3.8-avx-avx2-avx2_128-avx512
RDTSCP usage:       enabled
TNG support:        enabled
Hwloc support:      hwloc-1.11.5rc2
Tracing support:    disabled
C compiler:         /opt/intel/impi/2019.5.281/intel64/bin/mpiicc Intel 19.0.5.20190815
C compiler flags:   -xMIC-AVX512 -std=gnu99 -ip -funroll-all-loops -alias-const -ansi-alias -no-prec-div -fimf-domain-exclusion=14 -qoverride-limits -O3 -DNDEBUG
C++ compiler:       /opt/intel/impi/2019.5.281/intel64/bin/mpiicpc Intel 19.0.5.20190815
C++ compiler flags: -xMIC-AVX512 -ip -funroll-all-loops -alias-const -ansi-alias -no-prec-div -fimf-domain-exclusion=14 -qoverride-limits -qopenmp -O3 -DNDEBUG


Running on 1 node with total 68 cores, 272 logical cores
Hardware detected on host c0253.ofp (the node of MPI rank 0):
  CPU info:
    Vendor: Intel
    Brand:  Intel(R) Xeon Phi(TM) CPU 7250 @ 1.40GHz
    Family: 6   Model: 87   Stepping: 1
    Features: aes apic avx avx2 avx512f avx512pf avx512er avx512cd clfsh cmov cx8 cx16 f16c fma htt intel lahf mmx msr nonstop_tsc pclmuldq pdcm pdpe1gb popcnt pse rdrnd rdtscp sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
    Number of AVX-512 FMA units: Cannot run AVX-512 detection - assuming 2
  Hardware topology: Basic
    Sockets, cores, and logical processors:
      Socket  0: [   0  68 136 204] [   1  69 137 205] [   2  70 138 206] [   3  71 139 207] [   4  72 140 208] [   5  73 141 209] [   6  74 142 210] [   7  75 143 211] [   8  76 144 212] [   9  77 145 213] [  10  78 146 214] [  11  79 147 215] [  12  80 148 216] [  13  81 149 217] [  14  82 150 218] [  15  83 151 219] [  16  84 152 220] [  17  85 153 221] [  18  86 154 222] [  19  87 155 223] [  20  88 156 224] [  21  89 157 225] [  22  90 158 226] [  23  91 159 227] [  24  92 160 228] [  25  93 161 229] [  26  94 162 230] [  27  95 163 231] [  28  96 164 232] [  29  97 165 233] [  30  98 166 234] [  31  99 167 235] [  32 100 168 236] [  33 101 169 237] [  34 102 170 238] [  35 103 171 239] [  36 104 172 240] [  37 105 173 241] [  38 106 174 242] [  39 107 175 243] [  40 108 176 244] [  41 109 177 245] [  42 110 178 246] [  43 111 179 247] [  44 112 180 248] [  45 113 181 249] [  46 114 182 250] [  47 115 183 251] [  48 116 184 252] [  49 117 185 253] [  50 118 186 254] [  51 119 187 255] [  52 120 188 256] [  53 121 189 257] [  54 122 190 258] [  55 123 191 259] [  56 124 192 260] [  57 125 193 261] [  58 126 194 262] [  59 127 195 263] [  60 128 196 264] [  61 129 197 265] [  62 130 198 266] [  63 131 199 267] [  64 132 200 268] [  65 133 201 269] [  66 134 202 270] [  67 135 203 271]


++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E.
Lindahl
GROMACS: High performance molecular simulations through multi-level
parallelism from laptops to supercomputers
SoftwareX 1 (2015) pp. 19-25
-------- -------- --- Thank You --- -------- --------


++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
S. Páll, M. J. Abraham, C. Kutzner, B. Hess, E. Lindahl
Tackling Exascale Software Challenges in Molecular Dynamics Simulations with
GROMACS
In S. Markidis & E. Laure (Eds.), Solving Software Challenges for Exascale 8759 (2015) pp. 3-27
-------- -------- --- Thank You --- -------- --------


++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
S. Pronk, S. Páll, R. Schulz, P. Larsson, P. Bjelkmar, R. Apostolov, M. R.
Shirts, J. C. Smith, P. M. Kasson, D. van der Spoel, B. Hess, and E. Lindahl
GROMACS 4.5: a high-throughput and highly parallel open source molecular
simulation toolkit
Bioinformatics 29 (2013) pp. 845-54
-------- -------- --- Thank You --- -------- --------


++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
B. Hess and C. Kutzner and D. van der Spoel and E. Lindahl
GROMACS 4: Algorithms for highly efficient, load-balanced, and scalable
molecular simulation
J. Chem. Theory Comput. 4 (2008) pp. 435-447
-------- -------- --- Thank You --- -------- --------


++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
D. van der Spoel, E. Lindahl, B. Hess, G. Groenhof, A. E. Mark and H. J. C.
Berendsen
GROMACS: Fast, Flexible and Free
J. Comp. Chem. 26 (2005) pp. 1701-1719
-------- -------- --- Thank You --- -------- --------


++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
E. Lindahl and B. Hess and D. van der Spoel
GROMACS 3.0: A package for molecular simulation and trajectory analysis
J. Mol. Mod. 7 (2001) pp. 306-317
-------- -------- --- Thank You --- -------- --------


++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
H. J. C. Berendsen, D. van der Spoel and R. van Drunen
GROMACS: A message-passing parallel molecular dynamics implementation
Comp. Phys. Comm. 91 (1995) pp. 43-56
-------- -------- --- Thank You --- -------- --------


++++ PLEASE CITE THE DOI FOR THIS VERSION OF GROMACS ++++
https://doi.org/10.5281/zenodo.4054979
-------- -------- --- Thank You --- -------- --------

Input Parameters:
   integrator                     = md
   tinit                          = 0
   dt                             = 0.002
   nsteps                         = 50000000
   init-step                      = 0
   simulation-part                = 1
   comm-mode                      = Linear
   nstcomm                        = 100
   bd-fric                        = 0
   ld-seed                        = -1642179826
   emtol                          = 10
   emstep                         = 0.01
   niter                          = 20
   fcstep                         = 0
   nstcgsteep                     = 1000
   nbfgscorr                      = 10
   rtpi                           = 0.05
   nstxout                        = 0
   nstvout                        = 0
   nstfout                        = 0
   nstlog                         = 5000
   nstcalcenergy                  = 100
   nstenergy                      = 5000
   nstxout-compressed             = 5000
   compressed-x-precision         = 1000
   cutoff-scheme                  = Verlet
   nstlist                        = 20
   pbc                            = xyz
   periodic-molecules             = false
   verlet-buffer-tolerance        = 0.005
   rlist                          = 1.222
   coulombtype                    = PME
   coulomb-modifier               = Potential-shift
   rcoulomb-switch                = 0
   rcoulomb                       = 1.2
   epsilon-r                      = 1
   epsilon-rf                     = inf
   vdw-type                       = Cut-off
   vdw-modifier                   = Force-switch
   rvdw-switch                    = 1
   rvdw                           = 1.2
   DispCorr                       = No
   table-extension                = 1
   fourierspacing                 = 0.12
   fourier-nx                     = 72
   fourier-ny                     = 72
   fourier-nz                     = 72
   pme-order                      = 4
   ewald-rtol                     = 1e-05
   ewald-rtol-lj                  = 0.001
   lj-pme-comb-rule               = Geometric
   ewald-geometry                 = 0
   epsilon-surface                = 0
   tcoupl                         = Nose-Hoover
   nsttcouple                     = 20
   nh-chain-length                = 1
   print-nose-hoover-chain-variables = false
   pcoupl                         = Parrinello-Rahman
   pcoupltype                     = Isotropic
   nstpcouple                     = 20
   tau-p                          = 5
   compressibility (3x3):
      compressibility[    0]={ 4.50000e-05,  0.00000e+00,  0.00000e+00}
      compressibility[    1]={ 0.00000e+00,  4.50000e-05,  0.00000e+00}
      compressibility[    2]={ 0.00000e+00,  0.00000e+00,  4.50000e-05}
   ref-p (3x3):
      ref-p[    0]={ 1.00000e+00,  0.00000e+00,  0.00000e+00}
      ref-p[    1]={ 0.00000e+00,  1.00000e+00,  0.00000e+00}
      ref-p[    2]={ 0.00000e+00,  0.00000e+00,  1.00000e+00}
   refcoord-scaling               = COM
   posres-com (3):
      posres-com[0]= 0.00000e+00
      posres-com[1]= 0.00000e+00
      posres-com[2]= 0.00000e+00
   posres-comB (3):
      posres-comB[0]= 0.00000e+00
      posres-comB[1]= 0.00000e+00
      posres-comB[2]= 0.00000e+00
   QMMM                           = false
   QMconstraints                  = 0
   QMMMscheme                     = 0
   MMChargeScaleFactor            = 1
qm-opts:
   ngQM                           = 0
   constraint-algorithm           = Lincs
   continuation                   = true
   Shake-SOR                      = false
   shake-tol                      = 0.0001
   lincs-order                    = 4
   lincs-iter                     = 1
   lincs-warnangle                = 30
   nwall                          = 0
   wall-type                      = 9-3
   wall-r-linpot                  = -1
   wall-atomtype[0]               = -1
   wall-atomtype[1]               = -1
   wall-density[0]                = 0
   wall-density[1]                = 0
   wall-ewald-zfac                = 3
   pull                           = false
   awh                            = false
   rotation                       = false
   interactiveMD                  = false
   disre                          = No
   disre-weighting                = Conservative
   disre-mixed                    = false
   dr-fc                          = 1000
   dr-tau                         = 0
   nstdisreout                    = 100
   orire-fc                       = 0
   orire-tau                      = 0
   nstorireout                    = 100
   free-energy                    = no
   cos-acceleration               = 0
   deform (3x3):
      deform[    0]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
      deform[    1]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
      deform[    2]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
   simulated-tempering            = false
   swapcoords                     = no
   userint1                       = 0
   userint2                       = 0
   userint3                       = 0
   userint4                       = 0
   userreal1                      = 0
   userreal2                      = 0
   userreal3                      = 0
   userreal4                      = 0
   applied-forces:
     electric-field:
       x:
         E0                       = 0
         omega                    = 0
         t0                       = 0
         sigma                    = 0
       y:
         E0                       = 0
         omega                    = 0
         t0                       = 0
         sigma                    = 0
       z:
         E0                       = 0
         omega                    = 0
         t0                       = 0
         sigma                    = 0
     density-guided-simulation:
       active                     = false
       group                      = protein
       similarity-measure         = inner-product
       atom-spreading-weight      = unity
       force-constant             = 1e+09
       gaussian-transform-spreading-width = 0.2
       gaussian-transform-spreading-range-in-multiples-of-width = 4
       reference-density-filename = reference.mrc
       nst                        = 1
       normalize-densities        = true
       adaptive-force-scaling     = false
       adaptive-force-scaling-time-constant = 4
grpopts:
   nrdf:      114689
   ref-t:         310
   tau-t:           1
annealing:          No
annealing-npoints:           0
   acc:	           0           0           0
   nfreeze:           N           N           N
   energygrp-flags[  0]: 0

Changing nstlist from 20 to 100, rlist from 1.222 to 1.342


Initializing Domain Decomposition on 64 ranks
Dynamic load balancing: auto
Using update groups, nr 19464, average size 2.9 atoms, max. radius 0.139 nm
Minimum cell size due to atom displacement: 0.666 nm
Initial maximum distances in bonded interactions:
    two-body bonded interactions: 0.470 nm, LJ-14, atoms 3436 3931
  multi-body bonded interactions: 0.499 nm, CMAP Dih., atoms 654 666
Minimum cell size due to bonded interactions: 0.548 nm
Scaling the initial minimum size with 1/0.8 (option -dds) = 1.25
Guess for relative PME load: 0.19
Will use 48 particle-particle and 16 PME only ranks
This is a guess, check the performance at the end of the log file
Using 16 separate PME ranks, as guessed by mdrun
Optimizing the DD grid for 48 cells with a minimum initial size of 0.832 nm
The maximum allowed number of cells is: X 10 Y 10 Z 10
Domain decomposition grid 4 x 4 x 3, separate PME ranks 16
PME domain decomposition: 4 x 4 x 1
Interleaving PP and PME ranks
This rank does only particle-particle work.
Domain decomposition rank 0, coordinates 0 0 0

The initial number of communication pulses is: X 1 Y 1 Z 1
The initial domain decomposition cell size is: X 2.10 nm Y 2.10 nm Z 2.80 nm

The maximum allowed distance for atom groups involved in interactions is:
                 non-bonded interactions           1.620 nm
(the following are initial values, they could change due to box deformation)
            two-body bonded interactions  (-rdd)   1.620 nm
          multi-body bonded interactions  (-rdd)   1.620 nm

When dynamic load balancing gets turned on, these settings will change to:
The maximum number of communication pulses is: X 1 Y 1 Z 1
The minimum size for domain decomposition cells is 1.620 nm
The requested allowed shrink of DD cells (option -dds) is: 0.80
The allowed shrink of domain decomposition cells is: X 0.77 Y 0.77 Z 0.58
The maximum allowed distance for atom groups involved in interactions is:
                 non-bonded interactions           1.620 nm
            two-body bonded interactions  (-rdd)   1.620 nm
          multi-body bonded interactions  (-rdd)   1.620 nm

Using 64 MPI processes

Non-default thread affinity set, disabling internal thread affinity

Using 4 OpenMP threads per MPI process

System total charge: 0.000
Will do PME sum in reciprocal space for electrostatic interactions.

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
U. Essmann, L. Perera, M. L. Berkowitz, T. Darden, H. Lee and L. G. Pedersen 
A smooth particle mesh Ewald method
J. Chem. Phys. 103 (1995) pp. 8577-8592
-------- -------- --- Thank You --- -------- --------

Using a Gaussian width (1/beta) of 0.384195 nm for Ewald
Potential shift: LJ r^-12: -2.648e-01 r^-6: -5.349e-01, Ewald -8.333e-06
Initialized non-bonded Ewald tables, spacing: 1.02e-03 size: 1176

Generated table with 1171 data points for 1-4 COUL.
Tabscale = 500 points/nm
Generated table with 1171 data points for 1-4 LJ6.
Tabscale = 500 points/nm
Generated table with 1171 data points for 1-4 LJ12.
Tabscale = 500 points/nm

Using SIMD 4x8 nonbonded short-range kernels

Using a dual 4x8 pair-list setup updated with dynamic pruning:
  outer list: updated every 100 steps, buffer 0.142 nm, rlist 1.342 nm
  inner list: updated every  13 steps, buffer 0.001 nm, rlist 1.201 nm
At tolerance 0.005 kJ/mol/ps per atom, equivalent classical 1x1 list would be:
  outer list: updated every 100 steps, buffer 0.296 nm, rlist 1.496 nm
  inner list: updated every  13 steps, buffer 0.052 nm, rlist 1.252 nm


Initializing LINear Constraint Solver

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
B. Hess and H. Bekker and H. J. C. Berendsen and J. G. E. M. Fraaije
LINCS: A Linear Constraint Solver for molecular simulations
J. Comp. Chem. 18 (1997) pp. 1463-1472
-------- -------- --- Thank You --- -------- --------

The number of constraints is 2047

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
S. Miyamoto and P. A. Kollman
SETTLE: An Analytical Version of the SHAKE and RATTLE Algorithms for Rigid
Water Models
J. Comp. Chem. 13 (1992) pp. 952-962
-------- -------- --- Thank You --- -------- --------


Linking all bonded interactions to atoms


Intra-simulation communication will occur every 20 steps.
There are: 56315 Atoms
Atom distribution over 48 domains: av 1173 stddev 46 min 1110 max 1323
Center of mass motion removal mode is Linear
We have the following groups for center of mass motion removal:
  0:  System

Started mdrun on rank 0 Tue Jan 11 16:31:21 2022

           Step           Time
              0        0.00000

   Energies (kJ/mol)
           Bond            U-B    Proper Dih.  Improper Dih.      CMAP Dih.
    3.71683e+03    1.09179e+04    1.20019e+04    7.06692e+02   -6.36002e+02
          LJ-14     Coulomb-14        LJ (SR)   Coulomb (SR)   Coul. recip.
    2.73344e+03    3.72157e+04    6.66096e+04   -8.51480e+05    3.12529e+03
      Potential    Kinetic En.   Total Energy  Conserved En.    Temperature
   -7.15089e+05    1.49243e+05   -5.65846e+05   -5.65810e+05    3.13017e+02
 Pressure (bar)   Constr. rmsd
   -4.37170e+02    3.68922e-06


DD  step 99 load imb.: force 22.5%  pme mesh/force 3.184
step  600: timed with pme grid 72 72 72, coulomb cutoff 1.200: 503.2 M-cycles
step  800: timed with pme grid 60 60 60, coulomb cutoff 1.400: 524.1 M-cycles
step 1000: timed with pme grid 52 52 52, coulomb cutoff 1.615: 684.2 M-cycles
step 1200: timed with pme grid 56 56 56, coulomb cutoff 1.500: 606.1 M-cycles
step 1400: timed with pme grid 60 60 60, coulomb cutoff 1.400: 528.0 M-cycles
step 1600: timed with pme grid 64 64 64, coulomb cutoff 1.313: 460.5 M-cycles
step 1800: timed with pme grid 72 72 72, coulomb cutoff 1.200: 486.3 M-cycles
step 2000: timed with pme grid 64 64 64, coulomb cutoff 1.313: 469.8 M-cycles
step 2200: timed with pme grid 72 72 72, coulomb cutoff 1.200: 497.6 M-cycles
              optimal pme grid 64 64 64, coulomb cutoff 1.313

DD  step 4999 load imb.: force  8.8%  pme mesh/force 1.014
           Step           Time
           5000       10.00000

   Energies (kJ/mol)
           Bond            U-B    Proper Dih.  Improper Dih.      CMAP Dih.
    3.59132e+03    1.05560e+04    1.15225e+04    6.60512e+02   -5.45052e+02
          LJ-14     Coulomb-14        LJ (SR)   Coulomb (SR)   Coul. recip.
    2.62548e+03    3.72929e+04    6.86834e+04   -8.59544e+05    2.07033e+03
      Potential    Kinetic En.   Total Energy  Conserved En.    Temperature
   -7.23086e+05    1.48147e+05   -5.74940e+05   -5.65841e+05    3.10718e+02
 Pressure (bar)   Constr. rmsd
    1.36763e+02    3.09292e-06


DD  step 9999 load imb.: force  8.1%  pme mesh/force 1.007
           Step           Time
          10000       20.00000

   Energies (kJ/mol)
           Bond            U-B    Proper Dih.  Improper Dih.      CMAP Dih.
    3.50481e+03    1.04389e+04    1.13793e+04    6.51941e+02   -5.48826e+02
          LJ-14     Coulomb-14        LJ (SR)   Coulomb (SR)   Coul. recip.
    2.61028e+03    3.72691e+04    6.82935e+04   -8.63144e+05    2.01637e+03
      Potential    Kinetic En.   Total Energy  Conserved En.    Temperature
   -7.27528e+05    1.46082e+05   -5.81446e+05   -5.65470e+05    3.06388e+02
 Pressure (bar)   Constr. rmsd
    1.16957e+01    2.98622e-06



Received the TERM signal, stopping within 100 steps

           Step           Time
          14700       29.40000

Writing checkpoint, step 14700 at Tue Jan 11 16:32:41 2022


   Energies (kJ/mol)
           Bond            U-B    Proper Dih.  Improper Dih.      CMAP Dih.
    3.64271e+03    1.05973e+04    1.12037e+04    6.14280e+02   -5.71026e+02
          LJ-14     Coulomb-14        LJ (SR)   Coulomb (SR)   Coul. recip.
    2.55995e+03    3.71370e+04    6.87223e+04   -8.63461e+05    2.03921e+03
      Potential    Kinetic En.   Total Energy  Conserved En.    Temperature
   -7.27515e+05    1.48024e+05   -5.79491e+05   -5.65332e+05    3.10460e+02
 Pressure (bar)   Constr. rmsd
    7.06370e+01    3.13095e-06


	<======  ###############  ==>
	<====  A V E R A G E S  ====>
	<==  ###############  ======>

	Statistics over 14701 steps using 148 frames

   Energies (kJ/mol)
           Bond            U-B    Proper Dih.  Improper Dih.      CMAP Dih.
    3.65743e+03    1.04505e+04    1.14343e+04    6.48923e+02   -5.53127e+02
          LJ-14     Coulomb-14        LJ (SR)   Coulomb (SR)   Coul. recip.
    2.59964e+03    3.73598e+04    6.84031e+04   -8.61622e+05    2.07372e+03
      Potential    Kinetic En.   Total Energy  Conserved En.    Temperature
   -7.25548e+05    1.47890e+05   -5.77658e+05   -5.65614e+05    3.10178e+02
 Pressure (bar)   Constr. rmsd
   -1.91684e+00    0.00000e+00

          Box-X          Box-Y          Box-Z
    8.24066e+00    8.24066e+00    8.24066e+00

   Total Virial (kJ/mol)
    4.92424e+04    6.43378e+01    2.64065e+02
    6.68609e+01    4.96511e+04   -1.88036e+02
    2.68684e+02   -1.84937e+02    4.92011e+04

   Pressure (bar)
    4.31721e+00   -2.09554e+00   -1.44692e+01
   -2.24871e+00   -1.79220e+01    1.13233e+01
   -1.47432e+01    1.11393e+01    7.85428e+00


       P P   -   P M E   L O A D   B A L A N C I N G

 PP/PME load balancing changed the cut-off and PME settings:
           particle-particle                    PME
            rcoulomb  rlist            grid      spacing   1/beta
   initial  1.200 nm  1.201 nm      72  72  72   0.117 nm  0.384 nm
   final    1.313 nm  1.314 nm      64  64  64   0.131 nm  0.420 nm
 cost-ratio           1.31             0.70
 (note that these numbers concern only part of the total PP and PME load)


	M E G A - F L O P S   A C C O U N T I N G

 NB=Group-cutoff nonbonded kernels    NxN=N-by-N cluster Verlet kernels
 RF=Reaction-Field  VdW=Van der Waals  QSTab=quadratic-spline table
 W3=SPC/TIP3p  W4=TIP4p (single or pairs)
 V&F=Potential and force  V=Potential only  F=Force only

 Computing:                               M-Number         M-Flops  % Flops
-----------------------------------------------------------------------------
 Pair Search distance check            5668.050006       51012.450     0.1
 NxN Ewald Elec. + LJ [F]            962883.301248    75104897.497    94.6
 NxN Ewald Elec. + LJ [V&F]            9792.196000     1263193.284     1.6
 NxN LJ [F]                               7.628544         343.284     0.0
 NxN LJ [V&F]                             0.077056           5.009     0.0
 NxN Ewald Elec. [F]                  18005.907744     1098360.372     1.4
 NxN Ewald Elec. [V&F]                  183.107328       15381.016     0.0
 1,4 nonbonded interactions             158.315069       14248.356     0.0
 Calc Weights                          2483.660445       89411.776     0.1
 Spread Q Bspline                     52984.756160      105969.512     0.1
 Gather F Bspline                     52984.756160      317908.537     0.4
 3D-FFT                              140814.384784     1126515.078     1.4
 Solve PME                              242.537984       15522.431     0.0
 Reset In Box                             8.278305          24.835     0.0
 CG-CoM                                   8.334620          25.004     0.0
 Bonds                                   30.827997        1818.852     0.0
 Propers                                152.934503       35022.001     0.0
 Impropers                                9.981979        2076.252     0.0
 Virial                                  43.037600         774.677     0.0
 Stop-CM                                  8.334620          83.346     0.0
 Calc-Ekin                               82.895680        2238.183     0.0
 Lincs                                   30.092947        1805.577     0.0
 Lincs-Mat                              160.005684         640.023     0.0
 Constraint-V                           827.666300        6621.330     0.0
 Constraint-Vir                          39.930208         958.325     0.0
 Settle                                 255.826802       82632.057     0.1
 CMAP                                     3.939868        6697.776     0.0
 Urey-Bradley                           109.948779       20120.627     0.0
-----------------------------------------------------------------------------
 Total                                                79364307.467   100.0
-----------------------------------------------------------------------------


    D O M A I N   D E C O M P O S I T I O N   S T A T I S T I C S

 av. #atoms communicated per step for force:  2 x 151814.8


Dynamic load balancing report:
 DLB was off during the run due to low measured imbalance.
 Average load imbalance: 14.8%.
 The balanceable part of the MD step is 22%, load imbalance is computed from this.
 Part of the total run time spent waiting due to load imbalance: 3.3%.
 Average PME mesh/force load: 1.669
 Part of the total run time spent waiting due to PP/PME imbalance: 11.9 %

NOTE: 11.9 % performance was lost because the PME ranks
      had more work to do than the PP ranks.
      You might want to increase the number of PME ranks
      or increase the cut-off and the grid spacing.


     R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G

On 48 MPI ranks doing PP, each using 4 OpenMP threads, and
on 16 MPI ranks doing PME, each using 4 OpenMP threads

 Computing:          Num   Num      Call    Wall time         Giga-Cycles
                     Ranks Threads  Count      (s)         total sum    %
-----------------------------------------------------------------------------
 Domain decomp.        48    4        147       0.729        196.056   0.7
 DD comm. load         48    4          4       0.001          0.299   0.0
 Send X to PME         48    4      14701       0.288         77.533   0.3
 Neighbor search       48    4        148       1.197        321.746   1.1
 Comm. coord.          48    4      14553       5.178       1391.771   4.9
 Force                 48    4      14701      43.968      11818.152  41.3
 Wait + Comm. F        48    4      14701       7.199       1935.031   6.8
 PME mesh *            16    4      14701      56.152       5031.034  17.6
 PME wait for PP *                             20.387       1826.611   6.4
 Wait + Recv. PME F    48    4      14701      15.359       4128.482  14.4
 NB X/F buffer ops.    48    4      43807       2.704        726.750   2.5
 Write traj.           48    4          4       0.030          7.981   0.0
 Update                48    4      14701       0.804        216.067   0.8
 Constraints           48    4      14701       1.129        303.559   1.1
 Comm. energies        48    4        736       1.211        325.507   1.1
 Rest                                           0.021          5.701   0.0
-----------------------------------------------------------------------------
 Total                                         79.819      28606.180 100.0
-----------------------------------------------------------------------------
(*) Note that with separate PME ranks, the walltime column actually sums to
    twice the total reported, but the cycle count total and % are correct.
-----------------------------------------------------------------------------
 Breakdown of PME mesh computation
-----------------------------------------------------------------------------
 PME redist. X/F       16    4      29402      16.948       1518.472   5.3
 PME spread            16    4      14701      12.011       1076.144   3.8
 PME gather            16    4      14701       8.699        779.402   2.7
 PME 3D-FFT            16    4      29402      10.534        943.824   3.3
 PME 3D-FFT Comm.      16    4      58804       6.875        615.991   2.2
 PME solve Elec        16    4      14701       0.586         52.468   0.2
-----------------------------------------------------------------------------

               Core t (s)   Wall t (s)        (%)
       Time:    20431.170       79.819    25597.0
                 (ns/day)    (hour/ns)
Performance:       31.826        0.754
Finished mdrun on rank 0 Tue Jan 11 16:32:42 2022


This is now using 4 threads/rank rather than 1 as the 10 node run which is likely part of the reason why PME is slow (and there is a significant PP/PME imbalance).

By increasing the threads per rank I meant to keep the rank count the same while using e.g. 2 threads per rank.

Regardless, this system clearly does not scale great to 10 nodes, but it is also not the 1M atom system you were initially talking about, it has only 50k atoms so that is not too surprising, with less than ~500 atoms/core scaling inefficiencies will be increasingly pronounced.