Large % of "rest" time when running on GPU

GROMACS version: 2020.5
GROMACS modification: No

Hi!

I’m trying to perform a simulation containing ~450,000 TIP4P/ice water molecules on GPU (more specifically, it’s an ice slab with half of the molecules frozen; and a vacuum layer exists above the surface). I want to check out the performance so I run this:

gmx_mpi -quiet mdrun -nsteps 10000 -resethway -noconfout

The performance turns out not very satisfying, and when checking md.log I found that the “rest” part is taking ~20% of total time (see the log below). If I run on CPU only the “rest” time occupies <1%. I wonder if this is normal and how can I accelerate the simulation (nvidia-smi shows that the GPU usage is around 0%~50% so I think it might be possible?)

The simulation is performed on a E5-2640 v4 CPU (with 5 cores available) and a Tesla V100 GPU (the cluster I’m using forces a 5:1 CPU/GPU ratio). The md.log is provided below.

Thanks in advance!


               :-) GROMACS - gmx mdrun, 2020.5-dev-UNCHECKED (-:

                            GROMACS is written by:
     Emile Apol      Rossen Apostolov      Paul Bauer     Herman J.C. Berendsen
    Par Bjelkmar      Christian Blau   Viacheslav Bolnykh     Kevin Boyd    
 Aldert van Buuren   Rudi van Drunen     Anton Feenstra       Alan Gray     
  Gerrit Groenhof     Anca Hamuraru    Vincent Hindriksen  M. Eric Irrgang  
  Aleksei Iupinov   Christoph Junghans     Joe Jordan     Dimitrios Karkoulis
    Peter Kasson        Jiri Kraus      Carsten Kutzner      Per Larsson    
  Justin A. Lemkul    Viveca Lindahl    Magnus Lundborg     Erik Marklund   
    Pascal Merz     Pieter Meulenhoff    Teemu Murtola       Szilard Pall   
    Sander Pronk      Roland Schulz      Michael Shirts    Alexey Shvetsov  
   Alfons Sijbers     Peter Tieleman      Jon Vincent      Teemu Virolainen 
 Christian Wennberg    Maarten Wolf      Artem Zhmurov
   
                           and the project leaders:
        Mark Abraham, Berk Hess, Erik Lindahl, and David van der Spoel

Copyright (c) 1991-2000, University of Groningen, The Netherlands.
Copyright (c) 2001-2019, The GROMACS development team at
Uppsala University, Stockholm University and
the Royal Institute of Technology, Sweden.
check out http://www.gromacs.org for more information.

GROMACS is free software; you can redistribute it and/or modify it
under the terms of the GNU Lesser General Public License
as published by the Free Software Foundation; either version 2.1
of the License, or (at your option) any later version.

GROMACS:      gmx mdrun, version 2020.5-dev-UNCHECKED
Executable:   ***********/software/gromacs/gromacs-2020/bin/gmx_mpi
Data prefix:  ***********/software/gromacs/gromacs-2020
Working dir:  *********** (I guess the path is irrelevant)
Process ID:   9114
Command line:
  gmx_mpi -quiet mdrun -nsteps 10000 -resethway -noconfout

GROMACS version:    2020.5-dev-UNCHECKED
The source code this program was compiled from has not been verified because the reference checksum was missing during compilation. This means you have an incomplete GROMACS distribution, please make sure to download an intact source distribution and compile that before proceeding.
Computed checksum: NoChecksumFile
Precision:          single
Memory model:       64 bit
MPI library:        MPI
OpenMP support:     enabled (GMX_OPENMP_MAX_THREADS = 64)
GPU support:        CUDA
SIMD instructions:  AVX2_256
FFT library:        fftw-3.3.3-sse2
RDTSCP usage:       enabled
TNG support:        enabled
Hwloc support:      hwloc-1.11.8
Tracing support:    disabled
C compiler:         /software/intel/parallelstudio/2017u8/compilers_and_libraries_2017.8.262/linux/mpi/intel64/bin/mpicc GNU 6.3.1
C compiler flags:   -mavx2 -mfma -Wall -Wno-unused -Wunused-value -Wunused-parameter -Wextra -Wno-missing-field-initializers -Wno-sign-compare -Wpointer-arith -Wundef -fexcess-precision=fast -funroll-all-loops -Wno-array-bounds -O3 -DNDEBUG
C++ compiler:       /software/intel/parallelstudio/2017u8/compilers_and_libraries_2017.8.262/linux/mpi/intel64/bin/mpicxx GNU 6.3.1
C++ compiler flags: -mavx2 -mfma -Wall -Wextra -Wno-missing-field-initializers -Wpointer-arith -Wmissing-declarations -Wundef -fexcess-precision=fast -funroll-all-loops -Wno-array-bounds -fopenmp -O3 -DNDEBUG
CUDA compiler:      /software/nvidia/cuda/10.0/bin/nvcc nvcc: NVIDIA (R) Cuda compiler driver;Copyright (c) 2005-2018 NVIDIA Corporation;Built on Sat_Aug_25_21:08:01_CDT_2018;Cuda compilation tools, release 10.0, V10.0.130
CUDA compiler flags:-gencode;arch=compute_30,code=sm_30;-gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_35,code=compute_35;-gencode;arch=compute_50,code=compute_50;-gencode;arch=compute_52,code=compute_52;-gencode;arch=compute_60,code=compute_60;-gencode;arch=compute_61,code=compute_61;-gencode;arch=compute_70,code=compute_70;-gencode;arch=compute_75,code=compute_75;-use_fast_math;; -O3 -DNDEBUG
CUDA driver:        11.40
CUDA runtime:       10.0

Note: 20 CPUs configured, but only 5 were detected to be online.

Running on 1 node with total 5 cores, 5 logical cores, 1 compatible GPU
Hardware detected on host g0018 (the node of MPI rank 0):
  CPU info:
    Vendor: Intel
    Brand:  Intel(R) Xeon(R) CPU E5-2640 v4 @ 2.40GHz
    Family: 6   Model: 79   Stepping: 1
    Features: aes apic avx avx2 clfsh cmov cx8 cx16 f16c fma hle htt intel lahf mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd rdtscp rtm sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
  Hardware topology: Full, with devices
    Sockets, cores, and logical processors:
      Socket  0: [   0] [   1] [   2] [   3] [   4]
    Numa nodes:
      Node  0 (68602982400 bytes mem):   0   1   2   3   4
      Node  1 (68719476736 bytes mem):
      Latency:
               0     1
         0  1.00  2.10
         1  2.10  1.00
    Caches:
      L1: 32768 bytes, linesize 64 bytes, assoc. 8, shared 1 ways
      L2: 262144 bytes, linesize 64 bytes, assoc. 8, shared 1 ways
      L3: 26214400 bytes, linesize 64 bytes, assoc. 20, shared 5 ways
    PCI devices:
      0000:05:00.0  Id: 10de:1db1  Class: 0x0302  Numa: -1
      0000:06:00.0  Id: 10de:1db1  Class: 0x0302  Numa: -1
      0000:00:11.4  Id: 8086:8d62  Class: 0x0106  Numa: -1
      0000:07:00.0  Id: 8086:1528  Class: 0x0200  Numa: -1
      0000:07:00.1  Id: 8086:1528  Class: 0x0200  Numa: -1
      0000:09:00.0  Id: 1a03:2000  Class: 0x0300  Numa: -1
      0000:00:1f.2  Id: 8086:8d02  Class: 0x0106  Numa: -1
      0000:83:00.0  Id: 8086:24f0  Class: 0x0208  Numa: -1
      0000:84:00.0  Id: 10de:1db1  Class: 0x0302  Numa: -1
      0000:85:00.0  Id: 10de:1db1  Class: 0x0302  Numa: -1
  GPU info:
    Number of GPUs detected: 1
    #0: NVIDIA Tesla V100-SXM2-16GB, compute cap.: 7.0, ECC: yes, stat: compatible


++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E.
Lindahl
GROMACS: High performance molecular simulations through multi-level
parallelism from laptops to supercomputers
SoftwareX 1 (2015) pp. 19-25
-------- -------- --- Thank You --- -------- --------


++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
S. Páll, M. J. Abraham, C. Kutzner, B. Hess, E. Lindahl
Tackling Exascale Software Challenges in Molecular Dynamics Simulations with
GROMACS
In S. Markidis & E. Laure (Eds.), Solving Software Challenges for Exascale 8759 (2015) pp. 3-27
-------- -------- --- Thank You --- -------- --------


++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
S. Pronk, S. Páll, R. Schulz, P. Larsson, P. Bjelkmar, R. Apostolov, M. R.
Shirts, J. C. Smith, P. M. Kasson, D. van der Spoel, B. Hess, and E. Lindahl
GROMACS 4.5: a high-throughput and highly parallel open source molecular
simulation toolkit
Bioinformatics 29 (2013) pp. 845-54
-------- -------- --- Thank You --- -------- --------


++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
B. Hess and C. Kutzner and D. van der Spoel and E. Lindahl
GROMACS 4: Algorithms for highly efficient, load-balanced, and scalable
molecular simulation
J. Chem. Theory Comput. 4 (2008) pp. 435-447
-------- -------- --- Thank You --- -------- --------


++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
D. van der Spoel, E. Lindahl, B. Hess, G. Groenhof, A. E. Mark and H. J. C.
Berendsen
GROMACS: Fast, Flexible and Free
J. Comp. Chem. 26 (2005) pp. 1701-1719
-------- -------- --- Thank You --- -------- --------


++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
E. Lindahl and B. Hess and D. van der Spoel
GROMACS 3.0: A package for molecular simulation and trajectory analysis
J. Mol. Mod. 7 (2001) pp. 306-317
-------- -------- --- Thank You --- -------- --------


++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
H. J. C. Berendsen, D. van der Spoel and R. van Drunen
GROMACS: A message-passing parallel molecular dynamics implementation
Comp. Phys. Comm. 91 (1995) pp. 43-56
-------- -------- --- Thank You --- -------- --------


++++ PLEASE CITE THE DOI FOR THIS VERSION OF GROMACS ++++
https://doi.org/10.5281/zenodo.4420785
-------- -------- --- Thank You --- -------- --------

Input Parameters:
   integrator                     = md
   tinit                          = 0
   dt                             = 0.002
   nsteps                         = 50000000
   init-step                      = 0
   simulation-part                = 1
   comm-mode                      = None
   nstcomm                        = 0
   bd-fric                        = 0
   ld-seed                        = 1069350751
   emtol                          = 10
   emstep                         = 0.01
   niter                          = 20
   fcstep                         = 0
   nstcgsteep                     = 1000
   nbfgscorr                      = 10
   rtpi                           = 0.05
   nstxout                        = 50000
   nstvout                        = 0
   nstfout                        = 0
   nstlog                         = 50000
   nstcalcenergy                  = 10000
   nstenergy                      = 50000
   nstxout-compressed             = 0
   compressed-x-precision         = 1000
   cutoff-scheme                  = Verlet
   nstlist                        = 40
   pbc                            = xyz
   periodic-molecules             = false
   verlet-buffer-tolerance        = 0.005
   rlist                          = 0.954
   coulombtype                    = PME
   coulomb-modifier               = Potential-shift
   rcoulomb-switch                = 0
   rcoulomb                       = 0.9
   epsilon-r                      = 1
   epsilon-rf                     = inf
   vdw-type                       = Cut-off
   vdw-modifier                   = Potential-shift
   rvdw-switch                    = 0
   rvdw                           = 0.9
   DispCorr                       = EnerPres
   table-extension                = 1
   fourierspacing                 = 0.12
   fourier-nx                     = 600
   fourier-ny                     = 600
   fourier-nz                     = 84
   pme-order                      = 4
   ewald-rtol                     = 1e-05
   ewald-rtol-lj                  = 0.001
   lj-pme-comb-rule               = Geometric
   ewald-geometry                 = 0
   epsilon-surface                = 0
   tcoupl                         = V-rescale
   nsttcouple                     = 40
   nh-chain-length                = 0
   print-nose-hoover-chain-variables = false
   pcoupl                         = No
   pcoupltype                     = Isotropic
   nstpcouple                     = -1
   tau-p                          = 1
   compressibility (3x3):
      compressibility[    0]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
      compressibility[    1]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
      compressibility[    2]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
   ref-p (3x3):
      ref-p[    0]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
      ref-p[    1]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
      ref-p[    2]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
   refcoord-scaling               = No
   posres-com (3):
      posres-com[0]= 0.00000e+00
      posres-com[1]= 0.00000e+00
      posres-com[2]= 0.00000e+00
   posres-comB (3):
      posres-comB[0]= 0.00000e+00
      posres-comB[1]= 0.00000e+00
      posres-comB[2]= 0.00000e+00
   QMMM                           = false
   QMconstraints                  = 0
   QMMMscheme                     = 0
   MMChargeScaleFactor            = 1
qm-opts:
   ngQM                           = 0
   constraint-algorithm           = Lincs
   continuation                   = false
   Shake-SOR                      = false
   shake-tol                      = 0.0001
   lincs-order                    = 4
   lincs-iter                     = 1
   lincs-warnangle                = 30
   nwall                          = 0
   wall-type                      = 9-3
   wall-r-linpot                  = -1
   wall-atomtype[0]               = -1
   wall-atomtype[1]               = -1
   wall-density[0]                = 0
   wall-density[1]                = 0
   wall-ewald-zfac                = 3
   pull                           = false
   ramd                           = false
   awh                            = false
   rotation                       = false
   interactiveMD                  = false
   disre                          = No
   disre-weighting                = Conservative
   disre-mixed                    = false
   dr-fc                          = 1000
   dr-tau                         = 0
   nstdisreout                    = 100
   orire-fc                       = 0
   orire-tau                      = 0
   nstorireout                    = 100
   free-energy                    = no
   cos-acceleration               = 0
   deform (3x3):
      deform[    0]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
      deform[    1]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
      deform[    2]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
   simulated-tempering            = false
   swapcoords                     = no
   userint1                       = 0
   userint2                       = 0
   userint3                       = 0
   userint4                       = 0
   userreal1                      = 0
   userreal2                      = 0
   userreal3                      = 0
   userreal4                      = 0
   applied-forces:
     electric-field:
       x:
         E0                       = 0
         omega                    = 0
         t0                       = 0
         sigma                    = 0
       y:
         E0                       = 0
         omega                    = 0
         t0                       = 0
         sigma                    = 0
       z:
         E0                       = 0
         omega                    = 0
         t0                       = 0
         sigma                    = 0
     density-guided-simulation:
       active                     = false
       group                      = protein
       similarity-measure         = inner-product
       atom-spreading-weight      = unity
       force-constant             = 1e+09
       gaussian-transform-spreading-width = 0.2
       gaussian-transform-spreading-range-in-multiples-of-width = 4
       reference-density-filename = reference.mrc
       nst                        = 1
       normalize-densities        = true
       adaptive-force-scaling     = false
       adaptive-force-scaling-time-constant = 4
grpopts:
   nrdf:  1.34784e+06           0
   ref-t:         269           0
   tau-t:         0.5         0.5
annealing:          No          No
annealing-npoints:           0           0
   acc:	           0           0           0
   nfreeze:           Y           Y           Y           N           N           N
   energygrp-flags[  0]: 0


The -nsteps functionality is deprecated, and may be removed in a future version. Consider using gmx convert-tpr -nsteps or changing the appropriate .mdp file field.

Overriding nsteps with value passed on the command line: 10000 steps, 20 ps
Changing nstlist from 40 to 100, rlist from 0.954 to 1.016


1 GPU selected for this run.
Mapping of GPU IDs to the 2 GPU tasks in the 1 rank on this node:
  PP:0,PME:0
PP tasks will do (non-perturbed) short-ranged interactions on the GPU
PP task will update and constrain coordinates on the CPU
PME tasks will do all aspects on the GPU
Using 1 MPI process
Using 5 OpenMP threads 

Pinning threads with an auto-selected logical core stride of 1
System total charge: 0.000
Will do PME sum in reciprocal space for electrostatic interactions.

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
U. Essmann, L. Perera, M. L. Berkowitz, T. Darden, H. Lee and L. G. Pedersen 
A smooth particle mesh Ewald method
J. Chem. Phys. 103 (1995) pp. 8577-8592
-------- -------- --- Thank You --- -------- --------

Using a Gaussian width (1/beta) of 0.288146 nm for Ewald
Potential shift: LJ r^-12: -3.541e+00 r^-6: -1.882e+00, Ewald -1.111e-05
Initialized non-bonded Ewald tables, spacing: 8.85e-04 size: 1018


Using GPU 8x8 nonbonded short-range kernels

Using a dual 8x8 pair-list setup updated with dynamic, rolling pruning:
  outer list: updated every 100 steps, buffer 0.116 nm, rlist 1.016 nm
  inner list: updated every  18 steps, buffer 0.004 nm, rlist 0.904 nm
At tolerance 0.005 kJ/mol/ps per atom, equivalent classical 1x1 list would be:
  outer list: updated every 100 steps, buffer 0.257 nm, rlist 1.157 nm
  inner list: updated every  18 steps, buffer 0.066 nm, rlist 0.966 nm

Using geometric Lennard-Jones combination rule

Long Range LJ corr.: <C6> 2.2243e-04

Removing pbc first time

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
S. Miyamoto and P. A. Kollman
SETTLE: An Analytical Version of the SHAKE and RATTLE Algorithms for Rigid
Water Models
J. Comp. Chem. 13 (1992) pp. 952-962
-------- -------- --- Thank You --- -------- --------


The -noconfout functionality is deprecated, and may be removed in a future version.

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
G. Bussi, D. Donadio and M. Parrinello
Canonical sampling through velocity rescaling
J. Chem. Phys. 126 (2007) pp. 014101
-------- -------- --- Thank You --- -------- --------

There are: 1347840 Atoms
There are: 449280 VSites

Constraining the starting coordinates (step 0)

Constraining the coordinates at t0-dt (step 0)
RMS relative constraint deviation after constraining: 0.00e+00
Initial temperature: 270.205 K

Started mdrun on rank 0 Thu Dec 23 18:18:10 2021


The -resethway functionality is deprecated, and may be removed in a future version.
           Step           Time
              0        0.00000

   Energies (kJ/mol)
        LJ (SR)  Disper. corr.   Coulomb (SR)   Coul. recip.      Potential
    6.78704e+06   -8.22205e+04   -3.64499e+07    3.18710e+05   -2.94264e+07
    Kinetic En.   Total Energy  Conserved En.    Temperature Pres. DC (bar)
    1.51535e+06   -2.79110e+07   -2.79110e+07    2.70439e+02   -2.74455e+01
 Pressure (bar)
    4.78317e+03

step  600: timed with pme grid 600 600 84, coulomb cutoff 0.900: 8855.3 M-cycles
step  800: timed with pme grid 512 512 72, coulomb cutoff 1.042: 8933.4 M-cycles
step 1000: timed with pme grid 448 480 64, coulomb cutoff 1.183: 9341.4 M-cycles
step 1200: timed with pme grid 416 432 60, coulomb cutoff 1.274: 9710.0 M-cycles
step 1400: timed with pme grid 384 384 56, coulomb cutoff 1.381: 10029.1 M-cycles
step 1600: timed with pme grid 400 400 56, coulomb cutoff 1.339: 9866.4 M-cycles
step 1800: timed with pme grid 400 400 60, coulomb cutoff 1.326: 9875.9 M-cycles
step 2000: timed with pme grid 416 416 60, coulomb cutoff 1.275: 9743.4 M-cycles
step 2200: timed with pme grid 416 432 60, coulomb cutoff 1.274: 9704.1 M-cycles
step 2400: timed with pme grid 432 432 60, coulomb cutoff 1.250: 9551.5 M-cycles
step 2600: timed with pme grid 432 432 64, coulomb cutoff 1.228: 9391.8 M-cycles
step 2800: timed with pme grid 448 448 64, coulomb cutoff 1.184: 9216.8 M-cycles
step 3000: timed with pme grid 448 480 64, coulomb cutoff 1.183: 9281.3 M-cycles
step 3200: timed with pme grid 480 480 64, coulomb cutoff 1.172: 9297.0 M-cycles
step 3400: timed with pme grid 480 480 72, coulomb cutoff 1.105: 9103.3 M-cycles
step 3600: timed with pme grid 512 512 72, coulomb cutoff 1.042: 8856.1 M-cycles
step 3800: timed with pme grid 512 560 80, coulomb cutoff 1.035: 9039.6 M-cycles
step 4000: timed with pme grid 560 560 80, coulomb cutoff 0.947: 8824.9 M-cycles
step 4200: timed with pme grid 560 576 80, coulomb cutoff 0.946: 8769.2 M-cycles
step 4400: timed with pme grid 576 576 80, coulomb cutoff 0.938: 8748.5 M-cycles
step 4600: timed with pme grid 576 576 84, coulomb cutoff 0.921: 8792.6 M-cycles
step 4800: timed with pme grid 600 600 84, coulomb cutoff 0.900: 8910.0 M-cycles
              optimal pme grid 576 576 80, coulomb cutoff 0.938

step 5000: resetting all time and cycle counters
Restarted time on rank 0 Thu Dec 23 18:22:23 2021

           Step           Time
          10000       20.00000

   Energies (kJ/mol)
        LJ (SR)  Disper. corr.   Coulomb (SR)   Coul. recip.      Potential
    6.48658e+06   -8.22205e+04   -3.43656e+07    2.32747e+05   -2.77285e+07
    Kinetic En.   Total Energy  Conserved En.    Temperature Pres. DC (bar)
    1.50739e+06   -2.62211e+07   -2.78944e+07    2.69020e+02   -2.74455e+01
 Pressure (bar)
    5.22932e+03


	<======  ###############  ==>
	<====  A V E R A G E S  ====>
	<==  ###############  ======>

	Statistics over 10001 steps using 2 frames

   Energies (kJ/mol)
        LJ (SR)  Disper. corr.   Coulomb (SR)   Coul. recip.      Potential
    6.63681e+06   -8.22205e+04   -3.54078e+07    2.75729e+05   -2.85774e+07
    Kinetic En.   Total Energy  Conserved En.    Temperature Pres. DC (bar)
    1.51137e+06   -2.70661e+07   -2.79027e+07    2.69729e+02   -2.74455e+01
 Pressure (bar)
    5.00624e+03

   Total Virial (kJ/mol)
   -7.17627e+06    7.05301e+02   -5.85084e+03
    3.45133e+03   -7.17287e+06   -2.91071e+03
   -4.50451e+03   -3.46621e+03   -6.73872e+06

   Pressure (bar)
    5.10232e+03   -4.08236e-01    3.35906e+00
   -2.23316e+00    5.10002e+03    2.16240e+00
    2.46433e+00    2.53156e+00    4.81638e+03

T-z>1.48_f0_t0.000T-z<1.48_f0_t0.000
    2.69729e+02    0.00000e+00


       P P   -   P M E   L O A D   B A L A N C I N G

 PP/PME load balancing changed the cut-off and PME settings:
           particle-particle                    PME
            rcoulomb  rlist            grid      spacing   1/beta
   initial  0.900 nm  0.904 nm     600 600  84   0.119 nm  0.288 nm
   final    0.938 nm  0.942 nm     576 576  80   0.125 nm  0.300 nm
 cost-ratio           1.13             0.88
 (note that these numbers concern only part of the total PP and PME load)


	M E G A - F L O P S   A C C O U N T I N G

 NB=Group-cutoff nonbonded kernels    NxN=N-by-N cluster Verlet kernels
 RF=Reaction-Field  VdW=Van der Waals  QSTab=quadratic-spline table
 W3=SPC/TIP3p  W4=TIP4p (single or pairs)
 V&F=Potential and force  V=Potential only  F=Force only

 Computing:                               M-Number         M-Flops  % Flops
-----------------------------------------------------------------------------
 Pair Search distance check            5238.684928       47148.164     0.0
 NxN Ewald Elec. + LJ [F]           7301383.174400   481891289.510    99.8
 NxN Ewald Elec. + LJ [V&F]            1460.833792      156309.216     0.0
 Shift-X                                 91.653120         549.919     0.0
 Virial                                   1.797165          32.349     0.0
 Calc-Ekin                              900.357120       24309.642     0.0
 Constraint-V                          6740.547840       53924.383     0.0
 Constraint-Vir                           1.347840          32.348     0.0
 Settle                                2246.849280      725732.317     0.2
 Virtual Site 3                        2247.298560       83150.047     0.0
-----------------------------------------------------------------------------
 Total                                               482982477.895   100.0
-----------------------------------------------------------------------------


     R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G

On 1 MPI rank, each using 5 OpenMP threads

 Computing:          Num   Num      Call    Wall time         Giga-Cycles
                     Ranks Threads  Count      (s)         total sum    %
-----------------------------------------------------------------------------
 Vsite constr.          1    5       5001      16.015        192.177   8.8
 Neighbor search        1    5         51       8.254         99.046   4.5
 Launch GPU ops.        1    5       5001       1.315         15.774   0.7
 Force                  1    5       5001       4.532         54.381   2.5
 Wait PME GPU gather    1    5       5001      12.236        146.833   6.7
 Reduce GPU PME F       1    5       5001       7.493         89.911   4.1
 Wait GPU NB local                             12.653        151.836   6.9
 NB X/F buffer ops.     1    5       9951      24.680        296.161  13.5
 Vsite spread           1    5       5002      17.601        211.205   9.6
 Update                 1    5       5001      28.461        341.533  15.6
 Constraints            1    5       5001      13.784        165.408   7.5
 Rest                                          35.640        427.669  19.5
-----------------------------------------------------------------------------
 Total                                        182.663       2191.934 100.0
-----------------------------------------------------------------------------

               Core t (s)   Wall t (s)        (%)
       Time:      913.206      182.663      499.9
                 (ns/day)    (hour/ns)
Performance:        4.731        5.073
Finished mdrun on rank 0 Thu Dec 23 18:25:26 2021