Not able to see any acceleration when using two nodes

qingsong · April 13, 2021, 1:44am

GROMACS version: 2019. 06

I’m using GROMACS 2019.06 version with intel-mpi enabled to test its performance, the weird thing is it is slower when using two nodes comparing to using 1 node to run the same simulation, here comes configuration:
Node Configuration: 96 cpu, 4 T4 gpu for each node
Node Names: g-t4-4-worker0001 and g-t4-4-worker0002.
Run Configuration: 8 MPI processes, 24 openmp threads per MPI process.

SLURM log:

[0] MPI startup(): Multi-threaded optimized library
[0] DAPL startup: RLIMIT_MEMLOCK too small
[1] DAPL startup: RLIMIT_MEMLOCK too small
[2] DAPL startup: RLIMIT_MEMLOCK too small
[3] DAPL startup: RLIMIT_MEMLOCK too small
[4] DAPL startup: RLIMIT_MEMLOCK too small
[5] DAPL startup: RLIMIT_MEMLOCK too small
[6] DAPL startup: RLIMIT_MEMLOCK too small
[7] DAPL startup: RLIMIT_MEMLOCK too small
[4] MPI startup(): cannot load default tmi provider
[5] MPI startup(): cannot load default tmi provider
[6] MPI startup(): cannot load default tmi provider
[7] MPI startup(): cannot load default tmi provider
[0] MPI startup(): cannot load default tmi provider
[1] MPI startup(): cannot load default tmi provider
[2] MPI startup(): cannot load default tmi provider
[3] MPI startup(): cannot load default tmi provider
[4] MPI startup(): RLIMIT_MEMLOCK too small
[5] MPI startup(): RLIMIT_MEMLOCK too small
[6] MPI startup(): RLIMIT_MEMLOCK too small
[7] MPI startup(): RLIMIT_MEMLOCK too small
[0] MPI startup(): RLIMIT_MEMLOCK too small
[1] MPI startup(): RLIMIT_MEMLOCK too small
[2] MPI startup(): RLIMIT_MEMLOCK too small
[3] MPI startup(): RLIMIT_MEMLOCK too small
[2] MPI startup(): shm and tcp data transfer modes
[0] MPI startup(): shm and tcp data transfer modes
[3] MPI startup(): shm and tcp data transfer modes
[1] MPI startup(): shm and tcp data transfer modes
[5] MPI startup(): shm and tcp data transfer modes
[6] MPI startup(): shm and tcp data transfer modes
[4] MPI startup(): shm and tcp data transfer modes
[7] MPI startup(): shm and tcp data transfer modes
[0] MPI startup(): Rank Pid Node name Pin cpu
[0] MPI startup(): 0 15026 g-t4-4-worker0001 {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23}
[0] MPI startup(): 1 15027 g-t4-4-worker0001 {24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47}
[0] MPI startup(): 2 15028 g-t4-4-worker0001 {48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71}
[0] MPI startup(): 3 15029 g-t4-4-worker0001 {72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95}
[0] MPI startup(): 4 15042 g-t4-4-worker0002 {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23}
[0] MPI startup(): 5 15043 g-t4-4-worker0002 {24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47}
[0] MPI startup(): 6 15044 g-t4-4-worker0002 {48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71}
[0] MPI startup(): 7 15045 g-t4-4-worker0002 {72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95}
[0] MPI startup(): I_MPI_DEBUG=5
[0] MPI startup(): I_MPI_HYDRA_UUID=ac3a0000-aeee-ff9b-b0bf-0500e05d0a08
[0] MPI startup(): I_MPI_INFO_NUMA_NODE_NUM=1
[0] MPI startup(): I_MPI_PIN_MAPPING=4:0 0,1 24,2 48,3 72
:-) GROMACS - gmx mdrun, 2019.6 (-:

                        GROMACS is written by:
 Emile Apol      Rossen Apostolov      Paul Bauer     Herman J.C. Berendsen
Par Bjelkmar      Christian Blau   Viacheslav Bolnykh     Kevin Boyd

Aldert van Buuren Rudi van Drunen Anton Feenstra Alan Gray
Gerrit Groenhof Anca Hamuraru Vincent Hindriksen M. Eric Irrgang
Aleksei Iupinov Christoph Junghans Joe Jordan Dimitrios Karkoulis
Peter Kasson Jiri Kraus Carsten Kutzner Per Larsson
Justin A. Lemkul Viveca Lindahl Magnus Lundborg Erik Marklund
Pascal Merz Pieter Meulenhoff Teemu Murtola Szilard Pall
Sander Pronk Roland Schulz Michael Shirts Alexey Shvetsov
Alfons Sijbers Peter Tieleman Jon Vincent Teemu Virolainen
Christian Wennberg Maarten Wolf
and the project leaders:
Mark Abraham, Berk Hess, Erik Lindahl, and David van der Spoel

Copyright (c) 1991-2000, University of Groningen, The Netherlands.
Copyright (c) 2001-2018, The GROMACS development team at
Uppsala University, Stockholm University and
the Royal Institute of Technology, Sweden.
check out http://www.gromacs.org for more information.

GROMACS is free software; you can redistribute it and/or modify it
under the terms of the GNU Lesser General Public License
as published by the Free Software Foundation; either version 2.1
of the License, or (at your option) any later version.

GROMACS: gmx mdrun, version 2019.6
Executable: /public/software/.local/easybuild/software/GROMACS/2019.6-intelmpi-cuda/bin/gmx_mpi
Data prefix: /public/software/.local/easybuild/software/GROMACS/2019.6-intelmpi-cuda
Working dir: /home/cloudam/gromacs-demo/gromacs_test/MD
Command line:
gmx_mpi mdrun -v -pin on -nb gpu -bonded gpu -pme gpu -npme 1 -ntomp 24 -deffnm md_100

Back Off! I just backed up md_100.log to ./#md_100.log.53#
Compiled SIMD: AVX2_256, but for this host/run AVX_512 might be better (see
log).
Reading file md_100.tpr, VERSION 2019.6 (single precision)
Changing nstlist from 20 to 100, rlist from 1.228 to 1.356

Using 8 MPI processes
Using 24 OpenMP threads per MPI process

On host g-t4-4-worker0001 4 GPUs selected for this run.
Mapping of GPU IDs to the 4 GPU tasks in the 4 ranks on this node:
PP:0,PP:1,PP:2,PP:3
PP tasks will do (non-perturbed) short-ranged and most bonded interactions on the GPU
PME tasks will do all aspects on the GPU

NOTE: Your choice of number of MPI ranks and amount of resources results in using 24 OpenMP threads per rank, which is most likely inefficient. The optimum is usually between 2 and 6 threads per rank.

Overriding thread affinity set outside gmx mdrun

Back Off! I just backed up md_100.xtc to ./#md_100.xtc.41#

Back Off! I just backed up md_100.edr to ./#md_100.edr.41#

WARNING: This run will generate roughly 29602 Mb of data

NOTE: DLB will not turn on during the first phase of PME tuning
starting mdrun ‘Protein in water’
50000000 steps, 100000.0 ps.
step 0
imb F 11% pme/F 1.02 step 100, will finish Sat Apr 17 13:40:41 2021
imb F 12% pme/F 0.94 step 200, will finish Sat Apr 17 05:34:43 2021
imb F 12% pme/F 0.96 step 300, will finish Sat Apr 17 02:42:06 2021
imb F 11% pme/F 0.94 step 400, will finish Sat Apr 17 01:28:01 2021
imb F 12% pme/F 0.96 step 500, will finish Sat Apr 17 00:32:33 2021
imb F 12% pme/F 0.95 step 600, will finish Sat Apr 17 00:08:24 2021
imb F 16% pme/F 0.89 step 700, will finish Fri Apr 16 23:41:09 2021
imb F 9% pme/F 0.99 step 800, will finish Fri Apr 16 23:30:44 2021
imb F 12% pme/F 0.98 step 900, will finish Fri Apr 16 23:21:05 2021
imb F 12% pme/F 0.95 step 1000, will finish Fri Apr 16 23:44:59 2021
imb F 10% pme/F 0.96 step 1100, will finish Fri Apr 16 23:45:08 2021
imb F 13% pme/F 0.95 step 1200, will finish Fri Apr 16 23:36:00 2021
imb F 13% pme/F 0.96 step 1300, will finish Fri Apr 16 23:26:45 2021
imb F 13% pme/F 0.95 step 1400, will finish Fri Apr 16 23:15:02 2021
imb F 13% pme/F 0.92 step 1500, will finish Fri Apr 16 23:09:06 2021
imb F 17% pme/F 0.88 step 1600, will finish Fri Apr 16 23:00:16 2021
imb F 16% pme/F 0.89 step 1700, will finish Fri Apr 16 22:55:14 2021
imb F 11% pme/F 0.95 step 1800, will finish Fri Apr 16 22:51:26 2021
imb F 10% pme/F 0.98 step 1900, will finish Fri Apr 16 22:45:44 2021
imb F 14% pme/F 0.95 step 2000, will finish Fri Apr 16 22:41:10 2021
imb F 18% pme/F 0.91 step 2100, will finish Fri Apr 16 22:33:45 2021
imb F 12% pme/F 0.93 step 2200, will finish Fri Apr 16 22:24:03 2021
imb F 12% pme/F 0.94 step 2300, will finish Fri Apr 16 22:17:08 2021
imb F 11% pme/F 0.94 step 2400, will finish Fri Apr 16 22:12:50 2021
imb F 16% pme/F 0.90 step 2500, will finish Fri Apr 16 22:23:02 2021
imb F 11% pme/F 0.97 step 2600, will finish Fri Apr 16 22:17:49 2021
imb F 12% pme/F 0.94 step 2700, will finish Fri Apr 16 22:16:54 2021
imb F 14% pme/F 0.91 step 2800, will finish Fri Apr 16 22:17:57 2021
imb F 10% pme/F 0.96 step 2900, will finish Fri Apr 16 22:13:35 2021
imb F 12% pme/F 0.95 step 3000, will finish Fri Apr 16 22:11:23 2021
imb F 12% pme/F 0.93 step 3100, will finish Fri Apr 16 22:11:33 2021
imb F 13% pme/F 0.97 step 3200, will finish Fri Apr 16 22:08:39 2021
imb F 10% pme/F 0.99 step 3300, will finish Fri Apr 16 22:08:44 2021
imb F 12% pme/F 0.98 step 3400, will finish Fri Apr 16 22:17:20 2021
imb F 11% pme/F 0.98 step 3500, will finish Fri Apr 16 22:16:02 2021
imb F 18% pme/F 0.91 step 3600, will finish Fri Apr 16 22:12:21 2021
imb F 18% pme/F 0.88 step 3700, will finish Fri Apr 16 22:10:23 2021
imb F 12% pme/F 0.94 step 3800, will finish Fri Apr 16 22:08:45 2021
imb F 17% pme/F 0.90 step 3900, will finish Fri Apr 16 22:08:25 2021

MD.log:

GROMACS: gmx mdrun, version 2019.6
Executable: /public/software/.local/easybuild/software/GROMACS/2019.6-intelmpi-cuda/bin/gmx_mpi
Data prefix: /public/software/.local/easybuild/software/GROMACS/2019.6-intelmpi-cuda
Working dir: /home/cloudam/gromacs-demo/gromacs_test/MD
Process ID: 15026
Command line:
gmx_mpi mdrun -v -pin on -nb gpu -bonded gpu -pme gpu -npme 1 -ntomp 24 -deffnm md_100

GROMACS version: 2019.6
Precision: single
Memory model: 64 bit
MPI library: MPI
OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 64)
GPU support: CUDA
SIMD instructions: AVX2_256
FFT library: fftw-3.3.8-sse2-avx-avx2-avx2_128
RDTSCP usage: enabled
TNG support: enabled
Hwloc support: hwloc-1.11.12
Tracing support: disabled
C compiler: /public/software/.local/easybuild/software/impi/2018.5.288-iccifort-2019.5.281/bin64/mpicc GNU 8.3.0
C compiler flags: -mavx2 -mfma -O3 -DNDEBUG -funroll-all-loops -fexcess-precision=fast
C++ compiler: /public/software/.local/easybuild/software/impi/2018.5.288-iccifort-2019.5.281/bin64/mpicxx GNU 8.3.0
C++ compiler flags: -mavx2 -mfma -std=c++11 -O3 -DNDEBUG -funroll-all-loops -fexcess-precision=fast
CUDA compiler: /public/software/.local/easybuild/software/CUDA/10.1.243-iccifort-2019.5.281/bin/nvcc nvcc: NVIDIA (R) Cuda compiler driver;Copyright (c) 2005-2019 NVIDIA Corporation;Built on Sun_Jul_28_19:07:16_PDT_2019;Cuda compilation tools, release 10.1, V10.1.243
CUDA compiler flags:-gencode;arch=compute_30,code=sm_30;-gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=compute_75;-use_fast_math;;; ;-mavx2;-mfma;-std=c++11;-O3;-DNDEBUG;-funroll-all-loops;-fexcess-precision=fast;
CUDA driver: 10.10
CUDA runtime: 10.10

Running on 2 nodes with total 96 cores, 192 logical cores, 8 compatible GPUs
Cores per node: 48
Logical cores per node: 96
Compatible GPUs per node: 4
All nodes have identical type(s) of GPUs
Hardware detected on host g-t4-4-worker0001 (the node of MPI rank 0):
CPU info:
Vendor: Intel
Brand: Intel(R) Xeon(R) Platinum 8163 CPU @ 2.50GHz
Family: 6 Model: 85 Stepping: 4
Features: aes apic avx avx2 avx512f avx512cd avx512bw avx512vl clfsh cmov cx8 cx16 f16c fma hle htt intel lahf mmx msr nonstop_tsc pcid pclmuldq pdpe1gb popcnt pse rdrnd rdtscp rtm sse2 sse3 sse4.1 sse4.2 ssse3 x2apic
Number of AVX-512 FMA units: 2
Hardware topology: Full, with devices
Sockets, cores, and logical processors:
Socket 0: [ 0 1] [ 2 3] [ 4 5] [ 6 7] [ 8 9] [ 10 11] [ 12 13] [ 14 15] [ 16 17] [ 18 19] [ 20 21] [ 22 23] [ 24 25] [ 26 27] [ 28 29] [ 30 31] [ 32 33] [ 34 35] [ 36 37] [ 38 39] [ 40 41] [ 42 43] [ 44 45] [ 46 47] [ 48 49] [ 50 51] [ 52 53] [ 54 55] [ 56 57] [ 58 59] [ 60 61] [ 62 63] [ 64 65] [ 66 67] [ 68 69] [ 70 71] [ 72 73] [ 74 75] [ 76 77] [ 78 79] [ 80 81] [ 82 83] [ 84 85] [ 86 87] [ 88 89] [ 90 91] [ 92 93] [ 94 95]
Numa nodes:
Node 0 (392897486848 bytes mem): 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
Latency:
0
0 1.00
Caches:
L1: 32768 bytes, linesize 64 bytes, assoc. 8, shared 2 ways
L2: 1048576 bytes, linesize 64 bytes, assoc. 16, shared 2 ways
L3: 34603008 bytes, linesize 64 bytes, assoc. 11, shared 64 ways
PCI devices:
0000:00:01.1 Id: 8086:7010 Class: 0x0101 Numa: 0
0000:00:02.0 Id: 1013:00b8 Class: 0x0300 Numa: 0
0000:00:05.0 Id: 1af4:1001 Class: 0x0100 Numa: 0
0000:00:06.0 Id: 1af4:1000 Class: 0x0200 Numa: 0
0000:00:07.0 Id: 10de:1eb8 Class: 0x0302 Numa: 0
0000:00:08.0 Id: 10de:1eb8 Class: 0x0302 Numa: 0
0000:00:09.0 Id: 10de:1eb8 Class: 0x0302 Numa: 0
0000:00:0a.0 Id: 10de:1eb8 Class: 0x0302 Numa: 0
GPU info:
Number of GPUs detected: 4
#0: NVIDIA Tesla T4, compute cap.: 7.5, ECC: yes, stat: compatible
#1: NVIDIA Tesla T4, compute cap.: 7.5, ECC: yes, stat: compatible
#2: NVIDIA Tesla T4, compute cap.: 7.5, ECC: yes, stat: compatible
#3: NVIDIA Tesla T4, compute cap.: 7.5, ECC: yes, stat: compatible

Highest SIMD level requested by all nodes in run: AVX_512
SIMD instructions selected at compile time: AVX2_256
This program was compiled for different hardware than you are running on,
which could influence performance. This build might have been configured on a
login node with only a single AVX-512 FMA unit (in which case AVX2 is faster),
while the node you are running on has dual AVX-512 FMA units.

The number of OpenMP threads was set by environment variable OMP_NUM_THREADS to 24 (and the command-line setting agreed with that)

Input Parameters:
integrator = md
tinit = 0
dt = 0.002
nsteps = 50000000
init-step = 0
simulation-part = 1
comm-mode = Linear
nstcomm = 100
bd-fric = 0
ld-seed = -1222564944
emtol = 10
emstep = 0.01
niter = 20
fcstep = 0
nstcgsteep = 1000
nbfgscorr = 10
rtpi = 0.05
nstxout = 0
nstvout = 0
nstfout = 0
nstlog = 2500
nstcalcenergy = 100
nstenergy = 2500
nstxout-compressed = 2500
compressed-x-precision = 1000
cutoff-scheme = Verlet
nstlist = 20
ns-type = Grid
pbc = xyz
periodic-molecules = false
verlet-buffer-tolerance = 0.005
rlist = 1.228
coulombtype = PME
coulomb-modifier = Potential-shift
rcoulomb-switch = 0
rcoulomb = 1.2
epsilon-r = 1
epsilon-rf = inf
vdw-type = Cut-off
vdw-modifier = Force-switch
rvdw-switch = 1
rvdw = 1.2
DispCorr = EnerPres
table-extension = 1
fourierspacing = 0.16
fourier-nx = 96
fourier-ny = 96
fourier-nz = 96
pme-order = 4
ewald-rtol = 1e-05
ewald-rtol-lj = 0.001
lj-pme-comb-rule = Geometric
ewald-geometry = 0
epsilon-surface = 0
tcoupl = V-rescale
nsttcouple = 20
nh-chain-length = 0
print-nose-hoover-chain-variables = false
pcoupl = Parrinello-Rahman
pcoupltype = Isotropic
nstpcouple = 20
tau-p = 2
compressibility (3x3):
compressibility[ 0]={ 4.50000e-05, 0.00000e+00, 0.00000e+00}
compressibility[ 1]={ 0.00000e+00, 4.50000e-05, 0.00000e+00}
compressibility[ 2]={ 0.00000e+00, 0.00000e+00, 4.50000e-05}
ref-p (3x3):
ref-p[ 0]={ 1.00000e+00, 0.00000e+00, 0.00000e+00}
ref-p[ 1]={ 0.00000e+00, 1.00000e+00, 0.00000e+00}
ref-p[ 2]={ 0.00000e+00, 0.00000e+00, 1.00000e+00}
refcoord-scaling = No
posres-com (3):
posres-com[0]= 0.00000e+00
posres-com[1]= 0.00000e+00
posres-com[2]= 0.00000e+00
posres-comB (3):
posres-comB[0]= 0.00000e+00
posres-comB[1]= 0.00000e+00
posres-comB[2]= 0.00000e+00
QMMM = false
QMconstraints = 0
QMMMscheme = 0
MMChargeScaleFactor = 1
qm-opts:
ngQM = 0
constraint-algorithm = Lincs
continuation = true
Shake-SOR = false
shake-tol = 0.0001
lincs-order = 4
lincs-iter = 1
lincs-warnangle = 30
nwall = 0
wall-type = 9-3
wall-r-linpot = -1
wall-atomtype[0] = -1
wall-atomtype[1] = -1
wall-density[0] = 0
wall-density[1] = 0
wall-ewald-zfac = 3
pull = false
awh = false
rotation = false
interactiveMD = false
disre = No
disre-weighting = Conservative
disre-mixed = false
dr-fc = 1000
dr-tau = 0
nstdisreout = 100
orire-fc = 0
orire-tau = 0
nstorireout = 100
free-energy = no
cos-acceleration = 0
deform (3x3):
deform[ 0]={ 0.00000e+00, 0.00000e+00, 0.00000e+00}
deform[ 1]={ 0.00000e+00, 0.00000e+00, 0.00000e+00}
deform[ 2]={ 0.00000e+00, 0.00000e+00, 0.00000e+00}
simulated-tempering = false
swapcoords = no
userint1 = 0
userint2 = 0
userint3 = 0
userint4 = 0
userreal1 = 0
userreal2 = 0
userreal3 = 0
userreal4 = 0
applied-forces:
electric-field:
x:
E0 = 0
omega = 0
t0 = 0
sigma = 0
y:
E0 = 0
omega = 0
t0 = 0
sigma = 0
z:
E0 = 0
omega = 0
t0 = 0
sigma = 0
grpopts:
nrdf: 28794.8 447873
ref-t: 300 300
tau-t: 0.1 0.1
annealing: No No
annealing-npoints: 0 0
acc: 0 0 0
nfreeze: N N N
energygrp-flags[ 0]: 0

Changing nstlist from 20 to 100, rlist from 1.228 to 1.356

Initializing Domain Decomposition on 8 ranks
Dynamic load balancing: locked
Minimum cell size due to atom displacement: 0.762 nm
Initial maximum distances in bonded interactions:
two-body bonded interactions: 0.438 nm, LJ-14, atoms 8608 8616
multi-body bonded interactions: 0.438 nm, Ryckaert-Bell., atoms 8608 8616
Minimum cell size due to bonded interactions: 0.482 nm
Maximum distance for 5 constraints, at 120 deg. angles, all-trans: 0.218 nm
Estimated maximum distance required for P-LINCS: 0.218 nm
Using 1 separate PME ranks
Scaling the initial minimum size with 1/0.8 (option -dds) = 1.25
Optimizing the DD grid for 7 cells with a minimum initial size of 0.953 nm
The maximum allowed number of cells is: X 12 Y 12 Z 11
Domain decomposition grid 7 x 1 x 1, separate PME ranks 1
PME domain decomposition: 1 x 1 x 1
Interleaving PP and PME ranks
This rank does only particle-particle work.
Domain decomposition rank 0, coordinates 0 0 0

The initial number of communication pulses is: X 1
The initial domain decomposition cell size is: X 1.74 nm

The maximum allowed distance for atoms involved in interactions is:
non-bonded interactions 1.356 nm
(the following are initial values, they could change due to box deformation)
two-body bonded interactions (-rdd) 1.356 nm
multi-body bonded interactions (-rdd) 1.356 nm
virtual site constructions (-rcon) 1.740 nm
atoms separated by up to 5 constraints (-rcon) 1.740 nm

When dynamic load balancing gets turned on, these settings will change to:
The maximum number of communication pulses is: X 1
The minimum size for domain decomposition cells is 1.356 nm
The requested allowed shrink of DD cells (option -dds) is: 0.80
The allowed shrink of domain decomposition cells is: X 0.78
The maximum allowed distance for atoms involved in interactions is:
non-bonded interactions 1.356 nm
two-body bonded interactions (-rdd) 1.356 nm
multi-body bonded interactions (-rdd) 1.356 nm
virtual site constructions (-rcon) 1.356 nm
atoms separated by up to 5 constraints (-rcon) 1.356 nm
Using two step summing over 2 groups of on average 3.5 ranks

Using 8 MPI processes
Using 24 OpenMP threads per MPI process

On host g-t4-4-worker0001 4 GPUs selected for this run.
Mapping of GPU IDs to the 4 GPU tasks in the 4 ranks on this node:
PP:0,PP:1,PP:2,PP:3
PP tasks will do (non-perturbed) short-ranged and most bonded interactions on the GPU
PME tasks will do all aspects on the GPU

NOTE: Your choice of number of MPI ranks and amount of resources results in using 24 OpenMP threads per rank, which is most likely inefficient. The optimum is usually between 2 and 6 threads per rank.

Overriding thread affinity set outside gmx mdrun

Pinning threads with an auto-selected logical core stride of 1
System total charge: 0.000
Will do PME sum in reciprocal space for electrostatic interactions.

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
U. Essmann, L. Perera, M. L. Berkowitz, T. Darden, H. Lee and L. G. Pedersen
A smooth particle mesh Ewald method
J. Chem. Phys. 103 (1995) pp. 8577-8592
-------- -------- — Thank You — -------- --------

Using a Gaussian width (1/beta) of 0.384195 nm for Ewald
Potential shift: LJ r^-12: -2.648e-01 r^-6: -5.349e-01, Ewald -8.333e-06
Initialized non-bonded Ewald correction tables, spacing: 1.02e-03 size: 1176

Long Range LJ corr.: 1.7854e-04
Generated table with 1178 data points for Ewald.
Tabscale = 500 points/nm
Generated table with 1178 data points for LJ6Shift.
Tabscale = 500 points/nm
Generated table with 1178 data points for LJ12Shift.
Tabscale = 500 points/nm
Generated table with 1178 data points for 1-4 COUL.
Tabscale = 500 points/nm
Generated table with 1178 data points for 1-4 LJ6.
Tabscale = 500 points/nm
Generated table with 1178 data points for 1-4 LJ12.
Tabscale = 500 points/nm

Using GPU 8x8 nonbonded short-range kernels

Using a dual 8x4 pair-list setup updated with dynamic, rolling pruning:
outer list: updated every 100 steps, buffer 0.156 nm, rlist 1.356 nm
inner list: updated every 10 steps, buffer 0.002 nm, rlist 1.202 nm
At tolerance 0.005 kJ/mol/ps per atom, equivalent classical 1x1 list would be:
outer list: updated every 100 steps, buffer 0.307 nm, rlist 1.507 nm
inner list: updated every 10 steps, buffer 0.044 nm, rlist 1.244 nm

Initializing Parallel LINear Constraint Solver

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
B. Hess
P-LINCS: A Parallel Linear Constraint Solver for molecular simulation
J. Chem. Theory Comput. 4 (2008) pp. 116-122
-------- -------- — Thank You — -------- --------

The number of constraints is 5765
There are constraints between atoms in different decomposition domains,
will communicate selected coordinates each lincs iteration

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
S. Miyamoto and P. A. Kollman
SETTLE: An Analytical Version of the SHAKE and RATTLE Algorithms for Rigid
Water Models
J. Comp. Chem. 13 (1992) pp. 952-962
-------- -------- — Thank You — -------- --------

Linking all bonded interactions to atoms
There are 74428 inter charge-group virtual sites,
will an extra communication step for selected coordinates and forces

Intra-simulation communication will occur every 20 steps.
Center of mass motion removal mode is Linear
We have the following groups for center of mass motion removal:
0: rest

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
G. Bussi, D. Donadio and M. Parrinello
Canonical sampling through velocity rescaling
J. Chem. Phys. 126 (2007) pp. 014101
-------- -------- — Thank You — -------- --------

There are: 235240 Atoms
There are: 74428 VSites
Atom distribution over 7 domains: av 44238 stddev 277 min 43920 max 44619

NOTE: DLB will not turn on during the first phase of PME tuning

Started mdrun on rank 0 Sun Apr 11 19:22:27 2021

       Step           Time
          0        0.00000

Energies (kJ/mol)
Bond Harmonic Pot. Angle Ryckaert-Bell. Improper Dih.
8.41905e+03 9.05915e+00 2.29383e+04 1.37466e+04 1.56581e+03
LJ-14 Coulomb-14 LJ (SR) Disper. corr. Coulomb (SR)
1.46827e+04 6.84097e+04 5.48357e+05 -2.21763e+04 -4.04786e+06
Coul. recip. Potential Kinetic En. Total Energy Conserved En.
1.16819e+04 -3.38023e+06 5.92384e+05 -2.78784e+06 -2.78770e+06
Temperature Pres. DC (bar) Pressure (bar) Constr. rmsd
2.98939e+02 -1.57011e+02 3.95213e+01 4.56933e-06

DD step 99 load imb.: force 11.2% pme mesh/force 1.020

DD step 2499 load imb.: force 16.2% pme mesh/force 0.895
Step Time
2500 5.00000

Energies (kJ/mol)
Bond Harmonic Pot. Angle Ryckaert-Bell. Improper Dih.
8.47799e+03 1.78844e+01 2.30134e+04 1.37323e+04 1.65879e+03
LJ-14 Coulomb-14 LJ (SR) Disper. corr. Coulomb (SR)
1.48272e+04 6.85086e+04 5.52541e+05 -2.22215e+04 -4.05390e+06
Coul. recip. Potential Kinetic En. Total Energy Conserved En.
1.14112e+04 -3.38194e+06 5.93595e+05 -2.78834e+06 -2.78767e+06
Temperature Pres. DC (bar) Pressure (bar) Constr. rmsd
2.99551e+02 -1.57651e+02 1.18464e+02 4.44390e-06

DD step 4999 load imb.: force 15.1% pme mesh/force 0.952
Step Time
5000 10.00000

Energies (kJ/mol)
Bond Harmonic Pot. Angle Ryckaert-Bell. Improper Dih.
8.32716e+03 1.96349e+01 2.33778e+04 1.35488e+04 1.50029e+03
LJ-14 Coulomb-14 LJ (SR) Disper. corr. Coulomb (SR)
1.46913e+04 6.81196e+04 5.50667e+05 -2.21641e+04 -4.05276e+06
Coul. recip. Potential Kinetic En. Total Energy Conserved En.
1.15729e+04 -3.38310e+06 5.96289e+05 -2.78681e+06 -2.78768e+06

Topic		Replies	Views
Gmx_mpi GPU and HPC clusters User discussions	7	6765	November 9, 2020
Optimizing GPU performance for GROMACS? User discussions	6	1488	January 13, 2021
Performance loss User discussions	2	1261	February 20, 2021
Efficiency low User discussions	13	1180	May 26, 2021
GROMACS scaling limit for a large system User discussions	14	1356	January 22, 2021

Not able to see any acceleration when using two nodes

Related topics