Gmx_mpi GPU and HPC clusters

GROMACS version: 2020.4
GROMACS modification: No

Ok ladies and gentlemen, I have tried to figure this out as much as possible by reading the documentation and am still in a quandry. I have always run GROMACS on my home system (i9900K and GTX2080super) and generally around 210ns/day on a solvated ScFv (1.2 cubic box)

Now I have access to a Cray cluster with an array of GPUs and the performance I am getting is lackluster to say the least.

Background
gromacs 2020.4 compiled with -DGMX_MPI=ON and -DGMX-GPU=ON -DGMX_BUILD_OWN_FFTW=ON
openmpi/4.0.1-cuda cuda/10.1 and cmake/3.18.3 loaded

Results:

500000 steps, 1000.0 ps. ScFv(25kDa) 1.2 cubic box
select=1:ncpus=32:gputype=K80:ngpus=4

  1. mpirun -np 4 gmx_mpi mdrun -ntomp 8 -v -deffnm md_0_1(163.884ns/day)
  2. mpirun -np 1 gmx_mpi mdrun -ntomp 32 -v -deffnm md_0_1(37.518ns/day)
  3. mpirun -np 4 gmx_mpi mdrun -ntomp 8 -v -npme 2 -deffnm md_0_1(117.9ns/day)
  4. mpirun -np 4 gmx_mpi mdrun -ntomp 8 -v -nb gpu -deffnm md_0_1(161.102ns/day)
  5. mpirun -np 4 gmx_mpi mdrun -ntomp 8 -v -nb gpu -bonded gpu -deffnm md_0_1 -gpu_id 0123 (160ns/day)
    select=1:ncpus=32:gputype=K80:ngpus=6
  6. mpirun -np 6 gmx_mpi mdrun -ntomp 5 -v -deffnm md_0_1(173ns/day)
    select=1:ncpus=16:gputype=V100:ngpus=1
  7. gmx mdrun -v -deffnm md_0_1 (229ns/day)
    select=1:ncpus=16:gputype=V100:ngpus=2
  8. mpirun -np 2 gmx_mpi mdrun -ntomp 8 -v -deffnm md_0_1(took too long and i quit)

Observations:

  1. Nothing I seem to do with the mdrun command line switched seems to help the speed.
  2. mpirun is abyssmal compared with the stock gmx command with no mpi. I tried to compile GROMACS and run it on the cluster without MPI but it crashed if I tried to use anything other than 1CPU .
  3. How can 6 tesla K80s and 32 cpus be much much worse than my home i9900K and single GTX2080?

I must be doing something wrong or obviously not getting something. Any tips?

In general it’s helpful to post log file outputs to get help. The first step would be to make sure the GPUs are actually being found and used by Gromacs.

You shouldn’t need MPI for a single node, if you have access to the entire node’s resources. What kind of crash did you get?

Overall, Gromacs doesn’t yet scale very well to multiple GPUs in the same simulation, it’s a lot more efficient to run 1 (or even 2) concurrent simulations per GPU, but post your logs first and there may be some gains to be had in a single simulation.

Ok sure thanks.

here are two logs:
select=1:cpus=20:gputype=V100:ngpus=2
gmx mdrun -ntmpi 2 -ntomp 10 -v -deffnm md_0_1

ends with segfault (this runs fine with -ntmpi 1)

                  :-) GROMACS - gmx mdrun, 2020.4 (-:

                        GROMACS is written by:
 Emile Apol      Rossen Apostolov      Paul Bauer     Herman J.C. Berendsen
Par Bjelkmar      Christian Blau   Viacheslav Bolnykh     Kevin Boyd    

Aldert van Buuren Rudi van Drunen Anton Feenstra Alan Gray
Gerrit Groenhof Anca Hamuraru Vincent Hindriksen M. Eric Irrgang
Aleksei Iupinov Christoph Junghans Joe Jordan Dimitrios Karkoulis
Peter Kasson Jiri Kraus Carsten Kutzner Per Larsson
Justin A. Lemkul Viveca Lindahl Magnus Lundborg Erik Marklund
Pascal Merz Pieter Meulenhoff Teemu Murtola Szilard Pall
Sander Pronk Roland Schulz Michael Shirts Alexey Shvetsov
Alfons Sijbers Peter Tieleman Jon Vincent Teemu Virolainen
Christian Wennberg Maarten Wolf Artem Zhmurov
and the project leaders:
Mark Abraham, Berk Hess, Erik Lindahl, and David van der Spoel

Copyright © 1991-2000, University of Groningen, The Netherlands.
Copyright © 2001-2019, The GROMACS development team at
Uppsala University, Stockholm University and
the Royal Institute of Technology, Sweden.
check out http://www.gromacs.org for more information.

GROMACS is free software; you can redistribute it and/or modify it
under the terms of the GNU Lesser General Public License
as published by the Free Software Foundation; either version 2.1
of the License, or (at your option) any later version.

GROMACS: gmx mdrun, version 2020.4
Executable: /SFS/user/ry/waight/tools/gromacs-2020/bin/gmx
Data prefix: /SFS/user/ry/waight/tools/gromacs-2020
Working dir: /mnt/lustre2/craycs/scratch/ABW_gromacs_scratch
Process ID: 196128
Command line:
gmx mdrun -ntmpi 2 -ntomp 10 -v -deffnm md_0_1

GROMACS version: 2020.4
Verified release checksum is 79c2857291b034542c26e90512b92fd4b184a1c9d6fa59c55f2e24ccf14e7281
Precision: single
Memory model: 64 bit
MPI library: thread_mpi
OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 64)
GPU support: CUDA
SIMD instructions: AVX_512
FFT library: fftw-3.3.8-sse2-avx-avx2-avx2_128-avx512
RDTSCP usage: enabled
TNG support: enabled
Hwloc support: disabled
Tracing support: disabled
C compiler: /SFS/product/eb_tcl/GCC/7.5/software/GCCcore/7.5.0/bin/gcc GNU 7.5.0
C compiler flags: -mavx512f -mfma -fexcess-precision=fast -funroll-all-loops -O3 -DNDEBUG
C++ compiler: /SFS/product/eb_tcl/GCC/7.5/software/GCCcore/7.5.0/bin/g++ GNU 7.5.0
C++ compiler flags: -mavx512f -mfma -fexcess-precision=fast -funroll-all-loops -fopenmp -O3 -DNDEBUG
CUDA compiler: /SFS/product/cuda/10.1/centos76_x86_64/bin/nvcc nvcc: NVIDIA ® Cuda compiler driver;Copyright © 2005-2019 NVIDIA Corporation;Built on Sun_Jul_28_19:07:16_PDT_2019;Cuda compilation tools, release 10.1, V10.1.243
CUDA compiler flags:-gencode;arch=compute_30,code=sm_30;-gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_35,code=compute_35;-gencode;arch=compute_50,code=compute_50;-gencode;arch=compute_52,code=compute_52;-gencode;arch=compute_60,code=compute_60;-gencode;arch=compute_61,code=compute_61;-gencode;arch=compute_70,code=compute_70;-gencode;arch=compute_75,code=compute_75;-use_fast_math;;-mavx512f -mfma -fexcess-precision=fast -funroll-all-loops -fopenmp -O3 -DNDEBUG
CUDA driver: 10.10
CUDA runtime: 10.10

Running on 1 node with total 40 cores, 40 logical cores, 4 compatible GPUs
Hardware detected:
CPU info:
Vendor: Intel
Brand: Intel® Xeon® Gold 6148 CPU @ 2.40GHz
Family: 6 Model: 85 Stepping: 4
Features: aes apic avx avx2 avx512f avx512cd avx512bw avx512vl clfsh cmov cx8 cx16 f16c fma hle htt intel lahf mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd rdtscp rtm sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
Number of AVX-512 FMA units: 2
Hardware topology: Basic
Sockets, cores, and logical processors:
Socket 0: [ 0] [ 1] [ 2] [ 3] [ 4] [ 5] [ 6] [ 7] [ 8] [ 9] [ 10] [ 11] [ 12] [ 13] [ 14] [ 15] [ 16] [ 17] [ 18] [ 19]
Socket 1: [ 20] [ 21] [ 22] [ 23] [ 24] [ 25] [ 26] [ 27] [ 28] [ 29] [ 30] [ 31] [ 32] [ 33] [ 34] [ 35] [ 36] [ 37] [ 38] [ 39]
GPU info:
Number of GPUs detected: 8
#0: NVIDIA Tesla V100-SXM2-32GB, compute cap.: 7.0, ECC: no, stat: compatible
#1: NVIDIA Tesla V100-SXM2-32GB, compute cap.: 7.0, ECC: no, stat: unavailable
#2: NVIDIA Tesla V100-SXM2-32GB, compute cap.: 7.0, ECC: no, stat: compatible
#3: NVIDIA Tesla V100-SXM2-32GB, compute cap.: 7.0, ECC: no, stat: unavailable
#4: NVIDIA Tesla V100-SXM2-32GB, compute cap.: 7.0, ECC: no, stat: unavailable
#5: NVIDIA Tesla V100-SXM2-32GB, compute cap.: 7.0, ECC: no, stat: compatible
#6: NVIDIA Tesla V100-SXM2-32GB, compute cap.: 7.0, ECC: no, stat: unavailable
#7: NVIDIA Tesla V100-SXM2-32GB, compute cap.: 7.0, ECC: no, stat: compatible

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E.
Lindahl
GROMACS: High performance molecular simulations through multi-level
parallelism from laptops to supercomputers
SoftwareX 1 (2015) pp. 19-25
-------- -------- — Thank You — -------- --------

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
S. Páll, M. J. Abraham, C. Kutzner, B. Hess, E. Lindahl
Tackling Exascale Software Challenges in Molecular Dynamics Simulations with
GROMACS
In S. Markidis & E. Laure (Eds.), Solving Software Challenges for Exascale 8759 (2015) pp. 3-27
-------- -------- — Thank You — -------- --------

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
S. Pronk, S. Páll, R. Schulz, P. Larsson, P. Bjelkmar, R. Apostolov, M. R.
Shirts, J. C. Smith, P. M. Kasson, D. van der Spoel, B. Hess, and E. Lindahl
GROMACS 4.5: a high-throughput and highly parallel open source molecular
simulation toolkit
Bioinformatics 29 (2013) pp. 845-54
-------- -------- — Thank You — -------- --------

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
B. Hess and C. Kutzner and D. van der Spoel and E. Lindahl
GROMACS 4: Algorithms for highly efficient, load-balanced, and scalable
molecular simulation
J. Chem. Theory Comput. 4 (2008) pp. 435-447
-------- -------- — Thank You — -------- --------

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
D. van der Spoel, E. Lindahl, B. Hess, G. Groenhof, A. E. Mark and H. J. C.
Berendsen
GROMACS: Fast, Flexible and Free
J. Comp. Chem. 26 (2005) pp. 1701-1719
-------- -------- — Thank You — -------- --------

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
E. Lindahl and B. Hess and D. van der Spoel
GROMACS 3.0: A package for molecular simulation and trajectory analysis
J. Mol. Mod. 7 (2001) pp. 306-317
-------- -------- — Thank You — -------- --------

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
H. J. C. Berendsen, D. van der Spoel and R. van Drunen
GROMACS: A message-passing parallel molecular dynamics implementation
Comp. Phys. Comm. 91 (1995) pp. 43-56
-------- -------- — Thank You — -------- --------

++++ PLEASE CITE THE DOI FOR THIS VERSION OF GROMACS ++++
https://doi.org/10.5281/zenodo.4054979
-------- -------- — Thank You — -------- --------

Input Parameters:
integrator = md
tinit = 0
dt = 0.002
nsteps = 500000
init-step = 0
simulation-part = 1
comm-mode = Linear
nstcomm = 100
bd-fric = 0
ld-seed = 898235739
emtol = 10
emstep = 0.01
niter = 20
fcstep = 0
nstcgsteep = 1000
nbfgscorr = 10
rtpi = 0.05
nstxout = 0
nstvout = 0
nstfout = 0
nstlog = 100000
nstcalcenergy = 100
nstenergy = 100000
nstxout-compressed = 100000
compressed-x-precision = 1000
cutoff-scheme = Verlet
nstlist = 10
pbc = xyz
periodic-molecules = false
verlet-buffer-tolerance = 0.005
rlist = 1
coulombtype = PME
coulomb-modifier = Potential-shift
rcoulomb-switch = 0
rcoulomb = 1
epsilon-r = 1
epsilon-rf = inf
vdw-type = Cut-off
vdw-modifier = Potential-shift
rvdw-switch = 0
rvdw = 1
DispCorr = EnerPres
table-extension = 1
fourierspacing = 0.16
fourier-nx = 52
fourier-ny = 52
fourier-nz = 52
pme-order = 4
ewald-rtol = 1e-05
ewald-rtol-lj = 0.001
lj-pme-comb-rule = Geometric
ewald-geometry = 0
epsilon-surface = 0
tcoupl = V-rescale
nsttcouple = 10
nh-chain-length = 0
print-nose-hoover-chain-variables = false
pcoupl = Parrinello-Rahman
pcoupltype = Isotropic
nstpcouple = 10
tau-p = 2
compressibility (3x3):
compressibility[ 0]={ 4.50000e-05, 0.00000e+00, 0.00000e+00}
compressibility[ 1]={ 0.00000e+00, 4.50000e-05, 0.00000e+00}
compressibility[ 2]={ 0.00000e+00, 0.00000e+00, 4.50000e-05}
ref-p (3x3):
ref-p[ 0]={ 1.00000e+00, 0.00000e+00, 0.00000e+00}
ref-p[ 1]={ 0.00000e+00, 1.00000e+00, 0.00000e+00}
ref-p[ 2]={ 0.00000e+00, 0.00000e+00, 1.00000e+00}
refcoord-scaling = No
posres-com (3):
posres-com[0]= 0.00000e+00
posres-com[1]= 0.00000e+00
posres-com[2]= 0.00000e+00
posres-comB (3):
posres-comB[0]= 0.00000e+00
posres-comB[1]= 0.00000e+00
posres-comB[2]= 0.00000e+00
QMMM = false
QMconstraints = 0
QMMMscheme = 0
MMChargeScaleFactor = 1
qm-opts:
ngQM = 0
constraint-algorithm = Lincs
continuation = true
Shake-SOR = false
shake-tol = 0.0001
lincs-order = 4
lincs-iter = 1
lincs-warnangle = 30
nwall = 0
wall-type = 9-3
wall-r-linpot = -1
wall-atomtype[0] = -1
wall-atomtype[1] = -1
wall-density[0] = 0
wall-density[1] = 0
wall-ewald-zfac = 3
pull = false
awh = false
rotation = false
interactiveMD = false
disre = No
disre-weighting = Conservative
disre-mixed = false
dr-fc = 1000
dr-tau = 0
nstdisreout = 100
orire-fc = 0
orire-tau = 0
nstorireout = 100
free-energy = no
cos-acceleration = 0
deform (3x3):
deform[ 0]={ 0.00000e+00, 0.00000e+00, 0.00000e+00}
deform[ 1]={ 0.00000e+00, 0.00000e+00, 0.00000e+00}
deform[ 2]={ 0.00000e+00, 0.00000e+00, 0.00000e+00}
simulated-tempering = false
swapcoords = no
userint1 = 0
userint2 = 0
userint3 = 0
userint4 = 0
userreal1 = 0
userreal2 = 0
userreal3 = 0
userreal4 = 0
applied-forces:
electric-field:
x:
E0 = 0
omega = 0
t0 = 0
sigma = 0
y:
E0 = 0
omega = 0
t0 = 0
sigma = 0
z:
E0 = 0
omega = 0
t0 = 0
sigma = 0
density-guided-simulation:
active = false
group = protein
similarity-measure = inner-product
atom-spreading-weight = unity
force-constant = 1e+09
gaussian-transform-spreading-width = 0.2
gaussian-transform-spreading-range-in-multiples-of-width = 4
reference-density-filename = reference.mrc
nst = 1
normalize-densities = true
adaptive-force-scaling = false
adaptive-force-scaling-time-constant = 4
grpopts:
nrdf: 8718.77 106107
ref-t: 300 300
tau-t: 0.1 0.1
annealing: No No
annealing-npoints: 0 0
acc: 0 0 0
nfreeze: N N N
energygrp-flags[ 0]: 0

Changing nstlist from 10 to 100, rlist from 1 to 1.157

Initializing Domain Decomposition on 2 ranks
Dynamic load balancing: auto
Using update groups, nr 19456, average size 2.9 atoms, max. radius 0.104 nm
Minimum cell size due to atom displacement: 0.652 nm
Initial maximum distances in bonded interactions:
two-body bonded interactions: 0.449 nm, LJ-14, atoms 1973 3103
multi-body bonded interactions: 0.449 nm, Proper Dih., atoms 3103 1973
Minimum cell size due to bonded interactions: 0.494 nm
Scaling the initial minimum size with 1/0.8 (option -dds) = 1.25
Using 0 separate PME ranks
Optimizing the DD grid for 2 cells with a minimum initial size of 0.815 nm
The maximum allowed number of cells is: X 10 Y 10 Z 10
Domain decomposition grid 2 x 1 x 1, separate PME ranks 0
PME domain decomposition: 2 x 1 x 1
Domain decomposition rank 0, coordinates 0 0 0

The initial number of communication pulses is: X 1
The initial domain decomposition cell size is: X 4.14 nm

The maximum allowed distance for atom groups involved in interactions is:
non-bonded interactions 1.366 nm
(the following are initial values, they could change due to box deformation)
two-body bonded interactions (-rdd) 1.366 nm
multi-body bonded interactions (-rdd) 1.366 nm

When dynamic load balancing gets turned on, these settings will change to:
The maximum number of communication pulses is: X 1
The minimum size for domain decomposition cells is 1.366 nm
The requested allowed shrink of DD cells (option -dds) is: 0.80
The allowed shrink of domain decomposition cells is: X 0.33
The maximum allowed distance for atom groups involved in interactions is:
non-bonded interactions 1.366 nm
two-body bonded interactions (-rdd) 1.366 nm
multi-body bonded interactions (-rdd) 1.366 nm

On host ktchpccg015 2 GPUs selected for this run.
Mapping of GPU IDs to the 2 GPU tasks in the 2 ranks on this node:
PP:0,PP:2
PP tasks will do (non-perturbed) short-ranged and most bonded interactions on the GPU
PP task will update and constrain coordinates on the CPU
Using 2 MPI threads
Using 10 OpenMP threads per tMPI thread

NOTE: Your choice of number of MPI ranks and amount of resources results in using 10 OpenMP threads per rank, which is most likely inefficient. The optimum is usually between 2 and 6 threads per rank.

NOTE: The number of threads is not equal to the number of (logical) cores
and the -pin option is set to auto: will not pin threads to cores.
This can lead to significant performance degradation.
Consider using -pin on (and -pinoffset in case you run multiple jobs).
System total charge: -0.000
Will do PME sum in reciprocal space for electrostatic interactions.

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
U. Essmann, L. Perera, M. L. Berkowitz, T. Darden, H. Lee and L. G. Pedersen
A smooth particle mesh Ewald method
J. Chem. Phys. 103 (1995) pp. 8577-8592
-------- -------- — Thank You — -------- --------

Using a Gaussian width (1/beta) of 0.320163 nm for Ewald
Potential shift: LJ r^-12: -1.000e+00 r^-6: -1.000e+00, Ewald -1.000e-05
Initialized non-bonded Ewald tables, spacing: 9.33e-04 size: 1073

Generated table with 1078 data points for 1-4 COUL.
Tabscale = 500 points/nm
Generated table with 1078 data points for 1-4 LJ6.
Tabscale = 500 points/nm
Generated table with 1078 data points for 1-4 LJ12.
Tabscale = 500 points/nm

Using GPU 8x8 nonbonded short-range kernels

Using a dual 8x8 pair-list setup updated with dynamic, rolling pruning:
outer list: updated every 100 steps, buffer 0.157 nm, rlist 1.157 nm
inner list: updated every 10 steps, buffer 0.001 nm, rlist 1.001 nm
At tolerance 0.005 kJ/mol/ps per atom, equivalent classical 1x1 list would be:
outer list: updated every 100 steps, buffer 0.305 nm, rlist 1.305 nm
inner list: updated every 10 steps, buffer 0.042 nm, rlist 1.042 nm

Using Lorentz-Berthelot Lennard-Jones combination rule

Long Range LJ corr.: 3.0483e-04

Initializing LINear Constraint Solver

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
B. Hess and H. Bekker and H. J. C. Berendsen and J. G. E. M. Fraaije
LINCS: A Linear Constraint Solver for molecular simulations
J. Comp. Chem. 18 (1997) pp. 1463-1472
-------- -------- — Thank You — -------- --------

The number of constraints is 1706

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
S. Miyamoto and P. A. Kollman
SETTLE: An Analytical Version of the SHAKE and RATTLE Algorithms for Rigid
Water Models
J. Comp. Chem. 13 (1992) pp. 952-962
-------- -------- — Thank You — -------- --------

Linking all bonded interactions to atoms

Intra-simulation communication will occur every 10 steps.

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
G. Bussi, D. Donadio and M. Parrinello
Canonical sampling through velocity rescaling
J. Chem. Phys. 126 (2007) pp. 014101
-------- -------- — Thank You — -------- --------

There are: 56528 Atoms
Atom distribution over 2 domains: av 28264 stddev 43 min 28221 max 28307

NOTE: DLB will not turn on during the first phase of PME tuning
Center of mass motion removal mode is Linear
We have the following groups for center of mass motion removal:
0: rest

Started mdrun on rank 0 Sat Nov 7 02:55:20 2020

       Step           Time
          0        0.00000

Energies (kJ/mol)
Bond Angle Proper Dih. Improper Dih. LJ-14
2.64995e+03 7.08888e+03 4.11627e+03 3.90657e+02 3.95263e+03
Coulomb-14 LJ (SR) Disper. corr. Coulomb (SR) Coul. recip.
4.59261e+04 1.02064e+05 -7.17474e+03 -9.29911e+05 5.10661e+03
Potential Kinetic En. Total Energy Conserved En. Temperature
-7.65790e+05 2.51272e+09 2.51196e+09 2.51196e+09 5.26381e+06
Pres. DC (bar) Pressure (bar) Constr. rmsd
-2.10007e+02 -5.00312e+07 2.83630e-06

here is the second run on the same system but using

mpirun -np 2 gmx_mpi mdrun -ntomp 10 -v -deffnm md_0_1

interestingly this segfaults with a “one or more molecules cannot be settled” but only with dual GPUs (-np 1 also runs fine)

                  :-) GROMACS - gmx mdrun, 2020.4 (-:

                        GROMACS is written by:
 Emile Apol      Rossen Apostolov      Paul Bauer     Herman J.C. Berendsen
Par Bjelkmar      Christian Blau   Viacheslav Bolnykh     Kevin Boyd    

Aldert van Buuren Rudi van Drunen Anton Feenstra Alan Gray
Gerrit Groenhof Anca Hamuraru Vincent Hindriksen M. Eric Irrgang
Aleksei Iupinov Christoph Junghans Joe Jordan Dimitrios Karkoulis
Peter Kasson Jiri Kraus Carsten Kutzner Per Larsson
Justin A. Lemkul Viveca Lindahl Magnus Lundborg Erik Marklund
Pascal Merz Pieter Meulenhoff Teemu Murtola Szilard Pall
Sander Pronk Roland Schulz Michael Shirts Alexey Shvetsov
Alfons Sijbers Peter Tieleman Jon Vincent Teemu Virolainen
Christian Wennberg Maarten Wolf Artem Zhmurov
and the project leaders:
Mark Abraham, Berk Hess, Erik Lindahl, and David van der Spoel

Copyright © 1991-2000, University of Groningen, The Netherlands.
Copyright © 2001-2019, The GROMACS development team at
Uppsala University, Stockholm University and
the Royal Institute of Technology, Sweden.
check out http://www.gromacs.org for more information.

GROMACS is free software; you can redistribute it and/or modify it
under the terms of the GNU Lesser General Public License
as published by the Free Software Foundation; either version 2.1
of the License, or (at your option) any later version.

GROMACS: gmx mdrun, version 2020.4
Executable: /SFS/user/ry/waight/tools/gromacs-2020-mpi/bin/gmx_mpi
Data prefix: /SFS/user/ry/waight/tools/gromacs-2020-mpi
Working dir: /mnt/lustre2/craycs/scratch/ABW_gromacs_scratch
Process ID: 42352
Command line:
gmx_mpi mdrun -ntomp 10 -v -deffnm md_0_1

GROMACS version: 2020.4
Verified release checksum is 79c2857291b034542c26e90512b92fd4b184a1c9d6fa59c55f2e24ccf14e7281
Precision: single
Memory model: 64 bit
MPI library: MPI
OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 64)
GPU support: CUDA
SIMD instructions: AVX_512
FFT library: fftw-3.3.8-sse2-avx-avx2-avx2_128-avx512
RDTSCP usage: enabled
TNG support: enabled
Hwloc support: disabled
Tracing support: disabled
C compiler: /SFS/product/eb_tcl/GCC/7.5/software/GCCcore/7.5.0/bin/gcc GNU 7.5.0
C compiler flags: -mavx512f -mfma -pthread -fexcess-precision=fast -funroll-all-loops -O3 -DNDEBUG
C++ compiler: /SFS/product/eb_tcl/GCC/7.5/software/GCCcore/7.5.0/bin/g++ GNU 7.5.0
C++ compiler flags: -mavx512f -mfma -pthread -fexcess-precision=fast -funroll-all-loops -fopenmp -O3 -DNDEBUG
CUDA compiler: /SFS/product/cuda/10.1/centos76_x86_64/bin/nvcc nvcc: NVIDIA ® Cuda compiler driver;Copyright © 2005-2019 NVIDIA Corporation;Built on Sun_Jul_28_19:07:16_PDT_2019;Cuda compilation tools, release 10.1, V10.1.243
CUDA compiler flags:-gencode;arch=compute_30,code=sm_30;-gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_35,code=compute_35;-gencode;arch=compute_50,code=compute_50;-gencode;arch=compute_52,code=compute_52;-gencode;arch=compute_60,code=compute_60;-gencode;arch=compute_61,code=compute_61;-gencode;arch=compute_70,code=compute_70;-gencode;arch=compute_75,code=compute_75;-use_fast_math;;-mavx512f -mfma -pthread -fexcess-precision=fast -funroll-all-loops -fopenmp -O3 -DNDEBUG
CUDA driver: 10.10
CUDA runtime: 10.10

Running on 1 node with total 40 cores, 40 logical cores, 4 compatible GPUs
Hardware detected on host ktchpccg015 (the node of MPI rank 0):
CPU info:
Vendor: Intel
Brand: Intel® Xeon® Gold 6148 CPU @ 2.40GHz
Family: 6 Model: 85 Stepping: 4
Features: aes apic avx avx2 avx512f avx512cd avx512bw avx512vl clfsh cmov cx8 cx16 f16c fma hle htt intel lahf mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd rdtscp rtm sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
Number of AVX-512 FMA units: 2
Hardware topology: Basic
Sockets, cores, and logical processors:
Socket 0: [ 0] [ 1] [ 2] [ 3] [ 4] [ 5] [ 6] [ 7] [ 8] [ 9] [ 10] [ 11] [ 12] [ 13] [ 14] [ 15] [ 16] [ 17] [ 18] [ 19]
Socket 1: [ 20] [ 21] [ 22] [ 23] [ 24] [ 25] [ 26] [ 27] [ 28] [ 29] [ 30] [ 31] [ 32] [ 33] [ 34] [ 35] [ 36] [ 37] [ 38] [ 39]
GPU info:
Number of GPUs detected: 8
#0: NVIDIA Tesla V100-SXM2-32GB, compute cap.: 7.0, ECC: no, stat: compatible
#1: NVIDIA Tesla V100-SXM2-32GB, compute cap.: 7.0, ECC: no, stat: unavailable
#2: NVIDIA Tesla V100-SXM2-32GB, compute cap.: 7.0, ECC: no, stat: compatible
#3: NVIDIA Tesla V100-SXM2-32GB, compute cap.: 7.0, ECC: no, stat: unavailable
#4: NVIDIA Tesla V100-SXM2-32GB, compute cap.: 7.0, ECC: no, stat: unavailable
#5: NVIDIA Tesla V100-SXM2-32GB, compute cap.: 7.0, ECC: no, stat: compatible
#6: NVIDIA Tesla V100-SXM2-32GB, compute cap.: 7.0, ECC: no, stat: unavailable
#7: NVIDIA Tesla V100-SXM2-32GB, compute cap.: 7.0, ECC: no, stat: compatible

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E.
Lindahl
GROMACS: High performance molecular simulations through multi-level
parallelism from laptops to supercomputers
SoftwareX 1 (2015) pp. 19-25
-------- -------- — Thank You — -------- --------

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
S. Páll, M. J. Abraham, C. Kutzner, B. Hess, E. Lindahl
Tackling Exascale Software Challenges in Molecular Dynamics Simulations with
GROMACS
In S. Markidis & E. Laure (Eds.), Solving Software Challenges for Exascale 8759 (2015) pp. 3-27
-------- -------- — Thank You — -------- --------

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
S. Pronk, S. Páll, R. Schulz, P. Larsson, P. Bjelkmar, R. Apostolov, M. R.
Shirts, J. C. Smith, P. M. Kasson, D. van der Spoel, B. Hess, and E. Lindahl
GROMACS 4.5: a high-throughput and highly parallel open source molecular
simulation toolkit
Bioinformatics 29 (2013) pp. 845-54
-------- -------- — Thank You — -------- --------

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
B. Hess and C. Kutzner and D. van der Spoel and E. Lindahl
GROMACS 4: Algorithms for highly efficient, load-balanced, and scalable
molecular simulation
J. Chem. Theory Comput. 4 (2008) pp. 435-447
-------- -------- — Thank You — -------- --------

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
D. van der Spoel, E. Lindahl, B. Hess, G. Groenhof, A. E. Mark and H. J. C.
Berendsen
GROMACS: Fast, Flexible and Free
J. Comp. Chem. 26 (2005) pp. 1701-1719
-------- -------- — Thank You — -------- --------

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
E. Lindahl and B. Hess and D. van der Spoel
GROMACS 3.0: A package for molecular simulation and trajectory analysis
J. Mol. Mod. 7 (2001) pp. 306-317
-------- -------- — Thank You — -------- --------

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
H. J. C. Berendsen, D. van der Spoel and R. van Drunen
GROMACS: A message-passing parallel molecular dynamics implementation
Comp. Phys. Comm. 91 (1995) pp. 43-56
-------- -------- — Thank You — -------- --------

++++ PLEASE CITE THE DOI FOR THIS VERSION OF GROMACS ++++
https://doi.org/10.5281/zenodo.4054979
-------- -------- — Thank You — -------- --------

Input Parameters:
integrator = md
tinit = 0
dt = 0.002
nsteps = 500000
init-step = 0
simulation-part = 1
comm-mode = Linear
nstcomm = 100
bd-fric = 0
ld-seed = -1826774160
emtol = 10
emstep = 0.01
niter = 20
fcstep = 0
nstcgsteep = 1000
nbfgscorr = 10
rtpi = 0.05
nstxout = 0
nstvout = 0
nstfout = 0
nstlog = 100000
nstcalcenergy = 100
nstenergy = 100000
nstxout-compressed = 100000
compressed-x-precision = 1000
cutoff-scheme = Verlet
nstlist = 10
pbc = xyz
periodic-molecules = false
verlet-buffer-tolerance = 0.005
rlist = 1
coulombtype = PME
coulomb-modifier = Potential-shift
rcoulomb-switch = 0
rcoulomb = 1
epsilon-r = 1
epsilon-rf = inf
vdw-type = Cut-off
vdw-modifier = Potential-shift
rvdw-switch = 0
rvdw = 1
DispCorr = EnerPres
table-extension = 1
fourierspacing = 0.16
fourier-nx = 52
fourier-ny = 52
fourier-nz = 52
pme-order = 4
ewald-rtol = 1e-05
ewald-rtol-lj = 0.001
lj-pme-comb-rule = Geometric
ewald-geometry = 0
epsilon-surface = 0
tcoupl = V-rescale
nsttcouple = 10
nh-chain-length = 0
print-nose-hoover-chain-variables = false
pcoupl = Parrinello-Rahman
pcoupltype = Isotropic
nstpcouple = 10
tau-p = 2
compressibility (3x3):
compressibility[ 0]={ 4.50000e-05, 0.00000e+00, 0.00000e+00}
compressibility[ 1]={ 0.00000e+00, 4.50000e-05, 0.00000e+00}
compressibility[ 2]={ 0.00000e+00, 0.00000e+00, 4.50000e-05}
ref-p (3x3):
ref-p[ 0]={ 1.00000e+00, 0.00000e+00, 0.00000e+00}
ref-p[ 1]={ 0.00000e+00, 1.00000e+00, 0.00000e+00}
ref-p[ 2]={ 0.00000e+00, 0.00000e+00, 1.00000e+00}
refcoord-scaling = No
posres-com (3):
posres-com[0]= 0.00000e+00
posres-com[1]= 0.00000e+00
posres-com[2]= 0.00000e+00
posres-comB (3):
posres-comB[0]= 0.00000e+00
posres-comB[1]= 0.00000e+00
posres-comB[2]= 0.00000e+00
QMMM = false
QMconstraints = 0
QMMMscheme = 0
MMChargeScaleFactor = 1
qm-opts:
ngQM = 0
constraint-algorithm = Lincs
continuation = true
Shake-SOR = false
shake-tol = 0.0001
lincs-order = 4
lincs-iter = 1
lincs-warnangle = 30
nwall = 0
wall-type = 9-3
wall-r-linpot = -1
wall-atomtype[0] = -1
wall-atomtype[1] = -1
wall-density[0] = 0
wall-density[1] = 0
wall-ewald-zfac = 3
pull = false
awh = false
rotation = false
interactiveMD = false
disre = No
disre-weighting = Conservative
disre-mixed = false
dr-fc = 1000
dr-tau = 0
nstdisreout = 100
orire-fc = 0
orire-tau = 0
nstorireout = 100
free-energy = no
cos-acceleration = 0
deform (3x3):
deform[ 0]={ 0.00000e+00, 0.00000e+00, 0.00000e+00}
deform[ 1]={ 0.00000e+00, 0.00000e+00, 0.00000e+00}
deform[ 2]={ 0.00000e+00, 0.00000e+00, 0.00000e+00}
simulated-tempering = false
swapcoords = no
userint1 = 0
userint2 = 0
userint3 = 0
userint4 = 0
userreal1 = 0
userreal2 = 0
userreal3 = 0
userreal4 = 0
applied-forces:
electric-field:
x:
E0 = 0
omega = 0
t0 = 0
sigma = 0
y:
E0 = 0
omega = 0
t0 = 0
sigma = 0
z:
E0 = 0
omega = 0
t0 = 0
sigma = 0
density-guided-simulation:
active = false
group = protein
similarity-measure = inner-product
atom-spreading-weight = unity
force-constant = 1e+09
gaussian-transform-spreading-width = 0.2
gaussian-transform-spreading-range-in-multiples-of-width = 4
reference-density-filename = reference.mrc
nst = 1
normalize-densities = true
adaptive-force-scaling = false
adaptive-force-scaling-time-constant = 4
grpopts:
nrdf: 8718.77 106107
ref-t: 300 300
tau-t: 0.1 0.1
annealing: No No
annealing-npoints: 0 0
acc: 0 0 0
nfreeze: N N N
energygrp-flags[ 0]: 0

Changing nstlist from 10 to 100, rlist from 1 to 1.157

Initializing Domain Decomposition on 2 ranks
Dynamic load balancing: auto
Using update groups, nr 19456, average size 2.9 atoms, max. radius 0.104 nm
Minimum cell size due to atom displacement: 0.652 nm
Initial maximum distances in bonded interactions:
two-body bonded interactions: 0.450 nm, LJ-14, atoms 1973 3103
multi-body bonded interactions: 0.450 nm, Proper Dih., atoms 3103 1973
Minimum cell size due to bonded interactions: 0.495 nm
Scaling the initial minimum size with 1/0.8 (option -dds) = 1.25
Using 0 separate PME ranks
Optimizing the DD grid for 2 cells with a minimum initial size of 0.815 nm
The maximum allowed number of cells is: X 10 Y 10 Z 10
Domain decomposition grid 2 x 1 x 1, separate PME ranks 0
PME domain decomposition: 2 x 1 x 1
Domain decomposition rank 0, coordinates 0 0 0

The initial number of communication pulses is: X 1
The initial domain decomposition cell size is: X 4.14 nm

The maximum allowed distance for atom groups involved in interactions is:
non-bonded interactions 1.366 nm
(the following are initial values, they could change due to box deformation)
two-body bonded interactions (-rdd) 1.366 nm
multi-body bonded interactions (-rdd) 1.366 nm

When dynamic load balancing gets turned on, these settings will change to:
The maximum number of communication pulses is: X 1
The minimum size for domain decomposition cells is 1.366 nm
The requested allowed shrink of DD cells (option -dds) is: 0.80
The allowed shrink of domain decomposition cells is: X 0.33
The maximum allowed distance for atom groups involved in interactions is:
non-bonded interactions 1.366 nm
two-body bonded interactions (-rdd) 1.366 nm
multi-body bonded interactions (-rdd) 1.366 nm

On host ktchpccg015 2 GPUs selected for this run.
Mapping of GPU IDs to the 2 GPU tasks in the 2 ranks on this node:
PP:0,PP:2
PP tasks will do (non-perturbed) short-ranged and most bonded interactions on the GPU
PP task will update and constrain coordinates on the CPU
Using 2 MPI processes

Non-default thread affinity set, disabling internal thread affinity

Using 10 OpenMP threads per MPI process

NOTE: Your choice of number of MPI ranks and amount of resources results in using 10 OpenMP threads per rank, which is most likely inefficient. The optimum is usually between 2 and 6 threads per rank.
System total charge: -0.000
Will do PME sum in reciprocal space for electrostatic interactions.

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
U. Essmann, L. Perera, M. L. Berkowitz, T. Darden, H. Lee and L. G. Pedersen
A smooth particle mesh Ewald method
J. Chem. Phys. 103 (1995) pp. 8577-8592
-------- -------- — Thank You — -------- --------

Using a Gaussian width (1/beta) of 0.320163 nm for Ewald
Potential shift: LJ r^-12: -1.000e+00 r^-6: -1.000e+00, Ewald -1.000e-05
Initialized non-bonded Ewald tables, spacing: 9.33e-04 size: 1073

Generated table with 1078 data points for 1-4 COUL.
Tabscale = 500 points/nm
Generated table with 1078 data points for 1-4 LJ6.
Tabscale = 500 points/nm
Generated table with 1078 data points for 1-4 LJ12.
Tabscale = 500 points/nm

Using GPU 8x8 nonbonded short-range kernels

Using a dual 8x8 pair-list setup updated with dynamic, rolling pruning:
outer list: updated every 100 steps, buffer 0.157 nm, rlist 1.157 nm
inner list: updated every 10 steps, buffer 0.001 nm, rlist 1.001 nm
At tolerance 0.005 kJ/mol/ps per atom, equivalent classical 1x1 list would be:
outer list: updated every 100 steps, buffer 0.306 nm, rlist 1.306 nm
inner list: updated every 10 steps, buffer 0.042 nm, rlist 1.042 nm

Using Lorentz-Berthelot Lennard-Jones combination rule

Long Range LJ corr.: 3.0483e-04

Initializing LINear Constraint Solver

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
B. Hess and H. Bekker and H. J. C. Berendsen and J. G. E. M. Fraaije
LINCS: A Linear Constraint Solver for molecular simulations
J. Comp. Chem. 18 (1997) pp. 1463-1472
-------- -------- — Thank You — -------- --------

The number of constraints is 1706

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
S. Miyamoto and P. A. Kollman
SETTLE: An Analytical Version of the SHAKE and RATTLE Algorithms for Rigid
Water Models
J. Comp. Chem. 13 (1992) pp. 952-962
-------- -------- — Thank You — -------- --------

Linking all bonded interactions to atoms

Intra-simulation communication will occur every 10 steps.

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
G. Bussi, D. Donadio and M. Parrinello
Canonical sampling through velocity rescaling
J. Chem. Phys. 126 (2007) pp. 014101
-------- -------- — Thank You — -------- --------

There are: 56528 Atoms
Atom distribution over 2 domains: av 28264 stddev 67 min 28197 max 28331

NOTE: DLB will not turn on during the first phase of PME tuning
Center of mass motion removal mode is Linear
We have the following groups for center of mass motion removal:
0: rest

Started mdrun on rank 0 Sat Nov 7 03:20:48 2020

       Step           Time
          0        0.00000

Energies (kJ/mol)
Bond Angle Proper Dih. Improper Dih. LJ-14
2.77117e+03 7.36667e+03 4.10178e+03 3.94997e+02 3.95671e+03
Coulomb-14 LJ (SR) Disper. corr. Coulomb (SR) Coul. recip.
4.58217e+04 1.02373e+05 -7.19928e+03 -9.29511e+05 5.22782e+03
Potential Kinetic En. Total Energy Conserved En. Temperature
-7.64697e+05 2.49244e+09 2.49167e+09 2.49167e+09 5.22132e+06
Pres. DC (bar) Pressure (bar) Constr. rmsd
-2.11444e+02 -4.98122e+07 2.93781e-06

A system of your size (~50K atoms) will not scale well to multiple V100s. There may be a small benefit to using 2, but certainly not more than that. If possible, you should set up multiple simulations in parallel, each on a single GPU. You’ll also want to run with more compute types on the GPU, using the -pme gpu and likely -update gpu options.

Check out the Gromacs performance section for tips on how to specify cores and GPUs. There have also been a bunch of discussions on the old forum with good examples, such as https://mailman-1.sys.kth.se/pipermail/gromacs.org_gmx-users/2019-July/126014.html

You have a number of warnings here that indicate some of what’s going on. Particularly, only two of your GPUs were selected, and threads were not pinned:

Blockquote
On host ktchpccg015 2 GPUs selected for this run.
Mapping of GPU IDs to the 2 GPU tasks in the 2 ranks on this node:
PP:0,PP:2
PP tasks will do (non-perturbed) short-ranged and most bonded interactions on the GPU
PP task will update and constrain coordinates on the CPU
Using 2 MPI threads
Using 10 OpenMP threads per tMPI thread
NOTE: Your choice of number of MPI ranks and amount of resources results in using 10 OpenMP threads per rank, which is most likely inefficient. The optimum is usually between 2 and 6 threads per rank.
NOTE: The number of threads is not equal to the number of (logical) cores
and the -pin option is set to auto: will not pin threads to cores.
This can lead to significant performance degradation.
Consider using -pin on (and -pinoffset in case you run multiple jobs).
System total charge: -0.000
Will do PME sum in reciprocal space for electrostatic interactions.

Thanks Kevin, I have checked out the Gromacs performance section and tried literally every variation of commands, and I think I understand how to offload the pme etc to other ranks, and I understand this can really optimize a good running simulation

My major question is why when I use mpirun the program runs so incredibly slow. The non-efficient use of threads notwithstanding, any time I invoke mdrun with “mpirun” the ns/day crawls to 30ns/day (down from ~220ns/day without mpirun), the difference seems like something is drastically wrong. I can use gmx_mpi without mpirun or gmx -ntmpi 1 and get good speed but these commands are specifically limited to a single gpu (whether it is advantageous or not).

I’m really interested in how gromacs communicates with the hardware and it seems very strange that invoking the openmpi command should cause such a decrease in performance.

Thanks so much for your help and advice!

       Drew

The answer is:

My cluster assigns GPUs with the nvidia compute mode as “exclusive process”. This can been checked with nvidia-smi -q -i GPUID. Whenever more than one MPI process (mpirun > (number of gpus)) the additional processes cannot access the gpu. I think the segfault with gmx -ntmpi >1 probably has something to do with exclusive process also.

Drew,

The segfault is certainly not expected, even if the GPUs are in process exclusive mode (and if you can reproduce it please file a report on https://gitlab.com/gromacs/gromacs/-/issues).

For a single-node run you do not need MPI, so you could test the default thread-MPI run (request 1 task from the batch system and run -ntmpi N).

In addition, regarding your performance issues, as Kevin’s has already pointed out, you have some thread affinity issues which can very well cause huge slowdowns. (Also, does your single GPU MPI that runs with 37.5 ns/day use a K80 GPU as the batch submission suggests?)

Szilárd