GROMACS scaling limit for a large system

GROMACS version:2018
GROMACS modification: Yes/No
Here post your question

Hi

I am trying to simulate a system that has 500.000 atoms using the following script

#!/bin/bash
#SBATCH -N 4 --tasks-per-node=2
#SBATCH -t 01:00:00
#SBATCH -p GPU --gres=gpu:p100:2

Setup the module command

set echo
set -x
module load gromacs/2018_gpu
cd $SLURM_SUBMIT_DIR
#move to working directory

this job assumes:

- all input data is stored in this directory

- all output should be stored in this directory

cd /pylon5/bio200035p/amnah/6PIJnew
#run GPU program
./mygpu
#Replace below “em” by the default name for your files:
mpirun -np 8 gmx_mpi mdrun -deffnm p100test -resethway -noconfout -nsteps 30000 -v -pin on -nb gpu -pme gpu -npme 1 -gputasks 01 -ntomp 12 -resetstep 10000 -s npttestnew.tpr -dlb yes

This is the best performance I got Using 8 MPI processes
No option -multi
No option -multi
No option -multi
Using 12 OpenMP threads per MPI process

No option -multi
No option -multi
No option -multi
No option -multi
On host gpu017.pvt.bridges.psc.edu 2 GPUs user-selected for this run.
Mapping of GPU IDs to the 2 GPU tasks in the 2 ranks on this node:
PP:0,PP:1

NOTE: Your choice of number of MPI ranks and amount of resources results in using 12 OpenMP threads per rank, which is most likely inefficient. The optimum is usually between 2 and 6 threads per rank.

           Core t (s)   Wall t (s)        (%)
   Time:     7860.125       81.876     9600.0
             (ns/day)    (hour/ns)

Performance: 31.660 0.758

but I think 30 ns/day is not good enough
My question is about the scaling limit? Is 31ns/day the optimal performance for my system (500.000 atoms)?
If not, how to improve that? I have been trying for one month and nothing improved

Thank you I appreciate your help

I’m running an 800k atoms system right now and getting 73 ns/day, so no, 31 ns/day is clearly not the limit. The difference is I’m running CPU only with 960 cores - and from what I’ve seen you don’t get much speedup from Gromacs beyond 1 GPU (hope someone confirms that).

Best,
MW

That is not universally true, especially not with 800k atoms, e.g. see Fig 12 of https://aip.scitation.org/doi/full/10.1063/5.0018516

Even if you do not have many fast CPUs cores to run PME with decomposition (as with GPUs PME decomposition is not supported), you should still be able to scale, but of course it will depend on the hardware. In particular, on dense GPU nodes with limited CPU–GPU interconnect performance, scaling may simply be not possible due to the extreme imbalance between the speed of computation on a GPU and the speed of transferring data across the PCI-E bus.

No, but performance will depend on your hardware and simulation settings. Please share details so we can advise. You might also want to wait a few weeks for the 2021 release which improves on the GPU direct communication feature in terms of robustness and eliminates some limitations too.

Thank you so much for your reply

this the mdp parameters I used to run the test

; Preprocessing

; 7.3.3 Run Control
integrator = md ; leap-frog integrator
dt = 0.002 ; [ps] time step for integration 2fs
nsteps = 500000 ; maximum number of steps to integrate, 2 * 500000 = 10000 ps (1ns)
comm_mode = Linear ; remove center of mass translation
comm_grps = Protein Non-Protein ; group(s) for center of mass motion removal

; Output Control
nstxout = 5000 ; save coordinates every 10.0 ps
nstvout = 5000 ; save velocities every 10.0 ps
nstlog = 5000 ; update log file every 10.0 ps
nstfout = 5000 ; [steps] freq to write forces to trajectory
nstenergy = 5000 ; [steps] freq to write energies to energy file
nstxtcout = 5000 ; [steps] freq to write coordinates to xtc trajectory
xtc_precision = 5000 ; [real] precision to write xtc trajectory
xtc_grps = System ; group(s) to write to xtc trajectory
energygrps = System ; group(s) to write to energy f

; Neighbor Searching
cutoff-scheme = Verlet ; Method of managing neighbor lists
nstlist = 10 ; [steps] freq to update neighbor list
ns_type = grid ; method of updating neighbor list
pbc = xyz ; periodic boundary conditions in all directions
rlist = 0.8 ; [nm] cut-off distance for the short-range neighbor list

; Electrostatics
coulombtype = PME ; Particle-Mesh Ewald electrostatics
rcoulomb = 1.2 ; [nm] distance for Coulomb cut-off

; VdW
vdwtype = cut-off ;
;vdw-modifier = Force-switch ; Use Van der Waals force switching
rvdw = 1.2 ; Van der Waals cutoff (nm)
rvdw_switch = 0.8 ; Van der Waals force switching (nm)
DispCorr = EnerPres ; apply long range dispersion corrections

; Ewald
fourierspacing = 0.1 ; [nm] grid spacing for FFT grid when using PME
pme_order = 4 ; interpolation order for PME, 4 = cubic
ewald_rtol = 1e-5 ; relative strength of Ewald-shifted potential at rcoulomb

; Temperature coupling
tcoupl = nose-hoover ; modified Berendsen thermostat
tau_t = 1 1 1 1 ; [ps] time constant for coupling
tc-grps = Protein DNA RNA Water_and_ions ; groups to couple seperately to temperature bath
ref_t = 300 300 300 300 ; [K] reference temperature for coupling

; Pressure coupling
Pcoupl = parrinello-rahman ; pressure coupling where box vectors are variable
Pcoupltype = isotropic ; pressure coupling in x-y-z directions
tau_p = 2.5 ; [ps] time constant for coupling
compressibility = 4.5e-5 ; [bar^-1] compressibility
ref_p = 1.0 ; [bar] reference pressure for coupling

; Velocity Generation
gen_vel = no ; generate velocities according to Maxwell distribution of temperature
gen_temp = 300 ; [K] temperature for Maxwell distribution
gen_seed = -1 ; [integer] used to initialize random generator for random velocities

; Bonds
constraints = h-bonds ; convert all bonds to constraints
constraint_algorithm = LINCS ; LINear Constraint Solver
continuation = yes ; no = apply constraints to the start configuration
lincs_order = 4 ; highest order in the expansion of the contraint coupling matrix
lincs_iter = 1 ; number of iterations to correct for rotational lengthening
lincs_warnangle = 30 ; [degrees] maximum angle that a bond can rotate before LINCS will complain

I used GPU type P100
I have tried to change the number of nodes (1 ,2 ,3 ,4 ) each node has 4 GPU and each node has 32 CPU cores
I have tried to use different setting by changing the number of CPUs and MPI process and I also changed the openMP tread per MPI process is from 2-6
The maximum performance I got is 31 ns/day

Thanks! That clears some confusion for me, although still looks like CPUs scale more robustly. Is it going to improve with FMM? Looking forward to the 2021 release & benchmarks - maybe this will be the one that does it for me in terms of GPU vs CPU.

Yes, that is true. It mainly boils down to the CPU–GPU bus increasingly becoming a severe bottleneck and integration of CPUs and GPUs, while it is on the horizon and very important for MD codes, is not happening fast enough.
While the speed at which the computation of a single MD step can be carried out has increased by up to 5x (for some computations up to even 10x, see Fig 3 of https://arxiv.org/pdf/1903.05918.pdf), PCIe 3 has been around unchanged for 7+ years and PCIe 4.0 really only brings 2x improvement.

At the same time, CPU-only machines do not have this added complexity and latency that GPU accelerated machines have, so strong scaling is better on those machines.

Yes, FMM is significantly better-suited for strong-scaling, but it is not known yet where will be the cross over be between PME and FMM in fully optimized iplementations. We expect that at smaller scale PME will still win, but this is still an area of active research.

That is why we are still investing in PME, specifically in PME decomposition on GPUs which, at least with fast HPC interconnects, have some room to improve on the current, limited scaling.

Admittedly, for your use-case of 800k atoms a modern CPU-only machine will likely still have better peak performance (though certainly not better cost efficiency of ns simulated) given that even PME can still only run on a single GPU in the 2021 release. Your scaling profile will be similar to the benchmark I previously linked from our recent paper (~60 ns/day peak on V100s with NVLink for 1M atoms with CHARMM FF).

What kind of interconnect do you have between CPU and GPU and between nodes? Please share some logs of your runs.

GROMACS: gmx mdrun, version 2018
Executable: /opt/packages/gromacs-GPU-2018/bin/gmx_mpi
Data prefix: /opt/packages/gromacs-GPU-2018
Working dir: /pylon5/bio200035p/amnah/6PIJnew
Command line:
gmx_mpi mdrun -deffnm last -resethway -noconfout -nsteps 30000 -pin on -nb gpu -pme gpu -npme 1 -v -resetstep 10000 -s npttestnew.tpr -ntomp 5

GROMACS version: 2018
Precision: single
Memory model: 64 bit
MPI library: MPI
OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 256)
GPU support: CUDA
SIMD instructions: AVX_256
FFT library: fftw-3.3.3-sse2
RDTSCP usage: enabled
TNG support: enabled
Hwloc support: disabled
Tracing support: disabled
Built on: 2018-02-20 21:19:31
Built by: mmadrid@gpu048.pvt.bridges.psc.edu [CMAKE]
Build OS/arch: Linux 3.10.0-693.11.6.el7.x86_64 x86_64
Build CPU vendor: Intel
Build CPU brand: Intel® Xeon® CPU E5-2683 v4 @ 2.10GHz
Build CPU family: 6 Model: 79 Stepping: 1
Build CPU features: aes apic avx avx2 clfsh cmov cx8 cx16 f16c fma hle htt intel lahf mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd rdtscp rtm sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
C compiler: /usr/lib64/ccache/cc GNU 4.8.5
C compiler flags: -mavx -O3 -DNDEBUG -funroll-all-loops -fexcess-precision=fast
C++ compiler: /usr/lib64/ccache/c++ GNU 4.8.5
C++ compiler flags: -mavx -std=c++11 -O3 -DNDEBUG -funroll-all-loops -fexcess-precision=fast
CUDA compiler: /opt/packages/cuda/9.0RC/bin/nvcc nvcc: NVIDIA ® Cuda compiler driver;Copyright © 2005-2017 NVIDIA Corporation;Built on Mon_Jun_26_16:13:28_CDT_2017;Cuda compilation tools, release 9.0, V9.0.102
CUDA compiler flags:-gencode;arch=compute_30,code=sm_30;-gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_70,code=compute_70;-use_fast_math;;; ;-mavx;-std=c++11;-O3;-DNDEBUG;-funroll-all-loops;-fexcess-precision=fast;
CUDA driver: 10.20
CUDA runtime: 9.0

Running on 3 nodes with total 96 cores, 96 logical cores, 6 compatible GPUs
Cores per node: 32
Logical cores per node: 32
Compatible GPUs per node: 2
All nodes have identical type(s) of GPUs
Hardware detected on host gpu029.pvt.bridges.psc.edu (the node of MPI rank 0):
CPU info:
Vendor: Intel
Brand: Intel® Xeon® CPU E5-2683 v4 @ 2.10GHz
Family: 6 Model: 79 Stepping: 1
Features: aes apic avx avx2 clfsh cmov cx8 cx16 f16c fma hle htt intel lahf mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd rdtscp rtm sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
Hardware topology: Basic
Sockets, cores, and logical processors:
Socket 0: [ 0] [ 16] [ 1] [ 17] [ 2] [ 18] [ 3] [ 19] [ 4] [ 20] [ 5] [ 21] [ 6] [ 22] [ 7] [ 23]
Socket 1: [ 8] [ 24] [ 9] [ 25] [ 10] [ 26] [ 11] [ 27] [ 12] [ 28] [ 13] [ 29] [ 14] [ 30] [ 15] [ 31]
GPU info:
Number of GPUs detected: 2
#0: NVIDIA Tesla P100-PCIE-16GB, compute cap.: 6.0, ECC: yes, stat: compatible
#1: NVIDIA Tesla P100-PCIE-16GB, compute cap.: 6.0, ECC: yes, stat: compatible

Highest SIMD level requested by all nodes in run: AVX2_256
SIMD instructions selected at compile time: AVX_256
This program was compiled for different hardware than you are running on,
which could influence performance.

Input Parameters:
integrator = md
tinit = 0
dt = 0.002
nsteps = 500000
init-step = 0
simulation-part = 1
comm-mode = Linear
nstcomm = 100
bd-fric = 0
ld-seed = -723976216
emtol = 10
emstep = 0.01
niter = 20
fcstep = 0
nstcgsteep = 1000
nbfgscorr = 10
rtpi = 0.05
nstxout = 5000
nstvout = 5000
nstfout = 5000
nstlog = 5000
nstcalcenergy = 100
nstenergy = 5000
nstxout-compressed = 5000
compressed-x-precision = 5000
cutoff-scheme = Verlet
nstlist = 10
ns-type = Grid
pbc = xyz
periodic-molecules = false
verlet-buffer-tolerance = 0.005
rlist = 1.2
coulombtype = PME
coulomb-modifier = Potential-shift
rcoulomb-switch = 0
rcoulomb = 1.2
epsilon-r = 1
epsilon-rf = inf
vdw-type = Cut-off
vdw-modifier = Potential-shift
rvdw-switch = 0.8
rvdw = 1.2
DispCorr = EnerPres
table-extension = 1
fourierspacing = 0.1
fourier-nx = 160
fourier-ny = 192
fourier-nz = 208
pme-order = 4
ewald-rtol = 1e-05
ewald-rtol-lj = 0.001
lj-pme-comb-rule = Geometric
ewald-geometry = 0
epsilon-surface = 0
implicit-solvent = No
gb-algorithm = Still
nstgbradii = 1
rgbradii = 1
gb-epsilon-solvent = 80
gb-saltconc = 0
gb-obc-alpha = 1
gb-obc-beta = 0.8
gb-obc-gamma = 4.85
gb-dielectric-offset = 0.009
sa-algorithm = Ace-approximation
sa-surface-tension = 2.05016
tcoupl = Nose-Hoover
nsttcouple = 10
nh-chain-length = 1
print-nose-hoover-chain-variables = false
pcoupl = Parrinello-Rahman
pcoupltype = Isotropic
nstpcouple = 10
tau-p = 2.5
compressibility (3x3):
compressibility[ 0]={ 4.50000e-05, 0.00000e+00, 0.00000e+00}
compressibility[ 1]={ 0.00000e+00, 4.50000e-05, 0.00000e+00}
compressibility[ 2]={ 0.00000e+00, 0.00000e+00, 4.50000e-05}
ref-p (3x3):
ref-p[ 0]={ 1.00000e+00, 0.00000e+00, 0.00000e+00}
ref-p[ 1]={ 0.00000e+00, 1.00000e+00, 0.00000e+00}
ref-p[ 2]={ 0.00000e+00, 0.00000e+00, 1.00000e+00}
refcoord-scaling = No
posres-com (3):
posres-com[0]= 0.00000e+00
posres-com[1]= 0.00000e+00
posres-com[2]= 0.00000e+00
posres-comB (3):
posres-comB[0]= 0.00000e+00
posres-comB[1]= 0.00000e+00
posres-comB[2]= 0.00000e+00
QMMM = false
QMconstraints = 0
QMMMscheme = 0
MMChargeScaleFactor = 1
qm-opts:
ngQM = 0
constraint-algorithm = Lincs
continuation = true
Shake-SOR = false
shake-tol = 0.0001
lincs-order = 4
lincs-iter = 1
lincs-warnangle = 30
nwall = 0
wall-type = 9-3
wall-r-linpot = -1
wall-atomtype[0] = -1
wall-atomtype[1] = -1
wall-density[0] = 0
wall-density[1] = 0
wall-ewald-zfac = 3
pull = false
awh = false
rotation = false
interactiveMD = false
disre = No
disre-weighting = Conservative
disre-mixed = false
dr-fc = 1000
dr-tau = 0
nstdisreout = 100
orire-fc = 0
orire-tau = 0
nstorireout = 100
free-energy = no
cos-acceleration = 0
deform (3x3):
deform[ 0]={ 0.00000e+00, 0.00000e+00, 0.00000e+00}
deform[ 1]={ 0.00000e+00, 0.00000e+00, 0.00000e+00}
deform[ 2]={ 0.00000e+00, 0.00000e+00, 0.00000e+00}
simulated-tempering = false
swapcoords = no
userint1 = 0
userint2 = 0
userint3 = 0
userint4 = 0
userreal1 = 0
userreal2 = 0
userreal3 = 0
userreal4 = 0
applied-forces:
electric-field:
x:
E0 = 0
omega = 0
t0 = 0
sigma = 0
y:
E0 = 0
omega = 0
t0 = 0
sigma = 0
z:
E0 = 0
omega = 0
t0 = 0
sigma = 0
grpopts:
nrdf: 147287 3010.99 5098.98 959295
ref-t: 300 300 300 300
tau-t: 1 1 1 1
annealing: No No No No
annealing-npoints: 0 0 0 0
acc: 0 0 0
nfreeze: N N N
energygrp-flags[ 0]: 0

The -nsteps functionality is deprecated, and may be removed in a future version. Consider using gmx convert-tpr -nsteps or changing the appropriate .mdp file field.

Overriding nsteps with value passed on the command line: 30000 steps, 60 ps
Changing nstlist from 10 to 100, rlist from 1.2 to 1.337

Initializing Domain Decomposition on 12 ranks
Dynamic load balancing: locked

Initial maximum inter charge-group distances:
two-body bonded interactions: 0.446 nm, LJ-14, atoms 36965 36972
multi-body bonded interactions: 0.446 nm, Proper Dih., atoms 36965 36972
Minimum cell size due to bonded interactions: 0.490 nm
Maximum distance for 5 constraints, at 120 deg. angles, all-trans: 0.218 nm
Estimated maximum distance required for P-LINCS: 0.218 nm
Using 1 separate PME ranks
Scaling the initial minimum size with 1/0.8 (option -dds) = 1.25
Optimizing the DD grid for 11 cells with a minimum initial size of 0.613 nm
The maximum allowed number of cells is: X 24 Y 29 Z 33
Domain decomposition grid 1 x 1 x 11, separate PME ranks 1
PME domain decomposition: 1 x 1 x 1
Interleaving PP and PME ranks
This rank does only particle-particle work.

Domain decomposition rank 0, coordinates 0 0 0

The initial number of communication pulses is: Z 1
The initial domain decomposition cell size is: Z 1.88 nm

The maximum allowed distance for charge groups involved in interactions is:
non-bonded interactions 1.337 nm
(the following are initial values, they could change due to box deformation)
two-body bonded interactions (-rdd) 1.337 nm
multi-body bonded interactions (-rdd) 1.337 nm
atoms separated by up to 5 constraints (-rcon) 1.881 nm

When dynamic load balancing gets turned on, these settings will change to:
The maximum number of communication pulses is: Z 1
The minimum size for domain decomposition cells is 1.337 nm
The requested allowed shrink of DD cells (option -dds) is: 0.80
The allowed shrink of domain decomposition cells is: Z 0.71
The maximum allowed distance for charge groups involved in interactions is:
non-bonded interactions 1.337 nm
two-body bonded interactions (-rdd) 1.337 nm
multi-body bonded interactions (-rdd) 1.337 nm
atoms separated by up to 5 constraints (-rcon) 1.337 nm

Using two step summing over 3 groups of on average 3.7 ranks

Using 12 MPI processes
Using 5 OpenMP threads per MPI process

On host gpu029.pvt.bridges.psc.edu 2 GPUs auto-selected for this run.
Mapping of GPU IDs to the 4 GPU tasks in the 4 ranks on this node:
PP:0,PP:0,PP:1,PP:1

NOTE: GROMACS was configured without NVML support hence it can not exploit
application clocks of the detected Tesla P100-PCIE-16GB GPU to improve performance.
Recompile with the NVML library (compatible with the driver used) or set application clocks manually.

Overriding thread affinity set outside gmx mdrun

Pinning threads with an auto-selected logical core stride of 1

The -resetstep functionality is deprecated, and may be removed in a future version.
System total charge: 0.000
Will do PME sum in reciprocal space for electrostatic interactions.

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
U. Essmann, L. Perera, M. L. Berkowitz, T. Darden, H. Lee and L. G. Pedersen
A smooth particle mesh Ewald method
J. Chem. Phys. 103 (1995) pp. 8577-8592
-------- -------- — Thank You — -------- --------

Using a Gaussian width (1/beta) of 0.384195 nm for Ewald
Potential shift: LJ r^-12: -1.122e-01 r^-6: -3.349e-01, Ewald -8.333e-06
Initialized non-bonded Ewald correction tables, spacing: 1.02e-03 size: 1176

Long Range LJ corr.: 3.3098e-04
Generated table with 1168 data points for Ewald.
Tabscale = 500 points/nm
Generated table with 1168 data points for LJ6.
Tabscale = 500 points/nm
Generated table with 1168 data points for LJ12.
Tabscale = 500 points/nm
Generated table with 1168 data points for 1-4 COUL.
Tabscale = 500 points/nm
Generated table with 1168 data points for 1-4 LJ6.
Tabscale = 500 points/nm
Generated table with 1168 data points for 1-4 LJ12.
Tabscale = 500 points/nm

Using GPU 8x8 nonbonded short-range kernels

Using a dual 8x4 pair-list setup updated with dynamic, rolling pruning:
outer list: updated every 100 steps, buffer 0.137 nm, rlist 1.337 nm
inner list: updated every 12 steps, buffer 0.002 nm, rlist 1.202 nm
At tolerance 0.005 kJ/mol/ps per atom, equivalent classical 1x1 list would be:
outer list: updated every 100 steps, buffer 0.290 nm, rlist 1.490 nm
inner list: updated every 12 steps, buffer 0.051 nm, rlist 1.251 nm

Using Lorentz-Berthelot Lennard-Jones combination rule

Initializing Parallel LINear Constraint Solver

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
B. Hess
P-LINCS: A Parallel Linear Constraint Solver for molecular simulation
J. Chem. Theory Comput. 4 (2008) pp. 116-122
-------- -------- — Thank You — -------- --------

The number of constraints is 30135
There are inter charge-group constraints,
will communicate selected coordinates each lincs iteration

Linking all bonded interactions to atoms


The -noconfout functionality is deprecated, and may be removed in a future version.

The -resethway functionality is deprecated, and may be removed in a future version.

Intra-simulation communication will occur every 10 steps.
Center of mass motion removal mode is Linear
We have the following groups for center of mass motion removal:
  0:  Protein
  1:  non-Protein
There are: 541456 Atoms
Atom distribution over 11 domains: av 49223 stddev 498 min 48417 max 49944

NOTE: DLB will not turn on during the first phase of PME tuning

P P   -   P M E   L O A D   B A L A N C I N G

 PP/PME load balancing changed the cut-off and PME settings:
           particle-particle                    PME
            rcoulomb  rlist            grid      spacing   1/beta
   initial  1.200 nm  1.202 nm     160 192 208   0.099 nm  0.384 nm
   final    1.491 nm  1.493 nm     120 144 168   0.124 nm  0.477 nm
 cost-ratio           1.92             0.45
 (note that these numbers concern only part of the total PP and PME load)


	M E G A - F L O P S   A C C O U N T I N G

 NB=Group-cutoff nonbonded kernels    NxN=N-by-N cluster Verlet kernels
 RF=Reaction-Field  VdW=Van der Waals  QSTab=quadratic-spline table
 W3=SPC/TIP3p  W4=TIP4p (single or pairs)
 V&F=Potential and force  V=Potential only  F=Force only

 Computing:                               M-Number         M-Flops  % Flops
-----------------------------------------------------------------------------
 Pair Search distance check           16003.474720      144031.272     0.0
 NxN Ewald Elec. + LJ [F]          16840800.225216  1111492814.864    98.2
 NxN Ewald Elec. + LJ [V&F]          171246.806848    18323408.333     1.6
 1,4 nonbonded interactions            2438.517557      219466.580     0.0
 Reset In Box                            81.218400         243.655     0.0
 CG-CoM                                  81.759856         245.280     0.0
 Bonds                                  490.577703       28944.084     0.0
 Angles                                1698.413220      285333.421     0.0
 Propers                               3025.416681      692820.420     0.1
 Impropers                              191.247749       39779.532     0.0
 Virial                                 813.468451       14642.432     0.0
 Stop-CM                                 81.759856         817.599     0.0
 Calc-Ekin                             1624.909456       43872.555     0.0
 Lincs                                  473.329419       28399.765     0.0
 Lincs-Mat                             2397.363312        9589.453     0.0
 Constraint-V                          8469.672777       67757.382     0.0
 Constraint-Vir                         800.113998       19202.736     0.0
 Settle                                2507.671313      809977.834     0.1
-----------------------------------------------------------------------------
 Total                                              1132221347.198   100.0
-----------------------------------------------------------------------------


    D O M A I N   D E C O M P O S I T I O N   S T A T I S T I C S

 av. #atoms communicated per step for force:  2 x 472305.4
 av. #atoms communicated per step for LINCS:  2 x 24173.4


 Dynamic load balancing report:
 DLB was off during the run due to low measured imbalance.
 Average load imbalance: 10.1%.
 The balanceable part of the MD step is 68%, load imbalance is computed from this.
 Part of the total run time spent waiting due to load imbalance: 6.9%.
 Average PME mesh/force load: 0.951
 Part of the total run time spent waiting due to PP/PME imbalance: 0.2 %

NOTE: 6.9 % of the available CPU time was lost due to load imbalance
      in the domain decomposition.
      You might want to use dynamic load balancing (option -dlb.)


     R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G

On 11 MPI ranks doing PP, each using 5 OpenMP threads, and
on 1 MPI rank doing PME, using 5 OpenMP threads

 Computing:          Num   Num      Call    Wall time         Giga-Cycles
                     Ranks Threads  Count      (s)         total sum    %
-----------------------------------------------------------------------------
 Domain decomp.        11    5        151       2.420        278.808   2.5
 DD comm. load         11    5        151       0.017          1.940   0.0
 DD comm. bounds       11    5        110       0.021          2.395   0.0
 Send X to PME         11    5      15001       6.765        779.498   7.0
 Neighbor search       11    5        151       1.347        155.196   1.4
 Launch GPU ops.       11    5      30002       1.385        159.556   1.4
 Comm. coord.          11    5      14850       8.114        934.966   8.4
 Force                 11    5      15001       6.376        734.664   6.6
 Wait + Comm. F        11    5      15001      12.094       1393.551  12.5
 PME mesh *             1    5      15001      49.227        515.671   4.6
 PME wait for PP *                             39.161        410.219   3.7
 Wait + Recv. PME F    11    5      15001       9.825       1132.065  10.2
 Wait PME GPU gather   11    5      15001       4.320        497.754   4.5
 Wait GPU NB nonloc.   11    5      15001      25.697       2961.005  26.7
 Wait GPU NB local     11    5      15001       0.282         32.516   0.3
 NB X/F buffer ops.    11    5      59702       2.621        302.016   2.7
 Write traj.           11    5          4       0.230         26.492   0.2
 Update                11    5      15001       1.139        131.254   1.2
 Constraints           11    5      15001       7.946        915.586   8.2
 Comm. energies        11    5       1501       1.551        178.701   1.6
-----------------------------------------------------------------------------
 Total                                         88.387      11110.542 100.0
-----------------------------------------------------------------------------
(*) Note that with separate PME ranks, the walltime column actually sums to
    twice the total reported, but the cycle count total and % are correct.
-----------------------------------------------------------------------------
 Breakdown of PP computation
-----------------------------------------------------------------------------
 DD redist.            11    5        150       0.168         19.318   0.2
 DD NS grid + sort     11    5        151       0.358         41.214   0.4
 DD setup comm.        11    5        151       0.431         49.645   0.4
 DD make top.          11    5        151       0.367         42.246   0.4
 DD make constr.       11    5        151       0.419         48.283   0.4
 DD top. other         11    5        151       0.496         57.138   0.5
 NS grid non-loc.      11    5        151       0.130         15.028   0.1
 NS search local       11    5        151       0.535         61.642   0.6
 NS search non-loc.    11    5        151       0.594         68.487   0.6
 Bonded F              11    5      15001       5.263        606.444   5.5
 Listed buffer ops.    11    5      15001       0.298         34.381   0.3
 Launch NB GPU tasks   11    5      30002       1.226        141.286   1.3
 Launch PME GPU task   11    5      15001       0.031          3.543   0.0
 NB X buffer ops.      11    5      29700       0.976        112.428   1.0
 NB F buffer ops.      11    5      30002       1.640        188.980   1.7
-----------------------------------------------------------------------------

               Core t (s)   Wall t (s)        (%)
       Time:     5303.226       88.387     6000.0
                 (ns/day)    (hour/ns)
Performance:       29.327        0.818

Thank you very much for all your help. I really appreciate that

Thanks again!, good to know that. Especially in the times of quick COVID projects, there are certainly cases where peak performance matters more, even at the cost of poor cost efficiency.

I suggest to start with running on fewer resources as you may be beyond the point where the code scales efficiently, e.g. show performance on 1, 2, 4 nodes, etc. At 11 PP ranks your single PME rank is likely the limiting factor and you may get best performance at lower node count.

Not sure if the CPUs can keep up, but you could try to offload nonbonded + bonded and run PME on the CPU.