Performance loss

GROMACS version: 2020
GROMACS modification: Yes/No
Here post your question

Hi
I am using 2 nodes , each node has 8 gpus (v100) to simulate (540.000 atoms). The performance is 64 ns/day but I have a significant performance loss (38%). How to improve the performance! I don’t think that 64ns/day is good enough!

               :-) GROMACS - gmx mdrun, 2020.4-MODIFIED (-:

GROMACS: gmx mdrun, version 2020.4-MODIFIED
Executable: /opt/packages/gromacs/2020.4/GPU/bin/gmx_mpi
Data prefix: /opt/packages/gromacs/2020.4/GPU
Working dir: /ocean/projects/bio200035p/amnah/6PIJ2new/6PIJ/PhD work
Process ID: 277868
Command line:
gmx_mpi mdrun -v -deffnm nvtmod3 -s nvtmod3.tpr -nsteps 90000 -nb gpu -pme gpu -npme 1 -pin on -nstlist 400 -resetstep 80000 -bonded gpu

GROMACS version: 2020.4-MODIFIED
This program has been built from source code that has been altered and does not match the code released as part of the official GROMACS version 2020.4-MODIFIED. If you did not intend to use an altered GROMACS version, make sure to download an intact source distribution and compile that before proceeding.
If you have modified the source code, you are strongly encouraged to set your custom version suffix (using -DGMX_VERSION_STRING_OF_FORK) which will can help later with scientific reproducibility but also when reporting bugs.
Release checksum: 79c2857291b034542c26e90512b92fd4b184a1c9d6fa59c55f2e24ccf14e7281
Computed checksum: 3b556199b4d94c25b21238aeeb6735ee9b3bcbdd25747066d25adb4d1a06e3f4
Precision: single
Memory model: 64 bit
MPI library: MPI
OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 64)
GPU support: CUDA
SIMD instructions: AVX_512
FFT library: fftw-3.3.5-sse2-avx
RDTSCP usage: enabled
TNG support: enabled
Hwloc support: hwloc-2.2.0
Tracing support: disabled
C compiler: /ocean/projects/pscstaff/mmadrid/spack/lib/spack/env/gcc/gcc GNU 10.2.0
C compiler flags: -mavx512f -mfma -pthread -fexcess-precision=fast -funroll-all-loops -O2 -g -DNDEBUG
C++ compiler: /ocean/projects/pscstaff/mmadrid/spack/lib/spack/env/gcc/g++ GNU 10.2.0
C++ compiler flags: -mavx512f -mfma -pthread -fexcess-precision=fast -funroll-all-loops -fopenmp -O2 -g -DNDEBUG
CUDA compiler: /ocean/projects/pscstaff/mmadrid/spack/opt/spack/linux-centos8-cascadelake/gcc-10.2.0/cuda-11.2.0-gsjevs3apaiprlyhci6xaex2v5jfyum4/bin/nvcc nvcc: NVIDIA ® Cuda compiler driver;Copyright © 2005-2020 NVIDIA Corporation;Built on Mon_Nov_30_19:08:53_PST_2020;Cuda compilation tools, release 11.2, V11.2.67;Build cuda_11.2.r11.2/compiler.29373293_0
CUDA compiler flags:-gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-Wno-deprecated-gpu-targets;-gencode;arch=compute_35,code=compute_35;-gencode;arch=compute_50,code=compute_50;-gencode;arch=compute_52,code=compute_52;-gencode;arch=compute_60,code=compute_60;-gencode;arch=compute_61,code=compute_61;-gencode;arch=compute_70,code=compute_70;-gencode;arch=compute_75,code=compute_75;-gencode;arch=compute_80,code=compute_80;-use_fast_math;-D_FORCE_INLINES;-mavx512f -mfma -pthread -fexcess-precision=fast -funroll-all-loops -fopenmp -O2 -g -DNDEBUG
CUDA driver: 11.20
CUDA runtime: 11.20

Running on 2 nodes with total 80 cores, 80 logical cores, 16 compatible GPUs
Cores per node: 40
Logical cores per node: 40
Compatible GPUs per node: 8
All nodes have identical type(s) of GPUs
Hardware detected on host v009.ib.bridges2.psc.edu (the node of MPI rank 0):
CPU info:
Vendor: Intel
Brand: Intel® Xeon® Gold 6248 CPU @ 2.50GHz
Family: 6 Model: 85 Stepping: 7
Features: aes apic avx avx2 avx512f avx512cd avx512bw avx512vl clfsh cmov cx8 cx16 f16c fma hle htt intel lahf mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd rdtscp rtm sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
Number of AVX-512 FMA units: 2
Hardware topology: Full, with devices
Sockets, cores, and logical processors:
Socket 0: [ 0] [ 1] [ 2] [ 3] [ 4] [ 5] [ 6] [ 7] [ 8] [ 9] [ 10] [ 11] [ 12] [ 13] [ 14] [ 15] [ 16] [ 17] [ 18] [ 19]
Socket 1: [ 20] [ 21] [ 22] [ 23] [ 24] [ 25] [ 26] [ 27] [ 28] [ 29] [ 30] [ 31] [ 32] [ 33] [ 34] [ 35] [ 36] [ 37] [ 38] [ 39]
Numa nodes:
Node 0 (270115745792 bytes mem): 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Node 1 (270577811456 bytes mem): 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39
Latency:
0 1
0 1.00 2.10
1 2.10 1.00
Caches:
L1: 32768 bytes, linesize 64 bytes, assoc. 8, shared 1 ways
L2: 1048576 bytes, linesize 64 bytes, assoc. 16, shared 1 ways
L3: 28835840 bytes, linesize 64 bytes, assoc. 11, shared 20 ways
PCI devices:
0000:02:00.0 Id: 14e4:1657 Class: 0x0200 Numa: 0
0000:02:00.1 Id: 14e4:1657 Class: 0x0200 Numa: 0
0000:02:00.2 Id: 14e4:1657 Class: 0x0200 Numa: 0
0000:02:00.3 Id: 14e4:1657 Class: 0x0200 Numa: 0
0000:01:00.1 Id: 102b:0538 Class: 0x0300 Numa: 0
0000:14:00.0 Id: 15b3:101b Class: 0x0207 Numa: 0
0000:15:00.0 Id: 10de:1db5 Class: 0x0302 Numa: 0
0000:16:00.0 Id: 10de:1db5 Class: 0x0302 Numa: 0
0000:39:00.0 Id: 15b3:101b Class: 0x0207 Numa: 0
0000:3a:00.0 Id: 10de:1db5 Class: 0x0302 Numa: 0
0000:3b:00.0 Id: 10de:1db5 Class: 0x0302 Numa: 0
0000:88:00.0 Id: 15b3:101b Class: 0x0207 Numa: 1
0000:89:00.0 Id: 10de:1db5 Class: 0x0302 Numa: 1
0000:8a:00.0 Id: 10de:1db5 Class: 0x0302 Numa: 1
0000:b1:00.0 Id: 15b3:101b Class: 0x0207 Numa: 1
0000:b2:00.0 Id: 10de:1db5 Class: 0x0302 Numa: 1
0000:b3:00.0 Id: 10de:1db5 Class: 0x0302 Numa: 1
0000:d8:00.0 Id: 144d:a822 Class: 0x0108 Numa: 1
0000:d9:00.0 Id: 144d:a822 Class: 0x0108 Numa: 1
0000:da:00.0 Id: 144d:a822 Class: 0x0108 Numa: 1
0000:db:00.0 Id: 144d:a822 Class: 0x0108 Numa: 1
GPU info:
Number of GPUs detected: 8
#0: NVIDIA Tesla V100-SXM2-32GB, compute cap.: 7.0, ECC: yes, stat: compatible
#1: NVIDIA Tesla V100-SXM2-32GB, compute cap.: 7.0, ECC: yes, stat: compatible
#2: NVIDIA Tesla V100-SXM2-32GB, compute cap.: 7.0, ECC: yes, stat: compatible
#3: NVIDIA Tesla V100-SXM2-32GB, compute cap.: 7.0, ECC: yes, stat: compatible
#4: NVIDIA Tesla V100-SXM2-32GB, compute cap.: 7.0, ECC: yes, stat: compatible
#5: NVIDIA Tesla V100-SXM2-32GB, compute cap.: 7.0, ECC: yes, stat: compatible
#6: NVIDIA Tesla V100-SXM2-32GB, compute cap.: 7.0, ECC: yes, stat: compatible
#7: NVIDIA Tesla V100-SXM2-32GB, compute cap.: 7.0, ECC: yes, stat: compatible

Input Parameters:
integrator = md
tinit = 0
dt = 0.002
nsteps = 10000
init-step = 0
simulation-part = 1
comm-mode = Linear
nstcomm = 100
bd-fric = 0
ld-seed = -1767156976
emtol = 10
emstep = 0.01
niter = 20
fcstep = 0
nstcgsteep = 1000
nbfgscorr = 10
rtpi = 0.05
nstxout = 0
nstvout = 0
nstfout = 0
nstlog = 10000
nstcalcenergy = 100
nstenergy = 10000
nstxout-compressed = 1000
compressed-x-precision = 1000
cutoff-scheme = Verlet
nstlist = 10
pbc = xyz
periodic-molecules = false
verlet-buffer-tolerance = 0.005
rlist = 1.2
coulombtype = PME
coulomb-modifier = Potential-shift
rcoulomb-switch = 0
rcoulomb = 1.2
epsilon-r = 1
epsilon-rf = inf
vdw-type = Cut-off
vdw-modifier = Potential-shift
rvdw-switch = 1
rvdw = 1.2
DispCorr = EnerPres
table-extension = 1
fourierspacing = 0.16
fourier-nx = 96
fourier-ny = 112
fourier-nz = 144
pme-order = 4
ewald-rtol = 1e-05
ewald-rtol-lj = 0.001
lj-pme-comb-rule = Geometric
ewald-geometry = 0
epsilon-surface = 0
tcoupl = V-rescale
nsttcouple = 10
nh-chain-length = 0
print-nose-hoover-chain-variables = false
pcoupl = No
pcoupltype = Isotropic
nstpcouple = -1
tau-p = 1
compressibility (3x3):
compressibility[ 0]={ 0.00000e+00, 0.00000e+00, 0.00000e+00}
compressibility[ 1]={ 0.00000e+00, 0.00000e+00, 0.00000e+00}
compressibility[ 2]={ 0.00000e+00, 0.00000e+00, 0.00000e+00}
ref-p (3x3):
ref-p[ 0]={ 0.00000e+00, 0.00000e+00, 0.00000e+00}
ref-p[ 1]={ 0.00000e+00, 0.00000e+00, 0.00000e+00}
ref-p[ 2]={ 0.00000e+00, 0.00000e+00, 0.00000e+00}
refcoord-scaling = No
posres-com (3):
posres-com[0]= 0.00000e+00
posres-com[1]= 0.00000e+00
posres-com[2]= 0.00000e+00
posres-comB (3):
posres-comB[0]= 0.00000e+00
posres-comB[1]= 0.00000e+00
posres-comB[2]= 0.00000e+00
QMMM = false
QMconstraints = 0
QMMMscheme = 0
MMChargeScaleFactor = 1
qm-opts:
ngQM = 0
constraint-algorithm = Lincs
continuation = false
Shake-SOR = false
shake-tol = 0.0001
lincs-order = 4
lincs-iter = 1
lincs-warnangle = 30
nwall = 0
wall-type = 9-3
wall-r-linpot = -1
wall-atomtype[0] = -1
wall-atomtype[1] = -1
wall-density[0] = 0
wall-density[1] = 0
wall-ewald-zfac = 3
pull = false
awh = false
rotation = false
interactiveMD = false
disre = No
disre-weighting = Conservative
disre-mixed = false
dr-fc = 1000
dr-tau = 0
nstdisreout = 100
orire-fc = 0
orire-tau = 0
nstorireout = 100
free-energy = no
cos-acceleration = 0
deform (3x3):
deform[ 0]={ 0.00000e+00, 0.00000e+00, 0.00000e+00}
deform[ 1]={ 0.00000e+00, 0.00000e+00, 0.00000e+00}
deform[ 2]={ 0.00000e+00, 0.00000e+00, 0.00000e+00}
simulated-tempering = false
swapcoords = no
userint1 = 0
userint2 = 0
userint3 = 0
userint4 = 0
userreal1 = 0
userreal2 = 0
userreal3 = 0
userreal4 = 0
applied-forces:
electric-field:
x:
E0 = 0
omega = 0
t0 = 0
sigma = 0
y:
E0 = 0
omega = 0
t0 = 0
sigma = 0
z:
E0 = 0
omega = 0
t0 = 0
sigma = 0
density-guided-simulation:
active = false
group = protein
similarity-measure = inner-product
atom-spreading-weight = unity
force-constant = 1e+09
gaussian-transform-spreading-width = 0.2
gaussian-transform-spreading-range-in-multiples-of-width = 4
reference-density-filename = reference.mrc
nst = 1
normalize-densities = true
adaptive-force-scaling = false
adaptive-force-scaling-time-constant = 4
grpopts:
nrdf: 147290 967405
ref-t: 0 0
tau-t: 0.1 0.1
annealing: No No
annealing-npoints: 0 0
acc: 0 0 0
nfreeze: N N N
energygrp-flags[ 0]: 0

The -nsteps functionality is deprecated, and may be removed in a future version. Consider using gmx convert-tpr -nsteps or changing the appropriate .mdp file field.

Overriding nsteps with value passed on the command line: 90000 steps, 180 ps
Changing nstlist from 10 to 400, rlist from 1.2 to 1.2

Initializing Domain Decomposition on 16 ranks
Dynamic load balancing: auto
Using update groups, nr 191631, average size 2.8 atoms, max. radius 0.114 nm
Minimum cell size due to atom displacement: 0.000 nm
Initial maximum distances in bonded interactions:
two-body bonded interactions: 0.426 nm, LJ-14, atoms 13085 13093
multi-body bonded interactions: 0.426 nm, Proper Dih., atoms 13085 13093
Minimum cell size due to bonded interactions: 0.469 nm
Scaling the initial minimum size with 1/0.8 (option -dds) = 1.25
Using 1 separate PME ranks
Optimizing the DD grid for 15 cells with a minimum initial size of 0.586 nm
The maximum allowed number of cells is: X 25 Y 30 Z 35
Domain decomposition grid 1 x 3 x 5, separate PME ranks 1
PME domain decomposition: 1 x 1 x 1
Interleaving PP and PME ranks
This rank does only particle-particle work.
Domain decomposition rank 0, coordinates 0 0 0

The initial number of communication pulses is: Y 1 Z 1
The initial domain decomposition cell size is: Y 5.95 nm Z 4.14 nm

The maximum allowed distance for atom groups involved in interactions is:
non-bonded interactions 1.427 nm
two-body bonded interactions (-rdd) 1.427 nm
multi-body bonded interactions (-rdd) 1.427 nm

When dynamic load balancing gets turned on, these settings will change to:
The maximum number of communication pulses is: Y 1 Z 1
The minimum size for domain decomposition cells is 1.427 nm
The requested allowed shrink of DD cells (option -dds) is: 0.80
The allowed shrink of domain decomposition cells is: Y 0.24 Z 0.34
The maximum allowed distance for atom groups involved in interactions is:
non-bonded interactions 1.427 nm
two-body bonded interactions (-rdd) 1.427 nm
multi-body bonded interactions (-rdd) 1.427 nm

On host v009.ib.bridges2.psc.edu 8 GPUs selected for this run.
Mapping of GPU IDs to the 8 GPU tasks in the 8 ranks on this node:
PP:0,PP:1,PP:2,PP:3,PP:4,PP:5,PP:6,PP:7
PP tasks will do (non-perturbed) short-ranged and most bonded interactions on the GPU
PP task will update and constrain coordinates on the CPU
PME tasks will do all aspects on the GPU
Using two step summing over 2 groups of on average 7.5 ranks

Using 16 MPI processes
Using 5 OpenMP threads per MPI process

Overriding thread affinity set outside gmx mdrun

Pinning threads with an auto-selected logical core stride of 1

The -resetstep functionality is deprecated, and may be removed in a future version.
System total charge: 0.000
Will do PME sum in reciprocal space for electrostatic interactions.

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
U. Essmann, L. Perera, M. L. Berkowitz, T. Darden, H. Lee and L. G. Pedersen
A smooth particle mesh Ewald method
J. Chem. Phys. 103 (1995) pp. 8577-8592
-------- -------- — Thank You — -------- --------

Using a Gaussian width (1/beta) of 0.384195 nm for Ewald
Potential shift: LJ r^-12: -1.122e-01 r^-6: -3.349e-01, Ewald -8.333e-06
Initialized non-bonded Ewald tables, spacing: 1.02e-03 size: 1176

Generated table with 1100 data points for 1-4 COUL.
Tabscale = 500 points/nm
Generated table with 1100 data points for 1-4 LJ6.
Tabscale = 500 points/nm
Generated table with 1100 data points for 1-4 LJ12.
Tabscale = 500 points/nm

Using GPU 8x8 nonbonded short-range kernels

Using a 8x8 pair-list setup:
updated every 400 steps, buffer 0.000 nm, rlist 1.200 nm
At tolerance 0.005 kJ/mol/ps per atom, equivalent classical 1x1 list would be:
updated every 400 steps, buffer 0.000 nm, rlist 1.200 nm

Using Lorentz-Berthelot Lennard-Jones combination rule

Long Range LJ corr.: 3.3092e-04

Removing pbc first time

Initializing LINear Constraint Solver

+++

Linking all bonded interactions to atoms

Intra-simulation communication will occur every 10 steps.

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
G. Bussi, D. Donadio and M. Parrinello
Canonical sampling through velocity rescaling
J. Chem. Phys. 126 (2007) pp. 014101
-------- -------- — Thank You — -------- --------

There are: 541456 Atoms
Atom distribution over 15 domains: av 36097 stddev 477 min 35104 max 36650

NOTE: DLB will not turn on during the first phase of PME tuning

Constraining the starting coordinates (step 0)

Constraining the coordinates at t0-dt (step 0)
Center of mass motion removal mode is Linear
We have the following groups for center of mass motion removal:
0: rest
RMS relative constraint deviation after constraining: 1.28e-06
Initial temperature: 1.30305e-06 K

<======  ###############  ==>
<====  A V E R A G E S  ====>
<==  ###############  ======>

Statistics over 90001 steps using 901 frames

Energies (kJ/mol)
Bond Angle Proper Dih. Improper Dih. LJ-14
1.67230e+04 5.20580e+04 2.62656e+05 2.19609e+03 5.24358e+04
Coulomb-14 LJ (SR) Disper. corr. Coulomb (SR) Coul. recip.
5.71339e+05 1.27619e+06 -4.26316e+04 -1.08155e+07 9.84800e+03
Potential Kinetic En. Total Energy Conserved En. Temperature
-8.61468e+06 7.53843e+03 -8.60714e+06 -7.45070e+06 1.62675e+00
Pres. DC (bar) Pressure (bar) Constr. rmsd
-1.28506e+02 -4.20464e+03 0.00000e+00

Total Virial (kJ/mol)
6.97215e+05 -2.24173e+04 -2.28051e+03
-2.24340e+04 7.33034e+05 -1.16278e+04
-2.30271e+03 -1.15991e+04 6.71081e+05

Pressure (bar)
-4.18519e+03 1.33378e+02 1.42519e+01
1.33479e+02 -4.40088e+03 6.98783e+01
1.43857e+01 6.97054e+01 -4.02786e+03

  T-Protein  T-non-Protein
1.04590e+01    2.82019e-01


   P P   -   P M E   L O A D   B A L A N C I N G

PP/PME load balancing changed the cut-off and PME settings:
particle-particle PME
rcoulomb rlist grid spacing 1/beta
initial 1.200 nm 1.200 nm 96 112 144 0.159 nm 0.384 nm
final 1.595 nm 1.595 nm 72 84 100 0.213 nm 0.511 nm
cost-ratio 2.35 0.39
(note that these numbers concern only part of the total PP and PME load)

M E G A - F L O P S   A C C O U N T I N G

NB=Group-cutoff nonbonded kernels NxN=N-by-N cluster Verlet kernels
RF=Reaction-Field VdW=Van der Waals QSTab=quadratic-spline table
W3=SPC/TIP3p W4=TIP4p (single or pairs)
V&F=Potential and force V=Potential only F=Force only

Computing: M-Number M-Flops % Flops

Pair Search distance check 3214.041936 28926.377 0.0
NxN Ewald Elec. + LJ [F] 10268715.055104 677735193.637 98.3
NxN Ewald Elec. + LJ [V&F] 104761.387008 11209468.410 1.6
Reset In Box 14.077856 42.234 0.0
CG-CoM 14.077856 42.234 0.0
Virial 54.755231 985.594 0.0
Stop-CM 54.687056 546.871 0.0
Calc-Ekin 1083.453456 29253.243 0.0
Lincs 301.380135 18082.808 0.0
Lincs-Mat 1503.030288 6012.121 0.0
Constraint-V 5398.589805 43188.718 0.0
Constraint-Vir 51.476670 1235.440 0.0
Settle 1598.609845 516350.980 0.1

Total 689589328.667 100.0

D O M A I N   D E C O M P O S I T I O N   S T A T I S T I C S

av. #atoms communicated per step for force: 2 x 369798.8

Dynamic load balancing report:
DLB was off during the run due to low measured imbalance.
Average load imbalance: 4.0%.
The balanceable part of the MD step is 49%, load imbalance is computed from this.
Part of the total run time spent waiting due to load imbalance: 1.9%.
Average PME mesh/force load: 2.545
Part of the total run time spent waiting due to PP/PME imbalance: 38.8 %

NOTE: 38.8 % performance was lost because the PME ranks
had more work to do than the PP ranks.
You might want to increase the number of PME ranks
or increase the cut-off and the grid spacing.

 R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G

On 15 MPI ranks doing PP, each using 5 OpenMP threads, and
on 1 MPI rank doing PME, using 5 OpenMP threads

Computing: Num Num Call Wall time Giga-Cycles
Ranks Threads Count (s) total sum %

Domain decomp. 15 5 26 0.266 49.664 0.9
DD comm. load 15 5 26 0.000 0.067 0.0
Send X to PME 15 5 10001 2.562 479.257 9.0
Neighbor search 15 5 26 0.165 30.869 0.6
Launch GPU ops. 15 5 20002 0.538 100.627 1.9
Comm. coord. 15 5 9975 1.857 347.284 6.5
Force 15 5 10001 0.235 43.918 0.8
Wait + Comm. F 15 5 10001 2.031 379.836 7.1
PME mesh * 1 5 10001 17.336 216.192 4.1
PME wait for PP * 9.322 116.252 2.2
Wait + Recv. PME F 15 5 10001 10.989 2055.482 38.6
Wait PME GPU gather 15 5 10001 1.128 211.019 4.0
Wait Bonded GPU 15 5 101 0.000 0.030 0.0
Wait GPU NB nonloc. 15 5 10001 3.569 667.685 12.6
Wait GPU NB local 15 5 10001 0.030 5.594 0.1
NB X/F buffer ops. 15 5 39952 0.958 179.249 3.4
Write traj. 15 5 11 0.116 21.667 0.4
Update 15 5 10001 0.403 75.345 1.4
Constraints 15 5 10001 1.344 251.429 4.7
Comm. energies 15 5 1001 1.423 266.214 5.0

Total 26.659 5319.288 100.0

(*) Note that with separate PME ranks, the walltime column actually sums to
twice the total reported, but the cycle count total and % are correct.

NOTE: 5 % of the run time was spent communicating energies,
you might want to increase some nst* mdp options

           Core t (s)   Wall t (s)        (%)
   Time:     2132.604       26.659     7999.4
             (ns/day)    (hour/ns)

Performance: 64.824 0.370
Finished mdrun on rank 0 Fri Feb 12 05:28:11 2021

You’re spending 40% of the time waiting for the PME node to finish. Since Gromacs 2020 only allows one GPU for PME, you have a large imbalance of PP to PME GPUs (7:1), where the optimal arrangement is typically closer to 3:1, though it depends.

Also, in general Gromacs 2020 doesn’t scale well to multiple GPUs for anything other than enormous systems. If possible, you should run multiple smaller systems in parallel. I wrote a tool to auto-generate possible parallel breakdowns (for single nodes, you should almost certainly be restricting to 1 node per simulation at the most for a 500k system), if you’re interested it’s here

Thank you very much, Kevin. I really appreciate your help! I will definitely try the code !!