Scaling of MD with domain decomposition on JUWELS Cluster

GROMACS version: 2024-dev-20240201-787d96c7a9-unknown
GROMACS modification: Yes
I’m conducting some performance tests on the JUWELS Cluster, trying to see the improvement in performance that DD can bring within a single node.

GROMACS build info
Here, GROMACS was built using GCC v11.4.0, uses OpenMP and OpenMPI v4.1.5, and builds its own FFTW v3.3.8.

JUWELS Cluster info
The standard compute node of the JUWELS Cluster has 2 sockets of Intel(R) Xeon(R) Platinum 8168 CPU @ 2.70GHz with 24 cores, L1 cache 32kB, L2 cache 1MB and L3 cache 33MB.

Pinning and affinity info
I’m using the following cell counts: 2, 3, 4, 6, 8, 12, 16, 24 which of course correspond to thread counts per cell: 24, 16, 12, 8, 6, 4, 3, 2. The threads are bound to cores using JSC’s native pinning tool for the JUWELS Cluster. The pinning of threads for 8 cells looks like this:

and that for 16 cells:

and similarly for 24 cells:

Result
Now, I’m getting the following scaling:

Question: Does this plot look OK?

Additional info
I’ve attached the job script, err file and performance break-down for the case with 8 cells.

Job script
Here is the job script:

#!/bin/bash
#SBATCH --account=my-account
#SBATCH --nodes=1
#SBATCH --threads-per-core=1
#SBATCH --output=node_sat_tpn8.log
#SBATCH --error=node_sat_tpn8.err
#SBATCH --time=00:10:00
#SBATCH --partition=devel

# parallelization
TASK_COUNT=8
THREAD_COUNT=$(( 48 / $TASK_COUNT ))

# create .mdp
cat > bench.mdp << EOF
; VARIOUS PREPROCESSING OPTIONS
; Preprocessor information: use cpp syntax.
; e.g.: -I/home/joe/doe -I/home/mary/roe
include                  = 
; e.g.: -DPOSRES -DFLEXIBLE (note these variable names are case sensitive)
define                   =

; RUN CONTROL PARAMETERS
integrator               = md
; Start time and timestep in ps
tinit                    = 0
dt                       = 0.0005
nsteps                   = 3000
; For exact run continuation or redoing part of a run
init_step                = 0
; Part index is updated automatically on checkpointing (keeps files separate)
simulation_part          = 1
; mode for center of mass motion removal
comm-mode                = none

; OUTPUT CONTROL OPTIONS
; Output frequency for coords (x), velocities (v) and forces (f)
nstxout                  = 0
nstvout                  = 0
nstfout                  = 0
; Output frequency for energies to log file and energy file
nstlog                   = 0
nstenergy                = 0

; NEIGHBORSEARCHING PARAMETERS
; cut-off scheme (Verlet: particle based cut-offs, group: using charge groups)
cutoff-scheme            = Verlet
; nblist update frequency
nstlist                  = 25
; Periodic boundary conditions: xyz, no, xy
pbc                      = xyz
periodic_molecules       = no
; Allowed energy error due to the Verlet buffer in kJ/mol/ps per atom,
; a value of -1 means: use rlist
verlet-buffer-tolerance  = 0.001
; nblist cut-off        
rlist                    = 0.9

; OPTIONS FOR ELECTROSTATICS AND VDW
; Method for doing electrostatics
coulombtype              = pme
coulomb-modifier         = Potential-shift-Verlet
rcoulomb-switch          = 0
rcoulomb                 = 0.9
; Relative dielectric constant for the medium and the reaction field
epsilon_r                = 1
epsilon_rf               = 1
; Method for doing Van der Waals
vdw-type                 = cut-off
vdw-modifier             = Potential-shift-Verlet
; cut-off lengths       
rvdw-switch              = 0
rvdw                     = 0.9
; Apply long range dispersion corrections for Energy and Pressure
DispCorr                 = no 
; Extension of the potential lookup tables beyond the cut-off
table-extension          = 1
; Separate tables between energy group pairs
energygrp-table          = 
; Spacing for the PME/PPPM FFT grid
fourierspacing           = 0.12
; FFT grid size, when a value is 0 fourierspacing will be used
fourier_nx               = 0
fourier_ny               = 0
fourier_nz               = 0
; EWALD/PME/PPPM parameters
pme_order                = 4
ewald_rtol               = 1e-05
ewald-rtol-lj            = 0.001
lj-pme-comb-rule         = Geometric
ewald_geometry           = 3d
epsilon_surface          = 0

; OPTIONS FOR WEAK COUPLING ALGORITHMS
; Temperature coupling  
Tcoupl                   = no
; Groups to couple separately
;tc-grps                  = System
; Time constant (ps) and reference temperature (K)
;tau_t                    = 0.9
;ref_t                    = 20.0
; pressure coupling     
Pcoupl                   = no
; Scaling of reference coordinates, No, All or COM
refcoord_scaling         = no

; GENERATE VELOCITIES FOR STARTUP RUN
gen_vel                  = no
gen_temp                 = 
gen_seed                 = -1

; OPTIONS FOR BONDS    
;constraints              = all-bonds
constraints              = h-bonds
; Type of constraint algorithm
constraint-algorithm     = Lincs
; Do not constrain the start configuration
continuation             = no
; Use successive overrelaxation to reduce the number of shake iterations
Shake-SOR                = no
; Relative tolerance of shake
shake-tol                = 0.0001
; Highest order in the expansion of the constraint coupling matrix
lincs-order              = 6
; Number of iterations in the final step of LINCS. 1 is fine for
; normal simulations, but use 2 to conserve energy in NVE runs.
; For energy minimization with constraints it should be 4 to 8.
lincs-iter               = 2
; Lincs will write a warning to the stderr if in one step a bond
; rotates over more degrees than
lincs-warnangle          = 30
; Convert harmonic bonds to morse potentials
morse                    = no
EOF

# do pre-proc
srun --ntasks=1 --cpus-per-task=$THREAD_COUNT --cpu-bind=threads --distribution=block:block:fcyclic gmx_mpi grompp -f bench.mdp -c em.gro -r em.gro -p lysozyme.top -o bench.tpr -maxwarn 1 && PREPROC_SUCCESS=1
if [[ $PREPROC_SUCCESS -ne 1 ]]
then
    echo "Pre-processing failed." 1>&2
    exit 1
fi
rm *.mdp \#* &> /dev/null

# run
export GMX_USE_MODULAR_SIMULATOR=ON
if [[ $TASK_COUNT -eq 1 ]]
then
    export GMX_DD_SINGLE_RANK=0 # to prevent reordering of atoms during pair-list building
fi
export OMP_PROC_BIND=true OMP_NUM_THREADS=$THREAD_COUNT
srun --ntasks=$TASK_COUNT --cpus-per-task=$THREAD_COUNT --cpu-bind=threads --distribution=block:block:fcyclic gmx_mpi mdrun -deffnm bench -ntomp $THREAD_COUNT -resetstep 2500 && MDRUN_SUCCESS=1
if [[ $MDRUN_SUCCESS -ne 1 ]]
then
    echo "MD run failed." 1>&2
    exit 1
fi
rm ms*/\#* &> /dev/null

Err file
Here is the error file.

       :-) GROMACS - gmx grompp, 2024-dev-20240201-a947d5995e-unknown (-:

Executable:   gromacs/build/bin/gmx_mpi
Data prefix:  gromacs (source tree)
Working dir:  gromacs/logs/ligand_binding/node_sat_stdmd/size0/tpn8
Command line:
  gmx_mpi grompp -f bench.mdp -c em.gro -r em.gro -p lysozyme.top -o bench.tpr -maxwarn 1

Generating 1-4 interactions: fudge = 0.5
Number of degrees of freedom in T-Coupling group rest is 103142.00
The integrator does not provide a ensemble temperature, there is no system ensemble temperature

NOTE 1 [file bench.mdp]:
  NVE simulation with an initial temperature of zero: will use a Verlet
  buffer of 10%. Check your energy drift!


WARNING 1 [file bench.mdp]:
  You are not using center of mass motion removal (mdp option comm-mode),
  numerical rounding errors can lead to build up of kinetic energy of the
  center of mass


There was 1 NOTE

There was 1 WARNING

GROMACS reminds you: "As an adolescent I aspired to lasting fame, I craved factual certainty, and I thirsted for a meaningful vision of human life -- so I became a scientist. This is like becoming an archbishop so you can meet girls." (Matt Cartmill)

       :-) GROMACS - gmx mdrun, 2024-dev-20240201-a947d5995e-unknown (-:

Executable:   gromacs/build/bin/gmx_mpi
Data prefix:  gromacs (source tree)
Working dir: gromacs/logs/ligand_binding/node_sat_stdmd/size0/tpn8
Command line:
  gmx_mpi mdrun -deffnm bench -ntomp 6 -resetstep 2500

Reading file bench.tpr, VERSION 2024-dev-20240201-a947d5995e-unknown (single precision)
Can not increase nstlist because an NVE ensemble is used
Using 8 MPI processes

Non-default thread affinity set, disabling internal thread affinity

Using 6 OpenMP threads per MPI process

starting mdrun 'Protein in water'
3000 steps,      1.5 ps.

step 2500: resetting all time and cycle counters

Writing final coordinates.

               Core t (s)   Wall t (s)        (%)
       Time:       52.237        1.151     4538.4
                 (ns/day)    (hour/ns)
Performance:       18.804        1.276

GROMACS reminds you: "Give someone a program, you frustrate them for a day; teach them how to program, you frustrate them for a lifetime." (David Leinweber)

Performance break-down
Finally, here’s the performance break-down:

      R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G

On 8 MPI ranks, each using 6 OpenMP threads

 Activity:              Num   Num      Call    Wall time         Giga-Cycles
                        Ranks Threads  Count      (s)         total sum    %
--------------------------------------------------------------------------------
 Domain decomp.            8    6         20       0.027          3.248   2.3
 Neighbor search           8    6         20       0.043          5.288   3.8
 Comm. coord.              8    6        480       0.028          3.425   2.4
 Force                     8    6        500       0.302         36.912  26.2
 Wait + Comm. F            8    6        500       0.024          2.992   2.1
 PME mesh                  8    6        500       0.638         78.009  55.4
 NB X/F buffer ops.        8    6       1460       0.020          2.450   1.7
 Write traj.               8    6          1       0.010          1.189   0.8
 Update                    8    6        500       0.004          0.549   0.4
 Constraints               8    6        500       0.014          1.689   1.2
 Comm. energies            8    6          5       0.000          0.036   0.0
 Rest                                              0.040          4.923   3.5
--------------------------------------------------------------------------------
 Total                                             1.151        140.712 100.0
--------------------------------------------------------------------------------
 Breakdown of PME mesh activities
--------------------------------------------------------------------------------
 PME redist. X/F           8    6       1000       0.269         32.898  23.4
 PME spread                8    6        500       0.107         13.059   9.3
 PME gather                8    6        500       0.073          8.968   6.4
 PME 3D-FFT                8    6       1000       0.060          7.280   5.2
 PME 3D-FFT Comm.          8    6       1000       0.124         15.180  10.8
 PME solve Elec            8    6        500       0.004          0.460   0.3
--------------------------------------------------------------------------------

               Core t (s)   Wall t (s)        (%)
       Time:       52.237        1.151     4538.4
                 (ns/day)    (hour/ns)
Performance:       18.804        1.276

Edit
Following @mabraham’s suggestion, here is the sampling rate plot of the above:

Thanks for the analysis.

Please try to report and reason about GROMACS performance in terms of ns/day (as reported in the log file) so that your observations are more readily transferable and able to be interpreted in the light of others’ experience. It’s just the convention for MD!

Your graphs are accurate. Performance is roughly linear with particle count at these sizes. 3 ranks should be terrible because one of the big OpenMP regions has its threads split across two sockets. Many ranks performs best because the overhead from the many OpenMP regions used by GROMACS goes down, and (for CPU-only GROMACS) until you get to the strong scaling limit for your system, anything other than -ntomp 1 is wasteful because you incur both MPI and OpenMP overheads.

PME redist x/f is relatively high (23%) for the log you showed, but the most relevant one is that for -ntomp 1. Probably you will want to switch to using 1/4 or 1/3 ranks as PME-only ranks, which the log file for that run probably recommends.

Hi Mark, thanks for your reply and for confirming the graph. Interesting to note that this is an example of MPI vs. OpenMP. I will give it a go with using separate PME-only ranks.