PP computation

GROMACS version: 2021
GROMACS modification: No

Hi,

I am trying to compare the performance of GROMACS in two different high performance computers for which I am obtaining very big differences in computational times for different number of nodes (I am not using GPUs yet): around 15 times slower in every case.
For this comparison, I am using two versions of GROMACS 2021 (I am not allowed to compile my own version) for the same system and the same mdp file. Please note that in the “slow” case, a version with PLUMED is being used. However, I do not think that this is the source of such a big difference.
After looking carefully at the produced log files, the thing that puzzles me the most is that for the “fast” computer I get at the bottom of the log file the “Breakdown of PP computation” part whereas for the other computer I do not.
My question would be if this is something architecture dependent or I am doing something wrong when I ask for my calculations.

For clarity, I will show below the scripts that I am using to submit my jobs to the queues in both computers.

“Fast” computer:

#!/bin/bash
#SBATCH --time=1-00:00:00
#SBATCH --job-name=...   # Job name
#SBATCH --output=%x-%j.out        # output file
#SBATCH --error=%x-%j.err         # Error file
#SBATCH --ntasks-per-node=8       # 8 Ranks
#SBATCH --cpus-per-task=16         # 16 MPI Threads
#SBATCH --nodes=1                 #1*8*16=128 process
#SBATCH --partition=standard

module load LUMI/22.08
module load partition/L
module load GROMACS/2021.6-cpeCray-22.08-CPU

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export OMP_PLACES=cores

srun gmx_mpi grompp -f nvt.mdp -c 1A_ordered.gro -r 1A_ordered.gro -p A_box.top -o bench.tpr -maxwarn 3
srun gmx_mpi mdrun -v -deffnm bench -ntomp 16

“Slow” computer:

#! /bin/bash
#PBS -N 1N_8n_16c
#PBS -A DD-23-95
#PBS -l select=1:mpiprocs=8:ompthreads=16
#PBS -l walltime=24:00:00
#PBS -q qcpu
#PBS -o %n-%j.out
#PBS -e %n-%j.err

#export OMP_NUM_THREADS=$PBS_NUM_PPN
#export OMP_PLACES=cores

PBS_O_WORKDIR=...
cd $PBS_O_WORKDIR

module load GROMACS/2021.4-foss-2020b-PLUMED-2.7.3

mpirun gmx_mpi grompp -f nvt.mdp -c 1A_ordered.gro -r 1A_ordered.gro -p A_box.top -o bench.tpr -maxwarn 3
mpirun gmx_mpi mdrun -v -deffnm bench -ntomp 16

Many thanks!

Hi,

I suspect you are not pinning threads correctly that’s the reason for the bad performance. You neither pass -pin on to mdrun nor are you setting OMP_PLACES (which you do set in the LUMI submit script)/.
Note that without OMP_PROC_BIND you might get suboptimal behavior) Also note that 16 threads per rank is likely not the most efficient in CPU-only runs unless you have a peculiar setup.
I suggest try using mdrun -pin on which will do the right thing and most often it is faster than the alternatives.

Cheers,
Szilárd

Hi,

Many thanks for your suggestions! I will try that now.

About the 16 threads per rank, we assume this is not optimal at all, however this is the combination for which we found the most surprising difference between computers. This is why we are investigating that.

Cheers,
Mario