PP computation

mfernandezpendas · July 18, 2023, 1:42pm

GROMACS version: 2021
GROMACS modification: No

Hi,

I am trying to compare the performance of GROMACS in two different high performance computers for which I am obtaining very big differences in computational times for different number of nodes (I am not using GPUs yet): around 15 times slower in every case.
For this comparison, I am using two versions of GROMACS 2021 (I am not allowed to compile my own version) for the same system and the same mdp file. Please note that in the “slow” case, a version with PLUMED is being used. However, I do not think that this is the source of such a big difference.
After looking carefully at the produced log files, the thing that puzzles me the most is that for the “fast” computer I get at the bottom of the log file the “Breakdown of PP computation” part whereas for the other computer I do not.
My question would be if this is something architecture dependent or I am doing something wrong when I ask for my calculations.

For clarity, I will show below the scripts that I am using to submit my jobs to the queues in both computers.

“Fast” computer:

#!/bin/bash
#SBATCH --time=1-00:00:00
#SBATCH --job-name=...   # Job name
#SBATCH --output=%x-%j.out        # output file
#SBATCH --error=%x-%j.err         # Error file
#SBATCH --ntasks-per-node=8       # 8 Ranks
#SBATCH --cpus-per-task=16         # 16 MPI Threads
#SBATCH --nodes=1                 #1*8*16=128 process
#SBATCH --partition=standard

module load LUMI/22.08
module load partition/L
module load GROMACS/2021.6-cpeCray-22.08-CPU

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export OMP_PLACES=cores

srun gmx_mpi grompp -f nvt.mdp -c 1A_ordered.gro -r 1A_ordered.gro -p A_box.top -o bench.tpr -maxwarn 3
srun gmx_mpi mdrun -v -deffnm bench -ntomp 16

“Slow” computer:

#! /bin/bash
#PBS -N 1N_8n_16c
#PBS -A DD-23-95
#PBS -l select=1:mpiprocs=8:ompthreads=16
#PBS -l walltime=24:00:00
#PBS -q qcpu
#PBS -o %n-%j.out
#PBS -e %n-%j.err

#export OMP_NUM_THREADS=$PBS_NUM_PPN
#export OMP_PLACES=cores

PBS_O_WORKDIR=...
cd $PBS_O_WORKDIR

module load GROMACS/2021.4-foss-2020b-PLUMED-2.7.3

mpirun gmx_mpi grompp -f nvt.mdp -c 1A_ordered.gro -r 1A_ordered.gro -p A_box.top -o bench.tpr -maxwarn 3
mpirun gmx_mpi mdrun -v -deffnm bench -ntomp 16

Many thanks!

pszilard · July 18, 2023, 3:18pm

Hi,

I suspect you are not pinning threads correctly that’s the reason for the bad performance. You neither pass -pin on to mdrun nor are you setting OMP_PLACES (which you do set in the LUMI submit script)/.
Note that without OMP_PROC_BIND you might get suboptimal behavior) Also note that 16 threads per rank is likely not the most efficient in CPU-only runs unless you have a peculiar setup.
I suggest try using mdrun -pin on which will do the right thing and most often it is faster than the alternatives.

Cheers,
Szilárd

mfernandezpendas · August 1, 2023, 11:18am

Hi,

Many thanks for your suggestions! I will try that now.

About the 16 threads per rank, we assume this is not optimal at all, however this is the combination for which we found the most surprising difference between computers. This is why we are investigating that.

Cheers,
Mario

Topic		Replies	Views
Performance Comparison Between GROMACS SMP and MPI Versions User discussions	0	13	July 9, 2024
Keywords in command line for getting good performance in gromacs User discussions mdrun , mdrun-performance , mdrun-parallelization	2	288	October 27, 2023
Efficiency low User discussions	13	1119	May 26, 2021
Performance with mpi support User discussions	3	887	December 26, 2020
Getitng best performance in parallel User discussions	2	446	December 31, 2020

PP computation

Related topics