GROMACS performance issues on POWER9/V100 node

This is in continuation of the following gmx-users thread:

On Mon, Apr 27, 2020 at 4:26 PM Jonathan D. Halverson <> wrote:

Hi Szilárd,

Our OS is RHEL 7.6.

Thank you for your test results. It’s nice to see consistent results on a POWER9 system.

Your suggestion of allocating the whole node was a good one. I did this in two ways. The first was to bypass the Slurm scheduler by ssh-ing to an empty node and running the benchmark. The second way was through Slurm using the --exclusive directive (which allocates the entire node indepedent of job size). In both cases, which used 32 hardware threads and one V100 GPU for ADH (PME, cubic, 40k steps), the performance was about 132 ns/day which is significantly better than the 90 ns/day from before (without --exclusive).

Note that you are comparing 32 CPU cores + 1 GPU vs (presumably) 8 CPU cores + 1 GPU there; see below.

Links to the md.log files are below. Here is the Slurm script with --exclusive:

#SBATCH --job-name=gmx # create a short name for your job
#SBATCH --nodes=1 # node count
#SBATCH --ntasks=1 # total number of tasks across all nodes
#SBATCH --cpus-per-task=32 # cpu-cores per task (>1 if multi-threaded tasks)
#SBATCH --mem=8G # memory per node (4G per cpu-core is default)
#SBATCH --time=00:10:00 # total run time limit (HH:MM:SS)
#SBATCH --gres=gpu:1 # number of gpus per node

module purge
module load cudatoolkit/10.2

gmx grompp -f $BCH/pme_verlet.mdp -c $BCH/conf.gro -p $BCH/ -o bench.tpr
srun gmx mdrun -nsteps 40000 -pin on -ntmpi $SLURM_NTASKS -ntomp $SLURM_CPUS_PER_TASK -s bench.tpr

Here are the log files:

md.log with --exclusive:

md.log without --exclusive:

Szilárd, what is your reading of these two files?

There are two critical information in this file that I’ve highlighted already before, and when looking side-by-side here it is again obvious:

Lines 59:
Running on 1 node with total 32 cores, 128 logical cores, 1 compatible GPU
Running on 1 node with total 128 cores, 128 logical cores, 1 compatible GPU

Lines 66:
Hardware topology: Full, with devices
Hardware topology: Only logical processor count

Next you see the topology mapping in the exclusive job. This suggests that the non-exclusive allocation does some kernel-level (presumably job isolation-related) settings that screw up the topology detection and make it look like a “flat” 128 core node rather than a 32 core / 128 thread node.

Furthermore, if you look at the “R E A L C Y C L E A N D T I M E A C C O U N T I N G” table, entries that correspond to CPU compute work (rather than rather than communication or wait for GPU) take ~2x longer in the “without-exclusive” case, .e.g: a few examples (3rd to last column is wall-time of the task in seconds):

Neighbor search 1 32 401 1.460 23.918 2.8
Force 1 32 40001 9.164 150.147 17.5
Update 1 32 40001 1.755 28.762 3.4


Neighbor search 1 32 401 3.107 50.904 4.7
Force 1 32 40001 19.936 326.639 30.2
Update 1 32 40001 3.375 55.299 5.1

To me it seems that this can only happen if the set of hardware threads are assigned incorrectly in the node sharing case. Note however that your exclusive case uses all 32 cores with one hardware thread placed on each – as GROMACS assumes that you have full access to the node, this can be seen from:

“Pinning threads with an auto-selected logical core stride of 4”

which means that i the listing of hw thread topology (see " Sockets, cores, and logical processors:") the mdrun threads get pinned to every fourth thread, i.e. to hw threads 0, 4, 8, 12, 16, etc. wich you can and should verify, I suggest.

The correct comparison you want to look at is however, to run mdrun on the quarter of the machine, I assume, i.e.
mdrun -ntomp 8 -pinstride 4 -pin on (for 1 thread/core)
mdrun -ntomp 16 -pinstride 2 -pin on (for 2 threads/core)
mdrun -ntomp 32 -pinstride 1 -pin on (for 4 threads/core)

I suspect the second one (SMT2) will be fastest, but it may depend on the use-case.

This is a shared cluster so I can’t use --exclusive for all jobs. Our nodes have four GPUs and 128 hardware threads (SMT4 so 32 cores over 2 sockets). Any thoughts on how to make a job behave like it is being run with --exclusive? The task affinities are apparently not being set properly in that case.

I’d also suggest to try to make sure you know how does your job scheduler partition the node and which cores does it assign to jobs.

To solve this I tried experimenting with the --cpu-bind settings. When --exclusive is not used, I find a slight performance gain by using --cpu-bind=cores:
srun --cpu-bind=cores gmx mdrun -nsteps 40000 -pin on -ntmpi $SLURM_NTASKS -ntomp $SLURM_CPUS_PER_TASK -s bench.tpr

Binding using the job scheduler will help, but note that without binding each thread to a specific hardware thread (not sure if srun can do that GOMP_CPU_AFFINITY can) the behavior will be different.

Either way, I suggest first making sure you compare 8 core node-exclusive vs shared runs. To make the difference even more obvious, you can run mdrun on CPUs only (i.e. set -nb cpu or disable the GPU detection with GMX_DISABLE_GPU_DETECTION).