Hybrid MPI and OpenMP

GROMACS version: 2020
GROMACS modification: Yes

Hi Gromacs Developers,
On an HPC, I am trying to use 1 full node that has 128 cores. I have a small system (444 nm3), hence I cannot allocate every core to an MPI process - 128 MPI processes (with 1 openMP thread per process). At this point, I have 2 questions:

  1. This may seem very basic, but I still want to ask: I am using mpirun -np 32, where in I get the following configuration :
    Using 32 MPI processes
    Non-default thread affinity set, disabling internal thread affinity
    Using 4 OpenMP threads per MPI process

Does this mean I am using the full 128 cores (32*4) ? I just want to make sure I am not wasting any resources on the node. This is because I will be charged for the entire node.

  1. Is there a better way to do this process on a HPC? Specially when I have a small system and I have to use 128 cores.

Thank you for your help.

Kind Regards,
Akash

Hi Akash,

I have a limited experience with running GROMACS on HPC, but from what I managed to learn I am pretty sure that you would end up using 128 threads and not all 128 cores (the number of cores would depent on the numer of threads per core, which is generally > 2).

Point is: is it good or bad? A priori, I would not say that using only part of the resources on the node is a waste: if benchmarking shows that 32 MPI processes are optimal for you system, then adding more would only make you pay more in terms of core hours, wouldn’t it?

If your compute node has 128 hardware threads, i.e. 64 physical cores, you can also try to only run on 64 cores with

mpirun -np 16 -ntomp 4 or
mpirun -np 32 -ntomp 2 or
mpirun -np 8 -ntomp 8

Especially for a small system this could give you a performance benefit.

To generally make better use of the hardware, you can run multiple .tpr files at the same time using GROMACS’s built-in multidir functionality, as described below

https://manual.gromacs.org/current/user-guide/mdrun-features.html

Additionally to the above suggestions, note that per the above message, mdrun detected externally set thread affinities and will honor these. However, if you job scheduler / MPI launcher did not set a correct process/thread affinity, you could end with suboptimal performance. E.g. if you did not tell your job scheduler that each MPI task is intended to use 4 cores, you may end up with 32 MPI tasks each assigned a single core, but each core will be oversubscribed running 4 threads and you’ll leave 3*32=96 cores empty.

Make sure your job launch is correct or use mdrun -pin on.