-multidir seems not working with multinode per directory

GROMACS version: 2021.4
GROMACS modification: No
Here post your question

I called mdrun on FUGAKU with 60 nodes, each directory is expected to run with 2 nodes for 8 MPI threads by the following:
mpirun -np 240 gmx_mpi mdrun -multidir dir{1…30} -s topol -ntomp 12

However, what the log file in each folder reported is only using 1 MPI with 12 omp actually as followed:

This is simulation 0 out of 30 running as a composite GROMACS
multi-simulation job. Setup for this simulation:

Using 1 MPI process

Non-default thread affinity set, disabling internal thread affinity

Using 12 OpenMP threads

I checked the speed of each simulation and it seems the output is corresponding to 1 MPI with 12 omp exactly.

Any recommendation to overcome this.

I have never heard about issues with -multidir and using multiple MPI ranks per simulation.

Is there more information at the start of the log file about the number of MPI ranks?

Hi @hess
I run with 120 MPI thread by:

mpirun -np 120 gmx_mpi mdrun -multidir <dirs> -ntomp 12

I copy some information from the log

GROMACS:      gmx mdrun, version 2021.4-fugaku
Executable:   .../gmx_mpi
Data prefix:  ...
Working dir:  ...
Process ID:   102
Command line:
  gmx_mpi mdrun -multidir <dirs> -ntomp 12
GROMACS version:    2021.4-fugaku
Precision:          mixed
Memory model:       64 bit
MPI library:        MPI
OpenMP support:     enabled (GMX_OPENMP_MAX_THREADS = 64)
GPU support:        disabled
SIMD instructions:  ARM_SVE
FFT library:        fftw-3.3.9-sve
TNG support:        enabled
Hwloc support:      disabled
Tracing support:    disabled
C compiler:         /vol0004/apps/oss/spack-v0.17.0/lib/spack/env/fj/fcc FujitsuClang 7.1.0
C compiler flags:   -march=armv8.2-a+sve -pthread -Wno-missing-field-initializers -Xg -w -Ofast -DNDEBUG
C++ compiler:       /vol0004/apps/oss/spack-v0.17.0/lib/spack/env/fj/case-insensitive/FCC FujitsuClang 7.1.0
C++ compiler flags: -march=armv8.2-a+sve -pthread -Wno-missing-field-initializers -Xg -w -Weverything -Wno-c++98-compat -Wno-c++98-compat-pedantic -Wno-source-uses-openmp -Wno-c++17-extensions -Wno-documentation-unknown-command -Wno-covered-switch-default -Wno-switch-enum -Wno-extra-semi-stmt -Wno-disabled-macro-expansion -Wno-cast-align -Wno-reserved-id-macro -Wno-global-constructors -Wno-exit-time-destructors -Wno-unused-macros -Wno-weak-vtables -Wno-conditional-uninitialized -Wno-format-nonliteral -Wno-shadow -Wno-cast-qual -Wno-documentation -Wno-used-but-marked-unused -Wno-padded -Wno-float-equal -Wno-old-style-cast -Wno-conversion -Wno-double-promotion -fopenmp -Ofast -DNDEBUG

Running on 8 nodes with total 400 cores, 400 logical cores
  Cores per node:           50
  Logical cores per node:   50
Hardware detected on host e32-2000c (the node of MPI rank 0):
  CPU info:
    Vendor: ARM
    Brand:  Unknown CPU brand
    Family: 8   Model: 1   Stepping: 0
    Features: neon neon_asimd sve
  Hardware topology: Only logical processor count

Some citing lines…

The number of OpenMP threads was set by environment variable OMP_NUM_THREADS to 12 (and the command-line setting agreed with that)

Note: 60 CPUs configured, but only 50 were detected to be online.
Input Parameters:

Then following by simulation parameters

Changing nstlist from 10 to 50, rlist from 1 to 1.103

This is simulation 0 out of 30 running as a composite GROMACS
multi-simulation job. Setup for this simulation:

**Using 1 MPI process**

Non-default thread affinity set, disabling internal thread affinity

**Using 12 OpenMP threads**

System total charge: -0.000
Will do PME sum in reciprocal space for electrostatic interactions.

Then the weird things appears in the upper.
I manually checked on our in-house cluster and it turn out to be the same with further validation on the performance.

It seems like there is no information in mdrun log file on the total number of ranks used by gmx. So I can’t say if this goes wrong in mdrun or somewhere in software stack that launches the job.

mdrun will use all MPI ranks passed (or exit with a fatal error when the number of ranks is not a multiple of the number of simulations). So my only conclusion can be that gmx_mpi got passed 30 ranks and not 240.

Hi @hess
Thank you a lot for looking into the problem.
Actually I’m also digging into the code on gitlab to check the behavior of multidir, however, I currently cannot trace where should I look into.
I found something maybe from src/gromacs/mdrunutility/multisim.cpp
then should there be some more related files?
Thank you a lot.

Hi @hess
I passed number of MPI processes to mpirun (mpiexec) exactly more than 1 (let say N) times of number of replicas (or we can say number of independent parallel simulations), however, the number of processes to call in gmx is actually the same number of parallel simulations (which means 01 MPI process per simulation, but not N MPI processes per simulation).

Yes, I understand what you command line should do. But mdrun can not leave ranks idle, so this must mean that mdrun only gets called on 30 ranks.

You say you are running on 60 nodes, but mdrun says it is running on 8 nodes. That also indicates that you are get less resources that you expect.

The multisim code is indeed in multisim.cpp in buildMultiSimulation(). You can run with -debug 1 -nsteps 0 to get the debug print in .debug output files, but I think this will show that there are 30 MPI ranks total.

Hi @hess
Thank you a lot for your precious information.
Let me check around from my side whether anything strange.
I’ll be back soon.