Gmx_mpi uses a small number of cpu/gpu resources across nodes

GROMACS version: 2021.2
GROMACS modification: No

I have a physical host with 2 Nvidia GTX1070 GPUs installed. I created 2 VMs on the physical host and each with 1 GPU and 8 vCPU. My hostfile is below.

biovm slots=1
biovm1 slots=1

And I run the following command, trying to leverage both VMs.

mpirun -np 2 -cpus-per-rank 8 -hostfile nodes -mca btl_tcp_if_include ens192 /usr/local/gromacs/bin/gmx_mpi mdrun -deffnm md -maxh 0.08333 -resethway -ntomp 8

The problem is the performance is really poor on such a 2-node cluster, only 19ns/day. When I run the command below on only 1 VM, it can achieve ~300ns/day.

gmx_mpi mdrun -deffnm md -maxh 0.08333 -resethway 

I noticed when I run the mpirun command, cpu usage is <600% and gpu usage is single digit. However, when I run in the command in 1 vm, it can fully use the cpu (800%) and gpu usage is ~60%.

Here is my log

The following command line options and corresponding MCA parameter have
been deprecated and replaced as follows:

  Command line options:
    Deprecated:  --cpus-per-proc, -cpus-per-proc, --cpus-per-rank, -cpus-per-rank
    Replacement: --map-by <obj>:PE=N, default <obj>=NUMA

  Equivalent MCA parameter:
    Deprecated:  rmaps_base_cpus_per_proc
    Replacement: rmaps_base_mapping_policy=<obj>:PE=N, default <obj>=NUMA

The deprecated forms *will* disappear in a future version of Open MPI.
Please update to the new syntax.

GROMACS is free software; you can redistribute it and/or modify it
under the terms of the GNU Lesser General Public License
as published by the Free Software Foundation; either version 2.1
of the License, or (at your option) any later version.

GROMACS:      gmx mdrun, version 2021.2
Executable:   /usr/local/gromacs/bin/gmx_mpi
Data prefix:  /usr/local/gromacs
Working dir:  /data/fdai/gromacs
Command line:
  gmx_mpi mdrun -deffnm md -maxh 0.08333 -resethway -ntomp 8

Back Off! I just backed up md.log to ./#md.log.6#
Reading file md.tpr, VERSION 5.1.2 (single precision)
Note: file tpx version 103, software tpx version 122
Changing nstlist from 10 to 25, rlist from 0.606 to 0.673

On host biovm 1 GPU selected for this run.
Mapping of GPU IDs to the 1 GPU task in the 1 rank on this node:
PP tasks will do (non-perturbed) short-ranged and most bonded interactions on the GPU
PP task will update and constrain coordinates on the CPU
Using 2 MPI processes
Using 8 OpenMP threads per MPI process

Back Off! I just backed up md.xtc to ./#md.xtc.5#

Back Off! I just backed up md.trr to ./#md.trr.5#

Back Off! I just backed up md.edr to ./#md.edr.5#

NOTE: DLB will not turn on during the first phase of PME tuning
starting mdrun 'LYSOZYME in water'
10000000 steps,  20000.0 ps.
[biovm:84668] 1 more process has sent help message help-orte-rmaps-base.txt / deprecated
[biovm:84668] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

NOTE: DLB can now turn on, when beneficial

step 15891: resetting all time and cycle counters

Step 32475: Run time exceeded 0.082 hours, will terminate the run within 25 steps

Dynamic load balancing report:
 DLB was off during the run due to low measured imbalance.
 Average load imbalance: 0.8%.
 The balanceable part of the MD step is 45%, load imbalance is computed from this.
 Part of the total run time spent waiting due to load imbalance: 0.3%.

               Core t (s)   Wall t (s)        (%)
       Time:     2378.561      148.660     1600.0
                 (ns/day)    (hour/ns)
Performance:       19.307        1.243

btw, I have to set -cpu-per-rank 8 to allow 1 gmx_mpi process to fully utilize the 8 vCPU. without this option, it will only use 1 vCPU. Also I have to set -mca btl_tcp_if_include ens192, otherwise it will through an error. I am not sure if my configuration is correct or I missed any setting. Please help. Thank you.