GROMACS version: 2021.2
GROMACS modification: No
I have a physical host with 2 Nvidia GTX1070 GPUs installed. I created 2 VMs on the physical host and each with 1 GPU and 8 vCPU. My hostfile is below.
biovm slots=1
biovm1 slots=1
And I run the following command, trying to leverage both VMs.
mpirun -np 2 -cpus-per-rank 8 -hostfile nodes -mca btl_tcp_if_include ens192 /usr/local/gromacs/bin/gmx_mpi mdrun -deffnm md -maxh 0.08333 -resethway -ntomp 8
The problem is the performance is really poor on such a 2-node cluster, only 19ns/day. When I run the command below on only 1 VM, it can achieve ~300ns/day.
gmx_mpi mdrun -deffnm md -maxh 0.08333 -resethway
I noticed when I run the mpirun command, cpu usage is <600% and gpu usage is single digit. However, when I run in the command in 1 vm, it can fully use the cpu (800%) and gpu usage is ~60%.
Here is my log
--------------------------------------------------------------------------
The following command line options and corresponding MCA parameter have
been deprecated and replaced as follows:
Command line options:
Deprecated: --cpus-per-proc, -cpus-per-proc, --cpus-per-rank, -cpus-per-rank
Replacement: --map-by <obj>:PE=N, default <obj>=NUMA
Equivalent MCA parameter:
Deprecated: rmaps_base_cpus_per_proc
Replacement: rmaps_base_mapping_policy=<obj>:PE=N, default <obj>=NUMA
The deprecated forms *will* disappear in a future version of Open MPI.
Please update to the new syntax.
--------------------------------------------------------------------------
GROMACS is free software; you can redistribute it and/or modify it
under the terms of the GNU Lesser General Public License
as published by the Free Software Foundation; either version 2.1
of the License, or (at your option) any later version.
GROMACS: gmx mdrun, version 2021.2
Executable: /usr/local/gromacs/bin/gmx_mpi
Data prefix: /usr/local/gromacs
Working dir: /data/fdai/gromacs
Command line:
gmx_mpi mdrun -deffnm md -maxh 0.08333 -resethway -ntomp 8
Back Off! I just backed up md.log to ./#md.log.6#
Reading file md.tpr, VERSION 5.1.2 (single precision)
Note: file tpx version 103, software tpx version 122
Changing nstlist from 10 to 25, rlist from 0.606 to 0.673
On host biovm 1 GPU selected for this run.
Mapping of GPU IDs to the 1 GPU task in the 1 rank on this node:
PP:0
PP tasks will do (non-perturbed) short-ranged and most bonded interactions on the GPU
PP task will update and constrain coordinates on the CPU
Using 2 MPI processes
Using 8 OpenMP threads per MPI process
Back Off! I just backed up md.xtc to ./#md.xtc.5#
Back Off! I just backed up md.trr to ./#md.trr.5#
Back Off! I just backed up md.edr to ./#md.edr.5#
NOTE: DLB will not turn on during the first phase of PME tuning
starting mdrun 'LYSOZYME in water'
10000000 steps, 20000.0 ps.
[biovm:84668] 1 more process has sent help message help-orte-rmaps-base.txt / deprecated
[biovm:84668] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
NOTE: DLB can now turn on, when beneficial
step 15891: resetting all time and cycle counters
Step 32475: Run time exceeded 0.082 hours, will terminate the run within 25 steps
Dynamic load balancing report:
DLB was off during the run due to low measured imbalance.
Average load imbalance: 0.8%.
The balanceable part of the MD step is 45%, load imbalance is computed from this.
Part of the total run time spent waiting due to load imbalance: 0.3%.
Core t (s) Wall t (s) (%)
Time: 2378.561 148.660 1600.0
(ns/day) (hour/ns)
Performance: 19.307 1.243
btw, I have to set -cpu-per-rank 8 to allow 1 gmx_mpi process to fully utilize the 8 vCPU. without this option, it will only use 1 vCPU. Also I have to set -mca btl_tcp_if_include ens192, otherwise it will through an error. I am not sure if my configuration is correct or I missed any setting. Please help. Thank you.