Suggestions for optimal task splitting on 8 x RTX2080ti

asente · April 17, 2023, 8:19pm

GROMACS version: 2021.4
GROMACS modification: No
Here post your question

Dear All,

I’m trying to set up an MD simulation of a fairly large membrane protein (~300k atoms in the system) and would be grateful if somebody could provide suggestions on how to improve the performance (if at all possible).

I have a node with 8 x RTX2080ti, 64 cores (HT; 2.6GHz Xeon Gold 6142), and 384GB.

If I run the this command:

gmx mdrun -v -deffnm ${istep} -ntmpi 8 -nb gpu -pme gpu -npme 1

I get the following report:

Dynamic load balancing report:
DLB was turned on during the run due to measured imbalance.
Average load imbalance: 11.6%.
The balanceable part of the MD step is 66%, load imbalance is computed from this.
Part of the total run time spent waiting due to load imbalance: 7.6%.
Steps where the load balancing was limited by -rdd, -rcon and/or -dds: Z 0 %
Average PME mesh/force load: 1.028
Part of the total run time spent waiting due to PP/PME imbalance: 1.3 %

NOTE: 7.6 % of the available CPU time was lost due to load imbalance
in the domain decomposition.
You can consider manually changing the decomposition (option -dd);
e.g. by using fewer domains along the box dimension in which there is
considerable inhomogeneity in the simulated system.
           Core t (s)   Wall t (s)        (%)
   Time:    91964.923     1436.961     6400.0
                   (ns/day)    (hour/ns)
 Performance:       60.127        0.399

The tasks were split in the following way:

8 GPUs selected for this run.
Mapping of GPU IDs to the 8 GPU tasks in the 8 ranks on this node:
PP:0,PP:1,PP:2,PP:3,PP:4,PP:5,PP:6,PME:7
PP tasks will do (non-perturbed) short-ranged interactions on the GPU
PP task will update and constrain coordinates on the CPU
PME tasks will do all aspects on the GPU
Using 8 MPI threads
Using 8 OpenMP threads per tMPI thread

Is this the most I can squeeze out of this hardware? Thanks a bunch in advance for any suggestions.

With best wishes,
Andrija

alang · April 18, 2023, 11:20am

Hi,

Your current command is likely to be extremely inefficient because you are missing GPU direct communications and several parts are assigned to the CPU which could potentially run on the GPU. I also recommend to update to the latest 2023 version for best performance.

Here are some NVIDIA resources with more info:

Blog articles:
Creating Faster Molecular Dynamics Simulations with GROMACS 2020
Maximizing GROMACS Throughput with Multiple Simulations per GPU Using MPS and MIG
Massively Improved Multi-node NVIDIA GPU Scalability with GROMACS
A Guide to CUDA Graphs in GROMACS 2023

Recent NVIDIA GTC presentation:
Cutting-Edge CUDA Technologies for Molecular Dynamics and Beyond

GROMACS NGC container and doc (has useful commands in the documentation, even if you are not using the container)

Best regards,

Alan Gray (NVIDIA)

pszilard · April 18, 2023, 5:05pm

Hi Andrija,

What I’d like to add to Alan’s recommendation is that since you do not have a high-performance interconnect between the GPUs (e.g. NVLink) even if you change your launch commands to use direct GPU communication, you will likely not see much benefits beyond 3-4 GPUs.
That is because multi-GPU runs will copy data across the relatively slow PCIe bus (as well as system bus assuming you have two CPUs and 4 GPUs connected to each) and this will lead to the slow communication becoming a performance limiter.

I would encourage you to always test scaling (i.e. run on 1,2,3,…8 GPUs) before launching production runs on many GPUs.

Cheers,
Szilárd

asente · April 18, 2023, 5:43pm

Hi Alan and Szilárd,

Thank you so much for the comments - I really appreciate it!

I tested 1/4/8 GPUs and observed the following:

1 GPU: 18.624 ns/day
4 GPUs: 46.824 ns/day
8 GPUs: 60.127 ns/day

This is all using 1 rank per GPU and 8 OpenMP threads per rank.

I’ll run a grid search with ntmpi, npme, ntomp, ntomp_pme and include the options from one of Alan’s links, e.g.:

-update gpu -bonded gpu -dlb no -nstlist 300 -pin on

and will let you know what happens. In all previous attempts, setting thread affinity failed but perhaps this will be resolved with the version update.

Thanks again for your help.

Best wishes,
Andrija

asente · April 20, 2023, 11:06am

Hi Alan,

Is it possible to get access to the .mdp file you used for the cellulose case study in Creating Faster Molecular Dynamics Simulations with GROMACS 2020?

Thanks!

Best wishes,
Andrija

alang · April 24, 2023, 3:38pm

Hi, I’ve pasted it below. Also, for multi-GPU runs, please make sure (in addition to the mdrun options you mentioned), you also set the environment variable to enable GPU direct communications
export GMX_ENABLE_DIRECT_GPU_COMM=1
(which is recognized by GROMACS v2022 and later).

Cellolose mdp file:

integrator = md
nsteps = -1
nstlist = 10
nstfout = 0
nstxout = 0
nstvout = 0
nstxout-compressed = 0
nstlog = 0
nstenergy = 0
dt = 0.002
constraints = h-bonds
coulombtype = PME ; !autoset
rcoulomb = 0.8 ; !autoset
vdwtype = Cut-off
rvdw = 0.8 ; !autoset
tcoupl = v-rescale
tc_grps = system
tau_t = 0.1
ref_t = 300

freezegrps = ; !autoset
freezedim = ; !autoset

fourier_spacing = 0.1 ; !autoset
nstcalcenergy = 500 ; !autogen
cutoff-scheme = verlet ; !autogen

hess · April 25, 2023, 6:26am

Note that with most force fields you can not change the cut-off for Lennard-Jones to 0.8 nm. You have the use what the force field has been parametrized with. (This cellulose test case is special)

Topic		Replies	Views
Increase Performance of the simulation User discussions mdrun , mdrun-performance	3	2882	April 27, 2021
Quetions about performance User discussions	1	780	February 19, 2021
Abysmal MD production performance on GPU node User discussions mdrun	8	726	December 15, 2023
Optimizing number of cpu cores in a gpu node run User discussions mdrun	2	251	January 14, 2024
Performance optimization with PME GPU decomposition User discussions	9	199	September 25, 2024

Suggestions for optimal task splitting on 8 x RTX2080ti

Related topics