MD Performance dependency on PCIe bandwidth

GROMACS version: 2022.4
GROMACS modification: No

Hi, may be I ask a stupid question here. We have in our workstation one RTX 3090Ti installed on PCIe 5.0 x16. We use it running for standard MD simulations using GROMACS.
Right now, the GPU is placed inside the workstation, and we wish to have 2 GPUs parallely. However, due to the dimension of RTX 3090Ti, we cannot use the other PCIe x16 slots, and therefore we need to place the GPU outside the casing (as mining rig configuration).

There are PCIe x16<->x16 cable and riser on market. However, mostly in mining situation, where you use multi GPUs, you convert x16<->x1. Therefore, in the scenario where x16 is converted, the data transfer speed will suffer a lot.

I read one post, which may be related to my issue here about bandwidth (1GPU vs 4 GPU per single node; performance). There, pszilard said, that the date needs to be moved GPU->CPU->GPU for each step. So I can imagine, that in the case where x16 is converted to x1, this will be severe reduction of speed. Moreover, GPU resident loop has some limitations, so this option is already excluded.

My question is:

My system is fairly simple as 8 Riboflavin molecules in 8x8x8 TIP4P water box. I run non mpi version of Gromacs such as this:

gmx mdrun -v -nb gpu -pme gpu -bonded cpu -nt 8 -pin on -deffnm npt

After 100 ps NPT test simulation I got the following time accounting:

     R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G

On 1 MPI rank, each using 8 OpenMP threads

 Computing:          Num   Num      Call    Wall time         Giga-Cycles
                     Ranks Threads  Count      (s)         total sum    %
-----------------------------------------------------------------------------
 Vsite constr.          1    8     100001       2.282         62.401   3.8
 Neighbor search        1    8       1251       3.150         86.130   5.3
 Launch GPU ops.        1    8     100001       2.038         55.732   3.4
 Force                  1    8     100001       1.350         36.914   2.3
 Wait PME GPU gather    1    8     100001       8.358        228.497  14.1
 Reduce GPU PME F       1    8     100001       2.322         63.477   3.9
 Wait GPU NB local                              8.367        228.743  14.1
 NB X/F buffer ops.     1    8     198751      10.258        280.469  17.3
 Vsite spread           1    8     110002       1.082         29.574   1.8
 Write traj.            1    8        201       1.457         39.838   2.5
 Update                 1    8     100001       1.845         50.433   3.1
 Constraints            1    8     100001       2.527         69.087   4.3
 Rest                                          14.295        390.836  24.1
-----------------------------------------------------------------------------
 Total                                         59.331       1622.130 100.0
-----------------------------------------------------------------------------

               Core t (s)   Wall t (s)        (%)
       Time:      474.651       59.331      800.0
                 (ns/day)    (hour/ns)
Performance:      291.248        0.082
Finished mdrun on rank 0 Thu Feb 16 11:13:05 2023

As you see, the line with “Wait PME GPU gather” (14.1%) and “Wait GPU NB local” (14.1%) are significant. Moreover the rest time, which indicates the inefficiency of parallel execution, is also significant.

So in this case, I wonder why the CPU waits for the data from GPU? Does it wait the calculation on GPU to finish or rather due to the data transfer bottle neck caused by PCIe connection? I cannot imagine the latter because our PCIe x16 is version 5.0, which has a transfer speed of 63 GB/s.

If the data transfer is not decisive, does it mean, that we do not need such bandwidth between CPU and GPU (as in crypto mining)?

Many thanks for helps in advance.

John

Hi John,

You are running the force-offload mode that is why you see the CPU waiting for GPU results which is not the most efficient for your hardware. Use the GPU-resident mode (see https://manual.gromacs.org/documentation/current/user-guide/mdrun-performance.html#types-of-gpu-tasks) instead which will reduce the CPU-GPU communication and will also reduce the performance dependence on it (some activity like frequent t-coupling can come into play). In such single GPU runs which don’t require much CPU-GPU data movement I imagine that even PCI x1 might not have a large impact on performance.

Multi-GPU is a separate question, id you intend to run a single simulation across mulitple GPUs PCIe x1 will surely not be sufficient.

Side-note: always share full logs, that way all information contained in the log can help in diagnosing issues.

Cheers,
Szilárd

Hi Szilard,

thanks for your rapid reply. The problem with the GPU-resident mode is that I cannot use virtual sites or for my specific problem the pulling code and FEP.

-update

    Used to set where to execute update and constraints, when present. Can be set to “auto”, “cpu”, “gpu.” Defaults to “auto,” which currently always uses the CPU. Setting “gpu” requires that a compatible CUDA GPU is available, the simulation uses a single rank. 
Update and constraints on a GPU is currently not supported with mass and constraints free-energy perturbation, domain decomposition, virtual sites, Ewald surface correction, replica exchange, constraint pulling, orientation restraints and computational electrophysiology.

  1. I suppose, the reduction of CPU-GPU data movement can be realized by using the GPU-resident mode. This is what I understood from the manual.
GROMACS supports two major offload modes: force-offload and GPU-resident. 
The former involves offloading some of or all interaction calculations with integration on the CPU (hence requiring per-step data movement). 
In the GPU-resident mode by offloading integration and constraints (when used) less data movement is necessary.

So in my case, where I used force-offload mode,
Did the CPU wait the non bonded and PME results from the GPU because:
a) It is “stucked” or “limited” by the connection’s bandwidth? or
b) It is rather due to the calculation of PME and nonbonded parts on GPU which takes time?

I attached now the new log file with a longer simulation time?

npt.log (671.7 KB)

  1. If I shifted the PME calculation on the GPU, then the efficiency increased as one can see in the rest time. The overall performance however is slightly lower. I assume that this was due to the number of cores and CPU type.

npt_2.log (672.6 KB)

  1. Of course in the scenario of multi GPU, I consider NVLink between two GPUs. Still I assume, when using force-offload mode, then probably the data transfer from CPU<->(GPU1 <-NVLINK->GPU2) can be only benefit from large bandwidth, unless GPU-resident mode is used. Is my assumption correct?

Many thanks,

John

vsites are not supported.

Correct.

It waits for the forces to be transferred back to the CPU for integration, see Fig 2 of https://aip.scitation.org/doi/full/10.1063/5.0018516 the black arrows are the point where the CPU waits.

Even if you have fast PCIe transfers, the GPU will be idle during integration (and in your case vsites computation) which is wasteful especially when you have a very fast GPU.

On modern workstations and servers with fast GPUs, it is better to keep the GPU busy at the cost of leaving the CPU idle, that’s what the GPU-resident mode is designed to do. In an ideal setup >=90% of the time the GPU is busy with computation (you can monitor that using nvidia-smi).

The rest time does not reflect on efficiency. I think you got it backward, your run with -pme gpu is faster but since you have a fast CPU not by a lot.

You will have some benefit from direct GPU communication alone, be it across NVLink or PCIe, but the benefits are maximized when combined with GPU-resident mode (see Fig 12 of the above linked paper). The benefit is not only bandwidth, but that transfers are done more efficiently. Note however that your system is only 50000 atoms, that’s barely able to saturate your 3090Ti, so you won’t get much benefit from using two cards in the same run, while running two independent simulations will give you nearly double the combined throughput.

Cheers,
Szilárd