GROMACS version: 2022.4
GROMACS modification: No
Hi, may be I ask a stupid question here. We have in our workstation one RTX 3090Ti installed on PCIe 5.0 x16. We use it running for standard MD simulations using GROMACS.
Right now, the GPU is placed inside the workstation, and we wish to have 2 GPUs parallely. However, due to the dimension of RTX 3090Ti, we cannot use the other PCIe x16 slots, and therefore we need to place the GPU outside the casing (as mining rig configuration).
There are PCIe x16<->x16 cable and riser on market. However, mostly in mining situation, where you use multi GPUs, you convert x16<->x1. Therefore, in the scenario where x16 is converted, the data transfer speed will suffer a lot.
I read one post, which may be related to my issue here about bandwidth (1GPU vs 4 GPU per single node; performance). There, pszilard said, that the date needs to be moved GPU->CPU->GPU for each step. So I can imagine, that in the case where x16 is converted to x1, this will be severe reduction of speed. Moreover, GPU resident loop has some limitations, so this option is already excluded.
My question is:
My system is fairly simple as 8 Riboflavin molecules in 8x8x8 TIP4P water box. I run non mpi version of Gromacs such as this:
gmx mdrun -v -nb gpu -pme gpu -bonded cpu -nt 8 -pin on -deffnm npt
After 100 ps NPT test simulation I got the following time accounting:
R E A L C Y C L E A N D T I M E A C C O U N T I N G
On 1 MPI rank, each using 8 OpenMP threads
Computing: Num Num Call Wall time Giga-Cycles
Ranks Threads Count (s) total sum %
-----------------------------------------------------------------------------
Vsite constr. 1 8 100001 2.282 62.401 3.8
Neighbor search 1 8 1251 3.150 86.130 5.3
Launch GPU ops. 1 8 100001 2.038 55.732 3.4
Force 1 8 100001 1.350 36.914 2.3
Wait PME GPU gather 1 8 100001 8.358 228.497 14.1
Reduce GPU PME F 1 8 100001 2.322 63.477 3.9
Wait GPU NB local 8.367 228.743 14.1
NB X/F buffer ops. 1 8 198751 10.258 280.469 17.3
Vsite spread 1 8 110002 1.082 29.574 1.8
Write traj. 1 8 201 1.457 39.838 2.5
Update 1 8 100001 1.845 50.433 3.1
Constraints 1 8 100001 2.527 69.087 4.3
Rest 14.295 390.836 24.1
-----------------------------------------------------------------------------
Total 59.331 1622.130 100.0
-----------------------------------------------------------------------------
Core t (s) Wall t (s) (%)
Time: 474.651 59.331 800.0
(ns/day) (hour/ns)
Performance: 291.248 0.082
Finished mdrun on rank 0 Thu Feb 16 11:13:05 2023
As you see, the line with “Wait PME GPU gather” (14.1%) and “Wait GPU NB local” (14.1%) are significant. Moreover the rest time, which indicates the inefficiency of parallel execution, is also significant.
So in this case, I wonder why the CPU waits for the data from GPU? Does it wait the calculation on GPU to finish or rather due to the data transfer bottle neck caused by PCIe connection? I cannot imagine the latter because our PCIe x16 is version 5.0, which has a transfer speed of 63 GB/s.
If the data transfer is not decisive, does it mean, that we do not need such bandwidth between CPU and GPU (as in crypto mining)?
Many thanks for helps in advance.
John