MD Performance dependency on PCIe bandwidth

Hi Szilard,

thanks for your rapid reply. The problem with the GPU-resident mode is that I cannot use virtual sites or for my specific problem the pulling code and FEP.

-update

    Used to set where to execute update and constraints, when present. Can be set to “auto”, “cpu”, “gpu.” Defaults to “auto,” which currently always uses the CPU. Setting “gpu” requires that a compatible CUDA GPU is available, the simulation uses a single rank. 
Update and constraints on a GPU is currently not supported with mass and constraints free-energy perturbation, domain decomposition, virtual sites, Ewald surface correction, replica exchange, constraint pulling, orientation restraints and computational electrophysiology.

  1. I suppose, the reduction of CPU-GPU data movement can be realized by using the GPU-resident mode. This is what I understood from the manual.
GROMACS supports two major offload modes: force-offload and GPU-resident. 
The former involves offloading some of or all interaction calculations with integration on the CPU (hence requiring per-step data movement). 
In the GPU-resident mode by offloading integration and constraints (when used) less data movement is necessary.

So in my case, where I used force-offload mode,
Did the CPU wait the non bonded and PME results from the GPU because:
a) It is “stucked” or “limited” by the connection’s bandwidth? or
b) It is rather due to the calculation of PME and nonbonded parts on GPU which takes time?

I attached now the new log file with a longer simulation time?

npt.log (671.7 KB)

  1. If I shifted the PME calculation on the GPU, then the efficiency increased as one can see in the rest time. The overall performance however is slightly lower. I assume that this was due to the number of cores and CPU type.

npt_2.log (672.6 KB)

  1. Of course in the scenario of multi GPU, I consider NVLink between two GPUs. Still I assume, when using force-offload mode, then probably the data transfer from CPU<->(GPU1 <-NVLINK->GPU2) can be only benefit from large bandwidth, unless GPU-resident mode is used. Is my assumption correct?

Many thanks,

John