Status of GPU Direct Implementation

vdle · April 12, 2022, 2:55pm

GROMACS version: 2021.3
GROMACS modification: No

Hi,

GPU Direct feature was first introduced in 2020:
https://developer.nvidia.com/blog/creating-faster-molecular-dynamics-simulations-with-gromacs-2020/
https://manual.gromacs.org/documentation/2020/release-notes/2020/major/performance.html

I would like to ask if it has reached mature state and can be used for production runs.

Regards.

al42and · May 18, 2022, 1:12pm

Hello!

It is pretty stable, but still is not as thoroughly tested as CPU-staged communications (which is why we don’t enable it by default). In particular, in runs with various “features” like electric field or FEP, or when domain decomposition corner cases are encountered (e.g., too many ranks).

To be on a safe side, you can try running short simulations both with GPU direct communication (set GMX_ENABLE_DIRECT_GPU_COMM=1 in GROMACS 2022, for both threadMPI and “real” MPI) and without, and compare the results and the speed-up achieved.

vdle · May 23, 2022, 5:42am

Thanks for your comments.

I am not a GROMACS use. So my concern is not from the user’s perspective but from a cluster resource management one.

Based on Dr. Páll’s publication, J. Chem. Phys. 153, 134110 (2020), we would like to provide a guideline to our users on how to effectively utilize our HGX-A100 servers. For clarification, we are considering the halo-exchange via GPUDirect:

export GMX_GPU_DD_COMMS=true
export GMX_GPU_PME_PP_COMMS=true
export GMX_FORCE_UPDATE_DEFAULT_GPU=true

According to our tests, and NVIDIA’s published results on STMV system, GPUDirect implementation clearly outperform pre-2019 implementation one.
80 GB per A100 is more than enough to accommodate most of our users’ MD workload. So we a leaning toward recommending one A100 only with SLURM share mode. Thus offloading most of kernels to GPU is the most efficiency way for A100.
We are aware of the multi-node extension in 2022, but decided that it is too new to adopt at early state.

Based on your comments, we will provide both standard and GPUDirect implementations, with a warning regarding the later.

If you have further suggestions, please do let us now.

pszilard · May 23, 2022, 7:20pm

The former have been superseded by “export=GMX_ENABLE_DIRECT_G
PU_COMM=true” (for backward compatibility, for the current release either of the the former two are equivalent with the new variable). The latter GMX_FORCE_UPDATE_DEFAULT_GPU remains, although as this changes a default value that a user can change themselves on the command line, perhaps from a usability perspective it would be better to recommend that users use the command line flag -update gpu. The added benefit is that if this is used explicitly, if the simulation setup is not compatible with the GPU-resident mode (triggered by offloading update), mdrun will issue an error so the user will know that their request to offload update can not be fulfilled.

That makes sense, nn NVLINK systems using direct GPU communication will always be faster or at least as fast as staged communication, but even on PCIe systems if can often have a slight performance benefit.

Certainly, memory size is rarely if ever a concern. Note that if you have a decent amount of free cores per GPU, -bonded cpu can have a slight performance benefit (see the cross-over of the yellow and green curves on Fig 10 of Heterogeneous parallelization and acceleration of molecular dynamics simulations in GROMACS | The Journal of Chemical Physics | AIP Publishing).

Without PME decomposition, multi-node scaling is unfortunately quite limited. However, we are working on an extension and thanks to the recently released distributed cuFFTmp we will improve multi-node scaling in the future.

Feel free to ask if you have further questions.

CHeers,

Szilard

vdle · May 27, 2022, 7:26am

When using only GMX_GPU_DD_COMM and GMX_GPU_PME_PP_COMMS in 2022.1, the following warning was printed to stderr:

GPU-aware MPI detected, but by default GROMACS will not make use the direct GPU communication capabilities of MPI. 
For improved performance try enabling the feature by setting the GMX_ENABLE_DIRECT_GPU_COMM environment variable.

We were confused since there was no performance difference with GMX_ENABLE_DIRECT_GPU_COMM=true. Thanks for clarification that the two are actually equivalent for the sake of backward comparability.

For 2022.1,

We observed less than 10% performance difference between Open MPI and thread-MPI
We haven’t setup A100-PCIe yet. But according to NVIDIA’s published data, there are negligible difference w.r.t to NVLink up to 4 A00s. However, there is a noticeable 30% drops with 8 A00s. https://developer.nvidia.com/hpc-application-performance

We have been serving 2016 and 2019 versions to our users for a long time. The following table was prepared as guideline to new features.

	-bonded	-nb	-pme	-update	GPUDirect	CUDA_AWARE	Notes
2016	cpu	gpu	cpu	cpu	no	no
2018	cpu	gpu	gpu	cpu	no	no
2019	cpu/gpu	gpu	gpu	cpu	no	no
2020	cpu/gpu	gpu	gpu	gpu	yes	no	GMX_GPU_DD_COMMS=true GMX_GPU_PME_PP_COMMS=true Thread-MPI only
2021	cpu/gpu	gpu	gpu	gpu	yes	no	Improved NB kernels on V100 and A100
2022	cpu/gpu	gpu	gpu	gpu	yes	yes	GMX_ENABLE_DIRECT_GPU_COMM=true Thread-MPI/Open MPI 12% Improved NB kernels on A100

Optimal bonded offload can be done either one cpu or gpu (Fig. 10).
PME is still done on a single GPU ranks so multi-node scaling is limited (Fig. 12)

Thanks very much for clarifications regarding your publication.

Topic		Replies	Views
Error when enabling GPU-GPU direct communication across multiple nodes User discussions mdrun , gpu , mdrun-parallelization	0	706	March 12, 2023
Performance loss User discussions	2	1235	February 20, 2021
Abysmal MD production performance on GPU node User discussions mdrun	8	874	December 15, 2023
Using full GPU node without MPI User discussions mdrun	3	449	September 11, 2023
Gmx mdrun -deffnm md_0_1 -nb gpu ** GPU command line error User discussions	20	4798	March 25, 2022

Status of GPU Direct Implementation

CHeers,

Related topics