It is pretty stable, but still is not as thoroughly tested as CPU-staged communications (which is why we don’t enable it by default). In particular, in runs with various “features” like electric field or FEP, or when domain decomposition corner cases are encountered (e.g., too many ranks).
To be on a safe side, you can try running short simulations both with GPU direct communication (set GMX_ENABLE_DIRECT_GPU_COMM=1 in GROMACS 2022, for both threadMPI and “real” MPI) and without, and compare the results and the speed-up achieved.
I am not a GROMACS use. So my concern is not from the user’s perspective but from a cluster resource management one.
Based on Dr. Páll’s publication, J. Chem. Phys. 153, 134110 (2020), we would like to provide a guideline to our users on how to effectively utilize our HGX-A100 servers. For clarification, we are considering the halo-exchange via GPUDirect:
According to our tests, and NVIDIA’s published results on STMV system, GPUDirect implementation clearly outperform pre-2019 implementation one.
80 GB per A100 is more than enough to accommodate most of our users’ MD workload. So we a leaning toward recommending one A100 only with SLURM share mode. Thus offloading most of kernels to GPU is the most efficiency way for A100.
We are aware of the multi-node extension in 2022, but decided that it is too new to adopt at early state.
Based on your comments, we will provide both standard and GPUDirect implementations, with a warning regarding the later.
If you have further suggestions, please do let us now.
The former have been superseded by “export=GMX_ENABLE_DIRECT_G
PU_COMM=true” (for backward compatibility, for the current release either of the the former two are equivalent with the new variable). The latter GMX_FORCE_UPDATE_DEFAULT_GPU remains, although as this changes a default value that a user can change themselves on the command line, perhaps from a usability perspective it would be better to recommend that users use the command line flag -update gpu. The added benefit is that if this is used explicitly, if the simulation setup is not compatible with the GPU-resident mode (triggered by offloading update), mdrun will issue an error so the user will know that their request to offload update can not be fulfilled.
That makes sense, nn NVLINK systems using direct GPU communication will always be faster or at least as fast as staged communication, but even on PCIe systems if can often have a slight performance benefit.
Certainly, memory size is rarely if ever a concern. Note that if you have a decent amount of free cores per GPU, -bonded cpu can have a slight performance benefit (see the cross-over of the yellow and green curves on Fig 10 of https://doi.org/10.1063/5.0018516).
Without PME decomposition, multi-node scaling is unfortunately quite limited. However, we are working on an extension and thanks to the recently released distributed cuFFTmp we will improve multi-node scaling in the future.
When using only GMX_GPU_DD_COMM and GMX_GPU_PME_PP_COMMS in 2022.1, the following warning was printed to stderr:
GPU-aware MPI detected, but by default GROMACS will not make use the direct GPU communication capabilities of MPI.
For improved performance try enabling the feature by setting the GMX_ENABLE_DIRECT_GPU_COMM environment variable.
We were confused since there was no performance difference with GMX_ENABLE_DIRECT_GPU_COMM=true. Thanks for clarification that the two are actually equivalent for the sake of backward comparability.
We observed less than 10% performance difference between Open MPI and thread-MPI