Status of GPU Direct Implementation

GROMACS version: 2021.3
GROMACS modification: No


GPU Direct feature was first introduced in 2020:

I would like to ask if it has reached mature state and can be used for production runs.



It is pretty stable, but still is not as thoroughly tested as CPU-staged communications (which is why we don’t enable it by default). In particular, in runs with various “features” like electric field or FEP, or when domain decomposition corner cases are encountered (e.g., too many ranks).

To be on a safe side, you can try running short simulations both with GPU direct communication (set GMX_ENABLE_DIRECT_GPU_COMM=1 in GROMACS 2022, for both threadMPI and “real” MPI) and without, and compare the results and the speed-up achieved.

Thanks for your comments.

I am not a GROMACS use. So my concern is not from the user’s perspective but from a cluster resource management one.

Based on Dr. Páll’s publication, J. Chem. Phys. 153, 134110 (2020), we would like to provide a guideline to our users on how to effectively utilize our HGX-A100 servers. For clarification, we are considering the halo-exchange via GPUDirect:

export GMX_GPU_DD_COMMS=true
export GMX_GPU_PME_PP_COMMS=true
  • According to our tests, and NVIDIA’s published results on STMV system, GPUDirect implementation clearly outperform pre-2019 implementation one.
  • 80 GB per A100 is more than enough to accommodate most of our users’ MD workload. So we a leaning toward recommending one A100 only with SLURM share mode. Thus offloading most of kernels to GPU is the most efficiency way for A100.
  • We are aware of the multi-node extension in 2022, but decided that it is too new to adopt at early state.

Based on your comments, we will provide both standard and GPUDirect implementations, with a warning regarding the later.

If you have further suggestions, please do let us now.

The former have been superseded by “export=GMX_ENABLE_DIRECT_G
PU_COMM=true” (for backward compatibility, for the current release either of the the former two are equivalent with the new variable). The latter GMX_FORCE_UPDATE_DEFAULT_GPU remains, although as this changes a default value that a user can change themselves on the command line, perhaps from a usability perspective it would be better to recommend that users use the command line flag -update gpu. The added benefit is that if this is used explicitly, if the simulation setup is not compatible with the GPU-resident mode (triggered by offloading update), mdrun will issue an error so the user will know that their request to offload update can not be fulfilled.

That makes sense, nn NVLINK systems using direct GPU communication will always be faster or at least as fast as staged communication, but even on PCIe systems if can often have a slight performance benefit.

Certainly, memory size is rarely if ever a concern. Note that if you have a decent amount of free cores per GPU, -bonded cpu can have a slight performance benefit (see the cross-over of the yellow and green curves on Fig 10 of

Without PME decomposition, multi-node scaling is unfortunately quite limited. However, we are working on an extension and thanks to the recently released distributed cuFFTmp we will improve multi-node scaling in the future.

Feel free to ask if you have further questions.



When using only GMX_GPU_DD_COMM and GMX_GPU_PME_PP_COMMS in 2022.1, the following warning was printed to stderr:

GPU-aware MPI detected, but by default GROMACS will not make use the direct GPU communication capabilities of MPI. 
For improved performance try enabling the feature by setting the GMX_ENABLE_DIRECT_GPU_COMM environment variable.

We were confused since there was no performance difference with GMX_ENABLE_DIRECT_GPU_COMM=true. Thanks for clarification that the two are actually equivalent for the sake of backward comparability.

For 2022.1,

  • We observed less than 10% performance difference between Open MPI and thread-MPI
  • We haven’t setup A100-PCIe yet. But according to NVIDIA’s published data, there are negligible difference w.r.t to NVLink up to 4 A00s. However, there is a noticeable 30% drops with 8 A00s.

We have been serving 2016 and 2019 versions to our users for a long time. The following table was prepared as guideline to new features.

-nb -pme
-update GPUDirect CUDA_AWARE Notes
2016 cpu gpu cpu cpu no

2018 cpu gpu gpu cpu no no
2019 cpu/gpu gpu gpu cpu no no
2020 cpu/gpu gpu gpu gpu yes no
Thread-MPI only
2021 cpu/gpu gpu gpu gpu yes no Improved NB kernels on V100 and A100
2022 cpu/gpu gpu gpu gpu yes yes GMX_ENABLE_DIRECT_GPU_COMM=true
Thread-MPI/Open MPI
12% Improved NB kernels on A100
  • Optimal bonded offload can be done either one cpu or gpu (Fig. 10).
  • PME is still done on a single GPU ranks so multi-node scaling is limited (Fig. 12)

Thanks very much for clarifications regarding your publication.