Error when enabling GPU-GPU direct communication across multiple nodes

GROMACS version: 2022.3
GROMACS modification: No
Here post your question

I am trying to run a simulation with two nodes enabling GPU-GPU direct communication. Howver, it is not running as expected and is throwing me the following error. Any help would be appreciated.

                  :-) GROMACS - gmx mdrun, 2022.3 (-:

Executable: /cvmfs/soft.computecanada.ca/easybuild/software/2020/avx512/MPI/gcc9/cuda11.4/openmpi4/gromacs/2022.3/bin/gmx_mpi
Data prefix: /cvmfs/soft.computecanada.ca/easybuild/software/2020/avx512/MPI/gcc9/cuda11.4/openmpi4/gromacs/2022.3
Working dir: /scratch/adwaith/MD/Charmm_lugEF_ATP2/charmm-gui-7815573259/gromacs
Command line:
gmx_mpi mdrun -v -deffnm md -ntomp 11 -nb gpu -bonded gpu -update gpu -pme gpu -npme 1 -cpi md.cpt

Reading file md.tpr, VERSION 2022.3 (single precision)
GMX_ENABLE_DIRECT_GPU_COMM environment variable detected, enabling direct GPU communication using GPU-aware MPI.
Changing nstlist from 20 to 100, rlist from 1.213 to 1.329

On host gra1182 4 GPUs selected for this run.
Mapping of GPU IDs to the 4 GPU tasks in the 4 ranks on this node:
PP:0,PP:1,PP:2,PP:3
PP tasks will do (non-perturbed) short-ranged and most bonded interactions on the GPU
PP task will update and constrain coordinates on the GPU
PME tasks will do all aspects on the GPU
GPU direct communication will be used between MPI ranks.
Using 8 MPI processes

Non-default thread affinity set, disabling internal thread affinity

Using 11 OpenMP threads per MPI process

Note: Your choice of number of MPI ranks and amount of resources results in using 11 OpenMP threads per rank, which is most likely inefficient. The optimum is usually between 2 and 8 threads per rank.

WARNING: This run will generate roughly 125906 Mb of data

starting mdrun ‘Title’
500000000 steps, 1000000.0 ps (continuing from step 21804900, 43609.8 ps).
[1678646954.282894] [gra1183:242837:0] cma_ep.c:61 UCX ERROR process_vm_readv(pid=242836 length=644712) returned -1: Bad address
[1678646954.283002] [gra1183:242837:0] ib_md.c:325 UCX ERROR ibv_reg_mr(address=0x2b7bb933ad40, length=659312, access=0xf) failed: Bad address
[1678646954.283018] [gra1183:242837:0] ucp_mm.c:131 UCX ERROR failed to register address 0x2b7bb933ad48 mem_type bit 0x1 length 659292 on md[7]=mlx5_0: Input/output error (md reg_mem_types 0x15)
[1678646954.283026] [gra1183:242837:0] ucp_request.c:267 UCX ERROR failed to register user buffer datatype 0x8 address 0x2b7bb933ad48 len 659292: Input/output error
[gra1183:242837:0:242837] rndv.c:449 Assertion `status == UCS_OK’ failed
==== backtrace (tid: 242837) ====
0 0x000000000002027e ucs_debug_print_backtrace() /tmp/ebuser/avx512/UCX/1.8.0/gcccorecuda-2020.1.114/ucx-1.8.0/src/ucs/debug/debug.c:653
1 0x000000000003f37a ucp_rndv_progress_rma_get_zcopy() /tmp/ebuser/avx512/UCX/1.8.0/gcccorecuda-2020.1.114/ucx-1.8.0/src/ucp/tag/rndv.c:449
2 0x000000000003f7a1 ucp_request_try_send() /tmp/ebuser/avx512/UCX/1.8.0/gcccorecuda-2020.1.114/ucx-1.8.0/src/ucp/core/ucp_request.inl:171
3 0x000000000003f7a1 ucp_request_send() /tmp/ebuser/avx512/UCX/1.8.0/gcccorecuda-2020.1.114/ucx-1.8.0/src/ucp/core/ucp_request.inl:206
4 0x000000000003f7a1 ucp_rndv_req_send_rma_get() /tmp/ebuser/avx512/UCX/1.8.0/gcccorecuda-2020.1.114/ucx-1.8.0/src/ucp/tag/rndv.c:596
5 0x0000000000040826 ucp_rndv_matched() /tmp/ebuser/avx512/UCX/1.8.0/gcccorecuda-2020.1.114/ucx-1.8.0/src/ucp/tag/rndv.c:800
6 0x0000000000040af0 ucp_rndv_process_rts() /tmp/ebuser/avx512/UCX/1.8.0/gcccorecuda-2020.1.114/ucx-1.8.0/src/ucp/tag/rndv.c:840
7 0x0000000000040af0 ucp_rndv_process_rts() /tmp/ebuser/avx512/UCX/1.8.0/gcccorecuda-2020.1.114/ucx-1.8.0/src/ucp/tag/rndv.c:844
8 0x0000000000038b71 uct_iface_invoke_am() /tmp/ebuser/avx512/UCX/1.8.0/gcccorecuda-2020.1.114/ucx-1.8.0/src/uct/base/uct_iface.h:628
9 0x0000000000038b71 uct_rc_mlx5_iface_common_am_handler() /tmp/ebuser/avx512/UCX/1.8.0/gcccorecuda-2020.1.114/ucx-1.8.0/src/uct/ib/rc/accel/rc_mlx5.inl:397
10 0x0000000000038b71 uct_rc_mlx5_iface_common_poll_rx() /tmp/ebuser/avx512/UCX/1.8.0/gcccorecuda-2020.1.114/ucx-1.8.0/src/uct/ib/rc/accel/rc_mlx5.inl:1370
11 0x0000000000038b71 uct_rc_mlx5_iface_progress() /tmp/ebuser/avx512/UCX/1.8.0/gcccorecuda-2020.1.114/ucx-1.8.0/src/uct/ib/rc/accel/rc_mlx5_iface.c:130
12 0x0000000000026b0a ucs_callbackq_dispatch() /tmp/ebuser/avx512/UCX/1.8.0/gcccorecuda-2020.1.114/ucx-1.8.0/src/ucs/datastruct/callbackq.h:211
13 0x0000000000026b0a uct_worker_progress() /tmp/ebuser/avx512/UCX/1.8.0/gcccorecuda-2020.1.114/ucx-1.8.0/src/uct/api/uct.h:2221
14 0x0000000000026b0a ucp_worker_progress() /tmp/ebuser/avx512/UCX/1.8.0/gcccorecuda-2020.1.114/ucx-1.8.0/src/ucp/core/ucp_worker.c:1951
15 0x00000000000037b7 mca_pml_ucx_progress() ???:0
16 0x00000000000331eb opal_progress() ???:0
17 0x0000000000039805 ompi_sync_wait_mt() ???:0
18 0x000000000007c49f ompi_request_default_wait_any() ???:0
19 0x00000000000bbcfb MPI_Waitany() ???:0
20 0x0000000000d45f17 gmx::PmeCoordinateReceiverGpu::Impl::synchronizeOnCoordinatesFromPpRank() ???:0
21 0x0000000000d4d079 pme_gpu_spread() ???:0
22 0x0000000000bd1135 pme_gpu_launch_spread() ???:0
23 0x0000000000bb8ccc gmx_pmeonly() ???:0
24 0x0000000000c1d8c4 gmx::Mdrunner::mdrunner() ???:0
25 0x0000000000408a1c gmx::gmx_mdrun() ???:0
26 0x0000000000408b58 gmx::gmx_mdrun() ???:0
27 0x0000000000463af2 gmx::CommandLineModuleManager::run() ???:0
28 0x00000000004057fd main() ???:0
29 0x0000000000023e1b __libc_start_main() /cvmfs/soft.computecanada.ca/gentoo/2020/usr/src/debug/sys-libs/glibc-2.30-r8/glibc-2.30/csu/…/csu/libc-start.c:308
30 0x000000000040587a _start() ???:0

[gra1183:242837] *** Process received signal ***
[gra1183:242837] Signal: Aborted (6)
[gra1183:242837] Signal code: (-6)
[gra1183:242837] [ 0] /cvmfs/soft.computecanada.ca/gentoo/2020/lib64/libpthread.so.0(+0x130f0)[0x2b7b245ea0f0]
[gra1183:242837] [ 1] /cvmfs/soft.computecanada.ca/gentoo/2020/lib64/libc.so.6(gsignal+0x141)[0x2b7b2462f901]
[gra1183:242837] [ 2] /cvmfs/soft.computecanada.ca/gentoo/2020/lib64/libc.so.6(abort+0x127)[0x2b7b2461956b]
[gra1183:242837] [ 3] /cvmfs/soft.computecanada.ca/easybuild/software/2020/avx512/CUDA/cuda11.4/ucx/1.8.0/lib/libucs.so.0(+0x1ef65)[0x2b7b47cd1f65]
[gra1183:242837] [ 4] /cvmfs/soft.computecanada.ca/easybuild/software/2020/avx512/CUDA/cuda11.4/ucx/1.8.0/lib/libucs.so.0(ucs_fatal_error_format+0xde)[0x2b7b47cd204e]
[gra1183:242837] [ 5] /cvmfs/soft.computecanada.ca/easybuild/software/2020/avx512/CUDA/cuda11.4/ucx/1.8.0/lib/libucp.so.0(ucp_rndv_progress_rma_get_zcopy+0x44a)[0x2b7b47c2b37a]
[gra1183:242837] [ 6] /cvmfs/soft.computecanada.ca/easybuild/software/2020/avx512/CUDA/cuda11.4/ucx/1.8.0/lib/libucp.so.0(+0x3f7a1)[0x2b7b47c2b7a1]
[gra1183:242837] [ 7] /cvmfs/soft.computecanada.ca/easybuild/software/2020/avx512/CUDA/cuda11.4/ucx/1.8.0/lib/libucp.so.0(ucp_rndv_matched+0x4f6)[0x2b7b47c2c826]
[gra1183:242837] [ 8] /cvmfs/soft.computecanada.ca/easybuild/software/2020/avx512/CUDA/cuda11.4/ucx/1.8.0/lib/libucp.so.0(ucp_rndv_process_rts+0x1d0)[0x2b7b47c2caf0]
[gra1183:242837] [ 9] /cvmfs/soft.computecanada.ca/easybuild/software/2020/avx512/CUDA/cuda11.4/ucx/1.8.0/lib/ucx/libuct_ib.so.0(uct_rc_mlx5_iface_progress+0x121)[0x2b7b47ea2b71]
[gra1183:242837] [10] /cvmfs/soft.computecanada.ca/easybuild/software/2020/avx512/CUDA/cuda11.4/ucx/1.8.0/lib/libucp.so.0(ucp_worker_progress+0x6a)[0x2b7b47c12b0a]
[gra1183:242837] [11] /cvmfs/soft.computecanada.ca/easybuild/software/2020/avx512/CUDA/gcc9/cuda11.4/openmpi/4.0.3/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_progress+0x17)[0x2b7b4575b7b7]
[gra1183:242837] [12] /cvmfs/soft.computecanada.ca/easybuild/software/2020/avx512/CUDA/gcc9/cuda11.4/openmpi/4.0.3/lib/libopen-pal.so.40(opal_progress+0x2b)[0x2b7b3ac4e1eb]
[gra1183:242837] [13] /cvmfs/soft.computecanada.ca/easybuild/software/2020/avx512/CUDA/gcc9/cuda11.4/openmpi/4.0.3/lib/libopen-pal.so.40(ompi_sync_wait_mt+0xb5)[0x2b7b3ac54805]
[gra1183:242837] [14] /cvmfs/soft.computecanada.ca/easybuild/software/2020/avx512/CUDA/gcc9/cuda11.4/openmpi/4.0.3/lib/libmpi.so.40(ompi_request_default_wait_any+0x2df)[0x2b7b23f9249f]
[gra1183:242837] [15] /cvmfs/soft.computecanada.ca/easybuild/software/2020/avx512/CUDA/gcc9/cuda11.4/openmpi/4.0.3/lib/libmpi.so.40(MPI_Waitany+0xab)[0x2b7b23fd1cfb]
[gra1183:242837] [16] /cvmfs/soft.computecanada.ca/easybuild/software/2020/avx512/MPI/gcc9/cuda11.4/openmpi4/gromacs/2022.3/bin/…/lib/libgromacs_mpi.so.7(_ZN3gmx24PmeCoordinateReceiverGpu4Impl34synchronizeOnCoordinatesFromPpRankEiRK12DeviceStream+0x27)[0x2b7b226a2f17]
[gra1183:242837] [17] /cvmfs/soft.computecanada.ca/easybuild/software/2020/avx512/MPI/gcc9/cuda11.4/openmpi4/gromacs/2022.3/bin/…/lib/libgromacs_mpi.so.7(_Z14pme_gpu_spreadPK6PmeGpuP20GpuEventSynchronizerPPfPP18gmx_parallel_3dfftbbfbPN3gmx24PmeCoordinateReceiverGpuE+0x779)[0x2b7b226aa079]
[gra1183:242837] [18] /cvmfs/soft.computecanada.ca/easybuild/software/2020/avx512/MPI/gcc9/cuda11.4/openmpi4/gromacs/2022.3/bin/…/lib/libgromacs_mpi.so.7(_Z21pme_gpu_launch_spreadP9gmx_pme_tP20GpuEventSynchronizerP13gmx_wallcyclefbPN3gmx24PmeCoordinateReceiverGpuE+0xb5)[0x2b7b2252e135]
[gra1183:242837] [19] /cvmfs/soft.computecanada.ca/easybuild/software/2020/avx512/MPI/gcc9/cuda11.4/openmpi4/gromacs/2022.3/bin/…/lib/libgromacs_mpi.so.7(_Z11gmx_pmeonlyP9gmx_pme_tPK9t_commrecP6t_nrnbP13gmx_wallcycleP23gmx_walltime_accountingP10t_inputrec10PmeRunModebPKN3gmx19DeviceStreamManagerE+0x33fc)[0x2b7b22515ccc]
[gra1183:242837] [20] /cvmfs/soft.computecanada.ca/easybuild/software/2020/avx512/MPI/gcc9/cuda11.4/openmpi4/gromacs/2022.3/bin/…/lib/libgromacs_mpi.so.7(_ZN3gmx8Mdrunner8mdrunnerEv+0x4984)[0x2b7b2257a8c4]
[gra1183:242837] [21] gmx_mpi[0x408a1c]
[gra1183:242837] [22] gmx_mpi[0x408b58]
[gra1183:242837] [23] /cvmfs/soft.computecanada.ca/easybuild/software/2020/avx512/MPI/gcc9/cuda11.4/openmpi4/gromacs/2022.3/bin/…/lib/libgromacs_mpi.so.7(_ZN3gmx24CommandLineModuleManager3runEiPPc+0x252)[0x2b7b21dc0af2]
[gra1183:242837] [24] gmx_mpi[0x4057fd]
[gra1183:242837] [25] /cvmfs/soft.computecanada.ca/gentoo/2020/lib64/libc.so.6(__libc_start_main+0xeb)[0x2b7b2461ae1b]
[gra1183:242837] [26] gmx_mpi[0x40587a]
[gra1183:242837] *** End of error message ***

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.


mpirun noticed that process rank 7 with PID 242837 on node gra1183 exited on signal 6 (Aborted).