Multinode GPU direct communication fails

GROMACS version: 2023.1, 2022.5
GROMACS modification: No

Hi,

Multi-GPU run over multi-nodes fails with the following error.

-------------------------------------------------------
Program:     gmx mdrun, version 2023.1
Source file: src/gromacs/gpu_utils/device_stream.cu (line 81)
Function:    DeviceStream::~DeviceStream()::<lambda()>
MPI rank:    7 (out of 8)

Assertion failed:
Condition: stat == cudaSuccess
Failed to release CUDA stream. CUDA error #700 (cudaErrorIllegalAddress): an
illegal memory access was encountered.

For more information and tips for troubleshooting, please check the GROMACS
website at http://www.gromacs.org/Documentation/Errors
-------------------------------------------------------
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 7 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------

This gmx_mpi runs fine in a single node.
For multiple-nodes, if I disable the direct GPU-GPU communication, the calculation also runs fine.
This error happens both in 2023.1 and 2022.5.

Does anyone know what to look into to solve this problem?
Thanks in advance

log1.log (4.6 KB)
log2.log (22.7 KB)

I’m really in trouble with this.

I know this kind of problem highly depends on the specific hardware/software configuration, but I still need your kind help.

I’m using a HPC cluster, each node have four V100 SXM2 cards, and the nodes are interconnected via InfiniBand EDR.

Here are the compilation steps. For C/C++/F compiler, I used GCC 11.3.0.

  1. CUDA 11.8.0
    Installed with GCC 11.3.0

  2. UCX 1.14.1
    Configure command:
    ../contrib/configure-release --prefix=/home/x09527a/apps/ucx1.14.1-gcc11.3.0-cuda11.8.0 --with-avx --with-cuda=/home/x09527a/apps/cuda11.8.0-gcc11.3.0 --enable-optimizations --enable-cma --enable-mt --with-java=no --with-verbs

  3. OpenMPI 4.1.5
    Configure command:
    ../configure --prefix=/home/x09527a/apps/openmpi4.1.5-gcc11.3.0-cuda11.8.0-ucx1.14.1 --with-ucx=/home/x09527a/apps/ucx1.14.1-gcc11.3.0-cuda11.8.0 --with-cuda=/home/x09527a/apps/cuda11.8.0-gcc11.3.0 --enable-orterun-prefix-by-default --enable-mca-no-build=btl-uct

  4. Gromacs 2023.1
    Configure command:
    cmake .. -DCMAKE_INSTALL_PREFIX=~/apps/gromacs/2023.1-gcc11.3.0-cuda11.8.0-ucx-1.14.1-mpi4.1.5 -DCMAKE_C_COMPILER=mpicc -DCMAKE_CXX_COMPILER=mpicxx -DGMX_BUILD_OWN_FFTW=ON -DGMX_GPU=CUDA -DGMX_MPI=ON -DGMX_SIMD=AVX2_256 -DGMX_HWLOC=ON -DGMX_PYTHON_PACKAGE=OFF
    The installed gmx_mpi fails as in the above post.

Hi,

We have not seen such errors before, so I’d need your help with trying to identify the source of the problem:

  • Since the simulation aborts during cleanup (possibly due to an unrelated error) we might be loosing some information about the original error. We might be able to recover this if you could edit the file src/gromacs/gpu_utils/device_stream.cu and comment out or remove lines 81-82 (i.e. src/gromacs/gpu_utils/device_stream.cu · main · GROMACS / GROMACS · GitLab), then recompile GROMACS and run this binary.

  • Could you try a different OpenMPI/UCX version, e.g. the one shipped in the NVHPC SDK?

Cheers,
Szilárd

Hi,

Thank you for your help.

Here is the log file from the commented out version.

A warning message showed up:
“WARNING: Could not free page-locked memory. An unhandled error from a previous CUDA operation was detected. CUDA error #700 (cudaErrorIllegalAddress): an illegal memory access was encountered.”

log1.log (2.8 KB)
log2.log (22.9 KB)

I will try with different MPI/UCX versions later.

Some more findings with the gromacs build above:

  1. Using 2 nodes with 1 rank/node, the simulation ran fine.
  2. Using 2 nodes with 2 rank/node, the simulation failed.
  3. Using 1 node with 4 ranks, the simulation ran fine.
  4. Using 4 nodes with 1 rank/node, the simulation failed.

For me it seems that simulations fail when trying to run more than 2 ranks in total over multiple nodes?

edit: problem encountered not in an unpatched version - cause confirmed unrelated.

However it had the same error in DeviceStream::~DeviceStream() because destructors will be called in a different order from normal run when throwing, in particular forcerec->nbv will be destroyed after the DeviceStreamManager (whereas normally it is the reverse, c.f. runner.cpp line ~2122). Sometime cudaDestroyStream returns with error, sometime it crashes internally.

That sounds peculiar, I have not seen such issues before. It would help knowing what the exact issue is, can you try the workaround I suggested or alternatively running compute-sanitizer might reveal something more (you’ll likely have to write a wrapper script around compute-sanitizer gmx_mpi mdrun ... and launch that with the MPI launcher).

Here are the log files with compute-sanitizer.

fail_2node_2rankpernode_log1.log (450.9 KB)
fail_2node_2rankpernode_log2.log (22.8 KB)
success_2node_1rankpernode_log1.log (38.3 KB)
success_2node_1rankpernode_log2.log (24.2 KB)

I’m also trying the HPC_SDK, but having difficulties in making the MPI itself work properly. I’ll continue to tackle this.

Note that HPC SDK likely does not work out of the box as a compiler. If the MPI bundled in it does not work, perhaps try a different OpenMPI + UCX, what I have used myself is 4.1.2+UCX 1.12.1 and 4.1.4+UCX 1.13.1.

That suggests there may be a code issue. Can you try the following simple modification to test whether disabling a parallelization feature perhaps eliminates the issue: modify the following line of code src/gromacs/ewald/pme_gpu_internal.cpp · release-2023 · GROMACS / GROMACS · GitLab
to
kernelParamsPtr->usePipeline = 0;
Compile and run the resulting binary and please report back whether the run still fails.

Thanks & cheers
Szilárd

I tried the three different builds(① 4.1.5+UCX1.14.1 with the code modification, ② 4.1.4+UCX1.13.1, ③ 4.1.2+UCX1.12.1), but all ended up in the same error as above.

OK, please try the workaround above to see if that works around the issue.

Yes I tested that workaround, resulted in the same error.

Thanks. Are you able to share your input files and ideally file an issue with the above information on Issues · GROMACS / GROMACS · GitLab ?