Multinode GPU direct communication fails

kttn8769 · June 3, 2023, 7:45am

GROMACS version: 2023.1, 2022.5
GROMACS modification: No

Hi,

Multi-GPU run over multi-nodes fails with the following error.

-------------------------------------------------------
Program:     gmx mdrun, version 2023.1
Source file: src/gromacs/gpu_utils/device_stream.cu (line 81)
Function:    DeviceStream::~DeviceStream()::<lambda()>
MPI rank:    7 (out of 8)

Assertion failed:
Condition: stat == cudaSuccess
Failed to release CUDA stream. CUDA error #700 (cudaErrorIllegalAddress): an
illegal memory access was encountered.

For more information and tips for troubleshooting, please check the GROMACS
website at http://www.gromacs.org/Documentation/Errors
-------------------------------------------------------
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 7 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------

This gmx_mpi runs fine in a single node.
For multiple-nodes, if I disable the direct GPU-GPU communication, the calculation also runs fine.
This error happens both in 2023.1 and 2022.5.

Does anyone know what to look into to solve this problem?
Thanks in advance

log1.log (4.6 KB)
log2.log (22.7 KB)

kttn8769 · June 10, 2023, 8:36am

I’m really in trouble with this.

I know this kind of problem highly depends on the specific hardware/software configuration, but I still need your kind help.

I’m using a HPC cluster, each node have four V100 SXM2 cards, and the nodes are interconnected via InfiniBand EDR.

Here are the compilation steps. For C/C++/F compiler, I used GCC 11.3.0.

CUDA 11.8.0
Installed with GCC 11.3.0
UCX 1.14.1
Configure command:
../contrib/configure-release --prefix=/home/x09527a/apps/ucx1.14.1-gcc11.3.0-cuda11.8.0 --with-avx --with-cuda=/home/x09527a/apps/cuda11.8.0-gcc11.3.0 --enable-optimizations --enable-cma --enable-mt --with-java=no --with-verbs
OpenMPI 4.1.5
Configure command:
../configure --prefix=/home/x09527a/apps/openmpi4.1.5-gcc11.3.0-cuda11.8.0-ucx1.14.1 --with-ucx=/home/x09527a/apps/ucx1.14.1-gcc11.3.0-cuda11.8.0 --with-cuda=/home/x09527a/apps/cuda11.8.0-gcc11.3.0 --enable-orterun-prefix-by-default --enable-mca-no-build=btl-uct
Gromacs 2023.1
Configure command:
cmake .. -DCMAKE_INSTALL_PREFIX=~/apps/gromacs/2023.1-gcc11.3.0-cuda11.8.0-ucx-1.14.1-mpi4.1.5 -DCMAKE_C_COMPILER=mpicc -DCMAKE_CXX_COMPILER=mpicxx -DGMX_BUILD_OWN_FFTW=ON -DGMX_GPU=CUDA -DGMX_MPI=ON -DGMX_SIMD=AVX2_256 -DGMX_HWLOC=ON -DGMX_PYTHON_PACKAGE=OFF
The installed gmx_mpi fails as in the above post.

pszilard · June 13, 2023, 1:56pm

Hi,

We have not seen such errors before, so I’d need your help with trying to identify the source of the problem:

Since the simulation aborts during cleanup (possibly due to an unrelated error) we might be loosing some information about the original error. We might be able to recover this if you could edit the file src/gromacs/gpu_utils/device_stream.cu and comment out or remove lines 81-82 (i.e. src/gromacs/gpu_utils/device_stream.cu · main · GROMACS / GROMACS · GitLab), then recompile GROMACS and run this binary.
Could you try a different OpenMPI/UCX version, e.g. the one shipped in the NVHPC SDK?

Cheers,
Szilárd

kttn8769 · June 14, 2023, 1:53am

Hi,

Thank you for your help.

Here is the log file from the commented out version.

A warning message showed up:
“WARNING: Could not free page-locked memory. An unhandled error from a previous CUDA operation was detected. CUDA error #700 (cudaErrorIllegalAddress): an illegal memory access was encountered.”

log1.log (2.8 KB)
log2.log (22.9 KB)

I will try with different MPI/UCX versions later.

kttn8769 · June 14, 2023, 8:46am

Some more findings with the gromacs build above:

Using 2 nodes with 1 rank/node, the simulation ran fine.
Using 2 nodes with 2 rank/node, the simulation failed.
Using 1 node with 4 ranks, the simulation ran fine.
Using 4 nodes with 1 rank/node, the simulation failed.

For me it seems that simulations fail when trying to run more than 2 ranks in total over multiple nodes?

ebriand · June 14, 2023, 9:51am

edit: problem encountered not in an unpatched version - cause confirmed unrelated.

However it had the same error in DeviceStream::~DeviceStream() because destructors will be called in a different order from normal run when throwing, in particular forcerec->nbv will be destroyed after the DeviceStreamManager (whereas normally it is the reverse, c.f. runner.cpp line ~2122). Sometime cudaDestroyStream returns with error, sometime it crashes internally.

pszilard · June 15, 2023, 12:56pm

That sounds peculiar, I have not seen such issues before. It would help knowing what the exact issue is, can you try the workaround I suggested or alternatively running compute-sanitizer might reveal something more (you’ll likely have to write a wrapper script around compute-sanitizer gmx_mpi mdrun ... and launch that with the MPI launcher).

kttn8769 · June 16, 2023, 1:49am

Here are the log files with compute-sanitizer.

fail_2node_2rankpernode_log1.log (450.9 KB)
fail_2node_2rankpernode_log2.log (22.8 KB)
success_2node_1rankpernode_log1.log (38.3 KB)
success_2node_1rankpernode_log2.log (24.2 KB)

I’m also trying the HPC_SDK, but having difficulties in making the MPI itself work properly. I’ll continue to tackle this.

pszilard · June 26, 2023, 4:14pm

Note that HPC SDK likely does not work out of the box as a compiler. If the MPI bundled in it does not work, perhaps try a different OpenMPI + UCX, what I have used myself is 4.1.2+UCX 1.12.1 and 4.1.4+UCX 1.13.1.

That suggests there may be a code issue. Can you try the following simple modification to test whether disabling a parallelization feature perhaps eliminates the issue: modify the following line of code src/gromacs/ewald/pme_gpu_internal.cpp · release-2023 · GROMACS / GROMACS · GitLab
to
kernelParamsPtr->usePipeline = 0;
Compile and run the resulting binary and please report back whether the run still fails.

Thanks & cheers
Szilárd

kttn8769 · June 27, 2023, 11:03am

I tried the three different builds(① 4.1.5+UCX1.14.1 with the code modification, ② 4.1.4+UCX1.13.1, ③ 4.1.2+UCX1.12.1), but all ended up in the same error as above.

pszilard · June 27, 2023, 3:39pm

OK, please try the workaround above to see if that works around the issue.

kttn8769 · June 28, 2023, 12:49am

Yes I tested that workaround, resulted in the same error.

pszilard · July 3, 2023, 2:26pm

Thanks. Are you able to share your input files and ideally file an issue with the above information on Issues · GROMACS / GROMACS · GitLab ?

Topic		Replies	Views
CUDA error in 10ns run User discussions mdrun	3	217	May 4, 2024
Error when enabling GPU-GPU direct communication across multiple nodes User discussions mdrun , gpu , mdrun-parallelization	0	711	March 12, 2023
gmx_mpi mdrun cuda error #700 User discussions mdrun	0	979	April 13, 2023
Fatal error: Unexpected cudaStreamQuery failure. CUDA error #700 (cudaErrorIllegalAddress) User discussions mdrun , gpu	2	1525	August 18, 2023
All CUDA-capable devices are busy or unavailable User discussions mdrun , gpu	7	1959	August 31, 2022

Multinode GPU direct communication fails

Related topics