hipSYCL Error when running with multiple MPIs

GROMACS version: 2023.2
GROMACS modification: No

Hi,
I am running some simulations in GROMACS using GPU and CPU. the system has 8 GPUs and I use 1 node of CPU with 128 CPU cores. The simulations run with no problem when using single MPI. But when I try to use multiple MPIs I get the following error. I don’t know if there is any limitation about number of MPIs that can be run with GPU or there is a problem with installation of the software.

[hipSYCL Error] from /tmp/hellsvik/hipsycl/0.9.4/cpeGNU-22.06-rocm-5.3.3-llvm/hipSYCL-0.9.4/src/runtime/hip/hip_queue.cpp:355 @ submit_memset()
: hip_queue: hipMemsetAsync() failed (error code = HIP:1)
============== hipSYCL error report ==============
hipSYCL has caught the following undhandled asynchronous errors:

  1. from /tmp/hellsvik/hipsycl/0.9.4/cpeGNU-22.06-rocm-5.3.3-llvm/hipSYCL-0.9.4/src/runtime/hip/hip_queue.cpp:355 @ submit_memset(): hip_queue: hipMem
    setAsync() failed (error code = HIP:1)
    The application will now be terminated.
    terminate called without an active exception
    srun: error: nid002892: task 39: Aborted (core dumped)
    srun: launch/slurm: _step_signal: Terminating StepId=2789083.0
    slurmstepd: error: *** STEP 2789083.0 ON nid002892 CANCELLED AT 2023-11-03T01:03:07 ***
    srun: error: nid002892: tasks 0-38,40-63: Killed
    srun: Force Terminated StepId=2789083.0

Bests,
Maryam Majd

Hi!

Could you please share how exactly did you build hipSYCL and GROMACS, and what command are you using to launch the simulation?

I assume the system is Dardel, correct?

Hi,

Yeah the system is Dardel. I don’t know about the building of hipSYCL and GROMACS on the system. But I specify number of cores by the command:
#SBATCH --ntasks-per-node=64

and the commands I use for running the job are:
gmx_mpi mdrun -deffnm test -s …/test.tpr -cpi test -nsteps 100000

srun gmx_mpi mdrun -deffnm test -s …/test.tpr -cpi test -nsteps 100000 -nb gpu -bonded cpu -update cpu

srun gmx_mpi mdrun -deffnm test -s …/test.tpr -cpi test -nsteps 100000 -nb gpu -bonded gpu -update cpu

I remembered that I run grompp command locally with CUDA GPU. Can it be the reason for the error?

Maybe you could ask Johan? Since it’s just a temporary installation on some node, it is very hard to comment on what could have gone wrong here.

No, it should not be a problem.

Are you using and additional flags to control the CPU and GPU assignment?

Currently, you are launching 64 processed on each node, 8 per GPU, which would be very inefficient for most purposes. It is better to use --ntasks-per-node=8 --nodes=N to run only 8 tasks on each node, one per GPU, and then use -ntomp 14 to use 14 CPU threads per GPU. This value of “14” is very dependent on the workload, and the good values can be between 5 and 14, depending on the ratio of CPU/GPU work (-bonded cpu -update cpu will benefit from higher values; -bonded gpu -update gpu would be better with smaller values to leave more system resources for the GPU runtime to handle things in the background).

It is also a very good idea to use ROCR_VISIBLE_DEVICES variable and explicit CPU pinning to correctly group CPU and GPU tasks. Without it, multi-GPU performance is going to be very bad. LUMI has a good example of a script doing that. Or here is what I use on Dardel:

#!/bin/bash
  
# Usage: srun ~/this_script.sh gmx_mpi mdrun ...

HWLOC_BIND="$(which hwloc-bind)"
# First CPU core in the L3 domain
CORES0=(0 8 16 24 32 40 48 56)
# GPU for the L3 domain
GPUS=(4 5 2 3 6 7 0 1)
NGPU=8 # Number of GPUs

RANK_LOCAL=${SLURM_LOCALID}
RANK=$((${RANK_LOCAL} % ($NGPU)))

CORE0=${CORES0[$RANK]}
GPU=${GPUS[$RANK]}
CPUBIND="core:$((${CORE0}))-$((${CORE0}+7))"

export ROCR_VISIBLE_DEVICES=${GPU}
export OMP_PLACES=cores
export OMP_PROC_BIND=close
exec "${HWLOC_BIND}" --cpubind "${CPUBIND}" $*
1 Like

Thanks a lot. Adding the script to the simulations improved the performance a lot.

I assume the errors with hipMemsetAsync are gone too? Or do they still happen?

Also, in the receipt above, I haven’t touched upon the topic of GPU-aware MPI. If you are running a single simulation across multiple GPUs, it is definitely a good idea to use it. You need to enable it both in the Cray MPI itself (export MPICH_GPU_SUPPORT_ENABLED=1) and in GROMACS (export GMX_ENABLE_DIRECT_GPU_COMM=1 GMX_FORCE_GPU_AWARE_MPI=1).

Hi,

No, the error still occurs for some simulations. Even if I run a same simulation with same setting twice, I might get error for one while the other one runs without problem.

Bests,
Maryam

Interesting. I assumed that the problem was in oversubscribing the GPU by launching 8 ranks par GCD, but alas that’s not the case.

  1. Can you try compiling with newer version of hipSYCL? They rebranded and are now “AdaptiveCpp”, but it’s the same thing. AdaptiveCpp 23.10.0 is the latest release and should work fine with GROMACS 2023.x

  2. Independently of (1), could you try running with HIPSYCL_RT_MAX_CACHED_NODES=0 set? At the very least, that will make it easier to debug since there will be less asynchrony in the internal scheduling.

Hi again,

Thanks for the reply. I was wondering where I should add the HIPSYCL_RT_MAX_CACHED_NODES=0 setting and how.

Bests,
Maryam

It is an environment variable, so you can export it any time before srun is called from your script.

Was following this thread and found I am getting a similar issue. Since maryamma’s cmake info for the build wasn’t listed above, I thought I would chime in with mine and see if anyone sees anything obvious that I am missing:

hipSYCL cmake:

Blockquote cmake -DCMAKE_INSTALL_PREFIX=/home/hipSYCL_0.9.4_GNU-ROCM-5.4.3/ -DWITH_ACCELERATED_CPU=ON -DWITH_CPU_BACKEND=ON -DROCM_PATH=${ROCM_PATH} -DWITH_CUDA_BACKEND=OFF -DWITH_ROCM_BACKEND=on -DDEFAULT_GPU_ARCH=gfx90a -DROCM_CXX_FLAGS="–gcc-toolchain=GCC_PATH/snos/" -DLLVM_DIR={ROCM_PATH}/llvm/lib/cmake/llvm …/

GROMACS cmake:

Blockquote cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_COMPILER=/opt/rocm-5.4.3/llvm/bin/amdclang++ -DCMAKE_C_COMPLIER=/opt/rocm-5.4.3/llvm/bin/amdclang -DCMAKE_CXX_FLAGS=“–gcc-toolchain=/opt/cray/pe/gcc/12.2.0/snos/ -L/opt/rocm-5.4.3/llvm/lib/” -DGMX_GPU=SYCL -DGMX_MPI=ON -DGMX_OPENMP=ON -DGMX_SYCL_HIPSYCL=ON -DGMX_GPU_FFT_LIBRARY=rocFFT -DGMX_BUILD_OWN_FFTW=OFF -DHIPSYCL_TARGETS=“hip:gfx90a”

And for the record the $GCC_PATH goes to a gcc version 12.2.0 implementation and ROCM goes to version 5.4.3. When building with MPI, I am on a compute system with a ‘module’ environment management set up, where I am loading a cray-environment module that has the flags for mpi on it (I can pull those up if anyone thinks they may be relevant)

Note, I am using amdclang instead of the ROCM provided clang, as I am trying to build with openMP, MPI, and GPU support simultaneously and the user-guide for the system I am on indicates that the ROCM clang shouldn’t be used in this case.

Using the script provided by al42and I was able to run a test simulation that didn’t crash immeditately with a hip error, but instead was able to run on 4 nodes with a substantial performance loss (less than 1ns/day for a system of ~2M atoms on 4 nodes with 8 GCDs compared to when I run the same system on a nvidia compute cluster with a standard build of GROMACS that uses CUDA).

For the record, I also tried to build with the latest AdaptiveCPP and I get the same issue.

EDIT: An example mdrun command that fails is below (excluding the script that was provided)

srun -N 4 -t 12:00 --ntasks-per-node=8 --gpus-per-node=8 --cpus-per-task=6 gmx_mpi mdrun -tunepme no -s FinalRelax_450K.tpr -o test.trr -e test.edr -g test.log -v -c test_out.gro -nb gpu -bonded gpu -update auto -pme auto -ntomp 6 -maxh 0.15

I also get the same error if I reduce the number of tasks per node and gpus per node.

srun -N 4 -t 12:00 --ntasks-per-node=4–gpus-per-node=4 --cpus-per-task=6 gmx_mpi mdrun -tunepme no -s FinalRelax_450K.tpr -o test.trr -e test.edr -g test.log -v -c test_out.gro -nb gpu -bonded gpu -update auto -pme auto -ntomp 6 -maxh 0.15

Interestingly, if I use the second command, I get quite a bit of a performance boost compared to the 1ns/day I was getting, but I still crash after ~2000 steps. Looking at the dump of the tpr, I don’t see anything really happening around step 2000 that I could point to as a potential triggering event either (I can provide if anyone is interested).

Hi! Thanks for sharing your experiences.

Note, I am using amdclang instead of the ROCM provided clang

Yes, that’s perfectly fine. Nothing wrong with your CMake commands as far as I can tell.

For the record, I also tried to build with the latest AdaptiveCPP and I get the same issue.

With recent AdaptiveCpp, it might be useful to try adding -DSYCL_CXX_FLAGS_EXTRA=-DHIPSYCL_ALLOW_INSTANT_SUBMISSION=1 to GROMACS CMake options. It can both improve the scaling and help pinpoint the bug (or even prevent it), since it simplifies the mechanism for submitting tasks to the GPU.

Please share your results if you try it!

Using the script provided by al42and I was able to run a test simulation that didn’t crash immeditately with a hip error, but instead was able to run on 4 nodes with a substantial performance loss (less than 1ns/day for a system of ~2M atoms on 4 nodes with 8 GCDs compared to when I run the same system on a nvidia compute cluster with a standard build of GROMACS that uses CUDA).

Note: My script is specific to our local cluster (e.g., no reserved cores, which is different from some other Cray EX235a machines). I assume that you adjusted those and ensured that GPU-aware MPI is used.

Worse scaling than NVIDIA is, unfortunately, expected at the moment, and is something we’re working on together with runtime developers. But still, 1ns/day for 2M atoms is abnormally bad.

srun -N 4 -t 12:00 --ntasks-per-node=8 --gpus-per-node=8 --cpus-per-task=6 gmx_mpi mdrun -tunepme no -s FinalRelax_450K.tpr -o test.trr -e test.edr -g test.log -v -c test_out.gro -nb gpu -bonded gpu -update auto -pme auto -ntomp 6 -maxh 0.15

With this command, you are running PME on the CPU, which is very sub-optimal. I suggest using -pme gpu -npme 1.

You can also try building GROMACS with HeFFTe to be able to utilize several GPUs for PME calculation; but it won’t be of any use until the current performance issues are resolved.

Interestingly, if I use the second command, I get quite a bit of a performance boost compared to the 1ns/day I was getting

This hints at CPU competition with ROCm and AdaptiveCpp runtime threads. One thing to try would be setting HSA_OVERRIDE_CPU_AFFINITY_DEBUG=0 environment variable to make sure the AMD runtime worker thread obeys the CPU pinning (it escapes otherwise) and using less OpenMP threads for GROMACS, e.g., srun --cpus-per-task=7 ..... gmx_mpi -ntomp 5 ....

Alternatively, HIPSYCL_ALLOW_INSTANT_SUBMISSION mentioned above should also help.