Performance optimization with PME GPU decomposition

GROMACS version: 2023.2
GROMACS modification: No

Dear Community,

I want to use PME GPU Decomposition for MD simulations of a larger ~3mil atoms. I have compiled GROMACS following the instructions in the manual and @alang blog post.

On a single node (4 GPU) i get 38 ns/day. This matches the performance of a installation without cuFFTMp. Given the performance figures provided in the Nvidia blog post, i expected relatively good scaling up to 4-8 nodes. However, when i run the my benchmark on 2 nodes, with 2 dedicated pme ranks, performance increases only marginally to 39 ns/day. When going to higher node counts performance deteriorates.

A comparison of the log files suggests, that the PME ranks struggle to keep up with the NB ranks:

Without PME Decomposition:
bench_cuFFTmp_1.log (30.5 KB)

With PME Decomposition:
bench_cuFFTmp_2.log (31.1 KB)

If anyone has experience with PME GPU decomposition or knows what the issues are, I’d appreciate the help.

Best Regards,
Florian

For reference here are the GROMACS Compile flags:

-DGMX_OPENMP=ON -DGMX_MPI=ON -DGMX_BUILD_OWN_FFTW=ON \ 
-DGMX_GPU=CUDA  -DCMAKE_BUILD_TYPE=Release -DGMX_DOUBLE=off \
-DGMX_USE_CUFFTMP=ON -DcuFFTMp_ROOT=$HPCSDK_LIBDIR \
-DBUILD_TESTING=ON -DGMX_BUILD_UNITTESTS=ON\
-DGMX_DEVELOPER_BUILD=ON -DCMAKE_INSTALL_PREFIX=$INSTALL_PREFIX \
-DCMAKE_CXX_FLAGS=-mcpu=neoverse-v2 -DCMAKE_C_FLAGS=-mcpu=neoverse-v2 -DGMX_SIMD=ARM_NEON_ASIMD

and Library versions:

GCC: 12.3.0
OpenMPI: 4.1.6
CUDA: 12.4
HPCSDK: 24.3

P.S. I’m showing the compile flags for the ARM system I want to use for the simulations, but I get the same issue on a X86_64 system.

Scaling of PME to multiple GPUs is often very bad because of the amount of communication needed. You should try putting the PME ranks on the same node using -ddorder pp_pme. That should improve performance, but by how much depends on the bandwidth between the GPUs. NVLink is what you would like to have.

Thanks you for the suggestion!

The -ddorder pp_pme setting was something I had overlooked in the past. This improved performance up to 4 nodes. To increase the number of nodes further, I had to distribute pme across multiple nodes. This resulted in a significant loss of performance, suggesting that communication is indeed the limiting factor.

Up to 4 nodes i get:

1 node  : 39 ns/day
2 nodes : 51 ns/day
4 nodes : 67 ns/day

This is still far from ideal and I’m still a little bit puzzled by these numbers. The system is (to my limited knowledge) state of the art with NVLink 4 and InfiniBand NDR200 (Connect-X7).
I also noticed that the performance varies a lot. The 4 node performance reported here is an average of 5 runs, where the best performance was 73.3 ns/day and the worst was 59.1 ns/day.

I have not seen such a large spread in my previous benchmarks, although those were on different computing systems and without pme gpu decomposition. I was the only person using those nodes at that time, so it’s not due to competing jobs.

You can see what is waiting for what in the timing table at the end of the log file. On one node the PP ranks are waiting for PME. On two nodes I PME might already take more time than PP. Then you would need more PME ranks, but that also increases the communication, so the scaling deteriorates very quickly. In addition, all PP ranks need to communicate to the PME node at the same time.

Maybe 3 nodes with one node doing only PME is better?

Another option is running PME on the CPU.

@Florian_Leidner if you are able to share your tpr file, I can have a go running this on our internal DGX-H100 cluster and report back with results and recommended settings. If you prefer not to post it here you can find me on LinkedIn for direct contact. Alan

Hey @alang thanks for the offer, feedback would be much appreciated!

I have uploaded the tpr file here. It was compile with GMX23.3. I can run the preprocessor with another version version or alternative share the raw coordinates and topology.

Thanks for your help,
All the Best,
Florian

@Florian_Leidner I have benchmarked your case on our DGX H100 cluster (using 4xH100 per node, 1 PME GPU per node, using GROMACS v2024.3) and get the results:

Nodes Performance
1 37.5 ns/day
2 48.4 ns/day
4 72.5 ns/day
8 102.5 ns/day
16 132.2 ns/day
32 158.6 ns/day

Here is my run script:

#!/bin/bash
#SBATCH --ntasks=128
#SBATCH --ntasks-per-node=4
#SBATCH --partition=batch
#SBATCH --time=00:30:00
#SBATCH -o gromacs_eos_florian_32node.out

NODES=$SLURM_JOB_NUM_NODES

# Set 4 GPUs per node
NGPU=$(($NODES*4))

# Set 1 PME task per node
PMETASKS=$NODES

# Specify Location of GROMACS binary
GMX=/lustre/fsw/coreai_devtech_all/alang/gromacs/gromacs-gitlab/gromacs/build_mpi/bin/gmx_mpi

# Set number of OpenMP threads per MPI task
# Here we are running 4 tasks on a 56-core CPU, so 14 cores per task
export OMP_NUM_THREADS=14

# Specify GPU direct communication should be used
export GMX_ENABLE_DIRECT_GPU_COMM=1

# Specify GPU PME decomposition should be used
export GMX_GPU_PME_DECOMPOSITION=1

# Create a wrapper to pin tasks to NICS, GPUs and CPU NUMA regions
cat << 'EOF' > wrapper.sh
#!/bin/bash

# Add NVSHMEM to library path
export LD_LIBRARY_PATH=/lustre/fsw/coreai_devtech_all/alang/packages/nvhpc/nvhpc_2024_247_Linux_x86_64_cuda_12.5-install/Linux_x86_64/24.7/comm_libs/12.5/nvshmem/lib:$LD_LIBRARY_PATH

case $(( ${SLURM_LOCALID} )) in
 0) UCX_NET_DEVICES=mlx5_0:1 CUDA_VISIBLE_DEVICES=0 numactl --cpunodebind=0 $* ;;
 1) UCX_NET_DEVICES=mlx5_3:1 CUDA_VISIBLE_DEVICES=1 numactl --cpunodebind=0 $* ;;
 2) UCX_NET_DEVICES=mlx5_4:1 CUDA_VISIBLE_DEVICES=2 numactl --cpunodebind=0 $* ;;
 3) UCX_NET_DEVICES=mlx5_5:1 CUDA_VISIBLE_DEVICES=3 numactl --cpunodebind=0 $* ;;
esac
EOF
chmod u+x wrapper.sh

# Set srun options
IMAGE=gitlab-master.nvidia.com/dtcomp-nv-internal/imgs-released/x86_64-ubuntu22.04-gcc-cuda12.2:20230710
SRUN_ADDITIONAL_OPTIONS="--mpi=pmix \
--container-image=$IMAGE \
--container-mounts=/lustre:/lustre \
--container-mount-home \
--container-workdir=$PWD"

# Run GROMACS
srun -n $NGPU $SRUN_ADDITIONAL_OPTIONS ./wrapper.sh  \
     $GMX mdrun -v -noconfout -dlb no -nsteps 50000 -resetstep 40000 \
     -pin off -ntomp $OMP_NUM_THREADS -pme gpu -npme $PMETASKS -bonded gpu \
     -update gpu -nstlist 300 -s topol_23_3.tpr

An important factor is the wrapper script to pin NICs, GPUs and CPU NUMA regions. For the Grace-Hopper cluster, I suggest adapting the wrapper as follows:

case $(( ${SLURM_LOCALID} )) in
 0) UCX_NET_DEVICES=mlx5_0:1 CUDA_VISIBLE_DEVICES=0 numactl --cpunodebind=0 $* ;;
 1) UCX_NET_DEVICES=mlx5_1:1 CUDA_VISIBLE_DEVICES=1 numactl --cpunodebind=1 $* ;;
 2) UCX_NET_DEVICES=mlx5_2:1 CUDA_VISIBLE_DEVICES=2 numactl --cpunodebind=2 $* ;;
 3) UCX_NET_DEVICES=mlx5_3:1 CUDA_VISIBLE_DEVICES=3 numactl --cpunodebind=3 $* ;;
esac

(to pin each task to a separate Grace-Hopper module plus its associated NIC), and use 64 OpenMP threads per MPI task (i.e. per Grace CPU). The performance may be a bit lower than the above because of the 200Gb NICs, vs 400Gb NICs on the DGX H100.

Thanks for the effort, these numbers are very informative!
It looks like I can get the performance numbers you are getting as long as I keep all pme ranks on the same node, i.e. use -ddorder pp_pme.
Unfortunately, once I have to distribute pme across multiple nodes, I still hit a brick wall.
I have included the wrapper script you suggested in the past, but it did not seem to have any measurable effect.

When I try to follow your batch file as close as possible:

#!/bin/bash -e
#
#SBATCH -p all
#SBATCH -J cuFFTmp4
#SBATCH --ntasks=16
#SBATCH --ntasks-per-node=4
#SBATCH --gres=gpu:4
#SBATCH --cpus-per-task=16
#SBATCH --hint=nomultithread
#SBATCH --time=00:10:00

module load GCC/12.3.0
module load OpenMPI/4.1.6
module load CUDA/12
module load hwloc

NODES=$SLURM_JOB_NUM_NODES

# Set 4 GPUs per node
NGPU=$(($NODES*4))

# Set 1 PME task per node
PMETASKS=$NODES

# set 16 OpenMP threads per MPI task
export OMP_NUM_THREADS=16

# build the code with PME GPU decomposition with cuFFTMp enabled,
# in an environment with a CUDA-aware MPI installation
# (see https://manual.gromacs.org/current/install-guide/index.html)
HPCSDK=/p/software/jedi/stages/2024/software/NVHPC/24.3-CUDA-12

# set the location of the math_libs directory in the NVIDIA HPC installation
HPCSDK_LIBDIR=$HPCSDK/Linux_aarch64/2024/math_libs
NVSHMEM_HOME=$HPCSDK/Linux_aarch64/2024/comm_libs/nvshmem
export LD_LIBRARY_PATH=$NVSHMEM_HOME/lib:$LD_LIBRARY_PATH

# Create a wrapper to pin tasks to NICS, GPUs and CPU NUMA regions
cat << 'EOF' > wrapper.sh
#!/bin/bash

case $(( ${SLURM_LOCALID} )) in
0) UCX_NET_DEVICES=mlx5_0:1 CUDA_VISIBLE_DEVICES=0 numactl --cpunodebind=0 $* ;;
1) UCX_NET_DEVICES=mlx5_1:1 CUDA_VISIBLE_DEVICES=1 numactl --cpunodebind=1 $* ;;
2) UCX_NET_DEVICES=mlx5_2:1 CUDA_VISIBLE_DEVICES=2 numactl --cpunodebind=2 $* ;;
3) UCX_NET_DEVICES=mlx5_3:1 CUDA_VISIBLE_DEVICES=3 numactl --cpunodebind=3 $* ;;
esac

EOF

chmod u+x wrapper.sh

# Specify GPU direct communication should be used
export GMX_ENABLE_DIRECT_GPU_COMM=1
# Specify GPU PME decomposition should be used
export GMX_GPU_PME_DECOMPOSITION=1

source /p/project1/project/opt/gromacs/23.3_arm_cufftmp/bin/GMXRC

srun --mpi=pmix -n $NGPU ./wrapper.sh gmx_mpi mdrun -v -noconfout -dlb no -nsteps 50000 -resetstep 40000 \
-pin off -ntomp $OMP_NUM_THREADS -pme gpu -npme $PMETASKS -bonded gpu \
-update gpu -nstlist 300 -s topol_23_3.tpr

My performance drops to about 20 ns/day.

My performance drops to about 20 ns/day.

I’ve just reproduced this on the same machine, and I think there are issues with some parts of the cluster. Please can you try with

#SBATCH --nodelist=jpbot-001-[01-04]

to restrict to these specific nodes, which is working much better for me.

I’ll try and further isolate which parts are causing the issue and follow up with the system admins.

When i limit my jobs to adjacent nodes i.e. --nodelist=jpbot-001-[01-04] the performance of my jobs is indeed closer to what you get on the DGX H100 cluster and what i got when i limited all pme ranks to a single node.

With these setting i get 58 ns/day on 4 nodes and 82 ns/day on 8 nodes, this seems more reasonable and for the first time I’m able to scale beyond four nodes.