Performance optimization with PME GPU decomposition

GROMACS version: 2023.2
GROMACS modification: No

Dear Community,

I want to use PME GPU Decomposition for MD simulations of a larger ~3mil atoms. I have compiled GROMACS following the instructions in the manual and @alang blog post.

On a single node (4 GPU) i get 38 ns/day. This matches the performance of a installation without cuFFTMp. Given the performance figures provided in the Nvidia blog post, i expected relatively good scaling up to 4-8 nodes. However, when i run the my benchmark on 2 nodes, with 2 dedicated pme ranks, performance increases only marginally to 39 ns/day. When going to higher node counts performance deteriorates.

A comparison of the log files suggests, that the PME ranks struggle to keep up with the NB ranks:

Without PME Decomposition:
bench_cuFFTmp_1.log (30.5 KB)

With PME Decomposition:
bench_cuFFTmp_2.log (31.1 KB)

If anyone has experience with PME GPU decomposition or knows what the issues are, I’d appreciate the help.

Best Regards,
Florian

For reference here are the GROMACS Compile flags:

-DGMX_OPENMP=ON -DGMX_MPI=ON -DGMX_BUILD_OWN_FFTW=ON \ 
-DGMX_GPU=CUDA  -DCMAKE_BUILD_TYPE=Release -DGMX_DOUBLE=off \
-DGMX_USE_CUFFTMP=ON -DcuFFTMp_ROOT=$HPCSDK_LIBDIR \
-DBUILD_TESTING=ON -DGMX_BUILD_UNITTESTS=ON\
-DGMX_DEVELOPER_BUILD=ON -DCMAKE_INSTALL_PREFIX=$INSTALL_PREFIX \
-DCMAKE_CXX_FLAGS=-mcpu=neoverse-v2 -DCMAKE_C_FLAGS=-mcpu=neoverse-v2 -DGMX_SIMD=ARM_NEON_ASIMD

and Library versions:

GCC: 12.3.0
OpenMPI: 4.1.6
CUDA: 12.4
HPCSDK: 24.3

P.S. I’m showing the compile flags for the ARM system I want to use for the simulations, but I get the same issue on a X86_64 system.

Scaling of PME to multiple GPUs is often very bad because of the amount of communication needed. You should try putting the PME ranks on the same node using -ddorder pp_pme. That should improve performance, but by how much depends on the bandwidth between the GPUs. NVLink is what you would like to have.

Thanks you for the suggestion!

The -ddorder pp_pme setting was something I had overlooked in the past. This improved performance up to 4 nodes. To increase the number of nodes further, I had to distribute pme across multiple nodes. This resulted in a significant loss of performance, suggesting that communication is indeed the limiting factor.

Up to 4 nodes i get:

1 node  : 39 ns/day
2 nodes : 51 ns/day
4 nodes : 67 ns/day

This is still far from ideal and I’m still a little bit puzzled by these numbers. The system is (to my limited knowledge) state of the art with NVLink 4 and InfiniBand NDR200 (Connect-X7).
I also noticed that the performance varies a lot. The 4 node performance reported here is an average of 5 runs, where the best performance was 73.3 ns/day and the worst was 59.1 ns/day.

I have not seen such a large spread in my previous benchmarks, although those were on different computing systems and without pme gpu decomposition. I was the only person using those nodes at that time, so it’s not due to competing jobs.

You can see what is waiting for what in the timing table at the end of the log file. On one node the PP ranks are waiting for PME. On two nodes I PME might already take more time than PP. Then you would need more PME ranks, but that also increases the communication, so the scaling deteriorates very quickly. In addition, all PP ranks need to communicate to the PME node at the same time.

Maybe 3 nodes with one node doing only PME is better?

Another option is running PME on the CPU.

@Florian_Leidner if you are able to share your tpr file, I can have a go running this on our internal DGX-H100 cluster and report back with results and recommended settings. If you prefer not to post it here you can find me on LinkedIn for direct contact. Alan