I want to use PME GPU Decomposition for MD simulations of a larger ~3mil atoms. I have compiled GROMACS following the instructions in the manual and @alangblog post.
On a single node (4 GPU) i get 38 ns/day. This matches the performance of a installation without cuFFTMp. Given the performance figures provided in the Nvidia blog post, i expected relatively good scaling up to 4-8 nodes. However, when i run the my benchmark on 2 nodes, with 2 dedicated pme ranks, performance increases only marginally to 39 ns/day. When going to higher node counts performance deteriorates.
A comparison of the log files suggests, that the PME ranks struggle to keep up with the NB ranks:
Scaling of PME to multiple GPUs is often very bad because of the amount of communication needed. You should try putting the PME ranks on the same node using -ddorder pp_pme. That should improve performance, but by how much depends on the bandwidth between the GPUs. NVLink is what you would like to have.
The -ddorder pp_pme setting was something I had overlooked in the past. This improved performance up to 4 nodes. To increase the number of nodes further, I had to distribute pme across multiple nodes. This resulted in a significant loss of performance, suggesting that communication is indeed the limiting factor.
This is still far from ideal and I’m still a little bit puzzled by these numbers. The system is (to my limited knowledge) state of the art with NVLink 4 and InfiniBand NDR200 (Connect-X7).
I also noticed that the performance varies a lot. The 4 node performance reported here is an average of 5 runs, where the best performance was 73.3 ns/day and the worst was 59.1 ns/day.
I have not seen such a large spread in my previous benchmarks, although those were on different computing systems and without pme gpu decomposition. I was the only person using those nodes at that time, so it’s not due to competing jobs.
You can see what is waiting for what in the timing table at the end of the log file. On one node the PP ranks are waiting for PME. On two nodes I PME might already take more time than PP. Then you would need more PME ranks, but that also increases the communication, so the scaling deteriorates very quickly. In addition, all PP ranks need to communicate to the PME node at the same time.
Maybe 3 nodes with one node doing only PME is better?
@Florian_Leidner if you are able to share your tpr file, I can have a go running this on our internal DGX-H100 cluster and report back with results and recommended settings. If you prefer not to post it here you can find me on LinkedIn for direct contact. Alan