Tweak -multidir performance on a dual-socket, 8 GPU server

GROMACS version: 2018 and 2020
GROMACS modification: No

Dear GROMACS users and developers,

I am currently benchmarking a dual-socket node with 8 GPUs to determine the optimal aggregate performance when running one replica per GPU. For that I use mdrun’s -multidir functionality. The node consists of 2x AMD Epyc 7302 (2x 16 cores or 2x 32 hardware threads) with 8 NVIDIA RTX 2080Ti GPUs attached, 4 to each CPU. My MD system is a 80k atom membrane in water sytem.

I found that when using all hardware threads:

mpirun -np 8 mdrun -ntomp 8 -nb gpu -pme gpu -pin on …

the performance is about 10% higher than without:

mpirun -np 8 mdrun -ntomp 4 -nb gpu -pme gpu -pin on …

However, with the following hardware topology

Sockets, cores, and logical processors:
Socket 0: [ 0 32] [ 1 33] [ 2 34] [ 3 35] [ 4 36] [ 5 37] [ 6 38] [ 7 39] [ 8 40] [ 9 41] [ 10 42] [ 11 43] [ 12 44] [ 13 45] [ 14 46] [ 15 47]
Socket 1: [ 16 48] [ 17 49] [ 18 50] [ 19 51] [ 20 52] [ 21 53] [ 22 54] [ 23 55] [ 24 56] [ 25 57] [ 26 58] [ 27 59] [ 28 60] [ 29 61] [ 30 62] [ 31 63]
Numa nodes:
Node 0 (67485523968 bytes mem): 0 32 1 33 2 34 3 35 4 36 5 37 6 38 7 39 8 40 9 41 10 42 11 43 12 44 13 45 14 46 15 47
Node 1 (67621851136 bytes mem): 16 48 17 49 18 50 19 51 20 52 21 53 22 54 23 55 24 56 25 57 26 58 27 59 28 60 29 61 30 62 31 63

I would think that the pinning of threads to cores would be:

replica0 -> cores 0-7 on socket 0
replica1 -> cores 8-15 on socket 0
replica2 -> cores 16-23 on socket 1
replica3 -> cores 24-31 on socket 1
replica4 -> cores 32-39 again on socket 0 … and so on.

So two replicas per socket first, then the next socket. Replicas2 and 3 on socket1 would use the GPUs with IDs 2 and 3, which are however attached to socket0. In total, four of the eight replicas would unnecessarily use a GPU of the remote socket.

This would be circumvented by adding -gputasks 0011445522336677 to the command line, so that both the PP and the PME task of each replica would run on the ‘home’ GPU. And indeed, I get a +5 % performance benefit with the -gputasks string.

I have the following questions:
a) Are my assumptions correct regarding what happens under these circumstances?
b) Does only -gputasks ensure optimal performance here or is there an easier solution?
c) With the information about hardware topology present, couldn’t mdrun be enhanced to automatically assign the nearest GPU to each rank?

Thanks for clarification!
Best regards,
Carsten

Hi Carsten,

The pinning is done onto hardware threads remapped (if needed) into a contiguous index space (as opposed to the os “CPU” indexing which is typically strided). This mapping is indicated in listing under “Sockets, cores, and logical processors”.

In your case, with 8 ranks on the 32 cores / 64 threads, your pinning should place
8 threads of a the first rank with a stride 1 onto to “CPUs” [ 0 32] [ 1 33] [ 2 34] [ 3 35].
Therefore, unless the topology detection is simply wrong, you should get thread pinning map the ranks 0-3 onto the the first socket, and 4-7 onto the second.

Where things might be mixed up is the GPU device order as exposed by the CUDA runtime which may not match the PCI bus/slot order.
I suggest to check the hwloc-ls / nvidia-smi topo -m where the PCI topology and core/socket affinities should become clear.

We could detect PCIe topology and check GPU affinity or do locality-aware device assignment, but as the straightforward mapping is wrong quite rarely, we’ve not prioritized it.

Cheers,
Szilárd

Thank you Szilárd for the clarification. Indeed PCI topology and core/socket affinities all seem as they should be. However, this leaves me even more puzzled about the 5% performance increase I see when applying a reshuffling of the GPUs - I would assume that that should decrease performance if anything. Will continue to investigate.

Best,
Carsten

Hi,

Can you check where does the improvement come from (e.g. in the cycle counters or other profiler)? Are these uncoupled runs? I could imagine that if there is some local coupling between neighboring ranks, but none across ranks further away, there could be opportunities for higher core clocks or more CPU–GPU bandwidth if immediate neighbor ranks can better overlap idle/compute times.

Suggest using release-2021 as cycle counters were broken which I’ve just fixed in the 2021 release branch (will backport some of it for 2020.5).

Cheers,
Szilárd

Hi Szilárd,

these are uncoupled runs - I just use the -multidir functionality on 8 identical input files to conveniently determine the aggregate performance on that node without having to worry about correct pinning of individual simulations.

The effect seems to express itself in longer timings for NB X/F buffer ops, whereas all other timings look similar within statistical fluctuation. If you want, I can send two exemplary log files.

Not that in benchmarks with a larger system (2M atoms) I did not see a beneficial effect of the -gputasks setting.

Best regards,
Carsten

Hi Carsten,

Are you using force offload rather than GPU resident steps (GPU update) – given that you see significant time in buffer ops (which is always offloaded with the latter)?

Can you share logs? I am quite intrigued by your observation and would like to understand where the differences come from.

At first I thought this may be related to cache traffic collision with Infiny fabric PCIe traffic (something that can be observed on Zen1), but I don’t think that is the case as on AMD Rome all traffic goes through the IO die and each of your 4 ranks per socket should be using its own L3 slice (assuming they are pinned correctly).

Cheers,
Szilárd

Hi Szilárd,

yes, this is still an old benchmark input file where GPU update is not possible. I’ll send a couple of logs by email.

Thanks,
Carsten