GROMACS version: 2018 and 2020
GROMACS modification: No
Dear GROMACS users and developers,
I am currently benchmarking a dual-socket node with 8 GPUs to determine the optimal aggregate performance when running one replica per GPU. For that I use mdrun’s -multidir functionality. The node consists of 2x AMD Epyc 7302 (2x 16 cores or 2x 32 hardware threads) with 8 NVIDIA RTX 2080Ti GPUs attached, 4 to each CPU. My MD system is a 80k atom membrane in water sytem.
I found that when using all hardware threads:
mpirun -np 8 mdrun -ntomp 8 -nb gpu -pme gpu -pin on …
the performance is about 10% higher than without:
mpirun -np 8 mdrun -ntomp 4 -nb gpu -pme gpu -pin on …
However, with the following hardware topology
Sockets, cores, and logical processors:
Socket 0: [ 0 32] [ 1 33] [ 2 34] [ 3 35] [ 4 36] [ 5 37] [ 6 38] [ 7 39] [ 8 40] [ 9 41] [ 10 42] [ 11 43] [ 12 44] [ 13 45] [ 14 46] [ 15 47]
Socket 1: [ 16 48] [ 17 49] [ 18 50] [ 19 51] [ 20 52] [ 21 53] [ 22 54] [ 23 55] [ 24 56] [ 25 57] [ 26 58] [ 27 59] [ 28 60] [ 29 61] [ 30 62] [ 31 63]
Numa nodes:
Node 0 (67485523968 bytes mem): 0 32 1 33 2 34 3 35 4 36 5 37 6 38 7 39 8 40 9 41 10 42 11 43 12 44 13 45 14 46 15 47
Node 1 (67621851136 bytes mem): 16 48 17 49 18 50 19 51 20 52 21 53 22 54 23 55 24 56 25 57 26 58 27 59 28 60 29 61 30 62 31 63
I would think that the pinning of threads to cores would be:
replica0 -> cores 0-7 on socket 0
replica1 -> cores 8-15 on socket 0
replica2 -> cores 16-23 on socket 1
replica3 -> cores 24-31 on socket 1
replica4 -> cores 32-39 again on socket 0 … and so on.
So two replicas per socket first, then the next socket. Replicas2 and 3 on socket1 would use the GPUs with IDs 2 and 3, which are however attached to socket0. In total, four of the eight replicas would unnecessarily use a GPU of the remote socket.
This would be circumvented by adding -gputasks 0011445522336677 to the command line, so that both the PP and the PME task of each replica would run on the ‘home’ GPU. And indeed, I get a +5 % performance benefit with the -gputasks string.
I have the following questions:
a) Are my assumptions correct regarding what happens under these circumstances?
b) Does only -gputasks ensure optimal performance here or is there an easier solution?
c) With the information about hardware topology present, couldn’t mdrun be enhanced to automatically assign the nearest GPU to each rank?
Thanks for clarification!
Best regards,
Carsten