Gputask assignment on dual-socket node

GROMACS version: 2021.3
GROMACS modification: No

Hello,
I’m trying to run GROMACS on a dual-socket node (24 cores, 48 threads per socket) with 4 GPUs. However, the logical core topology seems a little wonky; the even processor IDs correspond to physical ID 0, while the odds correspond to physical ID 1. I’m enabling GPU buffer ops, GPU halo exchange, and GPU PME-PP comms. The best performance I’ve been able to achieve is using the following mdrun flags:
-nb gpu -bonded gpu -pme gpu -npme 1 -ntmpi 7 -ntomp 7 -pin on -pinstride 2 -ntomp_pme 6 -nstlist 400 -gputasks 0011223

The problem is this excludes an entire socket; I cannot figure out how to assign the tasks properly so each GPU is only being used by one socket. Is there a way to properly assign tasks so I can use both sockets?

Thanks,

Will Martin

Hi,

Do you mean that the cores of a socket are not used? That you control by the thread count and affinity settings.

If you started 7 ranks 67 threads + 16 = 48 you have assigned threads to all cores. However, you explicitly request a stride of 2 which will not work as you only have 48 threads in total (you’d need at least 2x48 for that to work).

That said, this may not be the most efficient assignment depending on your CPU topology. Also note that if your PME rank offloads everything to the GPU, it does not need (nor can it use right now) more than a single core.

Cheers,
Szilard

I understand why the way I’m assigning things doesn’t use both sockets, but I can’t figure out a way to make it use both sockets while not being a performance loss. If a thread uses cores fromboth sockets it results in a performance loss; is there a way to force threads to use every other core while still using all cores? So still using a stride, but combining an offset for the second “set” of threads? So for a basic example:

-ntmpi 16 -ntomp 6 -gputasks 0000111122223333

But where the first 8 tmpi threads only use even processor IDs and the second 8 use odds?

As for the PME, that’s good to know; I don’t have anything I can have those other 5 cores do in this case, but would it be better to just assign 1 core for PME here?