Running simultaneous jobs in a same machine

GROMACS version: 2024.4
GROMACS modification: No

Hi all.

I have access to a workstation with a ryzen 9 5900X CPU (12 cores, 24 threads) that now happens to have two GPUs (a RTX3060 12 Gb and a RTX4060Ti 8 Gb).

I decided to use a system of two proteins interacting that I have at hand as a benchmark for the computer.

I always use the option “-pin on”.

Using only the CPU it peaked at 14ns/day
The RTX3060 12 Gb peaked near 56 ns/day, both with all 24 threads or just 6 or 12 threads requested.
And the RTX4060Ti 8Gb peaked 95 ns/day, again in all 3 number of threads tested.

Then, I chose to try to run 4 simultaneously runs (so that each gpu would have 2 processes at the same time, and 6 cpu threads dedicated). I used the “-pinoffset” and limited the number of threads to control that no cpu would be requested twice. RTX3060 and 4060Ti performances fell to 22 and 31 ns/day respectively. When enabling all options to make the run to put everything it could on the gpu, it raised slightly to 25 and 36, respectively.

After that, I decided to have just two simultaneous runs: each one would get half cpu threads, and one of the gpus. A typical, no special options run, reached only 24 and 27 ns/day.

However, things got interesting when I once again enable all the “run it on the gpu” options, when only two processes at once raised the performance now to 43 and 83ns/day.

Is it to be expected? Considering that the CPU threads are not being used twice for different processes? Is there a way to improve several simultaneous simulations processes perfornace?

Thanks a lot in advance!

I’m afraid there is something wrong in the cpu pinning trying to avoid the use of the same core twice.

I ran the two jobs below, one in each GPU:

_NV_PRIME_RENDER_OFFLOAD=1 __GLX_VENDOR_LIBRARY_NAME=nvidia gmx_gpu mdrun -s CoViD19+hACE2.014.gpu.tpr -o CoViD19+hACE2.014.gpu.trr -x CoViD19+hACE2.014.gpu.xtc -c CoViD19+hACE2.014.gpu.gro -e CoViD19+hACE2.014.gpu.edr -g CoViD19+hACE2.014.gpu.log -cpi CoViD19+hACE2.014.gpu.cpt -cpo CoViD19+hACE2.014.gpu.cpt -pin on -pinoffset 0 -ntmpi 1 -ntomp 12 -gpu_id "0" -nice 0 -v >& CoViD19+hACE2.014.gpu.out

_NV_PRIME_RENDER_OFFLOAD=1 __GLX_VENDOR_LIBRARY_NAME=nvidia gmx_gpu mdrun -s CoViD19+hACE2.014.gpu.tpr -o CoViD19+hACE2.014.gpu.trr -x CoViD19+hACE2.014.gpu.xtc -c CoViD19+hACE2.014.gpu.gro -e CoViD19+hACE2.014.gpu.edr -g CoViD19+hACE2.014.gpu.log -cpi CoViD19+hACE2.014.gpu.cpt -cpo CoViD19+hACE2.014.gpu.cpt -pin on -pinoffset 12 -ntmpi 1 -ntomp 12 -gpu_id "1" -nice 0 -v >& CoViD19+hACE2.014.gpu.out

As far as I know, that should ensure that no thread is used twice. However, when running the “top” command shows me the following:

Tarefas: 506 total,   5 em exec., 501 dormindo,   0 parado,   0 zumbi
%Cpu(s): 56,3 us,  0,4 sy,  0,0 ni, 43,3 id,  0,0 wa,  0,0 hi,  0,0 si,  0,0 st
MB mem : 64190,21+total, 6250,293 livre, 6149,223 usados, 52755,03+buff/cache
MB swap: 65535,99+total, 65534,98+livre,    1,012 usados, 58040,99+mem dispon.

  PID USUARIO   PR  NI    VIRT    RES    SHR S  %CPU  %MEM    TEMPO+ COMANDO                                                                       
20971 johannes  20   0 10,058g 409656 203484 R 701,7 0,623   4:28.13 gmx_gpu                                                                       
20436 johannes  20   0 10,029g 413456 204948 R 667,8 0,629  21:54.96 gmx_gpu                                                                       
21255 johannes  20   0       0      0      0 R 2,326 0,000   0:00.07 nvidia-smi

Which is odd to me, because both processes should be around ~1200% (what actually happens when I issue only one of those two command lines), and not ~600%, right?

Moreover, if I issue the options “-nb gpu -pme gpu -pmefft gpu -bonded gpu -update gpu”, the pu usage when using only one command or both is about the same in both cases and I get the good timings of 43 and 83ns/day for the RTX3060 12 Gb and the RTX4060Ti 8Gb, respectively. Without then, to illustrate those strange issues above, the performance falls to 27 and 24 ns/day respectively.

Can someone point me what I can possibly be doing wrong here? Why those processes seems to be going to the same cores/threads, even with the pinoffset defined differently?

Thanks a lot in advance for any help, clue or suggestion.

Not my expertise, but maybe I have an idea. Take a look at the definition of -pinoffset and -pinstride in the manual, and then at the second to last of the last entry of the Examples for mdrun on one node of the manual. You have twelve cores, so I guess you are using six of them on each simulation and each of those is being used to 100% (and you get 600%). If you want more is by probably exploiting hyper-treading, which won’t necessarily give you boosts in performance (but maybe a bit yes). So I guess you will have to pin two threads per physical core, maybe try this

gmx mdrun [...] -gpu_id 0 -nt 12 -pin on -pinoffset 0 -pinstride 1
gmx mdrun [...] -gpu_id 1 -nt 12 -pin on -pinoffset 12 -pinstride 1

Regarding the gain in performance if you offload all to the GPU, I would say that that depends on the quality of your CPU/GPU and the type of calculations you are doing. In your case, you have an advantage in off-loading most of the run to the GPUs.