Questions about using two gpus on one node, parallization across the node, and interpreting the outs

GROMACS version: 2021
GROMACS modification: No

Hello all,

Happy to be posting again after another two or so weeks of using Gromacs. It’s powerful software.

So, I’ll start by saying that we have built a system with two Nvidia GTX 3090s and a 48 core CPU. As of now we have installed gromacs in an unbuntu environment and are running the gromax (GitHub - scal444/gromax) benchmarks across some systems (Slizard et al. 2020 (Cookie Absent)). This is working and we are getting the best performance settings for certain system sizes.

One of the questions that we have, and the reason for this post, is; when running the below commands are the outputs individual runs or are they steps in the whole run?

COMMANDS RUN:

$gmx -bonded gpu -deffnm group_${group}_trial_${i}_component_1 -gputasks 00 -nb gpu -nsteps ${nsteps} -nstlist 80 -nt 12 -ntmpi 1
-ntomp 12 -pin on -pinoffset 0 -pinstride 1 -pme gpu -resetstep ${resetstep} -s ${tpr} -update gpu &
  $gmx -bonded gpu -deffnm group_${group}_trial_${i}_component_2 -gputasks 00 -nb gpu -nsteps ${nsteps} -nstlist 80 -nt 12 -ntmpi 1
-ntomp 12 -pin on -pinoffset 12 -pinstride 1 -pme gpu -resetstep ${resetstep} -s ${tpr} -update gpu &
  $gmx -bonded gpu -deffnm group_${group}_trial_${i}_component_3 -gputasks 11 -nb gpu -nsteps ${nsteps} -nstlist 80 -nt 12 -ntmpi 1
-ntomp 12 -pin on -pinoffset 24 -pinstride 1 -pme gpu -resetstep ${resetstep} -s ${tpr} -update gpu &
  $gmx -bonded gpu -deffnm group_${group}_trial_${i}_component_4 -gputasks 11 -nb gpu -nsteps ${nsteps} -nstlist 80 -nt 12 -ntmpi 1
-ntomp 12 -pin on -pinoffset 36 -pinstride 1 -pme gpu -resetstep ${resetstep} -s ${tpr} -update gpu

This produces outputs like below

xxxx_component_1.xtc
xxxx_component_2.xtc
xxxx_component_3.xtc
xxxx_component_4.xtc

If they are all the same run it seems we have found parameters to increase throughput but not speed to end result as in this configuration the hardware utilization is close to maxed. So, have we run the same thing 4 different times or have we split the system into quarters and run it? It seems to me like we’ve run the same system 4 times.

Looking forward to hearing from you!

Best,
Kirtley

Hi Kirtley,

I am not sure if I understand you question, but what your code shows is commands launching four independent simulations producing four trajectories (with 12 cores each and two of them sharing each GPUs). If you intend to have that setup you can simplify things slightly by using the -multidir functionality. If however you intended to run a single simulation across all your CPU cores and GPU, you will need to launch mdrun with multiple ranks and assign work to these. Make sure to enable the GPU direct communication using the GMX_ENABLE_DIRECT_GPU_COMM=1 environment variable.

Also, make sure to launch the CUDA MPS daemon as it can improve performance in both ensemble style multi-trajectory and multi-GPU scaling use-cases.

Cheers,
Szilárd