Very Low GPU utilization

GROMACS version: 2020.4
GROMACS modification: No

Question:

Hi, I ran Gromacs in my machine equipped with a Tesla P40 GPU. In NPT Equilibration, the GPU utilization was only 4%~5%. Here is the snapshot of nvidia-smi : https://drive.google.com/file/d/15E3K52PP7LwawL6Mpbl2Y8aNuFeTqxv5/view?usp=sharing

My system contain about 30k atoms. Because I’m running free energy perturbation task, the PME calculation was on the CPU.

Here is the full log.

Thanks for any suggestion!

This is a known limitation that will be addressed in the upcoming release, see:

Thanks, I installed the beta2 version, now the GPU utilization is ~20% .

That is still lower than expected. Which tasks did you offload? Try the different offload modes if you have not done so, including the GPU-resident mode with GPU update.

Hi @pszilard , thanks for your quick reply. I’ve test it in my computer with 8 Tesla P40 GPU and Intel Xeion E5 CPU @ 2.2GHz with 88 Cores. The Command I used was:

gmx mdrun -s Equilibration_NVT_34.tpr -deffnm Equilibration_NVT_34 -ntmpi 1 -ntomp 14 -gpu_id 2 -dlb yes

Here, 34 is the lambda value in one of FEP calculations. I changed the thread of -ntomp and the corresponding speed was as follow:

image

I think the offload was determined by Gromacs automatically. The log about task assignments was:
gromacslog

And here is the computation time of 14 thread:

Is the GPU-resident mode you referred implemented by adding -udpate gpu ?

Hi,

I didn’t notice this was FEP nor that this was a 8-GPU machine – the hardware and the simulation setup is useful to know to advise on improvements.

Indeed. The output shows that the nonbonded short-ranged work and PME is offloaded; this implies that bondeds and integration+constraints are not. This can also be seen from the CPU timing breakdown you shared: and 12.7% in CPU update and constraints, 72% of the runtime in “Force” mostly FEP short-range nonbondeds, but also includes FEP and non-FEP bonded work. The former you should be able to offload by passing -bonded gpu.

Yes! The “GPU resident mode” does update on the GPU and during the regular MD steps positions and forces are kept on the GPU, with the CPU in a “support” role where tasks can be carried out if performance or features requires it; in this case short-range FEP needs to run on the CPU.
I suggest trying to offloading everything (some of this is default but to be explicit e.g. -nb gpu -pme gpu -bonded gpu -update gpu) and alternatively keeping the bondeds on the CPU if that proves to be faster.

If you want to maximize your full-node hardware utilization and the overall simulation throughput while running multiple lambda points at the same time, I suggest also checking the performance of not only individual runs, but also that of 8 or 16 parallel runs on the node (1-2 runs per GPU). You will observe a behavior similar to what we show on Fig 11 of our recent paper: https://aip.scitation.org/doi/full/10.1063/5.0018516

Cheers,
Szilárd

Hi @pszilard, thank you for your kind suggestions.

I’ve tried setting -bonded gpu in mdrun, but the speed stayed the same.

For -update gpu option, there was an error occured:
Free energy perturbation for mass and constraints are not supported.

I did run multiple lambda at the same time, but in different GPUs, not in the same GPU.
As I mentioned in my previous reply, I checked the efficiency of mdrun in one GPU with multiple threads. For one GPU, I need set 20 CPU to achieve maximum efficiency.
For my machine with 8 GPUs and 88 cores, I ran 4 mdrun at the same time with 4 different GPUs, each with 20 openmp thread. The remaining 4 GPUs was lefted unused because the shortage of CPU cores. It’s indeed a waste of resources.

Do you think that, if I used all the GPUs, each with less openmp threads, for example:

  • 8 individual mdrun in 8 GPUs, each with 10 openmp threads
  • 16 individual mdrun in 8 GPUs, each with 5 openmp threads

would get better overall performance compared to my current setting?

I may try it soon if I understood it correctly, looking forward your further suggestions.

Hi,

Unfortunately that is not supported, indeed.

That may partly be because you assigned many CPU cores for each GPU.

Yes. Do make sure that CPU and GPU affinities are set correctly (whether through your job scheduler, or mdrun pinning/gpuid assignment).

As I suggested earlier, do benchmark the total throughput rather than maximizing the performance of each run first and trying to fit those runs onto the machine – which is your case ends up leaving GPUs idle.

The fewer CPU cores you have the more more gain can there be from offloading additional tasks to GPUs; case in point in the Fig 11 of previously linked paper on the left (A) you have relatively more CPU resources per GPU (compared to the right panel B) hence it is always fastest to leave some work for the CPU (yellow). That explains why you did not see improvement from -bonded gpu.

Secondly, on the same figure the horizontal axis shows the number of simulations per GPU and as you can observe, the more simulations you have per GPU the higher the potential for overall throughput (the topmost curve shows increasing trend), and in addition, the CPU resources can be used more effectively and also CPU-requiring runs get more benefit from these setups, e.g. see the slope of the light blue curves (these correspond to your setup where update & constraints is on the CPU).

I hope that helps, please let me know if you have further questions.

Cheers,
Szilárd