That is still lower than expected. Which tasks did you offload? Try the different offload modes if you have not done so, including the GPU-resident mode with GPU update.
Hi @pszilard , thanks for your quick reply. I’ve test it in my computer with 8 Tesla P40 GPU and Intel Xeion E5 CPU @ 2.2GHz with 88 Cores. The Command I used was:
I didn’t notice this was FEP nor that this was a 8-GPU machine – the hardware and the simulation setup is useful to know to advise on improvements.
Indeed. The output shows that the nonbonded short-ranged work and PME is offloaded; this implies that bondeds and integration+constraints are not. This can also be seen from the CPU timing breakdown you shared: and 12.7% in CPU update and constraints, 72% of the runtime in “Force” mostly FEP short-range nonbondeds, but also includes FEP and non-FEP bonded work. The former you should be able to offload by passing -bonded gpu.
Yes! The “GPU resident mode” does update on the GPU and during the regular MD steps positions and forces are kept on the GPU, with the CPU in a “support” role where tasks can be carried out if performance or features requires it; in this case short-range FEP needs to run on the CPU.
I suggest trying to offloading everything (some of this is default but to be explicit e.g. -nb gpu -pme gpu -bonded gpu -update gpu) and alternatively keeping the bondeds on the CPU if that proves to be faster.
If you want to maximize your full-node hardware utilization and the overall simulation throughput while running multiple lambda points at the same time, I suggest also checking the performance of not only individual runs, but also that of 8 or 16 parallel runs on the node (1-2 runs per GPU). You will observe a behavior similar to what we show on Fig 11 of our recent paper: https://aip.scitation.org/doi/full/10.1063/5.0018516
Hi @pszilard, thank you for your kind suggestions.
I’ve tried setting -bonded gpu in mdrun, but the speed stayed the same.
For -update gpu option, there was an error occured: Free energy perturbation for mass and constraints are not supported.
I did run multiple lambda at the same time, but in different GPUs, not in the same GPU.
As I mentioned in my previous reply, I checked the efficiency of mdrun in one GPU with multiple threads. For one GPU, I need set 20 CPU to achieve maximum efficiency.
For my machine with 8 GPUs and 88 cores, I ran 4 mdrun at the same time with 4 different GPUs, each with 20 openmp thread. The remaining 4 GPUs was lefted unused because the shortage of CPU cores. It’s indeed a waste of resources.
Do you think that, if I used all the GPUs, each with less openmp threads, for example:
8 individual mdrun in 8 GPUs, each with 10 openmp threads
16 individual mdrun in 8 GPUs, each with 5 openmp threads
would get better overall performance compared to my current setting?
I may try it soon if I understood it correctly, looking forward your further suggestions.
That may partly be because you assigned many CPU cores for each GPU.
Yes. Do make sure that CPU and GPU affinities are set correctly (whether through your job scheduler, or mdrun pinning/gpuid assignment).
As I suggested earlier, do benchmark the total throughput rather than maximizing the performance of each run first and trying to fit those runs onto the machine – which is your case ends up leaving GPUs idle.
The fewer CPU cores you have the more more gain can there be from offloading additional tasks to GPUs; case in point in the Fig 11 of previously linked paper on the left (A) you have relatively more CPU resources per GPU (compared to the right panel B) hence it is always fastest to leave some work for the CPU (yellow). That explains why you did not see improvement from -bonded gpu.
Secondly, on the same figure the horizontal axis shows the number of simulations per GPU and as you can observe, the more simulations you have per GPU the higher the potential for overall throughput (the topmost curve shows increasing trend), and in addition, the CPU resources can be used more effectively and also CPU-requiring runs get more benefit from these setups, e.g. see the slope of the light blue curves (these correspond to your setup where update & constraints is on the CPU).
I hope that helps, please let me know if you have further questions.