GPU usage in FEP calculation

GROMACS version:2023.1
GROMACS modification: No
Hi,
Is there a way to get more GPU utilization during FEP simulation since the GPU utilization is only 18℅ when i am running this type of simulation.

That is extremely little. Is your system very small?

Maybe running two or more FEP simulations simultaneously on the same GPU can help a bit.

My system has close to 60k atoms.

Submitting multiple FEP jobs slows it down further.

Do the two jobs together finish later than running then in sequence?

Yes running two job’s finish later if submitted parallel then if it’s submitted in sequence.

Hi,

The performance of such simulations will strongly depends on the type of hardware and inputs. Please provide a complete log file and a description of the simulation input to help identifying whether performance can be improved.

Cheers,
Szilárd

My workstation has 80 cpu’s and 2 GPU (A5000).
The numbering of cpu’s is not consistent which leads to difficulty in pinning the cores.
Due some reason I can’t share log file or input file for the simulation

You can share a link of a log uploaded to an external service (e.g., Google drive). Without a log file it is hard to judge what is happening.

Not sure what you mean, how did you try to pin to cores? Did you run multiple simulations per GPU. If so, did you use -pinoffset?

My computer has 2 GPUs. So when i run one job with ntmpi 1 - ntomp 40 -pin on -pinoffset 0 -gpu_id 0 command then I get 45% GPU usage. But as i increase the cpus the usage drops.
Since there are 2 sockets and one socket has 40 cpu,'s.So the command with 40 cpus runs on one complete socket but increasing the number to 60 doesn’t distribute the job in consistent manner.

So I tried to run another job also using similar command on gpu id 1 but then both jobs use gpu to 25%.

I have tried various combinations but still the max usage i could get was 45% and it doesn’t scale up from here.

below find the link to input file and production run stats of the log file.

Using external MPI library it does use GPU’s (98 % both gpu usage) with high number -np rank but still due to load imbalance the overeall job is still slow compared to the previouse best with 1 ntmpi 40 ntomp and 1 GPU (45% usage)

This is not a complete log file, please don’t select parts of it share the entire file.
Based on the incomplete information, it seems that you are not using GPU resident mode (ie… -update gpu), but since the log file is incomplete it is not clear whether you have a compatible integrator.

Otherwise, most of the CPU computational cost is in PME, so your best bet to improve performance is to allow PME balancing to shift load to the GPU and perhaps further reduce PME load with a larger PME order.

If you have multiple simulations to run, I suggest to focus on maximizing the overall throughput rather than the throughput of an individual simulation by mapping 2-3 runs to a single GPU and splitting the available CPU cores across these simulations (e.g. using -multidir).

Cheers,
Szilárd

you can find the complete log file on below link

Also i have tried update gpu option but in this case it was not supported

i will try multidir and check if it helps to do multiple simulations with same speed

@pszilard @hess @jalemkul

Hi I also have the same problem. For Protein-Protein system, Gromacs utilize only max ~15-18% of GPU.

My system specifications are as below:

$lscpu
architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              32
On-line CPU(s) list: 0-31
Thread(s) per core:  1
Core(s) per socket:  16
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel
CPU family:          6
Model:               106
Model name:          Intel(R) Xeon(R) Gold 6326 CPU @ 2.90GHz
Stepping:            6
CPU MHz:             3300.000
CPU max MHz:         2901.0000
CPU min MHz:         800.0000
BogoMIPS:            5800.00
Virtualization:      VT-x
L1d cache:           48K
L1i cache:           32K
L2 cache:            1280K
L3 cache:            24576K
NUMA node0 CPU(s):   0-15
NUMA node1 CPU(s):   16-31

and GPU configuration are as below:

$ lshw -C display
       product: TU102GL [Quadro RTX 6000/8000]
       vendor: NVIDIA Corporation
       physical id: 0
       bus info: pci@0000:4b:00.0
       version: a1
       width: 64 bits
       clock: 33MHz
       capabilities: vga_controller bus_master cap_list rom
       configuration: driver=nvidia latency=0

Command That I run for protein-protein production run is as below:

gmx mdrun -v -noappend -deffnm pro1 -ntmpi 8 -ntomp 4

using the above command it utilizes only ~18% GPU.

$ nvidia-smi
±--------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2 |
|-----------------------------------------±---------------------±---------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Quadro RTX 6000 Off | 00000000:4B:00.0 Off | Off |
| 42% 66C P2 93W / 260W | 302MiB / 24576MiB | 18% Default |
| | | N/A |
±----------------------------------------±---------------------±---------------------+

±--------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 2357 G /usr/libexec/Xorg 38MiB |
| 0 N/A N/A 3204 G /usr/bin/gnome-shell 36MiB |
| 0 N/A N/A 188285 C gmx 222MiB |
±--------------------------------------------------------------------------------------+

Please have a look into above info and guide me How can I maximize the GPU/CPU utilization?
Do I need to modify the mdrun command inputs? or Gromacs in not well configured with GPU?

Thanks for your kind consideration.

Using multidir command 6 windows gets completed in 24 to 28 hrs.

So if someone has some benchmark for FEP simulations then please share.

What does that mean, please use a throughput measure e.g. ns/day, otherwise it is hard to compare? What is the GPU utilization? Note that even if individual simulations get slower when you map fewwer CPU cores to them and have them share GPUs, in the end you want to maximize your total throughput, so do compute the aggregate throughput rather than that of a single simulation.

For my system with 61522 atoms the total from 6 windows md run (15ns) reaches 94 ns/day
Since I have 80 cores (with hyperthreading) and 2 Gpu’s (RTX A5000).
I had used np 6 ,ntomp 12 and gpu_id 01 option in multidir command.
The GPU utilization was close to 40% for both GPU’s.
i am sharing one file with some more details.

Without comparing this to something else, one can’t say whether this is optimal or not. Why not run short benchmark runs with 1, 2, 3, 4, 5, 8 runs per GPU (three may not even be ideal since you will be leaving some CPU cores idle)?

I did run some short md runs for 1ns and found that earlier i had used 12 ranks for running 6 windows. So I tried 6 ranks for 6 windows using both Gpu’s and not defining number of thread and total throughput was 109.17 ns/day (combined performance for all 6 windows). Then I increase and decreased the number of windows and giving same number of ranks as the number of windows. The best i could get was running 8 windows with 8 ranks which gave 114.54 ns/day (combined performance of all windows). Since above this number of windows it gives Cuda memory allocation error.
And both GPU usage was 40% for the 8 windows parallel run job and also it never crossed this number in other combinations also.

Given that you have 20 CPU cores per GPU which, it sounds like you have a lot of CPU work – perhaps a lot of perturbed atoms? If so there is not a lot more you can do because your run is CPU bound.