GPU usage in FEP calculation

Chetan · June 21, 2023, 8:43am

GROMACS version:2023.1
GROMACS modification: No
Hi,
Is there a way to get more GPU utilization during FEP simulation since the GPU utilization is only 18℅ when i am running this type of simulation.

hess · June 21, 2023, 8:13pm

That is extremely little. Is your system very small?

Maybe running two or more FEP simulations simultaneously on the same GPU can help a bit.

Chetan · June 22, 2023, 2:47am

My system has close to 60k atoms.

Submitting multiple FEP jobs slows it down further.

hess · June 22, 2023, 1:29pm

Do the two jobs together finish later than running then in sequence?

Chetan · June 23, 2023, 6:15am

Yes running two job’s finish later if submitted parallel then if it’s submitted in sequence.

pszilard · June 26, 2023, 3:45pm

Hi,

The performance of such simulations will strongly depends on the type of hardware and inputs. Please provide a complete log file and a description of the simulation input to help identifying whether performance can be improved.

Cheers,
Szilárd

Chetan · June 27, 2023, 9:44am

My workstation has 80 cpu’s and 2 GPU (A5000).
The numbering of cpu’s is not consistent which leads to difficulty in pinning the cores.
Due some reason I can’t share log file or input file for the simulation

pszilard · June 27, 2023, 3:42pm

You can share a link of a log uploaded to an external service (e.g., Google drive). Without a log file it is hard to judge what is happening.

Not sure what you mean, how did you try to pin to cores? Did you run multiple simulations per GPU. If so, did you use -pinoffset?

Chetan · June 27, 2023, 4:23pm

My computer has 2 GPUs. So when i run one job with ntmpi 1 - ntomp 40 -pin on -pinoffset 0 -gpu_id 0 command then I get 45% GPU usage. But as i increase the cpus the usage drops.
Since there are 2 sockets and one socket has 40 cpu,'s.So the command with 40 cpus runs on one complete socket but increasing the number to 60 doesn’t distribute the job in consistent manner.

So I tried to run another job also using similar command on gpu id 1 but then both jobs use gpu to 25%.

I have tried various combinations but still the max usage i could get was 45% and it doesn’t scale up from here.

Chetan · June 28, 2023, 4:24pm

below find the link to input file and production run stats of the log file.

Chetan · July 1, 2023, 5:50am

Using external MPI library it does use GPU’s (98 % both gpu usage) with high number -np rank but still due to load imbalance the overeall job is still slow compared to the previouse best with 1 ntmpi 40 ntomp and 1 GPU (45% usage)

pszilard · July 3, 2023, 2:25pm

This is not a complete log file, please don’t select parts of it share the entire file.
Based on the incomplete information, it seems that you are not using GPU resident mode (ie… -update gpu), but since the log file is incomplete it is not clear whether you have a compatible integrator.

Otherwise, most of the CPU computational cost is in PME, so your best bet to improve performance is to allow PME balancing to shift load to the GPU and perhaps further reduce PME load with a larger PME order.

If you have multiple simulations to run, I suggest to focus on maximizing the overall throughput rather than the throughput of an individual simulation by mapping 2-3 runs to a single GPU and splitting the available CPU cores across these simulations (e.g. using -multidir).

Cheers,
Szilárd

Chetan · July 4, 2023, 1:30pm

you can find the complete log file on below link

Also i have tried update gpu option but in this case it was not supported

i will try multidir and check if it helps to do multiple simulations with same speed

shivam1 · July 6, 2023, 1:58am

@pszilard @hess @jalemkul

Hi I also have the same problem. For Protein-Protein system, Gromacs utilize only max ~15-18% of GPU.

My system specifications are as below:

$lscpu
architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              32
On-line CPU(s) list: 0-31
Thread(s) per core:  1
Core(s) per socket:  16
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel
CPU family:          6
Model:               106
Model name:          Intel(R) Xeon(R) Gold 6326 CPU @ 2.90GHz
Stepping:            6
CPU MHz:             3300.000
CPU max MHz:         2901.0000
CPU min MHz:         800.0000
BogoMIPS:            5800.00
Virtualization:      VT-x
L1d cache:           48K
L1i cache:           32K
L2 cache:            1280K
L3 cache:            24576K
NUMA node0 CPU(s):   0-15
NUMA node1 CPU(s):   16-31

and GPU configuration are as below:

$ lshw -C display
       product: TU102GL [Quadro RTX 6000/8000]
       vendor: NVIDIA Corporation
       physical id: 0
       bus info: pci@0000:4b:00.0
       version: a1
       width: 64 bits
       clock: 33MHz
       capabilities: vga_controller bus_master cap_list rom
       configuration: driver=nvidia latency=0

Command That I run for protein-protein production run is as below:

gmx mdrun -v -noappend -deffnm pro1 -ntmpi 8 -ntomp 4

using the above command it utilizes only ~18% GPU.

$ nvidia-smi
±--------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2 |
|-----------------------------------------±---------------------±---------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Quadro RTX 6000 Off | 00000000:4B:00.0 Off | Off |
| 42% 66C P2 93W / 260W | 302MiB / 24576MiB | 18% Default |
| | | N/A |
±----------------------------------------±---------------------±---------------------+

±--------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 2357 G /usr/libexec/Xorg 38MiB |
| 0 N/A N/A 3204 G /usr/bin/gnome-shell 36MiB |
| 0 N/A N/A 188285 C gmx 222MiB |
±--------------------------------------------------------------------------------------+

Please have a look into above info and guide me How can I maximize the GPU/CPU utilization?
Do I need to modify the mdrun command inputs? or Gromacs in not well configured with GPU?

Thanks for your kind consideration.

Chetan · July 14, 2023, 11:18am

Using multidir command 6 windows gets completed in 24 to 28 hrs.

So if someone has some benchmark for FEP simulations then please share.

pszilard · July 17, 2023, 1:08pm

What does that mean, please use a throughput measure e.g. ns/day, otherwise it is hard to compare? What is the GPU utilization? Note that even if individual simulations get slower when you map fewwer CPU cores to them and have them share GPUs, in the end you want to maximize your total throughput, so do compute the aggregate throughput rather than that of a single simulation.

Chetan · July 18, 2023, 5:21pm

For my system with 61522 atoms the total from 6 windows md run (15ns) reaches 94 ns/day
Since I have 80 cores (with hyperthreading) and 2 Gpu’s (RTX A5000).
I had used np 6 ,ntomp 12 and gpu_id 01 option in multidir command.
The GPU utilization was close to 40% for both GPU’s.
i am sharing one file with some more details.

pszilard · July 19, 2023, 4:35pm

Without comparing this to something else, one can’t say whether this is optimal or not. Why not run short benchmark runs with 1, 2, 3, 4, 5, 8 runs per GPU (three may not even be ideal since you will be leaving some CPU cores idle)?

Chetan · July 29, 2023, 2:03pm

I did run some short md runs for 1ns and found that earlier i had used 12 ranks for running 6 windows. So I tried 6 ranks for 6 windows using both Gpu’s and not defining number of thread and total throughput was 109.17 ns/day (combined performance for all 6 windows). Then I increase and decreased the number of windows and giving same number of ranks as the number of windows. The best i could get was running 8 windows with 8 ranks which gave 114.54 ns/day (combined performance of all windows). Since above this number of windows it gives Cuda memory allocation error.
And both GPU usage was 40% for the 8 windows parallel run job and also it never crossed this number in other combinations also.

pszilard · August 1, 2023, 4:40pm

Given that you have 20 CPU cores per GPU which, it sounds like you have a lot of CPU work – perhaps a lot of perturbed atoms? If so there is not a lot more you can do because your run is CPU bound.

Topic		Replies	Views
Very Low GPU utilization User discussions	7	1825	December 2, 2020
How to get high GUP utility (GPU-util) when runing two simulations task on one computer? User discussions	1	739	July 21, 2020
Abysmal MD production performance on GPU node User discussions mdrun	8	1018	December 15, 2023
How to use GPU efficiently? User discussions	16	2944	July 25, 2024
Low Performance due to low utilisation of GPU User discussions	10	625	July 26, 2024

GPU usage in FEP calculation

Related topics