How to use GPU efficiently?

GROMACS version:2020.4
GROMACS modification: Yes/No
Here post your question Below is the nvidia-smi, as it shows it using only 40 %, how we can improve it?
or how I can use 90% and higher for running gromacs?
gmx mdrun -v -ntmpi 2 -ntomp 23 -deffnm md -gpu_id 01 -nb gpu -bonded gpu
load imbalnce is 1 %

----------------------------------------------------------------------------+
| NVIDIA-SMI 435.21 Driver Version: 435.21 CUDA Version: 10.1 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GT 740 Off | 00000000:02:00.0 N/A | N/A |
| 40% 56C P0 N/A / N/A | 351MiB / 4035MiB | N/A Default |
±------------------------------±---------------------±---------------------+
| 1 GeForce GTX 1070 Off | 00000000:03:00.0 Off | N/A |
| 44% 67C P2 80W / 151W | 101MiB / 8119MiB | 40% Default |
±------------------------------±---------------------±---------------------+
| 2 GeForce GTX 1070 Off | 00000000:82:00.0 Off | N/A |
| 52% 80C P2 82W / 151W | 101MiB / 8119MiB | 41% Default |
±------------------------------±---------------------±---------------------+

The simple answer is: use 1 GPU per simulation and trigger GPU-resident steps by offloading update.

If you want to run on two GPUs the best option will depend on your hardware and simulation settings (and depending on those the performance may scale very well).

Thank you szilard,
Now I can see that one GPU usage has increased to 74 %. I offloaded nb and bonded to GPU. is it possible to increase the performance again from 74%?

^^^ that

http://manual.gromacs.org/documentation/current/user-guide/mdrun-performance.html#examples-for-mdrun-on-one-node

You can also run multiple simulations on the same GPU - that should max you out and provide a modest aggregate performance boost.

Good morning all,

I am not able to push my GPU efficiency higher than 7% :(
I have tried a few configurations (with relative performance):

gmx mdrun -v #Performance: 34.852 0.689
gmx mdrun -v -nt 1 #Performance: 3.065 7.830
gmx mdrun -v -nt 4 #Performance: 10.182 2.357
gmx mdrun -v -ntmpi 1 -ntomp 1 #Performance: 3.068 7.822
gmx mdrun -v -ntmpi 1 -ntomp 4 #Performance: 10.272 2.336
gmx mdrun -v -ntmpi 4 -ntomp 1 #Performance: 6.352 3.778
gmx mdrun -v -ntmpi 1 -nb gpu #Performance: 33.951 0.707
gmx mdrun -v -ntmpi 1 -nb gpu -pme cpu #Performance: 18.239 1.316
gmx mdrun -v -ntmpi 1 -nb gpu -pme cpu -bonded gpu #Performance: 18.820 1.275
gmx mdrun -v -ntmpi 1 -nb gpu -pme gpu -bonded gpu -update gpu #Performance: 81.736 0.294
mpirun -np 1 gmx mdrun -v #Performance: 34.287 0.700

For the best option, nvidia-smi gives mostly no use of GPU:

±----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02 Driver Version: 450.80.02 CUDA Version: 11.0 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce GTX 108… On | 00000000:04:00.0 Off | N/A |
| 24% 45C P2 70W / 250W | 181MiB / 11178MiB | 7% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 1 GeForce GTX 108… On | 00000000:05:00.0 Off | N/A |
| 23% 26C P8 9W / 250W | 0MiB / 11178MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 2 GeForce GTX 108… On | 00000000:08:00.0 Off | N/A |
| 23% 24C P8 8W / 250W | 0MiB / 11178MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 3 GeForce GTX 108… On | 00000000:09:00.0 Off | N/A |
| 23% 28C P8 9W / 250W | 0MiB / 11178MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 4 GeForce GTX 108… On | 00000000:83:00.0 Off | N/A |
| 23% 26C P8 9W / 250W | 0MiB / 11178MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 5 GeForce GTX 108… On | 00000000:84:00.0 Off | N/A |
| 23% 27C P8 9W / 250W | 0MiB / 11178MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 6 GeForce GTX 108… On | 00000000:87:00.0 Off | N/A |
| 23% 24C P8 8W / 250W | 0MiB / 11178MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 7 GeForce GTX 108… On | 00000000:88:00.0 Off | N/A |
| 23% 27C P8 9W / 250W | 0MiB / 11178MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 2467144 C gmx 179MiB |
±----------------------------------------------------------------------------+

FYI: I am studying an ionic system with 1000 molecules
— BOF —
integrator = md
nsteps = 1000
dt = 0.002

nstxout = 100
nstvout = 100
nstenergy = 100
nstlog = 100

continuation = no
constraint_algorithm = lincs
constraints = h-bonds
lincs_iter = 1
lincs_order = 4

cutoff-scheme = Verlet

nstlist = 10
rcoulomb = 1.0
rvdw = 1.0
DispCorr = EnerPres

coulombtype = PME
pme_order = 4
fourierspacing = 0.16

tcoupl = V-rescale
tc-grps = LI TFS DIG
tau_t = 1.0000 1.0000 1.0000
ref_t = 200 200 200

pcoupl = no

pbc = xyz

gen_vel = yes
gen_temp = 300
gen_seed = 1234
—EOF—

Please find below some information from the log file regarding the detected hardware.

Precision: mixed
Memory model: 64 bit
MPI library: thread_mpi
OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 64)
GPU support: CUDA
SIMD instructions: AVX2_256
FFT library: fftw-3.3.8-sse2-avx-avx2-avx2_128
RDTSCP usage: enabled
TNG support: enabled
Hwloc support: disabled
Tracing support: disabled
C compiler: /usr/bin/gcc GNU 8.2.0
C compiler flags: -mavx2 -mfma -Wno-missing-field-initializers -fno-inline -pthread -g
C++ compiler: /usr/bin/g++ GNU 8.2.0
C++ compiler flags: -mavx2 -mfma -Wno-missing-field-initializers -fno-inline -pthread -fopenmp -g
CUDA compiler: /cm/shared/apps/cuda11.1/toolkit/11.1.1/bin/nvcc nvcc: NVIDIA (R) Cuda compiler driver;Copyright (c) 2005-2020 NVIDIA Corporation;Built on Mon_Oct_12_20:09:46_PDT_2020;Cuda compilation tools, release 11.1, V11.1.105;Build cuda_11.1.TC455_06.29190527_0
CUDA compiler flags:-std=c++17;-gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-Wno-deprecated-gpu-targets;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_35,code=compute_35;-gencode;arch=compute_53,code=compute_53;-gencode;arch=compute_80,code=compute_80;-use_fast_math;;-mavx2 -mfma -Wno-missing-field-initializers -fno-inline -pthread -fopenmp -g
CUDA driver: 11.0
CUDA runtime: 11.10

Running on 1 node with total 16 cores, 16 logical cores, 1 compatible GPU
Hardware detected:
CPU info:
Vendor: Intel
Brand: Intel(R) Xeon(R) CPU E5-2640 v3 @ 2.60GHz
Family: 6 Model: 63 Stepping: 2
Features: aes apic avx avx2 clfsh cmov cx8 cx16 f16c fma htt intel lahf mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd rdtscp sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
Hardware topology: Basic
Sockets, cores, and logical processors:
Socket 0: [ 0] [ 1] [ 2] [ 3] [ 4] [ 5] [ 6] [ 7]
Socket 1: [ 8] [ 9] [ 10] [ 11] [ 12] [ 13] [ 14] [ 15]
GPU info:
Number of GPUs detected: 1
#0: NVIDIA GeForce GTX 1080 Ti, compute cap.: 6.1, ECC: no, stat: compatible

Does anyone have an hint?
Thank
Marco

You won’t necessarily be able to fully utilize the GPU. If the system is relatively small, it’s only going to use the resources it needs. If you’ve found a way to maximize performance, that’s the best you’re going to do.

Hi Justin, I submitted an MD job and I got this message:
Multiple energy groups is not implemented for GPUs, falling back to the CPU. For better performance, run on the GPU without energy groups and then do gmx mdrun -rerun option on the trajectory with an energy group .tpr file.
Changing nstlist from 10 to 50, rlist from 1.1 to 1.206

What do you think should be done in this regard? What if I removed the tc-grps?

Remove any energygrps in the .mdp file, create a new .tpr file, and run again. There is no purpose to analyzing interaction energy during the actual simulation. Do it as part of the analysis.

Thank you so much, Justin!

I have two more questions.

  1. I got these notes after I used grompp:

NOTE 1 [file prMD_NVT_cont.mdp]:
Removing center of mass motion in the presence of position restraints
might cause artifacts. When you are using position restraints to
equilibrate a macro-molecule, the artifacts are usually negligible.

NOTE 2 [file prMD_NVT_cont.mdp]:
You are using geometric combination rules in LJ-PME, but your non-bonded
C6 parameters do not follow these rules. This will introduce very small
errors in the forces and energies in your simulations. Dispersion
correction will correct total energy and/or pressure for isotropic
systems, but not forces or surface tensions.

Do I need to do anything towards these notes?

  1. step 11: One or more water molecules can not be settled.
    Check for bad contacts and/or reduce the timestep if appropriate.

I got this during the energy minimization step using the steep algorithm, Is it okay to continue the minimization and equilibration without taking care of this, especially since it wrote out pdb files corresponding to these steps? or should I change sth in the mdp file?

Hi friends,
please, any one can suggest to modify command line or can help to increase the GPU memory usage to the maximum level (full memory of the GPU, in my case 48 GB).

Currently 4 running jobs getting 375 MiB, 370MiB, 290 MiB etc., as shown in screenshot it is allocating only 1% around 300 MiB for each job, out of 48 GB of GPU.

I have used the command line:

gmx mdrun -s gromacs1.tpr -ntmpi 1 -ntomp 16 -maxh 100

(also)

gmx mdrun -s gromacs1.tpr -ntmpi 1 -ntomp 16 -maxh 100 -update gpu

So, please suggest us how we can increase to the maximum (48GB) usage of GPU memory for a single job or multiple tasks.

gmx dont use much of the memmory, you are good here -usage is 97 %

Thank you scinikhil,

Yeah, 97% but which is sum total of all jobs, that means each job getting randomly some percentage and collectively showing 97 or 100%.

If we see the GPU memory usage percentage, which is around just 5% - 10% only ie., 1554 MiB/ 49140 MiB (or 48 GB).

So, we need to make it fully functional/access by using maximum level of usage of GPU.

I am still looking for any further suggestion to fix it since we have much GPU availability but the usage is less and MD is running very slow.

share your log file-
again-gmx dont utilise much memmory-the performance is based on cuda cores in your GPU.
run one tpr and obtain the performance-running multiple jobs on single GPU reduce the performance.
share the last part of the log file along with the full command you used for running one job, so that we can provide better options.

Also offload all calculations to GPU

https://manual.gromacs.org/5.1.1/user-guide/mdrun-performance.html

Hi,
please have a look on it:

Command line:
gmx mdrun -s gromacs1.tpr -ntmpi 1 -ntomp 16 -maxh 24 -gpu_id 0

Last part of md.log file: (


screenshot)

how many atoms are there in your input? and what type of simulation system are you using? if you wanted to benchmark and get best performance, try
add -nb -bonded -update flag refer the link above for better understanding.
you need to change the ntmpi and ntomp values to get the best performance if needed.

ok, thank you, I will come up after trying as you mentioned