How to use GPU efficiently?

GROMACS version:2020.4
GROMACS modification: Yes/No
Here post your question Below is the nvidia-smi, as it shows it using only 40 %, how we can improve it?
or how I can use 90% and higher for running gromacs?
gmx mdrun -v -ntmpi 2 -ntomp 23 -deffnm md -gpu_id 01 -nb gpu -bonded gpu
load imbalnce is 1 %

----------------------------------------------------------------------------+
| NVIDIA-SMI 435.21 Driver Version: 435.21 CUDA Version: 10.1 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GT 740 Off | 00000000:02:00.0 N/A | N/A |
| 40% 56C P0 N/A / N/A | 351MiB / 4035MiB | N/A Default |
±------------------------------±---------------------±---------------------+
| 1 GeForce GTX 1070 Off | 00000000:03:00.0 Off | N/A |
| 44% 67C P2 80W / 151W | 101MiB / 8119MiB | 40% Default |
±------------------------------±---------------------±---------------------+
| 2 GeForce GTX 1070 Off | 00000000:82:00.0 Off | N/A |
| 52% 80C P2 82W / 151W | 101MiB / 8119MiB | 41% Default |
±------------------------------±---------------------±---------------------+

The simple answer is: use 1 GPU per simulation and trigger GPU-resident steps by offloading update.

If you want to run on two GPUs the best option will depend on your hardware and simulation settings (and depending on those the performance may scale very well).

Thank you szilard,
Now I can see that one GPU usage has increased to 74 %. I offloaded nb and bonded to GPU. is it possible to increase the performance again from 74%?

^^^ that

You can also run multiple simulations on the same GPU - that should max you out and provide a modest aggregate performance boost.

Good morning all,

I am not able to push my GPU efficiency higher than 7% :(
I have tried a few configurations (with relative performance):

gmx mdrun -v #Performance: 34.852 0.689
gmx mdrun -v -nt 1 #Performance: 3.065 7.830
gmx mdrun -v -nt 4 #Performance: 10.182 2.357
gmx mdrun -v -ntmpi 1 -ntomp 1 #Performance: 3.068 7.822
gmx mdrun -v -ntmpi 1 -ntomp 4 #Performance: 10.272 2.336
gmx mdrun -v -ntmpi 4 -ntomp 1 #Performance: 6.352 3.778
gmx mdrun -v -ntmpi 1 -nb gpu #Performance: 33.951 0.707
gmx mdrun -v -ntmpi 1 -nb gpu -pme cpu #Performance: 18.239 1.316
gmx mdrun -v -ntmpi 1 -nb gpu -pme cpu -bonded gpu #Performance: 18.820 1.275
gmx mdrun -v -ntmpi 1 -nb gpu -pme gpu -bonded gpu -update gpu #Performance: 81.736 0.294
mpirun -np 1 gmx mdrun -v #Performance: 34.287 0.700

For the best option, nvidia-smi gives mostly no use of GPU:

±----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02 Driver Version: 450.80.02 CUDA Version: 11.0 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce GTX 108… On | 00000000:04:00.0 Off | N/A |
| 24% 45C P2 70W / 250W | 181MiB / 11178MiB | 7% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 1 GeForce GTX 108… On | 00000000:05:00.0 Off | N/A |
| 23% 26C P8 9W / 250W | 0MiB / 11178MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 2 GeForce GTX 108… On | 00000000:08:00.0 Off | N/A |
| 23% 24C P8 8W / 250W | 0MiB / 11178MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 3 GeForce GTX 108… On | 00000000:09:00.0 Off | N/A |
| 23% 28C P8 9W / 250W | 0MiB / 11178MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 4 GeForce GTX 108… On | 00000000:83:00.0 Off | N/A |
| 23% 26C P8 9W / 250W | 0MiB / 11178MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 5 GeForce GTX 108… On | 00000000:84:00.0 Off | N/A |
| 23% 27C P8 9W / 250W | 0MiB / 11178MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 6 GeForce GTX 108… On | 00000000:87:00.0 Off | N/A |
| 23% 24C P8 8W / 250W | 0MiB / 11178MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 7 GeForce GTX 108… On | 00000000:88:00.0 Off | N/A |
| 23% 27C P8 9W / 250W | 0MiB / 11178MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 2467144 C gmx 179MiB |
±----------------------------------------------------------------------------+

FYI: I am studying an ionic system with 1000 molecules
— BOF —
integrator = md
nsteps = 1000
dt = 0.002

nstxout = 100
nstvout = 100
nstenergy = 100
nstlog = 100

continuation = no
constraint_algorithm = lincs
constraints = h-bonds
lincs_iter = 1
lincs_order = 4

cutoff-scheme = Verlet

nstlist = 10
rcoulomb = 1.0
rvdw = 1.0
DispCorr = EnerPres

coulombtype = PME
pme_order = 4
fourierspacing = 0.16

tcoupl = V-rescale
tc-grps = LI TFS DIG
tau_t = 1.0000 1.0000 1.0000
ref_t = 200 200 200

pcoupl = no

pbc = xyz

gen_vel = yes
gen_temp = 300
gen_seed = 1234
—EOF—

Please find below some information from the log file regarding the detected hardware.

Precision: mixed
Memory model: 64 bit
MPI library: thread_mpi
OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 64)
GPU support: CUDA
SIMD instructions: AVX2_256
FFT library: fftw-3.3.8-sse2-avx-avx2-avx2_128
RDTSCP usage: enabled
TNG support: enabled
Hwloc support: disabled
Tracing support: disabled
C compiler: /usr/bin/gcc GNU 8.2.0
C compiler flags: -mavx2 -mfma -Wno-missing-field-initializers -fno-inline -pthread -g
C++ compiler: /usr/bin/g++ GNU 8.2.0
C++ compiler flags: -mavx2 -mfma -Wno-missing-field-initializers -fno-inline -pthread -fopenmp -g
CUDA compiler: /cm/shared/apps/cuda11.1/toolkit/11.1.1/bin/nvcc nvcc: NVIDIA (R) Cuda compiler driver;Copyright (c) 2005-2020 NVIDIA Corporation;Built on Mon_Oct_12_20:09:46_PDT_2020;Cuda compilation tools, release 11.1, V11.1.105;Build cuda_11.1.TC455_06.29190527_0
CUDA compiler flags:-std=c++17;-gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-Wno-deprecated-gpu-targets;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_35,code=compute_35;-gencode;arch=compute_53,code=compute_53;-gencode;arch=compute_80,code=compute_80;-use_fast_math;;-mavx2 -mfma -Wno-missing-field-initializers -fno-inline -pthread -fopenmp -g
CUDA driver: 11.0
CUDA runtime: 11.10

Running on 1 node with total 16 cores, 16 logical cores, 1 compatible GPU
Hardware detected:
CPU info:
Vendor: Intel
Brand: Intel(R) Xeon(R) CPU E5-2640 v3 @ 2.60GHz
Family: 6 Model: 63 Stepping: 2
Features: aes apic avx avx2 clfsh cmov cx8 cx16 f16c fma htt intel lahf mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd rdtscp sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
Hardware topology: Basic
Sockets, cores, and logical processors:
Socket 0: [ 0] [ 1] [ 2] [ 3] [ 4] [ 5] [ 6] [ 7]
Socket 1: [ 8] [ 9] [ 10] [ 11] [ 12] [ 13] [ 14] [ 15]
GPU info:
Number of GPUs detected: 1
#0: NVIDIA GeForce GTX 1080 Ti, compute cap.: 6.1, ECC: no, stat: compatible

Does anyone have an hint?
Thank
Marco

You won’t necessarily be able to fully utilize the GPU. If the system is relatively small, it’s only going to use the resources it needs. If you’ve found a way to maximize performance, that’s the best you’re going to do.