Low Performance due to low utilisation of GPU

GROMACS version: 2023
GROMACS modification: No

My desktop has Intel(R) Core™ i9-10900K CPU @ 3.70GHz processor and Nvidia RTX 4090 GPU.

This is gromacs version installed on my system
gmx --version

GROMACS version: 2023
Precision: mixed
Memory model: 64 bit
MPI library: thread_mpi
OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 128)
GPU support: CUDA
NB cluster size: 8
SIMD instructions: AVX2_256
CPU FFT library: fftw-3.3.8-sse2-avx-avx2-avx2_128
GPU FFT library: cuFFT
Multi-GPU FFT: none
RDTSCP usage: enabled
TNG support: enabled
Hwloc support: disabled
Tracing support: disabled
C compiler: /usr/bin/cc GNU 11.3.0
C compiler flags: -fexcess-precision=fast -funroll-all-loops -mavx2 -mfma -Wno-missing-field-initializers -O3 -DNDEBUG
C++ compiler: /usr/bin/c++ GNU 11.3.0
C++ compiler flags: -fexcess-precision=fast -funroll-all-loops -mavx2 -mfma -Wno-missing-field-initializers -Wno-cast-function-type-strict -fopenmp
BLAS library:
LAPACK library:
CUDA compiler: /usr/local/cuda-12.1/bin/nvcc nvcc: NVIDIA (R) Cuda compiler driver;Copyright (c) 2005-2023 NVIDIA Corporation;Built on Tue_Feb__7_19:32:13_PST_2023;Cuda compilation tools, release 12.1, V12.1.66;Build cuda_12.1.r12.1/compiler.32415258_0
CUDA compiler flags:-std=c++17;–generate-code=arch=compute_50,code=sm_50;–generate-code=arch=compute_52,code=sm_52;–generate-code=arch=compute_60,code=sm_60;–generate-code=arch=compute_61,code=sm_61;–generate-code=arch=compute_70,code=sm_70;–generate-code=arch=compute_75,code=sm_75;–generate-code=arch=compute_80,code=sm_80;–generate-code=arch=compute_86,code=sm_86;–generate-code=arch=compute_89,code=sm_89;–generate-code=arch=compute_90,code=sm_90;-Wno-deprecated-gpu-targets;–generate-code=arch=compute_53,code=sm_53;–generate-code=arch=compute_80,code=sm_80;-use_fast_math;-Xptxas;-warn-double-usage;-Xptxas;-Werror;-D_FORCE_INLINES;-fexcess-precision=fast -funroll-all-loops -mavx2 -mfma -Wno-missing-field-initializers -Wno-cast-function-type-strict -fopenmp
CUDA driver: 12.10
CUDA runtime: 12.10

I am running a md where i offloaded everything on GPU
gmx mdrun -deffnm md -nb gpu -pme gpu -bonded gpu -update gpu

Still my GPU utilisation is less than 10% while my cpu utilisation ~ 100%.

Here is log file results

How to increase my GPU utilization??

Thanks and Regards

Hi,

As you can see from the log output, the work on the CPU side that leads to the behavior you see is:

  • pair search is taking 35% of the wall-time; it appears that you have nstlist=100 so not sure why is that so high.
  • “Rest” time 37% this is time mdrun timing does not explicitly account for, perhaps you have some special algorithm enabled?
  • “Force” time is 25% which is likely due to some bonded types needing the CPU kernels.

We should rule out first the possibility that you are running something else on the CPU that is interfering.

Can you share a complete log file?

Cheers,
Szilárd

I can’t access that file without signing in, please allow downloads without sign in or upload a file here.

Hello, I’ve encountered the same problem as you. When I was submitting calculations on my Ubuntu linux system, I realized that my gromacs were very fast before, then I accidentally upgraded the kernel by updating the CUDA and drivers, and now after downgrading the kernel (5.15.0-71-generic), I found out that the gromacs have gone from 130ns/day to 3ns/day. (Normally, I was using CPU is 2400% and now the CPU usage is only 1200%. Moreover, my GPU usage is below 10%). Here is the configuration and kernel information of the server.


The CPU model of the server is: 13th Gen Intel(R) Core™ i7-13700KF, the GPU model is: 01:00.0 VGA compatible controller: NVIDIA Corporation Device 2782 (rev a1), and the driver information of the graphics card is as follows:

The gromacs installation version is 2023.5, here are the details:

I would appreciate your help in checking this out and if you need any more information, please do not hesitate to contact me and I look forward to hearing from you.

Additionally, I added the npt log file for not updating CUDA and its drivers (npt_win0_conf0.log) and the npt log file for updating CUDA and its drivers (npt_win26_conf455.log).
npt_win0_conf0.log (59.8 KB)
npt_win26_conf455.log (61.7 KB)
Looking forward to your reply!

Hi!

Do you have any other compute-intensive workload running on your machine?

The following line indicates that the CPU performance is the bottleneck:

 Force                     1   24      50001    1904.652     156221.791  81.9

And your observation that “Normally, I was using CPU is 2400% and now the CPU usage is only 1200%” (while still running 24 threads) suggests that somehow all 24 threads are put onto 12 (logical) cores now.

Since you ask GROMACS to use all logical cores (-nt 24), it expects to have them fully available. If you have another application running, you should tell GROMACS to use only some of the cores. Since your CPU has both P- and E-cores, I’d suggest -ntmpi 1 -ntomp 8 -pin on to limit GROMACS to using only P-cores and with one thread per physical core.

Hi al42and!
Thank you for your reply.
According to your suggestion, I have modified the parameters of NPT with the following commands:

gmx mdrun -v -deffnm npt_win27_conf467 -pin on -gpu_id 0 -ntmpi 1 -ntomp 8 -update gpu

npt_win27_conf467.log (61.4 KB)

For the same system (with 381731 atoms), the simulation time changed from 3.714 ns/day to 19.936 ns/day, and the CPU utilization was found to be 600%, and the GPU utilization was lower, about 20-40%.

And when I use the command:

gmx mdrun -v -deffnm npt_win28_conf473 -pin on -gpu_id 0 -ntmpi 1 -ntomp 16 -update gpu

npt_win28_conf473.log (63.3 KB)

The CPU utilization was found to be 960% and the screen appeared to have a lot of calculations about PME. The GPU utilization was also very low at about 30%. The simulation took 8.925 ns/day.

In addition, I used the (kill -STOP PID) to suspend the application that was calculating in the background before the gromacs calculations were submitted. Thus, all of the above studies are based on the case where no other application is computing in the background.

The log files of the above 2 NPTs are saved to the attachment. How should I speed up the calculation of groamcs to improve GPU and CPU utilization?

If you know it, please provide me some suggestions, thanks for your valuable advice.

In the top screenshot, you have gmx using 600% of the CPU, but the “load average” is around 30 (the three numbers correspond to the last 1, 5 and 15 minutes), so there’s around 30 active threads competing for the 23 (logical) cores, so there are some other processes using the CPU.

A large “Rest” time (from your log) and the observation that a lower value of ntomp seems to give better performance also suggest that something else on your system is heavily using the CPU.

That’s just status reports from PME autotuning, no need to worry.

Offloading bonded to the GPU (-bonded gpu) could help move more load from CPU to GPU. You can also increase neighbor search interval (e.g., -nstlist 200) while keeping -update gpu, this will reduce “Neighbor search” time.

If you do all this, there will be very little CPU work, so you can reduce -ntomp to 4 or 6, which will reduce interference with whatever else is using the CPU.

Ideally, you could pin each load to a separate set of CPU cores (in GROMACS, you can use -pin, -pinoffset, -pinstride options; for other applications, you can use, for example, the taskset utility).

Dear al42and!
Thank you very much for your patience in responding. After restarting my server, I typed in the command htop and found that all CPUs were being utilized, and finally realized that my server had a virus. The problem caused by the virus has now been resolved. Currently, the system of 16 W atoms is simulated using gromacs 2023.5 (GPU-4070Ti) with a time scale of 150 ns/day. Thank you for your help and being able to get my problem solved in a timely manner.

Thank you for your valuable suggestion, I am currently able to run gromacs properly using my school’s LAN, I will try the method you provided.