Optimizing GROMACS 2023.1 run for better GPU utilization

GROMACS version:2023.1
GROMACS modification: No

I’m seeking assistance to optimize my GROMACS simulation run to fully utilize my hardware resources. Below are the details of my current setup and the issue I’m encountering.

Hardware Specifications:

  • CPU: AMD Ryzen 9 5950X (16 cores, 32 threads)
  • GPU: NVIDIA RTX 4080 SUPER, 16 GB VRAM
  • RAM: 64 GB DDR4
  • Operating System: WLS Ubuntu 22.04

GROMACS Configuration:

  • GROMACS Version: 2023.1
  • Compilation Settings:
GROMACS version:    2023.1
Precision:          mixed
Memory model:       64 bit
MPI library:        thread_mpi
OpenMP support:     enabled (GMX_OPENMP_MAX_THREADS = 128)
GPU support:        CUDA
NB cluster size:    8
SIMD instructions:  AVX2_256
CPU FFT library:    fftw-3.3.8-sse2-avx-avx2-avx2_128
GPU FFT library:    cuFFT
Multi-GPU FFT:      none
RDTSCP usage:       enabled
TNG support:        enabled
Hwloc support:      disabled
Tracing support:    disabled
C compiler:         /usr/bin/cc GNU 11.4.0
C compiler flags:   -fexcess-precision=fast -funroll-all-loops -mavx2 -mfma -Wno-missing-field-initializers -O3 -DNDEBUG
C++ compiler:       /usr/bin/c++ GNU 11.4.0
C++ compiler flags: -fexcess-precision=fast -funroll-all-loops -mavx2 -mfma -Wno-missing-field-initializers -Wno-cast-function-type-strict -fopenmp -O3 -DNDEBUG
CUDA compiler:      /usr/local/cuda/bin/nvcc
CUDA compiler flags: [As listed above]
CUDA driver:        12.70
CUDA runtime:       12.50

Simulation Details:

  • System: Protein-ligand complex
  • Number of Atoms: ~120,000
  • Simulation Type: [e.g., NVT, NPT]
  • Simulation Parameters: in attached md.mdp

Current Command:

gmx mdrun -s md.tpr -v -pme gpu -bonded gpu -update gpu -deffnm new_run -ntomp 14 -cpi new_run.cpt

Issue Description:

  • GPU Utilization: Observing only up to 50% utilization of CUDA cores on the RTX 4080 SUPER, regardless of assigning different numbers of CPU threads (e.g., 4 or 16). CPU can be load at 20-100% with no big difference in ns.

  • Performance Metrics: Achieving an average of 190 ns/day, which seems suboptimal given the hardware capabilities.

Questions:

  1. Configuration Optimization: What settings or parameters should I adjust to enhance GPU utilization and overall simulation performance?
  2. MPI and OpenMP Balance: How should I best balance the number of MPI ranks (-ntmpi) and OpenMP threads (-ntomp) to fully leverage my 16-core CPU and good GPU?
  3. GROMACS Compilation: Are there specific compilation flags or configurations that could improve performance on my hardware setup?
  4. System Specifics: Are there any additional system-specific optimizations (e.g., BIOS settings, OS configurations) that I should consider?
    md.mdp (2.7 KB)

Please share some log files!