Efficient Use of CPU and GPU Hybridization for Multiple GROMACS Jobs on a Single Machine

GROMACS version: 2022.2
GROMACS modification: No
Here post your question

Hello,

I am planning to run PMX simulations using GROMACS on an on-premise setup. My goal is to efficiently utilize all resources of a single machine for multiple GROMACS jobs.

I have a replication factor of 3 and consider both unbound and bound states (2) as well as forward and reverse processes (2), leading to 12 mdrun jobs in total.

Hardware Specifications:

  • 2x AMD EPYC 7763 64-Core Processors (with hyperthreading, allowing 256 threads)
  • NVIDIA A100 80GB PCIe GPU
  • 1TB of Memory

Software Setup:

  • Slurm and OpenMPI installed
  • GROMACS configured for both CPU (gromacs_cpu_mpi) and GPU (gromacs_gpu_mpi) with MPI support
gmx_mpi -version
                       :-) GROMACS - gmx_mpi, 2022.2 (-:

Executable:   /usr/local/gromacs-2022.2_mpi/build/bin/gmx_mpi
Data prefix:  /usr/local/gromacs
Working dir:  /usr/share/modules/modulefiles
Command line:
  gmx_mpi -version

GROMACS version:    2022.2
Precision:          mixed
Memory model:       64 bit
MPI library:        MPI
OpenMP support:     enabled (GMX_OPENMP_MAX_THREADS = 256)
GPU support:        disabled
SIMD instructions:  AVX2_128
CPU FFT library:    fftw-3.3.8-sse2-avx-avx2-avx2_128
GPU FFT library:    none
RDTSCP usage:       enabled
TNG support:        enabled
Hwloc support:      hwloc-2.9.1
Tracing support:    disabled
C compiler:         /usr/bin/cc GNU 7.5.0
C compiler flags:   -mavx2 -mfma -Wno-missing-field-initializers -fexcess-precision=fast -funroll-all-loops -pthread -O3 -DNDEBUG
C++ compiler:       /usr/bin/g++ GNU 7.5.0
C++ compiler flags: -mavx2 -mfma -pthread -Wno-missing-field-initializers -fexcess-precision=fast -funroll-all-loops -pthread SHELL:-fopenmp -O3 -DNDEBUG
gmx_mpi --version
                       :-) GROMACS - gmx_mpi, 2022.2 (-:

Executable:   /usr/local/gromacs-2022.2/build/bin/./gmx_mpi
Data prefix:  /usr/local/gromacs-2022.2 (source tree)
Working dir:  /usr/local/gromacs-2022.2/build/bin
Command line:
  gmx_mpi --version

GROMACS version:    2022.2
Precision:          mixed
Memory model:       64 bit
MPI library:        MPI
OpenMP support:     enabled (GMX_OPENMP_MAX_THREADS = 256)
GPU support:        CUDA
SIMD instructions:  AVX2_128
CPU FFT library:    fftw-3.3.8-sse2-avx-avx2-avx2_128
GPU FFT library:    cuFFT
RDTSCP usage:       enabled
TNG support:        enabled
Hwloc support:      hwloc-2.9.1
Tracing support:    disabled
C compiler:         /usr/bin/cc GNU 7.5.0
C compiler flags:   -mavx2 -mfma -Wno-missing-field-initializers -fexcess-precision=fast -funroll-all-loops -pthread -O3 -DNDEBUG
C++ compiler:       /usr/bin/g++ GNU 7.5.0
C++ compiler flags: -mavx2 -mfma -pthread -Wno-missing-field-initializers -fexcess-precision=fast -funroll-all-loops -pthread SHELL:-fopenmp -O3 -DNDEBUG
CUDA compiler:      /usr/local/cuda-12.2/bin/nvcc nvcc: NVIDIA (R) Cuda compiler driver;Copyright (c) 2005-2023 NVIDIA Corporation;Built on Tue_Aug_15_22:02:13_PDT_2023;Cuda compilation tools, release 12.2, V12.2.140;Build cuda_12.2.r12.2/compiler.33191640_0
CUDA compiler flags:-std=c++14;-gencode;arch=compute_80,code=sm_80;-use_fast_math;-D_FORCE_INLINES;-mavx2 -mfma -pthread -Wno-missing-field-initializers -fexcess-precision=fast -funroll-all-loops -pthread SHELL:-fopenmp -O3 -DNDEBUG
CUDA driver:        12.20
CUDA runtime:       10.10

Questions:

  1. The NVIDIA A100 supports a maximum of 16 threads. To execute 12 multiple jobs, should I only use the --multidir parameter for multiple simulations? Is there a way to partition the A100 GPU for simultaneous job execution (though I understand it might not support running 12 jobs concurrently)?
  2. When performing multi-job executions on a single machine using CPU-based resources with OpenMPI and Slurm, are there any additional configurations, similar to mdrun’s pinning, required to avoid overhead? Also, should memory be isolated and set separately for each job?

I am looking forward to your insights and suggestions for optimizing my GROMACS setup.

Thank you.

Please see Maximizing GROMACS Throughput with Multiple Simulations per GPU Using MPS and MIG | NVIDIA Technical Blog which may be of use