GROMACS version: 2022.2
GROMACS modification: No
Here post your question
Hello,
I am planning to run PMX simulations using GROMACS on an on-premise setup. My goal is to efficiently utilize all resources of a single machine for multiple GROMACS jobs.
I have a replication factor of 3 and consider both unbound and bound states (2) as well as forward and reverse processes (2), leading to 12 mdrun jobs in total.
Hardware Specifications:
- 2x AMD EPYC 7763 64-Core Processors (with hyperthreading, allowing 256 threads)
- NVIDIA A100 80GB PCIe GPU
- 1TB of Memory
Software Setup:
- Slurm and OpenMPI installed
- GROMACS configured for both CPU (gromacs_cpu_mpi) and GPU (gromacs_gpu_mpi) with MPI support
gmx_mpi -version
:-) GROMACS - gmx_mpi, 2022.2 (-:
Executable: /usr/local/gromacs-2022.2_mpi/build/bin/gmx_mpi
Data prefix: /usr/local/gromacs
Working dir: /usr/share/modules/modulefiles
Command line:
gmx_mpi -version
GROMACS version: 2022.2
Precision: mixed
Memory model: 64 bit
MPI library: MPI
OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 256)
GPU support: disabled
SIMD instructions: AVX2_128
CPU FFT library: fftw-3.3.8-sse2-avx-avx2-avx2_128
GPU FFT library: none
RDTSCP usage: enabled
TNG support: enabled
Hwloc support: hwloc-2.9.1
Tracing support: disabled
C compiler: /usr/bin/cc GNU 7.5.0
C compiler flags: -mavx2 -mfma -Wno-missing-field-initializers -fexcess-precision=fast -funroll-all-loops -pthread -O3 -DNDEBUG
C++ compiler: /usr/bin/g++ GNU 7.5.0
C++ compiler flags: -mavx2 -mfma -pthread -Wno-missing-field-initializers -fexcess-precision=fast -funroll-all-loops -pthread SHELL:-fopenmp -O3 -DNDEBUG
gmx_mpi --version
:-) GROMACS - gmx_mpi, 2022.2 (-:
Executable: /usr/local/gromacs-2022.2/build/bin/./gmx_mpi
Data prefix: /usr/local/gromacs-2022.2 (source tree)
Working dir: /usr/local/gromacs-2022.2/build/bin
Command line:
gmx_mpi --version
GROMACS version: 2022.2
Precision: mixed
Memory model: 64 bit
MPI library: MPI
OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 256)
GPU support: CUDA
SIMD instructions: AVX2_128
CPU FFT library: fftw-3.3.8-sse2-avx-avx2-avx2_128
GPU FFT library: cuFFT
RDTSCP usage: enabled
TNG support: enabled
Hwloc support: hwloc-2.9.1
Tracing support: disabled
C compiler: /usr/bin/cc GNU 7.5.0
C compiler flags: -mavx2 -mfma -Wno-missing-field-initializers -fexcess-precision=fast -funroll-all-loops -pthread -O3 -DNDEBUG
C++ compiler: /usr/bin/g++ GNU 7.5.0
C++ compiler flags: -mavx2 -mfma -pthread -Wno-missing-field-initializers -fexcess-precision=fast -funroll-all-loops -pthread SHELL:-fopenmp -O3 -DNDEBUG
CUDA compiler: /usr/local/cuda-12.2/bin/nvcc nvcc: NVIDIA (R) Cuda compiler driver;Copyright (c) 2005-2023 NVIDIA Corporation;Built on Tue_Aug_15_22:02:13_PDT_2023;Cuda compilation tools, release 12.2, V12.2.140;Build cuda_12.2.r12.2/compiler.33191640_0
CUDA compiler flags:-std=c++14;-gencode;arch=compute_80,code=sm_80;-use_fast_math;-D_FORCE_INLINES;-mavx2 -mfma -pthread -Wno-missing-field-initializers -fexcess-precision=fast -funroll-all-loops -pthread SHELL:-fopenmp -O3 -DNDEBUG
CUDA driver: 12.20
CUDA runtime: 10.10
Questions:
- The NVIDIA A100 supports a maximum of 16 threads. To execute 12 multiple jobs, should I only use the
--multidir
parameter for multiple simulations? Is there a way to partition the A100 GPU for simultaneous job execution (though I understand it might not support running 12 jobs concurrently)? - When performing multi-job executions on a single machine using CPU-based resources with OpenMPI and Slurm, are there any additional configurations, similar to mdrun’s pinning, required to avoid overhead? Also, should memory be isolated and set separately for each job?
I am looking forward to your insights and suggestions for optimizing my GROMACS setup.
Thank you.