GROMACS version: 2023.2
GROMACS modification: No (I think)
Dear GMX users,
I’m new to running GROMACS on HPC clusters and I’m currently running test simulations with my system of interest to learn how to run simulations efficiently in the hardware. The issue that I am encountering is this: My system is relatively big, and when I run the simulation in one node with 40 cores, I get a performance of 21.5 ns/day. When I run the same simulation in a node with the same amount of cores plus a GPU, the performance drops to about 14 ns/day. I made another test with a small protein, and the difference is even more shocking: 173 and 16 ns/day without and with the GPU, respectively.
The big system contains a tetrameric protein with ~400 aa-long subunits and a total of 260 272 atoms after solvation. The small system contains the classic lysozyme and a total of 23 961 atoms.
The nodes have an Intel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz and 192 GB RAM available (I request 150 GB). The GPU node has one NVIDIA Tesla V100 GPU card. There is another node type available, with 20 cores, an NVIDIA Tensor Core A100 card, and 400 GB RAM available. However, I never tested this one.
My job script contains:
module load gcc/9.4.0
module load openmpi/gcc/64/1.10.2
module load cuda/toolkit/11.8.0
module load gromacs/2023.2
NPROCS=`wc -l < $PBS_NODEFILE`
export OMP_NUM_THREADS=1
export GMX_ENABLE_DIRECT_GPU_COMM=1
export GMX_CUDA_GRAPH=1
The following commands were use for the CPU and GPU-based simulations, respectively:
mpirun -np 36 gmx_mpi mdrun -deffnm md_0_1
mpirun -np 36 gmx_mpi mdrun -deffnm md_0_1 -nb gpu -bonded cpu -pme cpu -update gpu
I didn’t find any errors or warnings in the job logs of the GPU runs, but there is this information in the MD log:
Command line:
gmx_mpi mdrun -deffnm md_0_1 -nb gpu -bonded cpu -pme cpu -update gpu
GROMACS version: 2023.2
Precision: mixed
Memory model: 64 bit
MPI library: MPI
OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 128)
GPU support: CUDA
NB cluster size: 8
SIMD instructions: AVX_512
CPU FFT library: fftw-3.3.8-sse2-avx-avx2-avx2_128-avx512
GPU FFT library: cuFFT
Multi-GPU FFT: none
RDTSCP usage: enabled
TNG support: enabled
Hwloc support: disabled
Tracing support: disabled
C compiler: /cm/shared/apps/openmpi/gcc/64/1.10.2/bin/mpicc GNU 9.4.0
C compiler flags: -fexcess-precision=fast -funroll-all-loops -mavx512f -mfma -mavx512vl -mavx512dq -mavx512bw -Wno-missing-field-initializers -O3 -DNDEBUG
C++ compiler: /cm/shared/apps/openmpi/gcc/64/1.10.2/bin/mpicxx GNU 9.4.0
C++ compiler flags: -fexcess-precision=fast -funroll-all-loops -mavx512f -mfma -mavx512vl -mavx512dq -mavx512bw -Wno-missing-field-initializers -Wno-cast-function-type-strict -fopenmp -O3 -DNDEBUG
BLAS library:
LAPACK library:
CUDA compiler: /services/tools/cuda/toolkit/11.8.0/bin/nvcc nvcc: NVIDIA (R) Cuda compiler driver;Copyright (c) 2005-2022 NVIDIA Corporation;Built on Wed_Sep_21_10:33:58_PDT_2022;Cuda compilation tools, release 11.8, V11.8.89;Build cuda_11.8.r11.8/compiler.31833905_0
CUDA compiler flags:-std=c++17;--generate-code=arch=compute_35,code=sm_35;--generate-code=arch=compute_37,code=sm_37;--generate-code=arch=compute_50,code=sm_50;--generate-code=arch=compute_52,code=sm_52;--generate-code=arch=compute_60,code=sm_60;--generate-code=arch=compute_61,code=sm_61;--generate-code=arch=compute_70,code=sm_70;--generate-code=arch=compute_75,code=sm_75;--generate-code=arch=compute_80,code=sm_80;--generate-code=arch=compute_86,code=sm_86;--generate-code=arch=compute_89,code=sm_89;--generate-code=arch=compute_90,code=sm_90;-Wno-deprecated-gpu-targets;--generate-code=arch=compute_53,code=sm_53;--generate-code=arch=compute_80,code=sm_80;-use_fast_math;-Xptxas;-warn-double-usage;-Xptxas;-Werror;;-fexcess-precision=fast -funroll-all-loops -mavx512f -mfma -mavx512vl -mavx512dq -mavx512bw -Wno-missing-field-initializers -Wno-cast-function-type-strict -fopenmp -O3 -DNDEBUG
CUDA driver: 11.20
CUDA runtime: 11.80
Running on 1 node with total 40 cores, 40 processing units, 1 compatible GPU
Hardware detected on host g-12-g0029 (the node of MPI rank 0):
CPU info:
Vendor: Intel
Brand: Intel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz
Family: 6 Model: 85 Stepping: 7
Features: aes apic avx avx2 avx512f avx512cd avx512bw avx512vl avx512secondFMA clfsh cmov cx8 cx16 f16c fma hle htt intel lahf mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd rdtscp rtm sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
Number of AVX-512 FMA units: 2
Hardware topology: Basic
Packages, cores, and logical processors:
[indices refer to OS logical processors]
Package 0: [ 0] [ 1] [ 2] [ 3] [ 4] [ 5] [ 6] [ 7] [ 8] [ 9] [ 10] [ 11] [ 12] [ 13] [ 14] [ 15] [ 16] [ 17] [ 18] [ 19]
Package 1: [ 20] [ 21] [ 22] [ 23] [ 24] [ 25] [ 26] [ 27] [ 28] [ 29] [ 30] [ 31] [ 32] [ 33] [ 34] [ 35] [ 36] [ 37] [ 38] [ 39]
CPU limit set by OS: -1 Recommended max number of threads: 40
GPU info:
Number of GPUs detected: 1
#0: NVIDIA Tesla V100-PCIE-16GB, compute cap.: 7.0, ECC: yes, stat: compatible
and
The number of OpenMP threads was set by environment variable OMP_NUM_THREADS to 1
GMX_CUDA_GRAPH environment variable is detected. The experimental CUDA Graphs feature will be used if run conditions allow.
GPU-aware MPI was not detected, will not use direct GPU communication. Check the GROMACS install guide for recommendations for GPU-aware support. If you are certain about GPU-aware support in your MPI library, you can force its use by setting the GMX_FORCE_GPU_AWARE_MPI environment variable.
and
Initializing Domain Decomposition on 36 ranks
Dynamic load balancing: auto
Using update groups, nr 8340, average size 2.9 atoms, max. radius 0.104 nm
Minimum cell size due to atom displacement: 0.646 nm
Initial maximum distances in bonded interactions:
two-body bonded interactions: 0.448 nm, LJ-14, atoms 1156 1405
multi-body bonded interactions: 0.448 nm, Proper Dih., atoms 1156 1405
Minimum cell size due to bonded interactions: 0.493 nm
Scaling the initial minimum size with 1/0.8 (option -dds) = 1.25
Using 0 separate PME ranks because: PME-only ranks are not automatically used when non-bonded interactions are computed on GPUs
Optimizing the DD grid for 36 cells with a minimum initial size of 0.808 nm
The maximum allowed number of cells is: X 7 Y 7 Z 6
Domain decomposition grid 6 x 6 x 1, separate PME ranks 0
PME domain decomposition: 6 x 6 x 1
Domain decomposition rank 0, coordinates 0 0 0
The initial number of communication pulses is: X 2 Y 2
The initial domain decomposition cell size is: X 0.94 nm Y 0.94 nm
The maximum allowed distance for atom groups involved in interactions is:
non-bonded interactions 1.373 nm
(the following are initial values, they could change due to box deformation)
two-body bonded interactions (-rdd) 1.373 nm
multi-body bonded interactions (-rdd) 0.944 nm
When dynamic load balancing gets turned on, these settings will change to:
The maximum number of communication pulses is: X 2 Y 2
The minimum size for domain decomposition cells is 0.719 nm
The requested allowed shrink of DD cells (option -dds) is: 0.80
The allowed shrink of domain decomposition cells is: X 0.76 Y 0.76
The maximum allowed distance for atom groups involved in interactions is:
non-bonded interactions 1.373 nm
two-body bonded interactions (-rdd) 1.373 nm
multi-body bonded interactions (-rdd) 0.719 nm
On host g-12-g0029 1 GPU selected for this run.
Mapping of GPU IDs to the 36 GPU tasks in the 36 ranks on this node:
PP:0,PP:0,PP:0,PP:0,PP:0,PP:0,PP:0,PP:0,PP:0,PP:0,PP:0,PP:0,PP:0,PP:0,PP:0,PP:0,PP:0,PP:0,PP:0,PP:0,PP:0,PP:0,PP:0,PP:0,PP:0,PP:0,PP:0,PP:0,PP:0,PP:0,PP:0,PP:0,PP:0,PP:0,PP:0,PP:0
PP tasks will do (non-perturbed) short-ranged interactions on the GPU
PP task will update and constrain coordinates on the GPU
Using 36 MPI processes
Non-default thread affinity set, disabling internal thread affinity
Using 1 OpenMP thread per MPI process
System total charge: 0.000
Will do PME sum in reciprocal space for electrostatic interactions.
I contacted the HPC cluster support about this, but they told me that “this is a usage question and must be addressed with gromacs documentation or maintainers”. Hence, here I am.
I appreciate any help you may provide.
Best regards,
Gustavo