Abysmal MD production performance on GPU node

GROMACS version: 2023.2
GROMACS modification: No (I think)

Dear GMX users,

I’m new to running GROMACS on HPC clusters and I’m currently running test simulations with my system of interest to learn how to run simulations efficiently in the hardware. The issue that I am encountering is this: My system is relatively big, and when I run the simulation in one node with 40 cores, I get a performance of 21.5 ns/day. When I run the same simulation in a node with the same amount of cores plus a GPU, the performance drops to about 14 ns/day. I made another test with a small protein, and the difference is even more shocking: 173 and 16 ns/day without and with the GPU, respectively.

The big system contains a tetrameric protein with ~400 aa-long subunits and a total of 260 272 atoms after solvation. The small system contains the classic lysozyme and a total of 23 961 atoms.

The nodes have an Intel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz and 192 GB RAM available (I request 150 GB). The GPU node has one NVIDIA Tesla V100 GPU card. There is another node type available, with 20 cores, an NVIDIA Tensor Core A100 card, and 400 GB RAM available. However, I never tested this one.

My job script contains:

module load gcc/9.4.0
module load openmpi/gcc/64/1.10.2
module load cuda/toolkit/11.8.0
module load gromacs/2023.2

NPROCS=`wc -l < $PBS_NODEFILE`
export OMP_NUM_THREADS=1
export GMX_ENABLE_DIRECT_GPU_COMM=1
export GMX_CUDA_GRAPH=1

The following commands were use for the CPU and GPU-based simulations, respectively:

mpirun -np 36 gmx_mpi mdrun -deffnm md_0_1

mpirun -np 36 gmx_mpi mdrun -deffnm md_0_1 -nb gpu -bonded cpu -pme cpu -update gpu

I didn’t find any errors or warnings in the job logs of the GPU runs, but there is this information in the MD log:

Command line:
  gmx_mpi mdrun -deffnm md_0_1 -nb gpu -bonded cpu -pme cpu -update gpu

GROMACS version:    2023.2
Precision:          mixed
Memory model:       64 bit
MPI library:        MPI
OpenMP support:     enabled (GMX_OPENMP_MAX_THREADS = 128)
GPU support:        CUDA
NB cluster size:    8
SIMD instructions:  AVX_512
CPU FFT library:    fftw-3.3.8-sse2-avx-avx2-avx2_128-avx512
GPU FFT library:    cuFFT
Multi-GPU FFT:      none
RDTSCP usage:       enabled
TNG support:        enabled
Hwloc support:      disabled
Tracing support:    disabled
C compiler:         /cm/shared/apps/openmpi/gcc/64/1.10.2/bin/mpicc GNU 9.4.0
C compiler flags:   -fexcess-precision=fast -funroll-all-loops -mavx512f -mfma -mavx512vl -mavx512dq -mavx512bw -Wno-missing-field-initializers -O3 -DNDEBUG
C++ compiler:       /cm/shared/apps/openmpi/gcc/64/1.10.2/bin/mpicxx GNU 9.4.0
C++ compiler flags: -fexcess-precision=fast -funroll-all-loops -mavx512f -mfma -mavx512vl -mavx512dq -mavx512bw -Wno-missing-field-initializers -Wno-cast-function-type-strict -fopenmp -O3 -DNDEBUG
BLAS library:       
LAPACK library:     
CUDA compiler:      /services/tools/cuda/toolkit/11.8.0/bin/nvcc nvcc: NVIDIA (R) Cuda compiler driver;Copyright (c) 2005-2022 NVIDIA Corporation;Built on Wed_Sep_21_10:33:58_PDT_2022;Cuda compilation tools, release 11.8, V11.8.89;Build cuda_11.8.r11.8/compiler.31833905_0
CUDA compiler flags:-std=c++17;--generate-code=arch=compute_35,code=sm_35;--generate-code=arch=compute_37,code=sm_37;--generate-code=arch=compute_50,code=sm_50;--generate-code=arch=compute_52,code=sm_52;--generate-code=arch=compute_60,code=sm_60;--generate-code=arch=compute_61,code=sm_61;--generate-code=arch=compute_70,code=sm_70;--generate-code=arch=compute_75,code=sm_75;--generate-code=arch=compute_80,code=sm_80;--generate-code=arch=compute_86,code=sm_86;--generate-code=arch=compute_89,code=sm_89;--generate-code=arch=compute_90,code=sm_90;-Wno-deprecated-gpu-targets;--generate-code=arch=compute_53,code=sm_53;--generate-code=arch=compute_80,code=sm_80;-use_fast_math;-Xptxas;-warn-double-usage;-Xptxas;-Werror;;-fexcess-precision=fast -funroll-all-loops -mavx512f -mfma -mavx512vl -mavx512dq -mavx512bw -Wno-missing-field-initializers -Wno-cast-function-type-strict -fopenmp -O3 -DNDEBUG
CUDA driver:        11.20
CUDA runtime:       11.80


Running on 1 node with total 40 cores, 40 processing units, 1 compatible GPU
Hardware detected on host g-12-g0029 (the node of MPI rank 0):
  CPU info:
    Vendor: Intel
    Brand:  Intel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz
    Family: 6   Model: 85   Stepping: 7
    Features: aes apic avx avx2 avx512f avx512cd avx512bw avx512vl avx512secondFMA clfsh cmov cx8 cx16 f16c fma hle htt intel lahf mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd rdtscp rtm sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
    Number of AVX-512 FMA units: 2
  Hardware topology: Basic
    Packages, cores, and logical processors:
    [indices refer to OS logical processors]
      Package  0: [   0] [   1] [   2] [   3] [   4] [   5] [   6] [   7] [   8] [   9] [  10] [  11] [  12] [  13] [  14] [  15] [  16] [  17] [  18] [  19]
      Package  1: [  20] [  21] [  22] [  23] [  24] [  25] [  26] [  27] [  28] [  29] [  30] [  31] [  32] [  33] [  34] [  35] [  36] [  37] [  38] [  39]
    CPU limit set by OS: -1   Recommended max number of threads: 40
  GPU info:
    Number of GPUs detected: 1
    #0: NVIDIA Tesla V100-PCIE-16GB, compute cap.: 7.0, ECC: yes, stat: compatible

and

The number of OpenMP threads was set by environment variable OMP_NUM_THREADS to 1

GMX_CUDA_GRAPH environment variable is detected. The experimental CUDA Graphs feature will be used if run conditions allow.

GPU-aware MPI was not detected, will not use direct GPU communication. Check the GROMACS install guide for recommendations for GPU-aware support. If you are certain about GPU-aware support in your MPI library, you can force its use by setting the GMX_FORCE_GPU_AWARE_MPI environment variable.

and

Initializing Domain Decomposition on 36 ranks
Dynamic load balancing: auto
Using update groups, nr 8340, average size 2.9 atoms, max. radius 0.104 nm
Minimum cell size due to atom displacement: 0.646 nm
Initial maximum distances in bonded interactions:
    two-body bonded interactions: 0.448 nm, LJ-14, atoms 1156 1405
  multi-body bonded interactions: 0.448 nm, Proper Dih., atoms 1156 1405
Minimum cell size due to bonded interactions: 0.493 nm
Scaling the initial minimum size with 1/0.8 (option -dds) = 1.25
Using 0 separate PME ranks because: PME-only ranks are not automatically used when non-bonded interactions are computed on GPUs
Optimizing the DD grid for 36 cells with a minimum initial size of 0.808 nm
The maximum allowed number of cells is: X 7 Y 7 Z 6
Domain decomposition grid 6 x 6 x 1, separate PME ranks 0
PME domain decomposition: 6 x 6 x 1
Domain decomposition rank 0, coordinates 0 0 0

The initial number of communication pulses is: X 2 Y 2
The initial domain decomposition cell size is: X 0.94 nm Y 0.94 nm

The maximum allowed distance for atom groups involved in interactions is:
                 non-bonded interactions           1.373 nm
(the following are initial values, they could change due to box deformation)
            two-body bonded interactions  (-rdd)   1.373 nm
          multi-body bonded interactions  (-rdd)   0.944 nm

When dynamic load balancing gets turned on, these settings will change to:
The maximum number of communication pulses is: X 2 Y 2
The minimum size for domain decomposition cells is 0.719 nm
The requested allowed shrink of DD cells (option -dds) is: 0.80
The allowed shrink of domain decomposition cells is: X 0.76 Y 0.76
The maximum allowed distance for atom groups involved in interactions is:
                 non-bonded interactions           1.373 nm
            two-body bonded interactions  (-rdd)   1.373 nm
          multi-body bonded interactions  (-rdd)   0.719 nm

On host g-12-g0029 1 GPU selected for this run.
Mapping of GPU IDs to the 36 GPU tasks in the 36 ranks on this node:
  PP:0,PP:0,PP:0,PP:0,PP:0,PP:0,PP:0,PP:0,PP:0,PP:0,PP:0,PP:0,PP:0,PP:0,PP:0,PP:0,PP:0,PP:0,PP:0,PP:0,PP:0,PP:0,PP:0,PP:0,PP:0,PP:0,PP:0,PP:0,PP:0,PP:0,PP:0,PP:0,PP:0,PP:0,PP:0,PP:0
PP tasks will do (non-perturbed) short-ranged interactions on the GPU
PP task will update and constrain coordinates on the GPU
Using 36 MPI processes

Non-default thread affinity set, disabling internal thread affinity

Using 1 OpenMP thread per MPI process

System total charge: 0.000
Will do PME sum in reciprocal space for electrostatic interactions.

I contacted the HPC cluster support about this, but they told me that “this is a usage question and must be addressed with gromacs documentation or maintainers”. Hence, here I am.

I appreciate any help you may provide.

Best regards,

Gustavo

Try running one or two ranks per GPU. I.e. mpirun -np 1 and OMP_NUM_THREADS=36. I think that usually helps, but is of course not optimal if you have very many CPU cores.

1 Like

You are running with PME on CPU which is extremely slow. To run on a single GPU, just use a single MPI rank, and run with “-pme gpu”. You should also run with “-update gpu”. Depending on your system, “-bonded gpu” may also be faster. For more information on multi-GPU runs, please see
Creating Faster Molecular Dynamics Simulations with GROMACS 2020 | NVIDIA Technical Blog
and
Massively Improved Multi-node NVIDIA GPU Scalability with GROMACS | NVIDIA Technical Blog.

For more information on maximizing throughput for multiple simulations, see Maximizing GROMACS Throughput with Multiple Simulations per GPU Using MPS and MIG | NVIDIA Technical Blog

Thanks! Running on one rank didn’t solve the issue at first, but it appears that in the end, it was crucial to get a significant improvement.

Thanks a lot!

By using only one rank and assigning the gpu with all tasks (nb, bonded, and pme), the performance increased to 255 ns/day for the NPT equilibration step for the small protein. Unfortunately, an error occurred right in the beginning of the MD production:

Job error file

starting mdrun 'LYSOZYME in water'
500000 steps,   1000.0 ps.
[g-12-g0033:16517] *** Process received signal ***
[g-12-g0033:16517] Signal: Segmentation fault (11)
[g-12-g0033:16517] Signal code: Address not mapped (1)
[g-12-g0033:16517] Failing at address: 0xfffffffe00f308b0

Job output file
mpirun noticed that process rank 0 with PID 16517 on node g-12-g0033 exited on signal 11 (Segmentation fault).

I have no idea why this happened, but maybe it is just a matter of trying again.

In any case, despite the improvement, the current GPU run performance is still not worth it compared to the pure CPU run in terms of cost/speed per node. However, can this be improved if I run multiple GPU nodes in parallel? Particularly if I use the multi-GPU optimization and/or GPU PME decomposition and/or CUDA Graphs?

Best regards,

Gustavo

Try without that environment variable. It enables an experimental feature which has received a number of fixes in 2023.3 (and you seem to be using an older release).

Thanks a lot!

Turning off CUDA GRAPH indeed improved the simulations substantially. The small protein simulation speed went from 255 to 380 ns/day, and the big complex went from 22 to 55 ns/day. This appears to make the GPU node more cost-effective.

However, I noticed this in my md.log:

GPU-aware MPI was not detected, will not use direct GPU communication. Check the GROMACS install guide for recommendations for GPU-aware support. If you are certain about GPU-aware support in your MPI library, you can force its use by setting the GMX_FORCE_GPU_AWARE_MPI environment variable.

So, it appears that the GPU direction communication isn’t working. I’m not sure what effect it may have on the simulation speed, but maybe it’s something to look at in the future. For now, I’m happy with the improvement.

Best regards,

Gustavo

If you are not already, please can you also make sure you set
export OMP_NUM_THREADS=40
to use all the CPU cores in your system (even with full GPU acceleration, there exist parts of the code which run on CPU and rely on this for good performance), and
-update gpu
to run update and constraints on GPU.

If you are running on a single GPU with a single MPI task, you can ignore the GPU-aware MPI error since you are not performing any MPI communications. For multiple GPUs (controlled by multiple MPI tasks), you will either need CUDA-aware MPI installed, or you can instead use the thread-MPI build of GROMACS, where thread-MPI will only work within a single compute node.

Small cases will not scale well to multiple compute nodes, for larger cases please see Massively Improved Multi-node NVIDIA GPU Scalability with GROMACS | NVIDIA Technical Blog,

Dear alang,

Thanks for your reply.

The reason why I’m not using all cores is that I’ve read more than once that it is recommended to leave one core free (though I’m not sure if this is only for mpi, non-mpi, or both types of run). However, if I remember correctly, 39, 38, and 37 cores have that high prime number issue, so I had to use 36. However, I’m testing all 40 cores now and we’ll see what happens.

Edit: I just got the result. Increasing the number of cores from 36 to 40 didn’t increase the performance. Why?

I’m doing so, more or less. I’ve read here that I can use export GMX_FORCE_UPDATE_DEFAULT_GPU=true instead. I do so because this is the command that will be necessary in case one day I decide to try multi-GPU (unlikely, though, because the HPC cluster I’m using does’t have nodes with multiple GPUs).

So, the PME GPU decomposition will only work (or I will only benefit from it) if I use multiple nodes with multiple GPUs each, right?

So everybody knows how I’m running now, here is the relevant content of my job script:

export OMP_NUM_THREADS=40
export GMX_ENABLE_DIRECT_GPU_COMM=1
export GMX_FORCE_UPDATE_DEFAULT_GPU=true

mpirun -np 1 gmx_mpi mdrun -deffnm md_0_1 -nb gpu -bonded gpu -pme gpu

I’m not using -update gpu for the reason mentioned above.

And I found an optimal nstlist of 250-300 with 36 cores (running the big system at 62 ns/day now).

Best regards,

Gustavo