Gromacs 2020.6 Performance Problem on Centos7 based GPU/CPU hybrid node

GROMACS version: 2020.6
GROMACS modification: No

Hi everybody,

I am a PhD from University of Naples Italy and I am found some problems in the installation of Gromacs 2020.6 on the GPU nodes of our local cluster. The installed OS is Centos7 and the architecture of the node is the following:

Running Kernel
Linux 3.10.0-957.el7.x86_64

CPU
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 64
On-line CPU(s) list: 0-63
Thread(s) per core: 2
Core(s) per socket: 16
Socket(s): 2
NUMA node(s): 8
Vendor ID: AuthenticAMD
CPU family: 23
Model: 1
Model name: AMD EPYC 7301 16-Core Processor
Stepping: 2
CPU MHz: 2195.730
BogoMIPS: 4391.46
Virtualization: AMD-V
L1d cache: 32K
L1i cache: 64K
L2 cache: 512K
L3 cache: 8192K

** GPU **
±----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-PCIE… Off | 00000000:41:00.0 Off | 0 |
| N/A 57C P0 87W / 250W | 443MiB / 16160MiB | 50% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

I tried to install both the Thread-MPI and the MPI-Cuda version of Gromacs 2020.6 but in both cases I obtained very low performances. In fact the two new Gromacs builds performed 7 ns/day and 12 ns/day on a 100k atoms system which I previously simulated with Gromacs 2019.4 at 42 ns/day.

I tried to use different combinations of C++ and NVCC compilers. In particular, I tried both with the devtoolset-8 distribution of CentoS (so with gcc-8) + cudatoolkit 10.2, then I tried with devtoolset-9 (gcc9) and cudatoolkit 11.2. However, I didn’t manage to achieve any improvement.

I will attach here the two configure commands which I uses:

CUDA MPI

/data/apps/cmake/cmake-3.16.3/bin/cmake -DGMX_MPI=on -DGMX_GPU=on -DBUILD_SHARED_LIBS=ON -DGMX_PREFER_STATIC_LIBS=OFF -DCMAKE_INSTALL_PREFIX=/data/apps/gromacs/2020.6_MPI-cuda -DGMX_CYCLE_SUBCOUNTERS=ON -DGMX_FFT_LIBRARY=fftw3 -DFFTWF_LIBRARY="/data/lib/fftw/3.3.9/lib/libfftw3f.a" -DFFTWF_INCLUDE_DIR="/data/lib/fftw/3.3.9/include" …

MPI-THREADS

/data/apps/cmake/cmake-3.16.3/bin/cmake -DGMX_GPU=on -DBUILD_SHARED_LIBS=ON -DGMX_PREFER_STATIC_LIBS=OFF -DCMAKE_INSTALL_PREFIX=/data/apps/gromacs/2020.6_TMPI-cuda -DGMX_CYCLE_SUBCOUNTERS=ON -DGMX_FFT_LIBRARY=fftw3 -DFFTWF_LIBRARY="/data/lib/fftw/3.3.9/lib/libfftw3f.a" -DFFTWF_INCLUDE_DIR="/data/lib/fftw/3.3.9/include" …

make -j 32
make install -j 32

Do you have any ideas about how to solve this issue?

Thank you for your support,

Kindest regards,

Vincenzo D’Amore