GROMACS version: 2021.1-ROCmSoftwarePlatform-dev-20211215-c296fc66b-unknown
GIT SHA1 hash: c296fc66b8774e237928fda09a8aa28dc08f1790
Branched from: unknown
Memory model: 64 bit
MPI library: thread_mpi
OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 64)
GPU support: HIP
SIMD instructions: AVX2_256
FFT library: fftw-3.3.8-sse2-avx
RDTSCP usage: enabled
TNG support: enabled
Hwloc support: disabled
Tracing support: disabled
C compiler: /usr/bin/gcc GNU 9.3.0
C compiler flags: -mavx2 -mfma -Wall -Wno-unused -Wunused-value -Wunused-parameter -Wextra -Wno-sign-compare -Wpointer-arith -Wundef -Werror=stringop-truncation -Wno-missing-field-initializers -fexcess-precision=fast -funroll-all-loops -Wno-array-bounds -O3 -DNDEBUG
C++ compiler: /usr/bin/g++ GNU 9.3.0
C++ compiler flags: -mavx2 -mfma -Wall -Wextra -Wpointer-arith -Wmissing-declarations -Wundef -Wstringop-truncation -Wno-missing-field-initializers -fexcess-precision=fast -funroll-all-loops -Wno-array-bounds -fopenmp -O3 -DNDEBUG
HIP compiler: /opt/rocm/bin/hipcc 4.4.21401-bedc5f61
HIP compiler flags:
HIP driver: 40421.1
HIP runtime: 40421.1
I could not find the source code of HIP version of GROMACS else where.
The performance is underwhelmed on MI210. I am not sure if this built is reliable in term of performace and is validated for correctness.
Clarification from GROMACS devs is much appreciated.
this is a port of our code that the AMD developers have done themselves without our input. This means that we don’t have access to the code and can’t help you with it in terms of getting better performance or to check correctness, sorry.
Since AMD does not publish GROMACS benchmark result, we are doing performance survey on AMD devices to keep track of what works and doesn’t.
We will test hwe branch and see how performance lines up vs. HIP version.
If I the hipsycl’s -munsafe-fp-atomics is used, there is a marginal perform gain of 2~3% but M100 performance is noisy, as you stated in 2643.
The hipsycl macro was introduced with 2662 for MI200, since MI100 (gfx908) still generate slow CAS loop regardless. Here, the orginal committer comments below that 2643 causes hang, I assume with also MI200.
Manually call ‘fast’ function was tested with MI100 but not MI200.
Above is what I collected regarding AMD performance. To recap:
-munsafe-fp-atomics: noisy for MI100, hangs for MI200 (I am not 100% sure about this)
hipsycl macro: too slow for MI100, good performance with MI200
From the log file, the GPUDirect feature has not been ported to SYCL. Am I understand correctly ?
Yes, you are right.
We have in-development partial support for GPU Direct halo exchange with GPU-aware MPI (not threadMPI) in a separate branch, aa-sycl-mpi-halo-exchange-v2, which will be part of 2022-HWE release. If you are feeling adventurous, you can try building it. No special configure flags are required (besides -DGMX_MPI=ON), but at runtime, you will have to set GMX_FORCE_GPU_AWARE_MPI=1 GMX_ENABLE_DIRECT_GPU_COMM=1. If you try it, we would appreciate you sharing your observations.