HIP version of GROMACS

GROMACS version: 2021.1
GROMACS modification:No

Hello,

According the installation guide for GROMACS 2022.1, hipSYCL and ROCm runtime are required for CNDA GPUs support, which can be enabled with

  • -DGMX_GPU=SYCL
  • -DGMX_SYCL_HIPSYCL=on

However, the GROMACS docker image provided by AMD infinity hub (https://www.amd.com/en/technologies/infinity-hub/gromacs was built with HIP:

GROMACS version:    2021.1-ROCmSoftwarePlatform-dev-20211215-c296fc66b-unknown
GIT SHA1 hash:      c296fc66b8774e237928fda09a8aa28dc08f1790
Branched from:      unknown
Precision:          mixed
Memory model:       64 bit
MPI library:        thread_mpi
OpenMP support:     enabled (GMX_OPENMP_MAX_THREADS = 64)
GPU support:        HIP
SIMD instructions:  AVX2_256
FFT library:        fftw-3.3.8-sse2-avx
RDTSCP usage:       enabled
TNG support:        enabled
Hwloc support:      disabled
Tracing support:    disabled
C compiler:         /usr/bin/gcc GNU 9.3.0
C compiler flags:   -mavx2 -mfma -Wall -Wno-unused -Wunused-value -Wunused-parameter -Wextra -Wno-sign-compare -Wpointer-arith -Wundef -Werror=stringop-truncation -Wno-missing-field-initializers -fexcess-precision=fast -funroll-all-loops -Wno-array-bounds -O3 -DNDEBUG
C++ compiler:       /usr/bin/g++ GNU 9.3.0
C++ compiler flags: -mavx2 -mfma -Wall -Wextra -Wpointer-arith -Wmissing-declarations -Wundef -Wstringop-truncation -Wno-missing-field-initializers -fexcess-precision=fast -funroll-all-loops -Wno-array-bounds -fopenmp -O3 -DNDEBUG
HIP compiler:      /opt/rocm/bin/hipcc 4.4.21401-bedc5f61
HIP compiler flags:
HIP driver:        40421.1
HIP runtime:       40421.1

I could not find the source code of HIP version of GROMACS else where.
The performance is underwhelmed on MI210. I am not sure if this built is reliable in term of performace and is validated for correctness.

Clarification from GROMACS devs is much appreciated.

Thanks.

Hello,

this is a port of our code that the AMD developers have done themselves without our input. This means that we don’t have access to the code and can’t help you with it in terms of getting better performance or to check correctness, sorry.

Cheers

Paul

Thanks for your clarification.
Since this is an unofficial port by AMD, I will tread with caution.

Based on the most recent hipsycl-related issue on 4465, and subsequent merge requests (2643, 2662):

  • MI100’s performance is unreliable
  • MI200 stil hangs

Is the above understanding correct ?
For testing MI200 / MI210, do you recommend the latest commit over the stable release 2022.1 ?

For testing, I would recommend using the latest commit from the hwe-release-2022 branch, where we collect changes needed to allow the 2022 release to run on hardware using hipSYCL.

This means that you can expect the rest of the features to be the same as for the official 2022 releases, with only changes to hardware support.

Again, I would also recommend to tread with caution there, as we are still trying to get things to work reliable.

Cheers

Paul

1 Like

Can you please clarify what do you mean by the above?

Thanks, we will certainly heed your advice.

Since AMD does not publish GROMACS benchmark result, we are doing performance survey on AMD devices to keep track of what works and doesn’t.
We will test hwe branch and see how performance lines up vs. HIP version.

  • If I the hipsycl’s -munsafe-fp-atomics is used, there is a marginal perform gain of 2~3% but M100 performance is noisy, as you stated in 2643.
  • The hipsycl macro was introduced with 2662 for MI200, since MI100 (gfx908) still generate slow CAS loop regardless. Here, the orginal committer comments below that 2643 causes hang, I assume with also MI200.
  • Manually call ‘fast’ function was tested with MI100 but not MI200.

Above is what I collected regarding AMD performance. To recap:

  • -munsafe-fp-atomics: noisy for MI100, hangs for MI200 (I am not 100% sure about this)
  • hipsycl macro: too slow for MI100, good performance with MI200

Please correct me if I misunderstood the issue

Thanks.

That’s true, but it worked fine on our MI100 test stand, so YMMV.

And, as far as I know, no tests were done on MI200 for any of the solutions.

From the log file, the GPUDirect feature has not been ported to SYCL. Am I understand correctly ?

Yes, you are right.

We have in-development partial support for GPU Direct halo exchange with GPU-aware MPI (not threadMPI) in a separate branch, aa-sycl-mpi-halo-exchange-v2, which will be part of 2022-HWE release. If you are feeling adventurous, you can try building it. No special configure flags are required (besides -DGMX_MPI=ON), but at runtime, you will have to set GMX_FORCE_GPU_AWARE_MPI=1 GMX_ENABLE_DIRECT_GPU_COMM=1. If you try it, we would appreciate you sharing your observations.