First, thank you for such a thorough analysis! And for taking your time to translate it to English!
Have you tried setting HIPSYCL_RT_MAX_CACHED_NODES=0
environment variable, as described in GROMACS get stuck AMD GPU?
We still don’t have a good understanding of the root cause of the problem, but we suspect that the problem might be caused by the hipSYCL caching behavior, where it submits tasks to the GPU in bursts, with is handled poorly by the AMD HSA runtime sometimes. Setting HIPSYCL_RT_MAX_CACHED_NODES=0
will force immediate submission avoiding this potential problem. With the latest hipSYCL, is almost always improves performance too, but mostly for small systems.
I don’t think there were many performance-related changes since then. They added a new backend (OpenCL) and a new programming model (C++ Standard Parallelism / stdpar), which are significant changes, but neither is used by GROMACS.
Speaking of Intel A770: one can get slightly better performance when using Double-batched FFT library instead of MKL. For A770, double-batched FFT should be compiled with -DCMAKE_CXX_COMPILER=icpx -DCMAKE_BUILD_TYPE=Release -DBUILD_SHARED_LIBS=YES -DNO_DOUBLE_PRECISION=ON
, then the GROMACS should be told to use it.
In my tests, the speed-up is around ~10% on STMV when using oneAPI 2023.2 (same as yours), so nothing dramatic. Just FYI.