I have a project where I would need to have Gromacs running specifically on OpenCL backend. I am aware that OpenCL is deprecated and it may get replaced in the future. Does anyone have any idea how much work it would require to get it working? Also any tips what exactly is the problem with OpenCL at the moment would be appreciated. I have been trying to debug it some days now, and am wondering if it is even doable.
OpenCL is deprecated, meaning it is not under active development, with our efforts (new features, optimizations for new hardware) focused on the SYCL and CUDA backends.
But OpenCL build is not fundamentally broken, it works alright on some hardware. There are several known issues with it:
AppleSilicon GPUs are broken since GROMACS 2023.2. Here is an idea for the fix that we plan to get in at some point. AppleSilicon GPUs work fine with GROMACS 2023 and 2023.1.
NVIDIA GPUs since Volta (with independent thread scheduling) are not supported. We don’t have any work planned; you can use CUDA (or, if you want, SYCL) there. Older GPUs work fine with OpenCL.
AMD GPUs with RDNA* (specifically, Wave32) architecture (RX5000 - RX7000 series) are not supported. If I recall correctly, the problem is in the PME kernels and some barrier missing there. We don’t have any work planned and recommend using SYCL for these devices, but fixes for OpenCL are welcome. Wave64-devices, such as GCN (older generatios) or CDNA (datacenter devices) work fine; there is also a hack to force RDNA devices to run in Wave64 mode, which also works.
Would you mind sharing a bit more about your project? Unless you need OpenCL specifically, it is not a great choice for performance reasons.
Thanks for the quick and informative reply!
We are aiming at running Gromacs kernels on customized hardware and would really like to use OpenCL for it (due to open source reasons).
I think the problem is failing synchronisations between threads and maybe additional barriers could fix it?
Do you know if anybody has tried this? It would be valuable information to know where the problem occurs in the code (I am analysing particular minimization kernel at the moment).
Which problem are you referring to, the NVIDIA issues?
For the nonbonded kernels, disabling CJ prefetching (USE_CJ_PREFETCH macro) might fix the issue (since it eliminates the need for the manual syncwarp (see CUDA kenrel).
there is also a hack to force RDNA devices to run in Wave64 mode, which also works.
Looked it up; the hack in question is setting GMX_OCL_FORCE_AMD_WAVEFRONT64 environment variable, which will add poorly documented options -Wf,-mwavefrontsize64 to AMD OpenCL JIT compiler, forcing the device to run in Wave64 mode.