Understanding the functionality of OpenCL kernels

GROMACS version: 2024.2
GROMACS modification: Yes/No
Here post your question

Hey,

I know there has recently been discussion on removing the OpenCL support soon. However, I have personal interest on OpenCL so I hope I could get some clarification to this here.

As a background, I am running Gromacs OpenCL on top of PoCL cpu-driver. Currently, I have focused only on nonbonded interactions, that are calculated at cpu through the OpenCL backend.

More specifically, I am stuck at the energy minimization step which produces wrong results:

Energy minimization has stopped, but the forces have not converged to the
requested precision Fmax < 1000 (which may not be possible for your system).
It stopped because the algorithm tried to make a new step whose size was too
small, or there was no change in the energy since last step. Either way, we
regard the minimization as converged to within the available machine
precision, given your starting configuration and EM parameters.)

To help the debugging, I have tried to minimize the simulated system by using single methanol molecule accompanied with 7 water molecules. Wrong force calculations are produced already at first iteration after the OpenCL kernel is executed.

I have set:

GMX_OCL_NOOPT=1
GMX_OCL_NOFASTGEN=1
GMX_OCL_DISABLE_FAST_MATH=1
GMX_GPU_DISABLE_COMPATIBILITY_CHECK=1
GMX_OCL_FORCE_CPU=1

Also, I have disabled USE_CJ_PREFETCH manually.

My gromacs installation:

GROMACS version: 2024-dev-20240130-4d5683a176-dirty
GIT SHA1 hash: 4d5683a176e64fa46ec9da70755d263fa23c9cdf (dirty)
Precision: mixed
Memory model: 64 bit
MPI library: thread_mpi
OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 128)
GPU support: OpenCL
NBNxM GPU setup: super-cluster 2x2x2 / cluster 4
SIMD instructions: AVX2_128
CPU FFT library: fftw-3.3.8-sse2-avx-avx2-avx2_128
GPU FFT library: clFFT
Multi-GPU FFT: none
RDTSCP usage: enabled
TNG support: enabled
Hwloc support: disabled
Tracing support: disabled
C compiler: /usr/bin/cc GNU 12.3.0
C compiler flags: -Wno-array-bounds -fexcess-precision=fast -funroll-all-loops -mavx2 -mfma -Wall -Wno-unused -Wunused-value -Wunused-parameter -Wextra -Wno-sign-compare -Wpointer-arith -Wundef -Werror=stringop-truncation -Wno-missing-field-initializers -O3 -DNDEBUG
C++ compiler: /usr/bin/c++ GNU 12.3.0
C++ compiler flags: -Wno-array-bounds -fexcess-precision=fast -funroll-all-loops -mavx2 -mfma -Wall -Wextra -Wpointer-arith -Wmissing-declarations -Wundef -Wstringop-truncation -Wno-missing-field-initializers -Wno-cast-function-type-strict -fopenmp -O3 -DNDEBUG
BLAS library: External - detected on the system
LAPACK library: External - detected on the system
OpenCL include dir: /usr/include
OpenCL library: /usr/lib/x86_64-linux-gnu/libOpenCL.so
OpenCL version: 3.0

So, lately I have been trying to figure out what the kernel actually does (the specific kernel is nbnxn_kernel_ElecEw_VdwLJCombGeom_VF_prune_opencl). Could someone elaborate on this? Global worksize seems to be 88x8, making a total of 11 work groups (each of size 8x8). So, this kernel executes a total of 704 times per iteration. What I don’t understand, is that how does this number map to 27 atoms that my system has? My idea was that there is grid where atoms lie, and each kernel execution handles part of that grid, but the number of kernel executions seem to be way too large for this.

I feel like information on this is quite scarce and hard to find. Anything that would improve my understanding on kernel functionality would be much appeciated.

Thanks,

Tapio

As someone who worked to get OpenCL running on Radeon via Mesa/LLVM around 2016-2017, I am curious: what is your end goal with this? There might be better approaches to getting there than debugging the our code scheduled for removal and written in a language that hasn’t caught on that well in a broader ecosystem.

Hey,

the goal is to run kernels on ASIPs through PoCL. SYCL backend is also an option as it can be run on top of PoCL via SPIR-V. However, OpenCL would be more preferable if it is not too painful to accomplish. But yeah, you are correct in that in the long run SYCL would be better option. This is my masters thesis project and as a part of that I’m also trying to learn some internals of the Gromacs.

T

As a general recommendation, I would encourage comparing the trajectories (in terms of forces and velocities) from CPU-only runs without OpenCL to CPU+PoCL results frame-by-frame to make sure that the simulated system itself is not the cause of issues. Assuming your system doesn’t explode, this is far easier to achieve with a short MD than EM, of course.

Regarding the kernels, you might have more luck with understanding their CUDA counterparts, which were the inspiration for the initial version of the OpenCL kernels that was contributed by Stream Computing (now Stream HPC). Of course, CUDA kernels changed since, but you can always use Git to go back in time to the point when they were better aligned.

1 Like

Thanks, really appreciate this!

You can find a more detailed description of the algorithm here: [1306.1737] A flexible algorithm for calculating pair interactions on SIMD architectures

1 Like

Thanks. I was able to make progress with these, and got through the energy minimization with PoCL cpu. Compared to non-offloaded run results, the calculated forces differ with some decimals, but it does not affect the coordinates in the em step. However, this seems to be causing problems in the equilibrium steps and now I’m trying to find what causes this. It appears to be data race as the offloaded results are not same with multiple runs, but they fluctuate a little bit.

Now, going through the kernel once again, I stumbled into this part:

if (nb_sci.shift == c_centralShiftIndex
        && pl_cjPacked[cjPackedBegin].cj[0] == sci * c_nbnxnGpuNumClusterPerSupercluster)
    {  
        /* we have the diagonal: add the charge and LJ self interaction energy term */
        for (int i = 0; i < c_nbnxnGpuNumClusterPerSupercluster; i++)
        {
            const float qi = xqib[i * CL_SIZE + tidxi].w;
             barrier(CLK_LOCAL_MEM_FENCE);
            E_el += qi * qi;
...

This part is identical to CUDA kernel, but I was wondering what is meant by the comment about the diagonal? Diagonal of what? Currently, I have a system of 27 atoms that executes in 12 4x4 workgroups. It seems odd that the code above is true for all work items of the WG1, but not for anything else.