GROMACS version: 2024.2
GROMACS modification: Yes/No
Here post your question
Hey,
I know there has recently been discussion on removing the OpenCL support soon. However, I have personal interest on OpenCL so I hope I could get some clarification to this here.
As a background, I am running Gromacs OpenCL on top of PoCL cpu-driver. Currently, I have focused only on nonbonded interactions, that are calculated at cpu through the OpenCL backend.
More specifically, I am stuck at the energy minimization step which produces wrong results:
Energy minimization has stopped, but the forces have not converged to the
requested precision Fmax < 1000 (which may not be possible for your system).
It stopped because the algorithm tried to make a new step whose size was too
small, or there was no change in the energy since last step. Either way, we
regard the minimization as converged to within the available machine
precision, given your starting configuration and EM parameters.)
To help the debugging, I have tried to minimize the simulated system by using single methanol molecule accompanied with 7 water molecules. Wrong force calculations are produced already at first iteration after the OpenCL kernel is executed.
I have set:
GMX_OCL_NOOPT=1
GMX_OCL_NOFASTGEN=1
GMX_OCL_DISABLE_FAST_MATH=1
GMX_GPU_DISABLE_COMPATIBILITY_CHECK=1
GMX_OCL_FORCE_CPU=1
Also, I have disabled USE_CJ_PREFETCH manually.
My gromacs installation:
GROMACS version: 2024-dev-20240130-4d5683a176-dirty
GIT SHA1 hash: 4d5683a176e64fa46ec9da70755d263fa23c9cdf (dirty)
Precision: mixed
Memory model: 64 bit
MPI library: thread_mpi
OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 128)
GPU support: OpenCL
NBNxM GPU setup: super-cluster 2x2x2 / cluster 4
SIMD instructions: AVX2_128
CPU FFT library: fftw-3.3.8-sse2-avx-avx2-avx2_128
GPU FFT library: clFFT
Multi-GPU FFT: none
RDTSCP usage: enabled
TNG support: enabled
Hwloc support: disabled
Tracing support: disabled
C compiler: /usr/bin/cc GNU 12.3.0
C compiler flags: -Wno-array-bounds -fexcess-precision=fast -funroll-all-loops -mavx2 -mfma -Wall -Wno-unused -Wunused-value -Wunused-parameter -Wextra -Wno-sign-compare -Wpointer-arith -Wundef -Werror=stringop-truncation -Wno-missing-field-initializers -O3 -DNDEBUG
C++ compiler: /usr/bin/c++ GNU 12.3.0
C++ compiler flags: -Wno-array-bounds -fexcess-precision=fast -funroll-all-loops -mavx2 -mfma -Wall -Wextra -Wpointer-arith -Wmissing-declarations -Wundef -Wstringop-truncation -Wno-missing-field-initializers -Wno-cast-function-type-strict -fopenmp -O3 -DNDEBUG
BLAS library: External - detected on the system
LAPACK library: External - detected on the system
OpenCL include dir: /usr/include
OpenCL library: /usr/lib/x86_64-linux-gnu/libOpenCL.so
OpenCL version: 3.0
So, lately I have been trying to figure out what the kernel actually does (the specific kernel is nbnxn_kernel_ElecEw_VdwLJCombGeom_VF_prune_opencl). Could someone elaborate on this? Global worksize seems to be 88x8, making a total of 11 work groups (each of size 8x8). So, this kernel executes a total of 704 times per iteration. What I don’t understand, is that how does this number map to 27 atoms that my system has? My idea was that there is grid where atoms lie, and each kernel execution handles part of that grid, but the number of kernel executions seem to be way too large for this.
I feel like information on this is quite scarce and hard to find. Anything that would improve my understanding on kernel functionality would be much appeciated.
Thanks,
Tapio