Question about SYCL kernel

Hi, I’m attempting to run Gromacs through SYCL → PoCL → cpu.

I’m getting segfault which comes from this indexing in the SYCL code:

static inline float interpolateCoulombForceR(..){
       ..
         const float left  = a_coulombTab[index];
       ..
}

This backtracks to coordinate values in shared memory arrays (sm_xq) being as follows:

cache_index  x_qi[0]  x_qi[1]  x_qi[2]
0	0.017759	0.221908	6.882967
16	0.105778	0.421836	6.987104
1	0.052037	0.148851	6.752980
17	0.354853	0.488185	6.937916
2	0.064203	0.140132	6.916962
18	-1000000.000000	-1000000.000000	-1000000.000000
3	0.162037	0.158720	6.926071
19	-1000000.000000	-1000000.000000	-1000000.000000
4	0.276955	0.117952	6.811072
20	-1000000.000000	-1000000.000000	-1000000.000000
5	0.307072	0.213129	6.816928
21	-1000000.000000	-1000000.000000	-1000000.000000
6	0.560796	0.222230	6.732964
22	-1000000.000000	-1000000.000000	-1000000.000000
7	0.599214	0.137695	6.770084
23	-1000000.000000	-1000000.000000	-1000000.000000
8	0.697065	0.518162	6.774996
24	-1000000.000000	-1000000.000000	-1000000.000000
9	0.671004	0.463999	6.854915
25	-1000000.000000	-1000000.000000	-1000000.000000
10	0.636931	0.374839	6.825089
26	-1000000.000000	-1000000.000000	-1000000.000000
11	0.472990	0.241075	6.776952
27	-1000000.000000	-1000000.000000	-1000000.000000
12	0.356926	0.521089	6.777979
28	-1000000.000000	-1000000.000000	-1000000.000000
13	0.317221	0.546726	6.866105
29	-1000000.000000	-1000000.000000	-1000000.000000
14	0.283973	0.249920	6.906999
30	-1000000.000000	-1000000.000000	-1000000.000000
15	0.053192	0.558918	6.797868
31	-1000000.000000	-1000000.000000	-1000000.000000

index value used to index a_coulombTab is derived from these coordinate values and end up being huge in case of negative values.

So, I was wondering do the mask-variables imask and maskJI play some kind of role making sure that only appropriate coordinate values are used? How to interpret the mask-variables?

Does anyone have any insight on this?

Edit: I might add that I’m running with WG size of 16 and sg size of 8.

Thanks,

Tapio

Yes. You can read the details of the algorithm here (sections 2.3 and 3.1). The short version is: atoms are grouped in the clusters, which are then (for GPU) grouped into super-clusters. Sizes of those are fixed, so padding is necessary. The imask is used to avoid computing interactions for pairs of particles that are too far from each other (and that includes padding “particles”). Pseudocode in Fig. 7 is a decent explainer, IMO.

As you deduced, somehow the masking is not properly applied in your kernels, and so the force is computed for an interaction over an unreasonably long distance. When this distance is used to look up the value from the table, you get out-of-bounds access.

Properly handling the interaction masking would be best since it avoids extra compute. If fixing the masking is not an option, you can try adding the handling of out-of-bound values to interpolateCoulombForceR (returning 0 would be sane for physically-relevant cases). Another alternative is setting GMX_GPU_NB_ANA_EWALD=1 environment variable to use analytical instead of tabulated version of the kernels (no guarantee that it would work, but at least it will avoid this specific failure mode).

In addition to masking (which is a performance feature), there are also exclusions (also implemented via a bitmask). Exclusions are necessary for correctness, e.g., to avoid computing non-bonded forces between covelently-linked atoms.

Thanks, this was very informative and useful!
We managed to track the issue on PoCL side.

Tapio

1 Like