Hi!
Sorry for not following up, I could not reproduce the issue with our available software at the time, but we only had CUDA 11.8 installed. I guess you managed to get it working? Did the new oneAPI version fix things?
Sub-groups are hardware property, so they are always enabled :) We do use sub-group level functionality on all GPUs in both CUDA and SYCL, so this should not cause any performance discrepancy.
A very extensive performance review has been recently shared by @Entropy_YU: A series of performance benchmarks for MD Apps, including GROMACS
Here’s the comparison with CUDA from their post:
80-85% is consistent with our internal measurements for systems of similar size (we did not run a thorough comparison)