GPU detection deactivated

I believe I’ve resolved this issue, and my conclusion is that gromacs could really help out by being more explicit about what went wrong here. I know that’s not always possible, but in this case “GPU detection deactivated” was actually quite misleading.

Just to see what would happen, I compiled 2018.8 with the same options on the same hardware, and got this when I ran it with the same runtime options (I’ve bolded the important parts):

CUDA compiler: /modules/apps/cuda/11.0.1/bin/nvcc nvcc: NVIDIA (R) Cuda compiler driver;Copyright (c) 2005-2020 NVIDIA Corporation;Built on Wed_May__6_19:09:25_PDT_2020;Cuda compilation tools, release 11.0, V11.0.167;Build cuda_11.0_bu.TC445_37.28358933_0

CUDA driver: 10.20
CUDA runtime: 32.47

NOTE: Detection of GPUs failed. The API reported:
GROMACS cannot run tasks on a GPU.

This time CUDA runtime was loaded and GPU detection was attempted, but failed? A google search for “gromacs detection of GPUs failed” led me to this answer from the old mailing list: Re: [gmx-users] Gromacs 2018.5 with CUDA, specifically:

CUDA compiler: /usr/local/cuda-9.2/bin/nvcc nvcc: NVIDIA (R) Cuda compiler
driver;Copyright (c) 2005-2018 NVIDIA Corporation;Built on
Wed_Apr_11_23:16:29_CDT_2018;Cuda compilation tools, release 9.2, V9.2.88

CUDA driver: 9.10
CUDA runtime: 32.64

You can not run a program compiled with CUDA 9.2 on a system with a driver
labeled “CUDA 9.1” compatible. Either use CUDA 9.1 or upgrade your NVIDIA
drivers.

After I compiled with CUDA 10.1 instead of 11.0, both 2018.8 and 2020.2 versions were able to detect and use the GPUs.

Since CMake was both supplied with the CUDA version:

– Found CUDA: /modules/apps/cuda/11.0.1 (found suitable version “11.0”, minimum required is “9.0”)

and detected the GPUs:

– Looking for NVIDIA GPUs present in the system
– Number of NVIDIA GPUs detected: 2

it seems like the CUDA driver / CUDA toolkit incompatibility could have been detected at that stage? Certainly it became known at run time, but, unlike 2018, the 2020 version appears just to give up silently on loading CUDA runtime and detecting GPUs:

CUDA driver: 10.20
CUDA runtime: N/A

Running on 1 node with total 16 cores, 32 logical cores (GPU detection deactivated)

which seems like the opposite of desired behavior. Maybe falling through a switch statement to a catch-all case that results in abandoning GPU detection? I can’t imagine this behavior was intended. In any case, since the 2020 version already changes its loading / detection behavior in light of the mismatch, maybe that can be turned into a notification?