Gromacs 2020.4 compilation with GPU-support on non-GPU nodes

GROMACS version: 2020.4
GROMACS modification: No

Hi,

I’m trying to compile gromacs 2020.4 with GPU-support on a cluster with GPU- and CPU-only nodes. However, when trying to compile on the GPU-nodes, I fail due to missing libraries that I don’t have permission to install.

So I tried on the non-GPU nodes where the libraries are available but cmake will not detect the presence of GPUs during compilation and hence my compiled program (even though it passes all checks) has no GPU-support as can be seen from the md.log from a testrun. Furthermore, this test run does not proceed without GPU-support but is rather stuck at the first step (md.log remains at step 0).

GROMACS version: 2020.4
Verified release checksum is 79c2857291b034542c26e90512b92fd4b184a1c9d6fa59c55f2e24ccf14e7281
Precision: single
Memory model: 64 bit
MPI library: thread_mpi
OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 64)
GPU support: CUDA
SIMD instructions: AVX2_256
FFT library: fftw-3.3.8-sse2-avx-avx2-avx2_128
RDTSCP usage: enabled
TNG support: enabled
Hwloc support: hwloc-1.11.0
Tracing support: disabled
C compiler: /software/gcc/7.2.0/bin/gcc GNU 7.2.0
C compiler flags: -mavx2 -mfma -fexcess-precision=fast -funroll-all-loops -O3 -DNDEBUG
C++ compiler: /software/gcc/7.2.0/bin/g++ GNU 7.2.0
C++ compiler flags: -mavx2 -mfma -fexcess-precision=fast -funroll-all-loops -fopenmp -O3 -DNDEBUG
CUDA compiler: /software/cuda/10.0/bin/nvcc nvcc: NVIDIA ® Cuda compiler driver;Copyright © 2005-2018 NVIDIA Corporation;Built on Sat_Aug_25_21:08:01_CDT_2018;Cuda compilation tools, release 10.0, V10.0.130
CUDA compiler flags:-gencode;arch=compute_52,code=sm_52;-use_fast_math;;-mavx2 -mfma -fexcess-precision=fast -funroll-all-loops -fopenmp -O3 -DNDEBUG
CUDA driver: 10.10
CUDA runtime: N/A

Running on 1 node with total 48 cores, 48 logical cores (GPU detection deactivated)
Hardware detected:
CPU info:
Vendor: Intel
Brand: Intel® Xeon® CPU E5-2680 v3 @ 2.50GHz
Family: 6 Model: 63 Stepping: 2
Features: aes apic avx avx2 clfsh cmov cx8 cx16 f16c fma htt intel lahf mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd rdtscp sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
Hardware topology: Only logical processor count

Compilation options were:

cmake … -DCMAKE_INSTALL_PREFIX=/home/anhes/myapps/gromacs_cuda -DGMX_BUILD_OWN_FFTW=ON -DREGRESSIONTEST_DOWNLOAD=ON -DGMX_THREAD_MPI=ON -DGMX_GPU=ON -DCUDA_TOOLKIT_ROOT_DIR=/software/cuda/10.0 -DGMX_CUDA_TARGET_SM=“52” >cmake.log&

My question now is if there is a way to compile gromacs with support for the GPUs without the cards actually being present on the node I am compiling on? If so, does anyone spot what I am doing wrong with my settings?

Thanks in advance,

Julian

Hi Julian,

Compiling a GPU-enabled build does not require GPU hardware, so I suspect there are some software configuration issues here.

If library dependencies are not found at build-time, unless these are optional and you make sure to no link against those, you won’t be able to run binaries built elsewhere that link against these libraries.

Your GROMACS version header shows GPU support: CUDA which means that it is build with CUDA support. Further below, the CUDA runtime: N/A indicates that there is no (compatible) CUDA runtime available and that is why you do not get GPUs detected. Furthermore the “(GPU detection deactivated)” message suggests that some error condition e.g. incompatible CUDA runtime disabled the GPU detection. Have you made sure that the CUDA runtime is available and functional on the host where you are trying to run?


Szilárd

Hi Szilárd,

first of all thanks for making clear that I don’t require compilation on
the compute-nodes! So now I can got on looking for the error elsewhere.
I learned that on our system not the complete libraries are missing on
the compute nodes but only the headers to save space. So linking against
these libraries should not be a problem, only the compilation must be
done on the login-nodes where the headers are available.

Regarding the functionality of the cuda runtime on the nodes, I tried
the following: I compiled this code
(https://github.com/chathhorn/cuda-semantics/blob/master/examples/getVersion.cu)
using nvcc 10.0 on the login-node and ran it under both the compute- and
the login nodes each with the respective cuda-modules loaded.

On the compute node this confirms what gromacs indicated: the runtime
version is not found

/cudaGetDriverVersion
Driver Version: 10010
Runtime Version: 0

On the login-node I get neither a driver nor a runtime as there is no
GPU and hence no driver installed

/cudaGetDriverVersion
Driver Version: 0
Runtime Version: 0
Is this the usual way to check the presence of cudart? If so, does that
indicate that something is wrong with the installed cuda-modules?

Thanks for your help and best wishes

Julian

That makes sense, the only restriction that results in is that compilation will only be possible on the login nodes.

That is the same mechanism GROMACS uses, see:

What you should also check is what the return value of cudaRuntimeGetVersion is this can indicates the cause of the lack of a runtime detected.


Szilárd

I modified my code as follows to also include the returns of the cuda-API calls:

#include <cuda.h>
#include <stdio.h>

int main(int argc, char** argv) {
int driver_version = 0, runtime_version = 0;

cudaError_t driver_return = cudaDriverGetVersion(&driver_version);
cudaError_t runtime_return = cudaRuntimeGetVersion(&runtime_version);

printf("Driver Version: %d\n", driver_version);
printf("%s\n", cudaGetErrorString(driver_return));

printf("Runtime Version: %d\n", driver_return);
printf("%s\n", cudaGetErrorString(driver_return));

return 0;
}

After execution on the compute-nodes, I get

$ ./cudaGetDriverVersion
Driver Version: 10010
no error
Runtime Version: 0
no error

This was independent of the cuda version loaded on the node. To me this looks like the cuda runtime is not installed at all, does this make sense?

Thanks for your help

Julian

It is more likely that an incompatible runtime is used. What does not make sense is that already your initial report shows CUDA driver: 10.10 and a runtime 10.0 suggested by the use of Cuda compilation tools, release 10.0, V10.0.130. These should be compatible, not sure why they are not. The only thing I can think of is, try to use CUDA 10.1 (or an earlier version like 9.2).

Side-note: the GROMACS build system will by default link statically against the CUDA runtime (see the value of the CUDA_cudart_static_LIBRARY cmake cache variable in CMakeCache.txt) and therefore a missing library can not be the issue – and in any case that would prevent even launching gmx.

Cheers,
Szilárd