Make check failing when GPU enabled

mikewallis · March 26, 2021, 3:56pm

GROMACS version: 2021.1
GROMACS modification: No
Here post your question

Hi folks,
I’ve been trying to get a CUDA-enabled gmx to pass make check but it’s timing out on a lot of tests. They’re (mostly) passing when run CPU-only, however.

cmake -DCMAKE_INSTALL_PREFIX=${PWD}/…/gromacs_final -DREGRESSIONTEST_DOWNLOAD=ON -DCMAKE_C_COMPILER=gcc -DCMAKE_CXX_COMPILER=g++ …/ -DGMX_GPU=CUDA -DCUDA_TOOLKIT_ROOT_DIR=/exports/applications/apps/SL7/cuda/10.1.105 -DGMX_CUDA_TARGET_SM=“35;37;60;61;62”

(the targets are explicitly set as the binary could run on either K80 or Titan-X GPUs)

gmx mdrun -version

GROMACS version: 2021.1
Verified release checksum is 8c24bff5d3f78b0a9afb16e880b5667e5affe9a686d462482bac20ce975492c6
Precision: mixed
Memory model: 64 bit
MPI library: thread_mpi
OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 64)
GPU support: CUDA
SIMD instructions: AVX2_256
FFT library: fftw-3.3.3-sse2
RDTSCP usage: enabled
TNG support: enabled
Hwloc support: disabled
Tracing support: disabled
C compiler: /exports/applications/apps/community/roslin/gcc/7.3.0/bin/gcc GNU 7.3.0
C compiler flags: -mavx2 -mfma -Wno-missing-field-initializers -fexcess-precision=fast -funroll-all-loops -pthread -O3 -DNDEBUG
C++ compiler: /exports/applications/apps/community/roslin/gcc/7.3.0/bin/g++ GNU 7.3.0
C++ compiler flags: -mavx2 -mfma -Wno-missing-field-initializers -fexcess-precision=fast -funroll-all-loops -pthread -fopenmp -O3 -DNDEBUG
CUDA compiler: /exports/applications/apps/SL7/cuda/10.1.105/bin/nvcc nvcc: NVIDIA (R) Cuda compiler driver;Copyright (c) 2005-2019 NVIDIA Corporation;Built on Fri_Feb__8_19:08:17_PST_2019;Cuda compilation tools, release 10.1, V10.1.105
CUDA compiler flags:-std=c++14;-gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_62,code=sm_62;-use_fast_math;;-mavx2 -mfma -Wno-missing-field-initializers -fexcess-precision=fast -funroll-all-loops -pthread -fopenmp -O3 -DNDEBUG

When running:
On GPU:

The following tests FAILED:
	 12 - MdlibUnitTest (Timeout)
	 19 - DomDecMpiTests (Timeout)
	 20 - EwaldUnitTests (Timeout)
	 22 - GpuUtilsUnitTests (Timeout)
	 23 - HardwareUnitTests (Timeout)
	 53 - MdrunOutputTests (Timeout)
	 54 - MdrunModulesTests (Timeout)
	 55 - MdrunIOTests (Timeout)
	 56 - MdrunTests (Timeout)
	 57 - MdrunPmeTests (Timeout)
	 58 - MdrunNonIntegratorTests (Timeout)
	 59 - MdrunTpiTests (Timeout)
	 60 - MdrunMpiTests (Timeout)
	 61 - MdrunMpiPmeTests (Timeout)
	 62 - MdrunMpiCoordinationTestsOneRank (Timeout)
	 63 - MdrunMpiCoordinationTestsTwoRanks (Timeout)
	 64 - MdrunFEPTests (Timeout)
	 66 - GmxapiExternalInterfaceTests (Timeout)
	 67 - GmxapiInternalInterfaceTests (Timeout)
	 68 - regressiontests/complex (Timeout)
	 69 - regressiontests/freeenergy (Timeout)
	 70 - regressiontests/rotation (Timeout)
	 71 - regressiontests/essentialdynamics (Timeout)

with GMX_DISABLE_GPU_DETECTION=1 make check

The following tests FAILED:
63 - MdrunMpiCoordinationTestsTwoRanks (Timeout)

I am running this on a shared (gridengine) HPC cluster so there will be other jobs running on the node at the same time, but I should be getting exclusive use of a core and a GPU. It almost looks as if the tests aren’t being passed onto the GPU.

Any ideas?

Thanks,
MIke

mikewallis · March 26, 2021, 4:08pm

Just for clarity, I’ve made sure that CUDA_VISIBLE_DEVICES is set but as it’s a shared machine with up to 8 GPUs it won’t necessarily be 0. In the example above it was 1.

pszilard · March 29, 2021, 4:07pm

That the tests are timing out suggests that either the CPU or GPU you are using to execute the test is busy. Make sure that you are not using resourced that are already busy. Note that unless restricted, the test may try to use all cores / GPUs it detects, but if you set up the job correctly restricting the resources allocated and assuming your schedules is set up correctly, this should not happen.

Topic		Replies	Views
Gromacs 2021.4 installation, make check fails at regression/complex tests User discussions installation-error	0	862	December 28, 2021
Failed test during make check User discussions installation-error	2	49	January 24, 2025
Gromacs 2021.1 installation, make check fails at Mdrun Mpi Coordination Tests and regression tests User discussions installation-error	19	3742	January 18, 2024
Make check failed on ubuntu 20.04 User discussions installation-error	3	2241	August 23, 2021
Troubleshooting User discussions installation-error	0	781	August 3, 2022

Make check failing when GPU enabled

Related topics