Making efficient use of CPU and GPU in an old workstation using gmx 5.1.4

GROMACS version:5.1.4
GROMACS modification: Yes/No
Here post your question
I wanted to run a simulation in an old workstation that has NVIDIA QuadroK4000 and GMX_5.1.4. I am trying to make use of the cpu(bonded and pme) and gpu(nb shortrange) for getting the most out of the workstation. Gromacs detects the gpu and and it automatically selects the gpu for the mdrun(details below). GMX 5.1.4 doesn’t have -bonded and -pme flags as in GMX_2020.4 ;
So any suggestions on how to bring in the CPU for bonded and pme and offload the nonbonded shortranges to GPU?

I will also present the shortcomings
i. Can’t install gmx_mpi version for 2020.4 as the compute capability of the gpu is 3.0

The run and params are as follows__ Thank you all!

gmx_mpi --version
GROMACS version: VERSION 5.1.4
Precision: single
Memory model: 64 bit
MPI library: MPI
OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 32)
GPU support: enabled
OpenCL support: disabled
invsqrt routine: gmx_software_invsqrt(x)
SIMD instructions: AVX_256
FFT library: fftw-3.3.4-sse2-avx
RDTSCP usage: enabled
C++11 compilation: disabled
TNG support: enabled
Tracing support: disabled
Built on: Thu Feb 25 14:58:26 IST 2021
Built by: root@user-X9DAX [CMAKE]
Build OS/arch: Linux 5.4.0-66-generic x86_64
Build CPU vendor: GenuineIntel
Build CPU brand: Intel(R) Xeon(R) CPU E5-2687W v2 @ 3.40GHz
Build CPU family: 6 Model: 62 Stepping: 4
Build CPU features: aes apic avx clfsh cmov cx8 cx16 f16c htt lahf_lm mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd rdtscp sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
C compiler: /usr/bin/mpicc GNU 7.5.0
C compiler flags: -mavx -Wextra -Wno-missing-field-initializers -Wno-sign-compare -Wpointer-arith -Wall -Wno-unused -Wunused-value -Wunused-parameter -O3 -DNDEBUG -funroll-all-loops -fexcess-precision=fast -Wno-array-bounds
C++ compiler: /usr/bin/mpicxx GNU 7.5.0
C++ compiler flags: -mavx -Wextra -Wno-missing-field-initializers -Wpointer-arith -Wall -Wno-unused-function -O3 -DNDEBUG -funroll-all-loops -fexcess-precision=fast -Wno-array-bounds
Boost version: 1.55.0 (internal)
CUDA compiler: /usr/local/cuda/bin/nvcc nvcc: NVIDIA (R) Cuda compiler driver;Copyright (c) 2005-2016 NVIDIA Corporation;Built on Tue_Jan_10_13:22:03_CST_2017;Cuda compilation tools, release 8.0, V8.0.61
CUDA compiler flags:-gencode;arch=compute_30,code=sm_30;-gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_60,code=compute_60;-gencode;arch=compute_61,code=compute_61;-use_fast_math;-D_FORCE_INLINES; ;-mavx;-Wextra;-Wno-missing-field-initializers;-Wpointer-arith;-Wall;-Wno-unused-function;-O3;-DNDEBUG;-funroll-all-loops;-fexcess-precision=fast;-Wno-array-bounds;
CUDA driver: 11.20
CUDA runtime: 8.0

Simulation Run

mpirun -np 16 gmx_mpi mdrun -deffnm nvt__5.1.4 -v -s nvt__5.1.4.tpr -nb gpu


GROMACS: gmx mdrun, VERSION 5.1.4
Executable: /usr/local/gromacs/bin/gmx_mpi
Data prefix: /usr/local/gromacs
Command line:
gmx_mpi mdrun -deffnm nvt__5.1.4 -v -s nvt__5.1.4.tpr -nb gpu

Number of logical cores detected (32) does not match the number reported by OpenMP (16).
Consider setting the launch configuration manually!

Running on 1 node with total 16 cores, 32 logical cores, 1 compatible GPU
Hardware detected on host user-X9DAX (the node of MPI rank 0):
CPU info:
Vendor: GenuineIntel
Brand: Intel(R) Xeon(R) CPU E5-2687W v2 @ 3.40GHz
SIMD instructions most likely to fit this hardware: AVX_256
SIMD instructions selected at GROMACS compile time: AVX_256
GPU info:
Number of GPUs detected: 1
#0: NVIDIA Quadro K4000, compute cap.: 3.0, ECC: no, stat: compatible

Reading file nvt__5.1.4.tpr, VERSION 5.1.4 (single precision)
Changing nstlist from 10 to 40, rlist from 1 to 1.088

Using 16 MPI processes
Using 2 OpenMP threads per MPI process

On host user-X9DAX 1 compatible GPU is present, with ID 0
On host user-X9DAX 1 GPU auto-selected for this run.
Mapping of GPU ID to the 16 PP ranks in this node: 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0

Non-default thread affinity set probably by the OpenMP library,
disabling internal thread affinity
NOTE: DLB will not turn on during the first phase of PME tuning

starting mdrun ‘Protein in water’
500000 steps, 1000.0 ps.
step 80: timed with pme grid 96 96 16, coulomb cutoff 1.000: 1436.8 M-cycles
step 160: timed with pme grid 84 84 14, coulomb cutoff 1.116: 1777.9 M-cycles
step 240: timed with pme grid 96 96 14, coulomb cutoff 1.085: 1736.7 M-cycles
step 320: timed with pme grid 96 96 16, coulomb cutoff 1.000: 1439.5 M-cycles
optimal pme grid 96 96 16, coulomb cutoff 1.000

NOTE: DLB can now turn on, when beneficial

NOTE: Turning on dynamic load balancing

vol 0.94 imb F 12% step 400, will finish Mon Oct 11 09:17:15 2021
step 500, will finish Mon Oct 11 09:14:00 2021
vol 0.71 imb F 10% step 600, will finish Mon Oct 11 09:11:55 2021
step 700, will finish Mon Oct 11 09:10:27 2021
vol 0.54 imb F 9% step 800, will finish Mon Oct 11 09:09:19 2021
step 900, will finish Mon Oct 11 09:08:27 2021
vol 0.41 imb F 8% step 1000, will finish Mon Oct 11 09:07:46 2021
step 1100, will finish Mon Oct 11 09:07:00 2021
vol 0.33! imb F 6% step 1200, will finish Mon Oct 11 09:06:28 2021
step 1300, will finish Mon Oct 11 09:06:07 2021
vol 0.29! imb F 5% step 1400, will finish Mon Oct 11 09:05:51 2021
step 1500, will finish Mon Oct 11 09:05:36 2021

Why do you want to use such an old version of GROMACS? The newest version will be much faster.

Just due to the old GPU (10 years old) in place…2020.4 doesn’t sit well with CUDA runtime 8 and comp.cap < 6.0 which is compatible the GPU; So the demands of 2020.4 is too much for this GPU but can be managed with 5.1.4.
that’s why.;
As my university has less than 100 nodes in HPC which are mostly busy, running long simulations is tough. So, trying to optimize my performance with available resources.

But you can use version 2019 then, or?

You can use CUDA 9.2 (with gcc 7) even with GROMACS 2021.

I will try it.

Yes, but CUDA versions > 9 (with gcc 5) wasn’t compatible…