GROMACS version:5.1.4
GROMACS modification: Yes/No
Here post your question
I wanted to run a simulation in an old workstation that has NVIDIA QuadroK4000 and GMX_5.1.4. I am trying to make use of the cpu(bonded and pme) and gpu(nb shortrange) for getting the most out of the workstation. Gromacs detects the gpu and and it automatically selects the gpu for the mdrun(details below). GMX 5.1.4 doesn’t have -bonded and -pme flags as in GMX_2020.4 ;
So any suggestions on how to bring in the CPU for bonded and pme and offload the nonbonded shortranges to GPU?
I will also present the shortcomings
i. Can’t install gmx_mpi version for 2020.4 as the compute capability of the gpu is 3.0
The run and params are as follows__ Thank you all!
gmx_mpi --version
GROMACS version: VERSION 5.1.4
Precision: single
Memory model: 64 bit
MPI library: MPI
OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 32)
GPU support: enabled
OpenCL support: disabled
invsqrt routine: gmx_software_invsqrt(x)
SIMD instructions: AVX_256
FFT library: fftw-3.3.4-sse2-avx
RDTSCP usage: enabled
C++11 compilation: disabled
TNG support: enabled
Tracing support: disabled
Built on: Thu Feb 25 14:58:26 IST 2021
Built by: root@user-X9DAX [CMAKE]
Build OS/arch: Linux 5.4.0-66-generic x86_64
Build CPU vendor: GenuineIntel
Build CPU brand: Intel(R) Xeon(R) CPU E5-2687W v2 @ 3.40GHz
Build CPU family: 6 Model: 62 Stepping: 4
Build CPU features: aes apic avx clfsh cmov cx8 cx16 f16c htt lahf_lm mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd rdtscp sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
C compiler: /usr/bin/mpicc GNU 7.5.0
C compiler flags: -mavx -Wextra -Wno-missing-field-initializers -Wno-sign-compare -Wpointer-arith -Wall -Wno-unused -Wunused-value -Wunused-parameter -O3 -DNDEBUG -funroll-all-loops -fexcess-precision=fast -Wno-array-bounds
C++ compiler: /usr/bin/mpicxx GNU 7.5.0
C++ compiler flags: -mavx -Wextra -Wno-missing-field-initializers -Wpointer-arith -Wall -Wno-unused-function -O3 -DNDEBUG -funroll-all-loops -fexcess-precision=fast -Wno-array-bounds
Boost version: 1.55.0 (internal)
CUDA compiler: /usr/local/cuda/bin/nvcc nvcc: NVIDIA (R) Cuda compiler driver;Copyright (c) 2005-2016 NVIDIA Corporation;Built on Tue_Jan_10_13:22:03_CST_2017;Cuda compilation tools, release 8.0, V8.0.61
CUDA compiler flags:-gencode;arch=compute_30,code=sm_30;-gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_60,code=compute_60;-gencode;arch=compute_61,code=compute_61;-use_fast_math;-D_FORCE_INLINES; ;-mavx;-Wextra;-Wno-missing-field-initializers;-Wpointer-arith;-Wall;-Wno-unused-function;-O3;-DNDEBUG;-funroll-all-loops;-fexcess-precision=fast;-Wno-array-bounds;
CUDA driver: 11.20
CUDA runtime: 8.0
Simulation Run
mpirun -np 16 gmx_mpi mdrun -deffnm nvt__5.1.4 -v -s nvt__5.1.4.tpr -nb gpu
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
GROMACS: gmx mdrun, VERSION 5.1.4
Executable: /usr/local/gromacs/bin/gmx_mpi
Data prefix: /usr/local/gromacs
Command line:
gmx_mpi mdrun -deffnm nvt__5.1.4 -v -s nvt__5.1.4.tpr -nb gpu
Number of logical cores detected (32) does not match the number reported by OpenMP (16).
Consider setting the launch configuration manually!
Running on 1 node with total 16 cores, 32 logical cores, 1 compatible GPU
Hardware detected on host user-X9DAX (the node of MPI rank 0):
CPU info:
Vendor: GenuineIntel
Brand: Intel(R) Xeon(R) CPU E5-2687W v2 @ 3.40GHz
SIMD instructions most likely to fit this hardware: AVX_256
SIMD instructions selected at GROMACS compile time: AVX_256
GPU info:
Number of GPUs detected: 1
#0: NVIDIA Quadro K4000, compute cap.: 3.0, ECC: no, stat: compatible
Reading file nvt__5.1.4.tpr, VERSION 5.1.4 (single precision)
Changing nstlist from 10 to 40, rlist from 1 to 1.088
Using 16 MPI processes
Using 2 OpenMP threads per MPI process
On host user-X9DAX 1 compatible GPU is present, with ID 0
On host user-X9DAX 1 GPU auto-selected for this run.
Mapping of GPU ID to the 16 PP ranks in this node: 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Non-default thread affinity set probably by the OpenMP library,
disabling internal thread affinity
NOTE: DLB will not turn on during the first phase of PME tuning
starting mdrun ‘Protein in water’
500000 steps, 1000.0 ps.
step 80: timed with pme grid 96 96 16, coulomb cutoff 1.000: 1436.8 M-cycles
step 160: timed with pme grid 84 84 14, coulomb cutoff 1.116: 1777.9 M-cycles
step 240: timed with pme grid 96 96 14, coulomb cutoff 1.085: 1736.7 M-cycles
step 320: timed with pme grid 96 96 16, coulomb cutoff 1.000: 1439.5 M-cycles
optimal pme grid 96 96 16, coulomb cutoff 1.000
NOTE: DLB can now turn on, when beneficial
NOTE: Turning on dynamic load balancing
vol 0.94 imb F 12% step 400, will finish Mon Oct 11 09:17:15 2021
step 500, will finish Mon Oct 11 09:14:00 2021
vol 0.71 imb F 10% step 600, will finish Mon Oct 11 09:11:55 2021
step 700, will finish Mon Oct 11 09:10:27 2021
vol 0.54 imb F 9% step 800, will finish Mon Oct 11 09:09:19 2021
step 900, will finish Mon Oct 11 09:08:27 2021
vol 0.41 imb F 8% step 1000, will finish Mon Oct 11 09:07:46 2021
step 1100, will finish Mon Oct 11 09:07:00 2021
vol 0.33! imb F 6% step 1200, will finish Mon Oct 11 09:06:28 2021
step 1300, will finish Mon Oct 11 09:06:07 2021
vol 0.29! imb F 5% step 1400, will finish Mon Oct 11 09:05:51 2021
step 1500, will finish Mon Oct 11 09:05:36 2021