GROMACS version:2020.4
GROMACS modification: No
Hi,
I need some help regarding GROMACS performance.
I’m using NVIDIA Jetson AGX Xavier: 8-Core ARM v8.2 64-Bit CPU, 512-Core Volta GPU, 32 GB 256-Bit LPDDR4x | 137 GB/s RAM.
I’m compiling GROMACS with the following cmake parameters:
cmake … -DGMX_GPU=on -DCUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda-10.0 -DGMX_GPU_DETECTION_DONE=on -DGMX_BUILD_OWN_FFTW=on -DGMX_MPI=on -DBUILD_SHARED_LIBS=off -DCMAKE_C_COMPILER=mpicc -DCMAKE_CXX_COMPILER=mpicxx
DGMX_GPU_DETECTION_DONE is set to workaround a GPU detection issue. The compilation works fine. I’m able to run the Lysozyme in Water tutorial http://www.mdtutorials.com/gmx/lysozyme/index.html without issues. However, the tegrastats command shows that CPU is at less than 40% utilization:
RAM 1926/31927MB (lfb 2729x4MB) SWAP 0/15963MB (cached 0MB) CPU [37%@2265,37%@2265,35%@2265,35%@2265,34%@2265,35%@2265,71%@2265,35%@2265] EMC_FREQ 0% GR3D_FREQ 82% AO@45.5C GPU@50C Tdiode@47C PMIC@100C AUX@42.5C CPU@47C thermal@46.25C Tboard@43C GPU 15994/16232 CPU 5073/4963 SOC 3996/3996 CV 0/0 VDDRQ 2459/2591 SYS5V 3234/3234
htop shows a similar CPU utilization stats. I’m using nvpmodel 0 which sets power mode to MAXN:
SOC family:tegra194 Machine:Jetson-AGX
Online CPUs: 0-7
CPU Cluster Switching: Disabled
cpu0: Online=1 Governor=schedutil MinFreq=1190400 MaxFreq=2265600 CurrentFreq=2265600 IdleStates: C1=0 c6=0
cpu1: Online=1 Governor=schedutil MinFreq=1190400 MaxFreq=2265600 CurrentFreq=2265600 IdleStates: C1=0 c6=0
cpu2: Online=1 Governor=schedutil MinFreq=1190400 MaxFreq=2265600 CurrentFreq=2265600 IdleStates: C1=0 c6=0
cpu3: Online=1 Governor=schedutil MinFreq=1190400 MaxFreq=2265600 CurrentFreq=2265600 IdleStates: C1=0 c6=0
cpu4: Online=1 Governor=schedutil MinFreq=1190400 MaxFreq=2265600 CurrentFreq=2265600 IdleStates: C1=0 c6=0
cpu5: Online=1 Governor=schedutil MinFreq=1190400 MaxFreq=2265600 CurrentFreq=2265600 IdleStates: C1=0 c6=0
cpu6: Online=1 Governor=schedutil MinFreq=1190400 MaxFreq=2265600 CurrentFreq=2265600 IdleStates: C1=0 c6=0
cpu7: Online=1 Governor=schedutil MinFreq=1190400 MaxFreq=2265600 CurrentFreq=2265600 IdleStates: C1=0 c6=0
GPU MinFreq=318750000 MaxFreq=1377000000 CurrentFreq=1377000000
EMC MinFreq=204000000 MaxFreq=2133000000 CurrentFreq=2133000000 FreqOverride=1
Fan: speed=255
NV Power Mode: MAXN
For the tutorial I get the following stats:
…
Running on 1 node with total 8 cores, 8 logical cores, 1 compatible GPU
Hardware detected:
CPU info:
Vendor: ARM
Brand: ARMv8 Processor rev 0 (v8l)
Family: 8 Model: 0 Stepping: 0
Features: neon neon_asimd
Hardware topology: Full, with devices
Sockets, cores, and logical processors:
Socket 0: [ 0] [ 1]
Socket 1: [ 2] [ 3]
Socket 2: [ 4] [ 5]
Socket 3: [ 6] [ 7]
Numa nodes:
Node 0 (33477713920 bytes mem): 0 1 2 3 4 5 6 7
Latency:
0
0 1.00
Caches:
L1: 65536 bytes, linesize 64 bytes, assoc. 4, shared 1 ways
L2: 2097152 bytes, linesize 64 bytes, assoc. 16, shared 2 ways
L3: 4194304 bytes, linesize 64 bytes, assoc. 16, shared 8 ways
PCI devices:
0001:01:00.0 Id: 1b4b:9171 Class: 0x0106 Numa: 0
GPU info:
Number of GPUs detected: 1
#0: NVIDIA Xavier, compute cap.: 7.2, ECC: no, stat: compatible
…
M E G A - F L O P S A C C O U N T I N G
NB=Group-cutoff nonbonded kernels NxN=N-by-N cluster Verlet kernels
RF=Reaction-Field VdW=Van der Waals QSTab=quadratic-spline table
W3=SPC/TIP3p W4=TIP4p (single or pairs)
V&F=Potential and force V=Potential only F=Force only
Computing: M-Number M-Flops % Flops
Pair Search distance check 16235.718832 146121.469 0.0
NxN Ewald Elec. + LJ [F] 16555583.360448 1092668501.790 98.1
NxN Ewald Elec. + LJ [V&F] 167261.613184 17896992.611 1.6
1,4 nonbonded interactions 2553.005106 229770.460 0.0
Shift-X 169.413876 1016.483 0.0
Bonds 512.501025 30237.560 0.0
Angles 1773.503547 297948.596 0.0
Propers 213.000426 48777.098 0.0
RB-Dihedrals 1975.003950 487825.976 0.0
Virial 1696.083921 30529.511 0.0
Stop-CM 169.413876 1694.139 0.0
Calc-Ekin 3387.667752 91467.029 0.0
Lincs 479.500959 28770.058 0.0
Lincs-Mat 2388.004776 9552.019 0.0
Constraint-V 16913.033826 135304.271 0.0
Constraint-Vir 1643.382867 39441.189 0.0
Settle 5318.010636 1717717.435 0.2
Total 1113861667.692 100.0
R E A L C Y C L E A N D T I M E A C C O U N T I N G
On 1 MPI rank, each using 8 OpenMP threads
Computing: Num Num Call Wall time Giga-Cycles
Ranks Threads Count (s) total sum %
Neighbor search 1 8 5001 20.704 5.176 1.5
Launch GPU ops. 1 8 500001 99.104 24.775 7.1
Force 1 8 500001 73.945 18.485 5.3
Wait PME GPU gather 1 8 500001 337.805 84.447 24.3
Reduce GPU PME F 1 8 500001 27.654 6.913 2.0
Wait GPU NB local 337.060 84.260 24.3
NB X/F buffer ops. 1 8 995001 49.675 12.418 3.6
Write traj. 1 8 102 0.627 0.157 0.0
Update 1 8 500001 30.765 7.691 2.2
Constraints 1 8 500001 71.627 17.906 5.2
Rest 338.411 84.598 24.4
Total 1387.377 346.826 100.0
Core t (s) Wall t (s) (%)
Time: 11099.009 1387.377 800.0
(ns/day) (hour/ns)
Performance: 62.276 0.385
Finished mdrun on rank 0 Tue Dec 1 23:22:41 2020
Other programs are using 100% of CPUs
I tested other multi-threading software including NVIDIA’s cuda samples.
Any advice is welcome.
Thanks.