NVIDIA Jetson AGX Xavier CPU performance 40%

GROMACS version:2020.4
GROMACS modification: No
Hi,
I need some help regarding GROMACS performance.
I’m using NVIDIA Jetson AGX Xavier: 8-Core ARM v8.2 64-Bit CPU, 512-Core Volta GPU, 32 GB 256-Bit LPDDR4x | 137 GB/s RAM.

I’m compiling GROMACS with the following cmake parameters:

cmake … -DGMX_GPU=on -DCUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda-10.0 -DGMX_GPU_DETECTION_DONE=on -DGMX_BUILD_OWN_FFTW=on -DGMX_MPI=on -DBUILD_SHARED_LIBS=off -DCMAKE_C_COMPILER=mpicc -DCMAKE_CXX_COMPILER=mpicxx

DGMX_GPU_DETECTION_DONE is set to workaround a GPU detection issue. The compilation works fine. I’m able to run the Lysozyme in Water tutorial http://www.mdtutorials.com/gmx/lysozyme/index.html without issues. However, the tegrastats command shows that CPU is at less than 40% utilization:

RAM 1926/31927MB (lfb 2729x4MB) SWAP 0/15963MB (cached 0MB) CPU [37%@2265,37%@2265,35%@2265,35%@2265,34%@2265,35%@2265,71%@2265,35%@2265] EMC_FREQ 0% GR3D_FREQ 82% AO@45.5C GPU@50C Tdiode@47C PMIC@100C AUX@42.5C CPU@47C thermal@46.25C Tboard@43C GPU 15994/16232 CPU 5073/4963 SOC 3996/3996 CV 0/0 VDDRQ 2459/2591 SYS5V 3234/3234

htop shows a similar CPU utilization stats. I’m using nvpmodel 0 which sets power mode to MAXN:

SOC family:tegra194 Machine:Jetson-AGX
Online CPUs: 0-7
CPU Cluster Switching: Disabled
cpu0: Online=1 Governor=schedutil MinFreq=1190400 MaxFreq=2265600 CurrentFreq=2265600 IdleStates: C1=0 c6=0
cpu1: Online=1 Governor=schedutil MinFreq=1190400 MaxFreq=2265600 CurrentFreq=2265600 IdleStates: C1=0 c6=0
cpu2: Online=1 Governor=schedutil MinFreq=1190400 MaxFreq=2265600 CurrentFreq=2265600 IdleStates: C1=0 c6=0
cpu3: Online=1 Governor=schedutil MinFreq=1190400 MaxFreq=2265600 CurrentFreq=2265600 IdleStates: C1=0 c6=0
cpu4: Online=1 Governor=schedutil MinFreq=1190400 MaxFreq=2265600 CurrentFreq=2265600 IdleStates: C1=0 c6=0
cpu5: Online=1 Governor=schedutil MinFreq=1190400 MaxFreq=2265600 CurrentFreq=2265600 IdleStates: C1=0 c6=0
cpu6: Online=1 Governor=schedutil MinFreq=1190400 MaxFreq=2265600 CurrentFreq=2265600 IdleStates: C1=0 c6=0
cpu7: Online=1 Governor=schedutil MinFreq=1190400 MaxFreq=2265600 CurrentFreq=2265600 IdleStates: C1=0 c6=0
GPU MinFreq=318750000 MaxFreq=1377000000 CurrentFreq=1377000000
EMC MinFreq=204000000 MaxFreq=2133000000 CurrentFreq=2133000000 FreqOverride=1
Fan: speed=255
NV Power Mode: MAXN

For the tutorial I get the following stats:

Running on 1 node with total 8 cores, 8 logical cores, 1 compatible GPU
Hardware detected:
CPU info:
Vendor: ARM
Brand: ARMv8 Processor rev 0 (v8l)
Family: 8 Model: 0 Stepping: 0
Features: neon neon_asimd
Hardware topology: Full, with devices
Sockets, cores, and logical processors:
Socket 0: [ 0] [ 1]
Socket 1: [ 2] [ 3]
Socket 2: [ 4] [ 5]
Socket 3: [ 6] [ 7]
Numa nodes:
Node 0 (33477713920 bytes mem): 0 1 2 3 4 5 6 7
Latency:
0
0 1.00
Caches:
L1: 65536 bytes, linesize 64 bytes, assoc. 4, shared 1 ways
L2: 2097152 bytes, linesize 64 bytes, assoc. 16, shared 2 ways
L3: 4194304 bytes, linesize 64 bytes, assoc. 16, shared 8 ways
PCI devices:
0001:01:00.0 Id: 1b4b:9171 Class: 0x0106 Numa: 0
GPU info:
Number of GPUs detected: 1
#0: NVIDIA Xavier, compute cap.: 7.2, ECC: no, stat: compatible

M E G A - F L O P S   A C C O U N T I N G

NB=Group-cutoff nonbonded kernels NxN=N-by-N cluster Verlet kernels
RF=Reaction-Field VdW=Van der Waals QSTab=quadratic-spline table
W3=SPC/TIP3p W4=TIP4p (single or pairs)
V&F=Potential and force V=Potential only F=Force only

Computing: M-Number M-Flops % Flops

Pair Search distance check 16235.718832 146121.469 0.0
NxN Ewald Elec. + LJ [F] 16555583.360448 1092668501.790 98.1
NxN Ewald Elec. + LJ [V&F] 167261.613184 17896992.611 1.6
1,4 nonbonded interactions 2553.005106 229770.460 0.0
Shift-X 169.413876 1016.483 0.0
Bonds 512.501025 30237.560 0.0
Angles 1773.503547 297948.596 0.0
Propers 213.000426 48777.098 0.0
RB-Dihedrals 1975.003950 487825.976 0.0
Virial 1696.083921 30529.511 0.0
Stop-CM 169.413876 1694.139 0.0
Calc-Ekin 3387.667752 91467.029 0.0
Lincs 479.500959 28770.058 0.0
Lincs-Mat 2388.004776 9552.019 0.0
Constraint-V 16913.033826 135304.271 0.0
Constraint-Vir 1643.382867 39441.189 0.0
Settle 5318.010636 1717717.435 0.2

Total 1113861667.692 100.0

 R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G

On 1 MPI rank, each using 8 OpenMP threads

Computing: Num Num Call Wall time Giga-Cycles
Ranks Threads Count (s) total sum %

Neighbor search 1 8 5001 20.704 5.176 1.5
Launch GPU ops. 1 8 500001 99.104 24.775 7.1
Force 1 8 500001 73.945 18.485 5.3
Wait PME GPU gather 1 8 500001 337.805 84.447 24.3
Reduce GPU PME F 1 8 500001 27.654 6.913 2.0
Wait GPU NB local 337.060 84.260 24.3
NB X/F buffer ops. 1 8 995001 49.675 12.418 3.6
Write traj. 1 8 102 0.627 0.157 0.0
Update 1 8 500001 30.765 7.691 2.2
Constraints 1 8 500001 71.627 17.906 5.2
Rest 338.411 84.598 24.4

Total 1387.377 346.826 100.0

           Core t (s)   Wall t (s)        (%)
   Time:    11099.009     1387.377      800.0
             (ns/day)    (hour/ns)

Performance: 62.276 0.385
Finished mdrun on rank 0 Tue Dec 1 23:22:41 2020

Other programs are using 100% of CPUs
I tested other multi-threading software including NVIDIA’s cuda samples.

Any advice is welcome.

Thanks.

Hi pgalaviz,

as it seems to me, you offloaded all work on the GPU, so the CPU has nothing left to do. The recent GROMACS paper goes a bit more in-depth into the topic:

https://aip.scitation.org/doi/full/10.1063/5.0018516

1 Like