GROMACS version:2023.1
GROMACS modification: Yes/No
Hi,
I’ve just set up GMX on a auto-scaling HPC cluster using google cloud. I launch my jobs via slurm on the login node, and compute nodes (n1-standard-16, 8 CPU core and 1 Tesla T4 GPU) are spun up to run the simulations. Each compute node contains a containerised version of GMX and the container is run while mounting the shared file system on the login node onto the compute node, which is where the simulation data is saved to. My intuition is that this could be causing the problem, however it’s strange because this performance degradation only occurs rarely (out of the 20 jobs I submitted only 1 ran incredibly slowly) and all other jobs are using the same containerised run / shared filesystem mount point. Furthermore, if I re-submit the job using the same input data, no performance loss is observed.
I’ve noticed that occasionally a simulation will be prohibitively slow: Energy minimisation and EQ will take forever. As detailed in the title, performance will be around 2ns/day vs a normal node which runs at ~300ns/day for the same system (same protein different ligand). I am running in free-energy mode.
There is nothing entirely obvious from the logs that appears to be causing this. The hardware is detected correctly, but the accounting is slightly different (maybe within normal deviation range, I don’t know). Any suggestions for what could be going on would be seriously appreciated!
GROMACS version: 2023.1
Precision: mixed
Memory model: 64 bit
MPI library: thread_mpi
OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 128)
GPU support: CUDA
NB cluster size: 8
SIMD instructions: AVX2_256
CPU FFT library: fftw-3.3.8-sse2-avx-avx2-avx2_128
GPU FFT library: cuFFT
Multi-GPU FFT: none
RDTSCP usage: enabled
TNG support: enabled
Hwloc support: disabled
Tracing support: disabled
C compiler: /usr/bin/cc GNU 9.4.0
C compiler flags: -fexcess-precision=fast -funroll-all-loops -mavx2 -mfma -Wno-missing-field-initializers -O3 -DNDEBUG
C++ compiler: /usr/bin/c++ GNU 9.4.0
C++ compiler flags: -fexcess-precision=fast -funroll-all-loops -mavx2 -mfma -Wno-missing-field-initializers -Wno-cast-function-type-strict -fopenmp -O3 -DNDEBUG
BLAS library:
LAPACK library:
CUDA compiler: /usr/local/cuda/bin/nvcc nvcc: NVIDIA (R) Cuda compiler driver;Copyright (c) 2005-2021 NVIDIA Corporation;Built on Mon_May__3_19:15:13_PDT_2021;Cuda compilation tools, release 11.3, V11.3.109;Build cuda_11.3.r11.3/compiler.29920130_0
CUDA compiler flags:-std=c++17;--generate-code=arch=compute_35,code=sm_35;--generate-code=arch=compute_37,code=sm_37;--generate-code=arch=compute_50,code=sm_50;--generate-code=arch=compute_52,code=sm_52;--generate-code=arch=compute_60,code=sm_60;--generate-code=arch=compute_61,code=sm_61;--generate-code=arch=compute_70,code=sm_70;--generate-code=arch=compute_75,code=sm_75;--generate-code=arch=compute_80,code=sm_80;--generate-code=arch=compute_86,code=sm_86;-Wno-deprecated-gpu-targets;--generate-code=arch=compute_53,code=sm_53;--generate-code=arch=compute_80,code=sm_80;-use_fast_math;-Xptxas;-warn-double-usage;-Xptxas;-Werror;-D_FORCE_INLINES;-fexcess-precision=fast -funroll-all-loops -mavx2 -mfma -Wno-missing-field-initializers -Wno-cast-function-type-strict -fopenmp -O3 -DNDEBUG
CUDA driver: 12.10
CUDA runtime: 11.30
Running on 1 node with total 8 cores, 8 processing units, 1 compatible GPU
Hardware detected on host a3236e3ce0fd:
CPU info:
Vendor: Intel
Brand: Intel(R) Xeon(R) CPU @ 2.30GHz
Family: 6 Model: 63 Stepping: 0
Features: aes apic avx avx2 clfsh cmov cx8 cx16 f16c fma htt intel lahf mmx msr nonstop_tsc pcid pclmuldq pdpe1gb popcnt pse rdrnd rdtscp sse2 sse3 sse4.1 sse4.2 ssse3 x2apic
Hardware topology: Basic
Packages, cores, and logical processors:
[indices refer to OS logical processors]
Package 0: [ 0] [ 1] [ 2] [ 3] [ 4] [ 5] [ 6] [ 7]
CPU limit set by OS: -1 Recommended max number of threads: 8
GPU info:
Number of GPUs detected: 1
#0: NVIDIA Tesla T4, compute cap.: 7.5, ECC: yes, stat: compatible
This is the accounting for the slow run
M E G A - F L O P S A C C O U N T I N G
NB=Group-cutoff nonbonded kernels NxN=N-by-N cluster Verlet kernels
RF=Reaction-Field VdW=Van der Waals QSTab=quadratic-spline table
W3=SPC/TIP3p W4=TIP4p (single or pairs)
V&F=Potential and force V=Potential only F=Force only
Computing: M-Number M-Flops % Flops
-----------------------------------------------------------------------------
NB Free energy kernel 2548581.570540 2548581.571 2.3
Pair Search distance check 1566.270752 14096.437 0.0
NxN Ewald Elec. + LJ [F] 1127160.516736 104825928.056 96.1
NxN Ewald Elec. + LJ [V&F] 11396.798144 1447393.364 1.3
1,4 nonbonded interactions 37.613695 3385.233 0.0
Shift-X 11.428758 68.573 0.0
Bonds 9.880095 582.926 0.0
Angles 29.466950 4950.448 0.0
Propers 85.280820 19529.308 0.0
Impropers 10.746770 2235.328 0.0
Virial 9.228465 166.112 0.0
Update 914.168790 28339.232 0.0
Stop-CM 9.145116 91.451 0.0
Calc-Ekin 18.300780 494.121 0.0
Lincs 13.520130 811.208 0.0
Lincs-Mat 99.840960 399.364 0.0
Constraint-V 1821.057510 16389.518 0.0
Constraint-Vir 9.046290 217.111 0.0
Settle 598.005750 221262.127 0.2
-----------------------------------------------------------------------------
Total 109134921.487 100.0
-----------------------------------------------------------------------------
R E A L C Y C L E A N D T I M E A C C O U N T I N G
On 1 MPI rank, each using 8 OpenMP threads
Activity: Num Num Call Wall time Giga-Cycles
Ranks Threads Count (s) total sum %
--------------------------------------------------------------------------------
Neighbor search 1 8 2167 120.027 2208.273 0.6
Launch PP GPU ops. 1 8 173335 17.109 314.779 0.1
Force 1 8 173335 7635.319 140476.046 35.7
PME GPU mesh 1 8 173335 1399.060 25740.177 6.5
Wait GPU NB local 0.001 0.011 0.0
NB X/F buffer ops. 1 8 344503 2686.251 49422.162 12.6
Write traj. 1 8 32 2.167 39.869 0.0
Update 1 8 346670 4030.959 74162.351 18.9
Constraints 1 8 346670 5332.954 98116.690 25.0
Rest 143.884 2647.199 0.7
--------------------------------------------------------------------------------
Total 21367.731 393127.557 100.0
--------------------------------------------------------------------------------
Breakdown of PME mesh activities
--------------------------------------------------------------------------------
Wait PME GPU gather 1 8 173335 1.728 31.785 0.0
Reduce GPU PME F 1 8 173335 1375.799 25312.214 6.4
Launch PME GPU ops. 1 8 1906690 20.562 378.300 0.1
--------------------------------------------------------------------------------
Core t (s) Wall t (s) (%)
Time: 170941.809 21367.731 800.0
5h56:07
(ns/day) (hour/ns)
Performance: 2.103 11.414
For comparison’s sake, here is the accounting for the normal run:
M E G A - F L O P S A C C O U N T I N G
NB=Group-cutoff nonbonded kernels NxN=N-by-N cluster Verlet kernels
RF=Reaction-Field VdW=Van der Waals QSTab=quadratic-spline table
W3=SPC/TIP3p W4=TIP4p (single or pairs)
V&F=Potential and force V=Potential only F=Force only
Computing: M-Number M-Flops % Flops
-----------------------------------------------------------------------------
NB Free energy kernel 2564581.139310 2564581.139 2.4
Pair Search distance check 1597.526320 14377.737 0.0
NxN Ewald Elec. + LJ [F] 1125310.483712 104653874.985 96.0
NxN Ewald Elec. + LJ [V&F] 11379.764928 1445230.146 1.3
1,4 nonbonded interactions 37.613695 3385.233 0.0
Shift-X 11.428758 68.573 0.0
Bonds 9.880095 582.926 0.0
Angles 29.466950 4950.448 0.0
Propers 85.280820 19529.308 0.0
Impropers 10.746770 2235.328 0.0
Virial 9.228465 166.112 0.0
Update 914.168790 28339.232 0.0
Stop-CM 9.145116 91.451 0.0
Calc-Ekin 18.300780 494.121 0.0
Lincs 13.520130 811.208 0.0
Lincs-Mat 99.840960 399.364 0.0
Constraint-V 1821.057510 16389.518 0.0
Constraint-Vir 9.046290 217.111 0.0
Settle 598.005750 221262.127 0.2
-----------------------------------------------------------------------------
Total 108976986.066 100.0
-----------------------------------------------------------------------------
R E A L C Y C L E A N D T I M E A C C O U N T I N G
On 1 MPI rank, each using 8 OpenMP threads
Activity: Num Num Call Wall time Giga-Cycles
Ranks Threads Count (s) total sum %
--------------------------------------------------------------------------------
Neighbor search 1 8 2167 7.324 134.742 5.1
Launch PP GPU ops. 1 8 173335 6.806 125.219 4.7
Force 1 8 173335 85.362 1570.508 59.5
PME GPU mesh 1 8 173335 13.067 240.403 9.1
NB X/F buffer ops. 1 8 344503 3.904 71.826 2.7
Write traj. 1 8 9 0.034 0.631 0.0
Update 1 8 346670 10.627 195.516 7.4
Constraints 1 8 346670 13.288 244.474 9.3
Rest 3.124 57.479 2.2
--------------------------------------------------------------------------------
Total 143.535 2640.797 100.0
--------------------------------------------------------------------------------
Breakdown of PME mesh activities
--------------------------------------------------------------------------------
Wait PME GPU gather 1 8 173335 0.385 7.090 0.3
Reduce GPU PME F 1 8 173335 1.815 33.390 1.3
Launch PME GPU ops. 1 8 1906690 10.524 193.627 7.3
--------------------------------------------------------------------------------
Core t (s) Wall t (s) (%)
Time: 1148.251 143.535 800.0
(ns/day) (hour/ns)
Performance: 313.013 0.077
Finished mdrun on rank 0 Mon Aug 7 15:11:04 2023