MDRun randomly performs incredibly slowly (2ns/day) when it's usually achieving 300ns/day

GROMACS version:2023.1
GROMACS modification: Yes/No

Hi,

I’ve just set up GMX on a auto-scaling HPC cluster using google cloud. I launch my jobs via slurm on the login node, and compute nodes (n1-standard-16, 8 CPU core and 1 Tesla T4 GPU) are spun up to run the simulations. Each compute node contains a containerised version of GMX and the container is run while mounting the shared file system on the login node onto the compute node, which is where the simulation data is saved to. My intuition is that this could be causing the problem, however it’s strange because this performance degradation only occurs rarely (out of the 20 jobs I submitted only 1 ran incredibly slowly) and all other jobs are using the same containerised run / shared filesystem mount point. Furthermore, if I re-submit the job using the same input data, no performance loss is observed.

I’ve noticed that occasionally a simulation will be prohibitively slow: Energy minimisation and EQ will take forever. As detailed in the title, performance will be around 2ns/day vs a normal node which runs at ~300ns/day for the same system (same protein different ligand). I am running in free-energy mode.
There is nothing entirely obvious from the logs that appears to be causing this. The hardware is detected correctly, but the accounting is slightly different (maybe within normal deviation range, I don’t know). Any suggestions for what could be going on would be seriously appreciated!

GROMACS version:    2023.1
Precision:          mixed
Memory model:       64 bit
MPI library:        thread_mpi
OpenMP support:     enabled (GMX_OPENMP_MAX_THREADS = 128)
GPU support:        CUDA
NB cluster size:    8
SIMD instructions:  AVX2_256
CPU FFT library:    fftw-3.3.8-sse2-avx-avx2-avx2_128
GPU FFT library:    cuFFT
Multi-GPU FFT:      none
RDTSCP usage:       enabled
TNG support:        enabled
Hwloc support:      disabled
Tracing support:    disabled
C compiler:         /usr/bin/cc GNU 9.4.0
C compiler flags:   -fexcess-precision=fast -funroll-all-loops -mavx2 -mfma -Wno-missing-field-initializers -O3 -DNDEBUG
C++ compiler:       /usr/bin/c++ GNU 9.4.0
C++ compiler flags: -fexcess-precision=fast -funroll-all-loops -mavx2 -mfma -Wno-missing-field-initializers -Wno-cast-function-type-strict -fopenmp -O3 -DNDEBUG
BLAS library:
LAPACK library:
CUDA compiler:      /usr/local/cuda/bin/nvcc nvcc: NVIDIA (R) Cuda compiler driver;Copyright (c) 2005-2021 NVIDIA Corporation;Built on Mon_May__3_19:15:13_PDT_2021;Cuda compilation tools, release 11.3, V11.3.109;Build cuda_11.3.r11.3/compiler.29920130_0
CUDA compiler flags:-std=c++17;--generate-code=arch=compute_35,code=sm_35;--generate-code=arch=compute_37,code=sm_37;--generate-code=arch=compute_50,code=sm_50;--generate-code=arch=compute_52,code=sm_52;--generate-code=arch=compute_60,code=sm_60;--generate-code=arch=compute_61,code=sm_61;--generate-code=arch=compute_70,code=sm_70;--generate-code=arch=compute_75,code=sm_75;--generate-code=arch=compute_80,code=sm_80;--generate-code=arch=compute_86,code=sm_86;-Wno-deprecated-gpu-targets;--generate-code=arch=compute_53,code=sm_53;--generate-code=arch=compute_80,code=sm_80;-use_fast_math;-Xptxas;-warn-double-usage;-Xptxas;-Werror;-D_FORCE_INLINES;-fexcess-precision=fast -funroll-all-loops -mavx2 -mfma -Wno-missing-field-initializers -Wno-cast-function-type-strict -fopenmp -O3 -DNDEBUG
CUDA driver:        12.10
CUDA runtime:       11.30


Running on 1 node with total 8 cores, 8 processing units, 1 compatible GPU
Hardware detected on host a3236e3ce0fd:
  CPU info:
    Vendor: Intel
    Brand:  Intel(R) Xeon(R) CPU @ 2.30GHz
    Family: 6   Model: 63   Stepping: 0
    Features: aes apic avx avx2 clfsh cmov cx8 cx16 f16c fma htt intel lahf mmx msr nonstop_tsc pcid pclmuldq pdpe1gb popcnt pse rdrnd rdtscp sse2 sse3 sse4.1 sse4.2 ssse3 x2apic
  Hardware topology: Basic
    Packages, cores, and logical processors:
    [indices refer to OS logical processors]
      Package  0: [   0] [   1] [   2] [   3] [   4] [   5] [   6] [   7]
    CPU limit set by OS: -1   Recommended max number of threads: 8
  GPU info:
    Number of GPUs detected: 1
    #0: NVIDIA Tesla T4, compute cap.: 7.5, ECC: yes, stat: compatible

This is the accounting for the slow run

        M E G A - F L O P S   A C C O U N T I N G

 NB=Group-cutoff nonbonded kernels    NxN=N-by-N cluster Verlet kernels
 RF=Reaction-Field  VdW=Van der Waals  QSTab=quadratic-spline table
 W3=SPC/TIP3p  W4=TIP4p (single or pairs)
 V&F=Potential and force  V=Potential only  F=Force only

 Computing:                               M-Number         M-Flops  % Flops
-----------------------------------------------------------------------------
 NB Free energy kernel              2548581.570540     2548581.571     2.3
 Pair Search distance check            1566.270752       14096.437     0.0
 NxN Ewald Elec. + LJ [F]           1127160.516736   104825928.056    96.1
 NxN Ewald Elec. + LJ [V&F]           11396.798144     1447393.364     1.3
 1,4 nonbonded interactions              37.613695        3385.233     0.0
 Shift-X                                 11.428758          68.573     0.0
 Bonds                                    9.880095         582.926     0.0
 Angles                                  29.466950        4950.448     0.0
 Propers                                 85.280820       19529.308     0.0
 Impropers                               10.746770        2235.328     0.0
 Virial                                   9.228465         166.112     0.0
 Update                                 914.168790       28339.232     0.0
 Stop-CM                                  9.145116          91.451     0.0
 Calc-Ekin                               18.300780         494.121     0.0
 Lincs                                   13.520130         811.208     0.0
 Lincs-Mat                               99.840960         399.364     0.0
 Constraint-V                          1821.057510       16389.518     0.0
 Constraint-Vir                           9.046290         217.111     0.0
 Settle                                 598.005750      221262.127     0.2
-----------------------------------------------------------------------------
 Total                                               109134921.487   100.0
-----------------------------------------------------------------------------


      R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G

On 1 MPI rank, each using 8 OpenMP threads

 Activity:              Num   Num      Call    Wall time         Giga-Cycles
                        Ranks Threads  Count      (s)         total sum    %
--------------------------------------------------------------------------------
 Neighbor search           1    8       2167     120.027       2208.273   0.6
 Launch PP GPU ops.        1    8     173335      17.109        314.779   0.1
 Force                     1    8     173335    7635.319     140476.046  35.7
 PME GPU mesh              1    8     173335    1399.060      25740.177   6.5
 Wait GPU NB local                                 0.001          0.011   0.0
 NB X/F buffer ops.        1    8     344503    2686.251      49422.162  12.6
 Write traj.               1    8         32       2.167         39.869   0.0
 Update                    1    8     346670    4030.959      74162.351  18.9
 Constraints               1    8     346670    5332.954      98116.690  25.0
 Rest                                            143.884       2647.199   0.7
--------------------------------------------------------------------------------
 Total                                         21367.731     393127.557 100.0
--------------------------------------------------------------------------------
 Breakdown of PME mesh activities
--------------------------------------------------------------------------------
 Wait PME GPU gather       1    8     173335       1.728         31.785   0.0
 Reduce GPU PME F          1    8     173335    1375.799      25312.214   6.4
 Launch PME GPU ops.       1    8    1906690      20.562        378.300   0.1
--------------------------------------------------------------------------------

               Core t (s)   Wall t (s)        (%)
       Time:   170941.809    21367.731      800.0
                         5h56:07
                 (ns/day)    (hour/ns)
Performance:        2.103       11.414

For comparison’s sake, here is the accounting for the normal run:

        M E G A - F L O P S   A C C O U N T I N G

 NB=Group-cutoff nonbonded kernels    NxN=N-by-N cluster Verlet kernels
 RF=Reaction-Field  VdW=Van der Waals  QSTab=quadratic-spline table
 W3=SPC/TIP3p  W4=TIP4p (single or pairs)
 V&F=Potential and force  V=Potential only  F=Force only

 Computing:                               M-Number         M-Flops  % Flops
-----------------------------------------------------------------------------
 NB Free energy kernel              2564581.139310     2564581.139     2.4
 Pair Search distance check            1597.526320       14377.737     0.0
 NxN Ewald Elec. + LJ [F]           1125310.483712   104653874.985    96.0
 NxN Ewald Elec. + LJ [V&F]           11379.764928     1445230.146     1.3
 1,4 nonbonded interactions              37.613695        3385.233     0.0
 Shift-X                                 11.428758          68.573     0.0
 Bonds                                    9.880095         582.926     0.0
 Angles                                  29.466950        4950.448     0.0
 Propers                                 85.280820       19529.308     0.0
 Impropers                               10.746770        2235.328     0.0
 Virial                                   9.228465         166.112     0.0
 Update                                 914.168790       28339.232     0.0
 Stop-CM                                  9.145116          91.451     0.0
 Calc-Ekin                               18.300780         494.121     0.0
 Lincs                                   13.520130         811.208     0.0
 Lincs-Mat                               99.840960         399.364     0.0
 Constraint-V                          1821.057510       16389.518     0.0
 Constraint-Vir                           9.046290         217.111     0.0
 Settle                                 598.005750      221262.127     0.2
-----------------------------------------------------------------------------
 Total                                               108976986.066   100.0
-----------------------------------------------------------------------------


      R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G

On 1 MPI rank, each using 8 OpenMP threads

 Activity:              Num   Num      Call    Wall time         Giga-Cycles
                        Ranks Threads  Count      (s)         total sum    %
--------------------------------------------------------------------------------
 Neighbor search           1    8       2167       7.324        134.742   5.1
 Launch PP GPU ops.        1    8     173335       6.806        125.219   4.7
 Force                     1    8     173335      85.362       1570.508  59.5
 PME GPU mesh              1    8     173335      13.067        240.403   9.1
 NB X/F buffer ops.        1    8     344503       3.904         71.826   2.7
 Write traj.               1    8          9       0.034          0.631   0.0
 Update                    1    8     346670      10.627        195.516   7.4
 Constraints               1    8     346670      13.288        244.474   9.3
 Rest                                              3.124         57.479   2.2
--------------------------------------------------------------------------------
 Total                                           143.535       2640.797 100.0
--------------------------------------------------------------------------------
 Breakdown of PME mesh activities
--------------------------------------------------------------------------------
 Wait PME GPU gather       1    8     173335       0.385          7.090   0.3
 Reduce GPU PME F          1    8     173335       1.815         33.390   1.3
 Launch PME GPU ops.       1    8    1906690      10.524        193.627   7.3
--------------------------------------------------------------------------------

               Core t (s)   Wall t (s)        (%)
       Time:     1148.251      143.535      800.0
                 (ns/day)    (hour/ns)
Performance:      313.013        0.077
Finished mdrun on rank 0 Mon Aug  7 15:11:04 2023

Hi,

Please share complete log files rather than snippets, important information from the log files is missing, e.g. whether affinities were set or not.

My guess is that you have affinity issues within the container or across multiple containers since CPU activities are significantly slower in your slow run.

Unrelated to that, are you intentionally not offloading more tasks to the GPU (e.g. -pme gpu, -bonded gpu, -update gpu)?

Cheers,
Szilárd

Hi Szilárd,

Thanks for the reply. Apologies about not sharing the whole file. I’d send it now but I have since lost it unfortunately.
I had pinned the threads for the mdrun, so hopefully that wasn’t the issue. Setting the working directory of MDRun to a non-shared volume seemed to solve the issue thankfully.

As for not verbosely offloading more tasks to the GPU - this was because PME and bonded calculations appear to be automatically offloaded to the GPU even when I don’t specify these flags?

I am unable to use -update gpu since the simulation is a free-energy calculation and it appears that this type of sim is not supported with -update gpu.

Best,
Noah

Hi Noah,

Strange to hear that a file system issue solved the slowdowns since the performance table indicated that the increase in wall-time was in computation rather than I/O operation (e.g. “Write traj.” row).

As for not verbosely offloading more tasks to the GPU - this was because PME and bonded calculations appear to be automatically offloaded to the GPU even when I don’t specify these flags?

As far as I recall, -pme gpu is default, but I’m not sure -bonded gpu is (at least until recent releases the heuristic was to offload bondeds by default if PME was not present).

I am unable to use -update gpu since the simulation is a free-energy calculation and it appears that this type of sim is not supported with -update gpu.

If you are using integrator = sd that limitation does still apply.

Cheers,
Szilárd

Yeah that is strange r.e. no indication of I/O issues in the log file. If I encounter the issue again I’ll take more care to look after and inspect the logs!

I’m pretty sure I tried running the simulation using -update GPU and setting the integrator to md after seeing the error message about only the md integrator being supported. However after updating the integrator I’m pretty certain I would then still see the following message:

Free energy perturbation for mass and constraints are not supported.

I can have another go in case I failed to properly update the integrator and that is still the limiting factor…

Yeah so just checked and I can’t use -update according to the following message in the log file:

Update groups can not be used for this system because atoms that are (in)directly constrained together are interdispersed with other atoms

I imagine this is because I’m using a hybrid system that merges two small molecules into a single mol, using lambda values to dictate the end state in order to run FEP simulations?

The message means that the topology ordering prevents the “update groups” functionality to be enabled which is a prerequisite to using UP update. I do not think FEP is related, please check if you can reorder your topology to satisfy the requirement.

Hi Szilárd,

Thanks for the response. I didn’t think there was anything unusual about my topologies, what exactly should I be looking for in order to try and fix?

The order of my topology files is defined here. I have simplified and added a few of the include statements to hopefully make it easier to understand

; Define heavy hydrogens
#define HEAVY_H
; Include forcefield parameters
#include "amber99sb-ildn.ff/forcefield.itp"
; Include MOL forcefield parameters  #This points to a small molecule .itp file
#include "hybrid_top.itp"

[ moleculetype ] #All protein chains merged into single mol so just contains Protein
[ atoms ]
[ bonds ]
[ pairs ]
[ angles ]
[ dihedrals ]

;Include water topology
#ifdef POSRES_WATER
; Include topology for ions
[ system ]
[ molecules ] #Contains mols in the order Protein, Mol, SOL, NA, CL

The small molecule .itp file directives look as follows:

; Include MOL forcefield parameters #This points to the mol atom types .itp file
#include ff_hybrid.itp

[ moleculetype ] #All protein chains merged into single mol so just contains Protein
[ atoms ]
[ bonds ]
[ pairs ]
[ angles ]
[ dihedrals ]

The mol atom types file (ff_hybrid.itp) just contains the [atomtypes] directive for that hybrid molecule

Hi Szilárd,

I did some digging and came across this post which suggested to me that the inability to do updates on the GPU may be due to the fact I have constraints = h-bonds and then the ordering of my small molecule PDB file doesn’t place hydrogens directly following the heavy atoms they are attached to. Is this thinking correct? If so I can look into potentially making adjustments to the ordering.

I also have managed to re-produce the initial issue I was encountering (seemingly random performance loss). I realised this issue only arises when slurm runs simulation jobs on compute nodes that haven’t completely shut down after running a previous job. I.e., if a simulation has just finished but the compute node is still running and a new job is submitted to that node, the simulation will experience insane performance degradation. However, if the node is allowed to shut down, and a job is then submitted to a new node instance, this performance degradation isn’t exhibted. I’m a bit lost as to why this might be happening and any advice would be greatly appreciated!

Thanks,
Noah

Correct, that should resolve the issue.

What do you mean by “nodes that haven’t completely shut down”?

Cheers,
Szilárd

I am running an auto-scaling cluster - when jobs are submitted to slurm, each job leads to the creation of a node / virtual machine instance. The job (i.e. a set of simulations) are then run on the virtual machine instance… Once the job has finished running, the virtual machine instance shuts down and is deleted.

The time after a job finishes running and the virtual machine instance being deleted is around 5 or so minutes. If I submit another job in this time, then the job is directed towards this already running instance, and this is when I see performance degradation. If a job is submitted and requires a fresh VM instance / node to be booted up, then performance degradation doesn’t occur.

It’s a strange issue indeed, I can’t see why this might be occurring. As you pointed out it may be attributed to the CPU usage (according to the simulation log files). Something worth noting about these VM instances is that they are using 16 vCPUs, which I believe corresponds to 8 possible threads. I am setting -pin on -pinoffset 0 -pinstride 0 -nt 8

Also I re-arranged the hydrogen atoms to conform to the required order but still get the following message:

Update task on the GPU was required,
but the following condition(s) were not satisfied:
Free energy perturbation for mass and constraints are not supported.

I have attached the mdrun log file below for further info

eq_nvt_anneal.log (14.7 KB)

Update on the MDRun performance bug - It appears to have been an issue with docker. If I include a docker restart command at the start of my slurm submission scripts (i.e. docker is restarted at the beginning of every job that is submitted) the performance issue disappears. I’m quite surprised by this since I run docker using the -rm flag, which removes the container after the command, and so the docker container should shut down when the job finishes… Strange.

Anyway, if you have any advice r.e. update groups failing even after re-formatting my coordinate files that’d be great.

Thanks,
Noah

To be able to run update on the GPU, the masses of all particles and the lengths of all constraints have to be the same in the A and the B state.

So that sounds like you can’t use -update GPU with FEP calculations since the A and B ligands are going to have different masses? Perhaps I’m a little lost by your reply, apologies.

Yes, either different masses or different constraint lengths.

If it’s only the masses, this can be avoided by making the masses in the A and the B state the same. The masses contribution can be calculated analytically.