Normal and extremely slow simulation with the same input, what could be wrong?

GROMACS version: 2019.3 (on supercomputer)
GROMACS modification: No

I’ve used gromacs for a while with no issues like this one. Strange issue suddenly started to appear in every test simulation I’ve made: The same simulations, which usually finishes in one day, now runs for few months (as predicted in .log file) !

I suppose the problem is extremely high load imbalance (~ 100 %) which suddenly started to appear in every test I’ve made. I tried to fix it, but in process I realize that issue appears even on simulations I’ve already made with no problems, so I’m stuck…

Below are two md.log files (normal and slow simulation) and two slurm_hpc.log files (normal and slow simulation) from typical test benchmark example (simple mebrane bilayer with water and ions):

https://gofile.io/d/YO9G2u

Thanks in advance, any suggestion will be appreciated…

Summary of files:

  1. md_normal_load_imbalannce.log
    https://gofile.io/d/8zJf6E
    Summary:

Dynamic load balancing report:
DLB was turned on during the run due to measured imbalance.
Average load imbalance: 2.8%.
The balanceable part of the MD step is 80%, load imbalance is computed from this.
Part of the total run time spent waiting due to load imbalance: 2.3%.
Steps where the load balancing was limited by -rdd, -rcon and/or -dds: X 0 % Y 0 % Z 0 %
Average PME mesh/force load: 0.791
Part of the total run time spent waiting due to PP/PME imbalance: 2.1 %

          Core t (s)   Wall t (s)        (%)
   Time:     3280.648       17.089    19197.7
             (ns/day)    (hour/ns)
   Performance:      101.129        0.237
  1. md_high_load_imbalance.log

Summary:

Dynamic load balancing report:
DLB was permanently on during the run per user request.
Average load imbalance: 96.0%.
The balanceable part of the MD step is 32%, load imbalance is computed from this.
Part of the total run time spent waiting due to load imbalance: 30.7%.
Steps where the load balancing was limited by -rdd, -rcon and/or -dds: X 0 % Y 0 % Z 1 %
Average PME mesh/force load: 0.423
Part of the total run time spent waiting due to PP/PME imbalance: 4.4 %

NOTE: 30.7 % of the available CPU time was lost due to load imbalance
in the domain decomposition.
You can consider manually changing the decomposition (option -dd);
e.g. by using fewer domains along the box dimension in which there is
considerable inhomogeneity in the simulated system.
NOTE: 9 % of the run time was spent communicating energies,
you might want to use the -gcom option of mdrun

           Core t (s)   Wall t (s)        (%)
   Time:   147961.692      770.636    19200.0
             (ns/day)    (hour/ns)
   Performance:        2.243       10.702
  1. slurm_normal_load_imbalance.log

Summary:

Dynamic load balancing report:
DLB was turned on during the run due to measured imbalance.
Average load imbalance: 2.8%.
The balanceable part of the MD step is 80%, load imbalance is computed from this.
Part of the total run time spent waiting due to load imbalance: 2.3%.
Steps where the load balancing was limited by -rdd, -rcon and/or -dds: X 0 % Y 0 % Z 0 %
Average PME mesh/force load: 0.791
Part of the total run time spent waiting due to PP/PME imbalance: 2.1 %

           Core t (s)   Wall t (s)        (%)
   Time:     3280.648       17.089    19197.7
             (ns/day)    (hour/ns)
   Performance:      101.129        0.237
  1. slurm_high_load_imbalance.log

Summary:

Dynamic load balancing report:
DLB was permanently on during the run per user request.
Average load imbalance: 96.0%.
The balanceable part of the MD step is 32%, load imbalance is computed from this.
Part of the total run time spent waiting due to load imbalance: 30.7%.
Steps where the load balancing was limited by -rdd, -rcon and/or -dds: X 0 % Y 0 % Z 1 %
Average PME mesh/force load: 0.423
Part of the total run time spent waiting due to PP/PME imbalance: 4.4 %

NOTE: 30.7 % of the available CPU time was lost due to load imbalance
in the domain decomposition.
You can consider manually changing the decomposition (option -dd);
e.g. by using fewer domains along the box dimension in which there is
considerable inhomogeneity in the simulated system.

NOTE: 9 % of the run time was spent communicating energies,
you might want to use the -gcom option of mdrun

           Core t (s)   Wall t (s)        (%)
   Time:   147961.692      770.636    19200.0
             (ns/day)    (hour/ns)
  Performance:        2.243       10.702

I noticed maybe important differences in md.log files:

  • "Dynamic load balancing: (null) " in slow simulation
  • “Dynamic load balancing: locked” in normal simulation

What could cause “(null)” load balancing?

I use switch “-dlb yes” to force dynamic load balancing in slow simulation because otherwise dynamic load balancing was not turned on.
I read in manual that DLB turns off even with “-dlb yes” if it can not improve load balancing.
Maybe that is a reason for “(null)” :/

Here is example what I get without “-dlb yes” option:

Dynamic load balancing report:
DLB was off during the run due to low measured imbalance.
Average load imbalance: 93.9%.
The balanceable part of the MD step is 16%, load imbalance is computed from this.
Part of the total run time spent waiting due to load imbalance: 15.2%.
Average PME mesh/force load: 1.745
Part of the total run time spent waiting due to PP/PME imbalance: 23.9 %

NOTE: 15.2 % of the available CPU time was lost due to load imbalance
in the domain decomposition.
You might want to use dynamic load balancing (option -dlb.)
You can also consider manually changing the decomposition (option -dd);
e.g. by using fewer domains along the box dimension in which there is
considerable inhomogeneity in the simulated system.
NOTE: 23.9 % performance was lost because the PME ranks
had more work to do than the PP ranks.
You might want to increase the number of PME ranks
or increase the cut-off and the grid spacing.

Core t (s)   Wall t (s)        (%)
   Time:    59675.711      310.813    19199.9
             (ns/day)    (hour/ns)

Performance: 5.560 4.316

It says that is “low measured imbalance” so dlb was off.
How is it possible if imbalance is extremely high?

Thanks

For some reason, it works fine again…

I don’t know what was wrong.
Yesterday, I emailed hpc staff with description of problem, hoping that problem is on their side.
They didn’t answer (yet), and I don’t know if they did anything.
It just works fine now :/

Here are md.log and slurm.log files:
https://gofile.io/d/9uNmR5

  1. md_normal_again.log

Dynamic load balancing report:
DLB was permanently on during the run per user request.
Average load imbalance: 3.1%.
The balanceable part of the MD step is 80%, load imbalance is computed from this.
Part of the total run time spent waiting due to load imbalance: 2.4%.
Steps where the load balancing was limited by -rdd, -rcon and/or -dds: X 0 % Y 0 % Z 0 %
Average PME mesh/force load: 0.791
Part of the total run time spent waiting due to PP/PME imbalance: 2.1 %

   Core t (s)   Wall t (s)        (%)
   Time:     3284.604       17.109    19198.1
             (ns/day)    (hour/ns)
   Performance:      101.010        0.238
    Finished mdrun on rank 0 Mon Jun 29 18:25:18 2020
  1. slurm_normal_again.log

Dynamic load balancing report:
DLB was permanently on during the run per user request.
Average load imbalance: 3.1%.
The balanceable part of the MD step is 80%, load imbalance is computed from this.
Part of the total run time spent waiting due to load imbalance: 2.4%.
Steps where the load balancing was limited by -rdd, -rcon and/or -dds: X 0 % Y 0 % Z 0 %
Average PME mesh/force load: 0.791
Part of the total run time spent waiting due to PP/PME imbalance: 2.1 %

           Core t (s)   Wall t (s)        (%)
   Time:     3284.604       17.109    19198.1
             (ns/day)    (hour/ns)
    Performance:      101.010        0.238