Normal and extremely slow simulation with the same input, what could be wrong?

GROMACS version: 2019.3 (on supercomputer)
GROMACS modification: No

I’ve used gromacs for a while with no issues like this one. Strange issue suddenly started to appear in every test simulation I’ve made: The same simulations, which usually finishes in one day, now runs for few months (as predicted in .log file) !

I suppose the problem is extremely high load imbalance (~ 100 %) which suddenly started to appear in every test I’ve made. I tried to fix it, but in process I realize that issue appears even on simulations I’ve already made with no problems, so I’m stuck…

Below are two md.log files (normal and slow simulation) and two slurm_hpc.log files (normal and slow simulation) from typical test benchmark example (simple mebrane bilayer with water and ions):

Thanks in advance, any suggestion will be appreciated…

Summary of files:

  1. md_normal_load_imbalannce.log
    Gofile - Your all-in-one storage solution
    Summary:

Dynamic load balancing report:
DLB was turned on during the run due to measured imbalance.
Average load imbalance: 2.8%.
The balanceable part of the MD step is 80%, load imbalance is computed from this.
Part of the total run time spent waiting due to load imbalance: 2.3%.
Steps where the load balancing was limited by -rdd, -rcon and/or -dds: X 0 % Y 0 % Z 0 %
Average PME mesh/force load: 0.791
Part of the total run time spent waiting due to PP/PME imbalance: 2.1 %

          Core t (s)   Wall t (s)        (%)
   Time:     3280.648       17.089    19197.7
             (ns/day)    (hour/ns)
   Performance:      101.129        0.237
  1. md_high_load_imbalance.log

Summary:

Dynamic load balancing report:
DLB was permanently on during the run per user request.
Average load imbalance: 96.0%.
The balanceable part of the MD step is 32%, load imbalance is computed from this.
Part of the total run time spent waiting due to load imbalance: 30.7%.
Steps where the load balancing was limited by -rdd, -rcon and/or -dds: X 0 % Y 0 % Z 1 %
Average PME mesh/force load: 0.423
Part of the total run time spent waiting due to PP/PME imbalance: 4.4 %

NOTE: 30.7 % of the available CPU time was lost due to load imbalance
in the domain decomposition.
You can consider manually changing the decomposition (option -dd);
e.g. by using fewer domains along the box dimension in which there is
considerable inhomogeneity in the simulated system.
NOTE: 9 % of the run time was spent communicating energies,
you might want to use the -gcom option of mdrun

           Core t (s)   Wall t (s)        (%)
   Time:   147961.692      770.636    19200.0
             (ns/day)    (hour/ns)
   Performance:        2.243       10.702
  1. slurm_normal_load_imbalance.log

Summary:

Dynamic load balancing report:
DLB was turned on during the run due to measured imbalance.
Average load imbalance: 2.8%.
The balanceable part of the MD step is 80%, load imbalance is computed from this.
Part of the total run time spent waiting due to load imbalance: 2.3%.
Steps where the load balancing was limited by -rdd, -rcon and/or -dds: X 0 % Y 0 % Z 0 %
Average PME mesh/force load: 0.791
Part of the total run time spent waiting due to PP/PME imbalance: 2.1 %

           Core t (s)   Wall t (s)        (%)
   Time:     3280.648       17.089    19197.7
             (ns/day)    (hour/ns)
   Performance:      101.129        0.237
  1. slurm_high_load_imbalance.log

Summary:

Dynamic load balancing report:
DLB was permanently on during the run per user request.
Average load imbalance: 96.0%.
The balanceable part of the MD step is 32%, load imbalance is computed from this.
Part of the total run time spent waiting due to load imbalance: 30.7%.
Steps where the load balancing was limited by -rdd, -rcon and/or -dds: X 0 % Y 0 % Z 1 %
Average PME mesh/force load: 0.423
Part of the total run time spent waiting due to PP/PME imbalance: 4.4 %

NOTE: 30.7 % of the available CPU time was lost due to load imbalance
in the domain decomposition.
You can consider manually changing the decomposition (option -dd);
e.g. by using fewer domains along the box dimension in which there is
considerable inhomogeneity in the simulated system.

NOTE: 9 % of the run time was spent communicating energies,
you might want to use the -gcom option of mdrun

           Core t (s)   Wall t (s)        (%)
   Time:   147961.692      770.636    19200.0
             (ns/day)    (hour/ns)
  Performance:        2.243       10.702

I noticed maybe important differences in md.log files:

  • "Dynamic load balancing: (null) " in slow simulation
  • “Dynamic load balancing: locked” in normal simulation

What could cause “(null)” load balancing?

I use switch “-dlb yes” to force dynamic load balancing in slow simulation because otherwise dynamic load balancing was not turned on.
I read in manual that DLB turns off even with “-dlb yes” if it can not improve load balancing.
Maybe that is a reason for “(null)” :/

Here is example what I get without “-dlb yes” option:

Dynamic load balancing report:
DLB was off during the run due to low measured imbalance.
Average load imbalance: 93.9%.
The balanceable part of the MD step is 16%, load imbalance is computed from this.
Part of the total run time spent waiting due to load imbalance: 15.2%.
Average PME mesh/force load: 1.745
Part of the total run time spent waiting due to PP/PME imbalance: 23.9 %

NOTE: 15.2 % of the available CPU time was lost due to load imbalance
in the domain decomposition.
You might want to use dynamic load balancing (option -dlb.)
You can also consider manually changing the decomposition (option -dd);
e.g. by using fewer domains along the box dimension in which there is
considerable inhomogeneity in the simulated system.
NOTE: 23.9 % performance was lost because the PME ranks
had more work to do than the PP ranks.
You might want to increase the number of PME ranks
or increase the cut-off and the grid spacing.

Core t (s)   Wall t (s)        (%)
   Time:    59675.711      310.813    19199.9
             (ns/day)    (hour/ns)

Performance: 5.560 4.316

It says that is “low measured imbalance” so dlb was off.
How is it possible if imbalance is extremely high?

Thanks

For some reason, it works fine again…

I don’t know what was wrong.
Yesterday, I emailed hpc staff with description of problem, hoping that problem is on their side.
They didn’t answer (yet), and I don’t know if they did anything.
It just works fine now :/

Here are md.log and slurm.log files:

  1. md_normal_again.log

Dynamic load balancing report:
DLB was permanently on during the run per user request.
Average load imbalance: 3.1%.
The balanceable part of the MD step is 80%, load imbalance is computed from this.
Part of the total run time spent waiting due to load imbalance: 2.4%.
Steps where the load balancing was limited by -rdd, -rcon and/or -dds: X 0 % Y 0 % Z 0 %
Average PME mesh/force load: 0.791
Part of the total run time spent waiting due to PP/PME imbalance: 2.1 %

   Core t (s)   Wall t (s)        (%)
   Time:     3284.604       17.109    19198.1
             (ns/day)    (hour/ns)
   Performance:      101.010        0.238
    Finished mdrun on rank 0 Mon Jun 29 18:25:18 2020
  1. slurm_normal_again.log

Dynamic load balancing report:
DLB was permanently on during the run per user request.
Average load imbalance: 3.1%.
The balanceable part of the MD step is 80%, load imbalance is computed from this.
Part of the total run time spent waiting due to load imbalance: 2.4%.
Steps where the load balancing was limited by -rdd, -rcon and/or -dds: X 0 % Y 0 % Z 0 %
Average PME mesh/force load: 0.791
Part of the total run time spent waiting due to PP/PME imbalance: 2.1 %

           Core t (s)   Wall t (s)        (%)
   Time:     3284.604       17.109    19198.1
             (ns/day)    (hour/ns)
    Performance:      101.010        0.238

Hi,
I’ve been encountering a similar issue to the one you describe, also using SLURM to submit jobs. Did this problem ever arise again for you? And if, did you find out what was causing the issue?