Normal and extremely slow simulation with the same input, what could be wrong?

nevjernik · June 28, 2020, 1:19pm

GROMACS version: 2019.3 (on supercomputer)
GROMACS modification: No

I’ve used gromacs for a while with no issues like this one. Strange issue suddenly started to appear in every test simulation I’ve made: The same simulations, which usually finishes in one day, now runs for few months (as predicted in .log file) !

I suppose the problem is extremely high load imbalance (~ 100 %) which suddenly started to appear in every test I’ve made. I tried to fix it, but in process I realize that issue appears even on simulations I’ve already made with no problems, so I’m stuck…

Below are two md.log files (normal and slow simulation) and two slurm_hpc.log files (normal and slow simulation) from typical test benchmark example (simple mebrane bilayer with water and ions):

Thanks in advance, any suggestion will be appreciated…

Summary of files:

md_normal_load_imbalannce.log
Gofile - Your all-in-one storage solution
Summary:

Dynamic load balancing report:
DLB was turned on during the run due to measured imbalance.
Average load imbalance: 2.8%.
The balanceable part of the MD step is 80%, load imbalance is computed from this.
Part of the total run time spent waiting due to load imbalance: 2.3%.
Steps where the load balancing was limited by -rdd, -rcon and/or -dds: X 0 % Y 0 % Z 0 %
Average PME mesh/force load: 0.791
Part of the total run time spent waiting due to PP/PME imbalance: 2.1 %

          Core t (s)   Wall t (s)        (%)
   Time:     3280.648       17.089    19197.7
             (ns/day)    (hour/ns)
   Performance:      101.129        0.237

md_high_load_imbalance.log

Summary:

Dynamic load balancing report:
DLB was permanently on during the run per user request.
Average load imbalance: 96.0%.
The balanceable part of the MD step is 32%, load imbalance is computed from this.
Part of the total run time spent waiting due to load imbalance: 30.7%.
Steps where the load balancing was limited by -rdd, -rcon and/or -dds: X 0 % Y 0 % Z 1 %
Average PME mesh/force load: 0.423
Part of the total run time spent waiting due to PP/PME imbalance: 4.4 %

NOTE: 30.7 % of the available CPU time was lost due to load imbalance
in the domain decomposition.
You can consider manually changing the decomposition (option -dd);
e.g. by using fewer domains along the box dimension in which there is
considerable inhomogeneity in the simulated system.
NOTE: 9 % of the run time was spent communicating energies,
you might want to use the -gcom option of mdrun

           Core t (s)   Wall t (s)        (%)
   Time:   147961.692      770.636    19200.0
             (ns/day)    (hour/ns)
   Performance:        2.243       10.702

slurm_normal_load_imbalance.log

Summary:

Dynamic load balancing report:
DLB was turned on during the run due to measured imbalance.
Average load imbalance: 2.8%.
The balanceable part of the MD step is 80%, load imbalance is computed from this.
Part of the total run time spent waiting due to load imbalance: 2.3%.
Steps where the load balancing was limited by -rdd, -rcon and/or -dds: X 0 % Y 0 % Z 0 %
Average PME mesh/force load: 0.791
Part of the total run time spent waiting due to PP/PME imbalance: 2.1 %

           Core t (s)   Wall t (s)        (%)
   Time:     3280.648       17.089    19197.7
             (ns/day)    (hour/ns)
   Performance:      101.129        0.237

slurm_high_load_imbalance.log

Summary:

Dynamic load balancing report:
DLB was permanently on during the run per user request.
Average load imbalance: 96.0%.
The balanceable part of the MD step is 32%, load imbalance is computed from this.
Part of the total run time spent waiting due to load imbalance: 30.7%.
Steps where the load balancing was limited by -rdd, -rcon and/or -dds: X 0 % Y 0 % Z 1 %
Average PME mesh/force load: 0.423
Part of the total run time spent waiting due to PP/PME imbalance: 4.4 %

NOTE: 30.7 % of the available CPU time was lost due to load imbalance
in the domain decomposition.
You can consider manually changing the decomposition (option -dd);
e.g. by using fewer domains along the box dimension in which there is
considerable inhomogeneity in the simulated system.

NOTE: 9 % of the run time was spent communicating energies,
you might want to use the -gcom option of mdrun

           Core t (s)   Wall t (s)        (%)
   Time:   147961.692      770.636    19200.0
             (ns/day)    (hour/ns)
  Performance:        2.243       10.702

nevjernik · June 28, 2020, 4:44pm

I noticed maybe important differences in md.log files:

"Dynamic load balancing: (null) " in slow simulation
“Dynamic load balancing: locked” in normal simulation

What could cause “(null)” load balancing?

I use switch “-dlb yes” to force dynamic load balancing in slow simulation because otherwise dynamic load balancing was not turned on.
I read in manual that DLB turns off even with “-dlb yes” if it can not improve load balancing.
Maybe that is a reason for “(null)” :/

Here is example what I get without “-dlb yes” option:

Dynamic load balancing report:
DLB was off during the run due to low measured imbalance.
Average load imbalance: 93.9%.
The balanceable part of the MD step is 16%, load imbalance is computed from this.
Part of the total run time spent waiting due to load imbalance: 15.2%.
Average PME mesh/force load: 1.745
Part of the total run time spent waiting due to PP/PME imbalance: 23.9 %

NOTE: 15.2 % of the available CPU time was lost due to load imbalance
in the domain decomposition.
You might want to use dynamic load balancing (option -dlb.)
You can also consider manually changing the decomposition (option -dd);
e.g. by using fewer domains along the box dimension in which there is
considerable inhomogeneity in the simulated system.
NOTE: 23.9 % performance was lost because the PME ranks
had more work to do than the PP ranks.
You might want to increase the number of PME ranks
or increase the cut-off and the grid spacing.
Core t (s)   Wall t (s)        (%)
   Time:    59675.711      310.813    19199.9
             (ns/day)    (hour/ns)
Performance: 5.560 4.316

It says that is “low measured imbalance” so dlb was off.
How is it possible if imbalance is extremely high?

Thanks

nevjernik · June 29, 2020, 5:27pm

For some reason, it works fine again…

I don’t know what was wrong.
Yesterday, I emailed hpc staff with description of problem, hoping that problem is on their side.
They didn’t answer (yet), and I don’t know if they did anything.
It just works fine now :/

Here are md.log and slurm.log files:

md_normal_again.log

Dynamic load balancing report:
DLB was permanently on during the run per user request.
Average load imbalance: 3.1%.
The balanceable part of the MD step is 80%, load imbalance is computed from this.
Part of the total run time spent waiting due to load imbalance: 2.4%.
Steps where the load balancing was limited by -rdd, -rcon and/or -dds: X 0 % Y 0 % Z 0 %
Average PME mesh/force load: 0.791
Part of the total run time spent waiting due to PP/PME imbalance: 2.1 %

   Core t (s)   Wall t (s)        (%)
   Time:     3284.604       17.109    19198.1
             (ns/day)    (hour/ns)
   Performance:      101.010        0.238
    Finished mdrun on rank 0 Mon Jun 29 18:25:18 2020

slurm_normal_again.log

Dynamic load balancing report:
DLB was permanently on during the run per user request.
Average load imbalance: 3.1%.
The balanceable part of the MD step is 80%, load imbalance is computed from this.
Part of the total run time spent waiting due to load imbalance: 2.4%.
Steps where the load balancing was limited by -rdd, -rcon and/or -dds: X 0 % Y 0 % Z 0 %
Average PME mesh/force load: 0.791
Part of the total run time spent waiting due to PP/PME imbalance: 2.1 %
           Core t (s)   Wall t (s)        (%)
   Time:     3284.604       17.109    19198.1
             (ns/day)    (hour/ns)
    Performance:      101.010        0.238

noahharrison64 · August 8, 2023, 9:01am

Hi,
I’ve been encountering a similar issue to the one you describe, also using SLURM to submit jobs. Did this problem ever arise again for you? And if, did you find out what was causing the issue?

emimd123 · July 5, 2024, 2:03pm

Hi,
I am also experiencing this problem and do not know how to fix it, I am getting this comment:
DD load balancing is limited by minimum cell size in dimension X Y

with load imbalances of > 100 % resulting in very slow runs since I have set -dlb yes which works ok with my other jobs of very similar systems where they are running between 80-100 ns/day…
I really do not understand where the differences stems from, is it dependent on the nodes they end up on if I am using HPC cluster?

Best

Topic		Replies	Views
Increase Performance of the simulation User discussions mdrun , mdrun-performance	3	3123	April 27, 2021
My computer stuck at turning on dynamic load balancing User discussions	1	326	April 15, 2021
Performance loss User discussions	2	1256	February 20, 2021
Quetions about performance User discussions	1	798	February 19, 2021
Why load balancing is limited by minimum cell size in dimension X,Y,Z? User discussions grompp , mdrun , mdrun-performance , mdrun-parallelization , simulation-setup	1	1112	September 5, 2022

Normal and extremely slow simulation with the same input, what could be wrong?

Related topics