GROMACS version: 2019.3 (on supercomputer)
GROMACS modification: No
I’ve used gromacs for a while with no issues like this one. Strange issue suddenly started to appear in every test simulation I’ve made: The same simulations, which usually finishes in one day, now runs for few months (as predicted in .log file) !
I suppose the problem is extremely high load imbalance (~ 100 %) which suddenly started to appear in every test I’ve made. I tried to fix it, but in process I realize that issue appears even on simulations I’ve already made with no problems, so I’m stuck…
Below are two md.log files (normal and slow simulation) and two slurm_hpc.log files (normal and slow simulation) from typical test benchmark example (simple mebrane bilayer with water and ions):
Thanks in advance, any suggestion will be appreciated…
Summary of files:
- md_normal_load_imbalannce.log
Gofile - Your all-in-one storage solution
Summary:
Dynamic load balancing report:
DLB was turned on during the run due to measured imbalance.
Average load imbalance: 2.8%.
The balanceable part of the MD step is 80%, load imbalance is computed from this.
Part of the total run time spent waiting due to load imbalance: 2.3%.
Steps where the load balancing was limited by -rdd, -rcon and/or -dds: X 0 % Y 0 % Z 0 %
Average PME mesh/force load: 0.791
Part of the total run time spent waiting due to PP/PME imbalance: 2.1 %
Core t (s) Wall t (s) (%)
Time: 3280.648 17.089 19197.7
(ns/day) (hour/ns)
Performance: 101.129 0.237
- md_high_load_imbalance.log
Summary:
Dynamic load balancing report:
DLB was permanently on during the run per user request.
Average load imbalance: 96.0%.
The balanceable part of the MD step is 32%, load imbalance is computed from this.
Part of the total run time spent waiting due to load imbalance: 30.7%.
Steps where the load balancing was limited by -rdd, -rcon and/or -dds: X 0 % Y 0 % Z 1 %
Average PME mesh/force load: 0.423
Part of the total run time spent waiting due to PP/PME imbalance: 4.4 %NOTE: 30.7 % of the available CPU time was lost due to load imbalance
in the domain decomposition.
You can consider manually changing the decomposition (option -dd);
e.g. by using fewer domains along the box dimension in which there is
considerable inhomogeneity in the simulated system.
NOTE: 9 % of the run time was spent communicating energies,
you might want to use the -gcom option of mdrun
Core t (s) Wall t (s) (%)
Time: 147961.692 770.636 19200.0
(ns/day) (hour/ns)
Performance: 2.243 10.702
- slurm_normal_load_imbalance.log
Summary:
Dynamic load balancing report:
DLB was turned on during the run due to measured imbalance.
Average load imbalance: 2.8%.
The balanceable part of the MD step is 80%, load imbalance is computed from this.
Part of the total run time spent waiting due to load imbalance: 2.3%.
Steps where the load balancing was limited by -rdd, -rcon and/or -dds: X 0 % Y 0 % Z 0 %
Average PME mesh/force load: 0.791
Part of the total run time spent waiting due to PP/PME imbalance: 2.1 %
Core t (s) Wall t (s) (%)
Time: 3280.648 17.089 19197.7
(ns/day) (hour/ns)
Performance: 101.129 0.237
- slurm_high_load_imbalance.log
Summary:
Dynamic load balancing report:
DLB was permanently on during the run per user request.
Average load imbalance: 96.0%.
The balanceable part of the MD step is 32%, load imbalance is computed from this.
Part of the total run time spent waiting due to load imbalance: 30.7%.
Steps where the load balancing was limited by -rdd, -rcon and/or -dds: X 0 % Y 0 % Z 1 %
Average PME mesh/force load: 0.423
Part of the total run time spent waiting due to PP/PME imbalance: 4.4 %NOTE: 30.7 % of the available CPU time was lost due to load imbalance
in the domain decomposition.
You can consider manually changing the decomposition (option -dd);
e.g. by using fewer domains along the box dimension in which there is
considerable inhomogeneity in the simulated system.NOTE: 9 % of the run time was spent communicating energies,
you might want to use the -gcom option of mdrun
Core t (s) Wall t (s) (%)
Time: 147961.692 770.636 19200.0
(ns/day) (hour/ns)
Performance: 2.243 10.702