Efficient parallelization schme

Hi,

As You are trying to simulate a very small system (~20500 atoms), so you should not expect that to scale well to more than ~100 cores, and with a fast network interconnect and possibly some careful parameter tuning, perhaps to 200 cores or so.

A few pointers:

  • What kind of network interconnect are you using? Make sure it is not Ethernet.
  • Make sure to set process or thread affinities either with your MPI launcher or with mdrun -pin on
  • Your runs have major PP-PME imbalance, likely because PME does not seem to scale well; this can be due to either of the above (or possibly other reasons too). Rule our the first two, then focus on improving the PME time
  • Consider using more PME ranks, or possibly PME order 5.

Cheers,
Szilárd