Hi,
As You are trying to simulate a very small system (~20500 atoms), so you should not expect that to scale well to more than ~100 cores, and with a fast network interconnect and possibly some careful parameter tuning, perhaps to 200 cores or so.
A few pointers:
- What kind of network interconnect are you using? Make sure it is not Ethernet.
- Make sure to set process or thread affinities either with your MPI launcher or with
mdrun -pin on
- Your runs have major PP-PME imbalance, likely because PME does not seem to scale well; this can be due to either of the above (or possibly other reasons too). Rule our the first two, then focus on improving the PME time
- Consider using more PME ranks, or possibly PME order 5.
Cheers,
Szilárd