Efficient parallelization schme

GROMACS version: 2021.2
GROMACS modification: No

Hello all,

Thanks to the team who created this discussion forum. I have been using GROMACS for years and had a lot of questions which took me long time to resolve because I did not have anyone to ask my questions to. Again, thanks and this is a great idea.

My question is regarding parallelization. I have 4 nodes with following configuration:
2 processors per node, 16 cores per processor, 2 processor sockets per node, virtualization enabled = 64 logical cpus per node. GROMACS is installed with MPI enabled, and with static linking of libraries to a path which is common to all nodes. I am using Intel MPI architecture for parallelization of the programs on my nodes and SLURM to manage resources. SLURM and Intel MPI both are working properly (tested), and GROMACS passes all of the tests at the “make check” step of the installation.
Now I want to parallelize my simulations across all the nodes. Short question: what the best parallelization scheme for my nodes?
Long question: Gromacs runs efficiently when run the simulation independently on one node. The “mdrun” program automatically detects all 64 CPUs and uses 1 MPI rank and 64 OpenMP ranks to run the simulation. For my system, the performance is 131 ns/day when I run on just one node without parallelization. When I run on multiple nodes, I am not able to go above the 131 ns/day. The performance is either equal to the performance on one node or it is worse. How can I increase the performance in proportion to the number of nodes I use for my simulation? I am attaching my NVT equilibration output log when I run across multiple nodes with this thread. Please help. I really want to run long simulations.
The mdrun was performed using srun option as follow:

salloc --nodes=2 --ntasks-per-node=16 --cpus-per-task=4
srun --nodes=2 --ntasks-per-node=16 --cpus-per-task=4 /path/to/gmx_mpi mdrun -s nvt.tpr -deffnm some_name -v -dlb yes -notunepme

If I remove the notunepme argument then performance is slightly lower.

I am not able to attach my log file to the thread hosting it somewhere else:

Hi,

As You are trying to simulate a very small system (~20500 atoms), so you should not expect that to scale well to more than ~100 cores, and with a fast network interconnect and possibly some careful parameter tuning, perhaps to 200 cores or so.

A few pointers:

  • What kind of network interconnect are you using? Make sure it is not Ethernet.
  • Make sure to set process or thread affinities either with your MPI launcher or with mdrun -pin on
  • Your runs have major PP-PME imbalance, likely because PME does not seem to scale well; this can be due to either of the above (or possibly other reasons too). Rule our the first two, then focus on improving the PME time
  • Consider using more PME ranks, or possibly PME order 5.

Cheers,
Szilárd

Hi Szilárd,

Thank you for your reply and suggestions.

The network interconnect I have is 25 GbE SFP. Is that same as ethernet?

What is the correct way to define -npme and -ntomp options so that I can increase PME ranks? What is a rank in the first place? I am new to MPI implementations and parallelization techniques, so I don’t understand the language used in the GROMACS documentation when it comes to parallel simulations. Do you have any links where I can read and learn about it?

1 Like

Yes, see Gigabit Ethernet - Wikipedia

It is unlikely that you will be able to scale over Ethernet, especially such a small system with short iteration times. You can try, but your NIC, routers, and MPI stack have to be configured for efficient collective communications PME requires (and even than you may not get much better performance on 2-3 nodes than one a single node).

Total rank count times threads per rank should be equal to the total hardware thread count (as listed in the log). -npme is genereally adjusted such that the PP-PME load balance is minimized (see the load balancing report in the log).

You can find some background information no this page:
https://manual.gromacs.org/current/user-guide/mdrun-performance.html

However, you may need to familiarize a bit more with concepts such as MPI and threads (beyond what that brief GROMACS-specific content can offer) if you would like understand better how to tune parallelization settings of an HPC application like GROMACS.

Thank you Szilárd. I will see read more on how to optimize PME load.

I was not at all aware about limitations of GbE ports. Since we have purchased it for the lab we have to stick to it unfortunately.

I will look into my MPI implementation to make sure they are communicating effectively across the nodes. If you have any suggestions/links on that end then please send it to me. Until now, I have tried OpenMPI (terribly slow and full of errors) and Intel MPI (much better than OpenMPI, but not well documented) only.

Thanks again,