Poor performance of simulation with Colvars module in GROMACS 2024.1 beta

GROMACS version: 2024.1 beta
GROMACS modification: No

Dear Gromacs community,

I have the latest version of GROMACS (2024.1 beta) installed on the local supercomputer. When I perform a calculation with the Colvars options in the .mdp file (colvars-active = yes , colvars-configfile = file.colvars), I get a very poor performance: 20ns/day.

In the “file.colvars” I specify a simple reaction coordinate of pulling 2 atomic groups together.

When doing unbiased MD (with totally the same .mdp options, but without Colvars) the performance is ~245 ns/day.

Can you please let me know how the performance of the simulation with Colvars can be improved?

The system has 365,000 atoms, simulation is run on 16 nodes * 128 CPUs/node = 2048 CPUs.

Thanks in advance.

Best regards,
Oleksii

Hi Oleksii, thanks for your feedback. The current GROMACS-Colvars interface has been designed so far by prioritizing correctness and consistency with other GROMACS features. However, at this stage the Colvars computation in a GROMACS run is still carried out on a single MPI rank. See the following:
https://colvars.github.io/gromacs-2024/colvars-refman-gromacs.html#sec:colvar_atom_groups_scaling
https://manual.gromacs.org/documentation/2024-beta/user-guide/mdp-options.html#mdp-colvars

So a performance penalty is expected, the magnitude of which depends on the number of atoms that are requested by Colvars. How many atoms are requested for each group in your case?

If you can manage to use fewer atoms to define the relevant distance (e.g. for proteins, just using the alpha-carbons instead of all the atoms), it may reduce the performance loss to an acceptable level. Another option would be to run the Colvars computation less often, using a timeStepFactor value higher than 1 (default).

Hope that helps, and it would be very helpful if you could confirm whether these approaches help (they did in our tests). If not, please provide more details about how you are launching GROMACS?

Thanks!
Giacomo

Dear Giacomo,

Thank you very much for your suggestions.

My collective variable is defined as a distance between 2 atomic groups. The first group contains 6 atoms, and the second one - 8 atoms. It defines the pulling of a ligand towards the redox cofactor in a protein.

I am launching GROMACS in a conventional fashion with mdrun:
srun gmx_mpi mdrun -s prod_Q1_pull -dlb yes

Previously specifying the “Colvars” options in the .mdp file:
colvars-active = yes
colvars-configfile = distPull.colvars

For benchmarking, I was launching shorter and longer simulations, and noticed one unusual feature:
for a very short simulation (~50 ps), the performance with colvars is good (127 ns/day), however, when running much longer trajectories (~30 ns), it drops down by ~6 times to the reported 20 ns/day. Can you please explain what is the reason?

For these short simulations (~50 ps), I played around with the timeStepFactor. Upon increasing it from 1 to 100, the performance rises from 127 ns/day to 138 ns/day, respectively, which is not significant.
Moreover, for my setups, it is physically not correct to increase the timeStepFactor: in the umbrella sampling runs, I need to bias my reaction coordinate at every step of the simulation: if there are interruptions in the bias, the results will not be reliable.

For another test, I reduced the size of the first atomic group to 2 atoms, and the second - to 1 atom. However, this did not affect the calculation speed.

If you have some further suggestions on how to improve the performance, I would be glad to try them out.

Thanks a lot.

Best regards,
Oleksii

Thanks for the additional details.

With those group sizes, there should not be such a massive performance impact. Also, they are so small that they are probably fast-moving (i.e. timeStepFactor might not be appropriate) and can’t be reduced any further. The overhead here is almost certainly coming from the extra communication required by Colvars, because the extra computation of those two groups is almost negligible.

Here are my next suggestions:

  • You are using ~2,000 cores for ~360k atoms, which seems to work well for your combination of simulation and cluster but you may be approaching the point where GROMACS doesn’t scale well any more. Adding Colvars may shift that point, and the scaling performance data ought to be updated. You might be able to find a configuration using fewer nodes where the performance loss is not as bad.
  • Try -dlb auto or no and the related mdrun flags.
  • Try explicitly setting the number of OpenMP threads (-ntomp flag) to reduce the amount of communication going through MPI on each node.
  • If you can log into the compute node, can you try monitoring the memory usage over time?

Thanks and again, hope that helps.

Giacomo

If your biasing setup is simple, you might be able to use the GROMACS pull code, potentially combined with AWH if you want to do enhanced sampling. Both the pull code and AWH likely have much lower communication overhead.

Another tip is to reduce the number of nodes. I would think that even without biasing you are close to the scaling limit and have lower parallel efficiency. Have you checked the scaling?

1 Like

Dear Giacomo,

Thanks a lot for your recommendations. Now I have obtained a much better performance with Colvars.

As you suggested, I ran scaling tests. First of all, I took my setup (with dynamic load balancing), and launched short runs (for 2 min each) with different numbers of nodes (green line). After that, I did the same scaling test but increased the computation time to 1 hour. In this case, performance dropped by several times (blue line).

Finally, I shut down the dynamic load balancing, and the performance rocketed up (red line).

I tested the same setup on longer runs (i,e, for 24 hours on 8 nodes), and the performance dropped by around 10 ns/day (from ~145 ns/day to ~135 ns/day).

Do you think this drop can be alleviated by fine-tuning the “-ntomp” flag? By default, it is set to 1 thread/core, correct? Do you mean to make the number of threads less than the number of cores?

You also asked:

If you can log into the compute node, can you try monitoring the memory usage over time?

I logged in to one of the 8 computing nodes during the calculation, and the memory usage was stable: 20 GB throughout the simulation.

Best regards,
Oleksii

Dear @hess,

Thanks for this suggestion.

As far as I understand the “pull code”, there is no way to choose the “free distance” as a reaction coordinate: Gromacs will pull 2 atomic groups not freely in space, but along a particular direction.

Actually, my umbrella sampling setup is pulling a ligand along a long curved protein cavity. Choosing a reaction coordinate as a single distance (as implemented in Colvars) will afford the ligand to adapt to the shape of the cavity, while the “pull code” will induce the clashes of the ligand with the protein.

Or is there any way to not restrain the pulling direction in the “pull code”?

Thanks.

Hi @alzdorev thanks for reporting those results. I’m pleased that you were able to regain much of the lost performance. Given the tiny groups that you’re using, that kind of slowdown was much bigger than it was reasonable to expect.

It also looks from your data that some improvements may be needed in the GROMACS-Colvars interface, to avoid feeding inaccurate information to the load balancer.

There were two reasons why I suggested tuning -ntomp. One was to keep the same number of work units, but make some of them threads instead of tasks. How much that helps really depends on what MPI library/configuration you are using. Ideally, in a well configured cluster, using MPI + OpenMP can be marginally better than using just MPI (i.e. -ntomp 1).

Another reason would be to prevent GROMACS from trying to use hardware threads, which some cluster admins may choose to expose as independent cores (even though they are really not). In those cases GROMACS may be choosing -ntomp 2 by default.

To clarify, it’s good practice to look into the last two issues for all types of GROMACS runs, as there is always something to gain. How much that is depends on the specific simulation, but as you have seen that could be hard to predict.

Giacomo

Something goes very wrong here with the DLB. I don’t know what is causing this. @giacomo.fiorin does colvars do a lot of work on the main rank before the first MPI communication and/or after the last MPI communication within colvars?
If so, than that could throw DLB off as it will shrink the domain on the main rank to try to reduce the time there, but that doesn’t help once the colvars main rank time becomes dominant.

Still mdrun also has a mechanism to turn off DLB when the performance deteriorates compared to before turning on DLB. I wonder why that mechanism doesn’t kick in.

Yes, the gromacs pull code has distances and angles between groups as basic units to work with. Paths are not supported.

Hi @hess, not really (as far as I understand what you mean).

From within ColvarsForceProvider::calculateForces(), there is nothing substantial before coordinates are gathered using the GROMACS function communicate_group_positions(), then there is the sequential computation on MPI rank 0, followed by the forces being broadcast to all ranks through nblock_bc(), which is also a GROMACS function.

After the steps above, the only remaining computation done (by all ranks) is adding up the contributions of the local atoms to the virial, using the addVirialContribution() member function. @Hub please comment if I’m missing something?

If there are additional steps required to properly inform the DLB, we should probably add them. Perhaps such a change could get into the patch release 2024.1? I think this would qualify as a fix.

Giacomo

Forgot to add: there are no MPI calls from within the main loop inside the Colvars library itself.

Then it is strange that the load balancer gets confused. I thought we had the calls to algorithms with global communication in the right spot. It should also be close to where the pull code is called, which has rather similar behavior.