Query about Atom Index Ordering in GROMACS 2021.0 and 2023.3

Hello.

I am developing a constraint algorithm and primarily working with GROMACS 2021.0. In executions without domain decomposition and with constraints set to all-bonds, the atom indices appear to follow an order similar to that in the PDB file. For instance, in simulations involving a protein with N atoms, the indexes range from 0 to N-1.

However, in executions with domain decomposition, the indices of the protein atoms in each MPI rank are not contiguous, and the ordering is not clear to me. I have observed that the latest version of GROMACS (2023.3) applies a similar reordering of atom indices in simulations without domain decomposition.

I would like to understand the rationale behind this approach. Having non-contiguous and unordered atom indices seems to negatively affect the algorithm used in P-LINCS for distributing constraints between tasks. Specifically, in executions with domain decomposition, the last task, the one executed sequentially, has significantly more constraints assigned compared to executions without domain decomposition.

Thank you for your assistance.

Yes, when DD atom ordering is enabled (haveDDAtomOrdering(*cr_) == true), the atoms are sorted (according to position, more specifically based on the non-bonded interaction pair finding grid) instead of being in topology order. Under some conditions, this is also the case for single rank simulations.

IIRC this is for memory locality reason when loading coordinate for non-bonded interactions. A comment mentions a 15% performance gains, not sure how up to date this is.

See domdec/partition.cpp around line 3100 for the relevant calls where the sorting occurs. The selection of whether this occurs is in mdrun/runner.cpp, around line 1110 and above:

    const bool useDomainDecomposition =
            canUseDomainDecomposition
            && (PAR(cr)
                || (!useGpuForNonbonded && usingFullElectrostatics(inputrec->coulombtype)
                    && useDDWithSingleRank != 0)
                || useDDWithSingleRank == 1);

There is a environment variable to toggle this behavior (GMX_DD_SINGLE_RANK), though IIRC the sorting code path is to be become the only code path, at some indefinite time in the future.

There are also (of course) facilities to map sorted atom indexes to topology order indexes, see class LocalAtomSet, also search for ga2la.

I cannot comment how that affects constraint algorithm performance. Maybe @hess can? And also give more context on the reasons for this?

1 Like

That’s all correct.

The performance issue with P-LINCS has little to do with the reordering. The main performance impact is due to constraints crossing domain boundaries, which requires MPI communication.

Often the order of atoms in the topology is near optimal for constraint algorithms. But spatial ordering of atoms, as with DD, is not much worse.

Thank you very much for your reply.

There are also (of course) facilities to map sorted atom indexes to topology order indexes, see class LocalAtomSet, also search for ga2la.

That is precisely what I need. Does the cr->dd->globalAtomIndices array accessible inside the Lincs class in GROMACS 2020 provide similar functionality?

The performance issue with P-LINCS has little to do with the reordering. The main performance impact is due to constraints crossing domain boundaries, which requires MPI communication.
Often the order of atoms in the topology is near optimal for constraint algorithms. But spatial ordering of atoms, as with DD, is not much worse.

Yes, this might be a minor concern compared with MPI communication. In our cluster (2 sockets x 24 cores per node), we see that almost 15% of the constraints are assigned to P-LINCS’ sequential task with DD active, while less than 1% are assigned without DD. I see that P-LINCS’ distribution algorithm differs in newer GROMACS versions, so you might have already addressed this.

ga2la translates from global atom indices to local atom indices, as the (too short) name implies. cr->dd->globalAtomIndices goes the other way around.

In the 2022 release I extracted a bit more parallelism from LINCS and the sequential task got smaller for most cases.

Thank you very much. This has been of so much help.