Curious large force problem in path MD implementation with DD

I’m working on an implementation of the path MD (sometimes called “MD in trajectory space”) algorithm in the modular simulator which builds upon the multisim feature. To be exact, I’m trying to make use of the domain decomposition mechanism. Please see https://hackmd.io/oCnHzQwyQ3SSz2NW0-aZpQ for some context.

In the core implementation, the function do_force is called 3 times. One call is identical to the one made by standard MD in each step, and the remaining two, which are used to compute a derivative numerically, are called with flags adapted_flags = flags & ~(GMX_FORCE_ENERGY | GMX_FORCE_VIRIAL | GMX_FORCE_DHDL) to reduce the amount of unnecessary computation. Given a particular system, the path MD simulation with domain decomposition support works fine for certain system sizes and time steps. But, for example, with a system of 2197 atoms, the simulation hangs consistently (with asynchronous or synchronous communication) at step 1919. The cause of this hanging is strongly suspected to be the fact that (only) one force value in forcesPlus is (-5808913203068928.000, 1301582188642304.000,-4627847662534656.000). What is puzzling is that the corresponding position coordinates in positionsPlus are (0.667, 0.716, 0.146) [Note: here the box size is 2.0,2.0,2.0] i.e seemingly normal, and the minimum distance between all atom pairs in positionsPlus is > 1E-4. The same was found to be the case with forcesMinus and positionsMinus. The actual hanging happens during a neighbour searching event, possibly due to large force values.

I’ve attached a log file.
debug_dump.log (325.1 KB)

Any help towards debugging this problem would be greatly appreciated, as solving this problem would bring an end to a long-awaited task. I will gladly provide additional information when necessary. Thanks a lot in advance.

A general strategy could be to build gromacs with debug symbols, for instance with CMake argument -DCMAKE_BUILD_TYPE=RELWITHDEBINFO in additions to your flags, then attach a debugger such as gdb to the hung process. (e.g. if the path to the gmx binary is /path/to/gmx, and the process ID is 5422, you can run gdb /path/to/gmx 5422)

You can then find where every thread is at that moment with a command like thread apply all bt (for gdb), including which file/line corresponds to that point in the source code. Most of them are likely to be at waitpoint (function with names like omp_), and hopefully one of them is at the actual problematic code line.

Thanks for your speedy reply and debugging tip. However, I’m not trying to find the cause of the hanging per se, I’m rather trying to find out why I’m getting such a large force value for such a normal input coordinate in the step immediately before the hanging.

I would have two hypothesis for the source of the large force: either it is “legimate”, linked to the system configuration, or how your algorithm compute this forcesPlus value, or it could be uninitialized or overwritten memory.

To rule out/in the first case: Have you looked at the actual geometry of the system around the atom (not just this particular position)? Anything remarkable? Saving the one-before position, and comparing, might also be helpful.

For the second case: is the force value reproducible? Is it ballpark always the same value? Is that also the case on a different machine?

1 Like

A distance of 0.1 nm between LJ atoms would already get you such large forces, so that all distances are larger than 1e-4 is not at all sufficient, I would say.

For normal MD mdrun has the -pforce option to print large forces. Maybe you can use the function used by that option to find the atom pairs that cause trouble.

For the second case: is the force value reproducible? Is it ballpark always the same value?

On each machine, yes.

Is that also the case on a different machine?

No, the values are approximately equally large, but e.g. another machine gave me (-5929429985394688.000, 1976476572319744.000,-2591112699052032.000), which is a different value. Interestingly, it is for the atom identifiable by the same global atom index on all machines.

Thanks for this suggestion. I guess this speaks for the uninitialized/overwritten memory possibility.

If you think it might be uninitialized och out-of-bounds memory access, valgrind is your friend.

I have made the assumption that after each re-partitioning, that all home range atoms are conserved i.e. if a home range atom leaves one cell, it must reappear in another cell as a new home range atom there. In reality, I’m observing that in 965 times out of 5000 re-partitioning events, home range atoms are not being conserved. Any idea what could be the cause?

You phrasing is a bit confusing. Each atom is always home/local on extra one rank/domain. This assignment is changed every domain decomposition step.

If I sum up the gained atom count and subtract from this sum the lost atom count for every domain, I must end up with zero. In my case, 965/5000 times this is not true.

May do_force be called multiple times with different input coordinate vectors within the same step? Which flags should I unset when I want to only compute the forces for a numerical derivative?

Then you are doing something wrong.

You can call do_force multiple times per step. I assume you do domain decomposition for every new conformation along the path. You then only need the pairsearch flag for the first call to do_force. But why do need multiple calls do do_force? The gradient of the energy is already given by the force.

I assume that not using domain decomposition at all is going to be much more efficient. You should spread points along the trajectory over ranks, not the system. But you would anyhow need to take domain decomposition into account, as we will likely soon have this always active.

The gradient of the energy is already given by the force.

We need to compute the Hessian of the energy times another vector, say, eta. So we compute positionsPlus = positions + eta * (delta * eps) and positionsMinus = positions - eta * (delta * eps), and then, hessianUTimesEta = (forcesPlus - forcesMinus) / (delta * eps). It is for this that I want to use the proper force flags such that I do as little work as required.

I assume that not using domain decomposition at all is going to be much more efficient. You should spread points along the trajectory over ranks, not the system. But you would anyhow need to take domain decomposition into account, as we will likely soon have this always active.

In this algorithm, the physical system is replicated across the ranks, each of which corresponds to a simulation in the multisim terminology. And we hope to increase resource utilization for each simulation by using DD.

With this information, could you please tell me which force flags would be absolutely necessary for the 2nd and 3rd calls?

It seems you are using an old version of GROMACS. Then the flags you need are set by:
GMX_FORCE_ALLFORCES

The 2024 release and current main using workload flags structs instead.

1 Like

Thanks very much, you have been very helpful so far. Would I also need to use GMX_FORCE_STATECHANGED | GMX_FORCE_NS?

That would only be needed at the first call when moving to a new point on the path. You would also need GMX_FORCE_DYNAMICBOX for this first call when the box might change.

In the second and third calls, we’re providing new, separate coordinate vectors as input parameters. I thought we might need to use the above flags. And GMX_FORCE_DYNAMICBOX in case of activated DLB. But as of now, with just GMX_FORCE_ALLFORCES, it seems to be working quite well. I feel that a little bit more documentation about these flags somewhere might be more helpful :)

Thanks @hess and @ebriand, both of your suggestions have helped me resolve the problem and find the bug!

If you only modify the coordinates and with a delta which is smaller than the pairlist buffer, you only need pas GMX_FORCE_ALLFORCES at the second and later calls.

1 Like

Have at look at the new flags in src/gromacs/mdtypes/simulation_workload.h They have better documentation. Please report if something is unclear there or not.