Curious large force problem in path MD implementation with DD

nitinmalapally · January 22, 2024, 2:26pm

I’m working on an implementation of the path MD (sometimes called “MD in trajectory space”) algorithm in the modular simulator which builds upon the multisim feature. To be exact, I’m trying to make use of the domain decomposition mechanism. Please see https://hackmd.io/oCnHzQwyQ3SSz2NW0-aZpQ for some context.

In the core implementation, the function do_force is called 3 times. One call is identical to the one made by standard MD in each step, and the remaining two, which are used to compute a derivative numerically, are called with flags adapted_flags = flags & ~(GMX_FORCE_ENERGY | GMX_FORCE_VIRIAL | GMX_FORCE_DHDL) to reduce the amount of unnecessary computation. Given a particular system, the path MD simulation with domain decomposition support works fine for certain system sizes and time steps. But, for example, with a system of 2197 atoms, the simulation hangs consistently (with asynchronous or synchronous communication) at step 1919. The cause of this hanging is strongly suspected to be the fact that (only) one force value in forcesPlus is (-5808913203068928.000, 1301582188642304.000,-4627847662534656.000). What is puzzling is that the corresponding position coordinates in positionsPlus are (0.667, 0.716, 0.146) [Note: here the box size is 2.0,2.0,2.0] i.e seemingly normal, and the minimum distance between all atom pairs in positionsPlus is > 1E-4. The same was found to be the case with forcesMinus and positionsMinus. The actual hanging happens during a neighbour searching event, possibly due to large force values.

I’ve attached a log file.
debug_dump.log (325.1 KB)

Any help towards debugging this problem would be greatly appreciated, as solving this problem would bring an end to a long-awaited task. I will gladly provide additional information when necessary. Thanks a lot in advance.

ebriand · January 22, 2024, 2:39pm

A general strategy could be to build gromacs with debug symbols, for instance with CMake argument -DCMAKE_BUILD_TYPE=RELWITHDEBINFO in additions to your flags, then attach a debugger such as gdb to the hung process. (e.g. if the path to the gmx binary is /path/to/gmx, and the process ID is 5422, you can run gdb /path/to/gmx 5422)

You can then find where every thread is at that moment with a command like thread apply all bt (for gdb), including which file/line corresponds to that point in the source code. Most of them are likely to be at waitpoint (function with names like omp_), and hopefully one of them is at the actual problematic code line.

nitinmalapally · January 22, 2024, 2:48pm

Thanks for your speedy reply and debugging tip. However, I’m not trying to find the cause of the hanging per se, I’m rather trying to find out why I’m getting such a large force value for such a normal input coordinate in the step immediately before the hanging.

ebriand · January 22, 2024, 3:11pm

I would have two hypothesis for the source of the large force: either it is “legimate”, linked to the system configuration, or how your algorithm compute this forcesPlus value, or it could be uninitialized or overwritten memory.

To rule out/in the first case: Have you looked at the actual geometry of the system around the atom (not just this particular position)? Anything remarkable? Saving the one-before position, and comparing, might also be helpful.

For the second case: is the force value reproducible? Is it ballpark always the same value? Is that also the case on a different machine?

hess · January 22, 2024, 3:20pm

A distance of 0.1 nm between LJ atoms would already get you such large forces, so that all distances are larger than 1e-4 is not at all sufficient, I would say.

For normal MD mdrun has the -pforce option to print large forces. Maybe you can use the function used by that option to find the atom pairs that cause trouble.

nitinmalapally · January 23, 2024, 11:42am

For the second case: is the force value reproducible? Is it ballpark always the same value?

On each machine, yes.

Is that also the case on a different machine?

No, the values are approximately equally large, but e.g. another machine gave me (-5929429985394688.000, 1976476572319744.000,-2591112699052032.000), which is a different value. Interestingly, it is for the atom identifiable by the same global atom index on all machines.

Thanks for this suggestion. I guess this speaks for the uninitialized/overwritten memory possibility.

hess · January 23, 2024, 11:54am

If you think it might be uninitialized och out-of-bounds memory access, valgrind is your friend.

nitinmalapally · January 23, 2024, 12:14pm

I have made the assumption that after each re-partitioning, that all home range atoms are conserved i.e. if a home range atom leaves one cell, it must reappear in another cell as a new home range atom there. In reality, I’m observing that in 965 times out of 5000 re-partitioning events, home range atoms are not being conserved. Any idea what could be the cause?

hess · January 23, 2024, 12:59pm

You phrasing is a bit confusing. Each atom is always home/local on extra one rank/domain. This assignment is changed every domain decomposition step.

nitinmalapally · January 23, 2024, 1:19pm

If I sum up the gained atom count and subtract from this sum the lost atom count for every domain, I must end up with zero. In my case, 965/5000 times this is not true.

May do_force be called multiple times with different input coordinate vectors within the same step? Which flags should I unset when I want to only compute the forces for a numerical derivative?

hess · January 23, 2024, 2:05pm

Then you are doing something wrong.

You can call do_force multiple times per step. I assume you do domain decomposition for every new conformation along the path. You then only need the pairsearch flag for the first call to do_force. But why do need multiple calls do do_force? The gradient of the energy is already given by the force.

I assume that not using domain decomposition at all is going to be much more efficient. You should spread points along the trajectory over ranks, not the system. But you would anyhow need to take domain decomposition into account, as we will likely soon have this always active.

nitinmalapally · January 23, 2024, 3:31pm

The gradient of the energy is already given by the force.

We need to compute the Hessian of the energy times another vector, say, eta. So we compute positionsPlus = positions + eta * (delta * eps) and positionsMinus = positions - eta * (delta * eps), and then, hessianUTimesEta = (forcesPlus - forcesMinus) / (delta * eps). It is for this that I want to use the proper force flags such that I do as little work as required.

I assume that not using domain decomposition at all is going to be much more efficient. You should spread points along the trajectory over ranks, not the system. But you would anyhow need to take domain decomposition into account, as we will likely soon have this always active.

In this algorithm, the physical system is replicated across the ranks, each of which corresponds to a simulation in the multisim terminology. And we hope to increase resource utilization for each simulation by using DD.

With this information, could you please tell me which force flags would be absolutely necessary for the 2nd and 3rd calls?

hess · January 24, 2024, 9:57am

It seems you are using an old version of GROMACS. Then the flags you need are set by:
GMX_FORCE_ALLFORCES

The 2024 release and current main using workload flags structs instead.

nitinmalapally · January 24, 2024, 11:43am

Thanks very much, you have been very helpful so far. Would I also need to use GMX_FORCE_STATECHANGED | GMX_FORCE_NS?

hess · January 24, 2024, 12:28pm

That would only be needed at the first call when moving to a new point on the path. You would also need GMX_FORCE_DYNAMICBOX for this first call when the box might change.

nitinmalapally · January 24, 2024, 3:26pm

In the second and third calls, we’re providing new, separate coordinate vectors as input parameters. I thought we might need to use the above flags. And GMX_FORCE_DYNAMICBOX in case of activated DLB. But as of now, with just GMX_FORCE_ALLFORCES, it seems to be working quite well. I feel that a little bit more documentation about these flags somewhere might be more helpful :)

Thanks @hess and @ebriand, both of your suggestions have helped me resolve the problem and find the bug!

hess · January 24, 2024, 3:58pm

If you only modify the coordinates and with a delta which is smaller than the pairlist buffer, you only need pas GMX_FORCE_ALLFORCES at the second and later calls.

hess · January 24, 2024, 4:07pm

Have at look at the new flags in src/gromacs/mdtypes/simulation_workload.h They have better documentation. Please report if something is unclear there or not.

Topic		Replies	Views
Domain decomposition error in coarse grain MD using martini forcefield User discussions mdrun	0	306	October 12, 2023
Turning off DD User discussions mdrun	14	479	October 5, 2022
Automatic reduction of energy terms with DD Developers discussions mdrun , domain-decomposition	1	128	January 29, 2024
SD integrator vs MD integrator User discussions mdp-parameters , mdrun , simulation-setup	6	267	September 30, 2024
Steered MD doesn't produce expected force graph User discussions mdrun , pull-code	1	172	April 8, 2024

Curious large force problem in path MD implementation with DD

Related topics