-ntmpi demands 1 on multi-threaded cpus

GROMACS version: 2020.6
GROMACS modification: Yes

Hi everybody. Hope everyone is fine.

I’m facing an issue in my gromacs compilations for some time, but they were nothing to worry too much until now.

Every single one of my gromacs compilations (including he 2020.6 that I’m using for this report) on any multi-threaded cpu I have/had access only run on “pure openMP”: I have to manually set the command line with “-ntmpi 1”, effectivelly removing the thread-mpi, for it to work.

If I don’t set “-ntmpi 1”, independent of the system used I get errors as in the following example:

#########

hanging nstlist from 10 to 50, rlist from 1 to 1.107

Using 32 MPI threads
Using 1 OpenMP thread per tMPI thread

Not all bonded interactions have been properly assigned to the domain decomposition cells
A list of missing interactions:
                Bond of   6583 missing    134
               Angle of  22820 missing    651
         Proper Dih. of  35458 missing   1539
       Improper Dih. of   2682 missing    107
               LJ-14 of  32910 missing    938
Molecule type 'Protein_chain_A'
the first 10 missing interactions, except for exclusions:
         Proper Dih. atoms   60   62   64   66 global    60    62    64    66
         Proper Dih. atoms   60   62   64   66 global    60    62    64    66
         Proper Dih. atoms   60   62   64   66 global    60    62    64    66
               LJ-14 atoms   60   66           global    60    66
               Angle atoms   62   64   66      global    62    64    66
         Proper Dih. atoms   62   64   66   67 global    62    64    66    67
         Proper Dih. atoms   62   64   66   68 global    62    64    66    68
         Proper Dih. atoms   62   64   66   69 global    62    64    66    69
               LJ-14 atoms   62   67           global    62    67
               LJ-14 atoms   62   68           global    62    68
Molecule type 'Protein_chain_E'
the first 10 missing interactions, except for exclusions:
         Proper Dih. atoms 2288 2294 2296 2298 global 11794 11800 11802 11804
               LJ-14 atoms 2288 2298           global 11794 11804
               Angle atoms 2294 2296 2298      global 11800 11802 11804
         Proper Dih. atoms 2294 2296 2298 2300 global 11800 11802 11804 11806
         Proper Dih. atoms 2294 2296 2298 2300 global 11800 11802 11804 11806
         Proper Dih. atoms 2294 2296 2298 2300 global 11800 11802 11804 11806
         Proper Dih. atoms 2294 2296 2298 2308 global 11800 11802 11804 11814
         Proper Dih. atoms 2294 2296 2298 2308 global 11800 11802 11804 11814
         Proper Dih. atoms 2294 2296 2298 2308 global 11800 11802 11804 11814
       Improper Dih. atoms 2294 2298 2296 2297 global 11800 11804 11802 11803



-------------------------------------------------------
Program:     gmx mdrun, version 2020.6
Source file: src/gromacs/domdec/domdec_topology.cpp (line 421)
MPI rank:    0 (out of 32)

Fatal error:
3369 of the 181683 bonded interactions could not be calculated because some
atoms involved moved further apart than the multi-body cut-off distance
(1.31543 nm) or the two-body cut-off distance (1.31543 nm), see option -rdd,
for pairs and tabulated bonds also see option -ddcheck

For more information and tips for troubleshooting, please check the GROMACS
website at http://www.gromacs.org/Documentation/Errors

#########

However, if I set the “-ntmpi 1”, it runs a “pure openMP” version and it runs ok, with no flaws whatsoever. :(

Does anybody have any idea on the possible causes for this error?

Additionally: I’ve never worried too much about that because the performance was good enough. However, now we are using a machine with a RTX3060 gpu: on it the process begins to run but, after 1-2 minutes, the whole machine turns off. I’m almost certain that it is not related to the “-ntmpi 1” issue, but it don’t hurt to ask if someone had this sort of “turning off computers during gpu calculations”.

Thanks a lot for any comments!

That is the most likely issue, some of your bonded interactions are likely very long distance (CG system) and the default domain decomposition heuristics do not result to a stable simulation. Increasing the -rdd option will likely solve the issue.

My guess is faulty hardware or perhaps insufficient power supply or cooling.

Cheers,
Szilárd

Thanks for the suggestion. :)

However, I tried it and it did not work.

Including “-rdd” and increasing it until “2.4” did not change the error message obtained (at least not in a clearly noticeable way).

Moreover, starting from 2.5 and further, the error message changes to the following:

##############

Program: gmx mdrun, version 2020.6
Source file: src/gromacs/domdec/domdec.cpp (line 2277)
MPI rank: 0 (out of 32)

Fatal error:
There is no domain decomposition for 24 ranks that is compatible with the
given box and a minimum cell size of 3.38553 nm
Change the number of ranks or mdrun option -rdd or -dds
Look in the log file for details on the domain decomposition

For more information and tips for troubleshooting, please check the GROMACS
website at Errors - Gromacs

##############

Finally, by “CG system” did you meant “coarse grained”? That is not the case in any of my systems. This one is just an all-atom solvated two protein complex (and I also noticed this issue arising in cases as simple as pure all-atom ionic liquids).

I would also like to emphasize that the error only happens on “thread-mpi” simulations: if I choose to go purely on openMP it does not happen, and the calculation goes successfully until the end (on the CPU at least).

Back again.

I’m still facing issues with the “Not all bonded interactions have been properly assigned to the domain decomposition cells” error message that forbids me from using “-ntmpi” values different from “1” (as as mentioned in the previous post, “-rdd” did not solve it), however today I had developments in the strange “turning off” issues.

It was not only an electrical issue (we tested it by moving the computer to a whole different electrical grid available) since it was not enough to solve it on its own (but helped!).

There is also something strange we observed (on the new electrical grid): if we did not add the “-ntmpi 1” setting, the calculation failed stalling the computer. And it also failed when adding it.

However, when we added the “-ntomp” and set it to the maximum value of available threads, it runs smoothly until the very end! (only on the new electric grid: we tried it back on the original one, and then we had to lower the “-ntomp” value. Seems that we are also facing some electrical issue).

Does anybody has any clue on why the “-ntomp” needs to be explicitly set? What else than the “-rdd” can be the way to solve the strange “Not all bonded interactions have been properly assigned to the domain decomposition cells” issue (which does not happen on openMP but rather only on threadMPI)?

When we nail down the electrical issue a bit more I’ll report back on this thread: maybe it will help someone else in the future…

Thanks a lot in advance for any help.