Wait GPU NB nonloc. % is too high

GROMACS version: 2020.1
GROMACS modification: No

Hello everybody,

A system of “345216 atoms” is running on machine with below information:

Running 2 nodes with total 96 cores, 192 logical cores, 2 compatible GPUs
Cores per node: 48
Logical cores per node: 96
Compatible GPUs per node: 1

And below command:
FLAGS=’-nb gpu -pme gpu -bonded gpu’
gmx_mpi mdrun -ntomp 1 -npme 1 -s an.tpr -deffnm an -g an.log -tunepme yes -pin on -dlb yes -cpi an.cpt -append $FLAGS

Where I get only 13.747 ns/day (1.746 hour/ns). Apparently, more than 70% of resources is just getting wasted for waiting for the “GPU NB nonloc” and :+ Comm. F." (48.2% and 30.1%, respectively).

Would you please help me getting better performance using the resource I have.

For more information I have shared the an.log file in below link:



The waiting time that the performance table lists is not necessarily a bad thing or something that should be eliminated. GROMACS uses a heterogeneous parallelization and makes use of both CPU and GPU.However, in many modern GPU accelerated machines the CPU–GPU balance is such that it is often worth letting the GPU handle most compute-intensive work and use the CPU for the infrequent tasks (like domain decomposition, I/O) as well as to enable algorithms not explicitly ported to the GPU.

Therefore, between these less frequent tasks the CPU will often idle and we record time spent as waiting for GPU results – which is the case here (with some caveats, more on that below).

The machine you are running on is somewhat atypical in that it has a quite high number of CPU cores per GPU (2x 24 cores per Tesla P100). On this hardware it will be worth keeping some of the tasks on the CPU.

Looking at your log file there are a couple of issues I spotted:

  • you built GROMACS with SIMD acceleration disabled (see SIMD instructions selected at compile time: None in the log); you should use AVX_512 (and given that the tasks you might not offload to the GPU won’t benefit much from that you can also try AVX2_256 as it might allow some performance improvement);
  • you also have the low-level timing instructions disabled (see the log note on that); I suggest re-enabling those;
  • use fewer ranks and more OpenMP threads per rank (e.g. 4-8 ranks per node)
  • probably no separate PME ranks at this rank count will allow better performance
  • you run also has a significant imbalance (which leads to high fraction of runtime spent in communication) – this should be reduced by using ~8-16-way domain decomposition instead of the current 90.

I hope that helps, these should significantly improve the performance of your runs !


Hello Szilard,
Thank you very much for the informative and detailed response.

My I know please what the option is for the low level timing instructions? is this be able during compilation?

With the “no separate PME ranks”, If you mean the -npme option, then as you know the only option for -npme is 1 when pme goes over the GPU (-pme gpu). And if I put the pme on CPU to be able to have no separate PME rank, then the performance reduces would reduce respect to the former case.


From the log you shared:

The current CPU can measure timings more accurately than the code in
gmx mdrun was configured to use. This might affect your simulation
speed as accurate timings are needed for load-balancing.
Please consider rebuilding gmx mdrun with the GMX_USE_RDTSCP=ON CMake option.

Not sure how did you end up with this option disabled, it is by default enabled and only gets disabled if the build host does not support this feature.

Have you tried that with a binary that’s not hampered by the disabled CPU optimizations? As I noted, you have a very CPU-heavy machine, so you may well be able to get best performance by keeping PME on the CPU.