Wait GPU NB nonloc. % is too high

Alex · October 19, 2020, 6:43pm

GROMACS version: 2020.1
GROMACS modification: No

Hello everybody,

A system of “345216 atoms” is running on machine with below information:

Running 2 nodes with total 96 cores, 192 logical cores, 2 compatible GPUs
Cores per node: 48
Logical cores per node: 96
Compatible GPUs per node: 1

And below command:
FLAGS=’-nb gpu -pme gpu -bonded gpu’
gmx_mpi mdrun -ntomp 1 -npme 1 -s an.tpr -deffnm an -g an.log -tunepme yes -pin on -dlb yes -cpi an.cpt -append $FLAGS

Where I get only 13.747 ns/day (1.746 hour/ns). Apparently, more than 70% of resources is just getting wasted for waiting for the “GPU NB nonloc” and :+ Comm. F." (48.2% and 30.1%, respectively).

Would you please help me getting better performance using the resource I have.

For more information I have shared the an.log file in below link:

Regards,
Alexander

pszilard · October 20, 2020, 6:21pm

Hi,

The waiting time that the performance table lists is not necessarily a bad thing or something that should be eliminated. GROMACS uses a heterogeneous parallelization and makes use of both CPU and GPU.However, in many modern GPU accelerated machines the CPU–GPU balance is such that it is often worth letting the GPU handle most compute-intensive work and use the CPU for the infrequent tasks (like domain decomposition, I/O) as well as to enable algorithms not explicitly ported to the GPU.

Therefore, between these less frequent tasks the CPU will often idle and we record time spent as waiting for GPU results – which is the case here (with some caveats, more on that below).

The machine you are running on is somewhat atypical in that it has a quite high number of CPU cores per GPU (2x 24 cores per Tesla P100). On this hardware it will be worth keeping some of the tasks on the CPU.

Looking at your log file there are a couple of issues I spotted:

you built GROMACS with SIMD acceleration disabled (see SIMD instructions selected at compile time: None in the log); you should use AVX_512 (and given that the tasks you might not offload to the GPU won’t benefit much from that you can also try AVX2_256 as it might allow some performance improvement);
you also have the low-level timing instructions disabled (see the log note on that); I suggest re-enabling those;
use fewer ranks and more OpenMP threads per rank (e.g. 4-8 ranks per node)
probably no separate PME ranks at this rank count will allow better performance
you run also has a significant imbalance (which leads to high fraction of runtime spent in communication) – this should be reduced by using ~8-16-way domain decomposition instead of the current 90.

I hope that helps, these should significantly improve the performance of your runs !

Cheers,
Szilárd

Alex · October 23, 2020, 3:22pm

Hello Szilard,
Thank you very much for the informative and detailed response.

My I know please what the option is for the low level timing instructions? is this be able during compilation?

With the “no separate PME ranks”, If you mean the -npme option, then as you know the only option for -npme is 1 when pme goes over the GPU (-pme gpu). And if I put the pme on CPU to be able to have no separate PME rank, then the performance reduces would reduce respect to the former case.

Regards,
Alex

pszilard · October 26, 2020, 2:09pm

From the log you shared:

The current CPU can measure timings more accurately than the code in
gmx mdrun was configured to use. This might affect your simulation
speed as accurate timings are needed for load-balancing.
Please consider rebuilding gmx mdrun with the GMX_USE_RDTSCP=ON CMake option.

Not sure how did you end up with this option disabled, it is by default enabled and only gets disabled if the build host does not support this feature.

Have you tried that with a binary that’s not hampered by the disabled CPU optimizations? As I noted, you have a very CPU-heavy machine, so you may well be able to get best performance by keeping PME on the CPU.

Topic		Replies	Views
Abysmal MD production performance on GPU node User discussions mdrun	8	974	December 15, 2023
Are these timings fine? User discussions mdrun , cpu , gpu , mdrun-performance , simulation-setup	3	477	November 14, 2023
Performance with mpi support User discussions	3	925	December 26, 2020
Low Performance due to low utilisation of GPU User discussions	10	606	July 26, 2024
Optimizing GPU performance for GROMACS? User discussions	6	1442	January 13, 2021

Wait GPU NB nonloc. % is too high

Related topics