GROMCA drude version performance does not scale with increaseing OpenMP threads

GROMACS version: 2016-dev-20220119-e35ae4e-unknown
GROMACS modification: No

Useful Links:

  1. MacKerell Lab
  2. https://onlinelibrary.wiley.com/doi/full/10.1002/jcc.23937

My SWM4-NDP water simulation with approx. 31 angstroms box (1024 water molecules) does not scale with the increasing number of OpenMP threads. The performance reported in Figure 1 of the paper (LINK 2) almost linearly increases with the OpenMP threads. The command that I am using to run the simulation is:
gmx mdrun -s nvt.tpr -deffnm nvt -ntomp NO_OF_THREADS
where NO_OF_THREADS were varied from 1 to 16.

Following is the performance for a 10ps NVT run using extended Lagrangian method

ntomp ns/day
1 0.586
4 0.503
8 0.486
16 0.458

I have two questions:

  1. Can anyone help me with how to speed up the simulation? Some information from the output file is given below (using 16 OpenMP threads), which might be useful.

  2. While running the NVT simulation, a lot of data is printed on the terminal which I do not understand. Can anyone tell me why so much data is printed and what information it provides? And how I can stop this data from printing. The data is as following:

Start: Data printed on the terminal during NVT simulation


DO FORCE: after move_f f[5115] = 32.558803 -96.048094 -35.766874
DO FORCE: after move_f f[5120] = 109.191799 129.462895 -1.605907
.
.
.
DO FORCE: after GPU use/emulate f[485] = 169.072343 125.064651 69.923393
DO FORCE: after GPU use/emulate f[490] = -62.061709 -70.705441 20.019443
.
.
.
DRUDE TFP: n = 4 final atom v[171]: 0.777366 0.136756 0.165941 drude v[175]: 0.736939 0.122672 0.083183
DRUDE TFP: n = 4 init atom v[176]: 0.475490 -0.146997 -0.235264 drude v[180] (ib = 179): 0.405482 -0.008488 -0.573738
.
.
.
VV VEL: v[3315(3315)] b4 update: -0.023910 -0.444889 0.339263
VV VEL: v[1379(1379)] after update: 0.000000 0.000000 0.000000
VV VEL: f[2673(2673)] b4 update: -459.965751 -851.022226 -705.858880
.
.
.
VV POS: x[4800(4800)] b4 update: 1.896389 0.080550 0.767094
VV POS: v[4800(4800)] b4 update: 0.010975 0.089460 0.522833
VV POS: x[4800(4800)] after update: 1.896400 0.080640 0.767617


End: Data printed on terminal during NVT simulation

Start: Information from the Output .log file


Using 1 MPI process
Using 16 OpenMP threads

NOTE: You requested 16 OpenMP threads, whereas we expect the optimum to be with more MPI ranks with 1 to 6 OpenMP threads.

Will do PME sum in reciprocal space for electrostatic interactions.

M E G A - F L O P S   A C C O U N T I N G

NB=Group-cutoff nonbonded kernels NxN=N-by-N cluster Verlet kernels
RF=Reaction-Field VdW=Van der Waals QSTab=quadratic-spline table
W3=SPC/TIP3p W4=TIP4p (single or pairs)
V&F=Potential and force V=Potential only F=Force only

Computing: M-Number M-Flops % Flops

Pair Search distance check 2570.456388 23134.107 0.9
NxN QSTab Elec. + LJ [F] 24407.034560 1659678.350 62.3
NxN QSTab Elec. + LJ [V&F] 500.565696 39544.690 1.5
NxN QSTab Elec. [F] 24406.435680 829818.813 31.1
NxN QSTab Elec. [V&F] 500.565696 20523.194 0.8
Calc Weights 153.615360 5530.153 0.2
Spread Q Bspline 3277.127680 6554.255 0.2
Gather F Bspline 3277.127680 19662.766 0.7
3D-FFT 6332.493186 50659.945 1.9
Solve PME 7.840784 501.810 0.0
Shift-X 5.125120 30.751 0.0
Bonds 10.241024 604.220 0.0
Virial 1.038165 18.687 0.0
Stop-CM 0.522240 5.222 0.0
Calc-Ekin 51.205120 1382.538 0.1
Constraint-V 30.723072 245.785 0.0
Constraint-Vir 0.617472 14.819 0.0
Settle 20.484096 6616.363 0.2
Virtual Site 3 10.446848 386.533 0.0

Total 2664913.003 100.0

 R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G

On 1 MPI rank, each using 16 OpenMP threads

Computing: Num Num Call Wall time Giga-Cycles
Ranks Threads Count (s) total sum %

Vsite constr. 1 16 10001 0.326 11.743 0.0
Neighbor search 1 16 1001 2.309 83.131 0.1
Force 1 16 10001 96.011 3456.394 5.1
PME mesh 1 16 10001 4.332 155.958 0.2
NB X/F buffer ops. 1 16 19001 1.198 43.118 0.1
Vsite spread 1 16 10202 0.510 18.367 0.0
Write traj. 1 16 13 1.503 54.110 0.1
Update 1 16 40004 1628.047 58609.655 86.4
Constraints 1 16 20002 1.549 55.754 0.1
Rest 149.617 5386.224 7.9

Total 1885.402 67874.453 100.0

Breakdown of PME mesh computation

PME spread/gather 1 16 20002 2.955 106.380 0.2
PME 3D-FFT 1 16 20002 1.244 44.780 0.1
PME solve Elec 1 16 10001 0.086 3.107 0.0


End: Information from the Output .log file

Start: GROMACS and simulation details are:


The GROMACS drude version was downloaded using the link: LINK 1 (see above)
.mdp file: same as provided in the supporting information of the paper LINK 2 (see above)
coordinate file (.gro file): pre-equilibrated box of SWM4-NDP water downloaded from LINK 1 (see above)

Below is the SWM4-NDP topology file that I am using:

;
; Polarizable water: SWM4-NDP model
;
; G. Lamoureux, E. Harder, I. V. Vorobyov, B. Roux, and A. D. MacKerell, Jr. (2006)
; A polarizable model of water for molecular dynamics simulations of biomolecules.
; Chem. Phys. Lett. 418: 245-249.
;

[ defaults ]
; nbfunc comb-rule gen-pairs fudgeLJ fudgeQQ
1 2 no 0.5000 0.5000

[ atomtypes ]
;type atnum mass charge ptype sigma epsilon
ODW 8 15.599400 0.000 A 0.318394549320 0.88259
HDW 1 1.008000 0.000 A 0.000000000000 0.00000
LPDW 1 0.000000 0.000 V 0.000000000000 0.00000
DOH2 1 0.400000 0.000 S 0.000000000000 0.00000

[ moleculetype ]
; molname nrexcl
SOL 2

[ atoms ]
; id type resnr resname at name cg nr charge mass
1 ODW 1 SOL OH2 1 1.71636 15.5994
2 HDW 1 SOL H1 1 0.55733 1.0080
3 HDW 1 SOL H2 1 0.55733 1.0080
4 LPDW 1 SOL OM 1 -1.11466 0.0000
5 DOH2 1 SOL DOH2 1 -1.71636 0.4000

[ bonds ]
;; i j funct
1 5 1 0.00000000 418400.00

[ virtual_sites3 ]
; site from func a b
4 1 2 3 1 0.205109464 0.205109464

[ settles ]
1 1 0.09572 0.15139

[ exclusions ]
1 2 3 4 5
2 1 3 4 5
3 1 2 4 5
4 1 2 3 5

[ system ]
SOL

[ molecules ]
; Compound nmols
SOL 1024


End: GROMACS and simulation details are:

The code currently prints a bunch of debugging information I have been using for testing. It obliterates performance. Comment out those lines and recompile.

Please do not use this code for anything you intend to publish. It is not currently production quality and unfortunately I have very little time to work on it these days.

Thanks for your prompt reply, it is really helpful.