Too much time on PME mesh

GROMACS version: 2022
GROMACS modification: Yes
Dear all,
Hi, I’m using gromacs 2022 simulating a system ~1 million atoms, I use 36 threads and 1 gpu (V100) for the simulation, I notice a very large walltime on the PME mesh, could any one help give me some suggestions to improve the performance, the log file is as below:

       P P   -   P M E   L O A D   B A L A N C I N G

 NOTE: The PP/PME load balancing was limited by the maximum allowed grid scaling,
       you might not have reached a good load balance.

 PP/PME load balancing changed the cut-off and PME settings:
           particle-particle                    PME
            rcoulomb  rlist            grid      spacing   1/beta
   initial  1.200 nm  1.204 nm     192 192 192   0.116 nm  0.384 nm
   final    1.858 nm  1.862 nm     120 120 120   0.186 nm  0.595 nm
 cost-ratio           3.70             0.24
 (note that these numbers concern only part of the total PP and PME load)


	M E G A - F L O P S   A C C O U N T I N G

 NB=Group-cutoff nonbonded kernels    NxN=N-by-N cluster Verlet kernels
 RF=Reaction-Field  VdW=Van der Waals  QSTab=quadratic-spline table
 W3=SPC/TIP3p  W4=TIP4p (single or pairs)
 V&F=Potential and force  V=Potential only  F=Force only

 Computing:                               M-Number         M-Flops  % Flops
-----------------------------------------------------------------------------
 Pair Search distance check          133893.255360     1205039.298     0.0
 NxN QSTab Elec. + LJ [F]         168582681.324288  8934882110.187    97.7
 NxN QSTab Elec. + LJ [V&F]         1706313.111552   138211362.036     1.5
 1,4 nonbonded interactions             865.267305       77874.057     0.0
 Calc Weights                        157298.545908     5662747.653     0.1
 Spread Q Bspline                   3355702.312704     6711404.625     0.1
 Gather F Bspline                   3355702.312704    20134213.876     0.2
 3D-FFT                             3799854.699444    30398837.596     0.3
 Solve PME                              741.121600       47431.782     0.0
 Reset In Box                           524.318000        1572.954     0.0
 CG-CoM                                 525.366636        1576.100     0.0
 Bonds                                  165.903318        9788.296     0.0
 Propers                                844.166883      193314.216     0.0
 Impropers                               56.151123       11679.434     0.0
 Virial                                5245.578906       94420.420     0.0
 Stop-CM                                525.366636        5253.666     0.0
 Calc-Ekin                            10488.457272      283188.346     0.0
 Lincs                                  163.153263        9789.196     0.0
 Lincs-Mat                              870.017400        3480.070     0.0
 Constraint-V                         52331.446608      470983.019     0.0
 Constraint-Vir                        5217.768345      125226.440     0.0
 Settle                               17335.046694     6413967.277     0.1
 CMAP                                    19.950399       33915.678     0.0
 Urey-Bradley                           596.411928      109143.383     0.0
-----------------------------------------------------------------------------
 Total                                              9145098319.606   100.0
-----------------------------------------------------------------------------


    D O M A I N   D E C O M P O S I T I O N   S T A T I S T I C S

 av. #atoms communicated per step for force:  2 x 533583.3


Dynamic load balancing report:
 DLB was off during the run due to low measured imbalance.
 Average load imbalance: 3.1%.
 The balanceable part of the MD step is 64%, load imbalance is computed from this.
 Part of the total run time spent waiting due to load imbalance: 2.0%.


     R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G

On 6 MPI ranks, each using 6 OpenMP threads

 Computing:          Num   Num      Call    Wall time         Giga-Cycles
                     Ranks Threads  Count      (s)         total sum    %
-----------------------------------------------------------------------------
 Domain decomp.         6    6        500      19.629       2114.758   1.7
 DD comm. load          6    6        478       0.892         96.145   0.1
 Neighbor search        6    6        501      24.364       2624.960   2.1
 Launch GPU ops.        6    6     100002      13.916       1499.286   1.2
 Comm. coord.           6    6      49500      70.864       7634.768   6.2
 Force                  6    6      50001      16.685       1797.662   1.5
 Wait + Comm. F         6    6      50001      51.074       5502.587   4.5
 PME mesh               6    6      50001     749.587      80759.333  65.9
 Wait Bonded GPU        6    6        501       0.002          0.229   0.0
 Wait GPU NB nonloc.    6    6      50001      13.022       1402.967   1.1
 Wait GPU NB local      6    6      50001      23.720       2555.509   2.1
 NB X/F buffer ops.     6    6     199002      55.330       5961.213   4.9
 Write traj.            6    6          3       0.559         60.264   0.0
 Update                 6    6      50001      37.599       4050.878   3.3
 Constraints            6    6      50001      27.766       2991.444   2.4
 Comm. energies         6    6       5001       9.232        994.592   0.8
 Rest                                          22.950       2472.617   2.0
-----------------------------------------------------------------------------
 Total                                       1137.191     122519.210 100.0
-----------------------------------------------------------------------------
 Breakdown of PME mesh computation
-----------------------------------------------------------------------------
 PME redist. X/F        6    6     100002     134.499      14490.691  11.8
 PME spread             6    6      50001     332.452      35817.851  29.2
 PME gather             6    6      50001     149.632      16121.109  13.2
 PME 3D-FFT             6    6     100002      88.660       9552.052   7.8
 PME 3D-FFT Comm.       6    6     100002      32.897       3544.313   2.9
 PME solve Elec         6    6      50001      11.014       1186.582   1.0
-----------------------------------------------------------------------------

               Core t (s)   Wall t (s)        (%)
       Time:    40938.880     1137.191     3600.0
                 (ns/day)    (hour/ns)
Performance:        7.598        3.159
Finished mdrun on rank 0 Wed Sep 28 19:30:45 2022

The PME mesh time is expected to be large and this might not even be an issue. If the time spent on PME mesh is fully overlapping with calculations on the GPU, lowering the PME time will not improve performance. Is it difficult to judge if you get full overlap.

The above suggests that PP/PME load balancing was is maxed out, that is it shifted as much work as reasonable from the long-range to short-range computation (from CPU to GPU), so you are likely running an inefficient setup.

If you want to use a single GPU, simple use a a single rank with 36 threads (i.e. mpirun -np 1 gmx_mpi mdrun -ntomp 36). With up to 4-8 GPU you can typically get good results by using a single separate PME GPU (-npme 1).

Side-note: The upcoming release will support running PME across multiple GPUs, though you will need to run on a high performance network to get any effective scaling.

Cheers,
Szilárd

Dear Szilárd, thank you for your reply, I try the single rank you mentioned and the performance do improve, thank you.

Best,
Ding