Loss of performance in v. 2021

Sasha · December 17, 2021, 8:04am

Hi all,

I have a small test system (~17K atoms) that contains nothing special: a solid membrane with a hole (membrane position-restrained at the edges), a string of periodic DNA, water, and some ions. The machine is 44-core E5 + 4 Titan XP GPUs.

mdrun -nt 24 -ntmpi 4 -npme 1 -pme gpu -gputasks 0123
version 2020.5: 295 ns/day (1fs timestep)
version 2021 (installed today, built fine, but failed two tests at the end somewhere): 277 ns/day

Can post the complete mdp or link to the whole input package, if need be.
Amy comments on the performance drop?

Many thanks!

hess · December 20, 2021, 9:59am

Does this system really run faster using 4 GPUs instead of 1? I would think this would run much faster on a single GPU.

If you want to know where the difference in performance comes from, do a diff on the md.log files to see if there are differences in the task assignments. If not, have a look at the timings of the different components of mdrun at the end of the log file to see what got slower.

Sasha · December 20, 2021, 10:17am

Hm, that’s an interesting point – thanks! Let me check this before going into what is taking longer, because the workflow script probably inherited that 0123 thing from a bigger system. :)

Sasha · December 25, 2021, 6:46am

I did some testing and seems like whatever I was running had a close to optimal config for the machine I’ve got here. A slightly better result (5-6%) was achieved with 4 threads per GPU task, each on a separate GPU (i.e., -nt 16 -gputasks 0123). Running multiple GPU tasks on the same card always results in significant performance loss here.
Additional comments will be appreciated, of course.

hess · January 12, 2022, 9:33am

Have you checked if the task assignment is the same?
And if not, which timings in the table at the end of the log file explain the difference?
You can also send two log files so I can have a look, although I think you can’t attach log files on this forum.

Sasha · January 12, 2022, 9:56am

Hi Berk,

Yes, the task assignment is exactly the same. I upgraded to the latest version and here’s my attempt at a copy/paste:

************************* OLD *************************************************************


	M E G A - F L O P S   A C C O U N T I N G

 NB=Group-cutoff nonbonded kernels    NxN=N-by-N cluster Verlet kernels
 RF=Reaction-Field  VdW=Van der Waals  QSTab=quadratic-spline table
 W3=SPC/TIP3p  W4=TIP4p (single or pairs)
 V&F=Potential and force  V=Potential only  F=Force only

 Computing:                               M-Number         M-Flops  % Flops
-----------------------------------------------------------------------------
 Pair Search distance check         9963750.134848    89673751.214     0.0
 NxN Ewald Elec. + LJ [F]         6683855869.228032 441134487369.050    98.2
 NxN Ewald Elec. + LJ [V&F]        67513722.744832  7223968333.697     1.6
 1,4 nonbonded interactions          991500.003966    89235000.357     0.0
 Reset In Box                         42447.500000      127342.500     0.0
 CG-CoM                               42447.516979      127342.551     0.0
 Bonds                               378000.001512    22302000.089     0.0
 Angles                              825750.003303   138726000.555     0.0
 RB-Dihedrals                        400500.001602    98923500.396     0.0
 Pos. Restr.                          14000.000056      700000.003     0.0
 Virial                               42785.017114      770130.308     0.0
 Stop-CM                                 84.911979         849.120     0.0
 Calc-Ekin                           424475.033958    11460825.917     0.0
 Constraint-V                       2955000.011820    23640000.095     0.0
 Constraint-Vir                       29550.011820      709200.284     0.0
 Settle                              985000.003940   318155001.273     0.1
 Virtual Site 3                      994850.007880    36809450.292     0.0
-----------------------------------------------------------------------------
 Total                                             449189816097.698   100.0
-----------------------------------------------------------------------------


    D O M A I N   D E C O M P O S I T I O N   S T A T I S T I C S

 av. #atoms communicated per step for force:  2 x 7385.2


Dynamic load balancing report:
 DLB was off during the run due to low measured imbalance.
 Average load imbalance: 2.3%.
 The balanceable part of the MD step is 71%, load imbalance is computed from this.
 Part of the total run time spent waiting due to load imbalance: 1.7%.
 Average PME mesh/force load: 1.107
 Part of the total run time spent waiting due to PP/PME imbalance: 3.0 %


     R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G

On 3 MPI ranks doing PP, each using 6 OpenMP threads, and
on 1 MPI rank doing PME, using 6 OpenMP threads

 Computing:          Num   Num      Call    Wall time         Giga-Cycles
                     Ranks Threads  Count      (s)         total sum    %
-----------------------------------------------------------------------------
 Domain decomp.         3    6    2500000    2780.735     110116.978   2.9
 DD comm. load          3    6     250002       2.637        104.429   0.0
 Vsite constr.          3    6  250000001    1594.487      63141.590   1.6
 Send X to PME          3    6  250000001    7890.027     312444.658   8.1
 Neighbor search        3    6    2500001    2839.201     112432.214   2.9
 Launch GPU ops.        3    6  500000002   14449.480     572198.659  14.9
 Comm. coord.           3    6  247500000    6515.707     258021.649   6.7
 Force                  3    6  250000001    8311.497     329134.841   8.5
 Wait + Comm. F         3    6  250000001    6385.613     252869.936   6.6
 PME mesh *             1    6  250000001   30494.860     402531.618  10.5
 PME wait for PP *                          42454.411     560397.479  14.5
 Wait + Recv. PME F     3    6  250000001    4527.057     179271.209   4.7
 Wait PME GPU gather    3    6  250000001    5774.959     228688.087   5.9
 Wait GPU NB nonloc.    3    6  250000001    2991.101     118447.460   3.1
 Wait GPU NB local      3    6  250000001     516.722      20462.167   0.5
 NB X/F buffer ops.     3    6  995000002    6551.121     259424.037   6.7
 Vsite spread           3    6  252500002    1979.413      78384.642   2.0
 Write traj.            3    6      50081      61.881       2450.483   0.1
 Update                 3    6  250000001    2936.818     116297.857   3.0
 Constraints            3    6  250000001    2539.319     100556.880   2.6
 Comm. energies         3    6   12500001     324.485      12849.594   0.3
-----------------------------------------------------------------------------
 Total                                      72949.272    3851716.415 100.0
-----------------------------------------------------------------------------
(*) Note that with separate PME ranks, the walltime column actually sums to
    twice the total reported, but the cycle count total and % are correct.
-----------------------------------------------------------------------------

               Core t (s)   Wall t (s)        (%)
       Time:  1750782.450    72949.272     2400.0
                         20h15:49
                 (ns/day)    (hour/ns)
Performance:      592.192        0.041
Finished mdrun on rank 0 Sun Aug  1 17:55:16 2021


************************* NEW *************************************************************


	M E G A - F L O P S   A C C O U N T I N G

 NB=Group-cutoff nonbonded kernels    NxN=N-by-N cluster Verlet kernels
 RF=Reaction-Field  VdW=Van der Waals  QSTab=quadratic-spline table
 W3=SPC/TIP3p  W4=TIP4p (single or pairs)
 V&F=Potential and force  V=Potential only  F=Force only

 Computing:                               M-Number         M-Flops  % Flops
-----------------------------------------------------------------------------
 Pair Search distance check           99678.650160      897107.851     0.0
 NxN Ewald Elec. + LJ [F]          66744762.106176  4405154299.008    98.2
 NxN Ewald Elec. + LJ [V&F]          674216.459264    72141161.141     1.6
 1,4 nonbonded interactions            9915.003966      892350.357     0.0
 Reset In Box                           424.475000        1273.425     0.0
 CG-CoM                                 424.491979        1273.476     0.0
 Bonds                                 3780.001512      223020.089     0.0
 Angles                                8257.503303     1387260.555     0.0
 RB-Dihedrals                          4005.001602      989235.396     0.0
 Pos. Restr.                            140.000056        7000.003     0.0
 Virial                                 427.867114        7701.608     0.0
 Stop-CM                                  0.865929           8.659     0.0
 Calc-Ekin                             8489.533958      229217.417     0.0
 Constraint-V                         29550.011820      265950.106     0.0
 Constraint-Vir                         295.511820        7092.284     0.0
 Settle                                9850.003940     3644501.458     0.1
 Virtual Site 3                        9948.507880      368094.792     0.0
-----------------------------------------------------------------------------
 Total                                              4486216547.624   100.0
-----------------------------------------------------------------------------


    D O M A I N   D E C O M P O S I T I O N   S T A T I S T I C S

 av. #atoms communicated per step for force:  2 x 7360.2


Dynamic load balancing report:
 DLB was off during the run due to low measured imbalance.
 Average load imbalance: 1.5%.
 The balanceable part of the MD step is 70%, load imbalance is computed from this.
 Part of the total run time spent waiting due to load imbalance: 1.0%.
 Average PME mesh/force load: 1.165
 Part of the total run time spent waiting due to PP/PME imbalance: 4.5 %


     R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G

On 3 MPI ranks doing PP, each using 6 OpenMP threads, and
on 1 MPI rank doing PME, using 6 OpenMP threads

 Computing:          Num   Num      Call    Wall time         Giga-Cycles
                     Ranks Threads  Count      (s)         total sum    %
-----------------------------------------------------------------------------
 Domain decomp.         3    6      25000      30.213       1196.446   3.0
 DD comm. load          3    6       2502       0.011          0.449   0.0
 Vsite constr.          3    6    2500001      20.233        801.206   2.0
 Send X to PME          3    6    2500001      63.151       2500.764   6.4
 Neighbor search        3    6      25001      30.360       1202.243   3.1
 Launch GPU ops.        3    6    5000002     141.350       5597.448  14.3
 Comm. coord.           3    6    2475000      63.092       2498.435   6.4
 Force                  3    6    2500001      85.594       3389.525   8.6
 Wait + Comm. F         3    6    2500001      60.124       2380.913   6.1
 PME mesh *             1    6    2500001     316.233       4174.272  10.6
 PME wait for PP *                            427.651       5644.985  14.4
 Wait + Recv. PME F     3    6    2500001      54.001       2138.427   5.4
 Wait PME GPU gather    3    6    2500001      62.879       2489.993   6.3
 Wait GPU NB nonloc.    3    6    2500001      20.148        797.841   2.0
 Wait GPU NB local      3    6    2500001       4.894        193.805   0.5
 NB X/F buffer ops.     3    6    9950002      74.722       2958.973   7.5
 Vsite spread           3    6    2525002      20.239        801.469   2.0
 Write traj.            3    6        501       0.579         22.928   0.1
 Update                 3    6    2500001      35.212       1394.384   3.6
 Constraints            3    6    2500001      27.844       1102.605   2.8
 Comm. energies         3    6     250001       3.313        131.201   0.3
-----------------------------------------------------------------------------
 Total                                        743.884      39277.038 100.0
-----------------------------------------------------------------------------
(*) Note that with separate PME ranks, the walltime column actually sums to
    twice the total reported, but the cycle count total and % are correct.
-----------------------------------------------------------------------------

               Core t (s)   Wall t (s)        (%)
       Time:    17853.170      743.884     2400.0
                 (ns/day)    (hour/ns)
Performance:      580.736        0.041
Finished mdrun on rank 0 Sat Dec 25 16:52:13 2021

Sasha · January 12, 2022, 10:03am

Just posted and then edited, which triggered the bot again. Here it is again (yes, GPU tasks assigned identically):

************************* OLD *************************************************************


	M E G A - F L O P S   A C C O U N T I N G

 NB=Group-cutoff nonbonded kernels    NxN=N-by-N cluster Verlet kernels
 RF=Reaction-Field  VdW=Van der Waals  QSTab=quadratic-spline table
 W3=SPC/TIP3p  W4=TIP4p (single or pairs)
 V&F=Potential and force  V=Potential only  F=Force only

 Computing:                               M-Number         M-Flops  % Flops
-----------------------------------------------------------------------------
 Pair Search distance check         9963750.134848    89673751.214     0.0
 NxN Ewald Elec. + LJ [F]         6683855869.228032 441134487369.050    98.2
 NxN Ewald Elec. + LJ [V&F]        67513722.744832  7223968333.697     1.6
 1,4 nonbonded interactions          991500.003966    89235000.357     0.0
 Reset In Box                         42447.500000      127342.500     0.0
 CG-CoM                               42447.516979      127342.551     0.0
 Bonds                               378000.001512    22302000.089     0.0
 Angles                              825750.003303   138726000.555     0.0
 RB-Dihedrals                        400500.001602    98923500.396     0.0
 Pos. Restr.                          14000.000056      700000.003     0.0
 Virial                               42785.017114      770130.308     0.0
 Stop-CM                                 84.911979         849.120     0.0
 Calc-Ekin                           424475.033958    11460825.917     0.0
 Constraint-V                       2955000.011820    23640000.095     0.0
 Constraint-Vir                       29550.011820      709200.284     0.0
 Settle                              985000.003940   318155001.273     0.1
 Virtual Site 3                      994850.007880    36809450.292     0.0
-----------------------------------------------------------------------------
 Total                                             449189816097.698   100.0
-----------------------------------------------------------------------------


    D O M A I N   D E C O M P O S I T I O N   S T A T I S T I C S

 av. #atoms communicated per step for force:  2 x 7385.2


Dynamic load balancing report:
 DLB was off during the run due to low measured imbalance.
 Average load imbalance: 2.3%.
 The balanceable part of the MD step is 71%, load imbalance is computed from this.
 Part of the total run time spent waiting due to load imbalance: 1.7%.
 Average PME mesh/force load: 1.107
 Part of the total run time spent waiting due to PP/PME imbalance: 3.0 %


     R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G

On 3 MPI ranks doing PP, each using 6 OpenMP threads, and
on 1 MPI rank doing PME, using 6 OpenMP threads

 Computing:          Num   Num      Call    Wall time         Giga-Cycles
                     Ranks Threads  Count      (s)         total sum    %
-----------------------------------------------------------------------------
 Domain decomp.         3    6    2500000    2780.735     110116.978   2.9
 DD comm. load          3    6     250002       2.637        104.429   0.0
 Vsite constr.          3    6  250000001    1594.487      63141.590   1.6
 Send X to PME          3    6  250000001    7890.027     312444.658   8.1
 Neighbor search        3    6    2500001    2839.201     112432.214   2.9
 Launch GPU ops.        3    6  500000002   14449.480     572198.659  14.9
 Comm. coord.           3    6  247500000    6515.707     258021.649   6.7
 Force                  3    6  250000001    8311.497     329134.841   8.5
 Wait + Comm. F         3    6  250000001    6385.613     252869.936   6.6
 PME mesh *             1    6  250000001   30494.860     402531.618  10.5
 PME wait for PP *                          42454.411     560397.479  14.5
 Wait + Recv. PME F     3    6  250000001    4527.057     179271.209   4.7
 Wait PME GPU gather    3    6  250000001    5774.959     228688.087   5.9
 Wait GPU NB nonloc.    3    6  250000001    2991.101     118447.460   3.1
 Wait GPU NB local      3    6  250000001     516.722      20462.167   0.5
 NB X/F buffer ops.     3    6  995000002    6551.121     259424.037   6.7
 Vsite spread           3    6  252500002    1979.413      78384.642   2.0
 Write traj.            3    6      50081      61.881       2450.483   0.1
 Update                 3    6  250000001    2936.818     116297.857   3.0
 Constraints            3    6  250000001    2539.319     100556.880   2.6
 Comm. energies         3    6   12500001     324.485      12849.594   0.3
-----------------------------------------------------------------------------
 Total                                      72949.272    3851716.415 100.0
-----------------------------------------------------------------------------
(*) Note that with separate PME ranks, the walltime column actually sums to
    twice the total reported, but the cycle count total and % are correct.
-----------------------------------------------------------------------------

               Core t (s)   Wall t (s)        (%)
       Time:  1750782.450    72949.272     2400.0
                         20h15:49
                 (ns/day)    (hour/ns)
Performance:      592.192        0.041
Finished mdrun on rank 0 Sun Aug  1 17:55:16 2021


************************* NEW *************************************************************


	M E G A - F L O P S   A C C O U N T I N G

 NB=Group-cutoff nonbonded kernels    NxN=N-by-N cluster Verlet kernels
 RF=Reaction-Field  VdW=Van der Waals  QSTab=quadratic-spline table
 W3=SPC/TIP3p  W4=TIP4p (single or pairs)
 V&F=Potential and force  V=Potential only  F=Force only

 Computing:                               M-Number         M-Flops  % Flops
-----------------------------------------------------------------------------
 Pair Search distance check           99678.650160      897107.851     0.0
 NxN Ewald Elec. + LJ [F]          66744762.106176  4405154299.008    98.2
 NxN Ewald Elec. + LJ [V&F]          674216.459264    72141161.141     1.6
 1,4 nonbonded interactions            9915.003966      892350.357     0.0
 Reset In Box                           424.475000        1273.425     0.0
 CG-CoM                                 424.491979        1273.476     0.0
 Bonds                                 3780.001512      223020.089     0.0
 Angles                                8257.503303     1387260.555     0.0
 RB-Dihedrals                          4005.001602      989235.396     0.0
 Pos. Restr.                            140.000056        7000.003     0.0
 Virial                                 427.867114        7701.608     0.0
 Stop-CM                                  0.865929           8.659     0.0
 Calc-Ekin                             8489.533958      229217.417     0.0
 Constraint-V                         29550.011820      265950.106     0.0
 Constraint-Vir                         295.511820        7092.284     0.0
 Settle                                9850.003940     3644501.458     0.1
 Virtual Site 3                        9948.507880      368094.792     0.0
-----------------------------------------------------------------------------
 Total                                              4486216547.624   100.0
-----------------------------------------------------------------------------


    D O M A I N   D E C O M P O S I T I O N   S T A T I S T I C S

 av. #atoms communicated per step for force:  2 x 7360.2


Dynamic load balancing report:
 DLB was off during the run due to low measured imbalance.
 Average load imbalance: 1.5%.
 The balanceable part of the MD step is 70%, load imbalance is computed from this.
 Part of the total run time spent waiting due to load imbalance: 1.0%.
 Average PME mesh/force load: 1.165
 Part of the total run time spent waiting due to PP/PME imbalance: 4.5 %


     R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G

On 3 MPI ranks doing PP, each using 6 OpenMP threads, and
on 1 MPI rank doing PME, using 6 OpenMP threads

 Computing:          Num   Num      Call    Wall time         Giga-Cycles
                     Ranks Threads  Count      (s)         total sum    %
-----------------------------------------------------------------------------
 Domain decomp.         3    6      25000      30.213       1196.446   3.0
 DD comm. load          3    6       2502       0.011          0.449   0.0
 Vsite constr.          3    6    2500001      20.233        801.206   2.0
 Send X to PME          3    6    2500001      63.151       2500.764   6.4
 Neighbor search        3    6      25001      30.360       1202.243   3.1
 Launch GPU ops.        3    6    5000002     141.350       5597.448  14.3
 Comm. coord.           3    6    2475000      63.092       2498.435   6.4
 Force                  3    6    2500001      85.594       3389.525   8.6
 Wait + Comm. F         3    6    2500001      60.124       2380.913   6.1
 PME mesh *             1    6    2500001     316.233       4174.272  10.6
 PME wait for PP *                            427.651       5644.985  14.4
 Wait + Recv. PME F     3    6    2500001      54.001       2138.427   5.4
 Wait PME GPU gather    3    6    2500001      62.879       2489.993   6.3
 Wait GPU NB nonloc.    3    6    2500001      20.148        797.841   2.0
 Wait GPU NB local      3    6    2500001       4.894        193.805   0.5
 NB X/F buffer ops.     3    6    9950002      74.722       2958.973   7.5
 Vsite spread           3    6    2525002      20.239        801.469   2.0
 Write traj.            3    6        501       0.579         22.928   0.1
 Update                 3    6    2500001      35.212       1394.384   3.6
 Constraints            3    6    2500001      27.844       1102.605   2.8
 Comm. energies         3    6     250001       3.313        131.201   0.3
-----------------------------------------------------------------------------
 Total                                        743.884      39277.038 100.0
-----------------------------------------------------------------------------
(*) Note that with separate PME ranks, the walltime column actually sums to
    twice the total reported, but the cycle count total and % are correct.
-----------------------------------------------------------------------------

               Core t (s)   Wall t (s)        (%)
       Time:    17853.170      743.884     2400.0
                 (ns/day)    (hour/ns)
Performance:      580.736        0.041
Finished mdrun on rank 0 Sat Dec 25 16:52:13 2021

hess · January 12, 2022, 10:17am

From this log output it seems like the PME computation runs slower. The PME load is higher than the PP load, so this task mainly determines the performance. I don’t know why this got slower. I don’t think there are significant changes there.
Do you use the same CUDA version for both runs?
Was PME tuning active and if so, do the tuned PME grids differ between the two runs?

Sasha · January 12, 2022, 10:25am

PME tuning was on, both grids announced as optimal at 32 32 72.
CUDA version for the old run: 11.10 (runtime driver 10.20)
For the new run: 11.50 (11.50)

pszilard · January 13, 2022, 2:39pm

These could be due to CUDA compiler/runtime performance differences. I suggest to compare 2020 vs 2021 using the same CUDA version. Can you please share full log files?

However, I wonder if you really get scaling across four GPUs with such a small system? I could imagine that using a separate PME GPU could improve performance, but decomposing 17k atoms across three GPUs will generally not scale.

Side-note: consider trying the direct GPU communication in particular (in the 2021 release you need some extra environment variables GMX_GPU_DD_COMMS / GMX_GPU_PME_PP_COMMS, in the upcoming 2022 you won’t need those anymore)?

Sasha · January 13, 2022, 10:31pm

Hi Szilard,

I probably won’t be able to test a “new” build of the old gmx version, but yes, I absolutely get distinct scaling across GPUs. I am however happy to rebuild the latest version with the additional variables set. What should that do in terms of performance?
One other question I’ve been meaning to ask: does Gromacs still benefit from hyperthreading? The reason is that we have some other software that actually gets hurt when HT is is enabled.

Thanks!

hess · January 14, 2022, 9:36am

If there is high computational load on the CPU, HT will help. If there is high memory/cache load, HT might be worse. mdrun will automatically turn of HT when all compute is done on the GPU.

pszilard · January 15, 2022, 7:50pm

Hi Sasha,

Compiling the latest GROMACS 2020 is not very difficult and it would help ruling out whether the CUDA version is the culprit rather than the GROMACS release.

Also / alternatively, are you able to compare single GPU performance of the inputs/builds for which you report a regression?

I think you see your runs scale to four GPUs because you’re only offloading the nonbonded calculation, but not PME, bonded interaction nor integrations/constraints. That also means you will most likely get better performance if you also offload those, although it’s also quite likely that you’d get peak performance on 1-2 GPUs if you offload more (and you’ll need the direct communication support from 2021 / 2022 for best scaling).

Sasha · January 15, 2022, 9:53pm

No-no, I didn’t mean that it’s hard to build Gromacs. I am happy to rebuild the latest version – I meant your other suggestion to rebuild the old version with the newest CUDA, which I am sure could cause minor differences in performance.

About offloading PME: in the starting post I gave the actual mdrun line, which contains “-npme 1 -pme gpu”. Is this incorrect?

pszilard · January 17, 2022, 2:55pm

I did mean compiling the latest 2020, that is 2020.6, with the latest CUDA because earlier you wrote:

Assuming by “old run” you meant GROMACS 2020, and by “new run” 2021, you were comparing different version of code suing different versions of CUDA compilers and runtime.

Yes, if you wanted to offload PME and use a separate rank for that, it is (the -pme gpu does the former).

The earlier performance numbers you posted were about 280-300 ns/day, but now you show log output that indicates 580-590 ns/day. These runs are not the same, and differences are also smaller (in the latter runs only ~2%).

Sasha · January 17, 2022, 8:45pm

Ah, so then I understood correctly.

Now I am confused… Yes as in it’s incorrect, or yes it’s correct? :)
In other words, if I want to properly offload PME to GPU, should I have ‘-pme gpu’ without setting the number of ranks to 1?

Sorry about this confusion. The first post was about NPT relax with 1fs timestep. When responding to Berk, I used production NVT performance with 2fs timestep, hence the jump in ns/day. The systems are identical and so are the mdrun lines.

pszilard · January 18, 2022, 3:46pm

If you intended to use a separate GPU for PME, it is correct. GROMACS will by design use a single GPU per MPI rank, so if you want to use multiple GPUs, you have two means to parallelize:

assign PME to separate a rank(s) and dedicate a GPU to this(/each)
use domain decomposition which maps computation in a domain to MPI ranks each of which can be assigned a different GPU.

If you write:

-ntmpi 2 -pme gpu this implies automatic count for -npme which is by default zero, hence you would be requesting PME decomposition across the 2 ranks. This is not supported (it will have partial supporte with lib-MPI in release 2022).
-ntmpi 2 -npme 1 -pme gpu implies 1 PP rank (which does the particle work and integration) and one separate PME rank.

I hope that clarifies things!

OK, but the performance gap between the two 1 fs runs is still greater than the 2 fs runs, so I am not sure there is an actual regression, nor that it is unrelated to the different CUDA versions.

Sasha · January 19, 2022, 3:10am

Yes, but not completely, because I have zero idea what would be an “optimal” number of PME ranks. For instance, let’s consider the system I already have. My mdrun command is currently:
mdrun -nt 24 -ntmpi 4 -npme 1 -pme gpu -gputasks 0123
and my understanding of your comment suggests that -npme could be generally N (between 1 and 3), in which case 4 - N will be allotted for PP/integrator. In other words, if I set
mdrun -nt 24 -ntmpi 4 -npme 2 -pme gpu
I will get two ranks for PP and two for PME (instead of 3 and 1 with the original line). Now, here is my question: off the bat, how would you set the numbers in this case? Are there any hints for optimality? Basically, I have zero feel for what the relative computational “weights” between PP and PME are. Hence, completely blindly, knowing only the system size (say, 20K particles) and the hardware I described, what would be your first attempt at the mdrun line?

My first post was about v. 2021 (not 2021.x). My response to Berk with 2fs timestep was on the latest version I could find & build at that moment (2021.4, I think). I think the initial question I posted is more or less void, given the CUDA driver versions, etc.
However, the other things you are commenting on are of utmost importance to me! :)

Thanks as always!

pszilard · January 19, 2022, 10:30am

Depends on the hardware and simulation system. For GPU machines the limitations posed by your use-case makes it relatively simple: i) the current mdrun capabilities (PME decomposition is not supported on GPUs until the next release) ii) the extremely small input won’t scale anyway

Let’s make things concrete, using the 2021 code either of the following two is what will likely be fastest:

gmx mdrun -ntmpi 1 -ntomp 24 -bonded gpu -pme gpu -update gpu
GMX_GPU_DD_COMMS=1 GMX_GPU_PME_PP_COMMS=1 gmx mdrun -ntmpi 2 -npme 1 -ntomp 23 -ntomp_pme 1 -bonded gpu -pme gpu -update gpu

These will run on one and two GPUs, resp., and I think you can’t get better performance on more GPUs with 20k atoms, but I’m curious to hear your feedback.

Cheers,
Szilard

Sasha · January 19, 2022, 11:09am

Thank you. I am very happy to try and report here. So, the version is 2021.4 and I just tried

gmx mdrun -ntmpi 1 -ntomp 24 -bonded gpu -pme gpu -update gpu

This one fails with:
“Inconsistency in user input: Update task on the GPU was required, but the following conditions were not satisfied: Virtual sites are not supported.”

Vsites complaint probably because I use TIP4P water. Any suggestions? Thanks!

Topic		Replies	Views
1GPU vs 4 GPU per single node; performance User discussions mdrun	7	1564	January 20, 2023
Gmx mdrun with GPU User discussions mdrun	3	1139	May 31, 2024
Optimizing number of cpu cores in a gpu node run User discussions mdrun	2	263	January 14, 2024
Performance loss User discussions	2	1242	February 20, 2021
A series of performance benchmarks for MD Apps, including GROMACS User discussions mdrun , gpu , mdrun-performance	17	5072	December 6, 2023

Loss of performance in v. 2021

Related topics