Why is GPU state copied so often?

GROMACS version: 2020.2
GROMACS modification: No

So I’m trying to run a tiny simulation on a very fast GPU, it’s 19,000 atoms on an Nvidia A100. I’m using the nvidia registries container NVIDIA NGC with the relevant flags. I have the relevant flags enabled (overkill actually as this is running on only one GPU, not several).

Here’s what I believe is true:
The whole simulation step was ported to GPU, except the neighbor list building. To me this means that the simulation can run on the GPU for nstlist number of steps in GPU memory, and only then is transferred back to the CPU for the neighborhood list to be built. [Or because a traj write out has been requested or similar].

However what I see is that for the 150,601 steps (hence the number of force cycles calculated) the neighbor list is calculated just 754 times (yep, thats the nstlist 200) but the state is copied Wait GPU state copy 50,455 times, roughly every 3 steps. This is a huge drag. Why is it doing this?

Also if you have any tips on accelerating this (without bulking up the system size) it’d be much appreciated. I know GROMACS wasn’t designed for little systems and because the GPU is so fast these overheads pile up, but still.

Flags:

ENV GMX_GPU_DD_COMMS true
ENV GMX_GPU_PME_PP_COMMS true
ENV GMX_FORCE_UPDATE_DEFAULT_GPU true

Command:

gmx mdrun -deffnm md_prod -ntmpi 1 -ntomp 4 -nb gpu -bonded gpu -pme gpu -nstlist 200 -pin on -update gpu

Note: This is a 19,000 atom system with amber99sb-ildn and tip3p. It’s run here with just Justin Lemkuls classic tutorials options.

title                   = OPLS Lysozyme NPT equilibration
; Run parameters
integrator              = md        ; leap-frog integrator
nsteps                  = 30000000    ; 2 * 500000 = 1000 ps (1 ns) so 60 ns
dt                      = 0.002     ; 2 fs
; Output control
nstxout                 = 0         ; suppress bulky .trr file by specifying
nstvout                 = 0         ; 0 for output frequency of nstxout,
nstfout                 = 0         ; nstvout, and nstfout
nstenergy               = 2000      ; save energies every 10.0 ps
nstlog                  = 2000      ; update log file every 10.0 ps
nstxout-compressed      = 2000      ; save compressed coordinates every 10.0 ps
compressed-x-grps       = System    ; save the whole system
; Bond parameters
continuation            = yes       ; Restarting after NPT
constraint_algorithm    = lincs     ; holonomic constraints
constraints             = h-bonds   ; bonds involving H are constrained
lincs_iter              = 1         ; accuracy of LINCS
lincs_order             = 4         ; also related to accuracy
; Neighborsearching
cutoff-scheme           = Verlet    ; Buffered neighbor searching
ns_type                 = grid      ; search neighboring grid cells
nstlist                 = 10        ; 20 fs, largely irrelevant with Verlet scheme
rcoulomb                = 1.0       ; short-range electrostatic cutoff (in nm)
rvdw                    = 1.0       ; short-range van der Waals cutoff (in nm)
; Electrostatics
coulombtype             = PME       ; Particle Mesh Ewald for long-range electrostatics
pme_order               = 4         ; cubic interpolation
fourierspacing          = 0.16      ; grid spacing for FFT
; Temperature coupling is on
tcoupl                  = V-rescale             ; modified Berendsen thermostat
tc-grps                 = Protein Non-Protein   ; two coupling groups - more accurate
tau_t                   = 0.1     0.1           ; time constant, in ps
ref_t                   = 340     340           ; reference temperature, one for each group, in K
; Pressure coupling is on
pcoupl                  = Parrinello-Rahman     ; Pressure coupling on in NPT
pcoupltype              = isotropic             ; uniform scaling of box vectors
tau_p                   = 2.0                   ; time constant, in ps
ref_p                   = 1.0                   ; reference pressure, in bar
compressibility         = 4.5e-5                ; isothermal compressibility of water, bar^-1
; Periodic boundary conditions
pbc                     = xyz       ; 3-D PBC
; Dispersion correction
DispCorr                = EnerPres  ; account for cut-off vdW scheme
; Velocity generation
gen_vel                 = no        ; Velocity generation is off

Hi,

Some caveats mdrun has two modes:

  • force-offload (integration is done on the CPU x/f are transferred every step) and
  • GPU-resident step mode: integration is also offloaded so x/f are only transferred to CPU if needed.
    The important distinction is that the latter prioritizes keeping the GPU busy at the cost of letting the CPU idle (for more details see 10.1063/5.0018516.

GROMACS by design relies on heterogeneous parallelization where tasks that are not offloaded to the GPU are computed on the CPU (concurrently GPU execution when possible, in a way “back-offloading” from GPU to CPU) instead of just refusing to run with a GPU if you enable a feature not ported. Such tasks include some bonded interaction types, free energy computations, virial computation, comm removal. Note however that the presence of such computation does not mean that the overall performance is greatly impacted. For instance, if a some of the bonded interactions are computed on the CPU, as long as this completes before the GPU needs the results for integration, performance will not suffer.

As noted above, some tasks are still kept on the CPU and these require the coordinates to be copied back. The likely candidate is pressure coupling in your case.

Compare your run with one that has no pressure coupling, you may see less frequent “Wait GPU state copy” to see how much impact that has. If your pressure coupling setup allows (and you see significant improvement from the previous experiment), you can increase nstpcouple slightly.

Also note that time spent in “Wait GPU state copy” does not mean overhead or inefficiency: if the CPU has nothing else to do, it launches a sequence of steps of work on the GPU and then waits for the coordinates to arrive for the next search.

That’s actually not the case, GROMACS does have a fair amount of optimizations for real-world system sizes, starting from thousands of atoms (see some data in10.1063/5.0018516.

However, note that an NVIDIA A100 will need >50k atoms to reach peak throughput, so smaller systems will always run less efficiently.

Let us know if you have further questions!

Cheers,
Szilárd

PS: you only need those to enable the experimental direct-GPU communication feature, not for a single-GPU run (also the last one has the same effect as -update gpu which you already pass).

Hi Szilárd,

Thank you very much for your thorough reply. It is very interesting. I also feel I sounded a bit forceful and negative, I think GROMACS is wonderful, for both performance and ease to work with.

I would like to rescind my comment about GROMACS not being designed for little systems. Sorry.

I will check the pressure coupling, thank you, I hadn’t considered that component of the steps.

With regard to flags, yes I’ve put flags only relevant for multi GPU runs, it’s just a handy default, and forced update on the GPU twice, once in command line and once in the flag, I like to be doubly sure. I should’ve stripped this run more of details before posting.

Again thanks,
Oliver