GPU acceleration on Mac M1 mini

At the time of replying here I was not aware that the MR proposes completely removing clFFT, only realized when I looked at the MR.

For better or worse, OpenCL is our most stable portability backend, so we have to keep it as stable as possible until SYCL is ready to replace it. Therefore, even if we were at the beginning of a release cycle I’d proceed with caution and either test extensively before removing clFFT.

At this stage, we simply do not have the time to test extensively on all four supported platforms, so it’s better to add the VkFFT backend, allow using it and switch it to the default wherever we are confident that it is the best choice.

1 Like

How do you accomplish “GPU-resident” mode if GROMACS doesn’t use FP64 on GPUs? Does that automatically downgrade the precision from “mixed precision” to “single precision”? If so, I would rather stay in “mixed precision” so that the simulation is more numerically stable.

If we could run the entire simulation on GPU without ever communicating with the CPU except to report errors, we could reduce driver overhead by orders of magnitude. Even simulations with 100-1000 atoms might now run faster on the GPU than on the CPU, because we can encode commands ahead of time and submit massive chunks of time steps in a single command buffer.

How do you accomplish “GPU-resident” mode if GROMACS doesn’t use FP64 on GPUs? Does that automatically downgrade the precision from “mixed precision” to “single precision”? If so, I would rather stay in “mixed precision” so that the simulation is more numerically stable.

Some background on what “mixed” and “double” mean in GROMACS:
https://manual.gromacs.org/current/reference-manual/definitions.html#mixed-or-double-precision

In mixed-precision mode, most of the operations are FP32, while FP64 is used only for computing aggregate values, which typically only happens once every 10-100 steps. In the GPU-resident mode, we still have to use the CPU when we need to compute aggregate values since it’s not trivial to do it purely on the GPU, even without taking the precision into account. So, no loss of precision, we still use FP64 when needed.

If we could run the entire simulation on GPU without ever communicating with the CPU except to report errors, we could reduce driver overhead by orders of magnitude

Typically, in “GPU-resident” mode, we can schedule tens to hundreds of GPU-resident steps before we need to synchronize with the CPU. So, not quite the “entire simulation”, but close enough.

1 Like

What I’d add to the above is that:

  • GPU-resident mode only differs from the force-offload mode in that integration and constraining happens on the GPU (though this has ramifications in which compute unit the “primary” one kept busy during regular MD steps and which one is left to idle when heterogeneous execution still requires data movement (see Fig 2 of https://aip.scitation.org/doi/full/10.1063/5.0018516);
  • FP64 integration is not necessary for the energy conservation requirements of a typical MD simulation (with forces computed in FP32).

I ran some new calculations from the GROMACS benchmark data: Performance report on Apple-designed GPUs (#4615) · Issues · GROMACS / GROMACS · GitLab. We can compute wall-time microseconds/time-step using (86400 * time-step femtoseconds) / (throughput ns/day). My analysis explained why M1 GPU was slower than CPU for SNC, even though this wasn’t true on the reference system. Questions:

  1. In what year did you implement GPU-resident execution? May be 2020 but I’m not 100% sure. If it was 2018, the benchmarked system may have harnessed the optimization.
  2. What is the maximum ns/day you have ever seen for GPU-hosted computation, with 2 fs time-step? The reference system (GTX 1080) achieved 900 ns/day, but smaller simulations might achieve something higher. Did this “world record” speed increase after you implemented GPU-resident execution, and by how much? You could use another record with (e.g. 20 fs) and divide the empirical ns/day by 10.

For (2), I was trying to judge the minimum CPU-GPU driver overhead, not GPU-side performance. Ideally, you would try a 2-atom simulation on any recent GPU. You might even be able to try this right now, if you’re never ran such a benchmark before.

My main issue is that for some small simulations, GPU might be unreasonably slow compared to CPU. When you get to the order of 10-100 atoms, ns/day is so ridiculously fast that I’m not that bothered by it simulating with less ns/day than theoretically possible. I want to rely on GPU for all simulations in an everyday workflow, so that GPU is always faster than CPU, or the simulation is so small that switching to CPU for speed is overkill. I’d like to focus more on debugging my nanotechnology design, rather than on which processor would simulate it more quickly.

No need to decide now; the author of VkFFT just ported it to Metal! I’m going to help them achieve optimal performance.

Would you mind if I went ahead and implemented a Metal backend for VkFFT on Apple silicon for GROMACS? I would pre-allocate CPU memory using virtual memory, then hopefully that can be used as the backing buffer for both Metal and OpenCL.

Also, do non-bonded force computations use threadgroup memory for the bulk of computation? I commented on the performance report thread about how M1 has very slow threadgroup memory, which may be the bottleneck.

Metal has two versions of each transcendental function + division: fast and precise (which is slower but IEEE compliant). I plan to expose both through SYCL, making precise the default unless you request otherwise. Is this relevant at all to GROMACS?

In the force kernels in GROMACS we usually don’t need full single precision. We therefore like to use faster math at the cost of some precision. So if there is a non-negligible improvement with fast, I would suggest to use that as default. We should check that the energy conservation is not affected.

It was released with the 2020 version, but I’m not following how does that matter, which benchmarks are you referring to?

The peak simulation throughput I’ve measured in the past was around 3-3.5 us/day, so a bit under 50 us/iteration on single GPU with a small system of a few thousand atoms. This has probably not improved much on more recent hardware (in particular as clock speeds have not increased much, though consumer Lovelace GPUs may break that perf), but I have not tested such cases in recent times.

That could in principle happen especially for tens of atoms which have nowhere near the parallelism for a large GPU (not even for a large CPU e.g. with 64 cores). When you get to that regime, higher clocks and fewer cores will be best.

For the ~1000 atoms, I don’t think a GPU-resident run will be slower than CPU only, but it depends on the exact hardware comparison, simulation size, settings, etc.
I’m not sure about the 10-100 atoms regime, neither our code not our algorithms are specifically optimized for such small inputs (e.g. if the box is too small you can’t increase -nstlist much and you can’t balance the pair search cost to maximize GPU utilization).

I could imagine that one of the very recent high core-count high clock desktop CPUs like Zen4 (at >5 GHz) could be quite competitive with GPUs for such small systems.

Also note that as long as the time it takes to enqueue work for a single iteration is shorter than the time the GPU computation itself takes, GPU launch overheads may throttle the execution slightly but won’t completely starve the GPU as we’re launching tens to hundreds of steps, so a large fraction of the launch overhead will overlap with GPU execution.

Cheers,
Szilárd

1 Like

SOL is around 16 us/iteration in GPU runs (10-11 us/day)
in CPU runs 2.5 us/iteration (~70 us/day) using an EPYC Milan CPU so not very high clocks.

With 1.5-2000 atoms, more realistic for a very small input, I measured ~4 us/day with PME and ~5.2 with RF.