GPU acceleration on Mac M1 mini

pszilard · October 3, 2022, 6:22pm

At the time of replying here I was not aware that the MR proposes completely removing clFFT, only realized when I looked at the MR.

For better or worse, OpenCL is our most stable portability backend, so we have to keep it as stable as possible until SYCL is ready to replace it. Therefore, even if we were at the beginning of a release cycle I’d proceed with caution and either test extensively before removing clFFT.

At this stage, we simply do not have the time to test extensively on all four supported platforms, so it’s better to add the VkFFT backend, allow using it and switch it to the default wherever we are confident that it is the best choice.

philipturner · October 5, 2022, 5:05am

How do you accomplish “GPU-resident” mode if GROMACS doesn’t use FP64 on GPUs? Does that automatically downgrade the precision from “mixed precision” to “single precision”? If so, I would rather stay in “mixed precision” so that the simulation is more numerically stable.

If we could run the entire simulation on GPU without ever communicating with the CPU except to report errors, we could reduce driver overhead by orders of magnitude. Even simulations with 100-1000 atoms might now run faster on the GPU than on the CPU, because we can encode commands ahead of time and submit massive chunks of time steps in a single command buffer.

al42and · October 5, 2022, 11:42am

How do you accomplish “GPU-resident” mode if GROMACS doesn’t use FP64 on GPUs? Does that automatically downgrade the precision from “mixed precision” to “single precision”? If so, I would rather stay in “mixed precision” so that the simulation is more numerically stable.

Some background on what “mixed” and “double” mean in GROMACS:
https://manual.gromacs.org/current/reference-manual/definitions.html#mixed-or-double-precision

In mixed-precision mode, most of the operations are FP32, while FP64 is used only for computing aggregate values, which typically only happens once every 10-100 steps. In the GPU-resident mode, we still have to use the CPU when we need to compute aggregate values since it’s not trivial to do it purely on the GPU, even without taking the precision into account. So, no loss of precision, we still use FP64 when needed.

If we could run the entire simulation on GPU without ever communicating with the CPU except to report errors, we could reduce driver overhead by orders of magnitude

Typically, in “GPU-resident” mode, we can schedule tens to hundreds of GPU-resident steps before we need to synchronize with the CPU. So, not quite the “entire simulation”, but close enough.

pszilard · October 6, 2022, 9:21am

What I’d add to the above is that:

GPU-resident mode only differs from the force-offload mode in that integration and constraining happens on the GPU (though this has ramifications in which compute unit the “primary” one kept busy during regular MD steps and which one is left to idle when heterogeneous execution still requires data movement (see Fig 2 of https://aip.scitation.org/doi/full/10.1063/5.0018516);
FP64 integration is not necessary for the energy conservation requirements of a typical MD simulation (with forces computed in FP32).

philipturner · October 6, 2022, 8:37pm

I ran some new calculations from the GROMACS benchmark data: Performance report on Apple-designed GPUs (#4615) · Issues · GROMACS / GROMACS · GitLab. We can compute wall-time microseconds/time-step using (86400 * time-step femtoseconds) / (throughput ns/day). My analysis explained why M1 GPU was slower than CPU for SNC, even though this wasn’t true on the reference system. Questions:

In what year did you implement GPU-resident execution? May be 2020 but I’m not 100% sure. If it was 2018, the benchmarked system may have harnessed the optimization.
What is the maximum ns/day you have ever seen for GPU-hosted computation, with 2 fs time-step? The reference system (GTX 1080) achieved 900 ns/day, but smaller simulations might achieve something higher. Did this “world record” speed increase after you implemented GPU-resident execution, and by how much? You could use another record with (e.g. 20 fs) and divide the empirical ns/day by 10.

For (2), I was trying to judge the minimum CPU-GPU driver overhead, not GPU-side performance. Ideally, you would try a 2-atom simulation on any recent GPU. You might even be able to try this right now, if you’re never ran such a benchmark before.

My main issue is that for some small simulations, GPU might be unreasonably slow compared to CPU. When you get to the order of 10-100 atoms, ns/day is so ridiculously fast that I’m not that bothered by it simulating with less ns/day than theoretically possible. I want to rely on GPU for all simulations in an everyday workflow, so that GPU is always faster than CPU, or the simulation is so small that switching to CPU for speed is overkill. I’d like to focus more on debugging my nanotechnology design, rather than on which processor would simulate it more quickly.

philipturner · October 6, 2022, 10:40pm

pszilard:

philipturner:

However, I’m almost considering just keeping that on the CPU with -pme gpu -pmefft cpu to avoid porting VkFFT to Metal. Would that be possible in GPU resident mode, and if so, what are the performance implications?

It really depends on the performance of the relevant code on the CPU and GPU of interest. In principle, it could even be faster to run tasks that fit better the CPU than the GPU. What makes this more complex is that you’re running on an SoC with a power-cap so if you run things concurrently on both CPU and GPU the question is not only where is it faster to run some task (and how much is the cost of moving data around) but also how running concurrently affects the power consumption and performance of the concurrent tasks.
For instance, in Intel iGPUs I have observed that under-utilizing the CPU leaving more of the TDP to the GPU is often beneficial to the overall performance, see slides 26-32 of this presentation: Advances in the OpenCL Offload Support in GROMACS

No need to decide now; the author of VkFFT just ported it to Metal! I’m going to help them achieve optimal performance.

github.com/DTolm/VkFFT

Metal backend

opened 03:12PM - 29 Sep 22 UTC

philipturner

I've been working on making GROMACS faster on the Apple GPU, and I recently [cam…e across a need to work with VkFFT](https://gitlab.com/gromacs/gromacs/-/merge_requests/3162). The software package was using clFFT, which couldn't compile on Apple platforms. This meant certain work couldn't be offloaded to the GPU. I replaced clFFT with VkFFT and your software works great - very easy to implement and robust. I tried tuning performance settings, but it makes no difference for my use case (3D FFT, R2C/C2R). As I'm profiling GROMACS, there are very dismal results regarding GPU acceleration. On small simulations (~10-100k atoms), offloading most work to GPU takes as long as CPU or sometimes is slower. I only saw a confident advantage at ~2 million atoms, where GPU was twice as fast. I fear this is because of (a) CPU-side overhead of rapidly communicating with the GPU and (b) that Apple's OpenCL doesn't support simdgroup reductions. In theory, the GPU should be 10-100x faster than CPU. To fix this, I've been researching making a Metal backend for hipSYCL, which transpiles SYCL C++ code into Apple AIR. However, it will use Metal for API calls instead of OpenCL. This may decrease CPU-side overhead by an order of magnitude. VkFFT only supports OpenCL, so marshalling data between Metal and OpenCL may cause a severe bottleneck. Would you accept a contribution where I make a Metal backend for VkFFT? Faster GPU-side performance isn't why I'm doing this, but I'm still curious: would utilizing subgroup/simdgroup reductions (e.g. `simd_sum`, `WaveActiveSum`) measurably boost performance?

Would you mind if I went ahead and implemented a Metal backend for VkFFT on Apple silicon for GROMACS? I would pre-allocate CPU memory using virtual memory, then hopefully that can be used as the backing buffer for both Metal and OpenCL.

Also, do non-bonded force computations use threadgroup memory for the bulk of computation? I commented on the performance report thread about how M1 has very slow threadgroup memory, which may be the bottleneck.

Metal has two versions of each transcendental function + division: fast and precise (which is slower but IEEE compliant). I plan to expose both through SYCL, making precise the default unless you request otherwise. Is this relevant at all to GROMACS?

hess · October 7, 2022, 6:46am

In the force kernels in GROMACS we usually don’t need full single precision. We therefore like to use faster math at the cost of some precision. So if there is a non-negligible improvement with fast, I would suggest to use that as default. We should check that the energy conservation is not affected.

pszilard · October 7, 2022, 8:35pm

It was released with the 2020 version, but I’m not following how does that matter, which benchmarks are you referring to?

The peak simulation throughput I’ve measured in the past was around 3-3.5 us/day, so a bit under 50 us/iteration on single GPU with a small system of a few thousand atoms. This has probably not improved much on more recent hardware (in particular as clock speeds have not increased much, though consumer Lovelace GPUs may break that perf), but I have not tested such cases in recent times.

That could in principle happen especially for tens of atoms which have nowhere near the parallelism for a large GPU (not even for a large CPU e.g. with 64 cores). When you get to that regime, higher clocks and fewer cores will be best.

For the ~1000 atoms, I don’t think a GPU-resident run will be slower than CPU only, but it depends on the exact hardware comparison, simulation size, settings, etc.
I’m not sure about the 10-100 atoms regime, neither our code not our algorithms are specifically optimized for such small inputs (e.g. if the box is too small you can’t increase -nstlist much and you can’t balance the pair search cost to maximize GPU utilization).

I could imagine that one of the very recent high core-count high clock desktop CPUs like Zen4 (at >5 GHz) could be quite competitive with GPUs for such small systems.

Also note that as long as the time it takes to enqueue work for a single iteration is shorter than the time the GPU computation itself takes, GPU launch overheads may throttle the execution slightly but won’t completely starve the GPU as we’re launching tens to hundreds of steps, so a large fraction of the launch overhead will overlap with GPU execution.

Cheers,
Szilárd

pszilard · October 11, 2022, 10:55am

SOL is around 16 us/iteration in GPU runs (10-11 us/day)
in CPU runs 2.5 us/iteration (~70 us/day) using an EPYC Milan CPU so not very high clocks.

With 1.5-2000 atoms, more realistic for a very small input, I measured ~4 us/day with PME and ~5.2 with RF.

Topic		Replies	Views
Does GROMACS supports GPU acceleration on M1 Mac？ User discussions gpu	9	3195	May 18, 2023
Install GROMACS 2021.5 on Mac M1 with enabling GPU User discussions installation-error	5	2471	June 21, 2022
Installataion with GPU on Macbook pro 2020 Macos Big Sur intel core i5 User discussions	3	291	December 24, 2021
OpenCL installation returns disabled GPU support User discussions installation-error	1	411	April 4, 2023
GROMACS will support new Apple M1 computers User discussions	2	2348	December 25, 2020

GPU acceleration on Mac M1 mini

Related topics