GPU acceleration on Mac M1 mini

That’s exactly what we are considering:

SYCL C++ → LLVM IR → SPIR-V → MSL → AIR

So we lower from C++ to LLVM back to C++ back to a final LLVM dialect.

I’d like to update this thread - there’s been quite a bit of discussion on the hipSYCL GitHub issue. Going from LLVM IR → SPIR-V → MSL is harder than anticipated; key libraries don’t even compile. The build failures can be resolved, but we learned that Metal.jl reverse-engineered AIR. The new pipeline:

SYCL C++ → LLVM IR → AIR (.metallib form)

To perform compiler optimizations that happen when lowering MSL → AIR:

(AIR) – metal-objdump → (LLVM LL) – remove headers → metal-as → (AIR) – metal-optmetallib → (AIR)

Heidelberg University needs a few more weeks to finish the SYCL C++ → LLVM IR pipeline, and after that I can work on LLVM IR → AIR. In the meantime, I’ve picked up metal-cpp and am learning the C++ standard library.

Note that if you want LLVM IR of SYCL kernels you can get that from the DPC++ implementation of SYCL (GitHub - intel/llvm: Intel staging area for llvm.org contribution. Home for Intel LLVM-based projects.).

The problem is, the build failures I described came from intel/llvm: Need help compiling this repo · Issue #810 · illuhad/hipSYCL · GitHub. I can’t get the repository to compile.

I was running through a tutorial, and I was able to run GROMACS on my GPU! Very satisfied! I explicitly set the deprecated OCL env variable to assert that it’s checking for GMX_GPU_DISABLE_COMPATIBILITY_CHECK.

Molecular Dynamics Simulation Output
(base) philipturner@M1-Max-MacBook-Pro Tutorial_1 % gmx mdrun -v -deffnm em                     
                    :-) GROMACS - gmx mdrun, 2022.3-dev (-:

Executable:   /usr/local/gromacs/bin/gmx
Data prefix:  /usr/local/gromacs
Working dir:  /Users/philipturner/Documents/GROMACS/Tutorial_1
Command line:
  gmx mdrun -v -deffnm em

Environment variable GMX_OCL_DISABLE_COMPATIBILITY_CHECK is deprecated and will be removed in release 2022. Please use GMX_GPU_DISABLE_COMPATIBILITY_CHECK instead.

Back Off! I just backed up em.log to ./#em.log.9#
Reading file em.tpr, VERSION 2022.3-Homebrew (single precision)
1 GPU selected for this run.
Mapping of GPU IDs to the 1 GPU task in the 1 rank on this node:
  PP:0
PP tasks will do (non-perturbed) short-ranged interactions on the GPU
PP task will update and constrain coordinates on the CPU
Using 1 MPI thread
Using 10 OpenMP threads 
...
`gmx --version`
(base) philipturner@M1-Max-MacBook-Pro Tutorial_1 % gmx --version
                       :-) GROMACS - gmx, 2022.3-dev (-:

Executable:   /usr/local/gromacs/bin/gmx
Data prefix:  /usr/local/gromacs
Working dir:  /Users/philipturner/Documents/GROMACS/Tutorial_1
Command line:
  gmx --version

GROMACS version:    2022.3-dev
Precision:          mixed
Memory model:       64 bit
MPI library:        thread_mpi
OpenMP support:     enabled (GMX_OPENMP_MAX_THREADS = 128)
GPU support:        OpenCL
SIMD instructions:  ARM_NEON_ASIMD
CPU FFT library:    fftw-3.3.8
GPU FFT library:    clFFT
TNG support:        enabled
Hwloc support:      disabled
Tracing support:    disabled
C compiler:         /Applications/Xcode-13.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/cc AppleClang 13.1.6.13160021
C compiler flags:   -Wall -Wno-unused -Wunused-value -Wunused-parameter -Wno-missing-field-initializers -fno-stack-check -fno-stack-check -O3 -DNDEBUG
C++ compiler:       /Applications/Xcode-13.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++ AppleClang 13.1.6.13160021
C++ compiler flags: -Wall -Wextra -Wpointer-arith -Wmissing-prototypes -Wdeprecated -Wno-unused-function -Wno-reserved-identifier -Wno-missing-field-initializers -fno-stack-check -fno-stack-check -Weverything -Wno-c++98-compat -Wno-c++98-compat-pedantic -Wno-source-uses-openmp -Wno-c++17-extensions -Wno-documentation-unknown-command -Wno-covered-switch-default -Wno-switch-enum -Wno-extra-semi-stmt -Wno-weak-vtables -Wno-shadow -Wno-padded -Wno-reserved-id-macro -Wno-double-promotion -Wno-exit-time-destructors -Wno-global-constructors -Wno-documentation -Wno-format-nonliteral -Wno-used-but-marked-unused -Wno-float-equal -Wno-conditional-uninitialized -Wno-conversion -Wno-disabled-macro-expansion -Wno-unused-macros -Xclang -fopenmp -O3 -DNDEBUG
OpenCL include dir: /Applications/Xcode-13.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX12.3.sdk/System/Library/Frameworks/OpenCL.framework
OpenCL library:     /Applications/Xcode-13.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX12.3.sdk/System/Library/Frameworks/OpenCL.framework
OpenCL version:     1.2

System specs:

CPU: 100 GB/s bandwidth single-core, 200 GB/s multi-core
GPU: 400 GB/s bandwidth
CPU and GPU: 64 MB shared L3 cache
CPU and GPU: 32 GB shared RAM

CPU: 3 GHz x 128-bit SIMD = 12 GFLOPS per core, 10 cores
CPU (system BLAS matrix multiplication): 200 GFLOPS per core, 2000 total
GPU: 10400 GFLOPS

However, I do have some questions. I timed it against CPU-only, and it decreased simulation time from 14 to 11 seconds. Are the calculations for this tutorial mostly memory-bandwidth bound? If so, the CPU should be less than half as fast as GPU (200 GB/s vs 400 GB/s). In Activity Monitor, I saw GPU usage for the GROMACS process reach 82%, so it utilized 8500 GFLOPS of processing power. My Mac is also in high power mode, so there’s no constraints on performance.

Also, I had to recompile GROMACS from scratch. The build from Homebrew had not compiled the OpenCL backend. I don’t think this is right because GPU support is already gated behind the GMX_GPU_DISABLE_COMPATIBILITY_CHECK flag, and it’s extra effort for users to compile it from sources. However, compared to many other software packages, this is the smoothest from-source install I have ever completed - the only, in fact. Great job on that front!

Thank you for the update, and great news that you got it working!

Concerning the tutorials, they don’t really run long enough for comparison between CPU and GPU, but if you are interested you can check the wallcycle accounting that is printed at the end of the run to see where the most time was spend.

About the builds that are in Homebrew and elsewhere, I would assume that they are more geared at providing something that will work regardless of the system that is running on, so GPU support being disabled there doesn’t surprise me. We generally recommend people that want to get best performance to compile from source anyway, to make sure that they also get the optimal SIMD instructions for their system.

Cheers

Paul

I haven’t tried the wallcycle yet, but there are already some performance suggestions. It seems you recently started using cuFFT and clFFT, some time between 2021 and 2022. Is this correct, or were you using it earlier? If so, you might want to replace the new infrastructure with vkFFT.

vkFFT library is 2x faster than cuFFT, rocFFT, and probably more with clFFT. Also, clFFT has been unmaintained since 2017. I suggest switching to this mutual dependency for both CUDA and OpenCL, but don’t do it exclusively for CUDA. I don’t think it’s fair for CUDA to become faster while OpenCL is at a disadvantage. Regardless, my GPU has so much processing power that the orders-of-magnitude difference dwarfs with extra 2x performance boost.

Also, make sure you’re using the latest commits to OpenMM. They recently implemented an optimization for accumulating 64-bit integer forces through a series of 32-bit atomics, enabling force accumulation on lower-spec GPUs. This would make Apple silicon faster, because the arch doesn’t have 64-bit atomics.

No, the some of computation is very much compute-bound. Note however that the algorithms typically need to be tuned for a specific architecture to be able to utilize a useful fraction of the peak FLOPS available. Secondly, the implementation itself often also needs to be optimized and it is not uncommon that small tweaks make a large difference (e.g. if the algorithm formulation or poor compiler heuristic lead to register spills).

For that reason, you can’t always assume that a code that runs well on one vendor’s GPU will also run well without any optimization on another architecture.

Also note that out CUDA and SYCL (in the main branch) is significantly more complete and supports the so-called GPU-resident mode, which, at least on discrete GPU is generally faster.
On integrated GPUs however (like Intel iGPUs) I have seen the offload mode work quite well.
You might also be able to get better performance if you let the SoC use more of the available power for the GPU, e.g. by under-loading the CPU (using fewer threads than cores).

We have vkFFT support planned (see Implement VkFFT FFT backend for SYCL (#4502) · Issues · GROMACS / GROMACS · GitLab), aimed currently to be used in the SYCL backend.
We don’t have plans to add it to OpenCL (also because OpenCL support is deprecated), but contributions are welcome.

Not sure what you are referring to, we don’t use OpenMM in any way.

As @pbauer noted, it might be worthwhile running longer simulations and looking not just at the walltime but also at the performance report at the end of the log. For short simulations, one-off things like GPU initialization can take a few seconds.

I’d also caution against translating any “Utilization” metric reported by monitoring tools into FLOPs. It’s usually more convoluted, although I don’t know how exactly Apple calculates it.

@pszilard has given some excellent suggestions above about performance tuning.

The GROMACS core team does not maintain Homebrew builds, so I don’t think we can comment on that. You can try submitting a PR there if you want.

On top of not using OpenMM, we’re not using any 64-bit atomics either :)

1 Like

There is actually quite a big problem regarding clFFT and GPU performance. For my benchmark, it seems most of the work happens on the CPU with PME. That work can be offloaded to the GPU on OpenCL backends, but is artificially force-disabled on Macs.

I lost the wallcycle data, but I saw that the GPU got partitioned < 1% of the total workload, and it was mostly CPU-bound. I wanted to investigate whether running GPU-accelerated PME would improve performance.

It’s not that GROMACS lacks OpenCL code for the feature. There’s a bug in Apple’s OpenCL compiler that silently crashes when compiling clFFT! So four years ago, GROMACS stopped supporting PME on Mac GPUs. That crash still happens today, because a regression test failed when I removed the restriction. It had to retry with some work (assumed to be GPU) on exclusively the CPU:

Crash log
86/88 Test #86: regressiontests/freeenergy .....................***Failed   16.05 sec
Re-running coulandvdwsequential_coul using only CPU-based non-bonded kernels
Mdrun cannot use the requested (or automatic) number of ranks, retrying with 8.

Abnormal return value for ' gmx mdrun        -notunepme -nb cpu >mdrun.out 2>&1' was 1
Retrying mdrun with better settings...
Re-running coulandvdwsequential_vdw using only CPU-based non-bonded kernels
Mdrun cannot use the requested (or automatic) number of ranks, retrying with 8.

Abnormal return value for ' gmx mdrun        -notunepme -nb cpu >mdrun.out 2>&1' was 1
Retrying mdrun with better settings...
Re-running coulandvdwtogether using only CPU-based non-bonded kernels
Mdrun cannot use the requested (or automatic) number of ranks, retrying with 8.

Abnormal return value for ' gmx mdrun        -notunepme -nb cpu >mdrun.out 2>&1' was 1
Retrying mdrun with better settings...
Re-running expanded using only CPU-based non-bonded kernels
Mdrun cannot use the requested (or automatic) number of ranks, retrying with 8.

Abnormal return value for ' gmx mdrun        -notunepme -nb cpu >mdrun.out 2>&1' was 1
Retrying mdrun with better settings...
Re-running relative using only CPU-based non-bonded kernels
Mdrun cannot use the requested (or automatic) number of ranks, retrying with 8.

Abnormal return value for ' gmx mdrun        -notunepme -nb cpu >mdrun.out 2>&1' was 1
Retrying mdrun with better settings...
Re-running relative-position-restraints using only CPU-based non-bonded kernels
Mdrun cannot use the requested (or automatic) number of ranks, retrying with 8.

Abnormal return value for ' gmx mdrun        -notunepme -nb cpu >mdrun.out 2>&1' was 1
Retrying mdrun with better settings...
Re-running restraints using only CPU-based non-bonded kernels
Mdrun cannot use the requested (or automatic) number of ranks, retrying with 8.

Abnormal return value for ' gmx mdrun        -notunepme -nb cpu >mdrun.out 2>&1' was 1
Retrying mdrun with better settings...
Re-running simtemp using only CPU-based non-bonded kernels
Re-running transformAtoB using only CPU-based non-bonded kernels
FAILED. Check checkforce.out (6 errors) file(s) in transformAtoB/nb-cpu for transformAtoB-nb-cpu
Re-running vdwalone using only CPU-based non-bonded kernels
Mdrun cannot use the requested (or automatic) number of ranks, retrying with 8.

Abnormal return value for ' gmx mdrun        -notunepme -nb cpu >mdrun.out 2>&1' was 1
Retrying mdrun with better settings...
1 out of 20 freeenergy tests FAILED

      Start 87: regressiontests/rotation
87/88 Test #87: regressiontests/rotation .......................   Passed    2.21 sec
      Start 88: regressiontests/essentialdynamics
88/88 Test #88: regressiontests/essentialdynamics ..............   Passed    4.08 sec

99% tests passed, 1 tests failed out of 88

Label Time Summary:
GTest              =  31.00 sec*proc (81 tests)
IntegrationTest    =  18.61 sec*proc (25 tests)
MpiTest            =   9.95 sec*proc (19 tests)
QuickGpuTest       =   8.50 sec*proc (17 tests)
SlowTest           =   5.65 sec*proc (13 tests)
UnitTest           =   6.74 sec*proc (43 tests)

Total Test time (real) = 114.32 sec

The following tests FAILED:
	 86 - regressiontests/freeenergy (Failed)
Errors while running CTest
make[3]: *** [CMakeFiles/run-ctest-nophys] Error 8
make[2]: *** [CMakeFiles/run-ctest-nophys.dir/all] Error 2
make[1]: *** [CMakeFiles/check.dir/rule] Error 2
make: *** [check] Error 2
]

I have talked with the owner of vkFFT before to discuss my attempt at a GPU-accelerated FFT library, MetalFFT. He tested the repo on a Mac and it worked fine through OpenCL, so I’ll take a shot at replacng the clFFT dependency with vkFFT. Even though OpenCL is officially deprecated, would GROMACS be willing to accept a pull request that enables PME on Mac GPUs? This OpenCL patch is just a stopgap, because currently, hipSYCL’s custom LLVM backend is still in progress. Once it’s complete, we could then support Macs through SYCL.

Another thing I noticed: the GPU-resident mode is mostly for performing updates entirely on the GPU. The docs said it doesn’t boost performance that much, except for reducing CPU-GPU communication. The M1 Max has a unified memory architecture, like Intel iGPUs, so it’s not that problematic if updates happen on CPU. If we can run SYCL in the future, then maybe Macs could do -update gpu.

Just got VkFFT working with OpenCL + GROMACS + macOS! I’ll send a pull request your way on GitLab.

Regression test results
84/88 Test #84: MdrunVirtualSiteTests ..........................   Passed    0.25 sec
      Start 85: regressiontests/complex
85/88 Test #85: regressiontests/complex ........................   Passed   15.35 sec
      Start 86: regressiontests/freeenergy
86/88 Test #86: regressiontests/freeenergy .....................   Passed    3.85 sec
      Start 87: regressiontests/rotation
87/88 Test #87: regressiontests/rotation .......................   Passed    2.00 sec
      Start 88: regressiontests/essentialdynamics
88/88 Test #88: regressiontests/essentialdynamics ..............   Passed    0.99 sec

100% tests passed, 0 tests failed out of 88

Label Time Summary:
GTest              =  17.59 sec*proc (81 tests)
IntegrationTest    =   7.31 sec*proc (25 tests)
MpiTest            =   2.95 sec*proc (19 tests)
QuickGpuTest       =   4.27 sec*proc (17 tests)
SlowTest           =   5.16 sec*proc (13 tests)
UnitTest           =   5.12 sec*proc (43 tests)

Total Test time (real) =  40.13 sec
[100%] Built target run-ctest-nophys
[100%] Built target check

1 Like

Thanks for the investigation, admittedly, I forgot I disabled PME GPU for OpenCL on Apple.

I think we can accept such a change, but please open an issue on https://gitlab.com/gromacs/gromacs/-/issues.

It is the cost of data-movement rather than the speedup of the integration that makes the GPU-resident offload significantly faster on modern machines with discrete-GPU hardware. However, as you correctly point out, on integrated GPUs with shared physical memory the advantage is smaller often none. In my experience, on Intel iGPUs leaving some of the work on the CPU gives better performance (and leaving more power for the GPU does also help).

Sounds good. Can you please make sure that the VkFFT codepath works on other platforms too so we can test it on our CI hardware (AMD and NVIDIA).

I made commits to my branch, which allowed let me run PME on GPU. Someone recently merged the hipSYCL VkFFT backend into the main branch, so I’d need to reformat my commits to my personal fork. At this point, reformatting it to meet good code QA standards will be very tedious, and you’re more experienced in what source code changes are appropriate. Would someone maintaining GROMACS be as kind as to implement the pull request for me?

Here’s the branch of with a working VkFFT OpenCL implementation. Just ping me when you make the PR, I can test it on my Mac: Files · vkfft-replacing-clfft-1 · Philip Turner / GROMACS · GitLab

Regarding performance, there are some dismal results. GPU PME boosted performance by about 10%. The highest speedup from GPU acceleration was 1.9x. I was expecting an order-of-magnitude speedup, such as 5-10x. I’ll make an issue on GitLab detailing my benchmarks and hypotheses about performance. I think the only option is for me to implement the Metal hipSYCL backend, and see whether that reduces overhead of CPU <-> GPU communication.

I saw these kinds of mixed results happen with PyTorch’s Metal backend. I found a workaround drastically improves real-world performance, and can repurpose my ~1 year invested in that workaround toward hipSYCL. There’s hope :)

Edit: Performance report on Apple-designed GPUs (#4615) · Issues · GROMACS / GROMACS · GitLab

Amazing work!

Cannot test it on M1. But, glancing through your code, it looks okay. It seems it’s based on some earlier version of Bálint’s hipSYCL VkFFT MR, and some of the comments raised there still apply, but nothing major, as far as I can tell.

At this point, reformatting it to meet good code QA standards will be very tedious

Usually, installing clang-11 and running clang-format-11 -i on the modified files should be enough. Are you having any issues with that? There are a few cases where minor versions of clang-format-11 format code differently, but I would not worry about it too much now.

Would someone maintaining GROMACS be as kind as to implement the pull request for me?

I think external users should be able to create MRs. I’m not on top of our GitLab settings, but I don’t think we decided to lock it down to maintainers only. Please share what error you are getting.

Regarding performance: you said on GitLab that sub-group level operations, such as shuffles, are unavailable, which would limit the kernel performance. There are still a few knobs you can try to tune in the kernels. E.g., USE_CJ_PREFETCH and IATYPE_SHMEM in NBNXM kernels.

But, given the modest performance benefits and a clear expectation that hipSYCL/Metal would offer better performance (both from lower host-side latency and more instructions supported on the device), perhaps the efforts would be better spent there and not on OpenCL, which both Apple and GROMACS deprecated? That’s my opinion, though; other team members might have different thoughts.

I think external users should be able to create MRs. I’m not on top of our GitLab settings, but I don’t think we decided to lock it down to maintainers only. Please share what error you are getting.

I wasn’t experiencing this kind of error or linting errors. I just thought that you might make several suggestions about how I should modify the MR before it gets accepted. For example, it adds new environment variables that you might not want added to the public API. You might request that I change their names or remove them to reflect the latest changes. Thus I thought it would be easier for a maintainer to create the PR because they know what environment variable names are most appropriate.

perhaps the efforts would be better spent there and not on OpenCL, which both Apple and GROMACS deprecated?

I agree with you, in that effort is better spent on hipSYCL. Perhaps the easiest solution is to just tell people on this thread:

If you want optimal GPU performance on macOS, compile the GROMACS fork I linked above.

However, I think it’s still a good idea for me to get acquainted with making contributions to GROMACS, what your test suites are like, etc. So I might make a small PR just for that purpose.

I did make a pull request to patch up OpenCL for macOS. I have a concern about hipSYCL. To run that directly through Metal, I’d have to also make a direct Metal backend for VkFFT. It should be possible to translate Vulkan shaders into Metal with spirv and have all the modern Metal features like sub-group reductions. However, I’m almost considering just keeping that on the CPU with -pme gpu -pmefft cpu to avoid porting VkFFT to Metal. Would that be possible in GPU resident mode, and if so, what are the performance implications?

So, I tried this out on an M1 Mac as well, and it seems to work fine
when it comes to passing tests. I’ll respond a bit more on the MR later
with suggestions on how to proceed.

My main concern with the change is that we need to assess the
performance impact of replacing clFFT on all supported OpenCL platforms,
and I don’t think the team currently has the time for that.

Cheers

Paul

Great progress, thanks for sharing.

We can try to help with getting the code to meet the coding standards, but I’m not sure the core team will have the resources to pick up the code and get it upstreamed, especially not in the coming days/weeks so it has a chance to make it into the upcoming 2023 alpha release. I suggest that you open MR yourself and we can start from there – it’s much easier for the core team to help with fixing things one issue at a time than to take over the upstreaming. The rebasing should probably not be a major issue, but you can leave that for later and leave tge of your MR as is, i.e. 4b826f6c from Sept 27.

Depending on the Apple GPU architecture the kernels may simply not run optimally. Many of our kernels are rather complex and don’t map trivially to different architectures.

Great, thanks a lot!

It really depends on the performance of the relevant code on the CPU and GPU of interest. In principle, it could even be faster to run tasks that fit better the CPU than the GPU. What makes this more complex is that you’re running on an SoC with a power-cap so if you run things concurrently on both CPU and GPU the question is not only where is it faster to run some task (and how much is the cost of moving data around) but also how running concurrently affects the power consumption and performance of the concurrent tasks.
For instance, in Intel iGPUs I have observed that under-utilizing the CPU leaving more of the TDP to the GPU is often beneficial to the overall performance, see slides 26-32 of this presentation: https://doi.org/10.6084/m9.figshare.8257058.v1

Cheers,
Szilárd

I do not think performance is relevant, this is a feature that allows PME to run on Apple Silicon SoC-based computers. As long as that work correctly, it is a net benefit to anyone with such hardware as previously clFFT could not be used.

Whether we should by default switch to VkFFT (instead of clFFT) on other platforms is a separate concern that we can assess later, for now we should just make sure that the code does work on other platforms.

Cheers,
Szilárd

I wasn’t concerned with performance on the M1, but more with the performance implementations of switching to VkFFT on all the other supported platforms. As you pointed out on the MR, keeping clFFT for those for now would be a good alternative.