A series of performance benchmarks for MD Apps, including GROMACS

GROMACS version: 2023.2
GROMACS modification: No

Title:
Performance benchmarks for mainstream molecular dynamics simulation Apps on consumer GPUs from AMD, NVIDIA and Intel - Switch to AMD [Part Ⅰ]

Link: http://bbs.keinsci.com/thread-39269-1-1.html

Author: Myself

I have tried and failed to find a more suitable forum, so I still post this article in the Chinese forum I normally read, but in English. Welcome to propose your valuable comments here!

1 Like

Hi,

Interesting data and thorough comparison, thanks for sharing! In particular, the benchPEP CUDA vs SYCL performance is unexpected, I would like to understand what is happening there so I will look into it.

A few comments:

  • You state that “the performance of AMD GPUs and Intel GPUs in GROMACS falls considerably short of their theoretical capabilities, which may be explained by the fact that the current SYCL backend is still inefficient.” – which capabilities are you referring to? The SYCL performance is actually relatively close to that obtained with more mature runtimes/APIs; e.g. comparing to (the available) Intel OpenCL kernels or to the unofficial HIP fork from AMD.

  • You present STMV benchmarks with multiple MD codes which use very different settings (e.g. some use 2fs time-step and 1.2 nm cut-off while others uses 4 fs time-step and 0.9 nm cutoff). This makes the performance data not comparable, so to avoid confusion, I suggest making it clear in the text the absolute performance values of STMV can not be directly compared (not even if the “ns/day” is scaled by 2x due to the different time-step since different force fields and cutoffs are used).

Cheers,
Szilárd

PS: I noticed that in the next section you note that the “efficiency” of different applications can not be compared. I guess you mean the performance, i.e. “ns/day”?

Thanks a lot for your comments!

By “theoretical capabilities” I mean the ranking of different GPUs in tests of other MD applications that directly utilizing the HIP codes (e.g., Amber and OpenMM in my tests). AMD GPUs rank significantly lower in GROMACS than in other applications. If we look at some 3D graphics benchmarks (e.g., 3DMark), we’ll also see that AMD and intel GPUs don’t rank like GROMACS. In my expectation, the 6900XT should have better performance than the 3080Ti, and the Intel ARC A770 should have a performance closer to the 4060Ti. While of course, this is all my personal subjective opinion.

Yes, I have noticed that some tests on the Internet (especially the NVIDIA’s tests) are not comparable between different applications because many of the parameters vary too much. So I used the same parameters to compare the “efficiency” of different applications in the 4090 test last year, which is also mentioned many times in today’s blog. You could look at the last picture in that post, where the 4 models (STMV-NPT, B, A-2 and A) have exactly the same parameters between applications. The “max perf. with 13900K” means the maximum value obtained after scanning the performance-cores curve for the two types of bonded options (-bonded cpu & -bonded gpu), while the “1-thread GPU-resident” means the GROMACS performance with only one CPU thread, just like other “pure-GPU” MD apps (e.g., Amber, NAMD3 and OpenMM).

A few points related to what you seem to be assuming:

  • (synthetic) graphics benchmarks do not necessarily reflect MD performance;
  • MD codes use different algorithms and implementations and are likely tuned to a different extent on different GPU architectures; an alternative explanation could be that some applications are less efficient on NVIDIA than on AMD; e.g. GROMACS is well tuned on NVIDIA and has been optimized on CDNA but not much effort has gone into RDNA optimizations;
  • native HIP use equals better performance is not a given, algorithms, optimizations, “luck” with compilers often matter more; at least for our algorithms and implementation currently there is at most 10-15% performance left on the table vs plain HIP (on single GPU for large system sized) and that is on CDNA2 GPUs, I’m not sure about RDNA but I guess that number is smaller.

Thanks for the pointer. The pictures don’t seem to load for me (or are very slow to load), will try again later. Do you have a PDF version of your post – that may be easier to load?

I would recommend using “application performance” or just “performance” instead of “efficiency” since the latter implies that you are using a relative metric (e.g. performance relative to the best achievable would be an “efficiency”).

Cheers,
Szilárd

Thanks a lot for your patient explanation, which is very helpful to my understanding of GROMACS. Now I have made some additions to the blog post.

Sure! Unfortunately, that blog post was only in Chinese because I didn’t realize the necessity for an English translation at that time. But this should not affect the understanding of the picture:

In addition, a PDF version of today’s blog post can be downloaded from the “dataset link” provided in the post: MD-Benchmark-Datasets-Aug-2023 - Google Drive

Thanks a lot for advising, I’ve revised my post.

Update:

Compatibility notes and troubleshooting guides for mainstream molecular dynamics simulation Apps on AMD’s consumer GPUs - Switch to AMD [Part Ⅱ]

Link: http://bbs.keinsci.com/thread-39345-1-1.html

Author: Myself

Summary:

A PDF of these two article/blogs is available, along with the “benchmark dataset”: MD-Benchmark-Datasets-Aug-2023 - Google Drive

1 Like

Interesting, thanks for sharing this. Just double-checking: can you confirm that in these benchmarks the same simulation setup (cutoff lenthgs, cutoff treatment, PME settings, etc.) were used?

Thanks!

That’s very valuable data, thanks for sharing! Can you please clarify what does “unstable” mean for gfx1100? If you encountered some errors it would be good to know about them (so we can fix them)!

Thanks in advance!

Cheers,

Szilárd

Yes, I can totally confirm that.

This has been mentioned in the main text:

However, when it comes to the gfx1100 (RDNA 3) GPU, operation stability is a concern across all three versions of ROCm. Specifically, performance fluctuations and a high probability of mdrun getting stuck after running for a period of time have been observed (similar feedback was reported on the GROMACS forum in June of this year). Furthermore, GPU status information cannot be recognized by rocm-smi in this case.

I think the unstability is due to the fact that AMD ROCm does not yet officially support RDNA3. Two months ago someone posted a similar issue on this forum: GROMACS get stuck AMD GPU

Thanks for confirming.

Have you considered publishing your inputs and output (log) files public? Doing so would help reproducibility and ultimately improve trust of MD users of your benchmark data which is quite in-depth and valuable thanks to covering a wide range of hardware and MD applications.

Sharing the log files could also help developers of the codes (like me) find clues to issues (without having to ask separately) since the GROMACS logs have a detailed report of software and hardware information as well as runtime stats.

OK, I will make sure to read your last post.

Thanks, I missed that post.

One thing that might of of your interest to test: we have worked closely with the hipSYCL/OpenSYCL team on some runtime performance optimizations which in our tests significantly improve performance with small inputs/fast iterations (for further details see Optimize submission process for eager submission case by illuhad · Pull Request #1054 · AdaptiveCpp/AdaptiveCpp · GitHub). These improvements are available in the develop branch of OpenSYCL and will be released in an upcoming OpenSYCL release.

Overall, your testing results and performance data are both quite valuable, so please keep us informed when you have fresh data or in case if you find any bugs or unexpected code behavior!

Cheers,
Szilárd

This is the log file of a case where GROMACS mdrun got stuck on gfx1100:
B.log (19.9 KB)

All input files (including tpr files, bash scripts, and input files for other MD Apps) can be downloaded form the aforementioned Google Cloud Drive link, which is called the “benchmark datasets” in my post. Anyone can use these datasets to run benchmarks.

I usually scan GROMACS mdrun performance for different CPU core counts and bonded options, as shown in Part Ⅰ- Section 2.1. However, this hardly works on the gfx1100, as it gets stuck after running dozens of tests and then the OS crashes. I’ve used different gfx1100 (7900XTX) GPUs and other hardwares but got the same results. Therefore, it took me a lot of time to measure the GROMACS mdrun performance on gfx1100.

BTW, in last year’s RTX4090 test, the most important conclusion was that the single-core (or “per-core”) performance of today’s CPUs severely limits the RTX4090’s potential, so I have been encouraging my peers to choose single-core powerful CPUs (e.g., 7950X, 13900KF, Threadripper-WX, and overclockable Xeon-W), rather than some server CPUs that are primarily focused on multi-core performance. I have even suggested some manufacturers to produce multi-GPU servers based on overclockable workstation platforms, and there has been some progress.

The final data in my post (Part Ⅰ- Section 2.3) is based on the OpenSYCL development branch at 12:31 AM GMT+8 on July 25, 2023 (after commit 485ea8089cfc051d1d5ed916f4cf3fd6800c6335), and in this version, the PR #1054 you mentioned has already been merged and verified (as shown in Commits · AdaptiveCpp/AdaptiveCpp · GitHub). Therefore, my tests have got these performance optimizations.

This is why I specifically identified the OpenSYCL “develop 25Jul2023” in both blog posts.

However, I’ve noticed that a lot of new verified commits have been made after July 25, and I’m looking forward to the potential performance improvements!

Sure! I have always been a loyal user of GROMACS and will continue to learn, use and explore GROMACS.

First, thank you for such a thorough analysis! And for taking your time to translate it to English!

Have you tried setting HIPSYCL_RT_MAX_CACHED_NODES=0 environment variable, as described in GROMACS get stuck AMD GPU?

We still don’t have a good understanding of the root cause of the problem, but we suspect that the problem might be caused by the hipSYCL caching behavior, where it submits tasks to the GPU in bursts, with is handled poorly by the AMD HSA runtime sometimes. Setting HIPSYCL_RT_MAX_CACHED_NODES=0 will force immediate submission avoiding this potential problem. With the latest hipSYCL, is almost always improves performance too, but mostly for small systems.

I don’t think there were many performance-related changes since then. They added a new backend (OpenCL) and a new programming model (C++ Standard Parallelism / stdpar), which are significant changes, but neither is used by GROMACS.


Speaking of Intel A770: one can get slightly better performance when using Double-batched FFT library instead of MKL. For A770, double-batched FFT should be compiled with -DCMAKE_CXX_COMPILER=icpx -DCMAKE_BUILD_TYPE=Release -DBUILD_SHARED_LIBS=YES -DNO_DOUBLE_PRECISION=ON, then the GROMACS should be told to use it.

In my tests, the speed-up is around ~10% on STMV when using oneAPI 2023.2 (same as yours), so nothing dramatic. Just FYI.

@al42and Thank you very much for your explanation and suggestions! I will try these methods you mentioned in the near future. Most of those hardwares I tested were not my own, so I had to borrow them again, but that shouldn’t take too much time.

BTW, do you have any benchmarks of GROMACS 2023 on AMD Instinct GPU series or Intel Data Center GPU MAX series? These GPUs lack public benchmarks, and I can’t get them. The purpose of focusing on these expensive GPUs is to compare them with cheap consumer GPUs, even including consumer GPU-based high performance clusters that I will try in the future.

We plan to publish some AMD MI250X data soon (think a month, not days).

Intel Data Center Max / PVC are more complicated due to Dev Cloud terms of service (and its current configuration).

There is some early published data:

Note that this is quite early data with relatively old ROCm (5.3) and hipSYCL (0.9.4), so the performance improvements you have already seen will come soon to these HPC systems too.

Can we expect improvements with AMD in 2024 release or Nvidia is still the only practical option?