The latest published benchmark paper from Gromacs is a few years old and my group is in the process of trying to spec another set of nodes. The GPU market situation in the US is pretty bad, in that many cards are either out of stock or cost exorbitant amounts.
The target machines are along the lines of dual senior Xeon E5 or something AMD-based + 4 GPUs. Could Szilárd and/or other developers provide their thoughts on what would be a cost-effective purchase?
Hi - not a full answer (wish I had machines to give you benchmarks), but one perspective: prevailing wisdom for the last few years has been that you get the most efficiency from using consumer-grade cards, since the enhanced features of industrial cards didn’t justify the much higher pricetag.
That sort of analysis has the assumption that you’re running 1-2 jobs per GPU and not scaling single simulations to multiple GPUs, where GPU-GPU communication tends to crush efficiency for all but the largest of simulation systems. One of the main features added to Gromacs in the past few years is direct GPU-GPU communication support. If you absolutely need to run individual long simulations on multiple GPUs, you can get a node that has Nvlink connections, and Gromacs can now take advantage of them to scale better to multiple GPUs. I don’t know if anyone has really extensive benchmarks, but Nvidia wrote a blog about it.
Note that even with that kind of hardware, scaling can never be as good as if you restrict simulations to individual GPUs. Also, Nvlink-capable nodes tend to come at a premium, so I think most users would still benefit most from running multiple parallel simulations on a cheaper node, but it’s an option to explore if your research is genuinely bottlenecked by long single simulations.
In terms of CPU-GPU ratios, work continues to try to support more and more features on the GPU. I know that for the simplest simulation setups, there’s a GPU-only loop where most steps only need a single CPU core. I think many use cases can still benefit from multiple cores per GPU, but the dev team is plugging away at better GPU support for e.g. free energy stuff. Wishy-washy summary here - it depends on your workload, but the trend continues towards GPU-heavy workloads with minimal CPU power needed. Note I haven’t kept up on this super well, so maybe someone on the core team can provide more info here, particularly if you want to describe your typical workloads.
One other note, don’t expect the GPU shortage to end any time soon, Nvidia has said they don’t expect to be able to meet demand at least through the end of the year.
A few important questions that you need to think of before making a choice:
typical simulations (size, settings, etc.);
optimize for throughput vs time-to-solution; @kevinboyd noted, if you have a large-ish system and you want hundreds of ns/day (and get close to what you can achieve on clusters), you’ll need NVlink across multiple GPUs;
form-factor (workstation, server, space / power / cooling constraints)
budget
I’m familiar with projections regarding consumer GPU situation, but assuming not much will change, you have a few decent options among the professional cards, either Quadro or server-cards (formerly known “Tesla”); e.g. the RTX 5000/600 former gen cards, or A5000/A6000; if you also want some FP64 support for other workloads the A30/A40 may be options. Unless the Turing cards are heavily discounted, the A5000 is likely the best perf/price. If you want to scale a single simulation across >=4 GPUs, you’ll need a machine with A100 SXM4 GPUs in with NVLink (like the Supermicro AS -2124GQ-NART or similar). Otherwise, with workstation cards, if installed in suitable server chassis you can have pairs of GPUs interconnected with NVLink bridges and you can get decent scaling across these (see this NVLink compatibility chart)
When it comes to the choice of CPU, the number of cores / GPU needed is illustrated well in https://doi.org/10.1063/5.0018516 which has both max performance, throughput, and strong scaling benchmarks too.
When it comes to improves to expect from the next release, many are still work in progress, expect further algorithms supported for GPU offload (e.g. free energy perturbation with PME), improved multi-GPU scaling efficiency, CUDA-aware MPI support (for multi-node scaling).
Gentlemen, thank you. In the past, with Szilárd’s help we built a box with a 44-core E5 + four Titan XP cards, and single-node/single job performance still beats the crap out of what my institution offers with their V100s, etc.
The Nvlink is a new development and I will relay your comments.
I have a workstation with 4x Tesla T4 cards and I changed my workflow to accommodate two simulations running in parallel because that seemed to be the best usage for 4 GPUs
Perhaps this should be a new post, but it is closely related.
I’m working with 500k atoms ( having only 400 drug molecules- half the atoms are solvent ) in a thin film evaporation model for which literature indicates 1-2 microsecond runs are typical. Our two RTX 3090’s and 32 core AMD provide 30-40 ns/day. Clearly some speed increase would be welcome. We are planning our next build. Multinode systems running MPI would most likely require enterprise hardware and peripheral system programming and could easily top $100k. For now it appears I am limited to 4 consumer GPUs configured as a workstation. Is there a configuration to using more than 4 GPUs ( 8 or 10 ) as a single node and say 2 CPU’s without worrying too much about the CPU bottleneck. ? Or in this level sytem 9( 4 to 10 GPU’s ) is it better to make the model a bit smaller, use 4xnvlink and hope nvidia comes out with 20k cores soon.
That’s a pretty large system, I would actually think adding two more GPUs should boost your performance, quite a bit. More GPUs than that is probably going to get bottlenecked by the CPU. There was the whole discussion on how many CPU threads per rank, etc – hopefully Szilárd sees your post.
If you score more of those RTX 3090s at reasonable prices, please let us all know where. :)