Optimizing Selection of RTX 40 Series GPUs: Evaluating Performance and Cost Efficiency

Hello GROMACS Users,

Does anyone know the “extent” to which the type and memory of RTX 40 series GPUs affect simulation performance? Currently, the price differences are substantial; for instance, the 4060Ti-8GB costs $450, the 4070Ti-12GB costs $800, while the 4090-24GB is priced at $1,900. I would like to know if the performance improvement justifies the price difference.

Moreover, I’m interested in determining which configuration offers optimal performance:
- Using four 4060Ti-8GB GPUs (totaling 4 x $450)
- Using two 4070Ti-12GB GPUs (totaling 2 x $800)
- Using one 4090-24GB GPU priced at $1,900

My simulations involve approximately 1 million atoms.

Thank you all in advance for your insights!

–Ra’ed

I expect the GROMACS performance in the same GPU generation to be proportional to the 32-bit floating point flop rate. Memory size is irrelevant. If you don’t need to run a single simulation over multiple GPUs, buy the configuration with the highest aggregate flop rate. If you want to run one simulation on the whole system, things get more complicated as scaling to multiple GPUs gives performance losses.

1 Like

Thank you, so much, Berk! Your feedback is very insightful, and I greatly appreciate it.

Lets see some situations here: i have good budget- then i dont think much and take the one with highest CUDA core-or flop rate and use atleast 16 core dual processor- it is not required that much core but as i said, i have unlimited budget.

Now, i have limited budget and im trying to make a choices between price vs performane vs my time.

I would go with 4070 (~500 $)and 12-16 core processor with 1-2 GPU. (Performance are not 2x when you add second GPU ).
now i add more units with the above specification- You may also keep 1 or 2 unit with highest flop rate, for more intense simulations as per your plan of work or budget.
sometime, waiitng for couple of months can get you the GPU that is going to release.
refer here and here

1 Like

As suggested by @hess and @scinikhil, for a single GPU-to-GPU comparison, simulation performance scales approximately with FP32 performance of a GPU.

For multi-GPU systems, it depends on how many PCIe lanes come to your GPU. 16x PCIe offers the highest performance when you insert your GPU into a 16x PCIe performing at 16x (a 16x port having 16 PCIe lanes). Things become tricky when you add 2nd and 3rd GPU. For 2 GPU (or more) system, there is no motherboard for Intel Pentium and AMD Ryzen consumer chips which allows you to run both GPUs at 16x. This is because these processors usually have 20-24 PCIe lanes going to the CPU. Plus, PCIe lanes are also used by USB, SSD, etc. so the effective number of lanes available can decrease further. The configurations in motherboards supporting dual GPUs can be anything link 8x/8x (both GPU at half speed, i.e. 8x), 16x/1x (one GPU at 16x, second at 1x), 8x/4x, etc… Note that many motherboards will have 3-5 16x PCIe slots but those will be actually working at 1x, 4x or 8x when you insert multiple GPUs.
Many online discussion forums say that 16 versus 8 PCIe lanes does not make a performance difference with GPUs however, in my experience with Gromacs (and NAMD), there is a significant performance uplift at 16x. If you want to go for multiple GPUs running at 8x or 16x, look for Intel Xeon or AMD Threadripper processors and their supporting motherboards.

I hope that helps.

Regards,
Raman

1 Like

Thank you, so much, Nikhil! The two articles you provided are very useful!

Thank you, so much, Raman! It is very interesting to understand the impact of the number of PCIs lanes on this.