Hi,
does anyone have a clue about how the GROMACS performance on the new RTX 3080 GPUs will compare to, e.g., the RTX 2080Ti? Both seem to have the same number of cores and a similar clock rate, however with the new Ampere architecture, apparently 2x the FP32 throughput is expected. Does that mean we can also expect a 2x higher GROMACS performance from a 3080 as compared to a 2080Ti (let’s assume a large enough MD system)?
I’ve run nvidia 1060, 1070, 1080, 1080 ti and 2080 Ti’s. with gmx over the years. After accounting for system speed they all scaled directly with the number of cores. So , no, I’d expect only a modest change probably not worth the effort if your runs are relatively short ( i.e. hours as opposed to days ) However, there is the RTX3090 with ~ 10,000 cores…so you could cut your time almost in half.
I was one of the fortunate ones who managed to get their hands on a RTX 3080 and can give some benchmarks with the MEM benchmark (82k atoms) and GROMACS 2020.3
Cuda 11 and GCC 10.2 used. Processor was a Ryzen 7 2700x with HDMI output from the 3080 card, These are all single GPU runs. Runs were over 50,000 timesteps and averaged from 2 runs
One run on 16 threads: 130ns/day
Two runs (8 threads each, using pin offsets): 94ns/day (188ns/day aggregate performance)
Haven’t been able to get MPI to work yet; but since it’s not essential to my workflow I don’t intend to spend too much time on this. Thought this information might be useful to you anyway.
The performance you observe is somewhat low (in the same ballpark as a 2080, see here). It would be good to better understand the reasons for that. However, that benchmark system uses vsites which is not supported with the new full-step offload in 2020, hence these runs are prone to be limited on recent hardware by the PCIe transfers and the CPU.
Would you be able to run a test that can better reflect the GPU performance in a case when everything is offloaded? E.g. the ADH benchmark (cubic no-vsites) using all offloaded (i.e. -nb cpu -pme gpu -bonded gpu -update gpu)?
Thanks for sending those benchmarks over. The cubic non-vsites one only ran when I changed the constraints from all-bonds to h-bonds (error message :
Inconsistency in user input: Update task on the GPU was required, but the following condition(s) were not satisfied: The number of coupled constraints is higher than supported in the CUDA LINCS
code.)
You were right. Performance jumped with those command line options. Average from 2x 100,000 nstep runs was 200 ns/day . I also tried an old simulation system of mine, which was a solvated heme protein in a dodecahedral box (60k atoms) and I got 360 ns/day; whereas before we typically only saw 180 ns/day with a V100.
Seriously impressed with this card now; and only £649!
If you need me to run any more benchmarks, or if these need to be rerun/ any more queries then please do ask. I’m more than happy to try them out.
Indeed, those benchmarks had ancient mdps with all-bonds even with 2 fs timesteps (replaced the tarballs on the ftp server with updated settings).
Indeed, this looks much better; on a 2080 SUPER I get ~140 ns/day oob. What you measure is still behind what the raw FLOPS of the 3080 would suggest, but that is not too unexpected (and perhaps not even 100k atoms can saturate these cards).
Can you please try two more tests for me?
Using the nvprof profiler (which is located in the nvidia toolkit path, e.g. /usr/local/cuda/bin) run the following: nvprof --profile-from-start off -u ms --concurrent-kernels off --log-file profiler.log gmx mdrun -quiet -noconfout -ntmpi 1 -nsteps 10000 -resethway -nb gpu -pme gpu -bonded gpu -update gpu -nstlist 100 -notunepme
Both for the ADH cubic system as well as a larger one, e.g. the 768k water box from this water bench tarball.
Please share the profiler.log files of the two runs. These will contain breakdowns of individual kernel runtimes which can be compared to other GPUs (e.g. here’s my output from a 2080 SUPER: https://termbin.com/1rhc).