Speed Comparision against Maestro

pbuscemi · March 11, 2021, 6:16pm

GROMACS version: 2021
GROMACS modification: No

I 've recently built a system with a 32 core Threadripper and two rtx 3090 . Installation of Gromacs 2021 on linux ubuntu went well using
cmake … -DGMX_BUILD_OWN_FFTW=ON -DREGRESSIONTEST_DOWNLOAD=ON -DGMX_GPU=CUDA -DGMX_USE_OPENCL=off -DGMX_CUDA_TARGET_SM=75

In comparing three systems each with two gpu’s gtx 1080, rtx 2080ti and Rtx 3090
and 30,000 300000, 3e6 atoms ( 1AKI in water ) these scale as expected at about 1:2:3 . The RTX 3090 can run 130,000 atoms and 3e6 atoms at resp 162ns/d and 6 ns/d . In each case the gpu’s run at about 50%-70%. the cpu always at 99%

For academic work I have access to Schrodinger’s Maestro. No matter what the model is used in the simulation, Maestro runs the same systems at about twice the speed using one core of the cpu and 99% of a single GPU. I understand GMX is built for large systems and the difference shrinks with larger atom count ( Maestro, 4M atoms 7ns/d ) but still using a single gpu.

I’ve gone through Creating Faster Molecular Dynamics Simulations with GROMACS 2020 | NVIDIA Developer Blog and the “bang for your buck paper”, and benchmarked the 2M atom Ribosome model at 17ns/d on the RTX 3090 but still I am perplexed.

A typical run command is: gmx mdrun -deffnm XXX.npt -bonded gpu -nb gpu -pme gpu -ntomp 4 -ntmpi 16 -npme 1 I’ve attach a log of a run.

So for now, I just want to manage my expectations. Are the speed I see typical ? , is Maestro from another world ? Granted Maestro is a tad bit more expensive, but I feel I must not be configuring the system even close to optimum. A factor of two ( really 4x - twice the speed with 1/2 the gpu ) for most models is at best puzzling.

Comments/ suggestions

Paul

SR.isop.npt.iso.log (39.9 KB)

pszilard · March 12, 2021, 1:59pm

Hi,

Scaling across GPUs is hard for highly optimized codes, especially when it comes to consumer hardware (e.g. no high performance interconnects like NVLink) – that is why, unless I’m mistaken, Desmond does not even attempt doing so.

For that reason it is also important to pick reasonable mdrun launch settings when running multi-GPU runs, otherwise, unlike with CPU performance which is generally less sensitive to the settings, you can get performance for from the optimal.

Please try the best-practices often discussed on the forums:

observe performance on a single GPU
try to use the GPU-resident mode, i.e. with -update GPU
try using fewer ranks and seprate PME ranks

Cheers,
Szilárd

pbuscemi · March 12, 2021, 8:51pm

Szilard,

Thank you very much for your comments.

You’re correct Maestro runs only on one GPU.

I will try as you suggest, but would you tell me if the numbers I see see with my current conditions are acceptable, or just plain bad. I just want a reference point.

Best,
Paul

pbuscemi · March 14, 2021, 5:47pm

System 2x RTX 3090, 3970 Threadripper, Using the 2.1M atom Rib benchmark model, made the following changes:

step size = 4 fs  ( as in  the benchmark )
nstlist  = 50

implementing the suggestions from https://developer.nvidia.com/blog/creating-faster-molecular-dynamics- simulations-with-gromacs-2020/ namely:
export GMX_GPU_DD_COMMS=true
export GMX_FORCE_UPDATE_DEFAULT_GPU=true
constraints = h-bonds
run command
gmx mdrun -deffnm nvt ( or npt) -bonded gpu -nb gpu -pme gpu -ntomp 8 -ntmpi 8 -npme 1

Before implementing these changes except for nstlist and step size, the model rand at ! 14 ns/day. Withe the changes nvt ran at 40 ns/day npt at 34 ns/day… The two GPUs ran between 60-87% during tuning, and 50-78% thereafter. CPU ran at 4.2 Ghz , CPU/GPU temperatures at ~70C. The largest change resulted from using GMX_FORCE_UPDATE_DEFAULT_GPU

These changes greatly impaired the 25k atom cubic rnase benchmark - decreasing output from ~500ns/d to ~ 70ns/day… I’ll report more on this later.

Please do not hesitate to ask for more information

Paul

pbuscemi · March 15, 2021, 5:22pm

Further to the discussion above the results for the rnase cubic benchmark using GMX_GPU_DD_COMMS=true and GMX_FORCE_UPDATE_DEFAULT_GPU=true

single RTX 3090 gpu
run command : gmx mdrun -deffnm verlet.nvt -nb gpu -bonded gpu -pme gpu
with no Comm factor
Performance: 202.961 ns/d

    with GMX_FORCE_UPDATE_DEFAULT_GPU=true
    Performance:     1002.965 ns/d

=====================================
two RTX 3090 gpu’s
run command :gmx mdrun -deffnm verlet.nvt -nb gpu -bonded gpu -pme gpu -ntomp 8 -ntmpi 8 -npme 1

    No Comm factor
    Performance:      202.961 ns/d    

    with GMX_FORCE_UPDATE_DEFAULT_GPU=true
    Performance:      327.001  ns/d     

    with export GMX_GPU_DD_COMMS=true
    with GMX_FORCE_UPDATE_DEFAULT_GPU=true
    Performance:      201.142  ns/d

Single GPU performance is clearly superior with this small system compared to the 3M atom Rib model The force update environmental variable had a huge effect with single gpu use.

pbuscemi · March 15, 2021, 5:49pm

Reported value for no use of Comm factors and single gpu is 490 ns/day, not 202.9 ns/day

pszilard · March 16, 2021, 7:24pm

This all seems to make sense, we reported overall similar performance patterns in our recent paper (see Fig 10 of https://aip.scitation.org/doi/full/10.1063/5.0018516) with the main difference that we ran on one generation older hardware.

The small RNAse system does not parallelize well, there it too little work in that computation to split across tow large GPUs efficiently (in fact it can not even saturate a single large GPU). The 2M RIB system should however parallelize across two GPUs.

How does this update your assessment regarding the the earlier comparison against Desmond? Have you compared using the same system on the same hardware? \

pbuscemi · March 17, 2021, 3:03pm

Some updated comparisons: RTX 3090, 1AKI 2.6M attoms, ranae 24k atoms

rnase 24k atoms, 4 fs step, single RTX 3090
gmx mdrun -deffnm verlet.nvt -nb gpu -pme gpu -update gpu
Performance: 1970.429 0.012
Performance: 2001.585 0.012

gmx mdrun -deffnm verlet.nvt -nb gpu  -pme gpu   -update gpu -bonded gpu
    Performance:     1895.346        0.013
    Performance:     1897.612        0.013

gmx mdrun -deffnm verlet.nvt -nb gpu -pme gpu -bonded gpu ( no update )
Performance: 967.6 (~ equivalent to 2 fs step with gpu update )

Maestro rnase model
    Performance      1002 ns/d

1AKI 2.6M atoms, 4 fs step, 2 x RTX 3090 ( this is an enlarged version of the gromacs tutorial model )
gmx mdrun -deffnm npt -bonded gpu -nb gpu -pme gpu -ntomp 8 -ntmpi 8 -npme 1 -update gpu
Performance: 17.194 1.396

 export GMX_GPU_DD_COMMS=true,4 fs step  2 x RTX 3090
 notes:
 run has requested the 'GPU halo exchange' feature, enabled by the GMX_GPU_DD_COMMS environment variable.
 gmx mdrun -deffnm npt -bonded gpu -nb gpu -pme gpu -ntomp 8 -ntmpi 8 -npme 1 -update gpu
    Performance:       18.273        1.313

gmx mdrun -deffnm npt -bonded gpu -nb gpu -pme gpu -update gpu  ** SINGLE GPU selected**
notes:
run has requested the 'GPU halo exchange' feature, enabled by the GMX_GPU_DD_COMMS environment variable.
run will default to '-update gpu' as requested by the GMX_FORCE_UPDATE_DEFAULT_GPU environment variable.
    Performance:       25.565        0.939

Maestro
performance during NPT production at 2 fs step = 10 ns/d ( crashed at 4 fs step )
Performance during equilbration at 4 fs step = 16 ns/d

Comment on the single vs double GPU results for gromacs ??

kevinboyd · March 17, 2021, 3:03pm

Hi - in addition to Szilard’s comments, I wrote a tool to generate a bunch of different run configurations for Gromacs - if you’re interested you can find it here. It doesn’t yet handle the direct GPU communication options, but it might help you find optimal throughput for your system with 2/4 simulations at a time.

pbuscemi · March 17, 2021, 3:04pm

Fantastic… thank you

pszilard · April 12, 2021, 4:28pm

OK, thanks for reporting back!

To sum it up, if I understand your data correctly, mdrun performance is on-par with Maestro for the 24k atoms system and significantly faster for the 2.6M system.

~1000.5 ns/day if you leave the bonds on the CPU.

The two-GPU run would be more efficient if you used fewer MPI ranks (i.e. -ntmpi 2 or 4 at most).

pbuscemi · April 12, 2021, 10:07pm

Thats correct, with the smaller system, and single gpu , Gromacs and Maestro are essentially equivalent. Maestro as a single gpu instrument cannot compete with the large ( > 400k atoms ) systems as single and certainly not against a dual gpu.

I’ll try reducing the rank count.

Thank you for all your input.

Paul

Topic		Replies	Views
Abysmal MD production performance on GPU node User discussions mdrun	8	932	December 15, 2023
GROMACS performance on 8 cores workstation User discussions mdrun , mdrun-performance	8	1676	February 20, 2024
New NVIDIA RTX 3080 vs 2080Ti GPUs User discussions	5	1026	September 23, 2020
Gromacs performance on GPU User discussions	1	1365	March 23, 2022
Slow MD simulation and problem with GPU support User discussions mdrun	2	1248	May 10, 2021

Speed Comparision against Maestro

Related topics