GROMACS version: 2021
GROMACS modification: No
I 've recently built a system with a 32 core Threadripper and two rtx 3090 . Installation of Gromacs 2021 on linux ubuntu went well using
cmake … -DGMX_BUILD_OWN_FFTW=ON -DREGRESSIONTEST_DOWNLOAD=ON -DGMX_GPU=CUDA -DGMX_USE_OPENCL=off -DGMX_CUDA_TARGET_SM=75
In comparing three systems each with two gpu’s gtx 1080, rtx 2080ti and Rtx 3090
and 30,000 300000, 3e6 atoms ( 1AKI in water ) these scale as expected at about 1:2:3 . The RTX 3090 can run 130,000 atoms and 3e6 atoms at resp 162ns/d and 6 ns/d . In each case the gpu’s run at about 50%-70%. the cpu always at 99%
For academic work I have access to Schrodinger’s Maestro. No matter what the model is used in the simulation, Maestro runs the same systems at about twice the speed using one core of the cpu and 99% of a single GPU. I understand GMX is built for large systems and the difference shrinks with larger atom count ( Maestro, 4M atoms 7ns/d ) but still using a single gpu.
I’ve gone through Creating Faster Molecular Dynamics Simulations with GROMACS 2020 | NVIDIA Developer Blog and the “bang for your buck paper”, and benchmarked the 2M atom Ribosome model at 17ns/d on the RTX 3090 but still I am perplexed.
A typical run command is: gmx mdrun -deffnm XXX.npt -bonded gpu -nb gpu -pme gpu -ntomp 4 -ntmpi 16 -npme 1 I’ve attach a log of a run.
So for now, I just want to manage my expectations. Are the speed I see typical ? , is Maestro from another world ? Granted Maestro is a tad bit more expensive, but I feel I must not be configuring the system even close to optimum. A factor of two ( really 4x - twice the speed with 1/2 the gpu ) for most models is at best puzzling.
Comments/ suggestions
Paul
SR.isop.npt.iso.log (39.9 KB)
Hi,
Scaling across GPUs is hard for highly optimized codes, especially when it comes to consumer hardware (e.g. no high performance interconnects like NVLink) – that is why, unless I’m mistaken, Desmond does not even attempt doing so.
For that reason it is also important to pick reasonable mdrun launch settings when running multi-GPU runs, otherwise, unlike with CPU performance which is generally less sensitive to the settings, you can get performance for from the optimal.
Please try the best-practices often discussed on the forums:
- observe performance on a single GPU
- try to use the GPU-resident mode, i.e. with
-update GPU
- try using fewer ranks and seprate PME ranks
Cheers,
Szilárd
Szilard,
Thank you very much for your comments.
You’re correct Maestro runs only on one GPU.
I will try as you suggest, but would you tell me if the numbers I see see with my current conditions are acceptable, or just plain bad. I just want a reference point.
Best,
Paul
System 2x RTX 3090, 3970 Threadripper, Using the 2.1M atom Rib benchmark model, made the following changes:
step size = 4 fs ( as in the benchmark )
nstlist = 50
implementing the suggestions from https://developer.nvidia.com/blog/creating-faster-molecular-dynamics- simulations-with-gromacs-2020/ namely:
export GMX_GPU_DD_COMMS=true
export GMX_FORCE_UPDATE_DEFAULT_GPU=true
constraints = h-bonds
run command
gmx mdrun -deffnm nvt ( or npt) -bonded gpu -nb gpu -pme gpu -ntomp 8 -ntmpi 8 -npme 1
Before implementing these changes except for nstlist and step size, the model rand at ! 14 ns/day. Withe the changes nvt ran at 40 ns/day npt at 34 ns/day… The two GPUs ran between 60-87% during tuning, and 50-78% thereafter. CPU ran at 4.2 Ghz , CPU/GPU temperatures at ~70C. The largest change resulted from using GMX_FORCE_UPDATE_DEFAULT_GPU
These changes greatly impaired the 25k atom cubic rnase benchmark - decreasing output from ~500ns/d to ~ 70ns/day… I’ll report more on this later.
Please do not hesitate to ask for more information
Paul
Further to the discussion above the results for the rnase cubic benchmark using GMX_GPU_DD_COMMS=true and GMX_FORCE_UPDATE_DEFAULT_GPU=true
single RTX 3090 gpu
run command : gmx mdrun -deffnm verlet.nvt -nb gpu -bonded gpu -pme gpu
with no Comm factor
Performance: 202.961 ns/d
with GMX_FORCE_UPDATE_DEFAULT_GPU=true
Performance: 1002.965 ns/d
=====================================
two RTX 3090 gpu’s
run command :gmx mdrun -deffnm verlet.nvt -nb gpu -bonded gpu -pme gpu -ntomp 8 -ntmpi 8 -npme 1
No Comm factor
Performance: 202.961 ns/d
with GMX_FORCE_UPDATE_DEFAULT_GPU=true
Performance: 327.001 ns/d
with export GMX_GPU_DD_COMMS=true
with GMX_FORCE_UPDATE_DEFAULT_GPU=true
Performance: 201.142 ns/d
Single GPU performance is clearly superior with this small system compared to the 3M atom Rib model The force update environmental variable had a huge effect with single gpu use.
Reported value for no use of Comm factors and single gpu is 490 ns/day, not 202.9 ns/day
This all seems to make sense, we reported overall similar performance patterns in our recent paper (see Fig 10 of https://aip.scitation.org/doi/full/10.1063/5.0018516) with the main difference that we ran on one generation older hardware.
The small RNAse system does not parallelize well, there it too little work in that computation to split across tow large GPUs efficiently (in fact it can not even saturate a single large GPU). The 2M RIB system should however parallelize across two GPUs.
How does this update your assessment regarding the the earlier comparison against Desmond? Have you compared using the same system on the same hardware? \
Some updated comparisons: RTX 3090, 1AKI 2.6M attoms, ranae 24k atoms
rnase 24k atoms, 4 fs step, single RTX 3090
gmx mdrun -deffnm verlet.nvt -nb gpu -pme gpu -update gpu
Performance: 1970.429 0.012
Performance: 2001.585 0.012
gmx mdrun -deffnm verlet.nvt -nb gpu -pme gpu -update gpu -bonded gpu
Performance: 1895.346 0.013
Performance: 1897.612 0.013
gmx mdrun -deffnm verlet.nvt -nb gpu -pme gpu -bonded gpu ( no update )
Performance: 967.6 (~ equivalent to 2 fs step with gpu update )
Maestro rnase model
Performance 1002 ns/d
1AKI 2.6M atoms, 4 fs step, 2 x RTX 3090 ( this is an enlarged version of the gromacs tutorial model )
gmx mdrun -deffnm npt -bonded gpu -nb gpu -pme gpu -ntomp 8 -ntmpi 8 -npme 1 -update gpu
Performance: 17.194 1.396
export GMX_GPU_DD_COMMS=true,4 fs step 2 x RTX 3090
notes:
run has requested the 'GPU halo exchange' feature, enabled by the GMX_GPU_DD_COMMS environment variable.
gmx mdrun -deffnm npt -bonded gpu -nb gpu -pme gpu -ntomp 8 -ntmpi 8 -npme 1 -update gpu
Performance: 18.273 1.313
gmx mdrun -deffnm npt -bonded gpu -nb gpu -pme gpu -update gpu ** SINGLE GPU selected**
notes:
run has requested the 'GPU halo exchange' feature, enabled by the GMX_GPU_DD_COMMS environment variable.
run will default to '-update gpu' as requested by the GMX_FORCE_UPDATE_DEFAULT_GPU environment variable.
Performance: 25.565 0.939
Maestro
performance during NPT production at 2 fs step = 10 ns/d ( crashed at 4 fs step )
Performance during equilbration at 4 fs step = 16 ns/d
Comment on the single vs double GPU results for gromacs ??
Hi - in addition to Szilard’s comments, I wrote a tool to generate a bunch of different run configurations for Gromacs - if you’re interested you can find it here. It doesn’t yet handle the direct GPU communication options, but it might help you find optimal throughput for your system with 2/4 simulations at a time.
1 Like
OK, thanks for reporting back!
To sum it up, if I understand your data correctly, mdrun performance is on-par with Maestro for the 24k atoms system and significantly faster for the 2.6M system.
~1000.5 ns/day if you leave the bonds on the CPU.
The two-GPU run would be more efficient if you used fewer MPI ranks (i.e. -ntmpi
2 or 4 at most).
Thats correct, with the smaller system, and single gpu , Gromacs and Maestro are essentially equivalent. Maestro as a single gpu instrument cannot compete with the large ( > 400k atoms ) systems as single and certainly not against a dual gpu.
I’ll try reducing the rank count.
Thank you for all your input.
Paul