Using multiple GPUs on one machine

GROMACS version: 2023
GROMACS modification: No

I am building GROMACS to run on a single compute node with four GPUs, and I would like to offload PP work to two of the GPUs and PME work to the two. From the installation guide, it’s not clear to me whether CUDA-aware MPI is necessary to allow direct GPU communication on a single node.

In the “MPI support” section it says that no user action is required to enable thread-MPI for parallel runs on a single node. However, in the “GPU-aware MPI support” section it says that a CUDA-aware, third party MPI implementation is necessary for direct GPU communication. It then recommends OpenMPI for this. Finally, in the “Using cuFFTMp” section it says that decomposition of PME work to multiple GPUs requires cuFFTMp. With all of that in mind, some questions:

  1. Is thread-MPI provide CUDA-aware, or does direct GPU communication require a third-party MPI library?
  2. If a third party MPI library is required, does that mean direct GPU communication requires an MPI run, even on a single node? If so, is there a performance penalty for using MPI, not thread-MPI?
  3. Does decomposition of PME work to multiple GPUs also require GPU-aware MPI support, or just cuFFTMp? Or are these things orthogonal: PME work decomposition requires cuFFTMp, and it can work either through staging communication through CPU memory (without CUDA-aware MPI) or with direct GPU communication (with CUDA-aware MPI)?

Thank you!

Hi Roman,

When running on a single node, you can use thread-MPI with GPU direct communication (see example below). This uses CUDA directly for the inter-GPU communications. CUDA-aware MPI is required for GPU direct communications across multiple nodes.

It is unlikely that you will need PME decomposition when running on 4 GPUs, since that typically only benefits at larger scales. Assigning 1 (or part of 1 - again see below) GPU to PME, with the other 3 assigned to the more expensive short-range force calculations, usually gives good balance. But in any case, see that last link below for more on PME decomposition.

Here is an example with STMV (where I am pasting from some other documentation). Note that reference performance, on a range of systems, can be found at https://developer.nvidia.com/hpc-application-performance.

Download the benchmark:
wget https://zenodo.org/record/3893789/files/GROMACS_heterogeneous_parallelization_benchmark_info_and_systems_JCP.tar.gz
tar xf GROMACS_heterogeneous_parallelization_benchmark_info_and_systems_JCP.tar.gz
cd GROMACS_heterogeneous_parallelization_benchmark_info_and_systems_JCP/stmv

Run GROMACS using 4 GPUs (with IDs 0,1,2,3). Here we use 2 thread-MPI tasks per GPU (-ntmpi 8), which we find gives good performance. We set 16 OpenMP threads per thread-MPI task (assuming at least 128 CPU cores in the system). These can be adjusted to map to any specific hardware system, and experimented with for best performance…

export GMX_ENABLE_DIRECT_GPU_COMM=1
gmx mdrun -ntmpi 8 -ntomp 16 -nb gpu -pme gpu -npme 1 -update gpu -bonded gpu -nsteps 100000 -resetstep 90000 -noconfout -dlb no -nstlist 300 -pin on -v -gpu_id 0123

For more info, please see:

Creating Faster Molecular Dynamics Simulations with GROMACS 2020 | NVIDIA Technical Blog

Maximizing GROMACS Throughput with Multiple Simulations per GPU Using MPS and MIG | NVIDIA Technical Blog

Massively Improved Multi-node NVIDIA GPU Scalability with GROMACS | NVIDIA Technical Blog

Alan Gray (NVIDIA)

Hi Alan,

Thank you so much for your detailed answer – this is fantastic! I was very interested to see the reference performance for both GROMACS and AMBER, and I’m trying to wrap my head around how to compare the two on STMV, which seems to be the only system on which both were tested. On the face of it, AMBER seems get a much bigger boost from GPU utilization and, as a result, run much faster than GROMACS. (Maybe this GROMACS forum isn’t the best place to bring this up? :) However, it’s not clear from “PME-STMV_NPT_4fs” (AMBER) vs “STMV” (GROMACS) how the two setups might be different. I would guess that both use the NPT ensemble, but did GROMACS run with a 4fs time step using VSites? What was the system size, force field, and water model for each simulation? And what does PME-STMV mean, as opposed to STMV? I was under the impression that GROMACS at its best is faster, or at least competitive, with AMBER, even with AMBER’s commercialized GPU code. Was I wrong?

Thank you again!

Best,
Roman

Hi,

I didn’t set up either of these benchmarks (and don’t have any involvement with AMBER) so am perhaps not the best person to answer. However it looks like they are not directly comparable since AMBER uses a 4fs timestep (there seems to be more info here: AMBER GPU Benchmarks (ambermd.org)) while GROMACS uses a 2X smaller 2fs timestep (as you can see from the “dt = 0.002” line in the .mdp file which accompanies the .tpr input in the STMV archive).

Best regards,

Alan

Roman,

vsites is not compatible with the GPU-resident mode. As far as I know, AMBER uses HMR to allow 4fs time-steps which can be set up in GROMACS as well.

Otherwise, I’m sot sure based on what you are basing your observations so, so it would be best to clarify them.

On the face of it, AMBER seems get a much bigger boost from GPU utilization and as a result, run much faster than GROMACS.

You can turn off CPU optimizations in GROMACS and get more “boost” as well. Just set -DMGX_SIMD=None and perhaps also pass -O1 as compiler optimization flag.

Jokes aside, relative speedup tells nothing about absolute performance. GROMACS runs fast on CPUs and scales very well (peaking at around 100 atoms core, so 10000 cores for STMV!). That is why the performance on GPUs relative to CPUs is lower than for some other codes. Please do compare AMBER CPU performance and strong scaling to GROMACS and you’ll see what I mean ;)

(Maybe this GROMACS forum isn’t the best place to bring this up? :)

On the contrary, it is always good to clear up potential misunderstandings.

As explained above, the reason for the massive speedup you see is that AMBER CPU performance is far lower than that of GROMACS.

When it comes to actual absolute performance, you must have looked at wrong numbers, can you check again?

The recent AMBER benchmarks posted here claim 60 ns/day for STMV on an A100. That’s about the same that we have shown in our blog post. However, we also show decent scaling up 16-32 GPUs.

Cheers,
Szilárd

PS: I’ve not done exhaustive search, but here are some CPU benchmarks from Biowulf: https://hpc.nih.gov/apps/amber which show about 10 ns/day for their FactorIX benchmark.
On similar CPUs as those in the benchmark GROMACS will run a FactorIX-sized system with ~40 ns/day with 2 fs or ~80 with 4 fs.

Alan and Szilard, thank you for both of your responses!

I was specifically referring to the application performance summary above – it looks to me like on the “DC-STMV_NPT” test module on A100 GPUs AMBER gets 54, 107, 214, and 428 ns/day with 1, 2, 4, and 8 GPUs respectively, while on the “STMV” test module on A100s GROMACS gets 24, 44, 79, and 125 ns/day with the respective numbers of GPUs. I knew those modules could not be an apples-to-apples comparison, because there is no way AMBER runs that much faster on the same system and setup – I am quite confident in GROMACS performance :) As Alan pointed out, the shorter time step in the GROMACS simulation must explain at least some of the difference.

More importantly (at least for me) – I had no idea that vsites aren’t compatible with GPUs, unless I am misunderstanding exactly what GPU-resident mode is? I have run a number of GPU-accelerated simulations with vsites with the 2020 release, and never got any errors or other indications that anything wasn’t working. I can’t remember exactly which calculations I off-loaded to GPUs at the time, but certainly it was both PP and PME work. I was counting on using a 4fs time step now as well. Should I switch to HMR instead? The page on removing fastest degrees of freedom only discusses virtual sites. Any suggestions on how to set up HMR for GROMACS?

Thank you both again!
Roman

Hi Roman,

Vsites are compatible with GPUs for the main force calculations. GPU-resident mode refers to the whole simulation timestep being offloaded to GPU, such that the data can stay resident on the GPU across multiple steps. In particular, this requires update and constraints on GPU (-update gpu) - this is that part that is not supported for vsites.

Best regards,

Alan

Thank you Alan, that makes sense! And I see from this thread that setting up HMR simply requires passing -heavyh to pdb2gmx, which is easy enough.

One more question, if you two don’t mind. Is there a tradeoff between speed and accuracy with HMR + update on the GPU vs vsites + update on the CPU? In other words, is there an accuracy benefit to using vsites over HMR? I understand that this may be system-specific, but how do the two methods compare?

Thanks again!

Roman

Hi,

Afraid I don’t know the answer to that one.

Best regards,

Alan