How to improve performances on multiple GPU

GROMACS version: 2018.8
GROMACS modification: No

Dear all gromacs users,

I’m now using HPC clusters, consisting of 2x E5-2670v2 and 2x V100_32GB cards

When I’m using only one V100 card, perfomance is 187 ns/day.

But when using two V100 cards, performance decays to 165 ns/day.

I found in a log file that PME wait for PP takes much of time about 16.5%.
Domain decomp. 2.2
DD comm. load 0.0
DD comm. bounds 0.0
Vsite constr. 4.1
Send X to PME 2.6
Neighbor search 1.3
Launch GPU ops. 24.0
Comm. coord. 4.1
Force 4.7
Wait + Comm. F 3.8
PME mesh * 8.5
PME wait for PP * 16.5
Wait + Recv. PME F 2.5
Wait PME GPU gather 1.7
Wait GPU NB nonloc. 1.0
Wait GPU NB local 0.8
NB X/F buffer ops. 3.1
Vsite spread 6.3
Write traj. 0.0
Update 1.7
Constraints 18.2
Comm. energies 0.2

I tried some multiple sets of thread-MPI and OpenMP with dual V100 cards, but I could not exceed the performance of a single V100 card.

My system has 18240 atoms composed of 3264 TIP4P/Ice water molecules, 192 THF molecules, and 896 H2 molecules with V sites.

I ran the simulations with this .mdp file and command line
–mdp file–
integrator = md
dt = 0.001 ; 2 fs
nsteps = 1000000 ; 100 ps
nstenergy = 10000
nstlog = 10000
nstxout-compressed = 10000
gen-vel = yes
gen-temp = 260
constraint-algorithm = lincs
constraints = none
cutoff-scheme = Verlet
coulombtype = PME
rcoulomb = 0.95
lj-pme-comb-rule = Lorentz-Berthelot
vdwtype = Cut-off
rvdw = 0.95
DispCorr = EnerPres
tcoupl = Nose-Hoover
tc-grps = System
tau-t = 0.2
ref-t = 260
nhchainlength = 1

command lines:
gmx mdrun -deffnm eql -nb gpu -ntomp 5 -dlb yes -ntmpi 4 -gputasks 0011 -pme gpu -npme 1

Is there any ways to maximize the performance in this situation with dual V100 cards?

Any recommendations and advises would be very helpful to me.

Thanks in advance


An 18240 atom system can not even fully saturate a single V100 GPU, so running on multiple GPUs will not be likely to give performance benefits. There might however be opportunities to slightly improve performance on a single GPU, but you should share the full log file so we can assess if improvements can be made.

Dear pszilard,

Thanks for reply.

Right, my system is quite small, so calculation with dual cards would be slow as you said.

I wand to show my log file, but the log file is too long to post here, and I don’t know how to upload my log file because when I select the file, they say “new users cannot upload attachments.”

If you let me know how to attach my file, I will do that.

Thanks again for your advice.


Upload your file somewhere online that you can share, such as Dropbox, Google Drive, and other file sharing services.

Dear Dr_DBW

Thanks for reply.

As you recommended, I uploaded my log file to Google drive.

Here is the link:

Thanks in advance.



Looks fine, but your run seem very PME-bound, so I expect 1 PP + 1 PME (i.e. the second GPU dedicated to doing only PME, -ntmpi 2 -npme 1) would run faster, I think. Also, have you tried running on a single GPU – I’m still not sure using two GPUs instead of only one will be faster with only 14k atoms?

Dear pszilard,

Thanks for reply.

In fact, I was using HPC nodes with a single V100 card. But this node was upgraded with dual V100 card, and it costs 1.5 times than previous single V100 card. So I have to increase my performances to compensate this raise of charge.

To your questions, yes. I was runinng my simulations on single V100 card, and the performances on this node was about 180 ns/day.

When I increase the number of atoms in my system, perfomance was degraded; so I judged that my system was saturated in V100 system. But with dual v100 card, performance did not increased.

Anyway, I will try as you recommended.