1GPU vs 4 GPU per single node; performance

GROMACS version: 2020.1
GROMACS modification: No

Dear all,

I run a system, around 400k atoms, on 1 node which has 40 core and and 10 compatible GPUs.

When I select one GPU the performance is 31 ns/day while selecting 4 GPUs only enhance the performance to 43 ns/day, so the performance doesn’t scale up accordingly. Would you please let me know what I am using wrongly or need to consider to have a better performance?
For further details, please find below the shared log files of the two simulations. Middle parts have just been deleted to make the log files smaller.

Regards,
Alex

@pszilard
Any comment would be highly appreciated!
Thank you.

Alex,

I think there is not a lot you can do to improve this. The two main reasons this does not scale is that:

  • you are not using the GPU-resident parallelization, that is -update gpu, and therefore data needs to be moved GPU->CPU->GPU every step; in addition
  • the machine you are running on is not ideal for strong scaling (without GPU-resident loop): it is a dense GPU machine slow interconnect; as it has only PCIe 3.0 and only 8 lanes per GPU (so max 6 GB/s), the data transfers are the main bottleneck.

Try using GPU update so data can reside for tens to hundreds of steps on the GPU. If your machine only has PCIe to communicate GPU<->GPU (not only CPU<->GPU), your scaling will likely still be limited by communication. If it had NVLink at least across the GPUs it could scale better.

You could try running 8 ranks, 2 per GPU to expose more computation and overlap more of the slow communication, but it is not guaranteed it will help.

Cheers,
Szilard

1 Like

Thank you Szilard for the informative response.
Indeed, I set to have the update on GPU using the
“export GMX_FORCE_UPDATE_DEFAULT_GPU=true
export GMX_GPU_DD_COMMS=true
export GMX_GPU_PME_PP_COMMS=true”;

However, it switches to “update cpu” because of the below conditions pointed out in the log file.
%-----------
Update task on the GPU was required, by the GMX_FORCE_UPDATE_DEFAULT_GPU environment variable, but the following condition(s) were not satisfied:
Domain decomposition is only supported with constraints when update groups are used. This means constraining all bonds is not supported, except for small molecules, and box sizes close to half the pair-list cutoff are not supported.
Will use CPU version of update.
%---------------

If I am not wrong it required to have the “constraints = h-bonds” instead of the all-bonds, however the point is that I have already have the constraints = h-bonds in my mdp file.
I have a periodic molecule in the system (a slab), which I think it might causes to switch the asked “-update gpu” to “-update cpu”, don’t you?

Also, I see bunch of “peer mapping resources exhausted” massages as below in the log file, I wonder how one can avoid these?

GPU peer access not enabled between GPUs 0 and 9 due to unexpected return value from cudaDeviceEnablePeerAccess: peer mapping resources exhausted

GPU peer access not enabled between GPUs 1 and 9 due to unexpected return value from cudaDeviceEnablePeerAccess: peer mapping resources exhausted

Thank you

I think you are right, but I’ll check that. If update groups are used you should see that reported in the log (i.e. "Using update groups, nr XXX, average size YYY atoms, max radius ZZZ nm).

I have never seen that error, so we will have to take a look into that. Can you please file an issue on Issues · GROMACS / GROMACS · GitLab and attach your log file?

Thanks,
Sz.

Indeed, I set to have the update on GPU…

Can we see the whole log file please? Knowing why update groups wasn’t chosen is probably useful to guiding you and/or making a better error message in future.

Please find the whole log file shared below:

Thank you