Optimizing simulation speed for 2 million atoms

GROMACS version: 2024.1
GROMACS modification: Yes/No
Dear experts and gromacs users
I am trying to optimize the simulation speed for my biological system comprising some 2 million atoms. I am running on a single node which has 3 accelerators. With 3 mpi ranks (3 gpus) and 24 threads for each rank and offloading NB, BB, PME (1 rank/ 1 gpu) and update to gpus I get close to 9 ns/day. I have enabled direct communication between GPUs. Is there anything else I am missing and is there any way I could further optimize the speed?

What communication is used between the GPUs? Sometimes it’s quicker to just run on one GPU.

Thanks, @MagnusL, for the reply. So there is this GPU direct communication with CUDA-aware MPI enabled in my simulations. Currently, I am getting 9 ns/day speed for 3 MPI ranks (2(pp)+1(pme)). I can use 3 GPUs only, that’s why this weird MPI rank distributions. I see severe degradation of speed as I increase the mpi ranks with #gpus=3. I have tried with 1+1 (pp+pme) but it is not better then 2+1 mpi ranks. I am offloading NB,BB,update to GPUs.

3 gpu is in single node? i found better performance in single node with threadmpi. Also, the best of best was on single gpu