COM pull force and AWH

GROMACS version:2020
GROMACS modification: Yes/No
Here post your question

Dear all ,

I am trying to accelerate the sampling with the AWH method. I have a performance loss mainly because of COM pull force ( 23.8% )

Computing: Num Num Call Wall time Giga-Cycles
Ranks Threads Count (s) total sum %

Domain decomp. 7 5 34 0.675 58.952 1.7
DD comm. load 7 5 34 0.001 0.112 0.0
Send X to PME 7 5 10001 3.817 333.193 9.4
Neighbor search 7 5 34 0.494 43.145 1.2
Launch GPU ops. 7 5 20002 0.750 65.486 1.8
Comm. coord. 7 5 9967 3.521 307.356 8.7
Force 7 5 10001 5.093 444.554 12.6
Wait + Comm. F 7 5 10001 2.768 241.606 6.8
PME mesh * 1 5 10001 18.072 225.363 6.4
PME wait for PP * 17.433 217.396 6.1
Wait + Recv. PME F 7 5 10001 1.426 124.476 3.5
Wait PME GPU gather 7 5 10001 2.500 218.248 6.2
Wait GPU NB nonloc. 7 5 10001 0.048 4.222 0.1
Wait GPU NB local 7 5 10001 0.037 3.218 0.1
NB X/F buffer ops. 7 5 39936 2.245 195.950 5.5
COM pull force 7 5 10001 9.666 843.764 23.8
AWH 7 5 10001 0.093 8.085 0.2
Write traj. 7 5 1 0.158 13.766 0.4
Update 7 5 10001 0.984 85.913 2.4
Constraints 7 5 10001 2.016 175.968 5.0
Comm. energies 7 5 1001 1.103 96.303 2.7

Total 35.505 3542.055 100.0

(*) Note that with separate PME ranks, the walltime column actually sums to
twice the total reported, but the cycle count total and % are correct.

           Core t (s)   Wall t (s)        (%)
   Time:     1420.104       35.505     3999.8
             (ns/day)    (hour/ns)

Performance: 48.675 0.493

Any suggestions or advice would be very appreciated!

Thank you so much

Amnah

This high percentage in COM pull force does not necessarily need to come frome the COM pulling itself, it could also come from load imbalance in the force calculation before that.

But why are you running 7 MPI ranks which each 5 threads? That’s seems a very sub-optimal setup. What hardware are you running on?

Thank you so much for your reply!
What parameters do I need to change in the mdp file to reduce the imbalance? could you please help me with that?
awh.mdp (7.4 KB)

Regarding the hardware, this is the hardware description. The system I am simulating has 700K atoms, I thought 54-55ns/day using 1 node( 8 V100 GPUs) is optimal

I really appreciate your help
Thank you so much!

Amnah

I don’t understand how you got to using 7 rank and 5 threads. Did you use mpirun? If so, with how many ranks and on how many nodes. Did you specify the -nt, -ntmpi and/or -ntomp option?

Dear Berk,

I used 1 nod ( 8 GPUs)
This is the command I am using:
mpirun -np $SLURM_NPROCS gmx_mpi mdrun -deffnm 2 -s AWH.tpr -v -nb gpu -pme gpu -npme 1 -nstlist 300

So $SLURM_NPROCS is 7 then, I suppose?
You would like to use 8 ranks with 8 GPUs, I would think. That should give you much better performance. Why does $SLURM_NPROCS get set to 7?

Hi Berk,

I have no idea why $SLURM_NPROCS gets set to 7, I just tried the following command (np = 8), but it changed to 7 according to the log file.

mpirun -np 8 gmx_mpi mdrun -deffnm 2 -s AWH.tpr -v -nb gpu -pme gpu -npme 1 -nstlist 200

I misunderstood what it going on. You ask for -npme 1, so you get 7 PP ranks and 1 pme rank.

But I expect that the performance is fully limited by the fact that only 1 GPU is used for PME. My guess would be that you get better performance using half of the node. Then you can run two runs on one node and get more than double the performance.

Another option is to run PME on the CPU. But even then there are not many systems that scale to 8 V100 GPUs. You should try 4, 2 and 1 GPUs per simulation.