Keywords in command line for getting good performance in gromacs

I am trying to run gromacs 2023.2 gpu on HPC. I am getting super slow performance in HPC. I run the calculation for the test purpose on 8 cores for 10 mins only. The projected steps of calculation is about 15,000 in 10 minuites, unfortunately couldn’t get performance upto mark. The script file and the last comments in output file are as shown below.
SLURM SCRIPT FILE
#!/bin/bash
#SBATCH -N 1
#SBATCH --ntasks-per-node=8
#SBATCH --gres=gpu:A100-SXM4:1
#SBATCH --partition=testp
#SBATCH --time=00:10:00
#SBATCH --error=error_test.%J.err
#SBATCH --output=output_test.%J.out

echo “Starting at date
echo “Running on hosts: $SLURM_NODELIST”
echo “Running on $SLURM_NNODES nodes.”
echo “Running $SLURM_NTASKS tasks.”
echo “Job id is $SLURM_JOBID”
echo “Job submission directory is : $SLURM_SUBMIT_DIR”
cd $SLURM_SUBMIT_DIR

source /opt/hpcx-v2.9.0-gcc-MLNX_OFED_LINUX-5.4-1.0.3.0-ubuntu20.04-x86_64/env.sh

source /nlsasfs/home/groupiiiv/sarthakt/software_cdac/gromacs-2023.2/install/bin/GMXRC

mpirun -mca pml ucx -x UCX_NET_DEVICES -np 8 /nlsasfs/home/groupiiiv/sarthakt/software_cdac/gromacs-2023.2/build/bin/gmx_mpi mdrun -ntomp 4 --deffnm md_0_10 -cpi md_0_10.cpt -noappend

OUTPUT FILE
Started mdrun on rank 0 Thu Oct 26 11:14:28 2023

** Step Time**
** 0 0.00000**

** Energies (kJ/mol)**
** Bond U-B Proper Dih. Improper Dih. CMAP Dih.**
** 5.17908e+03 1.42265e+04 1.66364e+04 8.73544e+02 -8.25547e+02**
** LJ-14 Coulomb-14 LJ (SR) Coulomb (SR) Coul. recip.**
** 5.15878e+03 7.27696e+04 8.95562e+04 -1.15064e+06 3.48482e+03**
** Potential Kinetic En. Total Energy Conserved En. Temperature**
** -9.43579e+05 1.87862e+05 -7.55718e+05 -7.55674e+05 3.01866e+02**
** Pressure (bar) Constr. rmsd**
** 2.79474e+02 2.98607e-06**

DD step 99 load imb.: force 5.3%
step 600: timed with pme grid 56 56 56, coulomb cutoff 1.200: 65983.2 M-cycles
step 800: timed with pme grid 48 48 48, coulomb cutoff 1.400: 58358.5 M-cycles
step 1000: timed with pme grid 44 44 44, coulomb cutoff 1.527: 60890.5 M-cycles
step 1200: timed with pme grid 40 40 40, coulomb cutoff 1.680: 60729.1 M-cycles
step 1400: timed with pme grid 36 36 36, coulomb cutoff 1.866: 58230.7 M-cycles
step 1400: the maximum allowed grid scaling limits the PME load balancing to a coulomb cut-off of 1.866
step 1600: timed with pme grid 36 36 36, coulomb cutoff 1.866: 60691.8 M-cycles
step 1800: timed with pme grid 40 40 40, coulomb cutoff 1.680: 56804.9 M-cycles
step 2000: timed with pme grid 42 42 42, coulomb cutoff 1.600: 61399.7 M-cycles

Received the TERM signal, stopping within 100 steps

THANK ALL OF YOU IN ADVANCE. FEEL FREE TO ASK MORE DETAILS, SO THAT I CAN START MY CALCULATIONS HAPPILY. :D
GROMACS version:
GROMACS modification: Yes/No
Here post your question
CAN ANYONE SUGGEST ME THE SOLUTIONS TO ENHANCE PERFORMANCE?

If you are running on 8 cores of a single node, there is no reason to use MPI. You are also specifying 8 MPI tasks with 4 OpenMP threads per task, so you’re probably just overloading the hardware. gmx mdrun -nt 8 is sufficient if using 8 cores.

Also note that mdrun is still doing some tuning of cutoffs so the performance report will not be accurate. You need to run for a larger number of steps and should use -resethway to reset the timing calculation to omit the tuning part.

Thank you sir for your immediate response. I will try your advice and update you soon. Thank you so much again.

Regards,
Sarthak