Getitng best performance in parallel

GROMACS version: 2020.4
GROMACS modification: Yes/No
Here post your question
GROMACS version:
GROMACS modification: Yes/No
Here post your question Dear All,
I am having a cluster with 10 nodes and each nodes having 48 core processor.
I am trying to run it in 2 and 5 nodes and I am getting the maximum speed of 43 ns and 123 ns per day, which is likely slow for the above processor.
Below the script I used for running, can anyone tell me how I can achieve maximum performance here.
When I go for 10 nodes what I have to add to increase the performance?

#PJM -j
#PJM --rsc-list “node=5”
#PJM --mpi proc=48
#PJM --rsc-list “rscunit=rscunit_ft01”
#PJM --rsc-list “elapse=12:00:00”
#PJM --rsc-list “freq=2200”
#PJM --rsc-list “group_1”

export OMP_PROC_BIND=close
export KMP_AFFINITY=verbose,compact

. /vol0001/apps/oss/spack-v0.15.4/share/spack/
module load gromacs-2020.4-fj-4.2.1a-zexx6ov
mpiexec gmx_mpi mdrun -deffnm md

6 node and 4 MPI (export OMP_NUM_THREADS=4 )gives 70 ns (288 core)
3 node with 3 MPI gives 50 ns (144 core)
all time execution command was
mpiexec gmx_mpi mdrun -deffnm md

6 node with -ntomp 6 option I getting 93 ns (288 core)

-ntomp 12, in 12 nodes give 131 ns (576 core)
100 OpenMP thread returns following error (10000+ core)
100 OpenMP threads were requested. Since the non-bonded force buffer reduction
is prohibitively slow with more than 64 threads, we do not allow this. Use 64
or less OpenMP threads.

excuse me I’m stuck on this question and need to know if it can be solved or not
I need to do a simulation (REMD) for 36 replicas on one laptop that has a processor i7 with RTX 2060 (6 cores). is this suitable for simulation in parallel? or I’ll need a server or workstation if you have an answer I’ll be a pleasure and appreciate