Why performance is decrease when i am submitting the multiple jobs parallely?

GROMACS version:2020.1
GROMACS modification: Yes/No
Here post your question
Dear gromacs users,
I am doing a performance check using gromacs 2020.1.
I have taken a simulation system consisting of 38706 atoms. I have 2 nodes in my server, each node has 256 cores so total 512 cores we have.
I am doing a performance check only at one node (256 cores). When I occupy 16 cores out of 256 cores (using ntomp=1,2,4 and ntmpi=16,8,4) gromacs performance is approx 40 ns/day. Since I have 256 cores, I submitted identical 16 jobs to occupy all cores. While doing this I did use -pin, -pinoffset and -pinstride options to take care that each job will be running on separate cores. Not only this but also I did manually check but submitting one job at a time to confirm that each job is occupying different cores. (e.g. job1 is running on 1-8 and 129-136 cores, job2 is running on 17-24 and 137-144 cores and so on. Because pinstride was 1) After doing all the checks when I am submitting all 16 possible jobs to occupy 256 cores, performance for each job is drastically decreasing from ~40 ns/day to ~17 ns.day.

I need your help here to figure out why this is happening? What is the best way to submit such 16 jobs in parallel so that I shall get optimum performance?

Commands I am executing:
mpirun -np 16 /apps/gromacs/gromacs-2020.1/build_mpi/bin/gmx_mpi mdrun -s …/…/…/…/218K-8mics_1000ps_6nm-2fs.tpr -v -deffnm test1 -nsteps 1000 -ntomp 1 -pin on -pinoffset 0 -pinstride 1

mpirun -np 16 /apps/gromacs/gromacs-2020.1/build_mpi/bin/gmx_mpi mdrun -s …/…/…/…/218K-8mics_1000ps_6nm-2fs.tpr -v -deffnm test1 -nsteps 1000 -ntomp 1 -pin on -pinoffset 16 -pinstride 1

job 3:
mpirun -np 16 /apps/gromacs/gromacs-2020.1/build_mpi/bin/gmx_mpi mdrun -s …/…/…/…/218K-8mics_1000ps_6nm-2fs.tpr -v -deffnm test1 -nsteps 1000 -ntomp 1 -pin on -pinoffset 32 -pinstride 1
and so on