Gromacs 2023 installation issue, slower run with gmx_mpi on multiple nodes

GROMACS version: 2023
GROMACS modification: No
Here post your question

I am trying to install Gromacs 2023 on an HPC and hoping to do it correctly to get highest performance possible. HPC has Dual Intel Xeon Gold Skylake 6154 (3.0 GHz, 18-core) processors and Dual NVIDIA Tesla V100 PCIe 16 GB Computational Accelerator. I was successfully able to install it but my performance decreased 656 ns/day (Coarse-grained system) for 1 whole node to 350 ns/day for 2 nodes. The command I used to run on CPU-only run is

srun gmx_mpi mdrun -v -deffnm md

The correct number of MPI processes and OpenMP threads are used. One issues is the large load imbalance 65% and large communication wait times for 2 nodes. The same system performs with 1000 ns/day on a gromacs 2022 version with 2 nodes with very low communication wait times. So I posit that my installation has some issues. Before I go into the details of my installation process, here is the output of gmx_mpi --version.

(I changed some paths to hide the HPC system I am using)

GROMACS version:    2023
Precision:          mixed
Memory model:       64 bit
MPI library:        MPI
OpenMP support:     enabled (GMX_OPENMP_MAX_THREADS = 128)
GPU support:        CUDA
NB cluster size:    8
SIMD instructions:  AVX_512
CPU FFT library:    fftw-3.3.10-sse2-avx-avx2-avx2_128-avx512
GPU FFT library:    cuFFT
Multi-GPU FFT:      none
RDTSCP usage:       enabled
TNG support:        enabled
Hwloc support:      disabled
Tracing support:    disabled
C compiler:       /usr/local/spack/opt/spack/linux-centos7-x86_64/gcc-4.8.5/gcc-10.1.0-iw6p5hcjkqdddphuodu6abtqifbaqzu2/bin/gcc GNU 10.1.0
C compiler flags:   -fexcess-precision=fast -funroll-all-loops -mavx512f -mfma -mavx512vl -mavx512dq -mavx512bw -Wno-missing-field-initializers -O3 -DNDEBUG
C++ compiler:       usr/local/spack/opt/spack/linux-centos7-x86_64/gcc-4.8.5/gcc-10.1.0-iw6p5hcjkqdddphuodu6abtqifbaqzu2/bin/g++ GNU 10.1.0
C++ compiler flags: -fexcess-precision=fast -funroll-all-loops -mavx512f -mfma -mavx512vl -mavx512dq -mavx512bw -Wno-missing-field-initializers -Wno-cast-function-type-strict -fopenmp -O3 -DNDEBUG
BLAS library:       External - detected on the system
LAPACK library:     External - detected on the system
CUDA compiler:      /usr/local/cuda/11.2/bin/nvcc nvcc: NVIDIA (R) Cuda compiler driver;Copyright (c) 2005-2020 NVIDIA Corporation;Built on Mon_Nov_30_19:08:53_PST_2020;Cuda compilation tools, release 11.2, V11.2.67;Build cuda_11.2.r11.2/compiler.29373293_0
CUDA compiler flags:-std=c++17;--generate-code=arch=compute_35,code=sm_35;--generate-code=arch=compute_37,code=sm_37;--generate-code=arch=compute_50,code=sm_50;--generate-code=arch=compute_52,code=sm_52;--generate-code=arch=compute_60,code=sm_60;--generate-code=arch=compute_61,code=sm_61;--generate-code=arch=compute_70,code=sm_70;--generate-code=arch=compute_75,code=sm_75;--generate-code=arch=compute_80,code=sm_80;--generate-code=arch=compute_86,code=sm_86;-Wno-deprecated-gpu-targets;--generate-code=arch=compute_53,code=sm_53;--generate-code=arch=compute_80,code=sm_80;-use_fast_math;-Xptxas;-warn-double-usage;-Xptxas;-Werror;;-fexcess-precision=fast -funroll-all-loops -mavx512f -mfma -mavx512vl -mavx512dq -mavx512bw -Wno-missing-field-initializers -Wno-cast-function-type-strict -fopenmp -O3 -DNDEBUG
CUDA driver:        0.0
CUDA runtime:       N/A

I do not have sudo access and the HPC and I had to install several packages locally. I load some system installed packages gcc-10.1.0 cuda-11.2, and installed cmake-3.19.5, gdrcopy-2.4, fftw-3.3.10, ucx_1.15.0, openmpi-4.1.2.

After cmake, I installed gdrcopy-2.4

make prefix=/home/soft/gdrcopy-2.4 CUDA=/usr/local/cuda/11.2 all install
./insmod.sh (requires sudo access, but I do not recall how i managed to install it locally. Perhaps this is the issue)

Install ucx 1.15.0

./configure --prefix=/home/soft/ucx-1.15.0 --with-cuda=/usr/local/cuda/11.2 --with-gdrcopy=/usr --enable-mt
Install openmpi 4.1.2

./configure --prefix=/home/soft/openmpi/4.1.2 --with-cuda=usr/local/cuda/11.2 --with-ucx=/home/soft/ucx-1.15.0

Install fftw

./configure --enable-sse --enable-sse2 --enable-avx --enable-avx2 --enable-avx512 --prefix=/home/soft/fftw-3.3.10 --enable-float --enable-threads

Install gromacs

cmake .. -DREGRESSIONTEST_DOWNLOAD=ON -DCMAKE_C_COMPILER=gcc -DCMAKE_CXX_COMPILER=g++ -DGMX_GPU=CUDA -DGMX_MPI=ON -DGMX_FFT_LIBRARY=fftw3 -DCMAKE_PREFIX_PATH=/home/soft/gromacs2023 -DFFTWF_INCLUDE_DIR=/home/soft/fftw-3.3.10/usr/local/include -DFFTWF_LIBRARY=/home/soft/fftw-3.3.10/usr/local/lib/libfftw3f.so

make
make check
make DESTDIR=/home/soft/gromacs2023 install -j4

Please let me know if you need any more data to help me debug this performance issue.