Mdrun speed is very low

GROMACS version:
GROMACS modification: Yes/No
Here post your question
dear gromacs users and developers these are my commands (below) for running simulaion. the simulation is running very slow. can anyone please suggest the reason and possible solution of this problem.

architecture of installaion
gmx_mpi mdrun -version

GROMACS version: 2022.4
Precision: mixed
Memory model: 64 bit
MPI library: MPI (CUDA-aware)
OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 128)
GPU support: CUDA
SIMD instructions: AVX2_256
CPU FFT library: fftw-3.3.8-sse2-avx-avx2-avx2_128
GPU FFT library: cuFFT
RDTSCP usage: enabled
TNG support: enabled
Hwloc support: disabled
Tracing support: disabled
C compiler: /usr/bin/cc GNU 8.5.0
C compiler flags: -mavx2 -mfma -Wno-missing-field-initializers -fexcess-precision=fast -funroll-all-loops -pthread -O3 -DNDEBUG
C++ compiler: /usr/bin/c++ GNU 8.5.0
C++ compiler flags: -mavx2 -mfma -pthread -Wno-missing-field-initializers -fexcess-precision=fast -funroll-all-loops -pthread -fopenmp -O3 -DNDEBUG
CUDA compiler: /usr/local/cuda-11.6/bin/nvcc nvcc: NVIDIA (R) Cuda compiler driver;Copyright (c) 2005-2022 NVIDIA Corporation;Built on Tue_Mar__8_18:18:20_PST_2022;Cuda compilation tools, release 11.6, V11.6.124;Build cuda_11.6.r11.6/compiler.31057947_0
CUDA compiler flags:-std=c++17;–generate-code=arch=compute_35,code=sm_35;–generate-code=arch=compute_37,code=sm_37;–generate-code=arch=compute_50,code=sm_50;–generate-code=arch=compute_52,code=sm_52;–generate-code=arch=compute_60,code=sm_60;–generate-code=arch=compute_61,code=sm_61;–generate-code=arch=compute_70,code=sm_70;–generate-code=arch=compute_75,code=sm_75;–generate-code=arch=compute_80,code=sm_80;–generate-code=arch=compute_86,code=sm_86;-Wno-deprecated-gpu-targets;–generate-code=arch=compute_53,code=sm_53;–generate-code=arch=compute_80,code=sm_80;-use_fast_math;-D_FORCE_INLINES;-mavx2 -mfma -pthread -Wno-missing-field-initializers -fexcess-precision=fast -funroll-all-loops -pthread -fopenmp -O3 -DNDEBUG
CUDA driver: 11.60
CUDA runtime: 11.60

commands
gmx_mpi -quiet grompp -f md.mdp -c npt_confout.gro -n index.ndx -p npt_processed.top -t npt_state.cpt -po md_mdout.mdp -pp md_processed.top -o md.tpr -maxwarn 3
gmx_mpi -quiet mdrun -s md.tpr -mp md_processed.top -mn index.ndx -o md_traj.trr -x md_traj_comp.xtc -cpo md_state.cpt -c md_confout.gro -e md_ener.edr -g md_md.log -xvg xmgrace -nb gpu &>> all_output.txt
gmx_mpi -quiet check -f md_traj_comp.xtc -s1 md.tpr -c md_confout.gro -e md_ener.edr -n index.ndx -m doc.tex &>> check.txt

output

1 GPU selected for this run.
Mapping of GPU IDs to the 2 GPU tasks in the 1 rank on this node:
PP:0,PME:0
PP tasks will do (non-perturbed) short-ranged interactions on the GPU
PP task will update and constrain coordinates on the CPU
PME tasks will do all aspects on the GPU
Using 1 MPI process

Non-default thread affinity set, disabling internal thread affinity

Using 40 OpenMP threads

starting mdrun ‘Generic title’
5000000 steps, 10000.0 ps.

Writing final coordinates.

           Core t (s)   Wall t (s)        (%)
   Time:  2112165.704    52804.182     4000.0
                     14h40:04
                       (ns/day)    (hour/ns)

Performance: 16.362 1.467

Hi,

What kind of speed are you expecting? How many atoms are in your system and what simulation settings are you using? 16 ns/day isn’t necessarily bad on a single node with a GPU, it depends on a lot of things.

You can check this page for some general advice on improving performance:
https://manual.gromacs.org/2022-current/user-guide/mdrun-performance.html

Petter

thank you so much for your reply. but the issue is that I ran the same complex system earlier using the same workstation with the same MD setting. and it was running the same 10ns in about one and a half hours. but now it is taking almost 1 day for 10 ns. the main difference in the previous run was that I did not use the checkpoint file earlier as input. so is it possible that the reason for this about 25 times reduction in speed is the giving the checkpoint file as input or checkpoint file creation or update is running on CPU rather than GPU?

I see, a 25x performance decrease is indeed massive (to the extent that I’d suspect that the CPU/GPU are already busy in your second job, or some other hardware issue is at play?). Using the checkpoint shouldn’t affect the performance at all, except if you’ve encountered some very esoteric bug. I’d be very surprised if putting the update on the CPU instead of the GPU would make such a massive difference, but you can test this and other settings with short runs to see if the performance changes.

Best way to test is to use mdrun with the flags -noconfout -resethway -nsteps 20000 and any other settings you want to test. 20000 steps is enough to warm up the system, and -resethway resets the performance counter halfway through so you get steady state values.

You can change whether certain tasks are done on the CPU or GPU with the -update, -pme, -pmefft and -bonded flags. So, to check whether the update on CPU vs GPU changes the performance run one test with -update gpu and one with -update cpu, etc.

It could also be useful to check the details of your performance, so see if the accounting report (at the end of the md.log file after your runs) has any significant changes between your fast and slow runs.

Edit: In fact, if you could post your log files from the fast and slow runs that would be very helpful (upload them to pastebin or Github or whatnot).

Petter