Simulation time drastically slows down midway through

evanc · March 27, 2024, 4:13am

GROMACS version: 2023.3 with CUDA 12.3
GROMACS modification: No

Context:
I use GROMACS 2023.3 with CUDA 12.3 on my computer with a single RTX 4090.
I simulate bilayer assembly processes for a few hundred nanoseconds; the tasks are relatively computationally demanding.

Problem:
When I run the simulation (with the “gmx mdrun” command; see example below), the initial estimated time to completion is about 1-2 days, which is reasonable. My computer can accomplish about 1000 timesteps in 1-2 seconds. However, I’ve noticed that after about 8-16 hours of running, the run slows down significantly, with only 100 steps being processed every 10 seconds or so; the estimated time to completion increases to 7+days. In short, my runs become ~two orders of magnitude less efficient midway through my simulations; this has happened for each long simulation I’ve run. Example command:

gmx mdrun -s test.tpr -v -x test.xtc -c test.gro -nb gpu -bonded gpu -pme gpu

Questions:
Is this normal in GROMACS, i.e., significantly reduced runtime speeds midway through? If abnormal, do you have any advice to diagnose this problem? My first hypothesis is that perhaps my GPU is thermal throttling (at 86C) and being turned off midway through… though when I check GPU temp during the beginning of the run it runs at about 60-70C. Once the runtime slows down significantly, the GPU temperature decreases to 30-40C, suggesting its minimal activity midway through.

Any insight would be great, thanks!

evanc · March 29, 2024, 3:35pm

For more context, I tracked my GPU temperature and power draw over time during a simulation. By ~4 hours into the simulation, the estimated time to completion was about 1.5 days. By ~20 hours into the simulation, the estimated time to completion was an additional 4 days, i.e., much slower.

As can be seen in the below plots, my GPU seems to decrease temperature and power draw by about 10 hours into the simulation.

Has anyone seen something like this before?

Could this be related to my need to use the -update gpu option in my mdrun command? (I didn’t initially because the Nose_Hoover thermostat causes issues with update flag).

scinikhil · March 29, 2024, 11:33pm

It is not a high temperature. You need to check various factors to see the reason for the reduction in performance. is there any other calculations running? or any other works?

evanc · March 31, 2024, 5:37pm

I resolved this issue.

Here’s the solution:

I edited my mdrun command to include -update gpu. Thus, the full command is as follows:

gmx mdrun -s test.tpr -v -x test.xtc -c test.gro -nb gpu -bonded gpu -pme gpu -nstlist 400

This aligns with the suggestions from this NVIDIA blog.

I initially did not include the -update gpu argument because I was using the Nose-Hoover themostat, which for whatever reason does not work with the -update argument. Because I was not committed to Nose-Hoover for any particular reason besides aligning with prior studies, I have opted to use the v-rescale thermostat in my .mdp file (as suggested here). v-rescale works with the -update gpu argument in mdrun.

The simulation now runs in the expected time, i.e., ~30 hours. So matter resolved.

nilanjana88 · January 19, 2025, 5:55am

Thank you. It was very helpful.
My simulation speed had drastically reduced from 40ns/day to 14ns/day all of a sudden. After using your command, it has now increased to 56ns/day.
Thank you so much for your suggestions.

Topic		Replies	Views
GROMACS mdrun on GPU has become very slow recently User discussions mdrun , gpu	0	470	February 9, 2022
GPU likely dies out mid-simulation, causing long run times User discussions mdrun , gpu , simulation-setup	0	28	August 13, 2024
Mdrun speed is very low User discussions	3	531	February 22, 2023
Large % of "rest" time when running on GPU User discussions mdrun-performance	0	780	December 23, 2021
Performance Decline User discussions	15	711	September 4, 2020

Simulation time drastically slows down midway through

Related topics