GROMACS version: 2023.3 with CUDA 12.3
GROMACS modification: No
Context:
I use GROMACS 2023.3 with CUDA 12.3 on my computer with a single RTX 4090.
I simulate bilayer assembly processes for a few hundred nanoseconds; the tasks are relatively computationally demanding.
Problem:
When I run the simulation (with the “gmx mdrun” command; see example below), the initial estimated time to completion is about 1-2 days, which is reasonable. My computer can accomplish about 1000 timesteps in 1-2 seconds. However, I’ve noticed that after about 8-16 hours of running, the run slows down significantly, with only 100 steps being processed every 10 seconds or so; the estimated time to completion increases to 7+days. In short, my runs become ~two orders of magnitude less efficient midway through my simulations; this has happened for each long simulation I’ve run. Example command:
Questions:
Is this normal in GROMACS, i.e., significantly reduced runtime speeds midway through? If abnormal, do you have any advice to diagnose this problem? My first hypothesis is that perhaps my GPU is thermal throttling (at 86C) and being turned off midway through… though when I check GPU temp during the beginning of the run it runs at about 60-70C. Once the runtime slows down significantly, the GPU temperature decreases to 30-40C, suggesting its minimal activity midway through.
For more context, I tracked my GPU temperature and power draw over time during a simulation. By ~4 hours into the simulation, the estimated time to completion was about 1.5 days. By ~20 hours into the simulation, the estimated time to completion was an additional 4 days, i.e., much slower.
As can be seen in the below plots, my GPU seems to decrease temperature and power draw by about 10 hours into the simulation.
Has anyone seen something like this before?
Could this be related to my need to use the -update gpu option in my mdrun command? (I didn’t initially because the Nose_Hoover thermostat causes issues with update flag).
It is not a high temperature. You need to check various factors to see the reason for the reduction in performance. is there any other calculations running? or any other works?
This aligns with the suggestions from this NVIDIA blog.
I initially did not include the -update gpu argument because I was using the Nose-Hoover themostat, which for whatever reason does not work with the -update argument. Because I was not committed to Nose-Hoover for any particular reason besides aligning with prior studies, I have opted to use the v-rescale thermostat in my .mdp file (as suggested here). v-rescale works with the -update gpu argument in mdrun.
The simulation now runs in the expected time, i.e., ~30 hours. So matter resolved.