GROMACS version: 2022.5
GROMACS modification: No
I encountered an ‘OUT_OF_MEMORY’ error with the following job details for an md simulation with ~250,000 particles:
Nodes: 16
Cores per node: 64
CPU Utilized: 310-17:42:30
CPU Efficiency: 97.49% of 318-17:50:56 core-walltime
Job Wall-clock time: 07:28:14
Memory Utilized: 5.31 TB (estimated maximum)
Memory Efficiency: 392.86% of 1.35 TB (86.43 GB/node)
I’m interested in understanding why the simulation required such an excessive amount of RAM and what steps I can take to optimize memory usage in future GROMACS simulations on this cluster.
The workflow (solvation, ions, energy minimization,…) until the md run is pretty similar to the gromacs tutorial.
Any insights or recommendations would be greatly appreciated.
We can’t say much without more information about your setup. But it is strange that you can run energy minimization but not MD. What changes did you make in the mdp parameters between EM and MD?
As solvent model I used tip4p in a dodecahedron box (-c -d 1.0 -bt dodecahedron)
EM.mdp: is the same as in the lysozyme in water tutotrial
emtol = 1000.0
emstep = 0.01
nsteps = 50000
MD.mdp changes:
nsteps = 500000000 (1μsecond)
dt = 0.002
continuation = no
en_vel = yes
gen_seed = -1
For the rest of the MD.mdp file I used the default values
CPU: Intel Gold 6130 S2/C16/T2
In our group we also had this problem with different simulation as soon as 16 nodes are to be claimed.
With 8 nodes the simulation works but the performance is not satisfying.
I hope I could share some useful information.
Thanks for the help
So the memory usage goes from 21 GB to 5.3 TB when doubling the number of nodes. Or do I misunderstand something? Such a change is very unlikely to come from GROMACS. It could be some hidden bug that suddenly triggers order of magnitude more memory usage wen doubling the number of nodes.
What are the last few lines in the log file of the run that goes out of memory?
Yes exactly. The problem is that I don´t have an idea where it is comming from.
I already talked to our cluster support, but their only answer was to double the memory capacity for the next run but that wonn´t help if we are talking about TB.
I dont have the log files anymore, but I am currently waiting for my SLURM job to get started with the adapted command line MagnusL provided.
If I encounter the same problem again I can provide the log file.
the simulation with your provided command is running smoothly wie 90ns/day.
Thank you for your advice. But still I do not really understand the problem, can you explain me on which grounds you decided to change the values of -n, -c and -ntomp?
Unfortunately I don’t have any specific grounds for that recommendation. It’s just that my personal experience has shown that when running on more than 3 or 4 nodes it has been more efficient not to increase the total number of MPI tasks. I.e., to lower the number of tasks per node. I haven’t had the reported RAM issues, though, so it was just that I thought that 512 MPI tasks sounded a bit high.
Edit: It’s quite possible that srun -n 128 -c 8 gmx_mpi mdrun -ntomp 8 -deffnm md would be worth trying, perhaps also srun -n 256 -c 4 gmx_mpi mdrun -ntomp 4 -deffnm md. You might see a difference in performance.