Memory issue with 2D AWH and angle

GROMACS version: 2022.3, 2022
GROMACS modification: No

I ran 2D-AWH on Piz Daint and encounter this error.

slurmstepd: error: Detected 1 oom-kill event(s) in StepId=44077037.0 cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.

at which I thought it was to do with Piz Daint recent update. But it wasn’t, as I tried to run it on my local workstation and I encountered this error.

Killed

It was killed, with no warning, no seg fault, nothing that I can use to diagnose the death of this job by myself.

To help you with this - here is my mdp file on the AWH segment.

pull-group1-name      = S349_A
pull-group2-name      = S349_C
pull-group3-name      = S349_B
pull-group4-name      = S349_D
pull_group5_name      = POX
pull-group6-name       = 361_C_A
pull-group7-name       = 362_N_A
pull-group8-name       = 362_CA_A
pull-group9-name       = 362_C_A
pull-group10-name      = 363_N_A

pull-group11-name      = 361_C_B
pull-group12-name      = 362_N_B
pull-group13-name      = 362_CA_B
pull-group14-name      = 362_C_B
pull-group15-name      = 363_N_B

pull-group16-name      = 361_C_C
pull-group17-name      = 362_N_C
pull-group18-name      = 362_CA_C
pull-group19-name      = 362_C_C
pull-group20-name      = 363_N_C

pull-group21-name      = 361_C_D
pull-group22-name      = 362_N_D
pull-group23-name      = 362_CA_D
pull-group24-name      = 362_C_D
pull-group25-name      = 363_N_D

pull-coord1-groups          = 1 2
pull-coord1-geometry        = distance
pull-coord2-groups          = 3 4
pull-coord2-geometry        = distance

pull-coord3-geometry        = dihedral
pull-coord3-groups          = 8 6 6 7 7 9

pull-coord4-geometry        = dihedral
pull-coord4-groups          = 9 7 7 8 8 10

pull-coord5-geometry        = dihedral
pull-coord5-groups          = 13 11 11 12 12 14

pull-coord6-geometry        = dihedral
pull-coord6-groups          = 14 12 12 13 13 15

pull-coord7-geometry        = dihedral
pull-coord7-groups          = 18 16 16 17 17 19

pull-coord8-geometry        = dihedral
pull-coord8-groups          = 19 17 17 18 18 20

pull-coord9-geometry        = dihedral
pull-coord9-groups          = 23 21 21 22 22 24

pull-coord10-geometry        = dihedral
pull-coord10-groups          = 24 22 22 23 23 25

pull-coord11-geometry      = transformation
pull-coord11-groups         = []
pull-coord11-type                = external-potential
pull-coord11-potential-provider  = AWH
pull-coord11-expression = 0.5*x1+0.5*x2
;
pull-coord12-geometry            = transformation
pull-coord12-groups              = []
pull-coord12-type                = external-potential
pull-coord12-potential-provider  = AWH
pull-coord12-expression          = (x3+x4+x5+x6+x7+x8+x9+x10)/8
;
awh                      = yes
awh-potential            = convolved
awh-share-multisim       = yes
awh-nbias                = 1
awh-nstout               = 50000
awh1-ndim                    = 2
awh1-equilibrate-histogram   = yes
awh1-target                  = constant
awh1-share-group             = 1
awh1-dim1-coord-index        = 11
awh1-dim2-coord-index        = 12
awh1-dim1-start              =  0.60
awh1-dim1-end                =  2.50
awh1-dim1-force-constant     =  20000
awh1-dim1-diffusion          =  0.0002
awh1-dim1-cover-diameter     =  0.1
awh1-dim2-start          = -180
awh1-dim2-end            = 180
awh1-dim2-diffusion      = 2e-4
awh1-dim2-force-constant  = 12800

These are my index file content

[ S349_A ]
3936 3937 3938 3939 3940 3941 3942 3943 3944 3945 3946
[ S349_B ]
9809 9810 9811 9812 9813 9814 9815 9816 9817 9818 9819
[ S349_C ]
15682 15683 15684 15685 15686 15687 15688 15689 15690 15691 15692
[ S349_D ]
21555 21556 21557 21558 21559 21560 21561 21562 21563 21564 21565
[ r_361_&_C ]
4153 10026 15899 21772
[ r_362_&_N ]
4155 10028 15901 21774
[ r_362_&_CA ]
4157 10030 15903 21776
[ r_362_&_C ]
4175 10048 15921 21794
[ r_363_&_N ]
4177 10050 15923 21796
[ 361_C_A ]
4153
[ 361_C_B ]
10026
[ 361_C_C ]
15899
[ 361_C_D ]
21772
[ 362_N_A ]
4155
[ 362_N_B ]
10028
[ 362_N_C ]
15901
[ 362_N_D ]
21774
[ 362_CA_A ]
4157
[ 362_CA_B ]
10030
[ 362_CA_C ]
15903
[ 362_CA_D ]
21776
[ 362_C_A ]
4175
[ 362_C_B ]
10048
[ 362_C_C ]
15921
[ 362_C_D ]
21794
[ 363_N_A ]
4177
[ 363_N_B ]
10050
[ 363_N_C ]
15923
[ 363_N_D ]
21796

The grompp stage is healthy and fine. I only copied the relevant part of the grompp output for you.

Pull group 1 'S349_A' has 11 atoms
Pull group 2 'S349_C' has 11 atoms
Pull group 3 'S349_B' has 11 atoms
Pull group 4 'S349_D' has 11 atoms
Pull group 5 'POX' has 1 atoms
Pull group 6 '361_C_A' has 1 atoms
Pull group 7 '362_N_A' has 1 atoms
Pull group 8 '362_CA_A' has 1 atoms
Pull group 9 '362_C_A' has 1 atoms
Pull group 10 '363_N_A' has 1 atoms
Pull group 11 '361_C_B' has 1 atoms
Pull group 12 '362_N_B' has 1 atoms
Pull group 13 '362_CA_B' has 1 atoms
Pull group 14 '362_C_B' has 1 atoms
Pull group 15 '363_N_B' has 1 atoms
Pull group 16 '361_C_C' has 1 atoms
Pull group 17 '362_N_C' has 1 atoms
Pull group 18 '362_CA_C' has 1 atoms
Pull group 19 '362_C_C' has 1 atoms
Pull group 20 '363_N_C' has 1 atoms
Pull group 21 '361_C_D' has 1 atoms
Pull group 22 '362_N_D' has 1 atoms
Pull group 23 '362_CA_D' has 1 atoms
Pull group 24 '362_C_D' has 1 atoms
Pull group 25 '363_N_D' has 1 atoms
Number of degrees of freedom in T-Coupling group PROT is 80758.73
Number of degrees of freedom in T-Coupling group MEMB is 111110.27
Number of degrees of freedom in T-Coupling group SOL_ION is 331359.00

Determining Verlet buffer for a tolerance of 0.005 kJ/mol/ps at 303.15 K

Calculated rlist for 1x1 atom pair-list as 1.285 nm, buffer size 0.085 nm

Set rlist, assuming 4x4 atom pair-list, to 1.210 nm, buffer size 0.010 nm

Note that mdrun will redetermine rlist based on the actual pair-list setup
Calculating fourier grid dimensions for X Y Z
Using a fourier grid of 80x80x112, spacing 0.149 0.149 0.148
Pull group  natoms  pbc atom  distance at start  reference at t=0
       1        11      3941
       2        11     15687       0.943 nm          0.000 nm
       3        11      9814
       4        11     21560       0.906 nm          0.000 nm
       8         1         0
       6         1         0     -35.520 deg          0.000 deg
       9         1         0
       7         1         0      22.639 deg          0.000 deg
      13         1         0
      11         1         0     -30.483 deg          0.000 deg
      14         1         0
      12         1         0      15.009 deg          0.000 deg
      18         1         0
      16         1         0     -35.782 deg          0.000 deg
      19         1         0
      17         1         0       9.406 deg          0.000 deg
      23         1         0
      21         1         0     -31.776 deg          0.000 deg
      24         1         0
      22         1         0      25.560 deg          0.000 deg
      24         1         0
      22         1         0       0.925 nm          0.000 nm
      24         1         0
      22         1         0      -0.133 nm          0.000 nm

Estimate for the relative computational load of the PME mesh part: 0.14

NOTE 6 [file awh-2d-new.mdp]:
  This run will generate roughly 5957 Mb of data

What do I need to do to solve this issue?

Best wishes

Will

Hello,

I’ll take a look if there is a memory leak for this kind of calculation. Could you also check if the GROMACS process starts to use more and more physical memory over the course of the simulation?

Also, does this happen as well with 2022.4?

Cheers

Paul

Also, to make things easier for me, do you have a small system that I could use to try and reproduce this?

Cheers

Paul

Yes - It does. By looking at %mem while the job is running, I could see that it rises to approx 45% before it crashes.

I haven’t tried 2022.4 yet. I only used 2022.3, as it is available on the module system. I could try that but I doubt that it will fix the issue.

Do you want me to solvate an alanine quadpeptide for you? That may work?

Will

Sure, that would do the job. I just need to run the whole thing under ASAN/valgrind to find any possible leaks, and the smaller the system the better :)

Cheers

Paul

Yes - It does. By looking at %mem while the job is running, I could see that it rises to approx 45% before it crashes.

Thanks for confirming this, means that there is most likely a memory leak.

I’ll also open an issue after confirming.

Cheers

Paul

Hm, I now also see that you are using the transformation pull coordinates. Can you check if there is still a leak if you use a different potential type?

Cheers

Paul

You mean change from AWH to something else?

Here is the file - topol.tpr should be in the AWH folder

Will

Lets wait for now, first need to run some tests.

Ok, I can confirm the hard crash without any information on my setup as well.

Some information, so others who read this don’t need to go through the issue on gitlab.

The crash was caused by asking for allocating a grid with is larger than the available memory. The grid is so large due to the use of (too) high force constants for the two AWH dimensions.