The issue of slow tpr generation speed and slow running simulation speed of 15 million atoms system

GROMACS version: 2022.5
GROMACS modification: No
Here post your question

1.The system has 15 million atoms and contains more than 1,800 different organic molecules (the number of organic molecules is approximately 10000). The grompp step will take more than an hour. However, the system of fewer type molecules ( only including protein and water) which have the same number of atoms (15 million atoms) only takes tens of seconds. Is the grompp time related to the number of molecular species in the system (the mdp file imposes position restrictions on the heavy atoms of all organic molecules)?

2.This system will spend several hours reading the contents of the tpr file during the mdrun step. During the subsequent md running process, we try to assign the number of cpu cores and GPU A100 through -ntmpi 1 -ntomp 32 or directly -nt 32 (not use GPU). In both cases, only one CPU core can be used during the tpr reading and md running process (the top command shows that the CPU occupancy is 100 ). In addition, the GPU occupancy rate keeps 0% in md running process , and the corresponding log files are not updated in ten hours. So that we have no ideal whether the simulation is running normally. Is it normal for mdrun to spend several hours reading the contents of the tpr file? is the subsequent CPU/GPU usage normal? If not, If not, how can I modify it?

Your replies will be of great help to us ! ! ! If simulation files are needed, we can upload them later.

My guess is that grompp slows down due to the high number of atoms types. The LJ type interactions matrix has size numtypes^2. How many atom types do you have?

Do you have different atom types for each molecule type? If so, it would help if you can reuse atom types from other molecules.

I should add that we have ran systems up to 800 million atoms with grompp and mdrun start up times in the order of minutes.

Many thanks for your reply ! ! !

Atom type is defined in forcefield_water_ions.itp through the [atomtypes] field. These atom types come from the GAFF force field. There are 30 atom types in total (forcefield_water_ions.itp).

; modyfied by wangah 20230817
[ atomtypes ]
; name      at.num  mass     charge ptype  sigma      epsilon
OW           8      16.00    0.0000  A   3.15061e-01  6.36386e-01
HW           1       1.008   0.0000  A   0.00000e+00  0.00000e+00
Cl          17      35.45    0.0000  A   4.40104e-01  4.18400e-01
Na          11      22.99    0.0000  A   3.32840e-01  1.15897e-02
K           19      39.10    0.0000  A   4.73602e-01  1.37235e-03
c1       c1          0.00000  0.00000   A     3.39967e-01   8.78640e-01 ; 1.91  0.2100
c2       c2          0.00000  0.00000   A     3.39967e-01   3.59824e-01 ; 1.91  0.0860
c3       c3          0.00000  0.00000   A     3.39967e-01   4.57730e-01 ; 1.91  0.1094
ca       ca          0.00000  0.00000   A     3.39967e-01   3.59824e-01 ; 1.91  0.0860
c        c           0.00000  0.00000   A     3.39967e-01   3.59824e-01 ; 1.91  0.0860
...
...
...

Each subsequent molecule has a separate itp file (1822 files in total) with the following format (Eg: A55.itp):

[ moleculetype ]
;name            nrexcl
 A55              3

[ atoms ]
;   nr  type  resi  res  atom  cgnr     charge      mass       ; qtot   bond_type
     1   ss     1   A55    S1    1    -0.428418     32.06000 ; qtot -0.428
     2   oh     1   A55    O1    2    -0.631884     16.00000 ; qtot -1.060
     3    o     1   A55    O2    3    -0.630106     16.00000 ; qtot -1.690
     4    o     1   A55    O3    4    -0.594925     16.00000 ; qtot -2.285
     5    o     1   A55    O4    5    -0.476977     16.00000 ; qtot -2.762
     6    o     1   A55    O5    6    -0.476977     16.00000 ; qtot -3.239
     7    o     1   A55    O6    7    -0.689444     16.00000 ; qtot -3.929
     8    o     1   A55    O7    8    -0.689444     16.00000 ; qtot -4.618
     9    o     1   A55    O8    9    -0.724522     16.00000 ; qtot -5.343
    10    o     1   A55    O9   10    -0.724522     16.00000 ; qtot -6.067

...
...
...

Then add all itp files (1822 files in total) to mols.itp.

#include "A00.itp"
#include "A01.itp"
#include "A02.itp"
#include "A03.itp"
#include "A04.itp"
#include "A05.itp"
#include "A06.itp"
#include "A07.itp"
#include "A08.itp"
#include "A09.itp"
#include "A10.itp"
#include "A11.itp"
...
...
...

When constructing the topology file of the system, the number of each type of molecules is specified in the [molecules] field (system.top).

[ defaults ]
; nbfunc        comb-rule       gen-pairs       fudgeLJ fudgeQQ
1               2               yes             0.5     0.8333333333

; Include MO1_GMX.itp topology
#include "gmx_top_all/forcefield_water_ions.itp"
#include "gmx_top_all/mols.itp"

;; Ligand position restraints
;#ifdef POSRES_LIG
;#include "posre_MO1.itp"
;#endif

[ system ]
System in water

[ molecules ]
; Compound        nmols
A00 1
A01 5
A02 5
A03 10
A04 5
A05 1
A06 10
A07 10
A08 5
A09 5
...
...
...

In addition, the figure below shows the occupancy of gmx mdrun during MD running that we detected through perf. It seems that the occupancy of the two functions energyDriftAtomPair and calcVerletBufferSize is relatively high.

This information is provided for your reference. Could you please help us find out the reason? and how to correct it?

Will reading your post I already thought it might be the Verlet buffer estimation that could be the cost. You can avoid that part of the code by setting verlet-buffer-tolerance=-1 and setting nstlist and rlist manually. This will affect performance as you will not be able to use the twin pairlist. Then I suggest that you use the nstlist and rlist value that a CPU run would give you.

I can look if the time can be reduced. If not, than we might want to provide a manual option to set the twin pairlist.

How many atoms do you have summed over all moleculetypes? So how many atoms would there be in the system if you would have exactly one molecule per type?

And could you report the number of atom types for the Verlet buffer calculation? You can get this by running:
gmx mdrun -debug 1
or
gmx grompp -debug 1
Then there will be a file called gmx.debug and the number is on the line with:
energy drift atom types:
If you are lucky you get this line quickly before hour long part of the calculation starts.

One more question: does each atom have a different charge (apart from symmetric ones as in your first post)? If so, than that will lead to an extremely large number of different atom types in the Verlet buffer calculation. Rounding the charges in that calculation would help then.

Man thanks for your help ! ! !

Question 1

Will reading your post I already thought it might be the Verlet buffer estimation that could be the cost. You can avoid that part of the code by setting verlet-buffer-tolerance=-1 and setting nstlist and rlist manually.

As your suggestion, we modified the mdp file with two different setting.

First setting:(rlist = 1.0 )

; Neighborsearching
cutoff-scheme      = Verlet
nstlist            = 20         ; 20 steps, largely irrelevant with Verlet scheme
rlist              = 1.0        ; short-range neighborlist cutoff (in nm)
verlet-buffer-tolerance  = -1

rcoulomb        = 1.0        ; short-range electrostatic cutoff (in nm)
rvdw            = 1.0        ; short-range van der Waals cutoff (in nm)

The tpr file will be generated in a short time (~15 min). Howerer, we get the following error during the MD running using GPU (with gmx mdrun -v -deffnm nvt1 -gpu_id 0 -ntmpi 1 -ntomp 64 -nb gpu -pme gpu -bonded auto -update gpu):

Program:     gmx mdrun, version 2022.5
Source file: src/gromacs/gpu_utils/devicebuffer.cuh (line 93)
Function:    auto freeDeviceBuffer(float **)::(anonymous class)::operator()() const

Assertion failed:
Condition: stat == cudaSuccess
Freeing of the device buffer failed. CUDA error #700
(cudaErrorIllegalAddress): an illegal memory access was encountered.

If we running the MD simulation with only CPU, the MD simulation will be running successfully, but only one CPU core can be used during md running process. The occupancy of gmx mdrun during MD running obtained from perf is shown as follow:

Second setting: (rlist = 1.2 )

; Neighborsearching
cutoff-scheme      = Verlet
nstlist            = 20         ; 20 steps, largely irrelevant with Verlet scheme
rlist              = 1.2       ; short-range neighborlist cutoff (in nm)
verlet-buffer-tolerance  = -1

rcoulomb        = 1.0        ; short-range electrostatic cutoff (in nm)
rvdw            = 1.0        ; short-range van der Waals cutoff (in nm)

The tpr file will be generate in a short time (17 min). Howerer, we could not judge if the MD simulation run successfully (The CPU occupancy is 100, the GPU occupancy rate keeps 0% during md running process. In addition, the corresponding log files are not updated in ten hours.). The occupancy of gmx mdrun during MD running obtained from perf is shown as follow:

Samples: 450K of event 'cycles', Event count (approx.): 323821671531                                                                           
Overhead  Command         Shared Object         Symbol                                                                                         
  99.58%  gmx             libgromacs.so.7.0.0   [.] Nbnxm::sort_atoms
   0.15%  gmx             [kernel.kallsyms]     [k] change_pte_range
   0.03%  gmx             [kernel.kallsyms]     [k] _raw_spin_lock_irqsave
   0.03%  gmx             [kernel.kallsyms]     [k] timekeeping_advance
   0.02%  gmx             [kernel.kallsyms]     [k] _raw_spin_lock
   0.01%  gmx             [kernel.kallsyms]     [k] account_user_time
   0.01%  gmx             [kernel.kallsyms]     [k] __run_hrtimer

Question 2

How many atoms do you have summed over all moleculetypes? So how many atoms would there be in the system if you would have exactly one molecule per type?

Our system has 13676501 atoms in total, including 1822 different moleculer types with different number (328976 atoms in total), 4436329 water moleculers and 38538 ions. If we have exactly one molecule per typeIf each molecule, the total number of atoms is 61119.

Question 3

And could you report the number of atom types for the Verlet buffer calculation? You can get this by running:
gmx mdrun -debug 1
or
gmx grompp -debug 1
Then there will be a file called gmx.debug and the number is on the line with:
energy drift atom types:

As your suggestion, we tried the command : gmx grompp -debug 1 (The value of verlet-buffer-tolerance in mdp file was setting to 0.005), the ouput is :
energy drift atom types: 48124

Question 4

One more question: does each atom have a different charge (apart from symmetric ones as in your first post)? If so, than that will lead to an extremely large number of different atom types in the Verlet buffer calculation. Rounding the charges in that calculation would help then.

As you can see, each atom have a different charge. The charges of all small molecules are RESP charges calculated by Gaussian.

; modyfied by wangah 20230817
[ atomtypes ]
; name      at.num  mass     charge ptype  sigma      epsilon
OW           8      16.00    0.0000  A   3.15061e-01  6.36386e-01
HW           1       1.008   0.0000  A   0.00000e+00  0.00000e+00
Cl          17      35.45    0.0000  A   4.40104e-01  4.18400e-01
Na          11      22.99    0.0000  A   3.32840e-01  1.15897e-02
K           19      39.10    0.0000  A   4.73602e-01  1.37235e-03
c1       c1          0.00000  0.00000   A     3.39967e-01   8.78640e-01 ; 1.91  0.2100
c2       c2          0.00000  0.00000   A     3.39967e-01   3.59824e-01 ; 1.91  0.0860
c3       c3          0.00000  0.00000   A     3.39967e-01   4.57730e-01 ; 1.91  0.1094
ca       ca          0.00000  0.00000   A     3.39967e-01   3.59824e-01 ; 1.91  0.0860
c        c           0.00000  0.00000   A     3.39967e-01   3.59824e-01 ; 1.91  0.0860
...
...
...

Thank you for all the detailed information.

I am working on reducing the time of the Verlet buffer estimation with many different atom types.

I don’t understand why your run crashes, neither on GPU nor on CPU. Could you please check if you get the same issues with version 2023.2?

Many thanks for your help.

As your suggestion, we tested the MD simulation with gmx version: 2023.2, and the result was same as the result of gmx version: 2022.5. The current processing method is that 1822 type small molecules are all separate molecule types. If we combine 1822 type small molecules into one molecule type, can this problem be solved? If you are interested in testing this system, we can provide relevant input files.

The Verlet buffer is not solved in the 2023 release. My new code would be needed for that.

My question was if you get the same CUDA error. Is that the case?

I see now that with the CPU run you are using reference non-bonded kernels, these are extremely slow. Why do you not have SIMD support enabled in GROMACS?

Yes, it would be useful if I can get access to the input files needed to run grompp.

Many thanks for your reply

Question 1

My question was if you get the same CUDA error. Is that the case?

Setting the mdp file with rlist = 1.0 & verlet-buffer-tolerance = -1

; Neighborsearching
cutoff-scheme      = Verlet
nstlist            = 20         ; 20 steps, largely irrelevant with Verlet scheme
rlist              = 1.0        ; short-range neighborlist cutoff (in nm)
verlet-buffer-tolerance  = -1

rcoulomb        = 1.0        ; short-range electrostatic cutoff (in nm)
rvdw            = 1.0        ; short-range van der Waals cutoff (in nm)

We got the same error as:

Program:     gmx mdrun, version 2023.2
Source file: src/gromacs/gpu_utils/device_stream.cu (line 101)
Function:    auto DeviceStream::synchronize()::(anonymous class)::operator()() const

Assertion failed:
Condition: stat == cudaSuccess
cudaStreamSynchronize failed. CUDA error #700 (cudaErrorIllegalAddress): an
illegal memory access was encountered.

Question 2

I see now that with the CPU run you are using reference non-bonded kernels, these are extremely slow. Yes, it would be useful if I can get access to the input files needed to run grompp.

Thank you very much for your interest in testing this system. We provide the input file in the link .
data_for_nvt.tar.gz - Google Drive

My Verlet buffer code improvements make the estimate run in seconds.

Running your system with a debug build resulted in an assertion failure. I filed an issue: CUDA PME assertion failure (#4890) · Issues · GROMACS / GROMACS · GitLab
This should be easy to fix. @pszilard could you have a look?

The CPU run is unstable and gives LINCS warning and then an error with particles moving to far. Your starting coordinates are not sufficiently minimized/equilibrated for running MD.

We should avoid cryptic CUDA errors though.

Many thanks for your replay!!!

Yes, our starting coordinates are not moderate and need to be further minimized. The first task now for us is to ensure that this system can run using GPU nomally. We can test the MD running with a smaller step size. Whatever, the production run need to be run with GPU, so it is necessary to solove the problem about CUDA error.

Thank you very much for your efforts in our project, gives us confidence in overcoming the hardship in this system.

By the way, is it necessary for us to do the following test:

The the CUDA error is also triggered by the unstable system. A stable system should run fine.

No, it doesn’t help to merge all moleculetypes. The long setup time is caused by the number of different atomtypes, not moleculetypes. I fixed this at Reduce algorithmic complexity of calcVerletBufferDrift() (!3856) · Merge requests · GROMACS / GROMACS · GitLab

Many thanks for your reply!!!

As your suggestion, we Installed the GROMACS with the version provided by you (bh-calcverletbuf-map) and did a test. From the test results, the speed of reading and generating .tpr files became shortly. In addition, the NVT simulation run normally using both CPU and GPU with a very small step (emstep =0.00002 ).

Howerer, we still could not conduct the simulation with the emstep=0.002. Just like you said, the CUDA error is also triggered by the unstable system.

Our systerm contain 1800 species small moleculers (All of them were put into a 50 nm box by packmol program with packmol parameter tolerance 5.0). Then, these solutes were dissolved in a cube box with a length of 52 nm, and 0.15 M of Na+and Cl - were added to construct our initial system.

Then we attempted to minimize energy by using the steepest descent and conjugate gradient methods, and even reduced the step to 0.001 for energy minimization.

We got the following reminder information:

Energy minimization has stopped, but the forces have not converged to the
requested precision Fmax < 10 (which may not be possible for your system). It
stopped because the algorithm tried to make a new step whose size was too
small, or there was no change in the energy since last step. Either way, we
regard the minimization as converged to within the available machine
precision, given your starting configuration and EM parameters.
You might need to increase your constraint accuracy, or turn
off constraints altogether (set constraints = none in mdp file)

Is the coefficient of position restriction too strong during minimization? Here are the mdp files for minimization:

em_cg.mdp

; ions.mdp - used as input into grompp to generate em.tpr
; Parameters describing what to do, when to stop and what to save
define        = -DPOSRES_LIG -DPOSRES_FC=1000  ;include position_restraints on heavy atoms
integrator    = cg        ; Algorithm (steep = steepest descent minimization)
emtol         = 10.0      ; Stop minimization when the maximum force < 1000.0 kJ/mol/nm
emstep        = 0.001      ; Energy step size
nsteps        = 10000     ; Maximum number of (minimization) steps to perform
nstcgsteep    = 1000      ; do a steep every 10 steps of cg

; Neighborsearching
cutoff-scheme      = Verlet
ns_type            = grid      ; search neighboring grid cells
nstlist            = 20        ; 20 steps, largely irrelevant with Verlet scheme
rlist              = 1.0       ; short-range neighborlist cutoff (in nm)
; Electrostatics
coulombtype        = PME       ; Particle Mesh Ewald for long-range electrostatics
rcoulomb           = 1.0       ; short-range electrostatic cutoff (in nm)
; VDW
vdwtype            = Cut-off
rvdw               = 1.0       ; short-range van der Waals cutoff (in nm)
DispCorr           = EnerPres  ; account for cut-off vdW scheme

em_steep.mdp


; ions.mdp - used as input into grompp to generate em.tpr
; Parameters describing what to do, when to stop and what to save
define        = -DPOSRES_LIG -DPOSRES_FC=1000  ;include position_restraints on heavy atoms
integrator    = steep     ; Algorithm (steep = steepest descent minimization)
emtol         = 10.0      ; Stop minimization when the maximum force < 10.0 kJ/mol/nm
emstep        = 0.001      ; Energy step size
nsteps        = 10000     ; Maximum number of (minimization) steps to perform

; Neighborsearching
cutoff-scheme      = Verlet
ns_type            = grid      ; search neighboring grid cells
nstlist            = 20        ; 20 steps, largely irrelevant with Verlet scheme
rlist              = 1.0       ; short-range neighborlist cutoff (in nm)
; Electrostatics
coulombtype        = PME       ; Particle Mesh Ewald for long-range electrostatics
rcoulomb           = 1.0       ; short-range electrostatic cutoff (in nm)
; VDW
vdwtype            = Cut-off
rvdw               = 1.0       ; short-range van der Waals cutoff (in nm)
DispCorr           = EnerPres  ; account for cut-off vdW scheme

Could give some advices to equilibrium our systerm? Looking forward for your reply!!!

What is the energy and largest force after energy minimization?

Note that a maximum force of 10 kJ/mol/nm is very small. 100 would also be fine.

Sometimes it can help to run energy minimization in double precision.