MDRun Losing Atoms

GROMACS version: 2025.4
GROMACS modification: No

I’m running four large systems which contain Protein, DNA, KCl, MgCl, and TIP3P water. Min, heat, and equil run fine, however I then try to produce the production tpr with:
gmx grompp -f tprod.mdp -c equil.gro -p topol.top -o tprod.tpr

And receive the error:
Structure or trajectory file has more atoms (16846949) than the topology (16729393)

What I can guarantee isn’t happening:

  • Job crash during write-out: slurm.out and equil.log are fine and show full completion
  • trjconv issues: I have tried with and without trjconv-ing PBCs and the error persists
  • File changes: The directory is not shared with any researchers and I have not got any daemons which could be swapping topol.top’s around or updating them
  • Glitches during simulation: gmx check returns absolutely no warnings

I simply run mdrun with -deffnm equil and before it we have agreement and after the final equil.gro has drifted quite profusely. I’ve found some info that virtual-sites might be causing this as I am using amber14sb_parmbsc1 FF’s but I’m not certain this is the case. This error occurs in all systems (ranging from ~6 million to ~ 16 million atoms).

Any and all input would be of great help. Thanks in advance.

Cheers

These are some questions I would ask myself:

  • Is there any information about losing atoms in the log file of the equilibration phase?
  • Can you restart from a checkpoint rather than equil.gro?
  • Is this an error or an warning which you might considering supressing?

Thank you for the reply. To answer the questions, which I should’ve done in my post:

  • The equil.log is totally clear
  • I have been unable to get either cpt & gro to work nor any combo thereof (tho I’ll admit, I may just not have the right combo of flags - I am much more familiar with NAMD & Amber)
  • There have been no errors in grompp or the two logs (slurm & gromacs) and only one warning expressed during grompping: PME enabled alongside non-zero net charge

While I concede that the warning is worrisome, it should not cause atom loss, only the possibility of non-physical ion migration, unless I’ve misunderstood something. Another point is that if I don’t need to use the topol.top (as is true in my other three systems wherein equil > prod conversion was successful, and I can use trjconv to extend the tpr) everything is roses, no issues, however if I do try to grompp them into a new shape (i.e. swap 2fs steps prod1 to 4fs steps prod2), everything blows up, meaning that it really is the topology file which is the axis of difficulty here.

I hope that info is useful. Cheers.

Hi @agdukhan

Just some thoughts, hope they can help.

So, my understanding is that the tpr you used for the equilibration worked properly and produced the various outputs but then when you use it as a starting point for the production it breaks in the moment you couple it with the topology. The natural point here would be that you are mixing topologies of different systems. Can you double check that the topology you have in the top file is the same present in the equil.gro file? For example, can you reproduce the equil.tpr with that same topology by grompping the last frame of the thermalization?

I simply run mdrun with -deffnm equil and before it we have agreement and after the final equil.gro has drifted quite profusely.

What do you mean with “drifted quite profusely”? Given the error you report you have 117556 more in the trajectory than in the topology, so it seems to me the opposite of losing atoms (not that this is any better). If this is true, then the change in atom number should happen during the equilibration run. How many atoms do you have at the end of the equilibration (i.e. in the equil.gro)? And how many were contained in the tpr file that you used for the equilibration (you can check the content of the starting frame with something like gmx editconf -f equil.tpr -o starting.gro)? If these are the same, then there is no mismatch but the systems was already different originally.

How much is the charge mismatch that you have to maxwarn? If any, it should be <<< 1, and if it’s a large integer here there might be some problem with the ions added in the topology.

Hi @obZehn, thank you for your thoughts.

I am confident the topol is not being mixed - I have four systems but they are segregated at the path level, and all of my preparation is done manually, so I can be quite certain I haven’t accidentally shifted a topology file (not to mention they differ by several million atoms each, so the scale of error is another tell). Just to be certain, I have rerun pdb2gmx, my full solv/ionise/min/heat/equil series, and the issue persists (albeit with slightly different numbers and differences).

You’re right to point out that I read the numbers backwards - indeed, it does seem I’m somehow generating atoms, which is maybe even worse. After running the editconf you supplied, starting.gro contains 16729393 atoms, which is inline with my topol, whereas equil.gro has the extra 100k-ish atoms, and thus somehow mdrun has generated these. Strangle, gmx check -f equil.xtc does not complain, and only lists the final (increased) atom count. As to the charge, it is a) unchanging from min-heat-equil, but b) large. I had presumed (maybe incorrectly) that this wouldn’t effect a ‘proof of concept’ run, and that it would only produce wacky results but that the system could still be integrated.

The reason for the netcharge is that my PI has request that I use the SPLIT method for calculating ion numbers and that I get the system to 50 mM MgCl, then solvate the rest to 150 mM with KCl. I expect that I’m neutralising this incorrectly, somehow, however I’d presumed again that this would not completely deny the ability to simulate. Especially because no matter how messy the input is, generating atoms (or indeed any change in atom number) seems like aberrant behaviour.

What do you think? Should I just assume this is a natural effect of the charge issue? Again, I really appreciate your input and your taking the time to address this. Cheers.

I have honestly no idea, as I would also argue that whatever you do the number of atoms should not change, and in my experience (I am no general expert or guru, but I have few thousands runs under the belt) I have never ever seen an error like this in any GROMACS version between 2018 and 2025. The fact that you also reproduced the full iter and still got errors, and with different numbers, is in a way pointing towards a possible bug, but I’m no developer so this is by far out of my league.

My two cents would be in general to not run with large net charge. I would be okay with a fraction (like 0.001 or so) which is typical leftover from force field and ligand parametrizations and/or parameter transfer, but everything that is bigger than ±1 can be countered. The SPLIT method should still aim for a total neutral charge. My suggestion for you would be to go back, be sure that you neutralize properly, and then proceed and check if you see again this change in atom numbering. Even using the standard gmx solvate if you prefer. Although the number of atoms should never change, but this is still a good practice nevertheless.

The other thing that you might want to check out to be 100% sure that there might be a problem at the software level is to use the equil.tpr to split the equil.xtc file in regular gro files. For example, if you have a 100ns run you can extract frames with trjconv every 5/10 ns and print them as gro files. Is the atom numbering reported in the second line consistent or do you see a shift at a certain time interval? And what is the number of atoms included in the last checkpoint file, which ideally should be the same as the last gro frame written (gmx editconf -f equil.cpt -o last_cpt.gro)? And is the number of mismatching atoms similar to something else in your topology, like the number of positive/negative ions in your box?

So, debugged the charge issue (totally neutral, no grompp warning) and now it fails even faster (after initial min).

I was going to write a long reply, but I am now almost certain it’s a bug. I reran the simulation on normal GPU nodes (ppc64le, gmx version 2023.1 via hecbiosim package) and got no problems meaning this issue only occurs in the GH200 nodes (aarch64, gmx version 2025.4) and is thus either a compilation issue or a bug (same diff afaict). I will write a bug report on gitlab and forward to HPC admins so both teams can take a look.

Thank you both @obZehn @luedsch very much for the help :)

You are welcome!

Ah interesting. I will be following the issue as I also use extensively GH200 nodes, although I never found this error. Maybe this is related to very large systems and splitting on several nodes. Good luck with this!