Mdrun issue - well known issue but still cant fix it

Aman · February 6, 2024, 6:45am

GROMACS version: 2023.3
GROMACS modification: Yes/No
I am currently conducting a simulation of a protein structure immersed in water, comprising approximately 200,000 atoms within a cubic box with dimensions defined by -d 3. The simulation is being executed on a single-node server equipped with multiple CPUs. To optimize performance, I am utilizing the “mpirun -np 1 gmx mdrun -v -deffnm em -ntomp 4 -rdd 1.6” command to run the simulation on a single CPU while leveraging 4 threads.

Regrettably, I have encountered a persistent error despite adjusting the -rdd parameter from 1 to 1.6. The specific error message is as follows: "Command line:
gmx mdrun -v -deffnm em -ntomp 4 -rdd 1.6

Back Off! I just backed up em.log to ./#em.log.5#
Reading file em.tpr, VERSION 2023.3 (single precision)
Using 32 MPI threads

Non-default thread affinity set, disabling internal thread affinity

Using 4 OpenMP threads per tMPI thread

Back Off! I just backed up em.trr to ./#em.trr.6#

Back Off! I just backed up em.edr to ./#em.edr.6#

Steepest Descents:
Tolerance (Fmax) = 1.00000e+03
Number of steps = 50000

WARNING: Listed nonbonded interaction between particles 4863 and 4868
at distance 4.758 which is larger than the table limit 2.000 nm.

This is likely either a 1,4 interaction, or a listed interaction inside
a smaller molecule you are decoupling during a free energy calculation.
Since interactions at distances beyond the table cannot be computed,
they are skipped until they are inside the table limit again. You will
only see this message once, even if it occurs for several interactions.

IMPORTANT: This should not happen in a stable simulation, so there is
probably something wrong with your system. Only change the table-extension
distance in the mdp file if you are really sure that is the reason.

Not all bonded interactions have been properly assigned to the domain decomposition cells
A list of missing interactions:
Bond of 18570 missing 2
Angle of 33605 missing 8
Proper Dih. of 3749 missing 4
Ryckaert-Bell. of 38422 missing 16
LJ-14 of 48322 missing 20
Molecule type ‘Protein_chain_A’
the first 10 missing interactions, except for exclusions:
Ryckaert-Bell. atoms 2903 2905 2918 2920 global 2903 2905 2918 2920
LJ-14 atoms 2903 2920 global 2903 2920
Angle atoms 2905 2918 2920 global 2905 2918 2920
Proper Dih. atoms 2905 2920 2918 2919 global 2905 2920 2918 2919
Ryckaert-Bell. atoms 2905 2918 2920 2921 global 2905 2918 2920 2921
Ryckaert-Bell. atoms 2905 2918 2920 2922 global 2905 2918 2920 2922
LJ-14 atoms 2905 2921 global 2905 2921
LJ-14 atoms 2905 2922 global 2905 2922
LJ-14 atoms 2906 2920 global 2906 2920
Ryckaert-Bell. atoms 2907 2905 2918 2920 global 2907 2905 2918 2920

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.

mpirun noticed that process rank 0 with PID 0 on node hermes exited on signal 11 (Segmentation fault)."

jalemkul · February 6, 2024, 1:56pm

You have something very bizarre going on here, because this normally indicates a pair (1-4) interaction that is very distorted or incorrect. If you have only a protein in the system, did pdb2gmx warn about missing residues or long bonds? What are these atoms and how are they oriented in space? Likely your topology or coordinates are unsound in some way, so the calculation immediately fails.

Aman · February 6, 2024, 10:50pm

Hi Justin,

Thanks for your swift and precise response! It’s peculiar how smoothly the system runs on my Mac despite its limited capacity, yet throws up issues when running on the server. I’ve taken your advice and re-run the process from scratch since posing my initial question. Interestingly, my topology doesn’t seem to present any glaring issues, but upon redoing pdb2gmx, I did notice a few atoms with long bonds.

I absolutely agree with you on the importance of checking the sanity of the structure at the pdb2gmx stage, even if we obtain the gro file without any crash reports. It’s those little details that can often slip through the cracks.

I’m also curious about this aspect. Does GROMACS automatically adjust the number of threads/cores based on the system’s configuration when we run it on a single node with multiple CPU servers, or is it something we need to manually specify the number of threads/cores when executing mdrun?

Looking forward to your insights on this matter!

jalemkul · February 7, 2024, 3:04am

mdrun will use whatever it’s told to use. If you don’t tell it anything, it will try to use all available threads/cores that the hardware has. If you don’t want that, you have to take control by specifying exactly what resources it should use.

Aman · February 7, 2024, 3:28am

Good to know that, thanks, Justin.

Topic		Replies	Views
Perfomance issues of mdrun in gromacs User discussions mdrun	0	349	August 22, 2023
Gmx mdrun performance on cluster User discussions mdrun	2	702	May 13, 2023
Optimize mdrun User discussions mdrun	0	236	May 23, 2023
Running in mutlinode with GPU User discussions	1	2062	October 12, 2020
Is there any way to boost mdrun User discussions mdrun	1	314	November 7, 2023

Mdrun issue - well known issue but still cant fix it

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.

Related topics

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.