Domain decomposition on multiple ranks

jbender · March 2, 2021, 4:09pm

GROMACS version: 2020.4
GROMACS modification: No

Hi,

when running mdrun on my local workstation with more than one rank (e.g. by “gmx mdrun -ntmpi 4 -ntomp 4 -deffnm md_0_1”), the simulation crashes before even calculating the first step of the run with

Fatal error:
75 particles communicated to PME rank 2 are more than 2/3 times the cut-off
out of the domain decomposition cell of their charge group in dimension x.
This usually means that your system is not well equilibrated.

When running the same system with “gmx mdrun -ntmpi 1 -ntomp 16 -deffnm md_0_1” everything works fine and the simulation does look okay by visual inspection. This also happens when I test he lysozyme simulation protocol from Lysozyme in Water instead of my own system, so I assume this is rather not my simulation blowing up but more a misconfiguration on the software-side.

Logfile from a run with two ranks: md_0_1.log (18.5 KB)
Dumped pdb-files from the 2-rank run: https://cloud.uni-halle.de/s/CKfUjgSyvPNDel6
Logfile from run with one rank: md_0_1_1rank.log (84.7 KB)

The same error also happens when I submit a tpr-file generated on said workstation to a cluster and running it there with more than one rank.

Gromacs on the workstation:

GROMACS version: 2020.4
Verified release checksum is 79c2857291b034542c26e90512b92fd4b184a1c9d6fa59c55f2e24ccf14e7281
Precision: single
Memory model: 64 bit
MPI library: thread_mpi
OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 64)
GPU support: CUDA
SIMD instructions: AVX2_256
FFT library: fftw-3.3.8-sse2-avx-avx2-avx2_128
RDTSCP usage: enabled
TNG support: enabled
Hwloc support: disabled
Tracing support: disabled
C compiler: /usr/bin/cc GNU 9.3.0
C compiler flags: -mavx2 -mfma -fexcess-precision=fast -funroll-all-loops -O3 -DNDEBUG
C++ compiler: /usr/bin/c++ GNU 9.3.0
C++ compiler flags: -mavx2 -mfma -fexcess-precision=fast -funroll-all-loops -fopenmp -O3 -DNDEBUG
CUDA compiler: /usr/local/cuda/bin/nvcc nvcc: NVIDIA (R) Cuda compiler driver;Copyright (c) 2005-2020 NVIDIA Corporation;Built on Mon_Oct_12_20:09:46_PDT_2020;Cuda compilation tools, release 11.1, V11.1.105;Build cuda_11.1.TC455_06.29190527_0
CUDA compiler flags:-gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-Wno-deprecated-gpu-targets;-gencode;arch=compute_35,code=compute_35;-gencode;arch=compute_50,code=compute_50;-gencode;arch=compute_52,code=compute_52;-gencode;arch=compute_60,code=compute_60;-gencode;arch=compute_61,code=compute_61;-gencode;arch=compute_70,code=compute_70;-gencode;arch=compute_75,code=compute_75;-gencode;arch=compute_80,code=compute_80;-use_fast_math;-D_FORCE_INLINES;-mavx2 -mfma -fexcess-precision=fast -funroll-all-loops -fopenmp -O3 -DNDEBUG
CUDA driver: 11.20
CUDA runtime: 11.10

Any help on how to proceed with my observation would be very appreciated.

hess · March 8, 2021, 4:02pm

This indeed seems to indicate an issue in the code. But this is a rather common setup, so I would think we would have found that long before. I tried to download the log file, but there is an issue with the server.

How many GPUs do you have in the machine?

hess · March 8, 2021, 4:07pm

I could now download your files. I see that there is only one GPU. I have no clue what could be wrong here.
@pszilard do you have an idea?

jbender · March 8, 2021, 5:42pm

Thanks for looking into this.
I found out that omitting the cpt-file as input for grompp prevents this error from happening. So the following works:

gmx grompp -f md.mdp -c npt.gro -p topol.top -o md_0_1.tpr
gmx mdrun -ntmpi 4 -ntomp 4 -deffnm md_0_1

whereas this works not:

gmx grompp -f md.mdp -c npt.gro -p topol.top -t npt.cpt -o md_0_1.tpr
gmx mdrun -ntmpi 4 -ntomp 4 -deffnm md_0_1

We speculated that velocities are assigned differently to the atoms in the single-rank cpt than it would be required for the multi-rank run and hence the simulation explodes in multi-rank mode. But this would have to be confirmed by someone more familiar with the code base.

pszilard · March 9, 2021, 7:56pm

I can not reproduce errors, not even with the same lysozyme inputs. I tried continuation runs using checkpoint files generated by 1-rank run continued with a 4 rank run? Can you share your input files?

jbender · March 9, 2021, 8:25pm

Yes sure, please find the files at MLU-Cloud

Topic		Replies	Views
[MD Simulation Error] Fatal PME Rank Issue: Particles Moving Out of Domain Decomposition Cell User discussions mdp-parameters , grompp , mdrun , gpu , simulation-setup	2	45	March 4, 2025
Error: X particles communicated to PME rank N User discussions mdrun , mdrun-crash	1	486	November 16, 2021
NPT production run crash "94 particles communicated to PME rank 2 are more than 2/3 times..." User discussions mdrun	1	621	April 14, 2022
Error after restarting: particles communicated to PME rank 0 are more than 2/3 times the cut-off User discussions mdrun , mdrun-crash	0	647	May 27, 2021
Domain decomposition DD User discussions mdrun	0	1114	May 8, 2021

Domain decomposition on multiple ranks

Related topics