Domain decomposition on multiple ranks

GROMACS version: 2020.4
GROMACS modification: No

Hi,

when running mdrun on my local workstation with more than one rank (e.g. by “gmx mdrun -ntmpi 4 -ntomp 4 -deffnm md_0_1”), the simulation crashes before even calculating the first step of the run with

Fatal error:
75 particles communicated to PME rank 2 are more than 2/3 times the cut-off
out of the domain decomposition cell of their charge group in dimension x.
This usually means that your system is not well equilibrated.

When running the same system with “gmx mdrun -ntmpi 1 -ntomp 16 -deffnm md_0_1” everything works fine and the simulation does look okay by visual inspection. This also happens when I test he lysozyme simulation protocol from Lysozyme in Water instead of my own system, so I assume this is rather not my simulation blowing up but more a misconfiguration on the software-side.

Logfile from a run with two ranks: md_0_1.log (18.5 KB)
Dumped pdb-files from the 2-rank run: MLU-Cloud
Logfile from run with one rank: md_0_1_1rank.log (84.7 KB)

The same error also happens when I submit a tpr-file generated on said workstation to a cluster and running it there with more than one rank.

Gromacs on the workstation:

GROMACS version: 2020.4
Verified release checksum is 79c2857291b034542c26e90512b92fd4b184a1c9d6fa59c55f2e24ccf14e7281
Precision: single
Memory model: 64 bit
MPI library: thread_mpi
OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 64)
GPU support: CUDA
SIMD instructions: AVX2_256
FFT library: fftw-3.3.8-sse2-avx-avx2-avx2_128
RDTSCP usage: enabled
TNG support: enabled
Hwloc support: disabled
Tracing support: disabled
C compiler: /usr/bin/cc GNU 9.3.0
C compiler flags: -mavx2 -mfma -fexcess-precision=fast -funroll-all-loops -O3 -DNDEBUG
C++ compiler: /usr/bin/c++ GNU 9.3.0
C++ compiler flags: -mavx2 -mfma -fexcess-precision=fast -funroll-all-loops -fopenmp -O3 -DNDEBUG
CUDA compiler: /usr/local/cuda/bin/nvcc nvcc: NVIDIA ® Cuda compiler driver;Copyright © 2005-2020 NVIDIA Corporation;Built on Mon_Oct_12_20:09:46_PDT_2020;Cuda compilation tools, release 11.1, V11.1.105;Build cuda_11.1.TC455_06.29190527_0
CUDA compiler flags:-gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-Wno-deprecated-gpu-targets;-gencode;arch=compute_35,code=compute_35;-gencode;arch=compute_50,code=compute_50;-gencode;arch=compute_52,code=compute_52;-gencode;arch=compute_60,code=compute_60;-gencode;arch=compute_61,code=compute_61;-gencode;arch=compute_70,code=compute_70;-gencode;arch=compute_75,code=compute_75;-gencode;arch=compute_80,code=compute_80;-use_fast_math;-D_FORCE_INLINES;-mavx2 -mfma -fexcess-precision=fast -funroll-all-loops -fopenmp -O3 -DNDEBUG
CUDA driver: 11.20
CUDA runtime: 11.10

Any help on how to proceed with my observation would be very appreciated.

This indeed seems to indicate an issue in the code. But this is a rather common setup, so I would think we would have found that long before. I tried to download the log file, but there is an issue with the server.

How many GPUs do you have in the machine?

I could now download your files. I see that there is only one GPU. I have no clue what could be wrong here.
@pszilard do you have an idea?

Thanks for looking into this.
I found out that omitting the cpt-file as input for grompp prevents this error from happening. So the following works:

gmx grompp -f md.mdp -c npt.gro -p topol.top -o md_0_1.tpr
gmx mdrun -ntmpi 4 -ntomp 4 -deffnm md_0_1

whereas this works not:

gmx grompp -f md.mdp -c npt.gro -p topol.top -t npt.cpt -o md_0_1.tpr
gmx mdrun -ntmpi 4 -ntomp 4 -deffnm md_0_1

We speculated that velocities are assigned differently to the atoms in the single-rank cpt than it would be required for the multi-rank run and hence the simulation explodes in multi-rank mode. But this would have to be confirmed by someone more familiar with the code base.

I can not reproduce errors, not even with the same lysozyme inputs. I tried continuation runs using checkpoint files generated by 1-rank run continued with a 4 rank run? Can you share your input files?

Yes sure, please find the files at MLU-Cloud