Lincs error on newer but not older machines after installing gmx2020.3

As scientists we daily know and expect failure in our experiments. That’s why it’s called RE-search and not Search.

Be well .
Paul

Please try to use the quoting feature of the editor as custom markers in inline replies are hard to follow.

Yes, thread-MPI is enabled by default in all GROMACS builds unless GMX_MPI=ON is passed to cmake; for more details see the user guide.

No, thread-MPI is enabled by default.

If you set them in GMXRC they will persist over all invocations of gmx; depending on your shell you can set them in various ways (I suggest to check some general guides) you can also set them simply as a prefix to the command. Here the value does not matter, they just need to be set. E.g. in bash:

GMX_GPU_PME_PP_COMMS=1 GMX_GPU_DD_COMMS=1 gmx mdrun ...

I just wanted to thank you as well for actually trying this out for a
set of serious use cases. I think this will help us iron out the
remaining bugs for 2021.

Cheers
Paul

Are you referring to the “Peer access enabled…” note in the log? If that is missing, it could indicate that the part of the setup phase of GPU direct communication is the problem (and in that case the issue may be external to GROMACS). Can you please share the full log? This would be useful information because (as far as I know) this part of the code has not changed in the 2021 code.

The number of ranks (ntmpi) is not ideal in your launch so you could try fewer ranks, e.g. try -ntmpi 2 and -ntmpi 4 (but this should in principle only affect performance).

Here i you are…

command sequence: ( log follows . Sorry but I did not see a way to attach the log )

GROMACS:      gmx mdrun, version 2020.4
Executable:   /usr/local/gromacs/bin/gmx
Data prefix:  /usr/local/gromacs
Working dir:  /home/pb/Desktop/PE sys
Command line:
  gmx mdrun -deffnm PE.sys.LB.nvt -nb gpu -pme gpu -ntomp 4 -ntmpi 16 -npme 1 -nsteps 100000

Back Off! I just backed up PE.sys.LB.nvt.log to ./#PE.sys.LB.nvt.log.9#
Reading file PE.sys.LB.nvt.tpr, VERSION 2020.4 (single precision)
Enabling GPU buffer operations required by GMX_GPU_DD_COMMS (equivalent with GMX_USE_GPU_BUFFER_OPS=1).

This run uses the 'GPU halo exchange' feature, enabled by the GMX_GPU_DD_COMMS environment variable.

This run uses the 'GPU PME-PP communications' feature, enabled by the GMX_GPU_PME_PP_COMMS environment variable.

Overriding nsteps with value passed on the command line: 100000 steps, 100 ps
On host TR1 2 GPUs selected for this run.
Mapping of GPU IDs to the 16 GPU tasks in the 16 ranks on this node:
  PP:0,PP:0,PP:0,PP:0,PP:0,PP:0,PP:0,PP:0,PP:1,PP:1,PP:1,PP:1,PP:1,PP:1,PP:1,PME:1
PP tasks will do (non-perturbed) short-ranged interactions on the GPU
PP task will update and constrain coordinates on the CPU
PME tasks will do all aspects on the GPU
Using 16 MPI threads
Using 4 OpenMP threads per tMPI thread

NOTE: This run uses the 'GPU halo exchange' feature, enabled by the GMX_GPU_DD_COMMS environment variable.

Back Off! I just backed up PE.sys.LB.nvt.trr to ./#PE.sys.LB.nvt.trr.6#
Back Off! I just backed up PE.sys.LB.nvt.edr to ./#PE.sys.LB.nvt.edr.6#

NOTE: DLB will not turn on during the first phase of PME tuning
starting mdrun 'PE system TMPTA EGDMA IPA'
100000 steps,    100.0 ps.

[2]+  Stopped                 gmx mdrun -deffnm PE.sys.LB.nvt -nb gpu -pme gpu -ntomp 4 -ntmpi 16 -npme 1 -nsteps 100000

=============== log =================

GROMACS version:    2020.4
Verified release checksum is 79c2857291b034542c26e90512b92fd4b184a1c9d6fa59c55f2e24ccf14e7281
Precision:          single
Memory model:       64 bit
MPI library:        thread_mpi
OpenMP support:     enabled (GMX_OPENMP_MAX_THREADS = 64)
GPU support:        CUDA
SIMD instructions:  AVX2_128
FFT library:        fftw-3.3.8-sse2-avx-avx2-avx2_128
RDTSCP usage:       enabled
TNG support:        enabled
Hwloc support:      disabled
Tracing support:    disabled
C compiler:         /usr/bin/cc GNU 7.5.0
C compiler flags:   -mavx2 -mfma -fexcess-precision=fast -funroll-all-loops -O3 -DNDEBUG
C++ compiler:       /usr/bin/c++ GNU 7.5.0
C++ compiler flags: -mavx2 -mfma -fexcess-precision=fast -funroll-all-loops -fopenmp -O3 -DNDEBUG
CUDA compiler:      /usr/local/cuda/bin/nvcc nvcc: NVIDIA (R) Cuda compiler driver;Copyright (c) 2005-2020 NVIDIA Corporation;Built on Tue_Sep_15_19:10:02_PDT_2020;Cuda compilation tools, release 11.1, V11.1.74;Build cuda_11.1.TC455_06.29069683_0
CUDA compiler flags:-gencode;arch=compute_75,code=sm_75;-use_fast_math;-D_FORCE_INLINES;-mavx2 -mfma -fexcess-precision=fast -funroll-all-loops -fopenmp -O3 -DNDEBUG
CUDA driver:        11.10
CUDA runtime:       11.10


Running on 1 node with total 32 cores, 64 logical cores, 2 compatible GPUs
Hardware detected:
  CPU info:
    Vendor: AMD
    Brand:  AMD Ryzen Threadripper 2990WX 32-Core Processor
    Family: 23   Model: 8   Stepping: 2
    Features: aes amd apic avx avx2 clfsh cmov cx8 cx16 f16c fma htt lahf misalignsse mmx msr nonstop_tsc pclmuldq pdpe1gb popcnt pse rdrnd rdtscp sha sse2 sse3 sse4a sse4.1 sse4.2 ssse3
  Hardware topology: Basic
    Sockets, cores, and logical processors:
      Socket  0: [   0   1] [   2   3] [   4   5] [   6   7] [   8   9] [  10  11] [  12  13] [  14  15] [  32  33] [  34  35] [  36  37] [  38  39] [  40  41] [  42  43] [  44  45] [  46  47] [  16  17] [  18  19] [  20  21] [  22  23] [  24  25] [  26  27] [  28  29] [  30  31] [  48  49] [  50  51] [  52  53] [  54  55] [  56  57] [  58  59] [  60  61] [  62  63]
  GPU info:
    Number of GPUs detected: 2
    #0: NVIDIA GeForce RTX 2080 Ti, compute cap.: 7.5, ECC:  no, stat: compatible
    #1: NVIDIA GeForce RTX 2080 Ti, compute cap.: 7.5, ECC:  no, stat: compatible


++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
...
...

++++ PLEASE CITE THE DOI FOR THIS VERSION OF GROMACS ++++
https://doi.org/10.5281/zenodo.4054979
-------- -------- --- Thank You --- -------- --------

Enabling GPU buffer operations required by GMX_GPU_DD_COMMS (equivalent with GMX_USE_GPU_BUFFER_OPS=1).

This run uses the 'GPU halo exchange' feature, enabled by the GMX_GPU_DD_COMMS environment variable.

This run uses the 'GPU PME-PP communications' feature, enabled by the GMX_GPU_PME_PP_COMMS environment variable.
Input Parameters:
   integrator                     = sd
   tinit                          = 0
   dt                             = 0.001
   nsteps                         = 500000
   init-step                      = 0
   simulation-part                = 1
   comm-mode                      = Linear
   nstcomm                        = 100
   bd-fric                        = 0
   ld-seed                        = -436869696
   emtol                          = 10
   emstep                         = 0.01
   niter                          = 20
   fcstep                         = 0
   nstcgsteep                     = 1000
   nbfgscorr                      = 10
   rtpi                           = 0.05
   nstxout                        = 5000
   nstvout                        = 5000
   nstfout                        = 0
   nstlog                         = 5000
   nstcalcenergy                  = 100
   nstenergy                      = 5000
   nstxout-compressed             = 0
   compressed-x-precision         = 1000
   cutoff-scheme                  = Verlet
   nstlist                        = 100
   pbc                            = xyz
   periodic-molecules             = false
   verlet-buffer-tolerance        = 0.005
   rlist                          = 1
   coulombtype                    = PME
   coulomb-modifier               = Potential-shift
   rcoulomb-switch                = 0
   rcoulomb                       = 1
   epsilon-r                      = 1
   epsilon-rf                     = inf
   vdw-type                       = Cut-off
   vdw-modifier                   = Potential-shift
   rvdw-switch                    = 0
   rvdw                           = 1
   DispCorr                       = EnerPres
   table-extension                = 1
   fourierspacing                 = 0.416
   fourier-nx                     = 36
   fourier-ny                     = 72
   fourier-nz                     = 2560
   pme-order                      = 4
   ewald-rtol                     = 1e-05
   ewald-rtol-lj                  = 0.001
   lj-pme-comb-rule               = Geometric
   ewald-geometry                 = 0
   epsilon-surface                = 0
   tcoupl                         = No
   nsttcouple                     = -1
   nh-chain-length                = 0
   print-nose-hoover-chain-variables = false
   pcoupl                         = No
   pcoupltype                     = Isotropic
   nstpcouple                     = -1
   tau-p                          = 1
   compressibility (3x3):
      compressibility[    0]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
      compressibility[    1]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
      compressibility[    2]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
   ref-p (3x3):
      ref-p[    0]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
      ref-p[    1]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
      ref-p[    2]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
   refcoord-scaling               = No
   posres-com (3):
      posres-com[0]= 0.00000e+00
      posres-com[1]= 0.00000e+00
      posres-com[2]= 0.00000e+00
   posres-comB (3):
      posres-comB[0]= 0.00000e+00
      posres-comB[1]= 0.00000e+00
      posres-comB[2]= 0.00000e+00
   QMMM                           = false
   QMconstraints                  = 0
   QMMMscheme                     = 0
   MMChargeScaleFactor            = 1
qm-opts:
   ngQM                           = 0
   constraint-algorithm           = Lincs
   continuation                   = false
   Shake-SOR                      = false
   shake-tol                      = 0.0001
   lincs-order                    = 4
   lincs-iter                     = 1
   lincs-warnangle                = 30
   nwall                          = 0
   wall-type                      = 9-3
   wall-r-linpot                  = -1
   wall-atomtype[0]               = -1
   wall-atomtype[1]               = -1
   wall-density[0]                = 0
   wall-density[1]                = 0
   wall-ewald-zfac                = 3
   pull                           = false
   awh                            = false
   rotation                       = false
   interactiveMD                  = false
   disre                          = No
   disre-weighting                = Conservative
   disre-mixed                    = false
   dr-fc                          = 1000
   dr-tau                         = 0
   nstdisreout                    = 100
   orire-fc                       = 0
   orire-tau                      = 0
   nstorireout                    = 100
   free-energy                    = no
   cos-acceleration               = 0
   deform (3x3):
      deform[    0]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
      deform[    1]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
      deform[    2]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
   simulated-tempering            = false
   swapcoords                     = no
   userint1                       = 0
   userint2                       = 0
   userint3                       = 0
   userint4                       = 0
   userreal1                      = 0
   userreal2                      = 0
   userreal3                      = 0
   userreal4                      = 0
   applied-forces:
     electric-field:
       x:
         E0                       = 0
         omega                    = 0
         t0                       = 0
         sigma                    = 0
       y:
         E0                       = 0
         omega                    = 0
         t0                       = 0
         sigma                    = 0
       z:
         E0                       = 0
         omega                    = 0
         t0                       = 0
         sigma                    = 0
     density-guided-simulation:
       active                     = false
       group                      = protein
       similarity-measure         = inner-product
       atom-spreading-weight      = unity
       force-constant             = 1e+09
       gaussian-transform-spreading-width = 0.2
       gaussian-transform-spreading-range-in-multiples-of-width = 4
       reference-density-filename = reference.mrc
       nst                        = 1
       normalize-densities        = true
       adaptive-force-scaling     = false
       adaptive-force-scaling-time-constant = 4
grpopts:
   nrdf:       63203     19599.7     24399.6     27099.6       64999
   ref-t:         310         310         310         310         310  
   tau-t:         0.1         0.1         0.1         0.1         0.1
annealing:          No          No          No          No          No
annealing-npoints:           0           0           0           0           0
   acc:	           0           0           0
   nfreeze:           N           N           N
   energygrp-flags[  0]: 0


The -nsteps functionality is deprecated, and may be removed in a future version. Consider using gmx convert-tpr -nsteps or changing the appropriate .mdp file field.

Overriding nsteps with value passed on the command line: 100000 steps, 100 ps

Initializing Domain Decomposition on 16 ranks
Dynamic load balancing: auto
Minimum cell size due to atom displacement: 0.514 nm
Initial maximum distances in bonded interactions:
    two-body bonded interactions: 0.414 nm, LJ-14, atoms 50463 50466
  multi-body bonded interactions: 0.414 nm, Proper Dih., atoms 50463 50466
Minimum cell size due to bonded interactions: 0.455 nm
Maximum distance for 5 constraints, at 120 deg. angles, all-trans: 0.774 nm
Estimated maximum distance required for P-LINCS: 0.774 nm
This distance will limit the DD cell size, you can override this with -rcon
Scaling the initial minimum size with 1/0.8 (option -dds) = 1.25
Using 1 separate PME ranks
Optimizing the DD grid for 15 cells with a minimum initial size of 1.250 nm
The maximum allowed number of cells is: X 11 Y 22 Z 800
Domain decomposition grid 1 x 1 x 15, separate PME ranks 1
PME domain decomposition: 1 x 1 x 1
Interleaving PP and PME ranks
This rank does only particle-particle work.
Domain decomposition rank 0, coordinates 0 0 0

The initial number of communication pulses is: Z 1
The initial domain decomposition cell size is: Z 66.67 nm

The maximum allowed distance for atoms involved in interactions is:
                 non-bonded interactions           1.000 nm
            two-body bonded interactions  (-rdd)   1.000 nm
          multi-body bonded interactions  (-rdd)   1.000 nm
  atoms separated by up to 5 constraints  (-rcon) 14.200 nm

When dynamic load balancing gets turned on, these settings will change to:
The maximum number of communication pulses is: Z 1
The minimum size for domain decomposition cells is 1.000 nm
The requested allowed shrink of DD cells (option -dds) is: 0.80
The allowed shrink of domain decomposition cells is: Z 0.02
The maximum allowed distance for atoms involved in interactions is:
                 non-bonded interactions           1.000 nm
            two-body bonded interactions  (-rdd)   1.000 nm
          multi-body bonded interactions  (-rdd)   1.000 nm
  atoms separated by up to 5 constraints  (-rcon)  1.000 nm

On host TR1 2 GPUs selected for this run.
Mapping of GPU IDs to the 16 GPU tasks in the 16 ranks on this node:
  PP:0,PP:0,PP:0,PP:0,PP:0,PP:0,PP:0,PP:0,PP:1,PP:1,PP:1,PP:1,PP:1,PP:1,PP:1,PME:1
PP tasks will do (non-perturbed) short-ranged interactions on the GPU
PP task will update and constrain coordinates on the CPU
PME tasks will do all aspects on the GPU
Using 16 MPI threads
Using 4 OpenMP threads per tMPI thread


Note: Peer access enabled between the following GPU pairs in the node:
 0->1 1->0 

Pinning threads with an auto-selected logical core stride of 1
System total charge: 0.000
Will do PME sum in reciprocal space for electrostatic interactions.

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
U. Essmann, L. Perera, M. L. Berkowitz, T. Darden, H. Lee and L. G. Pedersen 
A smooth particle mesh Ewald method
J. Chem. Phys. 103 (1995) pp. 8577-8592
-------- -------- --- Thank You --- -------- --------

Using a Gaussian width (1/beta) of 0.320163 nm for Ewald
Potential shift: LJ r^-12: -1.000e+00 r^-6: -1.000e+00, Ewald -1.000e-05
Initialized non-bonded Ewald tables, spacing: 9.33e-04 size: 1073

Generated table with 1000 data points for 1-4 COUL.
Tabscale = 500 points/nm
Generated table with 1000 data points for 1-4 LJ6.
Tabscale = 500 points/nm
Generated table with 1000 data points for 1-4 LJ12.
Tabscale = 500 points/nm

Using GPU 8x8 nonbonded short-range kernels

Using a 8x8 pair-list setup:
  updated every 100 steps, buffer 0.000 nm, rlist 1.000 nm
At tolerance 0.005 kJ/mol/ps per atom, equivalent classical 1x1 list would be:
  updated every 100 steps, buffer 0.000 nm, rlist 1.000 nm
Using full Lennard-Jones parameter combination matrix
Long Range LJ corr.: <C6> 4.0128e-03

NOTE: This run uses the 'GPU halo exchange' feature, enabled by the GMX_GPU_DD_COMMS environment variable.
Removing pbc first time

Initializing Parallel LINear Constraint Solver

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
B. Hess
P-LINCS: A Parallel Linear Constraint Solver for molecular simulation
J. Chem. Theory Comput. 4 (2008) pp. 116-122
-------- -------- --- Thank You --- -------- --------

The number of constraints is 92788
There are constraints between atoms in different decomposition domains,
will communicate selected coordinates each lincs iteration
Linking all bonded interactions to atoms
Intra-simulation communication will occur every 100 steps.

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
N. Goga and A. J. Rzepiela and A. H. de Vries and S. J. Marrink and H. J. C.
Berendsen
Efficient Algorithms for Langevin and DPD Dynamics
J. Chem. Theory Comput. 8 (2012) pp. 3637--3649
-------- -------- --- Thank You --- -------- --------

Our emails are crossing. I had earlier sent a log file whcih you had asked for and later see you asked for items in quotes. The log file must be a mess. I will gladly resend but will wait for your response.

re the nvlink commands, I simply exported through bash

Best,
Paul

Paul, can you please check whether the just released 2021-beta1 also hangs? Also, please try to use fewer ranks as suggested earlier.

I will most certainly give it a go and let you know asap - tomorrow most likely - with fewer ranks… .
Paul

I installed 2020.1 without issue g++.gcc 7.5 cmake 3.18 cuda 11.1

The outcome without trying nvlink nvt ensemble is below, the log file is attached for the failed nvlink attempt.

Let me know if you like some other file or to make a change.

Best,

Paul

(Attachment PE.sys.LB.nvt.log is missing)

The log attachment for the nvlink attempt was rejected so it is renamed as a .dat

PE.sys.LB.nvt.log.nvlink.dat (18.2 KB)

Looks like this still hangs? Can you please open an issue on https://gitlab.com/gromacs/gromacs/-/issues and attach the log as well as the inputs.

Have you tried fewer ranks, e.g. 2, 4?