As scientists we daily know and expect failure in our experiments. That’s why it’s called RE-search and not Search.
Be well .
Paul
As scientists we daily know and expect failure in our experiments. That’s why it’s called RE-search and not Search.
Be well .
Paul
Please try to use the quoting feature of the editor as custom markers in inline replies are hard to follow.
Yes, thread-MPI is enabled by default in all GROMACS builds unless GMX_MPI=ON
is passed to cmake; for more details see the user guide.
No, thread-MPI is enabled by default.
If you set them in GMXRC
they will persist over all invocations of gmx
; depending on your shell you can set them in various ways (I suggest to check some general guides) you can also set them simply as a prefix to the command. Here the value does not matter, they just need to be set. E.g. in bash:
GMX_GPU_PME_PP_COMMS=1 GMX_GPU_DD_COMMS=1 gmx mdrun ...
I just wanted to thank you as well for actually trying this out for a
set of serious use cases. I think this will help us iron out the
remaining bugs for 2021.
Cheers
Paul
Are you referring to the “Peer access enabled…” note in the log? If that is missing, it could indicate that the part of the setup phase of GPU direct communication is the problem (and in that case the issue may be external to GROMACS). Can you please share the full log? This would be useful information because (as far as I know) this part of the code has not changed in the 2021 code.
The number of ranks (ntmpi) is not ideal in your launch so you could try fewer ranks, e.g. try -ntmpi 2
and -ntmpi 4
(but this should in principle only affect performance).
Here i you are…
command sequence: ( log follows . Sorry but I did not see a way to attach the log )
GROMACS: gmx mdrun, version 2020.4
Executable: /usr/local/gromacs/bin/gmx
Data prefix: /usr/local/gromacs
Working dir: /home/pb/Desktop/PE sys
Command line:
gmx mdrun -deffnm PE.sys.LB.nvt -nb gpu -pme gpu -ntomp 4 -ntmpi 16 -npme 1 -nsteps 100000
Back Off! I just backed up PE.sys.LB.nvt.log to ./#PE.sys.LB.nvt.log.9#
Reading file PE.sys.LB.nvt.tpr, VERSION 2020.4 (single precision)
Enabling GPU buffer operations required by GMX_GPU_DD_COMMS (equivalent with GMX_USE_GPU_BUFFER_OPS=1).
This run uses the 'GPU halo exchange' feature, enabled by the GMX_GPU_DD_COMMS environment variable.
This run uses the 'GPU PME-PP communications' feature, enabled by the GMX_GPU_PME_PP_COMMS environment variable.
Overriding nsteps with value passed on the command line: 100000 steps, 100 ps
On host TR1 2 GPUs selected for this run.
Mapping of GPU IDs to the 16 GPU tasks in the 16 ranks on this node:
PP:0,PP:0,PP:0,PP:0,PP:0,PP:0,PP:0,PP:0,PP:1,PP:1,PP:1,PP:1,PP:1,PP:1,PP:1,PME:1
PP tasks will do (non-perturbed) short-ranged interactions on the GPU
PP task will update and constrain coordinates on the CPU
PME tasks will do all aspects on the GPU
Using 16 MPI threads
Using 4 OpenMP threads per tMPI thread
NOTE: This run uses the 'GPU halo exchange' feature, enabled by the GMX_GPU_DD_COMMS environment variable.
Back Off! I just backed up PE.sys.LB.nvt.trr to ./#PE.sys.LB.nvt.trr.6#
Back Off! I just backed up PE.sys.LB.nvt.edr to ./#PE.sys.LB.nvt.edr.6#
NOTE: DLB will not turn on during the first phase of PME tuning
starting mdrun 'PE system TMPTA EGDMA IPA'
100000 steps, 100.0 ps.
[2]+ Stopped gmx mdrun -deffnm PE.sys.LB.nvt -nb gpu -pme gpu -ntomp 4 -ntmpi 16 -npme 1 -nsteps 100000
=============== log =================
GROMACS version: 2020.4
Verified release checksum is 79c2857291b034542c26e90512b92fd4b184a1c9d6fa59c55f2e24ccf14e7281
Precision: single
Memory model: 64 bit
MPI library: thread_mpi
OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 64)
GPU support: CUDA
SIMD instructions: AVX2_128
FFT library: fftw-3.3.8-sse2-avx-avx2-avx2_128
RDTSCP usage: enabled
TNG support: enabled
Hwloc support: disabled
Tracing support: disabled
C compiler: /usr/bin/cc GNU 7.5.0
C compiler flags: -mavx2 -mfma -fexcess-precision=fast -funroll-all-loops -O3 -DNDEBUG
C++ compiler: /usr/bin/c++ GNU 7.5.0
C++ compiler flags: -mavx2 -mfma -fexcess-precision=fast -funroll-all-loops -fopenmp -O3 -DNDEBUG
CUDA compiler: /usr/local/cuda/bin/nvcc nvcc: NVIDIA (R) Cuda compiler driver;Copyright (c) 2005-2020 NVIDIA Corporation;Built on Tue_Sep_15_19:10:02_PDT_2020;Cuda compilation tools, release 11.1, V11.1.74;Build cuda_11.1.TC455_06.29069683_0
CUDA compiler flags:-gencode;arch=compute_75,code=sm_75;-use_fast_math;-D_FORCE_INLINES;-mavx2 -mfma -fexcess-precision=fast -funroll-all-loops -fopenmp -O3 -DNDEBUG
CUDA driver: 11.10
CUDA runtime: 11.10
Running on 1 node with total 32 cores, 64 logical cores, 2 compatible GPUs
Hardware detected:
CPU info:
Vendor: AMD
Brand: AMD Ryzen Threadripper 2990WX 32-Core Processor
Family: 23 Model: 8 Stepping: 2
Features: aes amd apic avx avx2 clfsh cmov cx8 cx16 f16c fma htt lahf misalignsse mmx msr nonstop_tsc pclmuldq pdpe1gb popcnt pse rdrnd rdtscp sha sse2 sse3 sse4a sse4.1 sse4.2 ssse3
Hardware topology: Basic
Sockets, cores, and logical processors:
Socket 0: [ 0 1] [ 2 3] [ 4 5] [ 6 7] [ 8 9] [ 10 11] [ 12 13] [ 14 15] [ 32 33] [ 34 35] [ 36 37] [ 38 39] [ 40 41] [ 42 43] [ 44 45] [ 46 47] [ 16 17] [ 18 19] [ 20 21] [ 22 23] [ 24 25] [ 26 27] [ 28 29] [ 30 31] [ 48 49] [ 50 51] [ 52 53] [ 54 55] [ 56 57] [ 58 59] [ 60 61] [ 62 63]
GPU info:
Number of GPUs detected: 2
#0: NVIDIA GeForce RTX 2080 Ti, compute cap.: 7.5, ECC: no, stat: compatible
#1: NVIDIA GeForce RTX 2080 Ti, compute cap.: 7.5, ECC: no, stat: compatible
++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
...
...
++++ PLEASE CITE THE DOI FOR THIS VERSION OF GROMACS ++++
https://doi.org/10.5281/zenodo.4054979
-------- -------- --- Thank You --- -------- --------
Enabling GPU buffer operations required by GMX_GPU_DD_COMMS (equivalent with GMX_USE_GPU_BUFFER_OPS=1).
This run uses the 'GPU halo exchange' feature, enabled by the GMX_GPU_DD_COMMS environment variable.
This run uses the 'GPU PME-PP communications' feature, enabled by the GMX_GPU_PME_PP_COMMS environment variable.
Input Parameters:
integrator = sd
tinit = 0
dt = 0.001
nsteps = 500000
init-step = 0
simulation-part = 1
comm-mode = Linear
nstcomm = 100
bd-fric = 0
ld-seed = -436869696
emtol = 10
emstep = 0.01
niter = 20
fcstep = 0
nstcgsteep = 1000
nbfgscorr = 10
rtpi = 0.05
nstxout = 5000
nstvout = 5000
nstfout = 0
nstlog = 5000
nstcalcenergy = 100
nstenergy = 5000
nstxout-compressed = 0
compressed-x-precision = 1000
cutoff-scheme = Verlet
nstlist = 100
pbc = xyz
periodic-molecules = false
verlet-buffer-tolerance = 0.005
rlist = 1
coulombtype = PME
coulomb-modifier = Potential-shift
rcoulomb-switch = 0
rcoulomb = 1
epsilon-r = 1
epsilon-rf = inf
vdw-type = Cut-off
vdw-modifier = Potential-shift
rvdw-switch = 0
rvdw = 1
DispCorr = EnerPres
table-extension = 1
fourierspacing = 0.416
fourier-nx = 36
fourier-ny = 72
fourier-nz = 2560
pme-order = 4
ewald-rtol = 1e-05
ewald-rtol-lj = 0.001
lj-pme-comb-rule = Geometric
ewald-geometry = 0
epsilon-surface = 0
tcoupl = No
nsttcouple = -1
nh-chain-length = 0
print-nose-hoover-chain-variables = false
pcoupl = No
pcoupltype = Isotropic
nstpcouple = -1
tau-p = 1
compressibility (3x3):
compressibility[ 0]={ 0.00000e+00, 0.00000e+00, 0.00000e+00}
compressibility[ 1]={ 0.00000e+00, 0.00000e+00, 0.00000e+00}
compressibility[ 2]={ 0.00000e+00, 0.00000e+00, 0.00000e+00}
ref-p (3x3):
ref-p[ 0]={ 0.00000e+00, 0.00000e+00, 0.00000e+00}
ref-p[ 1]={ 0.00000e+00, 0.00000e+00, 0.00000e+00}
ref-p[ 2]={ 0.00000e+00, 0.00000e+00, 0.00000e+00}
refcoord-scaling = No
posres-com (3):
posres-com[0]= 0.00000e+00
posres-com[1]= 0.00000e+00
posres-com[2]= 0.00000e+00
posres-comB (3):
posres-comB[0]= 0.00000e+00
posres-comB[1]= 0.00000e+00
posres-comB[2]= 0.00000e+00
QMMM = false
QMconstraints = 0
QMMMscheme = 0
MMChargeScaleFactor = 1
qm-opts:
ngQM = 0
constraint-algorithm = Lincs
continuation = false
Shake-SOR = false
shake-tol = 0.0001
lincs-order = 4
lincs-iter = 1
lincs-warnangle = 30
nwall = 0
wall-type = 9-3
wall-r-linpot = -1
wall-atomtype[0] = -1
wall-atomtype[1] = -1
wall-density[0] = 0
wall-density[1] = 0
wall-ewald-zfac = 3
pull = false
awh = false
rotation = false
interactiveMD = false
disre = No
disre-weighting = Conservative
disre-mixed = false
dr-fc = 1000
dr-tau = 0
nstdisreout = 100
orire-fc = 0
orire-tau = 0
nstorireout = 100
free-energy = no
cos-acceleration = 0
deform (3x3):
deform[ 0]={ 0.00000e+00, 0.00000e+00, 0.00000e+00}
deform[ 1]={ 0.00000e+00, 0.00000e+00, 0.00000e+00}
deform[ 2]={ 0.00000e+00, 0.00000e+00, 0.00000e+00}
simulated-tempering = false
swapcoords = no
userint1 = 0
userint2 = 0
userint3 = 0
userint4 = 0
userreal1 = 0
userreal2 = 0
userreal3 = 0
userreal4 = 0
applied-forces:
electric-field:
x:
E0 = 0
omega = 0
t0 = 0
sigma = 0
y:
E0 = 0
omega = 0
t0 = 0
sigma = 0
z:
E0 = 0
omega = 0
t0 = 0
sigma = 0
density-guided-simulation:
active = false
group = protein
similarity-measure = inner-product
atom-spreading-weight = unity
force-constant = 1e+09
gaussian-transform-spreading-width = 0.2
gaussian-transform-spreading-range-in-multiples-of-width = 4
reference-density-filename = reference.mrc
nst = 1
normalize-densities = true
adaptive-force-scaling = false
adaptive-force-scaling-time-constant = 4
grpopts:
nrdf: 63203 19599.7 24399.6 27099.6 64999
ref-t: 310 310 310 310 310
tau-t: 0.1 0.1 0.1 0.1 0.1
annealing: No No No No No
annealing-npoints: 0 0 0 0 0
acc: 0 0 0
nfreeze: N N N
energygrp-flags[ 0]: 0
The -nsteps functionality is deprecated, and may be removed in a future version. Consider using gmx convert-tpr -nsteps or changing the appropriate .mdp file field.
Overriding nsteps with value passed on the command line: 100000 steps, 100 ps
Initializing Domain Decomposition on 16 ranks
Dynamic load balancing: auto
Minimum cell size due to atom displacement: 0.514 nm
Initial maximum distances in bonded interactions:
two-body bonded interactions: 0.414 nm, LJ-14, atoms 50463 50466
multi-body bonded interactions: 0.414 nm, Proper Dih., atoms 50463 50466
Minimum cell size due to bonded interactions: 0.455 nm
Maximum distance for 5 constraints, at 120 deg. angles, all-trans: 0.774 nm
Estimated maximum distance required for P-LINCS: 0.774 nm
This distance will limit the DD cell size, you can override this with -rcon
Scaling the initial minimum size with 1/0.8 (option -dds) = 1.25
Using 1 separate PME ranks
Optimizing the DD grid for 15 cells with a minimum initial size of 1.250 nm
The maximum allowed number of cells is: X 11 Y 22 Z 800
Domain decomposition grid 1 x 1 x 15, separate PME ranks 1
PME domain decomposition: 1 x 1 x 1
Interleaving PP and PME ranks
This rank does only particle-particle work.
Domain decomposition rank 0, coordinates 0 0 0
The initial number of communication pulses is: Z 1
The initial domain decomposition cell size is: Z 66.67 nm
The maximum allowed distance for atoms involved in interactions is:
non-bonded interactions 1.000 nm
two-body bonded interactions (-rdd) 1.000 nm
multi-body bonded interactions (-rdd) 1.000 nm
atoms separated by up to 5 constraints (-rcon) 14.200 nm
When dynamic load balancing gets turned on, these settings will change to:
The maximum number of communication pulses is: Z 1
The minimum size for domain decomposition cells is 1.000 nm
The requested allowed shrink of DD cells (option -dds) is: 0.80
The allowed shrink of domain decomposition cells is: Z 0.02
The maximum allowed distance for atoms involved in interactions is:
non-bonded interactions 1.000 nm
two-body bonded interactions (-rdd) 1.000 nm
multi-body bonded interactions (-rdd) 1.000 nm
atoms separated by up to 5 constraints (-rcon) 1.000 nm
On host TR1 2 GPUs selected for this run.
Mapping of GPU IDs to the 16 GPU tasks in the 16 ranks on this node:
PP:0,PP:0,PP:0,PP:0,PP:0,PP:0,PP:0,PP:0,PP:1,PP:1,PP:1,PP:1,PP:1,PP:1,PP:1,PME:1
PP tasks will do (non-perturbed) short-ranged interactions on the GPU
PP task will update and constrain coordinates on the CPU
PME tasks will do all aspects on the GPU
Using 16 MPI threads
Using 4 OpenMP threads per tMPI thread
Note: Peer access enabled between the following GPU pairs in the node:
0->1 1->0
Pinning threads with an auto-selected logical core stride of 1
System total charge: 0.000
Will do PME sum in reciprocal space for electrostatic interactions.
++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
U. Essmann, L. Perera, M. L. Berkowitz, T. Darden, H. Lee and L. G. Pedersen
A smooth particle mesh Ewald method
J. Chem. Phys. 103 (1995) pp. 8577-8592
-------- -------- --- Thank You --- -------- --------
Using a Gaussian width (1/beta) of 0.320163 nm for Ewald
Potential shift: LJ r^-12: -1.000e+00 r^-6: -1.000e+00, Ewald -1.000e-05
Initialized non-bonded Ewald tables, spacing: 9.33e-04 size: 1073
Generated table with 1000 data points for 1-4 COUL.
Tabscale = 500 points/nm
Generated table with 1000 data points for 1-4 LJ6.
Tabscale = 500 points/nm
Generated table with 1000 data points for 1-4 LJ12.
Tabscale = 500 points/nm
Using GPU 8x8 nonbonded short-range kernels
Using a 8x8 pair-list setup:
updated every 100 steps, buffer 0.000 nm, rlist 1.000 nm
At tolerance 0.005 kJ/mol/ps per atom, equivalent classical 1x1 list would be:
updated every 100 steps, buffer 0.000 nm, rlist 1.000 nm
Using full Lennard-Jones parameter combination matrix
Long Range LJ corr.: <C6> 4.0128e-03
NOTE: This run uses the 'GPU halo exchange' feature, enabled by the GMX_GPU_DD_COMMS environment variable.
Removing pbc first time
Initializing Parallel LINear Constraint Solver
++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
B. Hess
P-LINCS: A Parallel Linear Constraint Solver for molecular simulation
J. Chem. Theory Comput. 4 (2008) pp. 116-122
-------- -------- --- Thank You --- -------- --------
The number of constraints is 92788
There are constraints between atoms in different decomposition domains,
will communicate selected coordinates each lincs iteration
Linking all bonded interactions to atoms
Intra-simulation communication will occur every 100 steps.
++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
N. Goga and A. J. Rzepiela and A. H. de Vries and S. J. Marrink and H. J. C.
Berendsen
Efficient Algorithms for Langevin and DPD Dynamics
J. Chem. Theory Comput. 8 (2012) pp. 3637--3649
-------- -------- --- Thank You --- -------- --------
Our emails are crossing. I had earlier sent a log file whcih you had asked for and later see you asked for items in quotes. The log file must be a mess. I will gladly resend but will wait for your response.
re the nvlink commands, I simply exported through bash
Best,
Paul
Paul, can you please check whether the just released 2021-beta1 also hangs? Also, please try to use fewer ranks as suggested earlier.
I will most certainly give it a go and let you know asap - tomorrow most likely - with fewer ranks… .
Paul
I installed 2020.1 without issue g++.gcc 7.5 cmake 3.18 cuda 11.1
The outcome without trying nvlink nvt ensemble is below, the log file is attached for the failed nvlink attempt.
Let me know if you like some other file or to make a change.
Best,
Paul
(Attachment PE.sys.LB.nvt.log is missing)
The log attachment for the nvlink attempt was rejected so it is renamed as a .dat
PE.sys.LB.nvt.log.nvlink.dat (18.2 KB)
Looks like this still hangs? Can you please open an issue on https://gitlab.com/gromacs/gromacs/-/issues and attach the log as well as the inputs.
Have you tried fewer ranks, e.g. 2, 4?