Hi Carsten,
Thanks for the suggestion, I have now tested it with three diferent GPUs: NVIDIA GeForce GTX 680, Titan Black and RTX 2080 Ti. All gave the same result. I am attaching the debug filr from mdrun, may be it could shed some light on the issue.
content of gmx_mpi0.debug
In gmx_physicalnode_id_hash: hash 2087638387
hw_opt: nt 0 ntmpi 0 ntomp 6 ntomp_pme 0 gpu_id ‘’ gputasks ‘’
graph part nchanged=2, bMultiPart=false
graph part nchanged=0, bMultiPart=false
nr. of distance calculations in bondeds: C 0.0 SIMD 0.0
Average nr of pbc_dx calls per atom 0.00
nc 3 1 1 1 1 vol pp 0.0698 pbcdx 0.0000 pme 0.000e+00 tot 6.983e-02
Bonded atom communication beyond the cut-off: false
cellsize limit 1.000000
Domain decomposition rank 0, coordinates 0 0 0
The DD cut-off is 3.156348
Volume fraction for all DD zones: 0.356610
DD rank 0 neighbor ranks in dir 0 are + 1 - 2
Making load communicators
Finished making load communicators
In gmx_physicalnode_id_hash: hash 2087638387
In gmx_setup_nodecomm: splitting communicator of size 3
In gmx_setup_nodecomm: node ID 0 rank within node 0
In gmx_setup_nodecomm: groups 1, my group size 3
In gmx_setup_nodecomm: not unsing separate inter- and intra-node communicators.
Non-default affinity mask found
hw_opt: nt 0 ntmpi 0 ntomp 6 ntomp_pme 6 gpu_id ‘’ gputasks ‘’
In gmx_physicalnode_id_hash: hash 2087638387
dd_setup_dd_dlb_gpu_sharing:
DD PP rank 0 physical node hash 2087638387 gpu_id 0
nrank_gpu_shared 1
There are 4 atom types in the system, adding one for nbnxn_atomdata_t
Combination rules: geometric false Lorentz-Berthelot true
Initialized CUDA data structures.
Neighbor-list balancing parameter: 2992 (auto-adjusted to the number of GPU multi-processors)
graph part nchanged=2, bMultiPart=false
graph part nchanged=0, bMultiPart=false
Installing signal handler for SIGTERM
Installing signal handler for SIGINT
Installing signal handler for SIGUSR1
wh =0.333333, rc = 0.075695, ra = 0.0390588
rb = 0.0195294, irc2 = 6.60546, dHH = 0.15139, dOH = 0.09572
The total size of the atom to interaction index is 5 integers
Home charge groups:
0 1 5 6 14 15 16 18 19 20
21 22 23 24 25 28 29 30 31 34
… I removed some part manually to make the list shorter …
9438 9439 9440 9441 9442 9443 9444 9445 9446 9447
9448 9449 9450 9451 9452 9453 9454 9455 9456 9457
9458
Resizing state: currently 0, required 6561
Resizing state: currently 6561, required 6561
Changing the number of halo communication pulses along dim X from 0 to 1
cell_x[0] 0.000000 - 45.200001 skew_fac 1.000000
cell_x[1] 0.000000 - 135.600006 skew_fac 1.000000
cell_x[2] 0.000000 - 135.600006 skew_fac 1.000000
Set grid boundaries dim 0: 0.000000 45.200001
Set grid boundaries dim 1: 0.000000 135.600006
Set grid boundaries dim 2: 0.000000 135.600006
zone 0 0.000 - 45.200 0.000 - 135.600 0.000 - 135.600
zone 0 bb 0.000 - 45.200 0.000 - 135.600 0.000 - 135.600
natoms_local = 6561 atom_density = 0.0
ns na_sc 64 na_c 8 super-cells: 109 x 2 y 6 z 9.1 maxz 32
cell_offset 0 sorting columns 0 - 2, atoms 0 - 6561
cell_offset 0 sorting columns 4 - 6, atoms 0 - 6561
cell_offset 0 sorting columns 10 - 12, atoms 0 - 6561
cell_offset 0 sorting columns 2 - 4, atoms 0 - 6561
cell_offset 0 sorting columns 6 - 8, atoms 0 - 6561
cell_offset 0 sorting columns 8 - 10, atoms 0 - 6561
ns non-zero sub-cells: 825 average atoms 7.95
ns bb: grid 11.30 11.30 7.94 abs 1.00 1.02 1.33 rel 0.09 0.09 0.17
Step 0, sorting the 6561 home charge groups
Set the new home atom count to 6561
Resizing state: currently 6561, required 6561
Setting up DD communication
bBondComm false, r_bc 0.078174
Resizing state: currently 6561, required 6562
Finished setting up DD communication, zones: 6561 1
zone 1 45.200 - 48.356 0.000 - 135.600 0.000 - 135.600
zone 1 bb 45.200 - 48.356 0.000 - 135.600 0.000 - 135.600
Making local topology
Two-body bonded cut-off distance is 3.15635
dim 0 cellmin 45.200001 bonded rcheck[0] = 0, bRCheck2B = false
dim 1 cellmin 135.600006 bonded rcheck[1] = 0, bRCheck2B = false
dim 2 cellmin 135.600006 bonded rcheck[2] = 0, bRCheck2B = false
We have 19548 exclusions, check count 0
Resizing state: currently 6562, required 6562
Division of bondeds over threads:
wh =0.0559503, rc = 0.075695, ra = 0.00655606
rb = 0.0520322, irc2 = 6.60546, dHH = 0.15139, dOH = 0.09572
vcm: start=0, homenr=6561, end=6561
Summing 12 energies
cell_x[0] 0.000000 - 45.200001 skew_fac 1.000000
cell_x[1] 0.000000 - 135.600006 skew_fac 1.000000
cell_x[2] 0.000000 - 135.600006 skew_fac 1.000000
Sending ddim 0 dir 0: ncg 0 nat 0
Sending ddim 0 dir 1: ncg 0 nat 0
Resizing state: currently 6562, required 6561
Finished repartitioning: cgs moved out 0, new home 6561
Set grid boundaries dim 0: 0.000000 45.200001
Set grid boundaries dim 1: 0.000000 135.600006
Set grid boundaries dim 2: 0.000000 135.600006
zone 0 0.000 - 45.200 0.000 - 135.600 0.000 - 135.600
zone 0 bb 0.000 - 45.200 0.000 - 135.600 0.000 - 135.600
natoms_local = 6561 atom_density = 0.0
ns na_sc 64 na_c 8 super-cells: 109 x 2 y 6 z 9.1 maxz 32
cell_offset 0 sorting columns 0 - 2, atoms 0 - 6561
cell_offset 0 sorting columns 10 - 12, atoms 0 - 6561
cell_offset 0 sorting columns 6 - 8, atoms 0 - 6561
cell_offset 0 sorting columns 4 - 6, atoms 0 - 6561
cell_offset 0 sorting columns 8 - 10, atoms 0 - 6561
cell_offset 0 sorting columns 2 - 4, atoms 0 - 6561
ns non-zero sub-cells: 825 average atoms 7.95
ns bb: grid 11.30 11.30 7.94 abs 1.00 1.02 1.33 rel 0.09 0.09 0.17
Step 0, sorting the 6561 home charge groups
Set the new home atom count to 6561
Resizing state: currently 6561, required 6561
Setting up DD communication
bBondComm false, r_bc 0.078174
Resizing state: currently 6561, required 6562
Finished setting up DD communication, zones: 6561 1
zone 1 45.200 - 48.356 0.000 - 135.600 0.000 - 135.600
zone 1 bb 45.200 - 48.356 0.000 - 135.600 0.000 - 135.600
Making local topology
Two-body bonded cut-off distance is 3.15635
dim 0 cellmin 45.200001 bonded rcheck[0] = 0, bRCheck2B = false
dim 1 cellmin 135.600006 bonded rcheck[1] = 0, bRCheck2B = false
dim 2 cellmin 135.600006 bonded rcheck[2] = 0, bRCheck2B = false
We have 19548 exclusions, check count 0
Resizing state: currently 6562, required 6562
Division of bondeds over threads:
ns na_sc 64 na_c 8 super-cells: 1 x 2 y 2 z 0.2 maxz 1
cell_offset 109 sorting columns 2 - 2, atoms 6561 - 6562
cell_offset 109 sorting columns 1 - 2, atoms 6561 - 6562
cell_offset 109 sorting columns 3 - 4, atoms 6561 - 6562
cell_offset 109 sorting columns 2 - 3, atoms 6561 - 6562
cell_offset 109 sorting columns 0 - 0, atoms 6561 - 6562
cell_offset 109 sorting columns 0 - 1, atoms 6561 - 6562
ns non-zero sub-cells: 1 average atoms 1.00
ns bb: grid 1.58 67.80 9.47 abs 0.00 0.00 0.00 rel 0.00 0.00 0.00
ns making 6 nblists
nsp_est local 11550.0 non-local 205.1
nbl nsp estimate 11550.0, nsubpair_target 36
ns search grid 0 vs 0
nbl bounding box only distance 0.000000
nbl bounding box only distance 0.000000
nbl nc_i 109 col.av. 9.1 ci_block 4
nbl bounding box only distance 0.000000
nbl bounding box only distance 0.000000
nbl bounding box only distance 0.000000
nbl bounding box only distance 0.000000
nbl nc_i 109 col.av. 9.1 ci_block 4
nbl nc_i 109 col.av. 9.1 ci_block 4
nbl nc_i 109 col.av. 9.1 ci_block 4
nbl nc_i 109 col.av. 9.1 ci_block 4
nbl nc_i 109 col.av. 9.1 ci_block 4
number of distance checks 26224
nbl nsci 222 ncj4 239 nsi 6967 excl4 127
nbl na_c 8 rl 3 ncp 6967 per cell 8.4 atoms 67.6 ratio 150.44
nbl #cluster-pairs: av 31.4 stddev 3.7 max 36
nbl j-list #i-subcell 0 20 2.1
nbl j-list #i-subcell 1 16 1.7
nbl j-list #i-subcell 2 16 1.7
nbl j-list #i-subcell 3 16 1.7
nbl j-list #i-subcell 4 36 3.8
number of distance checks 24752
number of distance checks 41744
nbl nsci 266 ncj4 282 nsi 8361 excl4 142
nbl j-list #i-subcell 5 15 1.6
nbl j-list #i-subcell 6 15 1.6
nbl na_c 8 rl 3 ncp 8361 per cell 10.1 atoms 81.1 ratio 180.55
nbl j-list #i-subcell 7 14 1.5
nbl j-list #i-subcell 8 808 84.5
nbl #cluster-pairs: av 31.4 stddev 3.7 max 36
nbl j-list #i-subcell 0 26 2.3
nbl j-list #i-subcell 1 17 1.5
nbl j-list #i-subcell 2 16 1.4
nbl j-list #i-subcell 3 16 1.4
number of distance checks 31216
nbl j-list #i-subcell 4 16 1.4
nbl nsci 395 ncj4 410 nsi 12358 excl4 162
nbl j-list #i-subcell 5 16 1.4
nbl na_c 8 rl 3 ncp 12358 per cell 15.0 atoms 119.8 ratio 266.86
nbl nsci 338 ncj4 344 nsi 10365 excl4 119
number of distance checks 37680
nbl na_c 8 rl 3 ncp 10365 per cell 12.6 atoms 100.5 ratio 223.82
nbl nsci 488 ncj4 504 nsi 15339 excl4 160
nbl #cluster-pairs: av 31.3 stddev 4.0 max 36
nbl na_c 8 rl 3 ncp 15339 per cell 18.6 atoms 148.7 ratio 331.23
nbl j-list #i-subcell 6 16 1.4
nbl #cluster-pairs: av 30.7 stddev 4.8 max 36
nbl j-list #i-subcell 7 16 1.4
nbl j-list #i-subcell 0 26 1.9
nbl j-list #i-subcell 1 16 1.2
nbl j-list #i-subcell 2 15 1.1
nbl j-list #i-subcell 3 15 1.1
nbl #cluster-pairs: av 31.4 stddev 3.6 max 48
nbl j-list #i-subcell 4 16 1.2
nbl j-list #i-subcell 8 989 87.7
nbl j-list #i-subcell 0 30 1.5
nbl j-list #i-subcell 5 16 1.2
nbl j-list #i-subcell 6 15 1.1
nbl j-list #i-subcell 7 16 1.2
nbl j-list #i-subcell 1 20 1.0
nbl j-list #i-subcell 0 31 1.9
nbl j-list #i-subcell 1 20 1.2
nbl j-list #i-subcell 2 19 1.2
nbl j-list #i-subcell 2 20 1.0
nbl j-list #i-subcell 8 1241 90.2
nbl j-list #i-subcell 3 18 1.1
nbl j-list #i-subcell 4 17 1.0
nbl j-list #i-subcell 5 17 1.0
nbl j-list #i-subcell 6 17 1.0
nbl j-list #i-subcell 7 17 1.0
nbl j-list #i-subcell 8 1484 90.5
nbl j-list #i-subcell 3 20 1.0
nbl j-list #i-subcell 4 19 0.9
nbl j-list #i-subcell 5 19 0.9
nbl j-list #i-subcell 6 19 0.9
nbl j-list #i-subcell 7 18 0.9
nbl j-list #i-subcell 8 1851 91.8
number of distance checks 40176
nbl nsci 441 ncj4 480 nsi 14571 excl4 147
nbl na_c 8 rl 3 ncp 14571 per cell 17.7 atoms 141.3 ratio 314.64
nbl #cluster-pairs: av 33.0 stddev 6.6 max 64
nbl j-list #i-subcell 0 28 1.5
nbl j-list #i-subcell 1 20 1.0
nbl j-list #i-subcell 2 21 1.1
nbl j-list #i-subcell 3 20 1.0
nbl j-list #i-subcell 4 20 1.0
nbl j-list #i-subcell 5 20 1.0
nbl j-list #i-subcell 6 20 1.0
nbl j-list #i-subcell 7 19 1.0
nbl j-list #i-subcell 8 1752 91.2
nbl nsci 2150 ncj4 2259 nsi 67961 excl4 857
nbl na_c 8 rl 3 ncp 67961 per cell 82.4 atoms 659.0 ratio 1467.53
nbl #cluster-pairs: av 31.6 stddev 4.7 max 64
nbl j-list #i-subcell 0 161 1.8
nbl j-list #i-subcell 1 109 1.2
nbl j-list #i-subcell 2 107 1.2
nbl j-list #i-subcell 3 105 1.2
nbl j-list #i-subcell 4 124 1.4
nbl j-list #i-subcell 5 103 1.1
nbl j-list #i-subcell 6 102 1.1
nbl j-list #i-subcell 7 100 1.1
nbl j-list #i-subcell 8 8125 89.9
Non-bonded GPU launch configuration:
Thread block: 8x8x1
Grid: 2150x1
#Super-clusters/clusters: 17200/8 (8)
ShMem: 1568