FMM/GROMACS segmentation fault

GROMACS version: 2019, SPPEXA | Max Planck Institute for Multidisciplinary Sciences
GROMACS modification: Yes

Hi Everyone, I have tried GRMACS/FMM (fast multipole method) installation from the webpage mentioned above, and it is ended up with segfault. The test system is aerosol system provided in the page mentioned above.

The fault occurs before the first MD step, debug from mdrun shows no problem.
Also, the issiue arises only when FMM routine is switched on, i.e. mdrun from the same installation without fmm works fine.

Both GMX_USE_BUFFER_OPS and FMM_SPARSE were set to 1.
Gromacs is compiled with gcc 8.3.1, cmake 3.14.3.

The issue seems to be similar to Gromacs with fast multipole method? (building issues/stability of port), unfortunatelly the latter post does not provide the info how to resolve it.
Thanks in advance for any feedback.

Sincerely \ Alex

Hi,

what GPU model are you running that on?

Best,
Carsten

Hi Carsten,
Thanks for the suggestion, I have now tested it with three diferent GPUs: NVIDIA GeForce GTX 680, Titan Black and RTX 2080 Ti. All gave the same result. I am attaching the debug filr from mdrun, may be it could shed some light on the issue.

content of gmx_mpi0.debug

In gmx_physicalnode_id_hash: hash 2087638387
hw_opt: nt 0 ntmpi 0 ntomp 6 ntomp_pme 0 gpu_id ‘’ gputasks ‘’
graph part nchanged=2, bMultiPart=false
graph part nchanged=0, bMultiPart=false
nr. of distance calculations in bondeds: C 0.0 SIMD 0.0
Average nr of pbc_dx calls per atom 0.00
nc 3 1 1 1 1 vol pp 0.0698 pbcdx 0.0000 pme 0.000e+00 tot 6.983e-02
Bonded atom communication beyond the cut-off: false
cellsize limit 1.000000
Domain decomposition rank 0, coordinates 0 0 0

The DD cut-off is 3.156348
Volume fraction for all DD zones: 0.356610
DD rank 0 neighbor ranks in dir 0 are + 1 - 2
Making load communicators
Finished making load communicators
In gmx_physicalnode_id_hash: hash 2087638387
In gmx_setup_nodecomm: splitting communicator of size 3
In gmx_setup_nodecomm: node ID 0 rank within node 0
In gmx_setup_nodecomm: groups 1, my group size 3
In gmx_setup_nodecomm: not unsing separate inter- and intra-node communicators.
Non-default affinity mask found
hw_opt: nt 0 ntmpi 0 ntomp 6 ntomp_pme 6 gpu_id ‘’ gputasks ‘’
In gmx_physicalnode_id_hash: hash 2087638387
dd_setup_dd_dlb_gpu_sharing:
DD PP rank 0 physical node hash 2087638387 gpu_id 0
nrank_gpu_shared 1
There are 4 atom types in the system, adding one for nbnxn_atomdata_t
Combination rules: geometric false Lorentz-Berthelot true
Initialized CUDA data structures.
Neighbor-list balancing parameter: 2992 (auto-adjusted to the number of GPU multi-processors)
graph part nchanged=2, bMultiPart=false
graph part nchanged=0, bMultiPart=false
Installing signal handler for SIGTERM
Installing signal handler for SIGINT
Installing signal handler for SIGUSR1
wh =0.333333, rc = 0.075695, ra = 0.0390588
rb = 0.0195294, irc2 = 6.60546, dHH = 0.15139, dOH = 0.09572
The total size of the atom to interaction index is 5 integers
Home charge groups:
0 1 5 6 14 15 16 18 19 20
21 22 23 24 25 28 29 30 31 34
… I removed some part manually to make the list shorter …
9438 9439 9440 9441 9442 9443 9444 9445 9446 9447
9448 9449 9450 9451 9452 9453 9454 9455 9456 9457
9458
Resizing state: currently 0, required 6561
Resizing state: currently 6561, required 6561
Changing the number of halo communication pulses along dim X from 0 to 1
cell_x[0] 0.000000 - 45.200001 skew_fac 1.000000
cell_x[1] 0.000000 - 135.600006 skew_fac 1.000000
cell_x[2] 0.000000 - 135.600006 skew_fac 1.000000
Set grid boundaries dim 0: 0.000000 45.200001
Set grid boundaries dim 1: 0.000000 135.600006
Set grid boundaries dim 2: 0.000000 135.600006
zone 0 0.000 - 45.200 0.000 - 135.600 0.000 - 135.600
zone 0 bb 0.000 - 45.200 0.000 - 135.600 0.000 - 135.600
natoms_local = 6561 atom_density = 0.0
ns na_sc 64 na_c 8 super-cells: 109 x 2 y 6 z 9.1 maxz 32
cell_offset 0 sorting columns 0 - 2, atoms 0 - 6561
cell_offset 0 sorting columns 4 - 6, atoms 0 - 6561
cell_offset 0 sorting columns 10 - 12, atoms 0 - 6561
cell_offset 0 sorting columns 2 - 4, atoms 0 - 6561
cell_offset 0 sorting columns 6 - 8, atoms 0 - 6561
cell_offset 0 sorting columns 8 - 10, atoms 0 - 6561
ns non-zero sub-cells: 825 average atoms 7.95
ns bb: grid 11.30 11.30 7.94 abs 1.00 1.02 1.33 rel 0.09 0.09 0.17
Step 0, sorting the 6561 home charge groups
Set the new home atom count to 6561
Resizing state: currently 6561, required 6561
Setting up DD communication
bBondComm false, r_bc 0.078174
Resizing state: currently 6561, required 6562
Finished setting up DD communication, zones: 6561 1
zone 1 45.200 - 48.356 0.000 - 135.600 0.000 - 135.600
zone 1 bb 45.200 - 48.356 0.000 - 135.600 0.000 - 135.600
Making local topology
Two-body bonded cut-off distance is 3.15635
dim 0 cellmin 45.200001 bonded rcheck[0] = 0, bRCheck2B = false
dim 1 cellmin 135.600006 bonded rcheck[1] = 0, bRCheck2B = false
dim 2 cellmin 135.600006 bonded rcheck[2] = 0, bRCheck2B = false
We have 19548 exclusions, check count 0
Resizing state: currently 6562, required 6562
Division of bondeds over threads:
wh =0.0559503, rc = 0.075695, ra = 0.00655606
rb = 0.0520322, irc2 = 6.60546, dHH = 0.15139, dOH = 0.09572
vcm: start=0, homenr=6561, end=6561
Summing 12 energies
cell_x[0] 0.000000 - 45.200001 skew_fac 1.000000
cell_x[1] 0.000000 - 135.600006 skew_fac 1.000000
cell_x[2] 0.000000 - 135.600006 skew_fac 1.000000
Sending ddim 0 dir 0: ncg 0 nat 0
Sending ddim 0 dir 1: ncg 0 nat 0
Resizing state: currently 6562, required 6561
Finished repartitioning: cgs moved out 0, new home 6561
Set grid boundaries dim 0: 0.000000 45.200001
Set grid boundaries dim 1: 0.000000 135.600006
Set grid boundaries dim 2: 0.000000 135.600006
zone 0 0.000 - 45.200 0.000 - 135.600 0.000 - 135.600
zone 0 bb 0.000 - 45.200 0.000 - 135.600 0.000 - 135.600
natoms_local = 6561 atom_density = 0.0
ns na_sc 64 na_c 8 super-cells: 109 x 2 y 6 z 9.1 maxz 32
cell_offset 0 sorting columns 0 - 2, atoms 0 - 6561
cell_offset 0 sorting columns 10 - 12, atoms 0 - 6561
cell_offset 0 sorting columns 6 - 8, atoms 0 - 6561
cell_offset 0 sorting columns 4 - 6, atoms 0 - 6561
cell_offset 0 sorting columns 8 - 10, atoms 0 - 6561
cell_offset 0 sorting columns 2 - 4, atoms 0 - 6561
ns non-zero sub-cells: 825 average atoms 7.95
ns bb: grid 11.30 11.30 7.94 abs 1.00 1.02 1.33 rel 0.09 0.09 0.17
Step 0, sorting the 6561 home charge groups
Set the new home atom count to 6561
Resizing state: currently 6561, required 6561
Setting up DD communication
bBondComm false, r_bc 0.078174
Resizing state: currently 6561, required 6562
Finished setting up DD communication, zones: 6561 1
zone 1 45.200 - 48.356 0.000 - 135.600 0.000 - 135.600
zone 1 bb 45.200 - 48.356 0.000 - 135.600 0.000 - 135.600
Making local topology
Two-body bonded cut-off distance is 3.15635
dim 0 cellmin 45.200001 bonded rcheck[0] = 0, bRCheck2B = false
dim 1 cellmin 135.600006 bonded rcheck[1] = 0, bRCheck2B = false
dim 2 cellmin 135.600006 bonded rcheck[2] = 0, bRCheck2B = false
We have 19548 exclusions, check count 0
Resizing state: currently 6562, required 6562
Division of bondeds over threads:
ns na_sc 64 na_c 8 super-cells: 1 x 2 y 2 z 0.2 maxz 1
cell_offset 109 sorting columns 2 - 2, atoms 6561 - 6562
cell_offset 109 sorting columns 1 - 2, atoms 6561 - 6562
cell_offset 109 sorting columns 3 - 4, atoms 6561 - 6562
cell_offset 109 sorting columns 2 - 3, atoms 6561 - 6562
cell_offset 109 sorting columns 0 - 0, atoms 6561 - 6562
cell_offset 109 sorting columns 0 - 1, atoms 6561 - 6562
ns non-zero sub-cells: 1 average atoms 1.00
ns bb: grid 1.58 67.80 9.47 abs 0.00 0.00 0.00 rel 0.00 0.00 0.00
ns making 6 nblists
nsp_est local 11550.0 non-local 205.1
nbl nsp estimate 11550.0, nsubpair_target 36
ns search grid 0 vs 0
nbl bounding box only distance 0.000000
nbl bounding box only distance 0.000000
nbl nc_i 109 col.av. 9.1 ci_block 4
nbl bounding box only distance 0.000000
nbl bounding box only distance 0.000000
nbl bounding box only distance 0.000000
nbl bounding box only distance 0.000000
nbl nc_i 109 col.av. 9.1 ci_block 4
nbl nc_i 109 col.av. 9.1 ci_block 4
nbl nc_i 109 col.av. 9.1 ci_block 4
nbl nc_i 109 col.av. 9.1 ci_block 4
nbl nc_i 109 col.av. 9.1 ci_block 4
number of distance checks 26224
nbl nsci 222 ncj4 239 nsi 6967 excl4 127
nbl na_c 8 rl 3 ncp 6967 per cell 8.4 atoms 67.6 ratio 150.44
nbl #cluster-pairs: av 31.4 stddev 3.7 max 36
nbl j-list #i-subcell 0 20 2.1
nbl j-list #i-subcell 1 16 1.7
nbl j-list #i-subcell 2 16 1.7
nbl j-list #i-subcell 3 16 1.7
nbl j-list #i-subcell 4 36 3.8
number of distance checks 24752
number of distance checks 41744
nbl nsci 266 ncj4 282 nsi 8361 excl4 142
nbl j-list #i-subcell 5 15 1.6
nbl j-list #i-subcell 6 15 1.6
nbl na_c 8 rl 3 ncp 8361 per cell 10.1 atoms 81.1 ratio 180.55
nbl j-list #i-subcell 7 14 1.5
nbl j-list #i-subcell 8 808 84.5
nbl #cluster-pairs: av 31.4 stddev 3.7 max 36
nbl j-list #i-subcell 0 26 2.3
nbl j-list #i-subcell 1 17 1.5
nbl j-list #i-subcell 2 16 1.4
nbl j-list #i-subcell 3 16 1.4
number of distance checks 31216
nbl j-list #i-subcell 4 16 1.4
nbl nsci 395 ncj4 410 nsi 12358 excl4 162
nbl j-list #i-subcell 5 16 1.4
nbl na_c 8 rl 3 ncp 12358 per cell 15.0 atoms 119.8 ratio 266.86
nbl nsci 338 ncj4 344 nsi 10365 excl4 119
number of distance checks 37680
nbl na_c 8 rl 3 ncp 10365 per cell 12.6 atoms 100.5 ratio 223.82
nbl nsci 488 ncj4 504 nsi 15339 excl4 160
nbl #cluster-pairs: av 31.3 stddev 4.0 max 36
nbl na_c 8 rl 3 ncp 15339 per cell 18.6 atoms 148.7 ratio 331.23
nbl j-list #i-subcell 6 16 1.4
nbl #cluster-pairs: av 30.7 stddev 4.8 max 36
nbl j-list #i-subcell 7 16 1.4
nbl j-list #i-subcell 0 26 1.9
nbl j-list #i-subcell 1 16 1.2
nbl j-list #i-subcell 2 15 1.1
nbl j-list #i-subcell 3 15 1.1
nbl #cluster-pairs: av 31.4 stddev 3.6 max 48
nbl j-list #i-subcell 4 16 1.2
nbl j-list #i-subcell 8 989 87.7
nbl j-list #i-subcell 0 30 1.5
nbl j-list #i-subcell 5 16 1.2
nbl j-list #i-subcell 6 15 1.1
nbl j-list #i-subcell 7 16 1.2
nbl j-list #i-subcell 1 20 1.0
nbl j-list #i-subcell 0 31 1.9
nbl j-list #i-subcell 1 20 1.2
nbl j-list #i-subcell 2 19 1.2
nbl j-list #i-subcell 2 20 1.0
nbl j-list #i-subcell 8 1241 90.2
nbl j-list #i-subcell 3 18 1.1
nbl j-list #i-subcell 4 17 1.0
nbl j-list #i-subcell 5 17 1.0
nbl j-list #i-subcell 6 17 1.0
nbl j-list #i-subcell 7 17 1.0
nbl j-list #i-subcell 8 1484 90.5
nbl j-list #i-subcell 3 20 1.0
nbl j-list #i-subcell 4 19 0.9
nbl j-list #i-subcell 5 19 0.9
nbl j-list #i-subcell 6 19 0.9
nbl j-list #i-subcell 7 18 0.9
nbl j-list #i-subcell 8 1851 91.8
number of distance checks 40176
nbl nsci 441 ncj4 480 nsi 14571 excl4 147
nbl na_c 8 rl 3 ncp 14571 per cell 17.7 atoms 141.3 ratio 314.64
nbl #cluster-pairs: av 33.0 stddev 6.6 max 64
nbl j-list #i-subcell 0 28 1.5
nbl j-list #i-subcell 1 20 1.0
nbl j-list #i-subcell 2 21 1.1
nbl j-list #i-subcell 3 20 1.0
nbl j-list #i-subcell 4 20 1.0
nbl j-list #i-subcell 5 20 1.0
nbl j-list #i-subcell 6 20 1.0
nbl j-list #i-subcell 7 19 1.0
nbl j-list #i-subcell 8 1752 91.2
nbl nsci 2150 ncj4 2259 nsi 67961 excl4 857
nbl na_c 8 rl 3 ncp 67961 per cell 82.4 atoms 659.0 ratio 1467.53
nbl #cluster-pairs: av 31.6 stddev 4.7 max 64
nbl j-list #i-subcell 0 161 1.8
nbl j-list #i-subcell 1 109 1.2
nbl j-list #i-subcell 2 107 1.2
nbl j-list #i-subcell 3 105 1.2
nbl j-list #i-subcell 4 124 1.4
nbl j-list #i-subcell 5 103 1.1
nbl j-list #i-subcell 6 102 1.1
nbl j-list #i-subcell 7 100 1.1
nbl j-list #i-subcell 8 8125 89.9
Non-bonded GPU launch configuration:
Thread block: 8x8x1
Grid: 2150x1
#Super-clusters/clusters: 17200/8 (8)
ShMem: 1568

Hi Alex,

are you running multiple ranks? That is not yet implemented (and we should have a check for that). If it’s not that, it might be related to the CUDA version - which one do you use?

Best,
Carsten

Thank you, Carsten!
Of course I do run multiple ranks :-) Now I understood all the depth of my misunderstanding. With a single task it works pretty well. Unrelated question, but is there any plans/ongoing work/timeframe to adapt the FMM method for multiple ranks?
Thanks for sorting out the issue once again.

Sincerely,
Alex

Hi Alex,

yes, we are actively working on the FMM implementation and further parallelization (multi-GPU and multi-rank) is on the top of our list!

BTW I have put a slightly updated version of the FMM code on our download page, which explicitly disallows running multiple ranks, plus it does not need the GMX_USE_GPU_BUFFER_OPS environment variable to be set explicitly. This should make it a bit more comfortable to use :)

Best,
Carsten