Performance with mpi support

GROMACS version: 2019.6
GROMACS modification: No
Here post your question
Dear all,
I had to install gromacs with mpi support, to use plumed, on Ubuntu 18.04 LTS. Comparing to gromacs with built-in thread-MPI, is it normal to observe a loss in performance (from 200 ns/day to 150 ns/day ca for a 50K atom system), when running without plumed calculation, with the same number of ranks, omp and gpu?
Thanks in advance
Stefano

Hi,

A drop in performance of that magnitude is unexpected. Please post logs of both runs, including the information about the build (simd usage, cuda, etc, you can also get this by gmx --version ), and the performance breakdown at the bottom of the log.

Hi.
unfortunately I deleted some files. I can post the log file for a simulation with 1 fs/step with the thread-mpi version and the log file for the simulation with 2 fs/sep for the MPI version, both for the same system.

THREAD-MPI VERSION:

                  :-) GROMACS - gmx mdrun, 2019.6

GROMACS: gmx mdrun, version 2019.6
Executable: /home/stefano/gromacs2019/bin/gmx
Data prefix: /home/stefano/gromacs2019
Working dir: /home/stefano/ERR
Process ID: 17194
Command line:
gmx mdrun -deffnm ERR_path_equilibration -nb gpu -pme gpu -ntmpi 8 -ntomp 8 -npme 1 -gputasks 00001111

GROMACS version: 2019.6
Precision: single
Memory model: 64 bit
MPI library: thread_mpi
OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 64)
GPU support: CUDA
SIMD instructions: AVX2_128
FFT library: fftw-3.3.8-sse2-avx-avx2-avx2_128-avx512
RDTSCP usage: enabled
TNG support: enabled
Hwloc support: disabled
Tracing support: disabled
C compiler: /usr/bin/cc GNU 7.5.0
C compiler flags: -mavx2 -mfma -O3 -DNDEBUG -funroll-all-loops -fexcess-precision=fast
C++ compiler: /usr/bin/c++ GNU 7.5.0
C++ compiler flags: -mavx2 -mfma -std=c++11 -O3 -DNDEBUG -funroll-all-loops -fexcess-precision=fast
CUDA compiler: /usr/local/cuda/bin/nvcc nvcc: NVIDIA ® Cuda compiler driver;Copyright © 2005-2019 NVIDIA Corporation;Built on Wed_Oct_23_19:24:38_PDT_2019;Cuda compilation tools, release 10.2, V10.2.89
CUDA compiler flags:-gencode;arch=compute_30,code=sm_30;-gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=compute_75;-use_fast_math;-D_FORCE_INLINES;; ;-mavx2;-mfma;-std=c++11;-O3;-DNDEBUG;-funroll-all-loops;-fexcess-precision=fast;
CUDA driver: 10.20
CUDA runtime: 10.20

Running on 1 node with total 32 cores, 64 logical cores, 2 compatible GPUs
Hardware detected:
CPU info:
Vendor: AMD
Brand: AMD Ryzen Threadripper 2990WX 32-Core Processor
Family: 23 Model: 8 Stepping: 2
Features: aes amd apic avx avx2 clfsh cmov cx8 cx16 f16c fma htt lahf misalignsse mmx msr nonstop_tsc pclmuldq pdpe1gb popcnt pse rdrnd rdtscp sha sse2 sse3 sse4a sse4.1 sse4.2 ssse3
Hardware topology: Basic
Sockets, cores, and logical processors:
Socket 0: [ 0 32] [ 1 33] [ 2 34] [ 3 35] [ 4 36] [ 5 37] [ 6 38] [ 7 39] [ 16 48] [ 17 49] [ 18 50] [ 19 51] [ 20 52] [ 21 53] [ 22 54] [ 23 55] [ 8 40] [ 9 41] [ 10 42] [ 11 43] [ 12 44] [ 13 45] [ 14 46] [ 15 47] [ 24 56] [ 25 57] [ 26 58] [ 27 59] [ 28 60] [ 29 61] [ 30 62] [ 31 63]
GPU info:
Number of GPUs detected: 2
#0: NVIDIA GeForce RTX 2080 Ti, compute cap.: 7.5, ECC: no, stat: compatible
#1: NVIDIA GeForce RTX 2080 Ti, compute cap.: 7.5, ECC: no, stat: compatible

Changing nstlist from 50 to 100, rlist from 1.046 to 1.106

Initializing Domain Decomposition on 8 ranks
Dynamic load balancing: locked
Using update groups, nr 19306, average size 2.9 atoms, max. radius 0.104 nm
Minimum cell size due to atom displacement: 0.400 nm
Initial maximum distances in bonded interactions:
two-body bonded interactions: 0.421 nm, LJ-14, atoms 3333 3341
multi-body bonded interactions: 0.487 nm, CMAP Dih., atoms 1516 1530
Minimum cell size due to bonded interactions: 0.536 nm
Using 1 separate PME ranks
Scaling the initial minimum size with 1/0.8 (option -dds) = 1.25
Optimizing the DD grid for 7 cells with a minimum initial size of 0.670 nm
The maximum allowed number of cells is: X 12 Y 12 Z 12
Domain decomposition grid 7 x 1 x 1, separate PME ranks 1
PME domain decomposition: 1 x 1 x 1
Interleaving PP and PME ranks
This rank does only particle-particle work.
Domain decomposition rank 0, coordinates 0 0 0

The initial number of communication pulses is: X 2
The initial domain decomposition cell size is: X 1.20 nm

The maximum allowed distance for atom groups involved in interactions is:
non-bonded interactions 1.314 nm
two-body bonded interactions (-rdd) 1.314 nm
multi-body bonded interactions (-rdd) 1.200 nm

When dynamic load balancing gets turned on, these settings will change to:
The maximum number of communication pulses is: X 2
The minimum size for domain decomposition cells is 0.844 nm
The requested allowed shrink of DD cells (option -dds) is: 0.80
The allowed shrink of domain decomposition cells is: X 0.70
The maximum allowed distance for atom groups involved in interactions is:
non-bonded interactions 1.314 nm
two-body bonded interactions (-rdd) 1.314 nm
multi-body bonded interactions (-rdd) 0.844 nm

Using 8 MPI threads
Using 8 OpenMP threads per tMPI thread

On host pcPharm018 2 GPUs selected for this run.
Mapping of GPU IDs to the 8 GPU tasks in the 8 ranks on this node:
PP:0,PP:0,PP:0,PP:0,PP:1,PP:1,PP:1,PME:1
PP tasks will do (non-perturbed) short-ranged interactions on the GPU
PME tasks will do all aspects on the GPU

NOTE: You assigned the same GPU ID(s) to multiple ranks, which is a good idea if you have measured the performance of alternatives.

Pinning threads with an auto-selected logical core stride of 1
System total charge: 0.000
Will do PME sum in reciprocal space for electrostatic interactions.

Using a Gaussian width (1/beta) of 0.320163 nm for Ewald
Potential shift: LJ r^-12: -1.000e+00 r^-6: -1.000e+00, Ewald -1.000e-05
Initialized non-bonded Ewald correction tables, spacing: 9.33e-04 size: 1073

Long Range LJ corr.: 3.0433e-04
Generated table with 1053 data points for Ewald.
Tabscale = 500 points/nm
Generated table with 1053 data points for LJ6.
Tabscale = 500 points/nm
Generated table with 1053 data points for LJ12.
Tabscale = 500 points/nm
Generated table with 1053 data points for 1-4 COUL.
Tabscale = 500 points/nm
Generated table with 1053 data points for 1-4 LJ6.
Tabscale = 500 points/nm
Generated table with 1053 data points for 1-4 LJ12.
Tabscale = 500 points/nm

Using GPU 8x8 nonbonded short-range kernels

Using a dual 8x4 pair-list setup updated with dynamic, rolling pruning:
outer list: updated every 100 steps, buffer 0.106 nm, rlist 1.106 nm
inner list: updated every 22 steps, buffer 0.003 nm, rlist 1.003 nm
At tolerance 0.005 kJ/mol/ps per atom, equivalent classical 1x1 list would be:
outer list: updated every 100 steps, buffer 0.233 nm, rlist 1.233 nm
inner list: updated every 22 steps, buffer 0.048 nm, rlist 1.048 nm

Using Lorentz-Berthelot Lennard-Jones combination rule

Removing pbc first time

Initializing LINear Constraint Solver

Linking all bonded interactions to atoms

Intra-simulation communication will occur every 50 steps.
Center of mass motion removal mode is Linear
We have the following groups for center of mass motion removal:
0: System

There are: 56204 Atoms
Atom distribution over 7 domains: av 8029 stddev 142 min 7854 max 8264

NOTE: DLB will not turn on during the first phase of PME tuning

Constraining the starting coordinates (step 0)

Constraining the coordinates at t0-dt (step 0)
RMS relative constraint deviation after constraining: 0.00e+00
Initial temperature: 302.765 K

Started mdrun on rank 0 Mon Dec 14 08:34:19 2020

DD step 99 load imb.: force 4.6% pme mesh/force 0.934

DD step 999 load imb.: force 1.5% pme mesh/force 1.195

   P P   -   P M E   L O A D   B A L A N C I N G

PP/PME load balancing changed the cut-off and PME settings:
particle-particle PME
rcoulomb rlist grid spacing 1/beta
initial 1.000 nm 1.003 nm 72 72 72 0.117 nm 0.320 nm
final 1.094 nm 1.097 nm 64 64 64 0.131 nm 0.350 nm
cost-ratio 1.31 0.70
(note that these numbers concern only part of the total PP and PME load)

M E G A - F L O P S   A C C O U N T I N G

NB=Group-cutoff nonbonded kernels NxN=N-by-N cluster Verlet kernels
RF=Reaction-Field VdW=Van der Waals QSTab=quadratic-spline table
W3=SPC/TIP3p W4=TIP4p (single or pairs)
V&F=Potential and force V=Potential only F=Force only

Computing: M-Number M-Flops % Flops

Pair Search distance check 255648.661376 2300837.952 0.0
NxN Ewald Elec. + LJ [F] 272544519.601728 17987938293.714 98.1
NxN Ewald Elec. + LJ [V&F] 2753030.658560 294574280.466 1.6
1,4 nonbonded interactions 46455.009291 4180950.836 0.0
Reset In Box 2810.256204 8430.769 0.0
CG-CoM 2810.312408 8430.937 0.0
Bonds 8825.001765 520675.104 0.0
Angles 32320.006464 5429761.086 0.0
Propers 41980.008396 9613421.923 0.1
Impropers 3270.000654 680160.136 0.0
Virial 2826.006519 50868.117 0.0
Stop-CM 2810.312408 28103.124 0.0
Calc-Ekin 11240.912408 303504.635 0.0
Lincs 8920.005352 535200.321 0.0
Lincs-Mat 51780.031068 207120.124 0.0
Constraint-V 281195.112478 2249560.900 0.0
Constraint-Vir 2722.804455 65347.307 0.0
Settle 87785.052671 28354572.013 0.2
CMAP 1125.000225 1912500.383 0.0

Total 18338962019.847 100.0

D O M A I N   D E C O M P O S I T I O N   S T A T I S T I C S

av. #atoms communicated per step for force: 2 x 51308.5

Dynamic load balancing report:
DLB was turned on during the run due to measured imbalance.
Average load imbalance: 6.8%.
The balanceable part of the MD step is 76%, load imbalance is computed from this.
Part of the total run time spent waiting due to load imbalance: 5.1%.
Steps where the load balancing was limited by -rdd, -rcon and/or -dds: X 0 %
Average PME mesh/force load: 1.092
Part of the total run time spent waiting due to PP/PME imbalance: 3.5 %

NOTE: 5.1 % of the available CPU time was lost due to load imbalance
in the domain decomposition.
You can consider manually changing the decomposition (option -dd);
e.g. by using fewer domains along the box dimension in which there is
considerable inhomogeneity in the simulated system.

 R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G

On 7 MPI ranks doing PP, each using 8 OpenMP threads, and
on 1 MPI rank doing PME, using 8 OpenMP threads

Computing: Num Num Call Wall time Giga-Cycles
Ranks Threads Count (s) total sum %

Domain decomp. 7 8 50001 105.219 18818.654 2.6
DD comm. load 7 8 21201 1.754 313.740 0.0
DD comm. bounds 7 8 18001 0.595 106.450 0.0
Send X to PME 7 8 5000001 315.419 56413.354 7.9
Neighbor search 7 8 50001 43.252 7735.762 1.1
Launch GPU ops. 7 8 10000002 1149.906 205663.077 28.8
Comm. coord. 7 8 4950000 506.670 90618.990 12.7
Force 7 8 5000001 188.546 33721.805 4.7
Wait + Comm. F 7 8 5000001 658.835 117834.092 16.5
PME mesh * 1 8 5000001 1797.720 45932.295 6.4
PME wait for PP * 1697.932 43382.676 6.1
Wait + Recv. PME F 7 8 5000001 83.541 14941.475 2.1
Wait PME GPU gather 7 8 5000001 145.914 26097.075 3.7
Wait GPU NB nonloc. 7 8 5000001 47.621 8517.077 1.2
Wait GPU NB local 7 8 5000001 23.452 4194.516 0.6
NB X/F buffer ops. 7 8 19900002 192.927 34505.372 4.8
Write traj. 7 8 1004 3.678 657.830 0.1
Update 7 8 5000001 120.828 21610.431 3.0
Constraints 7 8 5000003 113.125 20232.569 2.8
Comm. energies 7 8 100001 25.109 4490.775 0.6

Total 3495.676 714524.636 100.0

(*) Note that with separate PME ranks, the walltime column actually sums to
twice the total reported, but the cycle count total and % are correct.

           Core t (s)   Wall t (s)        (%)
   Time:   223723.172     3495.676     6400.0
                     58:15
             (ns/day)    (hour/ns)

Performance: 123.581 0.194

MPI VERSION:

                  :-) GROMACS - gmx mdrun, 2019.6 (-:

GROMACS: gmx mdrun, version 2019.6
Executable: /home/stefano/gromacs2019/bin/gmx_mpi
Data prefix: /home/stefano/gromacs2019
Working dir: /home/stefano/ERR/path_metad
Process ID: 69537
Command line:
gmx_mpi mdrun -deffnm ERR_path_metad -nb gpu -pme gpu -ntomp 8 -npme 1 -gputasks 00001111

GROMACS version: 2019.6
Precision: single
Memory model: 64 bit
MPI library: MPI
OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 64)
GPU support: CUDA
SIMD instructions: AVX2_128
FFT library: fftw-3.3.8-sse2-avx-avx2-avx2_128-avx512
RDTSCP usage: enabled
TNG support: enabled
Hwloc support: disabled
Tracing support: disabled
C compiler: /usr/bin/cc GNU 7.5.0
C compiler flags: -mavx2 -mfma -O3 -DNDEBUG -funroll-all-loops -fexcess-precision=fast
C++ compiler: /usr/bin/c++ GNU 7.5.0
C++ compiler flags: -mavx2 -mfma -std=c++11 -O3 -DNDEBUG -funroll-all-loops -fexcess-precision=fast
CUDA compiler: /usr/local/cuda/bin/nvcc nvcc: NVIDIA ® Cuda compiler driver;Copyright © 2005-2019 NVIDIA Corporation;Built on Wed_Oct_23_19:24:38_PDT_2019;Cuda compilation tools, release 10.2, V10.2.89
CUDA compiler flags:-gencode;arch=compute_30,code=sm_30;-gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=compute_75;-use_fast_math;-D_FORCE_INLINES;; ;-mavx2;-mfma;-std=c++11;-O3;-DNDEBUG;-funroll-all-loops;-fexcess-precision=fast;
CUDA driver: 10.20
CUDA runtime: 10.20

Changing nstlist from 50 to 100, rlist from 1.105 to 1.155

Initializing Domain Decomposition on 8 ranks
Dynamic load balancing: locked
Using update groups, nr 19306, average size 2.9 atoms, max. radius 0.104 nm
Minimum cell size due to atom displacement: 0.658 nm
Initial maximum distances in bonded interactions:
two-body bonded interactions: 0.430 nm, LJ-14, atoms 3333 3341
multi-body bonded interactions: 0.495 nm, CMAP Dih., atoms 2693 2702
Minimum cell size due to bonded interactions: 0.545 nm
Using 1 separate PME ranks
Scaling the initial minimum size with 1/0.8 (option -dds) = 1.25
Optimizing the DD grid for 7 cells with a minimum initial size of 0.823 nm
The maximum allowed number of cells is: X 10 Y 10 Z 10
Domain decomposition grid 7 x 1 x 1, separate PME ranks 1
PME domain decomposition: 1 x 1 x 1
Interleaving PP and PME ranks
This rank does only particle-particle work.
Domain decomposition rank 0, coordinates 0 0 0

The initial number of communication pulses is: X 2
The initial domain decomposition cell size is: X 1.20 nm

The maximum allowed distance for atom groups involved in interactions is:
non-bonded interactions 1.363 nm
(the following are initial values, they could change due to box deformation)
two-body bonded interactions (-rdd) 1.363 nm
multi-body bonded interactions (-rdd) 1.200 nm

When dynamic load balancing gets turned on, these settings will change to:
The maximum number of communication pulses is: X 2
The minimum size for domain decomposition cells is 0.848 nm
The requested allowed shrink of DD cells (option -dds) is: 0.80
The allowed shrink of domain decomposition cells is: X 0.71
The maximum allowed distance for atom groups involved in interactions is:
non-bonded interactions 1.363 nm
two-body bonded interactions (-rdd) 1.363 nm
multi-body bonded interactions (-rdd) 0.848 nm

Using 8 MPI processes
Using 8 OpenMP threads per MPI process

On host pcPharm018 2 GPUs selected for this run.
Mapping of GPU IDs to the 8 GPU tasks in the 8 ranks on this node:
PP:0,PP:0,PP:0,PP:0,PP:1,PP:1,PP:1,PME:1
PP tasks will do (non-perturbed) short-ranged interactions on the GPU
PME tasks will do all aspects on the GPU

NOTE: You assigned the same GPU ID(s) to multiple ranks, which is a good idea if you have measured the performance of alternatives.

Pinning threads with an auto-selected logical core stride of 1
System total charge: 0.000
Will do PME sum in reciprocal space for electrostatic interactions.

Using a Gaussian width (1/beta) of 0.320163 nm for Ewald
Potential shift: LJ r^-12: -1.000e+00 r^-6: -1.000e+00, Ewald -1.000e-05
Initialized non-bonded Ewald correction tables, spacing: 9.33e-04 size: 1073

Long Range LJ corr.: 3.0433e-04
Generated table with 1077 data points for Ewald.
Tabscale = 500 points/nm
Generated table with 1077 data points for LJ6.
Tabscale = 500 points/nm
Generated table with 1077 data points for LJ12.
Tabscale = 500 points/nm
Generated table with 1077 data points for 1-4 COUL.
Tabscale = 500 points/nm
Generated table with 1077 data points for 1-4 LJ6.
Tabscale = 500 points/nm
Generated table with 1077 data points for 1-4 LJ12.
Tabscale = 500 points/nm

Using GPU 8x8 nonbonded short-range kernels

Using a dual 8x4 pair-list setup updated with dynamic, rolling pruning:
outer list: updated every 100 steps, buffer 0.155 nm, rlist 1.155 nm
inner list: updated every 12 steps, buffer 0.005 nm, rlist 1.005 nm
At tolerance 0.005 kJ/mol/ps per atom, equivalent classical 1x1 list would be:
outer list: updated every 100 steps, buffer 0.304 nm, rlist 1.304 nm
inner list: updated every 12 steps, buffer 0.054 nm, rlist 1.054 nm

Using Lorentz-Berthelot Lennard-Jones combination rule

Initializing LINear Constraint Solver

The number of constraints is 1784
Linking all bonded interactions to atoms

Intra-simulation communication will occur every 50 steps.
Center of mass motion removal mode is Linear
We have the following groups for center of mass motion removal:
0: System

There are: 56204 Atoms
Atom distribution over 7 domains: av 8029 stddev 110 min 7892 max 8223

NOTE: DLB will not turn on during the first phase of PME tuning

M E G A - F L O P S   A C C O U N T I N G

NB=Group-cutoff nonbonded kernels NxN=N-by-N cluster Verlet kernels
RF=Reaction-Field VdW=Van der Waals QSTab=quadratic-spline table
W3=SPC/TIP3p W4=TIP4p (single or pairs)
V&F=Potential and force V=Potential only F=Force only

Computing: M-Number M-Flops % Flops

Pair Search distance check 3755.357424 33798.217 0.0
NxN Ewald Elec. + LJ [F] 3620881.851840 238978202.221 98.1
NxN Ewald Elec. + LJ [V&F] 36628.826880 3919284.476 1.6
1,4 nonbonded interactions 623.435391 56109.185 0.0
Reset In Box 37.712884 113.139 0.0
CG-CoM 37.769088 113.307 0.0
Bonds 118.433265 6987.563 0.0
Angles 433.740864 72868.465 0.0
Propers 563.379996 129014.019 0.1
Impropers 43.884054 9127.883 0.0
Virial 75.905017 1366.290 0.0
Stop-CM 37.769088 377.691 0.0
Calc-Ekin 150.963944 4076.026 0.0
Lincs 119.708184 7182.491 0.0
Lincs-Mat 694.897956 2779.592 0.0
Constraint-V 3773.693139 30189.545 0.0
Constraint-Vir 73.133065 1755.194 0.0
Settle 1178.092257 380523.799 0.2
CMAP 15.097725 25666.132 0.0

Total 243659535.236 100.0

D O M A I N   D E C O M P O S I T I O N   S T A T I S T I C S

av. #atoms communicated per step for force: 2 x 59892.1

Dynamic load balancing report:
DLB was off during the run due to low measured imbalance.
Average load imbalance: 3.8%.
The balanceable part of the MD step is 44%, load imbalance is computed from this.
Part of the total run time spent waiting due to load imbalance: 1.6%.
Average PME mesh/force load: 0.899
Part of the total run time spent waiting due to PP/PME imbalance: 0.5 %

 R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G

On 7 MPI ranks doing PP, each using 8 OpenMP threads, and
on 1 MPI rank doing PME, using 8 OpenMP threads

Computing: Num Num Call Wall time Giga-Cycles
Ranks Threads Count (s) total sum %

Domain decomp. 7 8 671 1.123 200.934 1.3
DD comm. load 7 8 69 0.013 2.302 0.0
Send X to PME 7 8 67101 10.220 1827.922 11.6
Neighbor search 7 8 672 0.591 105.747 0.7
Launch GPU ops. 7 8 134202 4.125 737.682 4.7
Comm. coord. 7 8 66429 6.845 1224.177 7.8
Force 7 8 67101 1.990 355.915 2.3
Wait + Comm. F 7 8 67101 11.363 2032.359 12.9
PME mesh * 1 8 67101 47.016 1201.259 7.6
PME wait for PP * 24.408 623.619 4.0
Wait + Recv. PME F 7 8 67101 2.706 483.882 3.1
Wait PME GPU gather 7 8 67101 6.126 1095.688 7.0
Wait GPU NB nonloc. 7 8 67101 33.996 6080.231 38.6
Wait GPU NB local 7 8 67101 0.177 31.743 0.2
NB X/F buffer ops. 7 8 267060 2.575 460.553 2.9
Write traj. 7 8 69 0.051 9.174 0.1
Update 7 8 67101 0.541 96.716 0.6
Constraints 7 8 67101 1.519 271.613 1.7
Comm. energies 7 8 1343 0.343 61.418 0.4

Total 77.117 15762.743 100.0

(*) Note that with separate PME ranks, the walltime column actually sums to
twice the total reported, but the cycle count total and % are correct.

           Core t (s)   Wall t (s)        (%)
   Time:     4935.430       77.117     6400.0
             (ns/day)    (hour/ns)

Performance: 150.357 0.160

Compilation looks fine. There are some differences in the DD stuff but that’s due to your 1fs vs 2fs example, which I don’t think should throw off the results too much, if you’ve seen the same difference in an apples-to-apples comparison. Your big difference in walltime is in the “Wait GPU NB nonloc.”, which is 38.6% of runtime in the slower simulation.

@pszilard, do you know if that’s atypical or just an indicator that the GPU is maxed out?