Random GROMACS Crashes with CUDA error #717

GROMACS version: 2025.0
GROMACS modification: No

Hello GROMACS community,

Background:
I am running 48 GROMACS simulations in parallel (with identical system topology) on GPU-resident mode to maximize throughput using the weighted ensemble strategy. This calculation is running on a HPC compute node with eight H100 GPUs and 192 logical CPU cores. The GPUs are partitioned using NVIDIA’s Multi-Process Service. I assign each GPU six independent GROMACS simulations, and I assign each simulation 4 unique CPU cores from the same NUMA node as the GPU. Empirically, six parallel simulations per GPU provides highest overall throughput.

The simulated systems are protein-protein binding in water and 0.15 M NaCl. I use the CHARMM-36m force field for proteins, mTIP3P for water, and JC for and ions. All replicate systems contain 210,385 atoms at 303 K and 1 atm pressure using the v-rescale thermostat and C-rescale barostat. I am employing hydrogen mass repartitioning (mass-repartition-factor 3) with a 4 fs time step, which is standard for biomolecules under HMR. The initial states for production are well-equilibrated using minimization followed by 50 ps NVT (1 fs timestep) and 50 ps NPT (2 fs timestep). I observed that energies, pressures, temperatures are relaxed by the end of these equilibration steps.

The production simulation runs normally most of the time, with very impressive throughput, but very rarely anywhere between one and six of the parallel GROMACS replicates will simultaneously crash with CUDA error #717 (cudaErrorInvalidAddressSpace): operation not supported on global/shared address space. When I was using GROMACS 2024.4, I had similar CUDA crashes but these logged CUDA error #700 rather than #717. The simulations that crash together are always assigned to the same GPU.

I have observed this crashing across multiple simulated systems using these H100 compute nodes, and I am fairly certain it is not related to system stability. I have even tried re-running a failed trajectory from the previous checkpoint (15,000 steps back) hundreds of times, and none of the re-tries crash or log warnings.

I have confirmed with NVIDIA support that this is indeed caused by an application crash, not something like a device driver crash.

The crashes produce a core dump, and I have examined the core dump with gdb, and have pasted the output (including back trace) below. I am looking for suggestions/advice on further debugging this error, any help is greatly appreciated!

Thanks,
Hayden

gdb output on core dump:

warning: Can't open file /dev/zero (deleted) during file-backed mapping note processing

warning: Can't open file /dev/shm/cuda.shm.0.166.1 during file-backed mapping note processing
[New LWP 408967]
[New LWP 410242]
[New LWP 410418]
[New LWP 409015]
[New LWP 410419]
[New LWP 410420]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/usr/local/gromacs/avx2_256/bin/gmx mdrun -ntmpi 1 -nt 4 -pin on -pinoffset 128'.
Program terminated with signal SIGABRT, Aborted.
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=134377962000384) at ./nptl/pthread_kill.c:44
44      ./nptl/pthread_kill.c: No such file or directory.
[Current thread is 1 (Thread 0x7a374eb0d000 (LWP 408967))]
(gdb) bt
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=134377962000384) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=134377962000384) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=134377962000384, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3  0x00007a3760c23476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4  0x00007a3760c097f3 in __GI_abort () at ./stdlib/abort.c:79
#5  0x00007a3760eccb9e in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007a3760ed820c in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#7  0x00007a3760ed71e9 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#8  0x00007a3760ed7959 in __gxx_personality_v0 () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#9  0x00007a3760e20884 in ?? () from /usr/lib/x86_64-linux-gnu/libgcc_s.so.1
#10 0x00007a3760e20f41 in _Unwind_RaiseException () from /usr/lib/x86_64-linux-gnu/libgcc_s.so.1
#11 0x00007a3760ed84cb in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#12 0x00007a3761fb5ad5 in gmx::(anonymous namespace)::checkDeviceError(cudaError, std::basic_string_view<char, std::char_traits<char> >) ()
   from /usr/local/gromacs/avx2_256/lib/libgromacs.so.10
#13 0x00007a3761fb51be in gmx::UpdateConstrainGpu::Impl::~Impl() () from /usr/local/gromacs/avx2_256/lib/libgromacs.so.10
#14 0x00007a3761fb5275 in gmx::UpdateConstrainGpu::~UpdateConstrainGpu() () from /usr/local/gromacs/avx2_256/lib/libgromacs.so.10
#15 0x00007a3761373513 in gmx::LegacySimulator::do_md() [clone .cold] () from /usr/local/gromacs/avx2_256/lib/libgromacs.so.10
#16 0x00007a37621391c0 in gmx::Mdrunner::mdrunner() () from /usr/local/gromacs/avx2_256/lib/libgromacs.so.10
#17 0x000057aa41c739ec in gmx::gmx_mdrun(tmpi_comm_*, gmx_hw_info_t const&, int, char**) ()
#18 0x000057aa41c73b8a in gmx::gmx_mdrun(int, char**) ()
#19 0x00007a376188853c in gmx::CommandLineModuleManager::run(int, char**) () from /usr/local/gromacs/avx2_256/lib/libgromacs.so.10
#20 0x000057aa41c6fec0 in main ()

mdrun console log:

...
GROMACS:      gmx mdrun, version 2025.0
Executable:   /usr/local/gromacs/avx2_256/bin/gmx
Data prefix:  /usr/local/gromacs/avx2_256
Working dir:  /dev/shm/unbinding/traj_segs/000093/000933
Process ID:   408967
Command line:
  gmx mdrun -ntmpi 1 -nt 4 -pin on -pinoffset 128 -pinstride 1 -update gpu -nb gpu -pme gpu -pmefft gpu -bonded cpu -deffnm seg -cpt -1 -nocpnum -cpo /dev/shm/null -noconfout

GROMACS version:     2025.0
Precision:           mixed
Memory model:        64 bit
MPI library:         thread_mpi
OpenMP support:      enabled (GMX_OPENMP_MAX_THREADS = 128)
GPU support:         CUDA
NBNxM GPU setup:     super-cluster 2x2x2 / cluster 8 (cluster-pair splitting on)
SIMD instructions:   AVX2_256
CPU FFT library:     fftw-3.3.10-sse2-avx-avx2-avx2_128
GPU FFT library:     cuFFT
Multi-GPU FFT:       none
RDTSCP usage:        enabled
TNG support:         enabled
Hwloc support:       disabled
Tracing support:     disabled
C compiler:          /usr/bin/gcc GNU 11.4.0
C compiler flags:    -fexcess-precision=fast -funroll-all-loops -mavx2 -mfma -Wno-missing-field-initializers -march=haswell -mtune=haswell -O3 -pipe -DNDEBUG
C++ compiler:        /usr/bin/g++ GNU 11.4.0
C++ compiler flags:  -fexcess-precision=fast -funroll-all-loops -mavx2 -mfma -Wno-missing-field-initializers -Wno-cast-function-type-strict SHELL:-fopenmp -march=haswell -mtune=haswell -O3 -pipe -DNDEBUG
BLAS library:        External - detected on the system
LAPACK library:      External - detected on the system
CUDA compiler:       /usr/local/cuda/bin/nvcc nvcc: NVIDIA (R) Cuda compiler driver;Copyright (c) 2005-2024 NVIDIA Corporation;Built on Thu_Sep_12_02:18:05_PDT_2024;Cuda compilation tools, release 12.6, V12.6.77;Build cuda_12.6.r12.6/compiler.34841621_0
CUDA compiler flags: -march=haswell -mtune=haswell -O3 -pipe -DNDEBUG
CUDA driver:         12.60
CUDA runtime:        12.60


Running on 1 node with total 96 cores, 192 processing units, 1 compatible GPU
Hardware detected on host hscheibe-set5-0-6:
  CPU info:
    Vendor: AMD
    Brand:  AMD EPYC 7R13 Processor
    Family: 25   Model: 1   Stepping: 1
    Features: aes amd apic avx avx2 clfsh cmov cx8 cx16 f16c fma htt lahf misalignsse mmx msr nonstop_tsc pcid pclmuldq pdpe1gb popcnt pse rdrnd rdtscp sha sse2 sse3 sse4a sse4.1 sse4.2 ssse3 x2apic
  Hardware topology: Basic
    Packages, cores, and logical processors:
    [indices refer to OS logical processors]
      Package  0: [   0  96] [   1  97] [   2  98] [   3  99] [   4 100] [   5 101] [   6 102] [   7 103] [   8 104] [   9 105] [  10 106] [  11 107] [  12 108] [  13 109] [  14 110] [  15 111] [  16 112] [  17 113] [  18 114] [  19 115] [  20 116] [  21 117] [  22 118] [  23 119] [  24 120] [  25 121] [  26 122] [  27 123] [  28 124] [  29 125] [  30 126] [  31 127] [  32 128] [  33 129] [  34 130] [  35 131] [  36 132] [  37 133] [  38 134] [  39 135] [  40 136] [  41 137] [  42 138] [  43 139] [  44 140] [  45 141] [  46 142] [  47 143]
      Package  1: [  48 144] [  49 145] [  50 146] [  51 147] [  52 148] [  53 149] [  54 150] [  55 151] [  56 152] [  57 153] [  58 154] [  59 155] [  60 156] [  61 157] [  62 158] [  63 159] [  64 160] [  65 161] [  66 162] [  67 163] [  68 164] [  69 165] [  70 166] [  71 167] [  72 168] [  73 169] [  74 170] [  75 171] [  76 172] [  77 173] [  78 174] [  79 175] [  80 176] [  81 177] [  82 178] [  83 179] [  84 180] [  85 181] [  86 182] [  87 183] [  88 184] [  89 185] [  90 186] [  91 187] [  92 188] [  93 189] [  94 190] [  95 191]
    CPU limit set by OS: -1   Recommended max number of threads: 192
  GPU info:
    Number of GPUs detected: 1
    #0: NVIDIA NVIDIA H100 80GB HBM3, compute cap.: 9.0, ECC: yes, stat: compatible


++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E.
Lindahl
GROMACS: High performance molecular simulations through multi-level
parallelism from laptops to supercomputers
SoftwareX (2015)
DOI: 10.1016/j.softx.2015.06.001
-------- -------- --- Thank You --- -------- --------


++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
S. Páll, M. J. Abraham, C. Kutzner, B. Hess, E. Lindahl
Tackling Exascale Software Challenges in Molecular Dynamics Simulations with
GROMACS
In S. Markidis & E. Laure (Eds.), Solving Software Challenges for Exascale (2015)
DOI: 10.1007/978-3-319-15976-8_1
-------- -------- --- Thank You --- -------- --------


++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
S. Pronk, S. Páll, R. Schulz, P. Larsson, P. Bjelkmar, R. Apostolov, M. R.
Shirts, J. C. Smith, P. M. Kasson, D. van der Spoel, B. Hess, E. Lindahl
GROMACS 4.5: a high-throughput and highly parallel open source molecular
simulation toolkit
Bioinformatics (2013)
DOI: 10.1093/bioinformatics/btt055
-------- -------- --- Thank You --- -------- --------


++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
B. Hess, C. Kutzner, D. van der Spoel, E. Lindahl
GROMACS 4: Algorithms for highly efficient, load-balanced, and scalable
molecular simulation
J. Chem. Theory Comput. (2008)
DOI: 10.1021/ct700301q
-------- -------- --- Thank You --- -------- --------


++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
D. van der Spoel, E. Lindahl, B. Hess, G. Groenhof, A. E. Mark, H. J. C.
Berendsen
GROMACS: Fast, Flexible and Free
J. Comp. Chem. (2005)
DOI: 10.1002/jcc.20291
-------- -------- --- Thank You --- -------- --------


++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
E. Lindahl, B. Hess, D. van der Spoel
GROMACS 3.0: A package for molecular simulation and trajectory analysis
J. Mol. Mod. (2001)
DOI: 10.1007/s008940100045
-------- -------- --- Thank You --- -------- --------


++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
H. J. C. Berendsen, D. van der Spoel and R. van Drunen
GROMACS: A message-passing parallel molecular dynamics implementation
Comp. Phys. Comm. (1995)
DOI: 10.1016/0010-4655(95)00042-E
-------- -------- --- Thank You --- -------- --------


++++ PLEASE CITE THE DOI FOR THIS VERSION OF GROMACS ++++
https://doi.org/10.5281/zenodo.14846130
-------- -------- --- Thank You --- -------- --------


The number of OpenMP threads was set by environment variable OMP_NUM_THREADS to 4

Input Parameters:
   integrator                     = md
   tinit                          = 0
   dt                             = 0.004
   nsteps                         = 12500
   init-step                      = 0
   simulation-part                = 1
   mts                            = false
   mass-repartition-factor        = 1
   comm-mode                      = Linear
   nstcomm                        = 500
   bd-fric                        = 0
   ld-seed                        = 10267
   emtol                          = 10
   emstep                         = 0.01
   niter                          = 20
   fcstep                         = 0
   nstcgsteep                     = 1000
   nbfgscorr                      = 10
   rtpi                           = 0.05
   nstxout                        = 12500
   nstvout                        = 12500
   nstfout                        = 0
   nstlog                         = 1250
   nstcalcenergy                  = 50
   nstenergy                      = 1250
   nstxout-compressed             = 1250
   compressed-x-precision         = 1000
   cutoff-scheme                  = Verlet
   nstlist                        = 10
   pbc                            = xyz
   periodic-molecules             = false
   verlet-buffer-tolerance        = 0.005
   verlet-buffer-pressure-tolerance = 0.5
   rlist                          = 1.219
   coulombtype                    = PME
   coulomb-modifier               = Potential-shift
   rcoulomb-switch                = 0
   rcoulomb                       = 1.2
   epsilon-r                      = 1
   epsilon-rf                     = inf
   vdw-type                       = Cut-off
   vdw-modifier                   = Force-switch
   rvdw-switch                    = 1
   rvdw                           = 1.2
   DispCorr                       = No
   table-extension                = 1
   fourierspacing                 = 0.12
   fourier-nx                     = 120
   fourier-ny                     = 120
   fourier-nz                     = 120
   pme-order                      = 4
   ewald-rtol                     = 1e-05
   ewald-rtol-lj                  = 0.001
   lj-pme-comb-rule               = Geometric
   ewald-geometry                 = 3d
   epsilon-surface                = 0
   ensemble-temperature-setting   = constant
   ensemble-temperature           = 303.15
   tcoupl                         = V-rescale
   nsttcouple                     = 5
   nh-chain-length                = 0
   print-nose-hoover-chain-variables = false
   pcoupl                         = C-rescale
   pcoupltype                     = Isotropic
   nstpcouple                     = 100
   tau-p                          = 2
   compressibility (3x3):
      compressibility[    0]={ 4.50000e-05,  0.00000e+00,  0.00000e+00}
      compressibility[    1]={ 0.00000e+00,  4.50000e-05,  0.00000e+00}
      compressibility[    2]={ 0.00000e+00,  0.00000e+00,  4.50000e-05}
   ref-p (3x3):
      ref-p[    0]={ 1.00000e+00,  0.00000e+00,  0.00000e+00}
      ref-p[    1]={ 0.00000e+00,  1.00000e+00,  0.00000e+00}
      ref-p[    2]={ 0.00000e+00,  0.00000e+00,  1.00000e+00}
   refcoord-scaling               = COM
   posres-com: not available
   posres-comB: not available
   QMMM                           = false
qm-opts:
   ngQM                           = 0
   constraint-algorithm           = Lincs
   continuation                   = true
   Shake-SOR                      = false
   shake-tol                      = 0.0001
   lincs-order                    = 4
   lincs-iter                     = 1
   lincs-warnangle                = 30
   nwall                          = 0
   wall-type                      = 9-3
   wall-r-linpot                  = -1
   wall-atomtype[0]               = -1
   wall-atomtype[1]               = -1
   wall-density[0]                = 0
   wall-density[1]                = 0
   wall-ewald-zfac                = 3
   pull                           = false
   awh                            = false
   rotation                       = false
   interactiveMD                  = false
   disre                          = No
   disre-weighting                = Conservative
   disre-mixed                    = false
   dr-fc                          = 1000
   dr-tau                         = 0
   nstdisreout                    = 100
   orire-fc                       = 0
   orire-tau                      = 0
   nstorireout                    = 100
   free-energy                    = no
   cos-acceleration               = 0
   deform (3x3):
      deform[    0]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
      deform[    1]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
      deform[    2]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
   simulated-tempering            = false
   swapcoords                     = no
   userint1                       = 0
   userint2                       = 0
   userint3                       = 0
   userint4                       = 0
   userreal1                      = 0
   userreal2                      = 0
   userreal3                      = 0
   userreal4                      = 0
   applied-forces:
     electric-field:
       x:
         E0                       = 0
         omega                    = 0
         t0                       = 0
         sigma                    = 0
       y:
         E0                       = 0
         omega                    = 0
         t0                       = 0
         sigma                    = 0
       z:
         E0                       = 0
         omega                    = 0
         t0                       = 0
         sigma                    = 0
     density-guided-simulation:
       active                     = false
       group                      = protein
       similarity-measure         = inner-product
       atom-spreading-weight      = unity
       force-constant             = 1e+09
       gaussian-transform-spreading-width = 0.2
       gaussian-transform-spreading-range-in-multiples-of-width = 4
       reference-density-filename = reference.mrc
       nst                        = 1
       normalize-densities        = true
       adaptive-force-scaling     = false
       adaptive-force-scaling-time-constant = 4
       shift-vector               = 
       transformation-matrix      = 
     qmmm-cp2k:
       active                     = false
       qmgroup                    = System
       qmmethod                   = PBE
       qmfilenames                = 
       qmcharge                   = 0
       qmmultiplicity             = 1
     colvars:
       active                     = false
       configfile                 = 
       seed                       = -1
     nnpot:
       active                     = false
       modelfile                  = model.pt
       input-group                = System
       model-input1               = 
       model-input2               = 
       model-input3               = 
       model-input4               = 
grpopts:
   nrdf:      423663
   ref-t:      303.15
   tau-t:         0.1
annealing:          No
annealing-npoints:           0
   acc:	           0           0           0
   nfreeze:           N           N           N
   energygrp-flags[  0]: 0


Changing rlist from 1.219 to 1.22 for non-bonded 8x8 atom kernels

Changing nstlist from 10 to 50, rlist from 1.22 to 1.347

When checking whether update groups are usable:
  Domain decomposition is not active, so there is no need for update groups

Please note that for thread-MPI builds, only PP ranks use GPU direct communication

Local state does not use filler particles

1 GPU selected for this run.
Mapping of GPU IDs to the 2 GPU tasks in the 1 rank on this node:
  PP:0,PME:0
PP tasks will do (non-perturbed) short-ranged interactions on the GPU
PP task will update and constrain coordinates on the GPU
PME tasks will do all aspects on the GPU
Using 1 MPI thread
Using 4 OpenMP threads 

System total charge: 0.000
Will do PME sum in reciprocal space for electrostatic interactions.

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
U. Essmann, L. Perera, M. L. Berkowitz, T. Darden, H. Lee, L. G. Pedersen 
A smooth particle mesh Ewald method
J. Chem. Phys. (1995)
DOI: 10.1063/1.470117
-------- -------- --- Thank You --- -------- --------

Using a Gaussian width (1/beta) of 0.384195 nm for Ewald
Potential shift: LJ r^-12: -2.648e-01 r^-6: -5.349e-01, Ewald -8.333e-06
Initialized non-bonded Coulomb Ewald tables, spacing: 1.02e-03 size: 1176

Generated table with 1173 data points for 1-4 COUL.
Tabscale = 500 points/nm
Generated table with 1173 data points for 1-4 LJ6.
Tabscale = 500 points/nm
Generated table with 1173 data points for 1-4 LJ12.
Tabscale = 500 points/nm


Using GPU 8x4 nonbonded short-range kernels

NBNxM GPU setup: super-cluster 2x2x2

Using a dual 8x4 pair-list setup updated with dynamic, rolling pruning:
  outer list: updated every 50 steps, buffer 0.147 nm, rlist 1.347 nm
  inner list: updated every  6 steps, buffer 0.005 nm, rlist 1.205 nm
At tolerance 0.005 kJ/mol/ps per atom, equivalent classical 1x1 list would be:
  outer list: updated every 50 steps, buffer 0.300 nm, rlist 1.500 nm
  inner list: updated every  6 steps, buffer 0.055 nm, rlist 1.255 nm

The average pressure is off by at most 0.26 bar due to missing LJ interactions

Overriding thread affinity set outside gmx mdrun

Applying core pinning offset 128
Pinning threads with a user-specified logical cpu stride of 1

Initializing LINear Constraint Solver

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
B. Hess, H. Bekker, H. J. C. Berendsen, J. G. E. M. Fraaije
LINCS: A Linear Constraint Solver for molecular simulations
J. Comp. Chem. (1997)
DOI: 10.1002/(sici)1096-987x(199709)18:12<1463::aid-jcc4>3.0.co;2-h
-------- -------- --- Thank You --- -------- --------

The number of constraints is 2370

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
S. Miyamoto, P. A. Kollman
SETTLE: An Analytical Version of the SHAKE and RATTLE Algorithms for Rigid
Water Models
J. Comp. Chem. (1992)
DOI: 10.1002/jcc.540130805
-------- -------- --- Thank You --- -------- --------


The -noconfout functionality is deprecated, and may be removed in a future version.

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
G. Bussi, D. Donadio, M. Parrinello
Canonical sampling through velocity rescaling
J. Chem. Phys. (2007)
DOI: 10.1063/1.2408420
-------- -------- --- Thank You --- -------- --------


++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
M. Bernetti, G. Bussi
Pressure control using stochastic cell rescaling
J. Chem. Phys. (2020)
DOI: 10.1063/5.0020514
-------- -------- --- Thank You --- -------- --------

There are: 210385 Atoms

Updating coordinates and applying constraints on the GPU.
Center of mass motion removal mode is Linear
We have the following groups for center of mass motion removal:
  0:  System

Started mdrun on rank 0 Sun Feb 23 19:24:34 2025

           Step           Time
              0        0.00000

   Energies (kJ/mol)
           Bond            U-B    Proper Dih.  Improper Dih.      CMAP Dih.
    4.66521e+03    1.14448e+04    1.31713e+04    7.45423e+02   -9.12172e+02
          LJ-14     Coulomb-14        LJ (SR)   Coulomb (SR)   Coul. recip.
    4.43424e+03    5.26820e+04    3.07223e+05   -3.37263e+06    9.52401e+03
      Potential    Kinetic En.   Total Energy  Conserved En.    Temperature
   -2.96965e+06    5.33720e+05   -2.43593e+06   -2.43610e+06    3.03032e+02
 Pressure (bar)   Constr. rmsd
    8.04383e+01    0.00000e+00

step  650: timed with pme grid 120 120 120, coulomb cutoff 1.200: 474.6 M-cycles
step  750: timed with pme grid 108 108 108, coulomb cutoff 1.326: 824.2 M-cycles
step  850: timed with pme grid 112 112 112, coulomb cutoff 1.279: 433.4 M-cycles
step  950: timed with pme grid 120 120 120, coulomb cutoff 1.200: 427.8 M-cycles
step 1050: timed with pme grid 112 112 112, coulomb cutoff 1.279: 520.8 M-cycles
step 1150: timed with pme grid 120 120 120, coulomb cutoff 1.200: 426.7 M-cycles
              optimal pme grid 120 120 120, coulomb cutoff 1.200
           Step           Time
           1250        5.00000

   Energies (kJ/mol)
           Bond            U-B    Proper Dih.  Improper Dih.      CMAP Dih.
    4.62781e+03    1.15377e+04    1.32760e+04    7.57956e+02   -8.43420e+02
          LJ-14     Coulomb-14        LJ (SR)   Coulomb (SR)   Coul. recip.
    4.39888e+03    5.30850e+04    3.03912e+05   -3.36666e+06    9.93560e+03
      Potential    Kinetic En.   Total Energy  Conserved En.    Temperature
   -2.96597e+06    5.36569e+05   -2.42940e+06   -2.43713e+06    3.04650e+02
 Pressure (bar)   Constr. rmsd
   -7.99465e+01    0.00000e+00

           Step           Time
           2500       10.00000

   Energies (kJ/mol)
           Bond            U-B    Proper Dih.  Improper Dih.      CMAP Dih.
    4.52426e+03    1.15839e+04    1.31890e+04    6.82653e+02   -7.57863e+02
          LJ-14     Coulomb-14        LJ (SR)   Coulomb (SR)   Coul. recip.
    4.37433e+03    5.31289e+04    3.04483e+05   -3.37403e+06    9.91455e+03
      Potential    Kinetic En.   Total Energy  Conserved En.    Temperature
   -2.97291e+06    5.32777e+05   -2.44013e+06   -2.43538e+06    3.02497e+02
 Pressure (bar)   Constr. rmsd
   -4.47560e+01    0.00000e+00

           Step           Time
           3750       15.00000

   Energies (kJ/mol)
           Bond            U-B    Proper Dih.  Improper Dih.      CMAP Dih.
    4.81530e+03    1.13377e+04    1.33536e+04    7.01424e+02   -8.71726e+02
          LJ-14     Coulomb-14        LJ (SR)   Coulomb (SR)   Coul. recip.
    4.41416e+03    5.27515e+04    3.10521e+05   -3.37835e+06    9.75775e+03
      Potential    Kinetic En.   Total Energy  Conserved En.    Temperature
   -2.97157e+06    5.32022e+05   -2.43955e+06   -2.43455e+06    3.02068e+02
 Pressure (bar)   Constr. rmsd
    1.21204e+02    0.00000e+00

           Step           Time
           5000       20.00000

   Energies (kJ/mol)
           Bond            U-B    Proper Dih.  Improper Dih.      CMAP Dih.
    4.75789e+03    1.13901e+04    1.31921e+04    6.94501e+02   -8.93060e+02
          LJ-14     Coulomb-14        LJ (SR)   Coulomb (SR)   Coul. recip.
    4.42497e+03    5.28399e+04    3.04265e+05   -3.36767e+06    9.82345e+03
      Potential    Kinetic En.   Total Energy  Conserved En.    Temperature
   -2.96718e+06    5.34198e+05   -2.43298e+06   -2.43454e+06    3.03304e+02
 Pressure (bar)   Constr. rmsd
   -4.93898e+01    0.00000e+00

           Step           Time
           6250       25.00000

   Energies (kJ/mol)
           Bond            U-B    Proper Dih.  Improper Dih.      CMAP Dih.
    4.57695e+03    1.17273e+04    1.32145e+04    7.13992e+02   -8.99832e+02
          LJ-14     Coulomb-14        LJ (SR)   Coulomb (SR)   Coul. recip.
    4.47821e+03    5.29900e+04    3.04806e+05   -3.36971e+06    9.90996e+03
      Potential    Kinetic En.   Total Energy  Conserved En.    Temperature
   -2.96819e+06    5.33983e+05   -2.43421e+06   -2.43430e+06    3.03181e+02
 Pressure (bar)   Constr. rmsd
   -2.79513e+01    0.00000e+00

           Step           Time
           7500       30.00000

   Energies (kJ/mol)
           Bond            U-B    Proper Dih.  Improper Dih.      CMAP Dih.
    4.87142e+03    1.16783e+04    1.34093e+04    7.13340e+02   -9.64884e+02
          LJ-14     Coulomb-14        LJ (SR)   Coulomb (SR)   Coul. recip.
    4.34392e+03    5.28699e+04    3.05445e+05   -3.36761e+06    9.99079e+03
      Potential    Kinetic En.   Total Energy  Conserved En.    Temperature
   -2.96526e+06    5.33365e+05   -2.43189e+06   -2.43376e+06    3.02831e+02
 Pressure (bar)   Constr. rmsd
   -6.39932e+00    0.00000e+00

           Step           Time
           8750       35.00000

   Energies (kJ/mol)
           Bond            U-B    Proper Dih.  Improper Dih.      CMAP Dih.
    4.65676e+03    1.15315e+04    1.32701e+04    6.83755e+02   -8.20625e+02
          LJ-14     Coulomb-14        LJ (SR)   Coulomb (SR)   Coul. recip.
    4.46724e+03    5.31731e+04    3.05204e+05   -3.37183e+06    9.96656e+03
      Potential    Kinetic En.   Total Energy  Conserved En.    Temperature
   -2.96969e+06    5.36749e+05   -2.43295e+06   -2.43211e+06    3.04752e+02
 Pressure (bar)   Constr. rmsd
    3.67423e+01    0.00000e+00

           Step           Time
          10000       40.00000

   Energies (kJ/mol)
           Bond            U-B    Proper Dih.  Improper Dih.      CMAP Dih.
    4.65391e+03    1.15818e+04    1.32497e+04    7.31425e+02   -8.54568e+02
          LJ-14     Coulomb-14        LJ (SR)   Coulomb (SR)   Coul. recip.
    4.47262e+03    5.30003e+04    3.07434e+05   -3.37639e+06    9.86487e+03
      Potential    Kinetic En.   Total Energy  Conserved En.    Temperature
   -2.97226e+06    5.33111e+05   -2.43915e+06   -2.43193e+06    3.02687e+02
 Pressure (bar)   Constr. rmsd
    7.53251e+01    0.00000e+00

           Step           Time
          11250       45.00000

   Energies (kJ/mol)
           Bond            U-B    Proper Dih.  Improper Dih.      CMAP Dih.
    4.72351e+03    1.16880e+04    1.31416e+04    6.64549e+02   -8.62839e+02
          LJ-14     Coulomb-14        LJ (SR)   Coulomb (SR)   Coul. recip.
    4.35162e+03    5.29298e+04    3.05876e+05   -3.37132e+06    9.90809e+03
      Potential    Kinetic En.   Total Energy  Conserved En.    Temperature
   -2.96890e+06    5.34509e+05   -2.43439e+06   -2.43166e+06    3.03480e+02
 Pressure (bar)   Constr. rmsd
    1.96575e+00    0.00000e+00

mdrun error log:

                      :-) GROMACS - gmx mdrun, 2025.0 (-:

Executable:   /usr/local/gromacs/avx2_256/bin/gmx
Data prefix:  /usr/local/gromacs/avx2_256
Working dir:  /dev/shm/unbinding/traj_segs/000093/000933
Command line:
  gmx mdrun -ntmpi 1 -nt 4 -pin on -pinoffset 128 -pinstride 1 -update gpu -nb gpu -pme gpu -pmefft gpu -bonded cpu -deffnm seg -cpt -1 -nocpnum -cpo /dev/shm/null -noconfout

Reading file seg.tpr, VERSION 2025.0 (single precision)
Changing nstlist from 10 to 50, rlist from 1.22 to 1.347

1 GPU selected for this run.
Mapping of GPU IDs to the 2 GPU tasks in the 1 rank on this node:
  PP:0,PME:0
PP tasks will do (non-perturbed) short-ranged interactions on the GPU
PP task will update and constrain coordinates on the GPU
PME tasks will do all aspects on the GPU
Using 1 MPI thread
Using 4 OpenMP threads 


Overriding thread affinity set outside gmx mdrun

Applying core pinning offset 128
starting mdrun 'Title in water'
12500 steps,     50.0 ps.
terminate called after throwing an instance of 'gmx::InternalError'
  what():  Freeing of the device buffer failed. CUDA error #717 (cudaErrorInvalidAddressSpace): operation not supported on global/shared address space.
(redacted)/runseg.sh: line 115: 408967 Aborted                 (core dumped)

I’ll also note that I’m running GROMACS 2025.0 through a docker container built atop the NVIDIA PyTorch Release 24.10 container, which has CUDA version 12.6.2 installed.

The relevant installation code is reproduced below:

FROM nvcr.io/nvidia/pytorch:24.10-py3 AS builder

############################
# Build FFTW
############################
ARG FFTW_VERSION=3.3.10
WORKDIR /tmp/fftw
RUN wget --no-check-certificate ftp://ftp.fftw.org/pub/fftw/fftw-${FFTW_VERSION}.tar.gz && \
    tar -xf /tmp/fftw/fftw-${FFTW_VERSION}.tar.gz && \
    cd /tmp/fftw/fftw-${FFTW_VERSION} && \
    CC=gcc CFLAGS='-march=haswell -mtune=haswell -O3 -pipe' \
    CXX=g++ CXXFLAGS='-march=haswell -mtune=haswell -O3 -pipe' \
    FFLAGS='-march=haswell -mtune=haswell -O3 -pipe' \
    LDFLAGS=-Wl,--as-needed \
    ./configure --prefix=/usr/local/fftw --enable-avx --enable-avx2 \
        --enable-float --enable-shared --enable-sse2 --enable-threads && \
    make -j"$(nproc)" && \
    make -j"$(nproc)" install && \
    cd / && rm -rf /tmp/fftw

############################
# Build GROMACS
############################
ARG GMX_VERSION=2025.0
WORKDIR /tmp/gromacs
RUN wget --no-check-certificate ftp://ftp.gromacs.org/gromacs/gromacs-${GMX_VERSION}.tar.gz && \
    tar -xf gromacs-${GMX_VERSION}.tar.gz && \
    cd gromacs-${GMX_VERSION} && mkdir build && cd build && \
    CC=gcc CFLAGS='-march=haswell -mtune=haswell -O3 -pipe' \
    CXX=g++ CXXFLAGS='-march=haswell -mtune=haswell -O3 -pipe' \
    FFLAGS='-march=haswell -mtune=haswell -O3 -pipe' \
    LDFLAGS=-Wl,--as-needed \
    cmake \
      -DCMAKE_INSTALL_PREFIX=/usr/local/gromacs/avx2_256 \
      -DGMX_SIMD=AVX2_256 \
      -DCMAKE_BUILD_TYPE=Release \
      -DCMAKE_C_FLAGS_RELEASE='-march=haswell -mtune=haswell -O3 -pipe -DNDEBUG' \
      -DCMAKE_CXX_FLAGS_RELEASE='-march=haswell -mtune=haswell -O3 -pipe -DNDEBUG' \
      -DREGRESSIONTEST_DOWNLOAD=OFF \
      -DBUILD_SHARED_LIBS=ON \
      -DGMX_GPU=CUDA \
      -DGMX_OPENMP=True \
      -DGMX_FFT_LIBRARY=fftw3 \
      -DFFTWF_LIBRARY=/usr/local/fftw/lib/libfftw3f.so \
      -DFFTWF_INCLUDE_DIR=/usr/local/fftw/include \
      -DGMX_BUILD_OWN_FFTW=OFF \
      -DGMX_BUILD_OWN_BLAS=ON \
      -DGMX_BUILD_OWN_LAPACK=ON \
      -DGMX_DOUBLE=OFF \
      -DGMX_X11=OFF \
      -DGMX_THREAD_MPI=ON \
      -DGMXAPI=OFF \
      -DGMX_CUDA_TARGET_SM='80;86;90' \
      ../ && \
    cmake --build . --target all -- -j$(nproc) && \
    cmake --build . --target install -- -j$(nproc) && \
    cd / && rm -rf /tmp/gromacs

I should also mention that I have run very similar simulations on HPC compute nodes that have eight A100 GPUs. I did not observe any crashes for simulations running on the older A100 compute nodes.

Based on what I am observing, I think that this is caused by a race condition in the GPU update for constraints. I am not familiar with this code but the fact that the crash appears during the destruction phase (specifically in the destructor of UpdateConstrainGpu::Impl) suggests that the memory being freed might still be referenced by an asynchronous kernel that did not complete due to a synchronization issue. This could be aggravated by the aggressive scheduling on H100 hardware.