Core dumped error while production run

Hi, while am running simulation production run this core dumped error is coming again and again and delaying the completion of the simulation and i tried to find error source but can’t find in log file. This error happened 3rd time on same simulation can anyone please help me find error cause and fix it.

:-) GROMACS - gmx mdrun, 2025.2 (-:

Executable: /usr/local/gromacs/bin/gmx
Data prefix: /usr/local/gromacs
Working dir: /mnt/c/Users/USER/Desktop/project/paper/md2
Command line:
gmx mdrun -v -deffnm md -ntmpi 1 -ntomp 14 -pin on -cpi md.cpt

Reading file md.tpr, VERSION 2025.2 (single precision)
Changing nstlist from 20 to 100, rlist from 1.223 to 1.371

Update groups can not be used for this system because atoms that are (in)directly constrained together are interdispersed with other atoms

1 GPU selected for this run.
Mapping of GPU IDs to the 2 GPU tasks in the 1 rank on this node:
PP:0,PME:0
PP tasks will do (non-perturbed) short-ranged interactions on the GPU
PP task will update and constrain coordinates on the GPU
PME tasks will do all aspects on the GPU
Using 1 MPI thread
Using 14 OpenMP threads

WARNING: This run will generate roughly 2916 Mb of data

starting mdrun ‘Protein in water’
50000000 steps, 100000.0 ps (continuing from step 9065900, 18131.8 ps).
step 9069700: timed with pme grid 64 64 64, coulomb cutoff 1.200: 450.2 M-cycles
step 9069900: timed with pme grid 56 56 56, coulomb cutoff 1.355: 680.6 M-cycles
step 9070100: timed with pme grid 60 60 60, coulomb cutoff 1.265: 475.9 M-cycles
step 9070300: timed with pme grid 64 64 64, coulomb cutoff 1.200: 403.5 M-cycles
step 9070500: timed with pme grid 64 64 64, coulomb cutoff 1.200: 431.1 M-cycles
optimal pme grid 64 64 64, coulomb cutoff 1.200
step 17850400, will finish Sun Sep 21 14:34:18 2025^C^C^C^CAborted (core dumped)
pooja@Pooja:/mnt/c/Users/USER/Desktop/project/paper/md2$ gmx mdrun -v -deffnm md -ntmpi 1 -ntomp 14 -pin on -cpi md.cpt
:-) GROMACS - gmx mdrun, 2025.2 (-:

Executable: /usr/local/gromacs/bin/gmx
Data prefix: /usr/local/gromacs
Working dir: /mnt/c/Users/USER/Desktop/project/paper/md2
Command line:
gmx mdrun -v -deffnm md -ntmpi 1 -ntomp 14 -pin on -cpi md.cpt

Reading file md.tpr, VERSION 2025.2 (single precision)
Changing nstlist from 20 to 100, rlist from 1.222 to 1.371

Update groups can not be used for this system because atoms that are (in)directly constrained together are interdispersed with other atoms

1 GPU selected for this run.
Mapping of GPU IDs to the 2 GPU tasks in the 1 rank on this node:
PP:0,PME:0
PP tasks will do (non-perturbed) short-ranged interactions on the GPU
PP task will update and constrain coordinates on the GPU
PME tasks will do all aspects on the GPU
Using 1 MPI thread
Using 14 OpenMP threads

WARNING: This run will generate roughly 2318 Mb of data

starting mdrun ‘Protein in water’
50000000 steps, 100000.0 ps (continuing from step 17481900, 34963.8 ps).
step 17485800: timed with pme grid 64 64 64, coulomb cutoff 1.200: 410.8 M-cycles
step 17486000: timed with pme grid 56 56 56, coulomb cutoff 1.358: 517.9 M-cycles
step 17486200: timed with pme grid 60 60 60, coulomb cutoff 1.267: 451.2 M-cycles
step 17486400: timed with pme grid 64 64 64, coulomb cutoff 1.200: 405.0 M-cycles
optimal pme grid 64 64 64, coulomb cutoff 1.200
step 28968400, will finish Sun Sep 21 14:54:51 2025^C^C^C^C^C^CAborted (core dumped)

Are there no warnings just before the core dump?

You can do:

gdb /usr/local/gromacs/bin/gmx <core file name>

to find out where in the code the issue occurred. You need to install gdb if you don’t already have it. Please report the output here.

the process again freeze at 506200 steps and cpu and gpu useage droped to zero

pooja@Pooja:/mnt/c/Users/USER/Desktop/project/chicken/paper/crash_report$ gdb --batch -ex “run” -ex “bt full” -ex “quit” --args /usr/local/gromacs/bin/gmx mdrun -v -deffnm md -ntmpi 1 -ntomp 14 -pin on
[Thread debugging using libthread_db enabled]
Using host libthread_db library “/lib/x86_64-linux-gnu/libthread_db.so.1”.
warning: could not find ‘.gnu_debugaltlink’ file for /lib/x86_64-linux-gnu/libblas.so.3
warning: could not find ‘.gnu_debugaltlink’ file for /lib/x86_64-linux-gnu/liblapack.so.3
warning: could not find ‘.gnu_debugaltlink’ file for /lib/x86_64-linux-gnu/libopenblas.so.0
[New Thread 0x7fffd9fff000 (LWP 1229)]
[New Thread 0x7fffd17fe000 (LWP 1230)]
[New Thread 0x7fffc8ffd000 (LWP 1231)]
[New Thread 0x7fffc07fc000 (LWP 1232)]
[New Thread 0x7fffb7ffb000 (LWP 1233)]
[New Thread 0x7fffaf7fa000 (LWP 1234)]
[New Thread 0x7fffaeff9000 (LWP 1235)]
[New Thread 0x7fff9e7f8000 (LWP 1236)]
[New Thread 0x7fff95ff7000 (LWP 1237)]
[New Thread 0x7fff8d7f6000 (LWP 1238)]
[New Thread 0x7fff8cff5000 (LWP 1239)]
[New Thread 0x7fff847f4000 (LWP 1240)]
[New Thread 0x7fff7bff3000 (LWP 1241)]
[New Thread 0x7fff6b7f2000 (LWP 1242)]
[New Thread 0x7fff6aff1000 (LWP 1243)]
:-) GROMACS - gmx mdrun, 2025.2 (-:

Executable: /usr/local/gromacs/bin/gmx
Data prefix: /usr/local/gromacs
Working dir: /mnt/c/Users/USER/Desktop/project/chicken/paper/crash_report
Command line:
gmx mdrun -v -deffnm md -ntmpi 1 -ntomp 14 -pin on

[New Thread 0x7fff585ff000 (LWP 1244)]
[New Thread 0x7fff579ff000 (LWP 1245)]
[Thread 0x7fff579ff000 (LWP 1245) exited]
[New Thread 0x7fff579ff000 (LWP 1246)]
[Thread 0x7fff579ff000 (LWP 1246) exited]

Back Off! I just backed up md.log to ./#md.log.1#
Reading file md.tpr, VERSION 2025.2 (single precision)
Changing nstlist from 20 to 100, rlist from 1.223 to 1.371

Update groups can not be used for this system because atoms that are (in)directly constrained together are interdispersed with other atoms

1 GPU selected for this run.
Mapping of GPU IDs to the 2 GPU tasks in the 1 rank on this node:
PP:0,PME:0
PP tasks will do (non-perturbed) short-ranged interactions on the GPU
PP task will update and constrain coordinates on the GPU
PME tasks will do all aspects on the GPU
[New Thread 0x7fff579ff000 (LWP 1247)]
Using 1 MPI thread
Using 14 OpenMP threads

[New Thread 0x7fff491ff000 (LWP 1248)]
[New Thread 0x7fff489fe000 (LWP 1249)]
[New Thread 0x7fff43fff000 (LWP 1250)]
[New Thread 0x7fff437fe000 (LWP 1251)]
[New Thread 0x7fff42ffd000 (LWP 1252)]
[New Thread 0x7fff427fc000 (LWP 1253)]
[New Thread 0x7fff41ffb000 (LWP 1254)]
[New Thread 0x7fff417fa000 (LWP 1255)]
[New Thread 0x7fff40ff9000 (LWP 1256)]
[New Thread 0x7fff407f8000 (LWP 1257)]
[New Thread 0x7fff3fff7000 (LWP 1258)]
[New Thread 0x7fff3f7f6000 (LWP 1259)]
[New Thread 0x7fff3eff5000 (LWP 1260)]

WARNING: This run will generate roughly 3561 Mb of data

starting mdrun ‘Protein in water’
50000000 steps, 100000.0 ps.
step 3000: timed with pme grid 64 64 64, coulomb cutoff 1.200: 562.9 M-cycles
step 3200: timed with pme grid 56 56 56, coulomb cutoff 1.356: 622.2 M-cycles
step 3400: timed with pme grid 48 48 48, coulomb cutoff 1.582: 875.1 M-cycles
step 3600: timed with pme grid 52 52 52, coulomb cutoff 1.461: 715.6 M-cycles
step 3800: timed with pme grid 56 56 56, coulomb cutoff 1.356: 699.8 M-cycles
step 4000: timed with pme grid 60 60 60, coulomb cutoff 1.266: 500.9 M-cycles
optimal pme grid 60 60 60, coulomb cutoff 1.266
step 506200, will finish Sun Sep 28 21:19:00 2025
Thread 1 “gmx” received signal SIGSTOP, Stopped (signal).
0x00007fff58bdbb7a in ?? () from /usr/lib/wsl/drivers/nvcvsi.inf_amd64_5313096c5a0237cd/libcuda.so.1.1
#0 0x00007fff58bdbb7a in ?? () from /usr/lib/wsl/drivers/nvcvsi.inf_amd64_5313096c5a0237cd/libcuda.so.1.1
No symbol table info available.
#1 0x00007fff58854dee in ?? () from /usr/lib/wsl/drivers/nvcvsi.inf_amd64_5313096c5a0237cd/libcuda.so.1.1
No symbol table info available.
#2 0x00007fff588f597c in ?? () from /usr/lib/wsl/drivers/nvcvsi.inf_amd64_5313096c5a0237cd/libcuda.so.1.1
No symbol table info available.
#3 0x00007fff588a8efb in cuEventSynchronize () from /usr/lib/wsl/drivers/nvcvsi.inf_amd64_5313096c5a0237cd/libcuda.so.1.1
No symbol table info available.
#4 0x00007ffff0ece602 in libcudart_static_daedcea17177362c37c34d10632b001f03c10dcf () from /usr/local/gromacs/lib/libgromacs.so.10
No symbol table info available.
#5 0x00007ffff0f10e48 in cudaEventSynchronize () from /usr/local/gromacs/lib/libgromacs.so.10
No symbol table info available.
#6 0x00007ffff07bbd6d in gmx::StatePropagatorDataGpu::Impl::waitVelocitiesReadyOnHost(gmx::AtomLocality) () from /usr/local/gromacs/lib/libgromacs.so.10
No symbol table info available.
#7 0x00007ffff07624dd in gmx::LegacySimulator::do_md() () from /usr/local/gromacs/lib/libgromacs.so.10
No symbol table info available.
#8 0x00007ffff07972a8 in gmx::Mdrunner::mdrunner() () from /usr/local/gromacs/lib/libgromacs.so.10
No symbol table info available.
#9 0x000055555555f73c in gmx::gmx_mdrun(tmpi_comm_*, gmx_hw_info_t const&, int, char**) ()
No symbol table info available.
#10 0x000055555555f877 in gmx::gmx_mdrun(int, char**) ()
No symbol table info available.
#11 0x00007fffefe67a53 in gmx::CommandLineModuleManager::run(int, char**) () from /usr/local/gromacs/lib/libgromacs.so.10
No symbol table info available.
#12 0x000055555555bf10 in main ()
No symbol table info available.
A debugging session is active.

    Inferior 1 [process 1226] will be killed.

Quit anyway? (y or n) [answered Y; input not from terminal]

Thanks for the stack dump. I have not seen a similar issue before.

@al42and or @pszilard Do you have any ideas?

Looks like GPU hung. I’m inclined to blame WSL / drivers, but perhaps we’re doing something bad there.

@Ashutosh, can you try running with -update cpu? Did you notice any behaviour signalling GPU reset (screen flickering etc around the time the simulation froze)?

We also do some manual GPU queue flushing with CUDA on bare Windows. Not sure if needed with WSL, but it might be worth trying to enalble the same workaround. Replace if (GMX_NATIVE_WINDOWS) with if (true) in src/gromacs/nbnxm/cuda/nbnxm_cuda.cu (two occurences there) and rebuild the code.

sure i will clear the build and rebuild with all updates of driver and cuda gromacs and with modifying src/gromacs/nbnxm/cuda/nbnxm_cuda.cu and i didn’t noticed any signaling GPU reset but i found something in windows event viewer this error in nvlddmkm

The description for Event ID 153 from source nvlddmkm cannot be found. Either the component that raises this event is not installed on your local computer or the installation is corrupted. You can install or repair the component on the local computer.

If the event originated on another computer, the display information had to be saved with the event.

The following information was included with the event:

\Device\Video16
Error occurred on GPUID: 100

The message resource is present but the message was not found in the message table