I tried with extra debug logging and found something strange.
When i run
ACPP_DEBUG_LEVEL=3 AMD_LOG_LEVEL=4 gmx mdrun -s testsystem.tpr -deffnm test2 -pin on -pinstride 1 -ntmpi 1 -ntomp 8 -v -bonded gpu &> debug_log.txt
The simulation crashes with a core dump, the last lines in the output are:
ESC[;32m[AdaptiveCpp Info] ESC[0minorder_executor: Dispatching to lane 0x3886c2c0: Memcpy: CPU-Device0 #1 {0, 0, 0}+{1, 1, 49152}-->ROCm-Device0 #1 {0, 0, 0}+{1, 1, 49152}{1, 1, 49152}
:3:hip_memory.cpp :1543: 2052086538d us: ESC[32m hipMemcpyAsync ( 0x152cdc3ef000, 0x3a962350, 49152, hipMemcpyHostToDevice, stream:0x389cacf0 ) ESC[0m
:4:command.cpp :352 : 2052086549d us: Command (CopyHostToDevice) enqueued: 0x382f0a50
:3:rocvirtual.cpp :168 : 2052086555d us: Signal = (0x152fca7fa680), Translated start/end = 2052086386999 / 2052086389719, Elapsed = 2720 ns, ticks start/end = 207224658597 / 207224658869, Ticks elapsed = 272
:4:command.cpp :167 : 2052086020d us: Command 0x382ef510 complete (Wall: 4811369, CPU: 0, GPU: 281 us)
/usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/bits/stl_vector.h:1282: const_reference std::vector<amd::roc::ProfilingSignal *>::operator[](size_type) const [_Tp = amd::roc::ProfilingSignal *, _Alloc = std::allocator<amd::roc::ProfilingSignal *>]: Assertion '__n < this->size()' failed.
:4:rocblit.cpp :822 : 2052086568d us: HSA Async Copy staged H2D dst=0x152cdc3ef000, src=0x152ec0200000, size=49152, completion_signal=0x152fca7fa600
:4:commandqueue.cpp :151 : 2052086608d us: Marker queued to ensure finish
On the other hand, if I run the command redirecting output with tee
:
ACPP_DEBUG_LEVEL=3 AMD_LOG_LEVEL=4 gmx mdrun -s testsystem.tpr -deffnm test2 -pin on -pinstride 1 -ntmpi 1 -ntomp 8 -v -bonded gpu 2>&1 | tee debug_log2.txt
The run continues for a while (I stopped it manually after the log file reached ~20 GB).
If I run -pme gpu -bonded gpu
with the extra debugging logging I get the same assertion failure error.
I’ll try to reinstall AdaptiveCpp from the stable release archive and rebuild GROMACS, then report back.
Thanks