Mdrun crashes with some md parameter during NPT (gmx compiled for rocm,AdaptiveCPP)

GROMACS version: 2025.2
GROMACS modification: Yes/No

Hi,
I’m doing some testing with different cutoff for same system and 5 different initial velocity: I’m unable to finish the NPT equilibration for several trials (the one with missing npt.gro)

  • rcoulomb(=rvdw)=0.8:
    ./cutoff08_1/npt.gro
    missing
    ./cutoff08_3/npt.gro
    ./cutoff08_4/npt.gro
    ./cutoff08_5/npt.gro

  • rcoulomb(=rvdw)=1.0:
    missing
    missing
    missing
    missing
    ./cutoff10_5/npt.gro

  • rcoulomb(=rvdw)=1.2:
    ./cutoff12_1/npt.gro
    ./cutoff12_2/npt.gro
    ./cutoff12_3/npt.gro
    ./cutoff12_4/npt.gro
    missing

  • rcoulomb(=rvdw)=1.4:
    ./cutoff14_1/npt.gro
    missing
    ./cutoff14_3/npt.gro
    ./cutoff14_4/npt.gro
    ./cutoff14_5/npt.gro

When npt.gro is not generated mdrun crashed with this error:

/usr/include/c++/15.1.1/bits/stl_vector.h:1263: std::vector<_Tp, _Alloc>::reference std::vector<_Tp, _Alloc>::operator[](size_type) [with _Tp = long unsigned int; _Alloc = std::alloc
ator<long unsigned int>; reference = long unsigned int&; size_type = long unsigned int]: Assertion '__n < this->size()' failed.                                                       
[mycomputer:172703] *** Process received signal ***                                                                                                                                        
[mycomputer:172703] Signal: Aborted (6)                                                                                                                                                    
[mycomputer:172703] Signal code:  (-6)                                                                                                                                                     
[mycomputer:172703] [ 0] /usr/lib/libc.so.6(+0x3def0) [0x7fcd62e4def0]                                                                                                                     
[mycomputer:172703] [ 1] /usr/lib/libc.so.6(+0x9774c) [0x7fcd62ea774c]                                                                                                                     
[mycomputer:172703] [ 2] /usr/lib/libc.so.6(gsignal+0x20) [0x7fcd62e4ddc0]                                                                                                                 
[mycomputer:172703] [ 3] /usr/lib/libc.so.6(abort+0x26) [0x7fcd62e3557a]
[mycomputer:172703] [ 4] /usr/lib/libstdc++.so.6(+0x9a421) [0x7fcd6309a421]
[mycomputer:172703] [ 5] /opt/rocm/lib/libhsa-runtime64.so.1(+0xf924) [0x7fcd59e0f924]
[mycomputer:172703] [ 6] /opt/rocm/lib/libhsa-runtime64.so.1(+0x3baec) [0x7fcd59e3baec]
[mycomputer:172703] [ 7] /opt/rocm/lib/libhsa-runtime64.so.1(+0x4cdb0) [0x7fcd59e4cdb0]
[mycomputer:172703] [ 8] /opt/rocm/lib/libhsa-runtime64.so.1(+0x4ce5d) [0x7fcd59e4ce5d]
[mycomputer:172703] [ 9] /opt/rocm/lib/libamdhip64.so.6(+0x39047f) [0x7fcd5af9047f]
[mycomputer:172703] [10] /opt/rocm/lib/libamdhip64.so.6(+0x390799) [0x7fcd5af90799]
[mycomputer:172703] [11] /opt/rocm/lib/libamdhip64.so.6(+0x3975e4) [0x7fcd5af975e4]
[mycomputer:172703] [12] /opt/rocm/lib/libamdhip64.so.6(+0x359dd1) [0x7fcd5af59dd1]
[mycomputer:172703] [13] /opt/rocm/lib/libamdhip64.so.6(+0x361704) [0x7fcd5af61704]
[mycomputer:172703] [14] /opt/rocm/lib/libamdhip64.so.6(+0x335ef5) [0x7fcd5af35ef5]
[mycomputer:172703] [15] /opt/rocm/lib/libamdhip64.so.6(+0x17c542) [0x7fcd5ad7c542]
[mycomputer:172703] [16] /opt/rocm/lib/libamdhip64.so.6(+0x191538) [0x7fcd5ad91538]
[mycomputer:172703] [17] /opt/rocm/lib/libamdhip64.so.6(+0x1a37ab) [0x7fcd5ada37ab]
[mycomputer:172703] [18] /home/myuser/AdaptiveCPP-25.02/lib/hipSYCL/librt-backend-hip.so(_ZN7hipsycl2rt9hip_queue13submit_memcpyERNS0_16memcpy_operationERKSt10shared_ptrINS0_8dag_nodeEE
+0x2fa) [0x7fcd58028e7a]                     
[mycomputer:172703] [19] /home/myuser/AdaptiveCPP-25.02/lib/libacpp-rt.so(+0x35bae) [0x7fcd639c5bae]
[mycomputer:172703] [20] /home/myuser/AdaptiveCPP-25.02/lib/libacpp-rt.so(+0x324a4) [0x7fcd639c24a4]
[mycomputer:172703] [21] /home/myuser/AdaptiveCPP-25.02/lib/libacpp-rt.so(_ZN7hipsycl2rt16inorder_executor15submit_directlyERKSt10shared_ptrINS0_8dag_nodeEEPNS0_9operationERKN3sbo12smal
l_vectorIS4_Lm8EEE+0xb60) [0x7fcd639c54d0]   
[mycomputer:172703] [22] /home/myuser/gromacs.single/gromacs-2025.2_gpu/bin/../lib/libgromacs_mpi.so.10(_ZN7hipsycl4sycl7handler11create_taskESt10unique_ptrINS_2rt9operationESt14default
_deleteIS4_EERKNS3_15execution_hintsERKNS3_17requirements_listE+0x292) [0x7fcd642e7cc2]
[mycomputer:172703] [23] /home/myuser/gromacs.single/gromacs-2025.2_gpu/bin/../lib/libgromacs_mpi.so.10(_ZN7hipsycl4sycl7handler6memcpyEPvPKvm+0x3de) [0x7fcd648b049e]
[mycomputer:172703] [24] /home/myuser/gromacs.single/gromacs-2025.2_gpu/bin/../lib/libgromacs_mpi.so.10(_ZN7hipsycl4sycl5queue18execute_submissionIZ20copyFromDeviceBufferIN3gmx11BasicVe
ctorIfEEEvPT_P12DeviceBufferIS7_EmmRK12DeviceStream18GpuApiCallBehaviorPPvEUlRNS0_7handlerEE0_EESt10shared_ptrINS_2rt8dag_nodeEES7_SJ_+0x1c3) [0x7fcd648b9823]
[mycomputer:172703] [25] /home/myuser/gromacs.single/gromacs-2025.2_gpu/bin/../lib/libgromacs_mpi.so.10(_ZN7hipsycl4sycl5queue6submitIZ20copyFromDeviceBufferIN3gmx11BasicVectorIfEEEvPT_
P12DeviceBufferIS7_EmmRK12DeviceStream18GpuApiCallBehaviorPPvEUlRNS0_7handlerEE0_EENS0_5eventERKNS0_13property_listES7_+0x892) [0x7fcd648b9102]
[mycomputer:172703] [26] /home/myuser/gromacs.single/gromacs-2025.2_gpu/bin/../lib/libgromacs_mpi.so.10(_Z20copyFromDeviceBufferIN3gmx11BasicVectorIfEEEvPT_P12DeviceBufferIS3_EmmRK12Dev
iceStream18GpuApiCallBehaviorPPv+0x221) [0x7fcd648b6ff1]                                   
[mycomputer:172703] [27] /home/myuser/gromacs.single/gromacs-2025.2_gpu/bin/../lib/libgromacs_mpi.so.10(_ZN3gmx22StatePropagatorDataGpu4Impl17copyForcesFromGpuENS_8ArrayRefINS_11BasicVe
ctorIfEEEENS_12AtomLocalityE+0x11e) [0x7fcd6511ed7e]                                       
[mycomputer:172703] [28] /home/myuser/gromacs.single/gromacs-2025.2_gpu/bin/../lib/libgromacs_mpi.so.10(_ZN3gmx8do_forceEP8_IO_FILEPK9t_commrecPK14gmx_multisim_tRK10t_inputrecRKNS_18MDM
odulesNotifiersEPNS_3AwhEP10gmx_enfrotPNS_10ImdSessionEP6pull_tlP6t_nrnbP13gmx_wallcyclePK14gmx_localtop_tPA3_KfNS_19ArrayRefWithPaddingINS_11BasicVectorIfEEEENS_8ArrayRefISY_EEPK9hi
story_tPNS_16ForceBuffersViewEPA3_fPK9t_mdatomsP14gmx_enerdata_tNS10_IST_EEP10t_forcerecRKNS_21MdrunScheduleWorkloadEPNS_19VirtualSitesHandlerEPfdP9gmx_edsamP24CpuPpLongRangeNonbonde
dsRK22DDBalanceRegionHandler+0x4d8e) [0x7fcd64d3377e]                                      
[mycomputer:172703] [29] /home/myuser/gromacs.single/gromacs-2025.2_gpu/bin/../lib/libgromacs_mpi.so.10(_ZN3gmx15LegacySimulator5do_mdEv+0x3536) [0x7fcd650e0446]
[mycomputer:172703] *** End of error message ***

Any suggestions?

Best Regards

Hi!

Could you share output of gmx -version, you hardware specs (CPU, GPU), and what ROCm version are you using?

Just to clarify: what you’re saying is that, sometimes equilibration works, but sometimes it crashes for the same input?

Hi,

The gmx command I’m using for non-mdrun task is:

gmx -version

GROMACS version:     2025.0-dev
Precision:           mixed
Memory model:        64 bit
MPI library:         thread_mpi
OpenMP support:      enabled (GMX_OPENMP_MAX_THREADS = 128)
GPU support:         disabled
SIMD instructions:   AVX2_256
CPU FFT library:     fftw-3.3.10-sse2-avx
GPU FFT library:     none
Multi-GPU FFT:       none
RDTSCP usage:        enabled
TNG support:         enabled
Hwloc support:       hwloc-2.12.0
Tracing support:     disabled
C compiler:          /usr/bin/gcc-13 GNU 13.3.1
C compiler flags:    -Wno-array-bounds -fexcess-precision=fast -funroll-all-loops -mavx2 -mfma -Wall -Wno-unused -Wunused-value -Wunused-parameter -Wextra -Wno-sign-compare -Wpointer-arith -Wpedantic -Wundef -Werror=stringop-truncation -Wno-missing-field-initializers -O3 -DNDEBUG
C++ compiler:        /usr/bin/g++-13 GNU 13.3.1
C++ compiler flags:  -Wno-array-bounds -fexcess-precision=fast -funroll-all-loops -mavx2 -mfma -Wall -Wextra -Wpointer-arith -Wmissing-declarations -Wpedantic -Wundef -Wstringop-truncation -Wno-missing-field-initializers -Wno-cast-function-type-strict SHELL:-fopenmp -O3 -DNDEBUG
BLAS library:        External - detected on the system
LAPACK library:      External - detected on the system

The gmx for mdrun is:

alias gmx_acpp="mpirun --np 2 --display-allocation --mca accelerator rocm /home/myuser/gromacs.single/gromacs-2025.2_gpu/bin/gmx_mpi"
gmx_mpi -version

GROMACS version:     2025.2
Precision:           mixed
Memory model:        64 bit
MPI library:         MPI (GPU-aware: HIP)
MPI library version: Open MPI v5.0.7, package: Open MPI builduser@buildhost Distribution, ident: 5.0.7, repo rev: v5.0.7, Feb 14, 2025
OpenMP support:      enabled (GMX_OPENMP_MAX_THREADS = 128)
GPU support:         SYCL (AdaptiveCpp)
NBNxM GPU setup:     super-cluster 2x2x2 / cluster 8 (cluster-pair splitting on)
SIMD instructions:   AVX2_256
CPU FFT library:     fftw-3.3.10-sse2-avx
GPU FFT library:     VkFFT internal (1.3.1) with HIP backend
Multi-GPU FFT:       none
RDTSCP usage:        enabled
TNG support:         enabled
Hwloc support:       disabled
Tracing support:     disabled
C compiler:          /usr/bin/clang Clang 19.1.7
C compiler flags:    -mavx2 -mfma -Wno-missing-field-initializers -O3 -DNDEBUG
C++ compiler:        /usr/bin/clang++ Clang 19.1.7
C++ compiler flags:  -mavx2 -mfma -Wno-reserved-identifier -Wno-missing-field-initializers -Weverything -Wno-c++98-compat -Wno-c++98-compat-pedantic -Wno-source-uses-openmp -Wno-c++17-extensions -Wno-documentation-unknown-command -Wno-covered-switch-default -Wno-switch-enum -Wno-switch-default -Wno-extra-semi-stmt -Wno-weak-vtables -Wno-shadow -Wno-padded -Wno-reserved-id-macro -Wno-double-promotion -Wno-exit-time-destructors -Wno-global-constructors -Wno-documentation -Wno-format-nonliteral -Wno-used-but-marked-unused -Wno-float-equal -Wno-cuda-compat -Wno-conditional-uninitialized -Wno-conversion -Wno-disabled-macro-expansion -Wno-unused-macros -Wno-unsafe-buffer-usage -Wno-unused-parameter -Wno-unused-variable -Wno-newline-eof -Wno-old-style-cast -Wno-zero-as-null-pointer-constant -Wno-unused-but-set-variable -Wno-sign-compare -Wno-unused-result -Wno-old-style-cast -Wno-cast-qual -Wno-suggest-override -Wno-suggest-destructor-override -Wno-zero-as-null-pointer-constant -Wno-cast-function-type-strict SHELL:-fopenmp=libomp -O3 -DNDEBUG
BLAS library:        External - detected on the system
LAPACK library:      External - detected on the system
SYCL version:        AdaptiveCpp 25.02.0+git.883b0e11.20250509.branch.HEAD
SYCL compiler:       /home/myuser/AdaptiveCPP-25.02/lib/cmake/AdaptiveCpp/syclcc-launcher
SYCL compiler flags: -Wno-unknown-cuda-version -Wno-unknown-attributes  --acpp-targets="hip:gfx1032" --acpp-clang=/usr/bin/clang++
SYCL GPU flags:      -ffast-math -DHIPSYCL_ALLOW_INSTANT_SUBMISSION=1 -DACPP_ALLOW_INSTANT_SUBMISSION=1 -fgpu-inline-threshold=99999 -Wno-deprecated-declarations
SYCL targets:        hip:gfx1032

I’m running gromacs on a PC:

GPU: Advanced Micro Devices, Inc. [AMD/ATI] Navi 23 [Radeon RX 6650 XT / 6700S / 6800S] (rev c1)
CPU: 12th Gen Intel(R) Core(TM) i9-12900
Rocm version: 6.4.0

It crashes with same .mdp input, but different nvt.gro input files.

Hi!

Thanks for sharing the details.

First, I see that you have one GPU. In such case, it is very unlikely that running mpirun --np 2 will be beneficial. One rank per GPU usually gives you better performance.

Running single rank would also make it less likely to hit any driver/MPI issue. The error trace kinda looks like that’s the case. Your GPU is not supported by ROCm, and while recent GPUs tend to work ok despite being unsupported, it does mean that the AMD compute stack is less tested on them.

If the error persists in single rank, what if you do export HSA_ENABLE_SDMA=0 before running the simulation? It’s a shot in the dark, though.

If it fails every time with the same gro input, it might be that the initial configuration is simply unstable. Does it crash immediately or after some time? Could you try running one of the failed cases on CPU-only to see if the problem is GPU-specific?

Thank you for your replay.

For one of the crashed npt simulation the npt.trr file has 752 frames, for another 1488 frames; I can complete npt equilibration with mdrun on cpu but also have found that on gpu mdrun didn’t crashes anymore: I’m using a bash script to run equilibration for multiple system so at end of script execution I just noted that some npt.gro (generated at end of NPT) are missing and when I run again the mdrun command, outside the script, also crashed, but now problem seems solved.

I’ve done some “trials” with “–np 1” but mdrun stall at step 2600:

Command line:
  gmx_mpi mdrun -ntomp 24 -pin on -v -nb auto -pme auto -pmefft auto -bonded auto -update auto -deffnm npt


Back Off! I just backed up npt.log to ./#npt.log.15#
Reading file npt.tpr, VERSION 2025.0-dev (single precision)
Changing nstlist from 25 to 100, rlist from 1.005 to 1.106

1 GPU selected for this run.
Mapping of GPU IDs to the 1 GPU task in the 1 rank on this node:
  PP:0
PP tasks will do (non-perturbed) short-ranged and most bonded interactions on the GPU
PP task will update and constrain coordinates on the GPU
Using 1 MPI process
Using 24 OpenMP threads 


Overriding thread affinity set outside gmx mdrun

Back Off! I just backed up npt.trr to ./#npt.trr.11#

Back Off! I just backed up npt.edr to ./#npt.edr.11#
starting mdrun 'cluster1-rigid_m1'
1000000 steps,   1000.0 ps.
step  600: timed with pme grid 60 52 100, coulomb cutoff 1.000: 1387.0 M-cycles
step  800: timed with pme grid 52 48 84, coulomb cutoff 1.150: 476.1 M-cycles
step 1000: timed with pme grid 48 42 80, coulomb cutoff 1.222: 1059.5 M-cycles
step 1200: timed with pme grid 52 42 80, coulomb cutoff 1.218: 1058.7 M-cycles
step 1400: timed with pme grid 52 44 80, coulomb cutoff 1.208: 530.5 M-cycles
step 1600: timed with pme grid 52 44 84, coulomb cutoff 1.163: 557.3 M-cycles
step 1800: timed with pme grid 52 48 84, coulomb cutoff 1.150: 639.4 M-cycles
step 2000: timed with pme grid 52 48 96, coulomb cutoff 1.128: 561.8 M-cycles
step 2200: timed with pme grid 56 48 96, coulomb cutoff 1.066: 422.3 M-cycles
step 2400: timed with pme grid 56 52 96, coulomb cutoff 1.048: 610.2 M-cycles
step 2600: timed with pme grid 60 52 96, coulomb cutoff 1.007: 545.6 M-cycles
              optimal pme grid 56 48 96, coulomb cutoff 1.066

For a given system I have performance, at end of mdrun:
for cpu 9.5 ns/day, for gpu(with “–np 2”) 32 ns/day.

Weird. But congrats on getting rid of the problem! :)

This looks suspicious (besides the whole “it hangs” problem). You seem to have PME in your system, but for some reason it’s not getting offloaded to GPU; that’s expected (but not always optimal) with two ranks, but with one rank it’s almost always detrimental unless you have very low-end GPU (not your case).

If you’re happy with the performance and how things work, then great. But it looks like some more performance can be squeezed; if you’re interested to dig deeper, please share the full npt.log file (and the mdp).

Indeed I had another crash (near the end of simulation); also seems that performance are quite random, ranging from more than 6 hours to less than 30 minutes for exactly same input (maybe related to error message " amdgpu 0000:03:00.0: amdgpu: Runlist is getting oversubscribed due to too many queues.. Expect reduced ROCm performance."?)

For this .mdp file:

;title                   = npt equilibration complex cluster1-m1_rigid
define = -DPOSRES_NA -DPOSRES_DNA -DPOSRES_HCY -DFLEXIBLE; position restrain
; Run parameters
integrator              = md        ; 
nsteps                  = 1000000     ; 1 * 1000000 = 1000 ps
dt                      = 0.001     ; 1 fs
; Output control
nstxout                 = 500       ; save coordinates every 0.5 ps
nstvout                 = 500       ; save velocities every 0.5 ps
nstenergy               = 500       ; save energies every 0.5 ps
nstlog                  = 500       ; update log file every 0.5 ps
; Bond parameters
continuation            = yes       ; Restarting after NVT 
constraint_algorithm    = lincs     ; holonomic constraints 
constraints             = h-bonds   ; bonds involving H are constrained
lincs_iter              = 1         ; accuracy of LINCS
lincs_order             = 4         ; also related to accuracy
; Nonbonded settings 
cutoff-scheme           = Verlet    ; Buffered neighbor searching
ns_type                 = grid      ; search neighboring grid cells
nstlist = 25; 25 fs, largely irrelevant with Verlet scheme
rcoulomb = 1.2; short-range electrostatic cutoff (in nm)
rvdw = 1.2; short-range van der Waals cutoff (in nm)
DispCorr                = EnerPres  ; account for cut-off vdW scheme
; Electrostatics
coulombtype = PME       ; Particle Mesh Ewald for long-range electrostatics
vdwtype = PME       ; Particle Mesh Ewald for long-range electrostatics
pme_order               = 4         ; cubic interpolation
fourierspacing = 0.145; grid spacing for FFT
; Temperature coupling is on
tcoupl                  = V-rescale             ; modified Berendsen thermostat
tc-grps = DNA_HCY Na+_Water
tau_t                   = 0.1     0.1           ; time constant, in ps
ref_t                   = 300     300           ; reference temperature, one for each group, in K
; Pressure coupling is on
pcoupl                  = C-rescale; Pressure coupling on in NPT
pcoupltype              = isotropic             ; uniform scaling of box vectors
tau_p                   = 2.0                   ; time constant, in ps
ref_p                   = 1.0                   ; reference pressure, in bar
compressibility         = 4.5e-5                ; isothermal compressibility of water, bar^-1
refcoord_scaling        = com;not needed without restraints??
; Periodic boundary conditions
pbc                     = xyz       ; 3-D PBC
; Velocity generation
gen_vel                 = no        ; Velocity generation is off 

using mdrun command:

gmx_acpp_np1 mdrun -ntomp 24 -pin on -v -nb auto -pme auto -pmefft auto -bonded auto -update auto -deffnm npt 2>&1 | tee mdrun.log

with

alias gmx_acpp_np1="mpirun --np 1 --display-allocation --mca accelerator rocm /home/myuser/gromacs.single/gromacs-2025.2_gpu/bin/gmx_mpi"

after 5 minutes I manually stopped command, as it seems stalled, with output:

======================   ALLOCATED NODES   ======================                                                                                                                     
    mycomputer: slots=1 max_slots=0 slots_inuse=0 state=UP                                                                                                                                 
        Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED                                                                                                                                      
        aliases: mylocalhost,mycomputer                                                                                                                                                  
=================================================================                                                                                                                     
                                                                                                                                                                                      
======================   ALLOCATED NODES   ======================                                                                                                                     
    mycomputer: slots=16 max_slots=0 slots_inuse=0 state=UP                                                                                                                                
        Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN                                                                                                                          
        aliases: mylocalhost,mycomputer                                                                                                                                                  
=================================================================                                                                                                                     
                                                                                                                                                                                      
======================   ALLOCATED NODES   ======================
    mycomputer: slots=16 max_slots=0 slots_inuse=0 state=UP
        Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
        aliases: mylocalhost,mycomputer
=================================================================
                      :-) GROMACS - gmx mdrun, 2025.2 (-:

Executable:   /home/myuser/gromacs.single/gromacs-2025.2_gpu/bin/gmx_mpi
Data prefix:  /home/myuser/gromacs.single/gromacs-2025.2_gpu
Working dir:  /mnt/scidata/lettieri/aptamer/2Dlmmd/3D-biophys.hust.edu.cn/MFE/pred1/OL21-ref-50ns/cluster1-complex/rigid-m1/NPT/cutoff12_1/test_singlerank
Command line:
  gmx_mpi mdrun -ntomp 24 -pin on -v -nb auto -pme auto -pmefft auto -bonded auto -update auto -deffnm ../npt


Back Off! I just backed up ../npt.log to ../#npt.log.1#
Reading file ../npt.tpr, VERSION 2025.0-dev (single precision)
Changing nstlist from 25 to 100, rlist from 1.202 to 1.293

1 GPU selected for this run.
Mapping of GPU IDs to the 1 GPU task in the 1 rank on this node:
  PP:0
PP tasks will do (non-perturbed) short-ranged and most bonded interactions on the GPU
PP task will update and constrain coordinates on the GPU
Using 1 MPI process
Using 24 OpenMP threads 


Overriding thread affinity set outside gmx mdrun

Back Off! I just backed up ../npt.trr to ../#npt.trr.1#

Back Off! I just backed up ../npt.edr to ../#npt.edr.1#
starting mdrun 'cluster1-rigid_m1'
1000000 steps,   1000.0 ps.
step 3200: timed with pme grid 52 44 80, coulomb cutoff 1.200: 354.5 M-cycles
step 3400: timed with pme grid 44 40 72, coulomb cutoff 1.333: 432.0 M-cycles
step 3600: timed with pme grid 48 40 80, coulomb cutoff 1.270: 4463.3 M-cycles
step 3800: timed with pme grid 48 42 80, coulomb cutoff 1.214: 4526.0 M-cycles
step 4000: timed with pme grid 52 42 80, coulomb cutoff 1.210: 1450.0 M-cycles
step 4200: timed with pme grid 52 44 80, coulomb cutoff 1.200: 389.3 M-cycles
              optimal pme grid 52 44 80, coulomb cutoff 1.200
^C

I’ve also tried with “vdwtype = cutoff” but with almost same output.

Well maybe inside a script is more reliable using cpu-only version of mdrun.

Update: Since for same system now take 3 days I’m wondering if changing clang compiler from the one shipped with rocm to system’s one could have made some damage

Good idea to check kernel log. export GPU_MAX_HW_QUEUES=2 could be a workaround. Consumer AMD GPU are not well-suited for handling multiple processes; they have only 8 hardware queues available for compute applications, see [Issue]: performance panelty caused by separated HSA queues in HIP and OpenMP implementations · Issue #2705 · ROCm/ROCm · GitHub, and GROMACS uses more than 4 per process (and the way GPU_MAX_HW_QUEUES works is not straightforward). Running one process per GPU would also solve this issue if my understanding is correct.

  1. What if you build GROMACS with threadMPI (-DGMX_MPI=OFF -DGMX_THREAD_MPI=ON when calling cmake)? Then just launch it without mpirun. I see that you’re on your local machine, so that should not be a problem.
  2. Less critical, but it could be better to use -ntomp 16 -pin on. Your CPU has a mix of P- and E-cores, and usually it’s better to just use P-cores (but that’s hard to detect automatically). Unlikely to do much for hangs, just an observation.
  3. If you’re absolutely 100% sure you cannot just run one process per GPU and you need to do two, use gmx_mpi mdrun -npme 1 with vdwtype=cutoff (as long as it works for you, scientifically!)

That should solve the PME-on-CPU issue, so I suggest doing further attempts with vdwtype=cutoff (as noted above, only if that makes physical sense).

Won’t hurt to check, but I don’t think that is the case. In your case, Clang 19 + ROCm 6.4.0 should be compatible, so I’d first try other things suggested like using threadMPI build without mpirun.

Thank you for this suggestion: with threadMPI simulations are quite fast.

I’m using system’s clang instead of rocm’s clang since with latter I’m having build errors.

1 Like