Simulation freezes on AMD GPU [AMD/ATI] Vega 20 (rev 02)

GROMACS version: 2022
GROMACS modification: No

One of our groups encounter simulation freezing issues on a cluster,
running [AMD/ATI] Vega 20 (rev 02) GPUs

I must admit i do not know much about gromacs and nothing about molecular dynamics in general, but i wondered if you have encountered such freezes as well and could spare a tip what we could try to fix these.

The version of rocm we use is: rocm-4.2.0
The hardware info i have:

$ lshw -C display
  *-display
       description: Display controller
       product: Vega 20
       vendor: Advanced Micro Devices, Inc. [AMD/ATI]
       physical id: 0
       bus info: pci@0000:03:00.0
       version: 02
       width: 64 bits
       clock: 33MHz
       capabilities: bus_master cap_list rom
       configuration: driver=amdgpu latency=0
       resources: iomemory:37c0-37bf iomemory:3800-37ff irq:242 memory:37c00000000-37fffffffff memory:38000000000-380001fffff memory:f6200000-f627ffff memory:f6280000-f629ffff
  *-display
       description: Display controller
       product: Vega 20
       vendor: Advanced Micro Devices, Inc. [AMD/ATI]
       physical id: 0
       bus info: pci@0000:27:00.0
       version: 02
       width: 64 bits
       clock: 33MHz
       capabilities: bus_master cap_list rom
       configuration: driver=amdgpu latency=0
       resources: iomemory:27c0-27bf iomemory:2800-27ff irq:241 memory:27c00000000-27fffffffff memory:28000000000-280001fffff memory:c4300000-c437ffff memory:c4380000-c439ffff
  *-display
       description: Display controller
       product: Vega 20
       vendor: Advanced Micro Devices, Inc. [AMD/ATI]
       physical id: 0
       bus info: pci@0000:43:00.0
       version: 02
       width: 64 bits
       clock: 33MHz
       capabilities: bus_master cap_list rom
       configuration: driver=amdgpu latency=0
       resources: iomemory:1fc0-1fbf iomemory:2000-1fff irq:240 memory:1fc00000000-1ffffffffff memory:20000000000-200001fffff memory:f0000000-f007ffff memory:f0080000-f009ffff
  *-display
       description: Display controller
       product: Vega 20
       vendor: Advanced Micro Devices, Inc. [AMD/ATI]
       physical id: 0
       bus info: pci@0000:63:00.0
       version: 02
       width: 64 bits
       clock: 33MHz
       capabilities: bus_master cap_list rom
       configuration: driver=amdgpu latency=0
       resources: iomemory:17c0-17bf iomemory:1800-17ff irq:239 memory:17c00000000-17fffffffff memory:18000000000-180001fffff memory:cb200000-cb27ffff memory:cb280000-cb29ffff
  *-display
       description: VGA compatible controller
       product: ASPEED Graphics Family
       vendor: ASPEED Technology, Inc.
       physical id: 0
       bus info: pci@0000:65:00.0
       version: 41
       width: 32 bits
       clock: 33MHz
       capabilities: vga_controller cap_list
       configuration: driver=ast latency=0
       resources: irq:174 memory:ca000000-caffffff memory:cb000000-cb01ffff ioport:1000(size=128)
  *-display
       description: Display controller
       product: Vega 20
       vendor: Advanced Micro Devices, Inc. [AMD/ATI]
       physical id: 0
       bus info: pci@0000:83:00.0
       version: 02
       width: 64 bits
       clock: 33MHz
       capabilities: bus_master cap_list rom
       configuration: driver=amdgpu latency=0
       resources: iomemory:57c0-57bf iomemory:5800-57ff irq:246 memory:57c00000000-57fffffffff memory:58000000000-580001fffff memory:b0200000-b027ffff memory:b0280000-b029ffff
  *-display
       description: Display controller
       product: Vega 20
       vendor: Advanced Micro Devices, Inc. [AMD/ATI]
       physical id: 0
       bus info: pci@0000:a3:00.0
       version: 02
       width: 64 bits
       clock: 33MHz
       capabilities: bus_master cap_list rom
       configuration: driver=amdgpu latency=0
       resources: iomemory:4fc0-4fbf iomemory:5000-4fff irq:245 memory:4fc00000000-4ffffffffff memory:50000000000-500001fffff memory:b6300000-b637ffff memory:b6380000-b639ffff
  *-display
       description: Display controller
       product: Vega 20
       vendor: Advanced Micro Devices, Inc. [AMD/ATI]
       physical id: 0
       bus info: pci@0000:c3:00.0
       version: 02
       width: 64 bits
       clock: 33MHz
       capabilities: bus_master cap_list rom
       configuration: driver=amdgpu latency=0
       resources: iomemory:47c0-47bf iomemory:4800-47ff irq:244 memory:47c00000000-47fffffffff memory:48000000000-480001fffff memory:ba000000-ba07ffff memory:ba080000-ba09ffff
  *-display
       description: Display controller
       product: Vega 20
       vendor: Advanced Micro Devices, Inc. [AMD/ATI]
       physical id: 0
       bus info: pci@0000:e3:00.0
       version: 02
       width: 64 bits
       clock: 33MHz
       capabilities: bus_master cap_list rom
       configuration: driver=amdgpu latency=0
       resources: iomemory:3fc0-3fbf iomemory:4000-3fff irq:243 memory:3fc00000000-3ffffffffff memory:40000000000-400001fffff memory:c0200000-c027ffff memory:c0280000-c029ffff



$ lspci |grep -i Display
03:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Vega 20 (rev 02)
27:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Vega 20 (rev 02)
43:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Vega 20 (rev 02)
63:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Vega 20 (rev 02)
83:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Vega 20 (rev 02)
a3:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Vega 20 (rev 02)
c3:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Vega 20 (rev 02)
e3:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Vega 20 (rev 02)

$ lspci -v -s 03:00.0
03:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Vega 20 (rev 02)
        Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Device 0834
        Flags: bus master, fast devsel, latency 0, IRQ 242, NUMA node 0
        Memory at 37c00000000 (64-bit, prefetchable) [size=16G]
        Memory at 38000000000 (64-bit, prefetchable) [size=2M]
        Memory at f6200000 (32-bit, non-prefetchable) [size=512K]
        Expansion ROM at f6280000 [disabled] [size=128K]
        Capabilities: <access denied>
        Kernel driver in use: amdgpu
        Kernel modules: amdgpu

the cpu is a:
 AMD EPYC 7452 32-Core Processor

Hello!

We have not encountered such issues with Vega 20. Trying a newer ROCm version is worth a shot as a first step, as 4.2.0 is over a year old. ROCm can be installed as a user, so even on a shared cluster that should be doable. I suggest either 4.5.2 (the one we tested most internally) or 5.2.3 (the most recent one, so probably has more bugs fixed).

Are you using OpenCL or SYCL build of GROMACS? Do freeze only occur if using all 8 GPUs? Does it happen early on or somewhere randomly in the middle of the simulation?