GROMACS version: 2022
GROMACS modification: No
One of our groups encounter simulation freezing issues on a cluster,
running [AMD/ATI] Vega 20 (rev 02)
GPUs
I must admit i do not know much about gromacs and nothing about molecular dynamics in general, but i wondered if you have encountered such freezes as well and could spare a tip what we could try to fix these.
The version of rocm we use is: rocm-4.2.0
The hardware info i have:
$ lshw -C display
*-display
description: Display controller
product: Vega 20
vendor: Advanced Micro Devices, Inc. [AMD/ATI]
physical id: 0
bus info: pci@0000:03:00.0
version: 02
width: 64 bits
clock: 33MHz
capabilities: bus_master cap_list rom
configuration: driver=amdgpu latency=0
resources: iomemory:37c0-37bf iomemory:3800-37ff irq:242 memory:37c00000000-37fffffffff memory:38000000000-380001fffff memory:f6200000-f627ffff memory:f6280000-f629ffff
*-display
description: Display controller
product: Vega 20
vendor: Advanced Micro Devices, Inc. [AMD/ATI]
physical id: 0
bus info: pci@0000:27:00.0
version: 02
width: 64 bits
clock: 33MHz
capabilities: bus_master cap_list rom
configuration: driver=amdgpu latency=0
resources: iomemory:27c0-27bf iomemory:2800-27ff irq:241 memory:27c00000000-27fffffffff memory:28000000000-280001fffff memory:c4300000-c437ffff memory:c4380000-c439ffff
*-display
description: Display controller
product: Vega 20
vendor: Advanced Micro Devices, Inc. [AMD/ATI]
physical id: 0
bus info: pci@0000:43:00.0
version: 02
width: 64 bits
clock: 33MHz
capabilities: bus_master cap_list rom
configuration: driver=amdgpu latency=0
resources: iomemory:1fc0-1fbf iomemory:2000-1fff irq:240 memory:1fc00000000-1ffffffffff memory:20000000000-200001fffff memory:f0000000-f007ffff memory:f0080000-f009ffff
*-display
description: Display controller
product: Vega 20
vendor: Advanced Micro Devices, Inc. [AMD/ATI]
physical id: 0
bus info: pci@0000:63:00.0
version: 02
width: 64 bits
clock: 33MHz
capabilities: bus_master cap_list rom
configuration: driver=amdgpu latency=0
resources: iomemory:17c0-17bf iomemory:1800-17ff irq:239 memory:17c00000000-17fffffffff memory:18000000000-180001fffff memory:cb200000-cb27ffff memory:cb280000-cb29ffff
*-display
description: VGA compatible controller
product: ASPEED Graphics Family
vendor: ASPEED Technology, Inc.
physical id: 0
bus info: pci@0000:65:00.0
version: 41
width: 32 bits
clock: 33MHz
capabilities: vga_controller cap_list
configuration: driver=ast latency=0
resources: irq:174 memory:ca000000-caffffff memory:cb000000-cb01ffff ioport:1000(size=128)
*-display
description: Display controller
product: Vega 20
vendor: Advanced Micro Devices, Inc. [AMD/ATI]
physical id: 0
bus info: pci@0000:83:00.0
version: 02
width: 64 bits
clock: 33MHz
capabilities: bus_master cap_list rom
configuration: driver=amdgpu latency=0
resources: iomemory:57c0-57bf iomemory:5800-57ff irq:246 memory:57c00000000-57fffffffff memory:58000000000-580001fffff memory:b0200000-b027ffff memory:b0280000-b029ffff
*-display
description: Display controller
product: Vega 20
vendor: Advanced Micro Devices, Inc. [AMD/ATI]
physical id: 0
bus info: pci@0000:a3:00.0
version: 02
width: 64 bits
clock: 33MHz
capabilities: bus_master cap_list rom
configuration: driver=amdgpu latency=0
resources: iomemory:4fc0-4fbf iomemory:5000-4fff irq:245 memory:4fc00000000-4ffffffffff memory:50000000000-500001fffff memory:b6300000-b637ffff memory:b6380000-b639ffff
*-display
description: Display controller
product: Vega 20
vendor: Advanced Micro Devices, Inc. [AMD/ATI]
physical id: 0
bus info: pci@0000:c3:00.0
version: 02
width: 64 bits
clock: 33MHz
capabilities: bus_master cap_list rom
configuration: driver=amdgpu latency=0
resources: iomemory:47c0-47bf iomemory:4800-47ff irq:244 memory:47c00000000-47fffffffff memory:48000000000-480001fffff memory:ba000000-ba07ffff memory:ba080000-ba09ffff
*-display
description: Display controller
product: Vega 20
vendor: Advanced Micro Devices, Inc. [AMD/ATI]
physical id: 0
bus info: pci@0000:e3:00.0
version: 02
width: 64 bits
clock: 33MHz
capabilities: bus_master cap_list rom
configuration: driver=amdgpu latency=0
resources: iomemory:3fc0-3fbf iomemory:4000-3fff irq:243 memory:3fc00000000-3ffffffffff memory:40000000000-400001fffff memory:c0200000-c027ffff memory:c0280000-c029ffff
$ lspci |grep -i Display
03:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Vega 20 (rev 02)
27:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Vega 20 (rev 02)
43:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Vega 20 (rev 02)
63:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Vega 20 (rev 02)
83:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Vega 20 (rev 02)
a3:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Vega 20 (rev 02)
c3:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Vega 20 (rev 02)
e3:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Vega 20 (rev 02)
$ lspci -v -s 03:00.0
03:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Vega 20 (rev 02)
Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Device 0834
Flags: bus master, fast devsel, latency 0, IRQ 242, NUMA node 0
Memory at 37c00000000 (64-bit, prefetchable) [size=16G]
Memory at 38000000000 (64-bit, prefetchable) [size=2M]
Memory at f6200000 (32-bit, non-prefetchable) [size=512K]
Expansion ROM at f6280000 [disabled] [size=128K]
Capabilities: <access denied>
Kernel driver in use: amdgpu
Kernel modules: amdgpu
the cpu is a:
AMD EPYC 7452 32-Core Processor