GROMACS Randomly Stuck in DGX A100

GROMACS version: 2024.2
GROMACS modification: No

Dear all, I am one of the system administrator for a HPC cluster in my university. We have some DGX A100 machines running in our small SLURM cluster, with BeeGFS 7.4.5 as our scratch storage supported with NVIDIA GDS.

We noticed that the process of GROMACS executing in all our DGX A100 nodes will randomly stuck in “z” state after sometime. The jobs will either get terminated by SLURM due to timeout after 3 days or terminated by the administrator when we noticed the stuck. We could not really identify which part of the execution is causing the issues as it happened randomly after certain amount of time, can be a few hours or up to few days. Check the graph below to get an overview:

We used the same exact input files in clean state (all scratch and output deleted) for all the test run. We tried using different version of GROMACS (v2024.4, v2024.2, v2024.1, v2023.2), and the result is the same, GROMACS just randomly stucked or complete without issue.

We tried running on local NVME of the node, but it doesn’t seems to stuck randomly. While we are suspecting this might have something to do with the BeeGFS storage we have, the behaviour doesn’t seems to happen to other jobs with different application like TeraChem, Amber or LAMMPS.

We have also tried the container from the Nvidia GPU cloud, and the result is the same, so I believe the GROMACS compilation/installation should be okay?

Since I am personally not a GROMACS user, I figured out it might be better to just create a post in the forum, so that I could get some helps here since our users in the university are mostly not proficient or expert enough in GROMACS. I have uploaded all the input files, output files and the submission script we have used to test in the following link. There are files from multiple runs in the directory, sorry for the mess!

I appreciate if someone can provide some insight on the problem we encountered, as we have tried various way of troubleshooting but still could not identified the problem on our side. I would list down all the steps we have tried:

  1. Running the calculation with different version of GROMACS (2023.2, 2024.1, 2024.2, 2024.4, 2023.2 from NGC)
  2. Clean up all files in the directory before rerun the same calculation.
  3. Recompile GROMACS with OpenMPI (previously compiled with HPCX)
  4. Set GMX_ENABLE_DIRECT_GPU_COMM=0
  5. Upgraded BeeGFS stack with newer NVIDIA GDS (1.11.1.6) and NVIDIA FS (2.17)
  6. Upgraded DGX OS and most libraries to latest compatible
  7. Running the calculation with and without other jobs in the same node (jobs are not sharing the same set of resources)
  8. Directly run the GROMACS calculation without going through the SLURM scheduler.
  9. Running in other storage (local NVME and CephFS) doesn’t seems to stuck
  10. Running with older GPU nodes without Infiniband (10G Ethernet) seems to be okay (v2022.1).

Sorry for the long post, and thank you in advance for helping.

Hi!

I agree with your suspicion that it very much looks like something fails in I/O. The last line in a couple md.log’s I checked is about writing a checkpoint. Of course, it could be that it was written successfully and something went wrong afterwards, but still. When checkpointing, GROMACS does a bunch of buffer flushes, file rename etc at that moment, so perhaps it just happens that this sequence of operation trips the filesystem somehow?

I also see that you have tried running GROMACS with -cpnum flag and without it; did you see any difference in behavior?

There are cases with poorly-equilibrated systems when GROMACS enters an infinite loop (although that should be mostly fixed by now), but in then you won’t get 0% CPU usage.

Setting GMX_ENABLE_DIRECT_GPU_COMM to any value (even 0 or empty string) is the same as setting it to 1. You should have this variable totally unset. We have an issue about it and plan to improve this behavior, but for now it is what it is.

However, you are running single rank (at east in the run42.sh), so this should not matter: there’s only one GPU process.

Hello,

We have checked the system log and BeeGFS log and did not noticed any relevant warning or errors. Other application processes are still performing read/write operation on the BeeGFS storage when GROMACS processes became “z” state.

I am not sure if this has something to do with NVIDIA GDS in BeeGFS, since only nodes with BeeGFS client compiled with NVIDIA GDS are having the stuck issue. I will need to confirm on this by recompile the storage client on the node and rerun the calculation multiple times after this.

We do have cases where our BeeGFS client on the node crashed from time to time, but it doesn’t happen too frequently, probably once in a month or two.

All the calculation inputs and script are taken from the users directory when we noticed the 0% CPU utilisation. I don’t think I have made any changes on -cpnum flag, perhaps it was set by the different GROMACS version? What I did was just reuse the same inputs and command in the submission scripts for multiple runs.

Sure, I was not aware about the different. I will remove the variables in the future run.

So in our case, what would you recommend us to do to narrow down the causes of the problem?

Thank you.

Could be that some I/O operation fails and GROMACS gets killed (we might miss handling some edge-case I/O error), not necessarily the whole filesystem crashes.

If I understand correctly, the same code with the same input and options works ok on local storage or CephFS but crashes on BeeGFS. You did a very comprehensive test of other things in the environment, so that’s a pretty strong signal, I think.

Here’s what I would do:

  • Following the “I/O” hypothesis:
    • Add -cpo /path/to/local/disk/state.cpt and -o /path/to/local/disk/traj.trr flags to gmx_mpi mdrun call to store checkpoints and trajectories on another filesystem while keeping everything else the same.
    • Try setting -cpt 1 (to write checkpoint every minute) to see if it changes the likelihood of the freeze (the default is 15 minutes)
  • Trying to gather more data generally:
    • Run GROMACS under gdb, and try to capture the backtrace on exceptions and error handling, like gdb -e script.gdb --args gmx_mpi mdrun ....., where script.gdb is (adjust as you see fit for your environment):
set breakpoint pending on
catch throw
catch signal SIGABRT
catch signal SIGSEGV
catch signal SIGKILL
b _exit
b gmx_fatal
b gmx::internal::assertHandler
commands 1-7
bt
continue
end
run
quit

Also, I assume you checks the logs for any errors, but just in case: is there a way to download the files in bulk? The “gromacs_troubleshoot” link you shared has a “Zip” button, but it fails with “size too large”.

Hi, sorry for delay response.

I followed some of your suggestion to run GROMACS in the DGX machines, including setting checkpoint every minutes with -cpt 1 and redirecting the checkpoint and trajectory to another file system using -cpo and -co. Unfortunately, the freeze with no utilisation and z state still happen after sometime randomly.

One thing we observed in the syslog is that, there was kernel panic recorded whenever the freeze happen, which seems to be related to fsync operation with BeeGFS in the affected DGX machines. We don’t see such kernel panic in other runs without freeze.

------------[ cut here ]------------
kernel BUG at lib/iov_iter.c:1498!
invalid opcode: 0000 [#5] SMP NOPTI
CPU: 21 PID: 2941610 Comm: gmx_mpi Tainted: P      D    OE     5.15.0-1045-nvidia #45-Ubuntu
Hardware name: NVIDIA DGXA100 920-23687-2531-001/DGXA100, BIOS 1.21 03/09/2023
RIP: 0010:iov_iter_get_pages+0x3b1/0x3c0
Code: 8d 7e ff 83 e6 01 48 0f 45 d7 f0 ff 42 34 83 c1 01 89 4d cc 4d 39 ce 75 c4 e9 56 fd ff ff 31 c0 e9 66 ff ff ff e8 9f f0 76 00 <0f> 0b 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 55 48 89 e5 41 57 41
RSP: 0018:ffffb83debddfb38 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000
RDX: 0000000000000001 RSI: ffffb83debddfb90 RDI: 0000000000000000
RBP: ffffb83debddfb80 R08: ffffb83debddfb98 R09: ffff89071c9c6000
R10: ffffffffc1590300 R11: ffff8a4754a37adc R12: ffffb83debddfba0
R13: 0000000000001000 R14: ffffb83debddfb90 R15: ffffb83debddfb98
FS:  0000147820300000(0000) GS:ffff8941cf740000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR4: 0000000000350ee0
Call Trace:
 <TASK>
 ? show_trace_log_lvl+0x1d6/0x2ea
 ? show_trace_log_lvl+0x1d6/0x2ea
 ? RdmaInfo_detectNVFSRequest+0x74/0x130 [beegfs]
 ? show_regs.part.0+0x23/0x29
 ? __die_body.cold+0x8/0xd
 ? __die+0x2b/0x37
 ? die+0x30/0x60
 ? do_trap+0xbe/0x100
 ? do_error_trap+0x6f/0xb0
 ? iov_iter_get_pages+0x3b1/0x3c0
 ? exc_invalid_op+0x53/0x70
 ? iov_iter_get_pages+0x3b1/0x3c0
 ? asm_exc_invalid_op+0x1b/0x20
 ? iov_iter_get_pages+0x3b1/0x3c0
 RdmaInfo_detectNVFSRequest+0x74/0x130 [beegfs]
 FhgfsOpsCommkit_communicate+0x339/0x1230 [beegfs]
 ? __wake_up_common_lock+0x8a/0xc0
 FhgfsOpsCommKit_fsyncCommunicate+0x1c/0x30 [beegfs]
 FhgfsOpsRemoting_fsyncfile+0x1a2/0x240 [beegfs]
 __FhgfsOps_flush+0x14a/0x420 [beegfs]
 ? __rseq_handle_notify_resume+0x2d/0xc0
 ? syscall_exit_to_user_mode+0x35/0x50
 FhgfsOps_fsync+0x80/0xf0 [beegfs]
 vfs_fsync_range+0x49/0x90
 ? __fget_light+0x39/0x90
 __x64_sys_fsync+0x38/0x70
 do_syscall_64+0x5c/0xc0
 ? do_syscall_64+0x69/0xc0
 entry_SYSCALL_64_after_hwframe+0x62/0xcc
RIP: 0033:0x147834b588ab
Code: 4a 00 00 00 0f 05 48 3d 00 f0 ff ff 77 41 c3 48 83 ec 18 89 7c 24 0c e8 53 51 f7 ff 8b 7c 24 0c 41 89 c0 b8 4a 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 35 44 89 c7 89 44 24 0c e8 a1 51 f7 ff 8b 44
RSP: 002b:00007ffed6f185e0 EFLAGS: 00000293 ORIG_RAX: 000000000000004a
RAX: ffffffffffffffda RBX: 0000558322a3a220 RCX: 0000147834b588ab
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000000000aa
RBP: 0000558322a3a220 R08: 0000000000000000 R09: 0000000000000000
R10: 80fe03f80fe03f01 R11: 0000000000000293 R12: 0000000000000000
R13: 000014783a019940 R14: 00005583229d91d0 R15: 000055831eb6b080
 </TASK>
Modules linked in: xt_nat xt_tcpudp veth beegfs(OE) ceph libceph mst_pciconf(OE) xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack_netlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xfrm_user xfrm_algo xt_addrtype
 br_netfilter bridge stp llc cachefiles fscache netfs nvme_fabrics nft_counter nft_compat nf_tables cuse overlay nfnetlink rdma_ucm(OE) rdma_cm(OE) iw_cm(OE) kvm_amd kvm ib_ipoib(OE) ib_cm(OE) ib_umad(OE) ipmi_ssif bonding binfmt_misc nls_iso8859_1 mlx5_ib(OE) ib_uverbs(OE) joydev input_leds ib_core(OE) ccp acpi_ipmi ipmi_si nvidia_uvm(POE) sch_fq_codel dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua nvidia_fs(OE) knem(OE) ipmi_devintf ipmi_msghandler msr efi_pstore auth_rpcgss sunrpc ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c multipath linear nvidia_drm(POE) nvidia_modeset(POE) drm_vram_helper drm_ttm_helper hid_generic crct10dif_pclmul crc32_pclmul ghash_clmulni_intel uas nvidia(POE) usb_storage ttm usbhid aesni_intel mlx5_core(OE) hid pci_hyperv_intf crypto_simd drm_kms_helper raid1 raid0 syscopyarea sysfillrect cryptd sysimgblt mlxdevm(OE) fb_sys_fops igb mlx_compat(OE) dca cec tls mpt3sas i2c_algo_bit rc_core nvme raid_class mlxfw(OE) xhci_pci scsi_transport_sas drm xhci_pci_renesas psample nvme_core [last unloaded: beegfs]
---[ end trace 52549c2692b09a22 ]---
RIP: 0010:iov_iter_get_pages+0x8d/0x3c0
Code: 3c 01 0f 87 ef d2 6f 00 89 d1 49 8b 7c 24 20 80 c9 80 a8 01 0f 45 d1 49 8b 4c 24 08 48 85 ff 0f 84 29 03 00 00 49 8b 74 24 18 <48> 8b 06 4c 8b 66 08 48 01 c8 49 29 cc 0f 84 2a 01 00 00 4d 39 e5
RSP: 0018:ffffb83d9bcb7aa8 EFLAGS: 00010206
RAX: 0000000000000000 RBX: 0000000000000001 RCX: 732e736c6f637472
RDX: 0000000000000001 RSI: 0000000400000003 RDI: 0000000001241a71
RBP: ffffb83d9bcb7af0 R08: ffffb83d9bcb7b08 R09: ffff88c461436000
R10: ffffffffc1590300 R11: 0000000000000000 R12: ffffb83d9bcb7b10
R13: 0000000000001000 R14: ffffb83d9bcb7b00 R15: ffffb83d9bcb7b08
FS:  0000147820300000(0000) GS:ffff8941cf740000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000558322a09bd8 CR3: 000001827dc7e000 CR4: 0000000000350ee0

We tried disable one of the fsync setting in BeeGFS (Client Node Tuning — BeeGFS Documentation 7.4.5), and rerun multiple times with the default script with 1 minute checkpoint and redirection. Suprisingly, the freeze did not reoccur again. We cannot 100% confirm that it has resolved our issues, but so far it is still working.

One thing we suspect is that the problem might be related to the kernel used in DGX OS, as none of our other nodes are running Ubuntu based OS, they are all running on RHEL based OS (CentOS and RockyLinux). This might not seems to be a GROMACS problem (or it might?), as it only occurs on our only Ubuntu based DGX OS. I wonder if any other users reported similar issues on the Ubuntu OS.

I have not manage to run with gdb debugger as no debugging syntax can be found on our current GROMACS compilation. Perhaps, I would need to recompile a new one with debug enabled.

I believe it has something to do with the cloud storage settings which limit maximum sizes, which I don’t think I have any control with.

GROMACS indeed calls fsync, but in a pretty straightforward way, so I doubt there’s anything illegal or unusual there. From the application standpoint, the worst thing it can do with fsync is to pass a wrong file descriptor to it, and that should cause fsync to return an error, not trigger a kernel bug. So, something is wrong with BeeGFS here.

I could not find anything in the public forums and we don’t keep any special secret bug database, so I think you’re the first :)

If the bug is in a syscall, then I don’t think debugging GROMACS will be of much value, but perhaps it can help pinpoint things.

In case there are others struggling with the same problem but being silent so far: did you enable or disable tuneRemoteFSync?

I think it has something to do with GROMACS waiting for BeeGFS client to complete fsync operation if tuneRemoteFSync in BeeGFS client is enabled, but I am not sure why the operation only randomly stuck in DGX nodes but not other RHEL based nodes.

I think I might need to recompile the BeeGFS client with other kernel version or disable some of the features to test if the problem still come back. I was very confused why this behavior never happened to other application in the past few months.

We disabled tuneRemoteFSync in our BeeGFS Client on the affected node by setting tuneRemoteFSync=false, which has been working well so far for us, but the milleage might vary for different people. I will try to raise an issue on the BeeGFS side to see if anyone is having the same issue.

1 Like