Gromacs Issue on EPYC 7B12 Workstation

GROMACS version: 2023
GROMACS modification: No

Very recently, I DIYed a workstation with following hardware:
- Supermicro H11DSi
- EPYC 7B12 * 2 (128C 256T)
- Samsung RECC 3200 32G * 16 = 512G
- Samsung 980 Pro 2t

On such machine, I prepared following environments:
- Centos 7 with kernel ver. 6.2.8-1.el7.elrepo.x86_64
- GLIBC updated to 2.28
- GCC & G++ 11 installed via command “$yum install devtoolset-11-gcc*” and source xxx/enable
- make 4.3 & cmake 3.25.3
- AOCC 4.0 & AOCL 4.0 installed

I downloaded GROMACS 2023 .tar.gz from official site and made installation according to guidance.

— Trial I —
I first compiled gmx with gcc/g++:
$ cmake … -DCMAKE_INSTALL_PREFIX=/usr/local/gromacs-2023-gcc
-DCMAKE_C_COMPILER=gcc -DCMAKE_CXX_COMPILER=g++
-DGMX_FFT_LIBRARY=fftw3 -DGMX_BUILD_OWN_FFTW=ON

Everything is smooth and I started running a system [3500 water + 256 met + 128 eth] for a purpose of detecting molecular behaviors in interface.
However, the simulation got freezed / hanging / stuck at approx. 10 million steps. No any more steps were added (just stuck there like being frozen) and the info on screen no longer changed. I input ctrl+c and it showed core dumped. While I checked the log file, there’s no additional information except normal simulation step data.

I repeated and got into same situation. The only difference was that the simulation got stuck in different steps - it seems to be random or something else.

— Trial II —
I searched on google and found some traces. However NO clear solution was given. I then re-installed gromacs according to AMD official benchmark page:
https://www.amd.com/system/files/documents/EPYC-7002-Gromacs-Molecular-Dynamics-Simulation.pdf

It was mentioned that their gmx was compiled with AOCC 2.0, AOCL 2.0 and Open MPI 4.0.0 - I didn’t turn on DGMX_MPI in my workstation since it’s just single node.
I then compiled in such way:
$ cmake … -DCMAKE_INSTALL_PREFIX=/usr/local/gromacs-2023-aocc
-DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++
-DGMX_FFT_LIBRARY=fftw3 -DCMAKE_PREFIX_PATH=/PATH/TO/FFTW3
-DGMX_OPENMP=ON

The installation itself seems to be smooth (AGAIN) and the simulation got freezed / hanging / stuck (AGAIN). I repeated the simulation and was led to same dilemma.

Please see beneath attached screenshot for a straightforward view.

— QUESTION ----

  1. Is this related to my EPYC 7B12 cpu?
  2. Or, is this related to my gmx installation (like, I missed some key FLAG for AMD EPYC cpu compilation)?
  3. Or, is this related to my gmx mdrun command (-nt 125 -dd 5 5 5)? I used such since my box is 666 nm.

Anyway, I think this may not be an ONLY case in community.
Please kindly help.
Any suggestion, comment, hint would be more than welcome.
Please save my new workstation and my research from this morass situation : (

Regards
Pim

To rule out the former I suggest doing some long stress-testing using some other application, e.g. you can try:

  • the stress tool
  • run HPL or similar compute-intensive code

You are using a peculiar run setup (which is disadvantageous for PME as the note on the console suggest), why not just run with -ntmpi 128 and use all cores and default decomposition?

Cheers,
Szilard

Dear Szilard,

Thanks a lot for your advice.

When I directly run mdrun, it said sth like the grid is too small and use -dd… So I directly took that comment.

I tested stress but not that long as gmx (~10h before frozen).
I’ll have stress tried today. Thanks again!

Just use -ntmpi 64 -ntomp 2. However, if your box is indeed 666 nm in any dimension, there is something peculiar that prevents decomposition.

Sorry maybe I’m not clear here.
The box is 6×6×6 nm, not 666 lol. 666 might be too large ╭(°A°`)╮

I’m doing some stress test now. Thanks again for your kind advice.

That should be a few tens of thousands of atoms, so it should run comfortably on a 64-core CPU. Try 2 or 4 threads per rank.

I just finished a ~12h stress test by “$ stress --cpu 128 --timeout 50000” (50000s=833min=13.9h). It seems alright, saying that “stress: info: [118673] successful run completed in 50000s”.
The temp was kept at apporx. 60-65 degree C - quite normal in my mind.

Here I have two questions:
Q1.
I tried 128 by “-ntmpi 32 -ntomp 4”, gmx said:
“There is no domain decomposition for 18 ranks that is compatible with the
given box and a minimum cell size of 1.185 nm
Change the number of ranks or mdrun option -rdd or -dds
Look in the log file for details on the domain decomposition”

So I then tried 120. If I’m understanding correctly, is this (try 2 or 4 threads per rank) equal to the command “-ntmpi 60 -ntomp 2” so as to nominate 2 threads per rank and 60 ranks (tMPI) in sum, which will take use of 60*2=120 cores?

Q2.
Physically, I have 64C for each CPU and two CPUs in total in this workstation.
I detect speed by mdrun -v, and found that the speed
[-ntmpi 125 -ntomp 1] > [-ntmpi 60 -ntomp 2] > [-ntmpi 125 -ntomp 2]

For abovementioned system [3500 water + 256 met + 128 eth], they showed to finish within 24h(125/1), 48h (60/2) and 120h (125/2)…
very weird. Which should be more reasonable?

Thanks a log Szilard!

Correct.

You can also use larger number of threads per rank, I just suggested 2 or 4 because I thought that should be sufficient to get things running on 128 cores. Please post a complete log file, that might help giving better suggestions rather than just guessing.

Measure performance first by doing a shorter run, and using -resetsteps N, e.g. -nsteps 20000 -resetsteps 15000 will allow measuring performance after 15000 steps (to exclude load balancing). Decide which setup to run after you’ve determined which one is better.

“-ntmpi 125 -ntomp 2” will use 250 threads which will oversubscribe your CPU’s 64x2=128 threads, so that won’t be efficient.

For further hints, as mentioned above, the complete log file is needed.

Cheers,
Szilárd

Hi Szilárd and community,

I made another series of tests in past days, while all failed into same dilemma… Say, gmx simulation got frozen/hanging at several million steps, with no any step moving forward. When I input ctrl+c, it just said core dumped.

I, yesterday, freshly installed rocky linux 9, yum install cmake, and directly compiled gromacs on this brand new system. While I got same freezing results : (

For simulation log, there’s nothing special but just normal info, while ending with simulation calculation information of stuck step. Please see attached sreenshot [I’m new here so I’m not able to upload files].

As for info in log filehead, please see following:

I saw another similar situation but I’m not sure if he/she has solved. topic:

Now, AGAIN, I’m stuck here…
Please kindly help.

Thanks with Regards,
Pim

BTW, in /var/log/messages, after I ctrl+c, I saw some info like in red box:

Hi Pim,

Unfortunately, the symptoms don’t really help explaining why are you getting

Core dumped just means that the program has crashed and the operating system saved the state of the program at the point of the crash. The core files can be useful to inspect where the crash happened if you have any of those around you could load them in a debugged using the gdb /path/to/gmx core command, then type “bt” and share the backtrace pointed.
There is a chance that this may help, but it could also be insufficient information.

Another thing you could do is to install an MPI-build (-DGMX_MPI=on https://manual.gromacs.org/current/install-guide/index.html#id1) and see whether that hangs.

Cheers,
Szilárd

Are we sure this is an issue with the hardware/software and not an instability in the simulated system which causes e.g. an extremely long loop due to atoms flying far away?

Note that we do our best to avoid the latter situation from occurring, but we can’t detect 100% of the cases before it goes wring.

Dear Hess,

I think the latter case can be excluded since:

  1. same files run smoothly on another 7742 (2nd epyc) platform, and
  2. when it’s stuck, I entered ctrl + c and re-run from the .cpt. The MD can continue for another 10~20 million steps before it get stuck again. One recent trial was that it was first hang @ 28 million steps, while I ctrl+c and continued, and it went on to 51 million steps and got hang again.

—update—
Dear Hess,

Following is a screenshot of today’s run. It’s captured at the last frame.
The system is consisted of 3500 water, 256 methane and 128 ethane, totalling ~16000 atoms in sum. Seemingly, the system is OK…

Dear Szilárd,

Today morning I freshly installed Rocky Linux 9.1, yum install cmake, and then installed AOCC4.0 and AOCL4.0 and AMD-FFTW. I compiled gromacs in such environments.

After another round of struggling, I’m now in a hang/freezing status again.
As is suggested, I key-in gdb /usr/local/gromacs-2023-aocc/bin/gmx and then bt, while it says “No stack”.

Is there any mistake/mis-operation here I have?

Unless you have reasons to believe that you operating system is not correctly installed, I suggest to look elsewhere, e.g. potential hardware issues.

You need to pass a core file as well to gdb, e.g. gdb /usr/local/gromacs-2023-aocc/bin/gmx core.dump if the simulation has crashed if it has not you can attach to a process by finding its process identified, also called PID (e.g. in top or ps output), and run gdb -p PID.

I suggested to core file option since your previous post suggested that systemd-coredump logged core files being generated. In the presence of that service, core files will generated will be placed somewhere in a central location e.g. on some systems that is /var/lib/systemd/coredump/.

Dear Szilárd,

When it got stuck today, I opened another terminal windows and key-in gdb -p 7003 [PID of gmx] and then bt. Messages are copied as beneath:

==========
Attaching to process 7003
[New LWP 7004]
[New LWP 7005]
[New LWP 7006]
[New LWP 7007]
[New LWP 7008]
[New LWP 7009]
[New LWP 7010]
[New LWP 7011]
[New LWP 7012]
[New LWP 7013]
[New LWP 7014]
[New LWP 7015]
[New LWP 7016]
[New LWP 7017]
[New LWP 7018]
[New LWP 7019]
[New LWP 7020]
[New LWP 7021]
[New LWP 7022]
[New LWP 7023]
[New LWP 7024]
[New LWP 7025]
[New LWP 7026]
[New LWP 7027]
[New LWP 7028]
[New LWP 7029]
[New LWP 7030]
[New LWP 7031]
[New LWP 7032]
[New LWP 7033]
[New LWP 7034]
[New LWP 7035]
[New LWP 7036]
[New LWP 7037]
[New LWP 7038]
[New LWP 7039]
[New LWP 7040]
[New LWP 7041]
[New LWP 7042]
[New LWP 7043]
[New LWP 7044]
[New LWP 7045]
[New LWP 7046]
[New LWP 7047]
[New LWP 7048]
[New LWP 7049]
[New LWP 7050]
[New LWP 7051]
[New LWP 7052]
[New LWP 7053]
[New LWP 7054]
[New LWP 7055]
[New LWP 7056]
[New LWP 7057]
[New LWP 7058]
[New LWP 7059]
[New LWP 7060]
[New LWP 7061]
[New LWP 7062]
[New LWP 7063]
[New LWP 7064]
[New LWP 7065]
[New LWP 7066]
[New LWP 7067]
[New LWP 7068]
[New LWP 7069]
[New LWP 7070]
[New LWP 7071]
[New LWP 7072]
[New LWP 7073]
[New LWP 7074]
[New LWP 7075]
[New LWP 7076]
[New LWP 7077]
[New LWP 7078]
[New LWP 7079]
[New LWP 7080]
[New LWP 7081]
[New LWP 7082]
[New LWP 7083]
[New LWP 7084]
[New LWP 7085]
[New LWP 7086]
[New LWP 7087]
[New LWP 7088]
[New LWP 7089]
[New LWP 7090]
[New LWP 7091]
[New LWP 7092]
[New LWP 7093]
[New LWP 7094]
[New LWP 7095]
[New LWP 7096]
[New LWP 7097]
[New LWP 7098]
[New LWP 7099]
[New LWP 7100]
[New LWP 7101]
[New LWP 7102]
[New LWP 7103]
[New LWP 7104]
[New LWP 7105]
[New LWP 7106]
[New LWP 7107]
[New LWP 7108]
[New LWP 7109]
[New LWP 7110]
[New LWP 7111]
[New LWP 7112]
[New LWP 7113]
[New LWP 7114]
[New LWP 7115]
[New LWP 7116]
[New LWP 7117]
[New LWP 7118]
[New LWP 7119]
[New LWP 7120]
[New LWP 7121]
[New LWP 7122]
[New LWP 7123]
[New LWP 7124]
[New LWP 7125]
[New LWP 7126]
[New LWP 7127]
[Thread debugging using libthread_db enabled]
Using host libthread_db library “/usr/lib64/libthread_db.so.1”.
Missing separate debuginfos, use: dnf debuginfo-install glibc-2.34-40.el9.x86_64 libstdc+±11.3.1-2.1.el9.x86_64
–Type for more, q to quit, c to continue without paging–c
0x0000000000ece273 in tMPI_Event_wait(tMPI_Event_t*) ()
(gdb) bt
#0 0x0000000000ece273 in tMPI_Event_wait(tMPI_Event_t*) ()
#1 0x0000000000ecafad in tMPI_Wait_process_incoming(tmpi_thread*) ()
#2 0x0000000000ecb3ac in tMPI_Wait_single(tmpi_thread*, tmpi_req_) ()
#3 0x0000000000eca00f in tMPI_Recv(void
, int, tmpi_datatype_, int, int, tmpi_comm_, tmpi_status_) ()
#4 0x0000000000d6ac31 in gmx_pme_receive_f(gmx::PmePpCommGpu
, t_commrec const*, gmx::ForceWithVirial*, float*, float*, float*, float*, bool, bool, float*) ()
#5 0x0000000000df44a4 in pme_receive_force_ener(t_forcerec*, t_commrec const*, gmx::ForceWithVirial*, gmx_enerdata_t*, bool, bool, gmx_wallcycle*) ()
#6 0x0000000000df3834 in do_force(IO_FILE*, t_commrec const*, gmx_multisim_t const*, t_inputrec const&, gmx::Awh*, gmx_enfrot*, gmx::ImdSession*, pull_t*, long, t_nrnb*, gmx_wallcycle*, gmx_localtop_t const*, float const () [3], gmx::ArrayRefWithPadding<gmx::BasicVector >, history_t const, gmx::ForceBuffersView*, float () [3], t_mdatoms const, gmx_enerdata_t*, gmx::ArrayRef, t_forcerec*, gmx::MdrunScheduleWorkload*, gmx::VirtualSitesHandler*, float*, double, gmx_edsam*, CpuPpLongRangeNonbondeds*, int, DDBalanceRegionHandler const&) ()
#7 0x0000000000e033fc in gmx::LegacySimulator::do_md() ()
#8 0x0000000000dd4b1a in gmx::Mdrunner::mdrunner() ()
#9 0x00000000004cfdca in gmx::gmx_mdrun(tmpi_comm
*, gmx_hw_info_t const&, int, char**) ()
#10 0x00000000004cfac0 in gmx::gmx_mdrun(int, char**) ()
#11 0x000000000086f1eb in gmx::CommandLineModuleManager::run(int, char**) ()
–Type for more, q to quit, c to continue without paging–c
#12 0x00000000004ce5ba in main ()
.==========

Is it stuck somewhere at tMPI or related process?

Thanks again
Pim

Hi,

Thanks, that is a step forward. This is indeed a hang in the thread-MPI library during receiving forces from the PME ranks.

Could you please try the same with a lib-MPI build (-DGMX_MPI=ON). It would be good to know if that elimantes the hangs, that would be a sign of a thread-MPi issue.

Thanks,
Szilárd

Dear Szilárd,

Thanks a lot for the confirmation.

During past week, I took a systematical test on the hardware and found that one of my CPU cannot login into system when I test them one by one. Then, I changed these dual 7b12 into dual 7742.
Very unluckily, same gmx-freeze thing happened on 7742 [Rocky Linux 9.1, with gcc/g++ 11 & cmake 3.25 & make 4.x].
However, I successfully finished my gmx run on another 7742-based supercomputer. I guess some tricks exist here. At same time, I had several failure experiences on another 7b12 workstation and 7542 workstation.

I’ll temporarily put tMPI test away and take a trial on -DGMX_MPI=ON. If it’s done, I guess I can report a bug or something for gromacs on EPYC ROME platform?

Thank you so much again.
Pim

I’d say yes, but we’ll have a very hard time reproducing, so we may need further help from you.

Let us know whether using MPI eliminates the issue or not.