Gromacs Issue on EPYC 7B12 Workstation

Dear Szilárd,

My original problem happened in a dual 7B12 platform, however my 7B12 CPU hardware was proved to have quality issue. So this would not be discussed further.

After changing to dual 7742 platform, I made another trials and found a temporary mitigation.
I tried rocky linux 9.1 [with gcc 11 as default] and compiled gromacs 2023, while it got hang @~28 ns steps.
Then I moved to Ubuntu 20.04 [with gcc 9.4 as default], the simulation was done quite smoothly.
So I immediately installed Centos 7.9 and upgrade gcc to different version and then tried. The results were:
- gcc 8: not compatible with gromacs 2023
- gcc 9: mdrun done
- gcc 10: mdrun done
- gcc 11: mdrun hang
- aocc 4.0.0 [with amd-fftw]: mdrun hang

I guess now we can summarize the issue as that later/latest version of gcc/aocc compiler has problem with EPYC 2nd platform, which may cause tMPI hang during MD simulation.
I took reference to AMD official test of EPYC 2nd+gromacs, several details were mentioned:
- aocc 2.0 & aocl 2.0
- gromacs 2019.3 [which I think does NOT support auto-detect EPYC 2nd CPU]
- mpi version gromacs with OpenMPI 4.0.0
https://www.amd.com/system/files/documents/EPYC-7002-Gromacs-Molecular-Dynamics-Simulation.pdf
The tricks here are that 1) they used older compiler at that time and 2) MPI version gmx was used for tests.

It should be mentioned that during ‘make check -j’, gcc 9 and 10 have one failed test for ParseCommonArgsTest, but gmx itself seems to be alright. For the time being, my simulation has finished two times of 100 ns run [gcc 9.3.1+gromacs 2023]. The speed [for a ~16000 atoms system] reaches 140 ns /day @ 125 threads/cores. Good enough.

This is my updates for this issue.
Should there be any requirement (for a purpose of fix conflicts between latest compiler and gromacs), please kindly let me know. I’ll be happy to provide hardware test for your investigation on this tMPI bug.

Best Regards,
Pim

Hi,

Thanks for the summary.

It seems that newer gcc or standard C++ library may be the issue, one thing remains unclear: does the same happen with MPI, e.g. OpenMPI + gcc 11?

I can’t comment on those other than that vendors like to use their own compilers, but I’ve not seen evidence that aocc has practical benefits.

I suggest to open an issue on gitlab.gromacs.org and provide the above description plus preferably the reproducer input and command lines used.

Thanks,
Szilárd

Dear Szilárd,

Sorry for the delay.

In a gcc10 environment, I:
a) compiled openmpi-4.1.5
b) compiled gmx_mpi with -DGMX_MPI=ON & -DGMX_OPENMP=OFF
c) during ‘make check -j’, many failed items appeared, while gmx_mpi seems to be running well
Then I did MD and it ran smoothly and finished in time.

What’s more, to my best knowledge, openmp version usually has better performance than mpi version on single node, while to my surprise the speed of mpi reaches ~170 ns/day [openmp = 150 ns/day].

I will collect some info and data for a bug report as you suggest.
Great appreciations for your kind support!!

Best Regards,
Pim