2022 Regression Test Timeouts

mwd · February 23, 2022, 3:48am

GROMACS version: 2022
GROMACS modification: No

I’m troubleshooting repeated regression test failures using the new 2022 release. This is a CPU-only Zen2 system with 128 cores. The failing tests are…

65:MdrunIOTests
69:MdrunNonIntegratorTests
83:MdrunFEPTests

This is my compiling procedure:

cmake -DCMAKE_INSTALL_PREFIX="$PREFIX" \
-DGMX_FFT_LIBRARY=fftw3 -DGMX_BUILD_OWN_FFTW=ON -DGMX_HWLOC=ON -DGMX_MPI=ON \
-DGMX_SIMD=AVX2_128 -DGMX_BUILD_OWN_FFTW_URL="$DISTDIR/fftw-3.3.8.tar.gz" \
-DREGRESSIONTEST_PATH="$WORKDIR/regressiontests-2022" \
-DGMX_EXTERNAL_BLAS=ON -DGMX_EXTERNAL_LAPACK=ON ..
make -j128
make VERBOSE=1 check

I’ve also tried AVX2_256 without any significant change. No other significant processes are competing for resources on this machine.
I’ve included my LastTest.log. Can anyone explain these timeouts? Thanks in advance!

LastTest.log (4.0 MB)

UPDATE: I can also confirm that compiling Gromacs without external MPI support produces a build which completes the regression tests successfully.

pbauer · February 23, 2022, 11:07am

Hello,

this is likely an issue with the tests trying to run on all of the 128 tests, and being extremely slowed down by this. Please try running with a lower number of OMP threads.

Cheers

Paul

mwd · February 23, 2022, 8:40pm

I gave this a shot:

export OMP_NUM_THREADS=16
make VERBOSE=1 check

but it caused a ton of failures with this repeated message:

Environment variable OMP_NUM_THREADS (16) and the number of threads requested
on the command line (2) have different values. Either omit one, or set them
both to the same value.

When I used our job scheduler to allocate 16 only cores, I got the same failures as before:

The following tests FAILED:
	 65 - MdrunIOTests (Timeout)
	 69 - MdrunNonIntegratorTests (Timeout)
	 83 - MdrunFEPTests (Timeout)

Is 16 threads still too many?

mwd · February 28, 2022, 8:29pm

This is still an issue, impeding the deployment of an updated Gromacs build on our system.

pbauer · March 1, 2022, 9:18am

I tried reproducing this locally (but only on 16 cores), and the tests don’t time out when building the same way as you, but there are a few NB-LIB test failures instead.

Going through the log it looks like the task setting is all done correctly, tests are running on the number of ranks that they should, they just seem to take a long time to do so.

One issue I found in the attached log is this, but you already said that this shouldn’t be an issue.

Highest SIMD level supported by all nodes in run: AVX2_256
SIMD instructions selected at compile time:       AVX2_128
Compiled SIMD newer than supported; program might crash

mwd · March 1, 2022, 3:13pm

I did the same, using a Ryzen 2700X processor. My results were similar to yours, and did not match the results on the HPC system.

Standard GROMACS runs provide a detailed explanation of where time is being spent. Is such an explanation available for the test cases?

pbauer · March 2, 2022, 8:18am

The unit tests don’t write this out by default, sorry.

swong · June 19, 2023, 11:05pm

I did the following and it got helped:

export OMP_NUM_THREADS=1,2,4,6,8

Topic		Replies	Views
MdrunModulesTests timeout on Gromacs 2021, gcc 10.2.1 User discussions	2	311	January 30, 2021
Gromacs 2021.1 installation, make check fails at Mdrun Mpi Coordination Tests and regression tests User discussions installation-error	19	3708	January 18, 2024
4 Regression test failed User discussions installation-error	3	2728	December 17, 2021
Regression test failure User discussions installation-error	3	1092	May 8, 2023
Installation failed test 70/78 User discussions mdrun , installation-error	3	959	January 26, 2023

2022 Regression Test Timeouts

Related topics