2022 Regression Test Timeouts

GROMACS version: 2022
GROMACS modification: No

I’m troubleshooting repeated regression test failures using the new 2022 release. This is a CPU-only Zen2 system with 128 cores. The failing tests are…

65:MdrunIOTests
69:MdrunNonIntegratorTests
83:MdrunFEPTests

This is my compiling procedure:

cmake -DCMAKE_INSTALL_PREFIX="$PREFIX" \
-DGMX_FFT_LIBRARY=fftw3 -DGMX_BUILD_OWN_FFTW=ON -DGMX_HWLOC=ON -DGMX_MPI=ON \
-DGMX_SIMD=AVX2_128 -DGMX_BUILD_OWN_FFTW_URL="$DISTDIR/fftw-3.3.8.tar.gz" \
-DREGRESSIONTEST_PATH="$WORKDIR/regressiontests-2022" \
-DGMX_EXTERNAL_BLAS=ON -DGMX_EXTERNAL_LAPACK=ON ..
make -j128
make VERBOSE=1 check

I’ve also tried AVX2_256 without any significant change. No other significant processes are competing for resources on this machine.
I’ve included my LastTest.log. Can anyone explain these timeouts? Thanks in advance!

LastTest.log (4.0 MB)

UPDATE: I can also confirm that compiling Gromacs without external MPI support produces a build which completes the regression tests successfully.

Hello,

this is likely an issue with the tests trying to run on all of the 128 tests, and being extremely slowed down by this. Please try running with a lower number of OMP threads.

Cheers

Paul

I gave this a shot:

export OMP_NUM_THREADS=16
make VERBOSE=1 check

but it caused a ton of failures with this repeated message:

Environment variable OMP_NUM_THREADS (16) and the number of threads requested
on the command line (2) have different values. Either omit one, or set them
both to the same value.

When I used our job scheduler to allocate 16 only cores, I got the same failures as before:

The following tests FAILED:
	 65 - MdrunIOTests (Timeout)
	 69 - MdrunNonIntegratorTests (Timeout)
	 83 - MdrunFEPTests (Timeout)

Is 16 threads still too many?

This is still an issue, impeding the deployment of an updated Gromacs build on our system.

I tried reproducing this locally (but only on 16 cores), and the tests don’t time out when building the same way as you, but there are a few NB-LIB test failures instead.

Going through the log it looks like the task setting is all done correctly, tests are running on the number of ranks that they should, they just seem to take a long time to do so.

One issue I found in the attached log is this, but you already said that this shouldn’t be an issue.

Highest SIMD level supported by all nodes in run: AVX2_256
SIMD instructions selected at compile time:       AVX2_128
Compiled SIMD newer than supported; program might crash

I did the same, using a Ryzen 2700X processor. My results were similar to yours, and did not match the results on the HPC system.

Standard GROMACS runs provide a detailed explanation of where time is being spent. Is such an explanation available for the test cases?

The unit tests don’t write this out by default, sorry.