I’m troubleshooting repeated regression test failures using the new 2022 release. This is a CPU-only Zen2 system with 128 cores. The failing tests are…
I’ve also tried AVX2_256 without any significant change. No other significant processes are competing for resources on this machine.
I’ve included my LastTest.log. Can anyone explain these timeouts? Thanks in advance!
this is likely an issue with the tests trying to run on all of the 128 tests, and being extremely slowed down by this. Please try running with a lower number of OMP threads.
but it caused a ton of failures with this repeated message:
Environment variable OMP_NUM_THREADS (16) and the number of threads requested
on the command line (2) have different values. Either omit one, or set them
both to the same value.
When I used our job scheduler to allocate 16 only cores, I got the same failures as before:
The following tests FAILED:
65 - MdrunIOTests (Timeout)
69 - MdrunNonIntegratorTests (Timeout)
83 - MdrunFEPTests (Timeout)
I tried reproducing this locally (but only on 16 cores), and the tests don’t time out when building the same way as you, but there are a few NB-LIB test failures instead.
Going through the log it looks like the task setting is all done correctly, tests are running on the number of ranks that they should, they just seem to take a long time to do so.
One issue I found in the attached log is this, but you already said that this shouldn’t be an issue.
Highest SIMD level supported by all nodes in run: AVX2_256
SIMD instructions selected at compile time: AVX2_128
Compiled SIMD newer than supported; program might crash