Again: Floating point exception (8) Floating point divide-by-zero (3)

GROMACS version: 2020.4
GROMACS modification: No

Hello All,

Almost a year back I reported a similar issue (redmine bug# 3234). I think this bug was ‘accepted’, and so I expected I will not see this error with new versions. But am getting same error with v2020.4. Restarting the job (killed after ~4000 ns; coarse-grained) using cpt file always stop at the same step (am simulating a new system).
I am not sure why am getting the same error again.

*** Process received signal ***
Signal: Floating point exception (8)
Signal code: Floating point divide-by-zero (3)
Failing at address: 0x7ff923a404d8
[ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x46210)[0x7ff931f42210]
[ 1] /lib/x86_64-linux-gnu/libm.so.6(+0x7e4d8)[0x7ff923a404d8]
[ 2] /usr/local/gromacs/gromacs2020_4/lib/libgromacs_mpi.so.5(_ZSt3logf+0x1d)[0x7ff9337dad19]
[ 3] /usr/local/gromacs/gromacs2020_4/lib/libgromacs_mpi.so.5(_ZN3gmx17GammaDistributionIfEclINS_12ThreeFry2x64ILj64EEEEEfRT_RKNS1_10param_typeE+0x21b)[0x7ff933c3b8d5]
[ 4] /usr/local/gromacs/gromacs2020_4/lib/libgromacs_mpi.so.5(ZN3gmx17GammaDistributionIfEclINS_12ThreeFry2x64ILj64EEEEEfRT+0x2b)[0x7ff933c3b299]
[ 5] /usr/local/gromacs/gromacs2020_4/lib/libgromacs_mpi.so.5(+0x1819ef5)[0x7ff933c39ef5]
[ 6] /usr/local/gromacs/gromacs2020_4/lib/libgromacs_mpi.so.5(_Z20vrescale_resamplekinffffll+0x11f)[0x7ff933c3a038]
[ 7] /usr/local/gromacs/gromacs2020_4/lib/libgromacs_mpi.so.5(_Z15vrescale_tcouplPK10t_inputreclP14gmx_ekindata_tfPd+0x1d8)[0x7ff933c3a2c7]
[ 8] /usr/local/gromacs/gromacs2020_4/lib/libgromacs_mpi.so.5(_Z14update_tcouplelPK10t_inputrecP7t_stateP14gmx_ekindata_tPK9t_extmassPK9t_mdatoms+0x227)[0x7ff933cb3d18]
[ 9] /usr/local/gromacs/gromacs2020_4/lib/libgromacs_mpi.so.5(_ZN3gmx15LegacySimulator5do_mdEv+0x5544)[0x7ff933dc09c6]
[10] /usr/local/gromacs/gromacs2020_4/lib/libgromacs_mpi.so.5(_ZN3gmx15LegacySimulator3runEv+0x1f5)[0x7ff933dba649]
[11] /usr/local/gromacs/gromacs2020_4/lib/libgromacs_mpi.so.5(_ZN3gmx8Mdrunner8mdrunnerEv+0x3a4a)[0x7ff933de7bd0]
[12] gmx_mpi(+0x17668)[0x557c9be9f668]
[13] /usr/local/gromacs/gromacs2020_4/lib/libgromacs_mpi.so.5(+0x1337eeb)[0x7ff933757eeb]
[14] /usr/local/gromacs/gromacs2020_4/lib/libgromacs_mpi.so.5(_ZN3gmx24CommandLineModuleManager3runEiPPc+0x3eb)[0x7ff933759c61]
[15] gmx_mpi(+0x14ab4)[0x557c9be9cab4]
[16] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7ff931f230b3]
[17] gmx_mpi(+0x1494e)[0x557c9be9c94e]

Thanks.
Dave

Hi Dave,

Thanks for pointing this out again - we worked on fixing the issue, but as I can see it had not been solved:

We moved our complete infrastructure from gerrit to gitlab in the last year and this issue was not given a milestone. I now assigned a milestone to the issue which will bring it up again and added a link to this post.

Best,
Christian

Hi Christian,

Many thanks for considering the bug. It will be really helpful. For many jobs I need to run multi-microseconds, and its a big computational loss when my simulation crashes with this error :-/ and I can’t continue them.
Request: If it gets fixed, would it be possible to get the code before it is officially released? It will be really ‘immense’ help.

Thanks,
D

Hello,

if we find the bug and can fix it, you can get the patch from the merge
request on GitLab.
I would just need some input files to try and reproduce the bug if this
is possible.

Cheers

Paul

Hi Paul,

Thanks.
I noticed your message on gitlab so I just activated the old links. You should be able to access the files I shared last year (2019.4 version). If you need the files which am using with 2020.4 version please let me know (but in private if it is ok?)

Hello,

using the 2019.4 files should be fine. How quickly are you running into
the exception, does it fire immediately after the checkpoint or a bit later?

Cheers

Paul

Hi Paul,

I am sorry, I don’t get your question.
a) Are you asking if there is any delay launching the job using cpt file? Then, no. It launches quickly as expected.
or
b) are you asking the job gets killed after few ns from this cpt state? Then yes, it should get killed after few ns and if you repeat, it will be killed always at the same step (using the cpt file).
Please let me know if you have more questions.

D

Hello Dave,

I was asking about b), how long it takes to trigger the FPE.

I uploaded a patch for GROMACS 2020 (https://gitlab.com/gromacs/gromacs/-/merge_requests/932), can you check if the errors still happen after building a version of GROMACS with the fix?

Cheers

Paul

Hi Paul,

I don’t remember exactly how long it takes, but it should be around 10 minutes or so depending on the GPU/CPU resources.
I will build and test the version with fix. Am bit lost, should I download 2020.4 from here http://manual.gromacs.org? Or just apply a patch. Sorry am slow here, how can I download and apply the required patch from the link you provided? I am not very familiar with Gitlab.

Hello Dave,

no problem. You can check out the branch using git, that should be the easiest approach, and will also make sure that there are no other unrelated issues with the code. If you need help with that, please let me know.

For the checkout, just use this

git fetch origin
git checkout -b "harden-random-gammadistribution" "origin/harden-random-gammadistribution"

Cheers

Paul

Hi Paul,
am dumb here. Where I should run “git fetch origin” command? Seems like this command should be run in the directory of the repository. How do I create a repository?

Hi Dave,

git clone git@gitlab.com:gromacs/gromacs.git

Then the commands Paul suggested.

Thanks Christian, was about to reply this :)

Hi Christian, Hi Paul,

I think there are permission issues or maybe am missing something:

git clone git@gitlab.com:gromacs/gromacs.git
Cloning into ‘gromacs’…
git@gitlab.com: Permission denied (publickey,keyboard-interactive).
fatal: Could not read from remote repository.
Please make sure you have the correct access rights and the repository exists.

Hm, I think you need to use

git clone https://gitlab.com/gromacs/gromacs.git

instead, to avoid the authentication issue

Hi Paul, Hi Christian,

Many thanks. It does seem like it is working now. I tested two different cpt files (which were killed with same FPE error) and both worked fine. I am continuing one case longer to test it. The version I have now is:

2020.5-dev-20201210-c995be6239-unknown

Though my gromacs job is running but while building this gromacs I get these tests failed:

The following tests FAILED:
         38 - TrajectoryAnalysisUnitTests (SEGFAULT)
         55 - regressiontests/freeenergy (Failed)
Errors while running CTest
make[3]: *** [CMakeFiles/run-ctest-nophys.dir/build.make:58: CMakeFiles/run-ctest-nophys] Error 8
make[2]: *** [CMakeFiles/Makefile2:2454: CMakeFiles/run-ctest-nophys.dir/all] Error 2
make[1]: *** [CMakeFiles/Makefile2:2433: CMakeFiles/check.dir/rule] Error 2
make: *** [Makefile:249: check] Error 2

I checked all md.log files in regressiontests/freeenergy folders and all seemed to finished fine. I am not sure why I am getting TrajectoryAnalysisUnitTests (SEGFAULT). I didn’t notice this error while building 2020.4

########################
Just for future users I am repeating here. An example to ‘How to apply a gromacs patch: follow these steps’

git clone https://gitlab.com/gromacs/gromacs.git
cd gromacs
git fetch origin
git checkout -b "harden-random-gammadistribution" "origin/harden-random-gammadistribution"
... Now build your gromacs .....

#################

Hello Dave,

thanks for the feedback! Concerning the issue with the segmentation fault, we can’t reproduce this in our CI (runs are here: https://gitlab.com/gromacs/gromacs/-/pipelines/227794494). Still, can you give me some more information about the failures?

Cheers

Paul

Hi Paul,

Yes, sure. The link you shared does not exist. And please let me know if any specific folder you would like me to check. I will do a new build.

D

Sorry, forgot that we have the pipelines only visible for internal
developers.
Can you just give me the output from running the failing tests?
E.g. run the test binary that fails only and then post whatever got
printed to the terminal?

Cheers

Paul

Hi Paul,

So this time I did everything from the scratch starting from updating ubuntu machine (on AWS cloud). It seems like SEGFAULT error is gone but rather a new error was seen this time. Only one failed test shown below (and additional information below this error message).

54/57 Test #54: regressiontests/complex .............   Passed  222.88 sec
      Start 55: regressiontests/freeenergy
55/57 Test #55: regressiontests/freeenergy ..........***Failed   54.22 sec
Will test on 8 MPI ranks (if possible)
Will test using executable suffix _mpi

Abnormal return value for '/usr/bin/mpiexec -np 8 -wdir /home/ubuntu/softwares/gromacs/build/tests/gromacs-regressiontests-release-2020/freeenergy/transformAtoB gmx_mpi mdrun        -notunepme >mdrun.out 2>&1' was -1
FAILED. Check mdrun.out, md.log file(s) in transformAtoB for transformAtoB
1 out of 10 freeenergy tests FAILED

      Start 56: regressiontests/rotation
56/57 Test #56: regressiontests/rotation ............   Passed   46.97 sec
      Start 57: regressiontests/essentialdynamics
57/57 Test #57: regressiontests/essentialdynamics ...   Passed   23.30 sec

98% tests passed, 1 tests failed out of 57

Label Time Summary:
GTest              = 183.04 sec*proc (53 tests)
IntegrationTest    = 137.37 sec*proc (12 tests)
MpiTest            =  39.64 sec*proc (6 tests)
UnitTest           =  45.67 sec*proc (41 tests)

Total Test time (real) = 532.28 sec

The following tests FAILED:
         55 - regressiontests/freeenergy (Failed)
Errors while running CTest
make[3]: *** [CMakeFiles/run-ctest-nophys.dir/build.make:58: CMakeFiles/run-ctest-nophys] Error 8
make[2]: *** [CMakeFiles/Makefile2:2454: CMakeFiles/run-ctest-nophys.dir/all] Error 2
make[1]: *** [CMakeFiles/Makefile2:2433: CMakeFiles/check.dir/rule] Error 2
make: *** [Makefile:249: check] Error 2

I checked in this folder transformAtoB, the md.log and mdrun.out (both attached) suggest this was not finished and mdrun.out shows some floating point error {I changed name of mdrun.out to mdrun_OUT.log otherwise it does not allow to upload. I replaced ip address in the files with *** signs}
Please let me know if you need more information.

mdrun_OUT.log (26.8 KB) md.log (23.3 KB)