Almost a year back I reported a similar issue (redmine bug# 3234). I think this bug was ‘accepted’, and so I expected I will not see this error with new versions. But am getting same error with v2020.4. Restarting the job (killed after ~4000 ns; coarse-grained) using cpt file always stop at the same step (am simulating a new system).
I am not sure why am getting the same error again.
*** Process received signal ***
Signal: Floating point exception (8)
Signal code: Floating point divide-by-zero (3)
Failing at address: 0x7ff923a404d8
[ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x46210)[0x7ff931f42210]
[ 1] /lib/x86_64-linux-gnu/libm.so.6(+0x7e4d8)[0x7ff923a404d8]
[ 2] /usr/local/gromacs/gromacs2020_4/lib/libgromacs_mpi.so.5(_ZSt3logf+0x1d)[0x7ff9337dad19]
[ 3] /usr/local/gromacs/gromacs2020_4/lib/libgromacs_mpi.so.5(_ZN3gmx17GammaDistributionIfEclINS_12ThreeFry2x64ILj64EEEEEfRT_RKNS1_10param_typeE+0x21b)[0x7ff933c3b8d5]
[ 4] /usr/local/gromacs/gromacs2020_4/lib/libgromacs_mpi.so.5(ZN3gmx17GammaDistributionIfEclINS_12ThreeFry2x64ILj64EEEEEfRT+0x2b)[0x7ff933c3b299]
[ 5] /usr/local/gromacs/gromacs2020_4/lib/libgromacs_mpi.so.5(+0x1819ef5)[0x7ff933c39ef5]
[ 6] /usr/local/gromacs/gromacs2020_4/lib/libgromacs_mpi.so.5(_Z20vrescale_resamplekinffffll+0x11f)[0x7ff933c3a038]
[ 7] /usr/local/gromacs/gromacs2020_4/lib/libgromacs_mpi.so.5(_Z15vrescale_tcouplPK10t_inputreclP14gmx_ekindata_tfPd+0x1d8)[0x7ff933c3a2c7]
[ 8] /usr/local/gromacs/gromacs2020_4/lib/libgromacs_mpi.so.5(_Z14update_tcouplelPK10t_inputrecP7t_stateP14gmx_ekindata_tPK9t_extmassPK9t_mdatoms+0x227)[0x7ff933cb3d18]
[ 9] /usr/local/gromacs/gromacs2020_4/lib/libgromacs_mpi.so.5(_ZN3gmx15LegacySimulator5do_mdEv+0x5544)[0x7ff933dc09c6]
[10] /usr/local/gromacs/gromacs2020_4/lib/libgromacs_mpi.so.5(_ZN3gmx15LegacySimulator3runEv+0x1f5)[0x7ff933dba649]
[11] /usr/local/gromacs/gromacs2020_4/lib/libgromacs_mpi.so.5(_ZN3gmx8Mdrunner8mdrunnerEv+0x3a4a)[0x7ff933de7bd0]
[12] gmx_mpi(+0x17668)[0x557c9be9f668]
[13] /usr/local/gromacs/gromacs2020_4/lib/libgromacs_mpi.so.5(+0x1337eeb)[0x7ff933757eeb]
[14] /usr/local/gromacs/gromacs2020_4/lib/libgromacs_mpi.so.5(_ZN3gmx24CommandLineModuleManager3runEiPPc+0x3eb)[0x7ff933759c61]
[15] gmx_mpi(+0x14ab4)[0x557c9be9cab4]
[16] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7ff931f230b3]
[17] gmx_mpi(+0x1494e)[0x557c9be9c94e]
Thanks for pointing this out again - we worked on fixing the issue, but as I can see it had not been solved:
We moved our complete infrastructure from gerrit to gitlab in the last year and this issue was not given a milestone. I now assigned a milestone to the issue which will bring it up again and added a link to this post.
Many thanks for considering the bug. It will be really helpful. For many jobs I need to run multi-microseconds, and its a big computational loss when my simulation crashes with this error :-/ and I can’t continue them.
Request: If it gets fixed, would it be possible to get the code before it is officially released? It will be really ‘immense’ help.
if we find the bug and can fix it, you can get the patch from the merge
request on GitLab.
I would just need some input files to try and reproduce the bug if this
is possible.
Thanks.
I noticed your message on gitlab so I just activated the old links. You should be able to access the files I shared last year (2019.4 version). If you need the files which am using with 2020.4 version please let me know (but in private if it is ok?)
I am sorry, I don’t get your question.
a) Are you asking if there is any delay launching the job using cpt file? Then, no. It launches quickly as expected.
or
b) are you asking the job gets killed after few ns from this cpt state? Then yes, it should get killed after few ns and if you repeat, it will be killed always at the same step (using the cpt file).
Please let me know if you have more questions.
I don’t remember exactly how long it takes, but it should be around 10 minutes or so depending on the GPU/CPU resources.
I will build and test the version with fix. Am bit lost, should I download 2020.4 from here http://manual.gromacs.org? Or just apply a patch. Sorry am slow here, how can I download and apply the required patch from the link you provided? I am not very familiar with Gitlab.
no problem. You can check out the branch using git, that should be the easiest approach, and will also make sure that there are no other unrelated issues with the code. If you need help with that, please let me know.
Hi Paul,
am dumb here. Where I should run “git fetch origin” command? Seems like this command should be run in the directory of the repository. How do I create a repository?
I think there are permission issues or maybe am missing something:
git clone git@gitlab.com:gromacs/gromacs.git
Cloning into ‘gromacs’… git@gitlab.com: Permission denied (publickey,keyboard-interactive).
fatal: Could not read from remote repository.
Please make sure you have the correct access rights and the repository exists.
Many thanks. It does seem like it is working now. I tested two different cpt files (which were killed with same FPE error) and both worked fine. I am continuing one case longer to test it. The version I have now is:
2020.5-dev-20201210-c995be6239-unknown
Though my gromacs job is running but while building this gromacs I get these tests failed:
I checked all md.log files in regressiontests/freeenergy folders and all seemed to finished fine. I am not sure why I am getting TrajectoryAnalysisUnitTests (SEGFAULT). I didn’t notice this error while building 2020.4
########################
Just for future users I am repeating here. An example to ‘How to apply a gromacs patch: follow these steps’
git clone https://gitlab.com/gromacs/gromacs.git
cd gromacs
git fetch origin
git checkout -b "harden-random-gammadistribution" "origin/harden-random-gammadistribution"
... Now build your gromacs .....
thanks for the feedback! Concerning the issue with the segmentation fault, we can’t reproduce this in our CI (runs are here: https://gitlab.com/gromacs/gromacs/-/pipelines/227794494). Still, can you give me some more information about the failures?
Sorry, forgot that we have the pipelines only visible for internal
developers.
Can you just give me the output from running the failing tests?
E.g. run the test binary that fails only and then post whatever got
printed to the terminal?
So this time I did everything from the scratch starting from updating ubuntu machine (on AWS cloud). It seems like SEGFAULT error is gone but rather a new error was seen this time. Only one failed test shown below (and additional information below this error message).
54/57 Test #54: regressiontests/complex ............. Passed 222.88 sec
Start 55: regressiontests/freeenergy
55/57 Test #55: regressiontests/freeenergy ..........***Failed 54.22 sec
Will test on 8 MPI ranks (if possible)
Will test using executable suffix _mpi
Abnormal return value for '/usr/bin/mpiexec -np 8 -wdir /home/ubuntu/softwares/gromacs/build/tests/gromacs-regressiontests-release-2020/freeenergy/transformAtoB gmx_mpi mdrun -notunepme >mdrun.out 2>&1' was -1
FAILED. Check mdrun.out, md.log file(s) in transformAtoB for transformAtoB
1 out of 10 freeenergy tests FAILED
Start 56: regressiontests/rotation
56/57 Test #56: regressiontests/rotation ............ Passed 46.97 sec
Start 57: regressiontests/essentialdynamics
57/57 Test #57: regressiontests/essentialdynamics ... Passed 23.30 sec
98% tests passed, 1 tests failed out of 57
Label Time Summary:
GTest = 183.04 sec*proc (53 tests)
IntegrationTest = 137.37 sec*proc (12 tests)
MpiTest = 39.64 sec*proc (6 tests)
UnitTest = 45.67 sec*proc (41 tests)
Total Test time (real) = 532.28 sec
The following tests FAILED:
55 - regressiontests/freeenergy (Failed)
Errors while running CTest
make[3]: *** [CMakeFiles/run-ctest-nophys.dir/build.make:58: CMakeFiles/run-ctest-nophys] Error 8
make[2]: *** [CMakeFiles/Makefile2:2454: CMakeFiles/run-ctest-nophys.dir/all] Error 2
make[1]: *** [CMakeFiles/Makefile2:2433: CMakeFiles/check.dir/rule] Error 2
make: *** [Makefile:249: check] Error 2
I checked in this folder transformAtoB, the md.log and mdrun.out (both attached) suggest this was not finished and mdrun.out shows some floating point error {I changed name of mdrun.out to mdrun_OUT.log otherwise it does not allow to upload. I replaced ip address in the files with *** signs}
Please let me know if you need more information.