Bug report

GROMACS version: 2021 - MODIFIED
GROMACS modification: Yes - note as modified on LUMI

I have got this error on gromacs - it said it is a bug - any suggestion?

starting mdrun 'Title'
500000000000 steps, 1000000000.0 ps (continuing from step 69799500, 139599.0 ps).
step 69799500
-------------------------------------------------------
Program:     gmx mdrun, version 2021-MODIFIED
MPI rank:    0 (out of 640)

Standard library logic error (bug):
(exception type: St12out_of_range)
basic_string::erase: __pos (which is 18446744073709551615) > this->size()
(which is 1024)

For more information and tips for troubleshooting, please check the GROMACS
website at http://www.gromacs.org/Documentation/Errors
-------------------------------------------------------
MPICH Notice [Rank 0] [job id 929543.0] [Fri Mar 18 17:47:34 2022] [nid001458] - Abort(1) (rank 0 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0

aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
srun: error: nid001458: task 0: Exited with exit code 255
srun: launch/slurm: _step_signal: Terminating StepId=929543.0
slurmstepd: error: *** STEP 929543.0 ON nid001458 CANCELLED AT 2022-03-18T17:47:34 ***

Hello,

this is an internal libc library error, likely due to some integer value overflowing and causing trouble.

Can you share the TPR and checkpoint file so I can try to reproduce this? This of course shouldn’t happen during a normal run.

Cheers

Paul

Hi Paul,

I just literally copy the file from LUMI and put it on Dardel and run it there and it works perfectly fine.

I guess this is not the gromacs issue but cluster issue? Should I just let them know?

Best

Will

Hello Will,

did you use same slurm settings, number of ranks and so on? Then it might be an issue with the LUMI version. Otherwise I would first try to fully reproduce it to make sure we are not doing something bad during restarts.

Cheers

Paul