GROMACS version:2022.3
GROMACS modification: No
Here post your question
Hi,
I am facing trouble submitting jobs in SLURM. I am using a cluster to submit my simulations and while Gromacs runs without problem on the head-node, when I am preparing and submitting my batch scripts to send for the queue of the calculation nodes they keep failing by giving me the following error:
/var/spool/slurm//job80121/slurm_script: line 16: 1108414 Illegal instruction (core dumped) gmx mdrun -deffnm nvt
and similarly for gmx_mpi:
/var/spool/slurm//job76105/slurm_script: line 15: 3217846 Illegal instruction (core dumped) gmx_mpi mdrun -deffnm nvt
Gromacs has been compiled for both gpu and mpi with cuda/11.6 openmpi/4.1.0 and gcc/11.2.0
The cluster is running on the latest Red Hat version
The slurm scripts I am submitting have the following form:
#!/bin/bash #SBATCH --partition=priority #SBATCH --nodes=1 #SBATCH --ntasks=1 #SBATCH --cpus-per-task=1 #SBATCH --mem=16G
“Illegal instruction” typically means the code was not compiled properly for the target hardware. That is, compiling on the head/login node may not be right for executing the programs on the compute nodes.
Firstly, I’d like to thank you for dedicating some of your time to answer : )
If it’s not too much to ask, would you perhaps have any tips to offer on how to tackle this problem? I tried using gromacs in two ways, one by compiling it myself, and another by asking IT to load it as a module on the cluster. Both failed the slurm queue test unfortunately.
Or maybe if there are any questions I should ask IT in order to gain further information about the cluster system and how to find the root cause?
While this time the executable was recognised properly and the simulation process appeared to work correctly, another error appeared:
Command line:
gmx_mpi mdrun -ntomp 2 -deffnm nvt -v
Back Off! I just backed up nvt.log to ./#nvt.log.2#
Reading file nvt.tpr, VERSION 2020.5-dev-UNCHECKED (single precision)
Program: gmx mdrun, version 2022.5
Standard library logic error (bug):
(exception type: St12length_error)
vector::_M_default_append
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[64689,1],0]
Exit code: 1
Additionally, when I tried to type gmx_mpi on the headnode to see if the program at least pops up, I got the following error: gmx_mpi: error while loading shared libraries: libcufft.so.10: cannot open shared object file: No such file or directory
Is this error solely based on the dependency that gromacs could not find for proceeding with the calculation, or is there also something else I am missing?