GROMACS version: 2022
GROMACS modification: No
I have been doing some benchmarking of GROMACS on JUWELS Booster. For a specific benchmark system:
benchRIB from A free GROMACS benchmark set | Max Planck Institute for Multidisciplinary Sciences
When I use 16 GPUs with GMX_ENABLE_DIRECT_GPU_COMM=true the simulation fails with error:
Fatal error:
Step 600: The total potential energy is nan, which is not finite.
If I run without setting the GMX_ENABLE_DIRECT_GPU_COMM flag then the simulation completes successfully.
hess
2
Can you post your command line for mdrun?
Tagging @pszilard on this one.
This is the run command:
srun gmx_mpi mdrun -s benchRIB.tpr -nsteps 10000 -v -ntomp 12 -nb gpu -pme gpu -bonded gpu -npme 1 -notunepme
With slurm settings:
#SBATCH --nodes=4
#SBATCH --ntasks=16
#SBATCH --hint=nomultithread
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=12
#SBATCH --partition=booster
#SBATCH --gres=gpu:4
That’s definitely not expected, please file an issue on https://gitlab.com/gromacs/gromacs/-/issues preferably with your input, job script, and log file, please.