I am trying to set a run using several nodes. Each node contains 24 cores, divided in 2 sockets. (2x12 cores). As a control, I have a run using 1 node that takes 8 days. This is the main part of the script for that run:
## SLURM options ##
#SBATCH --job-name=test
#SBATCH --partition=compute24
#SBATCH --nodes=1
#SBATCH --tasks-per-node=24
SCRATCH=/scratch/$USER/$SLURM_JOB_ID # $SCRATCH is the output dir for the calculation
module load mpi/openmpi/4.0.1/gcc
source $HOME/Softwares/gromacs-2020.1/gromacs-2020.1_built/bin/GMXRC
cd $SLURM_SUBMIT_DIR
# Creating dir for the output -> $SCRATCH
mkdir -p $SCRATCH
ROOT_NAME=out_test
# Copying the necessary inputs to $SCRATCH
cp $ROOT_NAME.tpr $SCRATCH
## RUNNING...
# Launching GROMACS
cd $SCRATCH
gmx_mpi mdrun -v -deffnm $ROOT_NAME -ntomp 24
# Copying the output back
cp $ROOT_NAME.* $SLURM_SUBMIT_DIR
# Moving back and removing $SCRATCH
cd $SLURM_SUBMIT_DIR
rm -rf $SCRATCH
I’ve tried that before, but GROMACS can not find the .tpr, even is I explicitly pass it with the -s flag. The file is definitely in the work directory and copied to the calculation $SCRATCH directory:
#SBATCH --job-name=multinode_test
#SBATCH --partition=compute24
#SBATCH --nodes=2
#SBATCH --tasks-per-node=1
#SBATCH --cpus-per-task=24
cd $SLURM_SUBMIT_DIR
(...)
# Creating dir for the output -> $SCRATCH
mkdir -p $SCRATCH
ROOT_NAME=md_MULTINODE_test
# Copying the necessary inputs to $SCRATCH
cp $ROOT_NAME.tpr $SCRATCH
## RUNNING...
# Launching GROMACS
cd $SCRATCH
mpirun -np 2 gmx_mpi mdrun -v -deffnm $ROOT_NAME -ntomp $SLURM_CPUS_PER_TASK -s $ROOT_NAME.tpr
(...)
slurm.out:
Program: gmx mdrun, version 2020.1
Source file: src/gromacs/commandline/cmdlineparser.cpp (line 275)
Function: void gmx::CommandLineParser::parse(int*, char**)
MPI rank: 1 (out of 2)
Error in user input:
Invalid command-line options
In command-line option -s
File 'md_MULTINODE_test.tpr' does not exist or is not accessible.
The file could not be opened.
Reason: No such file or directory
(call to fopen() returned error code 2)
I solved the issue with the .tpr file by eliminating the step of copying to the $SCRATCH directory, and it runs. However, it does not scale in 2 nodes.
-------------------------------------------------------
Program: gmx mdrun, version 2020.1
Source file: src/gromacs/domdec/domdec.cpp (line 2277)
MPI rank: 0 (out of 48)
Fatal error:
There is no domain decomposition for 36 ranks that is compatible with the
given box and a minimum cell size of 1.83 nm
Change the number of ranks or mdrun option -rdd or -dds
Look in the log file for details on the domain decomposition
For more information and tips for troubleshooting, please check the GROMACS
website at http://www.gromacs.org/Documentation/Errors
-------------------------------------------------------
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[34868,1],18]
Exit code: 1
--------------------------------------------------------------------------
I didn’t work with other values of $OMP_NUM_THREADS (2, 4) or -np 24.
Probably your box very small for that many MPI tasks. For the sake of scalability test of machine/gromacs-build, try running for bigger system that can facilitate 48 tasks. By the way, how many atoms your system has?
My system has 15277 atoms (DNA system with a small ligand, in water and KCl, AMBER99bsc1 FF) and the box dimensions are 5.36494^3 nm.
But I am a bit confused. The error informs about 36 MPI ranks (asked 48) for that box size. Why 36? Is that the maximum number of ranks for that size? I ran the system in a single node with 32 CPUs and worked well.
A full .log file would be informative. Some ranks may be assigned to PME rather than PP. Regardless, your system is quite small so you likely won’t benefit from using as many processors as you’re asking for. I would not expect much performance enhancement about 16 or 24 processors.