GROMACS version: 2021.4
GROMACS modification: No
Hello everyone!
I am trying to parallelize some testings to learn using efficiently gromacs in a cluster.
By using a small amount of resources via SLURM schedule (i.e. writing a sbatch script.sh) :
- 2 nodes
- 2 tasks per node (i.e. MPI-threads)
- 4 omp threads per task
- 1 GPU per node
The loaded modules on the cluster are: StdEnv/2020, gcc/9.3.0, cuda/11.0, openmpi/4.0.3 and gromacs/2021.4.
Inside separate sbatch scripts (at first I thought that was srun problem), I have used these 2 commands:
-
srun (with and without further arguments) gmx_mpi -ntmpi 2 -ntomp 4 -gputasks 00 -deffnm md -bonded cpu -nb gpu -pme gpu -npme 1 -s file.tpr
-
mpiexec (with or without -np) gmx_mpi -ntmpi 2 -ntomp 4 -gputasks 00 -deffnm md -bonded cpu -nb gpu -pme gpu -npme 1 -s file.tpr
In few words: I am trying to create 2 tasks per each simulation in which one is for PP calculations, while the second task is for PME calculations, both with the GPU (that’s why the -gputasks 00).
But I am continuing to get this error even when it is not basically true:
Fatal error:
Setting the number of thread-MPI ranks is only supported with thread-MPI and
GROMACS was compiled without thread-MPI
I have already tried using just 1 node, thus no needs to use srun/mpiexec commands (just gmx_mpi) and everything worked, even creating 4 tasks with 4 GPUs in a single node, showing that gromacs is actually compiled with thread-MPI!!
Alternatively, by using only 1 task per node (i.e. -ntmpi=1), it worked. But I want try to create more tasks in more nodes (I want to try using 8 GPUs and I need 4 tasks for each node)
I want ask you also, why even creating more tasks with few GPUs, the performances are significantly low compared to using few tasks and more GPUs: I thought that creating more tasks on a single GPU, the process was efficient anyway at the condition of not occupying all the memory of the same GPU (i.e: when I enter in the GPU node, by using nvidia-smi command, I see that more processes are simultaneously running, so I expected that creating 4 tasks for 1 GPU was not really different from creating 1 task for each GPU using 4 of them).
I observed (simulating a small system for 100 fs of molecular dynamics) that the former (4 tasks/1 GPU) required meanly 210 ns/day, while the latter (1 task/GPU with 4 GPU in total) over 300 ns/day. It is a significant difference! But I don’t understand why it is happening in this way.
Thank you and sorry for the long message.