Fatal error: Setting the number of thread-MPI ranks is only supported with thread-MPI and GRO

GROMACS version: 2021.4
GROMACS modification: No

Hello everyone!

I am trying to parallelize some testings to learn using efficiently gromacs in a cluster.
By using a small amount of resources via SLURM schedule (i.e. writing a sbatch script.sh) :

  • 2 nodes
  • 2 tasks per node (i.e. MPI-threads)
  • 4 omp threads per task
  • 1 GPU per node

The loaded modules on the cluster are: StdEnv/2020, gcc/9.3.0, cuda/11.0, openmpi/4.0.3 and gromacs/2021.4.
Inside separate sbatch scripts (at first I thought that was srun problem), I have used these 2 commands:

  1. srun (with and without further arguments) gmx_mpi -ntmpi 2 -ntomp 4 -gputasks 00 -deffnm md -bonded cpu -nb gpu -pme gpu -npme 1 -s file.tpr

  2. mpiexec (with or without -np) gmx_mpi -ntmpi 2 -ntomp 4 -gputasks 00 -deffnm md -bonded cpu -nb gpu -pme gpu -npme 1 -s file.tpr

In few words: I am trying to create 2 tasks per each simulation in which one is for PP calculations, while the second task is for PME calculations, both with the GPU (that’s why the -gputasks 00).

But I am continuing to get this error even when it is not basically true:

Fatal error:
Setting the number of thread-MPI ranks is only supported with thread-MPI and
GROMACS was compiled without thread-MPI

I have already tried using just 1 node, thus no needs to use srun/mpiexec commands (just gmx_mpi) and everything worked, even creating 4 tasks with 4 GPUs in a single node, showing that gromacs is actually compiled with thread-MPI!!
Alternatively, by using only 1 task per node (i.e. -ntmpi=1), it worked. But I want try to create more tasks in more nodes (I want to try using 8 GPUs and I need 4 tasks for each node)

I want ask you also, why even creating more tasks with few GPUs, the performances are significantly low compared to using few tasks and more GPUs: I thought that creating more tasks on a single GPU, the process was efficient anyway at the condition of not occupying all the memory of the same GPU (i.e: when I enter in the GPU node, by using nvidia-smi command, I see that more processes are simultaneously running, so I expected that creating 4 tasks for 1 GPU was not really different from creating 1 task for each GPU using 4 of them).

I observed (simulating a small system for 100 fs of molecular dynamics) that the former (4 tasks/1 GPU) required meanly 210 ns/day, while the latter (1 task/GPU with 4 GPU in total) over 300 ns/day. It is a significant difference! But I don’t understand why it is happening in this way.

Thank you and sorry for the long message.