Error when running REUS on multiple nodes

GROMACS version: gromacs-2020.4+plumed-2.6.2.sh
GROMACS modification: Yes
I’m running a replica-exchange umbrella sampling simulation with 24 replicas in 48 nodes. Every node has 128 physical cores and 256 logical cores.

The command to run the job is:

srun gmx_mpi mdrun -v -multidir 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 -deffnm md_0_1 -replex 1000 -plumed …/cv.dat

and gives a performance of 60 ps/h, when I modify

If I set --ntasks=128x48=6144 the job runs and I get a performance of 60 ps/h. If instead I set ntasks=256x48=12288 to use all the logical cores, my job is killed and I get the following message:

  • srun gmx_mpi mdrun -v -multidir 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 -deffnm md_0_1 -replex 1000 -plumed …/cv.dat
    kvsprovider[8844]: Timeout: Not all clients called pmi_init(): init=1024 left=11264 round=1
    kvsprovider[8844]: Timeout: Not all clients called pmi_init(): init=2048 left=10240 round=2
    kvsprovider[8844]: Timeout: Not all clients called pmi_init(): init=2998 left=9290 round=3
    kvsprovider[8844]: Timeout: Not all clients called pmi_init(): init=3969 left=8319 round=4
    kvsprovider[8844]: Timeout: Not all clients called pmi_init(): init=4943 left=7345 round=5
    kvsprovider[8844]: Timeout: Not all clients called pmi_init(): init=5926 left=6362 round=6
    kvsprovider[8844]: Timeout: Not all clients called pmi_init(): init=6912 left=5376 round=7
    kvsprovider[8844]: Timeout: Not all clients called pmi_init(): init=7936 left=4352 round=8
    srun: error: timeout waiting for task launch, started 8704 of 12288 tasks
    srun: launch/slurm: launch_p_step_launch: StepId=12006718.0 aborted before step completely launched.
    srun: Job step aborted: Waiting up to 6 seconds for job step to finish.
    readFromPMIClient: lost connection to the PMI client
    srun: error: task 9216 launch failed: Unspecified error
    srun: error: task 9217 launch failed: Unspecified error

Do you have any idea why this occurs?

You are using a plumed-modified version of GROMACS not an upstream official one. Can you please try running with an unmodified GROMACS and report back if you still see the failures (and you might see performance improvement too).

Cheers,
Szilárd