GROMACS version: gromacs-2020.4+plumed-2.6.2.sh
GROMACS modification: Yes
I’m running a replica-exchange umbrella sampling simulation with 24 replicas in 48 nodes. Every node has 128 physical cores and 256 logical cores.
The command to run the job is:
srun gmx_mpi mdrun -v -multidir 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 -deffnm md_0_1 -replex 1000 -plumed …/cv.dat
and gives a performance of 60 ps/h, when I modify
If I set --ntasks=128x48=6144 the job runs and I get a performance of 60 ps/h. If instead I set ntasks=256x48=12288 to use all the logical cores, my job is killed and I get the following message:
- srun gmx_mpi mdrun -v -multidir 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 -deffnm md_0_1 -replex 1000 -plumed …/cv.dat
kvsprovider[8844]: Timeout: Not all clients called pmi_init(): init=1024 left=11264 round=1
kvsprovider[8844]: Timeout: Not all clients called pmi_init(): init=2048 left=10240 round=2
kvsprovider[8844]: Timeout: Not all clients called pmi_init(): init=2998 left=9290 round=3
kvsprovider[8844]: Timeout: Not all clients called pmi_init(): init=3969 left=8319 round=4
kvsprovider[8844]: Timeout: Not all clients called pmi_init(): init=4943 left=7345 round=5
kvsprovider[8844]: Timeout: Not all clients called pmi_init(): init=5926 left=6362 round=6
kvsprovider[8844]: Timeout: Not all clients called pmi_init(): init=6912 left=5376 round=7
kvsprovider[8844]: Timeout: Not all clients called pmi_init(): init=7936 left=4352 round=8
srun: error: timeout waiting for task launch, started 8704 of 12288 tasks
srun: launch/slurm: launch_p_step_launch: StepId=12006718.0 aborted before step completely launched.
srun: Job step aborted: Waiting up to 6 seconds for job step to finish.
readFromPMIClient: lost connection to the PMI client
srun: error: task 9216 launch failed: Unspecified error
srun: error: task 9217 launch failed: Unspecified error
Do you have any idea why this occurs?