GROMACS version: 24.4
GROMACS modification: Yes
Here post your question
I’ve been trying to run two replicates of a simulation on a single node of a cluster, but am getting a big slowdown compared to running them one at a time even with identical resources allocated through sbatch.
Each job is given 20 CPUs and 1 GPU out of 40 and 2 on the node respectively. The only significant difference I can find monitoring the node is that GPU utilization drops in half when both jobs are running, but they are assigned different GPUs and both are well below the temperature at which they would start getting throttled.
The only difference I can find in the log files is that the same number of flops take twice as long due to an increase in GPU wait time across the board. I’ve included the log outputs from two identical simulations when alone on the node or when there is a second run.
Since both jobs are scheduled separately via slurm I’m at a bit of a loss for how the two runs are interfering with each other to cause such a slowdown. If this is a known issue or there are additional troubleshooting steps you recommend I’d really appreciate it!
Are you sure the jobs are not sharing cpu or gpu resources? If slurm is properly set up it should, but maybe it is not. It would seem like both jobs are running on the same GPU.
That was my initial thought as well, but when I ssh into the node and check it looks like both GPUs are being used at half capacity.
Here’s the output from nvidia-smi:
±----------------------------------------------------------------------------------------+
| NVIDIA-SMI 565.57.01 Driver Version: 565.57.01 CUDA Version: 12.7 |
|-----------------------------------------±-----------------------±---------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 Tesla V100-PCIE-32GB On | 00000000:86:00.0 Off | Off |
| N/A 48C P0 115W / 150W | 374MiB / 32768MiB | 50% Default |
| | | N/A |
±----------------------------------------±-----------------------±---------------------+
| 1 Tesla V100-PCIE-32GB On | 00000000:D8:00.0 Off | Off |
| N/A 45C P0 80W / 150W | 374MiB / 32768MiB | 47% Default |
| | | N/A |
±----------------------------------------±-----------------------±---------------------+
±----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 3606907 C gmx 370MiB |
| 1 N/A N/A 3612507 C gmx 370MiB |
±----------------------------------------------------------------------------------------+