Two simulations on a node don't fully utilize independent GPUs

hhig · March 22, 2025, 11:02pm

GROMACS version: 24.4
GROMACS modification: Yes
Here post your question

I’ve been trying to run two replicates of a simulation on a single node of a cluster, but am getting a big slowdown compared to running them one at a time even with identical resources allocated through sbatch.

Each job is given 20 CPUs and 1 GPU out of 40 and 2 on the node respectively. The only significant difference I can find monitoring the node is that GPU utilization drops in half when both jobs are running, but they are assigned different GPUs and both are well below the temperature at which they would start getting throttled.

The only difference I can find in the log files is that the same number of flops take twice as long due to an increase in GPU wait time across the board. I’ve included the log outputs from two identical simulations when alone on the node or when there is a second run.

Since both jobs are scheduled separately via slurm I’m at a bit of a loss for how the two runs are interfering with each other to cause such a slowdown. If this is a known issue or there are additional troubleshooting steps you recommend I’d really appreciate it!

step7_1_log_end.txt (3.8 KB)
step7_5_log_end.txt (3.8 KB)

hess · March 26, 2025, 11:04am

Are you sure the jobs are not sharing cpu or gpu resources? If slurm is properly set up it should, but maybe it is not. It would seem like both jobs are running on the same GPU.

hhig · March 26, 2025, 1:05pm

That was my initial thought as well, but when I ssh into the node and check it looks like both GPUs are being used at half capacity.

±----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 3606907 C gmx 370MiB |
| 1 N/A N/A 3612507 C gmx 370MiB |
±----------------------------------------------------------------------------------------+

Topic		Replies	Views
Simulations with multiple GPUs User discussions mdrun	3	2257	May 24, 2022
Running Simulations in Parallel Across Multiple Nodes/GPUs User discussions	1	516	January 7, 2021
Parallelization over several GPU nodes User discussions mdrun , gpu	2	1035	March 30, 2021
Low performance sharing same node User discussions	1	180	April 14, 2023
1GPU vs 4 GPU per single node; performance User discussions mdrun	7	1569	January 20, 2023

Two simulations on a node don't fully utilize independent GPUs

Related topics