Multiple gmx processes on GPU(s) ran too slowly

GROMACS version: 2026.0 (Anaconda distribution, build nompi_cuda_h39c90b0_1, channel conda-forge)
GROMACS modification: No
Hello, I need to simulate several systems almost identical in terms of the number of atoms. All have the same protein receptor, the same water model, the same box size. The only difference is the ligands. My command was:

gmx mdrun -deffnm md -v

When I run one system at a time on a GPU, it takes around 6 hours to complete. nvidia-smi reported around 80% GPU utilization during the run.

Theoretically, if I run 2 systems at a time on the same GPU, it will take approximately 6*2 = 12 hours to complete, right? But the estimated complete time of mdrun was 3 days later!? And nvidia-smi reported just 2-3% GPU utilization.

I then ran these 2 systems on 2 different GPUs of the same model. Each GPU handled 1 system. The estimated complete time of mdrun was also 3 days later. And each GPU utilized only 2-3%.

I don’t know what caused such problem. Any help is appreciated.

Best regards,

Hoa

Hi,

Theoretically, if I run 2 systems at a time on the same GPU, it will take approximately 6*2 = 12 hours to complete, right?

Right. Could even be faster, depending on things.

To properly diagnose the issue it would be necessary to look at the log files. You can run shorter simulations (~5 minutes wall-time), and see the performance counter tables at the end of the run. Some background to understanding the report: Getting good performance from mdrun - GROMACS 2026.2 documentation

However, your issue is, very likely, CPU contention. When you run gmx mdrun without resource usage flags, GROMACS assumes it can use the whole machine. If you run the two simulations on the same machine this way, they will start fighting for the same CPU cores, and, as a result, not be able to supply work to the GPU(s), waste time switching CPU contexts etc.

Add -ntmpi 1 -ntomp X flags to both gmx mdrun commands, where X is half the number of CPU cores on your machine to explicitly tell each simulation how much resources to use. That would hopefully resolve your issue.

You can further add -pinoffset 0 -pin on to the first simulation and -pinoffset X -pin on to the second (with the same X) to explicitly assign cores [0; X) to the first gmx and cores [X; 2*X) to the second gmx, which would help things a bit more.

Hi,

I have used flags “-ntmpi 1 -ntomp X -pinoffset 0 -pin on” and “”-ntmpi 1 -ntomp X -pinoffset X -pin on"" but the speed is still slow. GPU utilized 2-3%. Could the problem stem from Anaconda distribution?

Could the problem stem from Anaconda distribution?

I would not expect Conda to be a problem, but without data it’s just a guess.

As I mentioned, to properly diagnose things it’s better to run short simulations (~5 minutes wall-time; you can use -nsteps flag to change the number of steps without re-generating your TPR) in “good” and “bad” conditions (i.e., a stand-alone run, and two runs side-by-side), and see the performance counter tables in the log files. You can just attach the three log files here if you want me to take a look at them.

Hi,

My workstation has 128 CPU cores. If I do this:

gmx mdrun -deffnm md -v -gpu_id 0 -ntmpi 1 -ntomp 64 -pinoffset 64 -pin on

At pinoffset = 64 and one gmx process at a time, the CPU cores to run this process range from 32-63 and 96-127.

At pinoffset = 0 and one gmx process at a time, the CPU cores to run gmx will be from 0-63.

When I run two gmx processes with pinoffset=0 and 64 at the same time, the CPU cores to run range from 0-63 and 96-127. Cores 64-95 don’t run. Performance is 6.212 ns/day, which is far slower than the cases that will be mentioned later in this reply.

Clearly, pinoffset=X doesn’t get the CPU cores in the range [X; 2X) to run. There may be overlapping of selected CPU cores when running both gmx processes at the same time with the pinoffset 0 and 64. The two gmx processes must have been fighting for the overlapped CPU cores. If I remove the -pinoffset flag but keep -pin on, the same situation happens.

If I run two gmx processes at the same time using this setting:

gmx mdrun -deffnm md -v -gpu_id 0 -ntmpi 1 -ntomp 64

Then all CPU cores are utilized. I get performance 300.857 ns/day. If I run one process per GPU (with the same command, except different gpu_id value for each process), I get higher speed, at 356.302 ns/day. But one process per GPU still not as fast as running only one process at a time on one GPU (655.006 ns/day) (with the same command, of course, and also with 64 cores).

I also attached some log files. At the top of each log file is the description for the running case to which each log file belongs.

md_1_process.log (27.4 KB)

md_2_processes_2_gpus.log (27.5 KB)

md_2_processes_1_gpu.log (27.4 KB)

Hi,

Not quite. You have 64 physical cores, each having two threads / virtual cores / logical cores. From the log:

Running on 1 node with total 64 cores, 128 processing units, 2 compatible GPUs
....
    Brand:  AMD Ryzen Threadripper PRO 9985WX 64-Cores     
...
    Packages, cores, and logical processors:
    [indices refer to OS logical processors]
      Package  0: [   0  64] [   1  65] [   2  66] [   3  67] [   4  68] [   5  69] [   6  70] [   7  71] [   8  72] [   9  73] [  10  74] [  11  75] [  12  76] [  13  77] [  14  78] [  15  79] [  16  80] [  17  81] [  18  82] [  19  83] [  20  84] [  21  85] [  22  86] [  23  87] [  24  88] [  25  89] [  26  90] [  27  91] [  28  92] [  29  93] [  30  94] [  31  95] [  32  96] [  33  97] [  34  98] [  35  99] [  36 100] [  37 101] [  38 102] [  39 103] [  40 104] [  41 105] [  42 106] [  43 107] [  44 108] [  45 109] [  46 110] [  47 111] [  48 112] [  49 113] [  50 114] [  51 115] [  52 116] [  53 117] [  54 118] [  55 119] [  56 120] [  57 121] [  58 122] [  59 123] [  60 124] [  61 125] [  62 126] [  63 127]

Thus, using -ntomp 64 with two processes still oversubscribe your CPUs (and is probably what GROMACS does by default anyway). Two processes run with pinoffset=0 and pinoffset=64 end up on the same physical cores; furthermore, with -pin on, the main threads doing coordination and GPU submission from both simulations are pinned to the same core 0/64 – worst situation.

So, with two processes, you should use X=32.

Aside: in your logs, the runs have wall-time of ~1 second. It’s best to run for somewhat longer (e.g., 5 minutes I mentioned, or 1 minute if you’re in a hurry) to get more reliable timing breakdown. E.g., >10% of the time is spent writing coordinates to the disk twice for 2000 steps, while for a longer run you have nstxout set to 5000, so the effect would be ~5x smaller.

Hi,

Thanks for still helping me this far. Unfortunately, -pinoffset X=32 doesn’t fix the problem because it will utilize CPU threads [16;48) and [80;112). [16;48) overlaps with [0;64) of X=0.

I notice, with -ntomp 64:
If X=1, pinned threads are [1;33) and [64;96).

X=2, [1;33) and [65;97).

X=3, [2;34) and [65;97).

X=4, [2;34) and [66;98).


X roughly pins [X/2; X+ntomp/2) and [X/2+ntomp; X/2+ntomp*3/2).

This roughly matches the “Packages, cores, and logical processors” you found in the log:

Packages, cores, and logical processors:
[indices refer to OS logical processors]
Package 0: [ 0 64] [ 1 65] [ 2 66] [ 3 67] [ 4 68] [ 5 69] [ 6 70] [ 7 71] [ 8 72] [ 9 73] [ 10 74] [ 11 75] [ 12 76] [ 13 77] [ 14 78] [ 15 79] [ 16 80] [ 17 81] [ 18 82] [ 19 83] [ 20 84] [ 21 85] [ 22 86] [ 23 87] [ 24 88] [ 25 89] [ 26 90] [ 27 91] [ 28 92] [ 29 93] [ 30 94] [ 31 95] [ 32 96] [ 33 97] [ 34 98] [ 35 99] [ 36 100] [ 37 101] [ 38 102] [ 39 103] [ 40 104] [ 41 105] [ 42 106] [ 43 107] [ 44 108] [ 45 109] [ 46 110] [ 47 111] [ 48 112] [ 49 113] [ 50 114] [ 51 115] [ 52 116] [ 53 117] [ 54 118] [ 55 119] [ 56 120] [ 57 121] [ 58 122] [ 59 123] [ 60 124] [ 61 125] [ 62 126] [ 63 127]

However, X=0 doesn’t match this pattern at all. Instead of pinning [0;32) and [64;96), it goes [0;64).

With -ntomp 32, X=0 pins [0;32) or [X;X+ntomp).
X=1 pins [64;96) or [64X; 64X+ntomp).
X=2 pins [1;33) or [X/2;X/2+ntomp).
X=32 pins [16;48) or [X/2;X/2+ntomp).
X=64 pins [32;64) or [X-ntomp;X).
X=96 pins [48;64) and [112;128) or [X/2;X/2+ntomp/2) and [X+ntomp/2; X+ntomp).
These don’t follow any pattern.

Could you please list exact commands you are running, and how you check which cores that are being used?

Hi,
I use htop to check which threads are being used.
These are my commands:

# with 64 threads per gmx process
gmx mdrun -deffnm md -v -gpu_id 1 -ntmpi 1 -ntomp 64 -pinoffset X -pin on
# with 32 threads per gmx process
gmx mdrun -deffnm md -v -gpu_id 1 -ntmpi 1 -ntomp 32 -pinoffset X -pin on

The X and the corresponding used threads follow the description in my previous reply.

Could it be that the GROMACS version distributed via Anaconda cannot identify the IDs of the CPU threads correctly?