Scaling problem for very large systems

GROMACS version:2024.2
GROMACS modification: No
I’m currently benchmarking gromacs against my own code (that operates in a significantly different way) for large scale particle simulation. I’m comparing an atomistic representation to a coarse-grained representation, and to my method, for different sizes systems of small carbon nanotubes that are randomly distributed at the same concentration. Gromacs appears to scale well, approximately linearly until I get to systems with several million atoms, where I see a drastic slowdown. For example when I increase the number of atoms from 480000 to 4800000 in the atomistic model, the system takes ~250x longer, rather than the expected ~10x, and I’m not sure why.

I’m running the systems with nonbonded interaction on an RX6800XT, and while the utilization shows 100%, from what I can see in the power use goes down a lot for these bigger systems. I initially suspected the neighbor finding to be the problem, but looking at CPU activity this does not appear to be the case, as the bursts of activity on the CPU are relatively short.
Currently my only hypothesis is related to the 128MB ‘infinity cache’ on this GPU, and that the larger systems somehow don’t work well with it. However, I would not expect such a huge performance degradation, as the cache system is only marketed as an ‘up to 3.25x’ bandwidth improvement.

Is this a known issue? What could be the cause and are there any solutions, except for running on different hardware?

Hi!

The behavior you’re describing is indeed strange (I have RX6400 at hand, which does not fit 6M atoms, but in the 96k-3M atoms range, the performance behaves as expected). Could you share the md.log files for the 480k and 4800k systems you mentioned as having ~250x performance difference?

Instead of looking at the reported CPU/GPU utilization, it is usually much better to check the performance counters reported near the end of the md.log file. E.g., it directly reports how much time is spent doing Neighbor Search (which is rightfully a suspect), no need to eyeball it from htop :) The logs will also allow checking if there’s anything suspicious/suboptimal with how GROMACS is built and run or with hardware detection.

It could be that cache hits go down massively. Could you run with the environment variable GMX_DD_SINGLE_RANK set to force particle reordering and report if that changes the performance?

1 Like

It appears the same, although I’m pretty sure I was running on one rank before too.

Here’s the md.log file of one of the offending simulations.
md.log (20.8 KB)
Clearly the item Wait GPU state copy is the culprit, taking 94.7% of wall time.


These are the scaling issues I observe.

All time in wait on GPU simply reflects that the CPU does nothing most of the time and is waiting for the GPU (we can’t report timings on the GPU). This is what you want.

My guess is that the performance goes does down as the memory footprint gets larger and pushes data further away from the GPU.

As noted by Berk, the high GPU wait time is the desired behavior here. And you were right, the neighbor search is not the problem.

But I see some curious stuff in your logs:

  1. All the energies, pressures, and temperatures in the log are exactly zero. That is quite unusual. Is that expected for your forcefield?
  2. If you run the smaller simulations for 1000 steps too, then, with 1 ms / step, this corresponds to ~1 second total run time, and the things get much slower for runs > 10 seconds. There could be thermal throttling issues for longer runs. That would not fully explain the x250 slowdown (unless the card’s cooling is woefully bad), but it fits with your power use observations. Have you looked at the temperature and clocks (SCLK, MCLK in rocm-smi) of the card during these longer runs?

The LLC / “Infinity Cache” on the card is, as Gusten said, 128MB. That is around the total size for F + XYZQ arrays for a 4M particle system. Not much left for the neighborlist, but the zeroes in energies as well as low time in Neighbor search despite having nstlist = 10 suggest that it won’t need a lot of space. So, it is in the ballpark.

But 2 seconds/step we see in the log file seems too much for an 11M system. RX6400 with 16MB LLC does 80 ms/step on a 1.5M system (grappa box, so very different composition; larger systems don’t fit in the GPU memory, but anyway we’re far from fitting into the cache), and RX6800 has 5x more compute units and 4x GDDR bandwidth.

The strange energies are intentional. All the interactions are zero, because I want to have the exact same coordinates for all steps in the comparison between methods.

There aren’t any major throttling issues here, I’ve checked with longer runs for the smaller systems, and the thermals in my machine are excellent.

Perhaps the systems I use cause an issue somehow, but it appears I can’t attach coordinate files(.gro). The systems are relatively sparse, with each particle being represented by a relatively dense cluster of atoms. Here are the files I can attach

test.mdp (576 Bytes)
cg.1000000.top (75 Bytes)

Edit: Is there any way I could send the other relevant files
files to you(coordinate, forcefield. etc)? I assume it would be a lot easier to analyze if you could run it yourselves.

The top file doesn’t say anything.

Note that if all non-bonded parameters are zero the pairlist will only contain small diagonal blocks.

So the pair list is filtered based on the interaction parameters as well? Is there any way to get around this for testing? I just want to benchmark the nonbonded interaction performance for a given particle configuration, and repeat it a number of times to get decent statistics. My first idea was to set the timestep to 0 but gromacs doesn’t allow that.

You can freeze all the particles.

Ah, of course. It’s obvious now that you mention it. I find that GROMACS is a bit tricky to benchmark like this because I don’t know what might be optimized away.
I will test it and see.

Freezing the atoms forces GROMACS to do updates on the CPU, which should slow things down, but for whatever reason, the performance scaling seems more reasonable now. For the largest system it’s about 10x faster now.

Yes, frozen atoms are not supported with GPU update :( But good to know that the issues with the force calculation are now gone.

Perhaps you can just set a very small timestep and a very small force constant? The particles would not stay exactly where they are, but it’s highly unlikely that them moving in, say, 1/1000 of a normal timestep (whatever it is for your atomistic/CG forcefield) would change the configuration enough to alter the performance.

If there’s any consolation, when benchmarking during the development, we don’t usually bother with such things, and just use “production” settings for forcefield, timestep etc.

If update on the CPU makes things much faster, that could be because less memory is required on the GPU when masses, velocities and likely an extra coordinate vector are not needed.

If you want to update on the GPU, you could also set the masses to 1e12.

PS: getting the pairlist to not skip non interacting atoms is a single line change, if you need that.