Reducing "Wait GPU state copy" for single GPU runs

Nothing abnormal here, as you offload all CPU force compute the CPU has no useful work to do and after enqueuing GPU work for a sequence of steps until the CPU requires results (e.g. for pair search of I/O) it waits for the GPU to complete work. That is the wall-time measured in the above counter.

That is unfortunately a limitation of the current GPU-resident parallelization. The only thing you could do is to consider switching to a supported thermostat.

Based on you log there is not a lot of performance left on the table, but you could try a few tweaks:

  • increase nstlist to reduce the search time
  • move the bonded interactions back to the CPU (-bonded cpu option), the 16 CPU cores may be fast enough to give a slight benefit
  • if you care about throughput, run two simulations on the same GPU