Nothing abnormal here, as you offload all CPU force compute the CPU has no useful work to do and after enqueuing GPU work for a sequence of steps until the CPU requires results (e.g. for pair search of I/O) it waits for the GPU to complete work. That is the wall-time measured in the above counter.
That is unfortunately a limitation of the current GPU-resident parallelization. The only thing you could do is to consider switching to a supported thermostat.
Based on you log there is not a lot of performance left on the table, but you could try a few tweaks:
- increase
nstlistto reduce the search time - move the bonded interactions back to the CPU (
-bonded cpuoption), the 16 CPU cores may be fast enough to give a slight benefit - if you care about throughput, run two simulations on the same GPU