Reducing "Wait GPU state copy" for single GPU runs

GROMACS version: 2020.2-dev-20200430-5e78835-unkown
GROMACS modification: https://catalog.ngc.nvidia.com/orgs/hpc/containers/gromacs

I’m performing protein-bilayer simulations of ~360k atoms on a cluster running a docker image of gromacs 2020.2 from nvcr.io
Since GPUs are my most limited resource I’m running on a single NVIDIA Quadro RTX 6000 GPU with 16 Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz CPUs. My performance is currently ~30.7ns/d (timed here over 500ps)
This is the command I’m using:

‘gmx mdrun -ntmpi 1 -ntomp 16 -nb gpu -bonded gpu -pme gpu -deffnm run’

I have enabled GMX_GPU_DD_COMMS, GMX_GPU_PME_PP_COMMS and GMX_FORCE_UPDATE_DEFAULT_GPU

Two things in particular stand out to me in the log file:

‘Wait GPU state copy 1 16 237500 1027.031 41081.436 73.2’

For some reason the ‘-gpu update’ flag does not support use of the Nose_Hoover thermostat

‘Nose-Hoover temperature coupling is not supported.’
‘Will use CPU version of update.’

Included is a shortened version of my log file.
I’d be thrilled if someone could point me in the right direction for teasing better performance out of my resources. I’ve tried multiple combinations of thead-MPI and openMP threads and thus for 1 : 16 gave the best performance.
Thank you in advance!


GROMACS:      gmx mdrun, version 2020.2-dev-20200430-5e78835-unknown
Executable:   /usr/local/gromacs/sm70/bin/gmx
Data prefix:  /usr/local/gromacs/sm70
Command line:
  gmx mdrun -ntmpi 1 -ntomp 16 -nb gpu -bonded gpu -pme gpu -deffnm run

GROMACS version:    2020.2-dev-20200430-5e78835-unknown
GIT SHA1 hash:      5e788350ad75c15ba91d2ba02779f1f8200f61ee
Branched from:      unknown
Precision:          single
Memory model:       64 bit
MPI library:        thread_mpi
OpenMP support:     enabled (GMX_OPENMP_MAX_THREADS = 64)
GPU support:        CUDA
SIMD instructions:  AVX2_256
FFT library:        fftw-3.3.8
RDTSCP usage:       enabled
TNG support:        enabled
Hwloc support:      disabled
Tracing support:    disabled
C compiler:         /usr/bin/gcc GNU 8.4.0
C compiler flags:   -mavx2 -mfma -Wall -Wno-unused -Wunused-value -Wunused-parameter -Wextra -Wno-missing-field-initializers -Wno-sign-compare -Wpointer-arith -Wundef -Werror=stringop-truncation -fexcess-precision=fast -funroll-all-loops -Wno-array-bounds -mtune=generic -march=x86-64 -O2 -pipe -mavx -DNDEBUG
C++ compiler:       /usr/bin/g++ GNU 8.4.0
C++ compiler flags: -mavx2 -mfma -Wall -Wextra -Wno-missing-field-initializers -Wpointer-arith -Wmissing-declarations -Wundef -Wstringop-truncation -fexcess-precision=fast -funroll-all-loops -Wno-array-bounds -fopenmp -mtune=generic -march=x86-64 -O2 -pipe -mavx -DNDEBUG
CUDA compiler:      /usr/local/cuda/bin/nvcc nvcc: NVIDIA (R) Cuda compiler driver;Copyright (c) 2005-2020 NVIDIA Corporation;Built on Wed_May__6_19:09:25_PDT_2020;Cuda compilation tools, release 11.0, V11.0.167;Build cuda_11.0_bu.TC445_37.28358933_0
CUDA compiler flags:-std=c++14;-gencode;arch=compute_70,code=sm_70;-use_fast_math;-D_FORCE_INLINES;-mavx2 -mfma -Wall -Wextra -Wno-missing-field-initializers -Wpointer-arith -Wmissing-declarations -Wundef -Wstringop-truncation -fexcess-precision=fast -funroll-all-loops -Wno-array-bounds -fopenmp -mtune=generic -march=x86-64 -O2 -pipe -mavx -DNDEBUG
CUDA driver:        11.20
CUDA runtime:       11.0


Running on 1 node with total 40 cores, 80 logical cores, 1 compatible GPU
Hardware detected:
  CPU info:
    Vendor: Intel
    Brand:  Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz
    Family: 6   Model: 85   Stepping: 7
    Features: aes apic avx avx2 avx512f avx512cd avx512bw avx512vl clfsh cmov cx8 cx16 f16c fma htt intel lahf mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd rdtscp sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
    Number of AVX-512 FMA units: 2
  Hardware topology: Basic
    Sockets, cores, and logical processors:
      Socket  0: [   0  40] [   1  41] [   2  42] [   3  43] [   4  44] [   5  45] [   6  46] [   7  47] [   8  48] [   9  49] [  10  50] [  11  51] [  12  52] [  13  53] [  14  54] [  15  55] [  16  56] [  17  57] [  18  58] [  19  59]
      Socket  1: [  20  60] [  21  61] [  22  62] [  23  63] [  24  64] [  25  65] [  26  66] [  27  67] [  28  68] [  29  69] [  30  70] [  31  71] [  32  72] [  33  73] [  34  74] [  35  75] [  36  76] [  37  77] [  38  78] [  39  79]
  GPU info:
    Number of GPUs detected: 1
    #0: NVIDIA Quadro RTX 6000, compute cap.: 7.5, ECC:  no, stat: compatible

Highest SIMD level requested by all nodes in run: AVX_512
SIMD instructions selected at compile time:       AVX2_256
This program was compiled for different hardware than you are running on,
which could influence performance. This build might have been configured on a
login node with only a single AVX-512 FMA unit (in which case AVX2 is faster),
while the node you are running on has dual AVX-512 FMA units.

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
-------- -------- --- Thank You --- -------- --------

This run will default to '-update gpu' as requested by the GMX_FORCE_UPDATE_DEFAULT_GPU environment variable. GPU update with domain decomposition lacks substantial testing and should be used with caution.

Enabling GPU buffer operations required by GMX_GPU_DD_COMMS (equivalent with GMX_USE_GPU_BUFFER_OPS=1).

This run uses the 'GPU halo exchange' feature, enabled by the GMX_GPU_DD_COMMS environment variable.

This run uses the 'GPU PME-PP communications' feature, enabled by the GMX_GPU_PME_PP_COMMS environment variable.
Input Parameters:
   integrator                     = md
   tinit                          = 0
   dt                             = 0.002
   nsteps                         = 250000
   init-step                      = 0
   simulation-part                = 1
   comm-mode                      = Linear
   nstcomm                        = 100
   bd-fric                        = 0
   ld-seed                        = 375672286
   emtol                          = 10
   emstep                         = 0.01
   niter                          = 20
   fcstep                         = 0
   nstcgsteep                     = 1000
   nbfgscorr                      = 10
   rtpi                           = 0.05
   nstxout                        = 0
   nstvout                        = 0
   nstfout                        = 0
   nstlog                         = 5000
   nstcalcenergy                  = 100
   nstenergy                      = 5000
   nstxout-compressed             = 5000
   compressed-x-precision         = 1000
   cutoff-scheme                  = Verlet
   nstlist                        = 20
   pbc                            = xyz
   periodic-molecules             = false
   verlet-buffer-tolerance        = 0.005
   rlist                          = 1.212
   coulombtype                    = PME
   coulomb-modifier               = Potential-shift
   rcoulomb-switch                = 0
   rcoulomb                       = 1.2
   epsilon-r                      = 1
   epsilon-rf                     = inf
   vdw-type                       = Cut-off
   vdw-modifier                   = Force-switch
   rvdw-switch                    = 1
   rvdw                           = 1.2
   DispCorr                       = No
   table-extension                = 1
   fourierspacing                 = 0.12
   fourier-nx                     = 128
   fourier-ny                     = 128
   fourier-nz                     = 144
   pme-order                      = 4
   ewald-rtol                     = 1e-05
   ewald-rtol-lj                  = 0.001
   lj-pme-comb-rule               = Geometric
   ewald-geometry                 = 0
   epsilon-surface                = 0
   tcoupl                         = Nose-Hoover
   nsttcouple                     = 20
   nh-chain-length                = 1
   print-nose-hoover-chain-variables = false
   pcoupl                         = Parrinello-Rahman
   pcoupltype                     = Semiisotropic
   nstpcouple                     = 20
   tau-p                          = 5
   compressibility (3x3):
      compressibility[    0]={ 4.50000e-05,  0.00000e+00,  0.00000e+00}
      compressibility[    1]={ 0.00000e+00,  4.50000e-05,  0.00000e+00}
      compressibility[    2]={ 0.00000e+00,  0.00000e+00,  4.50000e-05}
   ref-p (3x3):
      ref-p[    0]={ 1.00000e+00,  0.00000e+00,  0.00000e+00}
      ref-p[    1]={ 0.00000e+00,  1.00000e+00,  0.00000e+00}
      ref-p[    2]={ 0.00000e+00,  0.00000e+00,  1.00000e+00}
   refcoord-scaling               = No
   posres-com (3):
      posres-com[0]= 0.00000e+00
      posres-com[1]= 0.00000e+00
      posres-com[2]= 0.00000e+00
   posres-comB (3):
      posres-comB[0]= 0.00000e+00
      posres-comB[1]= 0.00000e+00
      posres-comB[2]= 0.00000e+00
   QMMM                           = false
   QMconstraints                  = 0
   QMMMscheme                     = 0
   MMChargeScaleFactor            = 1
qm-opts:
   ngQM                           = 0
   constraint-algorithm           = Lincs
   continuation                   = true
   Shake-SOR                      = false
   shake-tol                      = 0.0001
   lincs-order                    = 4
   lincs-iter                     = 1
   lincs-warnangle                = 30
   nwall                          = 0
   wall-type                      = 9-3
   wall-r-linpot                  = -1
   wall-atomtype[0]               = -1
   wall-atomtype[1]               = -1
   wall-density[0]                = 0
   wall-density[1]                = 0
   wall-ewald-zfac                = 3
   pull                           = false
   awh                            = false
   rotation                       = false
   interactiveMD                  = false
   disre                          = No
   disre-weighting                = Conservative
   disre-mixed                    = false
   dr-fc                          = 1000
   dr-tau                         = 0
   nstdisreout                    = 100
   orire-fc                       = 0
   orire-tau                      = 0
   nstorireout                    = 100
   free-energy                    = no
   cos-acceleration               = 0
   deform (3x3):
      deform[    0]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
      deform[    1]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
      deform[    2]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
   simulated-tempering            = false
   swapcoords                     = no
   userint1                       = 0
   userint2                       = 0
   userint3                       = 0
   userint4                       = 0
   userreal1                      = 0
   userreal2                      = 0
   userreal3                      = 0
   userreal4                      = 0
   applied-forces:
     electric-field:
       x:
         E0                       = 0
         omega                    = 0
         t0                       = 0
         sigma                    = 0
       y:
         E0                       = 0
         omega                    = 0
         t0                       = 0
         sigma                    = 0
       z:
         E0                       = 0
         omega                    = 0
         t0                       = 0
         sigma                    = 0
     density-guided-simulation:
       active                     = false
       group                      = protein
       similarity-measure         = inner-product
       atom-spreading-weight      = unity
       force-constant             = 1e+09
       gaussian-transform-spreading-width = 0.2
       gaussian-transform-spreading-range-in-multiples-of-width = 4
       reference-density-filename = reference.mrc
       nst                        = 1
       normalize-densities        = true
       adaptive-force-scaling     = false
       adaptive-force-scaling-time-constant = 4
grpopts:
   nrdf:     62495.2      179918      515571
   ref-t:         300         300         300
   tau-t:           1           1           1
annealing:          No          No          No
annealing-npoints:           0           0           0
   acc:	           0           0           0
   nfreeze:           N           N           N
   energygrp-flags[  0]: 0

Changing nstlist from 20 to 100, rlist from 1.212 to 1.327


Update task on the GPU was required, by the GMX_FORCE_UPDATE_DEFAULT_GPU environment variable, but the following condition(s) were not satisfied:

Nose-Hoover temperature coupling is not supported.

Will use CPU version of update.

1 GPU selected for this run.
Mapping of GPU IDs to the 2 GPU tasks in the 1 rank on this node:
  PP:0,PME:0
PP tasks will do (non-perturbed) short-ranged and most bonded interactions on the GPU
PP task will update and constrain coordinates on the CPU
PME tasks will do all aspects on the GPU
Using 1 MPI thread

Non-default thread affinity set, disabling internal thread affinity

Using 16 OpenMP threads 

System total charge: 0.000
Will do PME sum in reciprocal space for electrostatic interactions.

Using a Gaussian width (1/beta) of 0.384195 nm for Ewald
Potential shift: LJ r^-12: -2.648e-01 r^-6: -5.349e-01, Ewald -8.333e-06
Initialized non-bonded Ewald tables, spacing: 1.02e-03 size: 1176

Generated table with 1163 data points for 1-4 COUL.
Tabscale = 500 points/nm
Generated table with 1163 data points for 1-4 LJ6.
Tabscale = 500 points/nm
Generated table with 1163 data points for 1-4 LJ12.
Tabscale = 500 points/nm

Using GPU 8x8 nonbonded short-range kernels

Using a dual 8x8 pair-list setup updated with dynamic, rolling pruning:
  outer list: updated every 100 steps, buffer 0.127 nm, rlist 1.327 nm
  inner list: updated every  14 steps, buffer 0.001 nm, rlist 1.201 nm
At tolerance 0.005 kJ/mol/ps per atom, equivalent classical 1x1 list would be:
  outer list: updated every 100 steps, buffer 0.284 nm, rlist 1.484 nm
  inner list: updated every  14 steps, buffer 0.058 nm, rlist 1.258 nm

Initializing LINear Constraint Solver

There are: 357992 Atoms
Center of mass motion removal mode is Linear
We have the following groups for center of mass motion removal:
  0:  SOLU_MEMB
  1:  SOLV

Started mdrun on rank 0 Wed Jan 26 14:17:01 2022

           Step           Time
              0        0.00000

   Energies (kJ/mol)
           Bond            U-B    Proper Dih.  Improper Dih.      CMAP Dih.
    5.57838e+04    2.48719e+05    1.84803e+05    4.28059e+03   -2.28378e+03
          LJ-14     Coulomb-14        LJ (SR)   Coulomb (SR)   Coul. recip.
    4.20522e+04    1.34794e+04    2.15573e+05   -4.59834e+06    1.67067e+04
      Potential    Kinetic En.   Total Energy  Conserved En.    Temperature
   -3.81922e+06    9.59816e+05   -2.85941e+06   -2.85920e+06    3.04596e+02
 Pressure (bar)   Constr. rmsd
    3.99005e+02    4.59983e-06

step 1000: timed with pme grid 128 128 144, coulomb cutoff 1.200: 1385.7 M-cycles
step 1200: timed with pme grid 120 120 128, coulomb cutoff 1.340: 1544.8 M-cycles
step 1400: timed with pme grid 108 108 120, coulomb cutoff 1.429: 1688.2 M-cycles
step 1600: timed with pme grid 108 108 128, coulomb cutoff 1.416: 1719.0 M-cycles
step 1800: timed with pme grid 112 112 128, coulomb cutoff 1.365: 1571.8 M-cycles
step 2000: timed with pme grid 120 120 128, coulomb cutoff 1.340: 1541.0 M-cycles
step 2200: timed with pme grid 120 120 144, coulomb cutoff 1.274: 1445.4 M-cycles
step 2400: timed with pme grid 128 128 144, coulomb cutoff 1.200: 1346.2 M-cycles
step 2600: timed with pme grid 120 120 144, coulomb cutoff 1.274: 1444.9 M-cycles
step 2800: timed with pme grid 128 128 144, coulomb cutoff 1.200: 1351.0 M-cycles
              optimal pme grid 128 128 144, coulomb cutoff 1.200
           Step           Time
           5000       10.00000

	<======  ###############  ==>
	<====  A V E R A G E S  ====>
	<==  ###############  ======>

	Statistics over 250001 steps using 2501 frames

   Energies (kJ/mol)
           Bond            U-B    Proper Dih.  Improper Dih.      CMAP Dih.
    5.55737e+04    2.48513e+05    1.83543e+05    4.43680e+03   -2.23988e+03
          LJ-14     Coulomb-14        LJ (SR)   Coulomb (SR)   Coul. recip.
    4.20071e+04    1.50332e+04    2.17391e+05   -4.60268e+06    1.67342e+04
      Potential    Kinetic En.   Total Energy  Conserved En.    Temperature
   -3.82169e+06    9.45369e+05   -2.87632e+06   -2.82994e+06    3.00011e+02
 Pressure (bar)   Constr. rmsd
    1.32668e+00    0.00000e+00

          Box-X          Box-Y          Box-Z
    1.50846e+01    1.30636e+01    1.75994e+01

   Total Virial (kJ/mol)
    3.12620e+05    2.94396e+02   -1.20361e+02
    2.91342e+02    3.12180e+05    2.18304e+02
   -1.25560e+02    2.17851e+02    3.20159e+05

   Pressure (bar)
   -1.65116e+00   -2.03722e+00   -2.01050e+00
   -2.00800e+00    4.38258e+00   -1.36357e+00
   -1.96069e+00   -1.35918e+00    1.24861e+00

         T-SOLU         T-MEMB         T-SOLV
    3.00017e+02    3.00015e+02    3.00009e+02


	M E G A - F L O P S   A C C O U N T I N G

 NB=Group-cutoff nonbonded kernels    NxN=N-by-N cluster Verlet kernels
 RF=Reaction-Field  VdW=Van der Waals  QSTab=quadratic-spline table
 W3=SPC/TIP3p  W4=TIP4p (single or pairs)
 V&F=Potential and force  V=Potential only  F=Force only

 Computing:                               M-Number         M-Flops  % Flops
-----------------------------------------------------------------------------
 Pair Search distance check          110154.048272      991386.434     0.0
 NxN Ewald Elec. + LJ [F]         112862084.144832  8803242563.297    97.9
 NxN Ewald Elec. + LJ [V&F]         1140476.197760   147121429.511     1.6
 1,4 nonbonded interactions           68948.525793     6205367.321     0.1
 Shift-X                                895.337992        5372.028     0.0
 Bonds                                10477.541910      618174.973     0.0
 Propers                              78973.315892    18084889.339     0.2
 Impropers                             1222.004888      254177.017     0.0
 Virial                                4475.820537       80564.770     0.0
 Stop-CM                                895.337992        8953.380     0.0
 Calc-Ekin                             8950.515984      241663.932     0.0
 Lincs                                14729.058916      883743.535     0.0
 Lincs-Mat                            96528.386112      386113.544     0.0
 Constraint-V                         93725.874902      749806.999     0.0
 Constraint-Vir                        3950.140986       94803.384     0.0
 Settle                               21422.585690     6919495.178     0.1
 CMAP                                   386.251545      656627.626     0.0
 Urey-Bradley                         48227.692910     8825667.803     0.1
-----------------------------------------------------------------------------
 Total                                              8995370800.071   100.0
-----------------------------------------------------------------------------


     R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G

On 1 MPI rank, each using 16 OpenMP threads

 Computing:          Num   Num      Call    Wall time         Giga-Cycles
                     Ranks Threads  Count      (s)         total sum    %
-----------------------------------------------------------------------------
 Neighbor search        1   16       2501      53.643       2145.720   3.8
 Launch GPU ops.        1   16     250001      28.859       1154.360   2.1
 Force                  1   16     250001      27.565       1102.619   2.0
 Wait PME GPU gather    1   16     250001       7.569        302.758   0.5
 Wait Bonded GPU        1   16       2501       0.004          0.142   0.0
 Reduce GPU PME F       1   16     250001       1.405         56.206   0.1
 Wait GPU NB local      1   16     237500       7.637        305.477   0.5
 Wait GPU state copy    1   16     237500    1027.031      41081.436  73.2
 NB X/F buffer ops.     1   16     500002       7.404        296.176   0.5
 Write traj.            1   16         52       1.582         63.262   0.1
 Update                 1   16     250001      60.261       2410.469   4.3
 Constraints            1   16     250001     101.647       4065.882   7.2
 Rest                                          78.359       3134.361   5.6
-----------------------------------------------------------------------------
 Total                                       1402.965      56118.869 100.0
-----------------------------------------------------------------------------

               Core t (s)   Wall t (s)        (%)
       Time:    22447.426     1402.965     1600.0
                 (ns/day)    (hour/ns)
Performance:       30.792        0.779
Finished mdrun on rank 0 Wed Jan 26 14:40:25 2022
1 Like

Nothing abnormal here, as you offload all CPU force compute the CPU has no useful work to do and after enqueuing GPU work for a sequence of steps until the CPU requires results (e.g. for pair search of I/O) it waits for the GPU to complete work. That is the wall-time measured in the above counter.

That is unfortunately a limitation of the current GPU-resident parallelization. The only thing you could do is to consider switching to a supported thermostat.

Based on you log there is not a lot of performance left on the table, but you could try a few tweaks:

  • increase nstlist to reduce the search time
  • move the bonded interactions back to the CPU (-bonded cpu option), the 16 CPU cores may be fast enough to give a slight benefit
  • if you care about throughput, run two simulations on the same GPU

Thank you for the pointers! I indeed got some minor improvements from increasing ‘nstlist’ to 400. I was also able to reduce the number of CPUs used per run without impacting performance much. Thank you!

I tested the v-rescale thermostat and indeed that alone raises my performance to ~52ns/d!
I guess the big question is really what implications the use of the v-rescale thermostat would have on my system. I’m using the Charmm36ff. Almost all papers I’ve seen prefer the use of Nose-Hoover with the Charmm36ff. There are however several publications using the GROMOS 54A7 ff with the v-rescale thermostat.

  • Does the choice of thermostat depend on the forcefield? If so, why?

Justin Lemkul in his tutorial mentions the greater (membrane) fluctuations allowed by Nose-Hoover. Do you think I could get away with using the v-rescale thermostat as I’m primarily interested in the dynamics of my protein-ligand system?

There is no specific association between thermostats and force fields or contents of systems. My membrane tutorial was written before V-rescale came out, and the choices were Berendsen (which is demonstrably incorrect) and Nosé-Hoover. The latter was the only appropriate choice at the time.

1 Like

Thank you for giving that context!
I surmise then that the continued preference for the Nose-Hoover thermostat in some publications over the v-rescale thermostat is for mostly historical reasons?
In that case, since the v-rescale thermostat produces a correct canonical ensemble I will gladly switch to the v-rescale thermostat for the massive increase in performance!

People often follow established protocols simply because there is proof they work. If there are technical limitations like in this case, making the switch is reasonable.

To add to the above, there are aspects of NH that may warrant switching to v-rescale; see e.g. section 4.4. of View of Best Practices for Foundations in Molecular Simulations [Article v1.0]

One more thing to note related to performance / efficiency: make sure that the computations like energy calculation, temperature coupling, and pressure coupling are as frequent as necessary but not more so, i.e. make sure you don’t have nstenergy=10 leftover from some legacy mdp file as the cost this can be much higher in GPU-resident runs. Also make sure that you do not have leftover nstlist values in your mdp file as this has been a free parameter (automatically bumped at runtime) but IIRC some other nst* parameters may inherit their values from an initial nstlist entry in the mdp.

1 Like

I read the paper but what v-rescaling are talking about? If Simple Velocity Rescaling, it’s not recommended, if Bussi, then, it seems OK. So what’s implemented in Gromacs for GPU?

Thats are well structured, well written article! Thanks
I’m currently writing outputs to my .xtc (coordinates) , .edr and .log files every 5000steps / 10ps. I don’t really want to save my outputs any less often than that.
Should I change nstcalcenergy from its default value of 100 to match nstenergy?

I set nstlist to 400. Would it be better not to define it at all? Though I was under the impression that nstlist defines a minimum.
Which parameters does gromacs scale dynamically if not explicitly set in the .mdp file?
Does for example setting nstcom = 100 (its default value) change anything?

https://manual.gromacs.org/documentation/2020/reference-manual/algorithms/molecular-dynamics.html#temperature-coupling
For gromacs v-rescale refers to the Bussi, 2007 stochastic v-rescale algorithm.

Thanks for that. Really good to revive all these basic fundamentals.

Dear sir, Can you tell us what is the best thermostat that has gpu support, for production md runs? Is V-rescale good? I have seen mostly nose-hoover is being used for production md runs.