I have a small test system (~17K atoms) that contains nothing special: a solid membrane with a hole (membrane position-restrained at the edges), a string of periodic DNA, water, and some ions. The machine is 44-core E5 + 4 Titan XP GPUs.
mdrun -nt 24 -ntmpi 4 -npme 1 -pme gpu -gputasks 0123
version 2020.5: 295 ns/day (1fs timestep)
version 2021 (installed today, built fine, but failed two tests at the end somewhere): 277 ns/day
Can post the complete mdp or link to the whole input package, if need be.
Amy comments on the performance drop?
Does this system really run faster using 4 GPUs instead of 1? I would think this would run much faster on a single GPU.
If you want to know where the difference in performance comes from, do a diff on the md.log files to see if there are differences in the task assignments. If not, have a look at the timings of the different components of mdrun at the end of the log file to see what got slower.
Hm, that’s an interesting point – thanks! Let me check this before going into what is taking longer, because the workflow script probably inherited that 0123 thing from a bigger system. :)
I did some testing and seems like whatever I was running had a close to optimal config for the machine I’ve got here. A slightly better result (5-6%) was achieved with 4 threads per GPU task, each on a separate GPU (i.e., -nt 16 -gputasks 0123). Running multiple GPU tasks on the same card always results in significant performance loss here.
Additional comments will be appreciated, of course.
Have you checked if the task assignment is the same?
And if not, which timings in the table at the end of the log file explain the difference?
You can also send two log files so I can have a look, although I think you can’t attach log files on this forum.
Yes, the task assignment is exactly the same. I upgraded to the latest version and here’s my attempt at a copy/paste:
************************* OLD *************************************************************
M E G A - F L O P S A C C O U N T I N G
NB=Group-cutoff nonbonded kernels NxN=N-by-N cluster Verlet kernels
RF=Reaction-Field VdW=Van der Waals QSTab=quadratic-spline table
W3=SPC/TIP3p W4=TIP4p (single or pairs)
V&F=Potential and force V=Potential only F=Force only
Computing: M-Number M-Flops % Flops
-----------------------------------------------------------------------------
Pair Search distance check 9963750.134848 89673751.214 0.0
NxN Ewald Elec. + LJ [F] 6683855869.228032 441134487369.050 98.2
NxN Ewald Elec. + LJ [V&F] 67513722.744832 7223968333.697 1.6
1,4 nonbonded interactions 991500.003966 89235000.357 0.0
Reset In Box 42447.500000 127342.500 0.0
CG-CoM 42447.516979 127342.551 0.0
Bonds 378000.001512 22302000.089 0.0
Angles 825750.003303 138726000.555 0.0
RB-Dihedrals 400500.001602 98923500.396 0.0
Pos. Restr. 14000.000056 700000.003 0.0
Virial 42785.017114 770130.308 0.0
Stop-CM 84.911979 849.120 0.0
Calc-Ekin 424475.033958 11460825.917 0.0
Constraint-V 2955000.011820 23640000.095 0.0
Constraint-Vir 29550.011820 709200.284 0.0
Settle 985000.003940 318155001.273 0.1
Virtual Site 3 994850.007880 36809450.292 0.0
-----------------------------------------------------------------------------
Total 449189816097.698 100.0
-----------------------------------------------------------------------------
D O M A I N D E C O M P O S I T I O N S T A T I S T I C S
av. #atoms communicated per step for force: 2 x 7385.2
Dynamic load balancing report:
DLB was off during the run due to low measured imbalance.
Average load imbalance: 2.3%.
The balanceable part of the MD step is 71%, load imbalance is computed from this.
Part of the total run time spent waiting due to load imbalance: 1.7%.
Average PME mesh/force load: 1.107
Part of the total run time spent waiting due to PP/PME imbalance: 3.0 %
R E A L C Y C L E A N D T I M E A C C O U N T I N G
On 3 MPI ranks doing PP, each using 6 OpenMP threads, and
on 1 MPI rank doing PME, using 6 OpenMP threads
Computing: Num Num Call Wall time Giga-Cycles
Ranks Threads Count (s) total sum %
-----------------------------------------------------------------------------
Domain decomp. 3 6 2500000 2780.735 110116.978 2.9
DD comm. load 3 6 250002 2.637 104.429 0.0
Vsite constr. 3 6 250000001 1594.487 63141.590 1.6
Send X to PME 3 6 250000001 7890.027 312444.658 8.1
Neighbor search 3 6 2500001 2839.201 112432.214 2.9
Launch GPU ops. 3 6 500000002 14449.480 572198.659 14.9
Comm. coord. 3 6 247500000 6515.707 258021.649 6.7
Force 3 6 250000001 8311.497 329134.841 8.5
Wait + Comm. F 3 6 250000001 6385.613 252869.936 6.6
PME mesh * 1 6 250000001 30494.860 402531.618 10.5
PME wait for PP * 42454.411 560397.479 14.5
Wait + Recv. PME F 3 6 250000001 4527.057 179271.209 4.7
Wait PME GPU gather 3 6 250000001 5774.959 228688.087 5.9
Wait GPU NB nonloc. 3 6 250000001 2991.101 118447.460 3.1
Wait GPU NB local 3 6 250000001 516.722 20462.167 0.5
NB X/F buffer ops. 3 6 995000002 6551.121 259424.037 6.7
Vsite spread 3 6 252500002 1979.413 78384.642 2.0
Write traj. 3 6 50081 61.881 2450.483 0.1
Update 3 6 250000001 2936.818 116297.857 3.0
Constraints 3 6 250000001 2539.319 100556.880 2.6
Comm. energies 3 6 12500001 324.485 12849.594 0.3
-----------------------------------------------------------------------------
Total 72949.272 3851716.415 100.0
-----------------------------------------------------------------------------
(*) Note that with separate PME ranks, the walltime column actually sums to
twice the total reported, but the cycle count total and % are correct.
-----------------------------------------------------------------------------
Core t (s) Wall t (s) (%)
Time: 1750782.450 72949.272 2400.0
20h15:49
(ns/day) (hour/ns)
Performance: 592.192 0.041
Finished mdrun on rank 0 Sun Aug 1 17:55:16 2021
************************* NEW *************************************************************
M E G A - F L O P S A C C O U N T I N G
NB=Group-cutoff nonbonded kernels NxN=N-by-N cluster Verlet kernels
RF=Reaction-Field VdW=Van der Waals QSTab=quadratic-spline table
W3=SPC/TIP3p W4=TIP4p (single or pairs)
V&F=Potential and force V=Potential only F=Force only
Computing: M-Number M-Flops % Flops
-----------------------------------------------------------------------------
Pair Search distance check 99678.650160 897107.851 0.0
NxN Ewald Elec. + LJ [F] 66744762.106176 4405154299.008 98.2
NxN Ewald Elec. + LJ [V&F] 674216.459264 72141161.141 1.6
1,4 nonbonded interactions 9915.003966 892350.357 0.0
Reset In Box 424.475000 1273.425 0.0
CG-CoM 424.491979 1273.476 0.0
Bonds 3780.001512 223020.089 0.0
Angles 8257.503303 1387260.555 0.0
RB-Dihedrals 4005.001602 989235.396 0.0
Pos. Restr. 140.000056 7000.003 0.0
Virial 427.867114 7701.608 0.0
Stop-CM 0.865929 8.659 0.0
Calc-Ekin 8489.533958 229217.417 0.0
Constraint-V 29550.011820 265950.106 0.0
Constraint-Vir 295.511820 7092.284 0.0
Settle 9850.003940 3644501.458 0.1
Virtual Site 3 9948.507880 368094.792 0.0
-----------------------------------------------------------------------------
Total 4486216547.624 100.0
-----------------------------------------------------------------------------
D O M A I N D E C O M P O S I T I O N S T A T I S T I C S
av. #atoms communicated per step for force: 2 x 7360.2
Dynamic load balancing report:
DLB was off during the run due to low measured imbalance.
Average load imbalance: 1.5%.
The balanceable part of the MD step is 70%, load imbalance is computed from this.
Part of the total run time spent waiting due to load imbalance: 1.0%.
Average PME mesh/force load: 1.165
Part of the total run time spent waiting due to PP/PME imbalance: 4.5 %
R E A L C Y C L E A N D T I M E A C C O U N T I N G
On 3 MPI ranks doing PP, each using 6 OpenMP threads, and
on 1 MPI rank doing PME, using 6 OpenMP threads
Computing: Num Num Call Wall time Giga-Cycles
Ranks Threads Count (s) total sum %
-----------------------------------------------------------------------------
Domain decomp. 3 6 25000 30.213 1196.446 3.0
DD comm. load 3 6 2502 0.011 0.449 0.0
Vsite constr. 3 6 2500001 20.233 801.206 2.0
Send X to PME 3 6 2500001 63.151 2500.764 6.4
Neighbor search 3 6 25001 30.360 1202.243 3.1
Launch GPU ops. 3 6 5000002 141.350 5597.448 14.3
Comm. coord. 3 6 2475000 63.092 2498.435 6.4
Force 3 6 2500001 85.594 3389.525 8.6
Wait + Comm. F 3 6 2500001 60.124 2380.913 6.1
PME mesh * 1 6 2500001 316.233 4174.272 10.6
PME wait for PP * 427.651 5644.985 14.4
Wait + Recv. PME F 3 6 2500001 54.001 2138.427 5.4
Wait PME GPU gather 3 6 2500001 62.879 2489.993 6.3
Wait GPU NB nonloc. 3 6 2500001 20.148 797.841 2.0
Wait GPU NB local 3 6 2500001 4.894 193.805 0.5
NB X/F buffer ops. 3 6 9950002 74.722 2958.973 7.5
Vsite spread 3 6 2525002 20.239 801.469 2.0
Write traj. 3 6 501 0.579 22.928 0.1
Update 3 6 2500001 35.212 1394.384 3.6
Constraints 3 6 2500001 27.844 1102.605 2.8
Comm. energies 3 6 250001 3.313 131.201 0.3
-----------------------------------------------------------------------------
Total 743.884 39277.038 100.0
-----------------------------------------------------------------------------
(*) Note that with separate PME ranks, the walltime column actually sums to
twice the total reported, but the cycle count total and % are correct.
-----------------------------------------------------------------------------
Core t (s) Wall t (s) (%)
Time: 17853.170 743.884 2400.0
(ns/day) (hour/ns)
Performance: 580.736 0.041
Finished mdrun on rank 0 Sat Dec 25 16:52:13 2021
Just posted and then edited, which triggered the bot again. Here it is again (yes, GPU tasks assigned identically):
************************* OLD *************************************************************
M E G A - F L O P S A C C O U N T I N G
NB=Group-cutoff nonbonded kernels NxN=N-by-N cluster Verlet kernels
RF=Reaction-Field VdW=Van der Waals QSTab=quadratic-spline table
W3=SPC/TIP3p W4=TIP4p (single or pairs)
V&F=Potential and force V=Potential only F=Force only
Computing: M-Number M-Flops % Flops
-----------------------------------------------------------------------------
Pair Search distance check 9963750.134848 89673751.214 0.0
NxN Ewald Elec. + LJ [F] 6683855869.228032 441134487369.050 98.2
NxN Ewald Elec. + LJ [V&F] 67513722.744832 7223968333.697 1.6
1,4 nonbonded interactions 991500.003966 89235000.357 0.0
Reset In Box 42447.500000 127342.500 0.0
CG-CoM 42447.516979 127342.551 0.0
Bonds 378000.001512 22302000.089 0.0
Angles 825750.003303 138726000.555 0.0
RB-Dihedrals 400500.001602 98923500.396 0.0
Pos. Restr. 14000.000056 700000.003 0.0
Virial 42785.017114 770130.308 0.0
Stop-CM 84.911979 849.120 0.0
Calc-Ekin 424475.033958 11460825.917 0.0
Constraint-V 2955000.011820 23640000.095 0.0
Constraint-Vir 29550.011820 709200.284 0.0
Settle 985000.003940 318155001.273 0.1
Virtual Site 3 994850.007880 36809450.292 0.0
-----------------------------------------------------------------------------
Total 449189816097.698 100.0
-----------------------------------------------------------------------------
D O M A I N D E C O M P O S I T I O N S T A T I S T I C S
av. #atoms communicated per step for force: 2 x 7385.2
Dynamic load balancing report:
DLB was off during the run due to low measured imbalance.
Average load imbalance: 2.3%.
The balanceable part of the MD step is 71%, load imbalance is computed from this.
Part of the total run time spent waiting due to load imbalance: 1.7%.
Average PME mesh/force load: 1.107
Part of the total run time spent waiting due to PP/PME imbalance: 3.0 %
R E A L C Y C L E A N D T I M E A C C O U N T I N G
On 3 MPI ranks doing PP, each using 6 OpenMP threads, and
on 1 MPI rank doing PME, using 6 OpenMP threads
Computing: Num Num Call Wall time Giga-Cycles
Ranks Threads Count (s) total sum %
-----------------------------------------------------------------------------
Domain decomp. 3 6 2500000 2780.735 110116.978 2.9
DD comm. load 3 6 250002 2.637 104.429 0.0
Vsite constr. 3 6 250000001 1594.487 63141.590 1.6
Send X to PME 3 6 250000001 7890.027 312444.658 8.1
Neighbor search 3 6 2500001 2839.201 112432.214 2.9
Launch GPU ops. 3 6 500000002 14449.480 572198.659 14.9
Comm. coord. 3 6 247500000 6515.707 258021.649 6.7
Force 3 6 250000001 8311.497 329134.841 8.5
Wait + Comm. F 3 6 250000001 6385.613 252869.936 6.6
PME mesh * 1 6 250000001 30494.860 402531.618 10.5
PME wait for PP * 42454.411 560397.479 14.5
Wait + Recv. PME F 3 6 250000001 4527.057 179271.209 4.7
Wait PME GPU gather 3 6 250000001 5774.959 228688.087 5.9
Wait GPU NB nonloc. 3 6 250000001 2991.101 118447.460 3.1
Wait GPU NB local 3 6 250000001 516.722 20462.167 0.5
NB X/F buffer ops. 3 6 995000002 6551.121 259424.037 6.7
Vsite spread 3 6 252500002 1979.413 78384.642 2.0
Write traj. 3 6 50081 61.881 2450.483 0.1
Update 3 6 250000001 2936.818 116297.857 3.0
Constraints 3 6 250000001 2539.319 100556.880 2.6
Comm. energies 3 6 12500001 324.485 12849.594 0.3
-----------------------------------------------------------------------------
Total 72949.272 3851716.415 100.0
-----------------------------------------------------------------------------
(*) Note that with separate PME ranks, the walltime column actually sums to
twice the total reported, but the cycle count total and % are correct.
-----------------------------------------------------------------------------
Core t (s) Wall t (s) (%)
Time: 1750782.450 72949.272 2400.0
20h15:49
(ns/day) (hour/ns)
Performance: 592.192 0.041
Finished mdrun on rank 0 Sun Aug 1 17:55:16 2021
************************* NEW *************************************************************
M E G A - F L O P S A C C O U N T I N G
NB=Group-cutoff nonbonded kernels NxN=N-by-N cluster Verlet kernels
RF=Reaction-Field VdW=Van der Waals QSTab=quadratic-spline table
W3=SPC/TIP3p W4=TIP4p (single or pairs)
V&F=Potential and force V=Potential only F=Force only
Computing: M-Number M-Flops % Flops
-----------------------------------------------------------------------------
Pair Search distance check 99678.650160 897107.851 0.0
NxN Ewald Elec. + LJ [F] 66744762.106176 4405154299.008 98.2
NxN Ewald Elec. + LJ [V&F] 674216.459264 72141161.141 1.6
1,4 nonbonded interactions 9915.003966 892350.357 0.0
Reset In Box 424.475000 1273.425 0.0
CG-CoM 424.491979 1273.476 0.0
Bonds 3780.001512 223020.089 0.0
Angles 8257.503303 1387260.555 0.0
RB-Dihedrals 4005.001602 989235.396 0.0
Pos. Restr. 140.000056 7000.003 0.0
Virial 427.867114 7701.608 0.0
Stop-CM 0.865929 8.659 0.0
Calc-Ekin 8489.533958 229217.417 0.0
Constraint-V 29550.011820 265950.106 0.0
Constraint-Vir 295.511820 7092.284 0.0
Settle 9850.003940 3644501.458 0.1
Virtual Site 3 9948.507880 368094.792 0.0
-----------------------------------------------------------------------------
Total 4486216547.624 100.0
-----------------------------------------------------------------------------
D O M A I N D E C O M P O S I T I O N S T A T I S T I C S
av. #atoms communicated per step for force: 2 x 7360.2
Dynamic load balancing report:
DLB was off during the run due to low measured imbalance.
Average load imbalance: 1.5%.
The balanceable part of the MD step is 70%, load imbalance is computed from this.
Part of the total run time spent waiting due to load imbalance: 1.0%.
Average PME mesh/force load: 1.165
Part of the total run time spent waiting due to PP/PME imbalance: 4.5 %
R E A L C Y C L E A N D T I M E A C C O U N T I N G
On 3 MPI ranks doing PP, each using 6 OpenMP threads, and
on 1 MPI rank doing PME, using 6 OpenMP threads
Computing: Num Num Call Wall time Giga-Cycles
Ranks Threads Count (s) total sum %
-----------------------------------------------------------------------------
Domain decomp. 3 6 25000 30.213 1196.446 3.0
DD comm. load 3 6 2502 0.011 0.449 0.0
Vsite constr. 3 6 2500001 20.233 801.206 2.0
Send X to PME 3 6 2500001 63.151 2500.764 6.4
Neighbor search 3 6 25001 30.360 1202.243 3.1
Launch GPU ops. 3 6 5000002 141.350 5597.448 14.3
Comm. coord. 3 6 2475000 63.092 2498.435 6.4
Force 3 6 2500001 85.594 3389.525 8.6
Wait + Comm. F 3 6 2500001 60.124 2380.913 6.1
PME mesh * 1 6 2500001 316.233 4174.272 10.6
PME wait for PP * 427.651 5644.985 14.4
Wait + Recv. PME F 3 6 2500001 54.001 2138.427 5.4
Wait PME GPU gather 3 6 2500001 62.879 2489.993 6.3
Wait GPU NB nonloc. 3 6 2500001 20.148 797.841 2.0
Wait GPU NB local 3 6 2500001 4.894 193.805 0.5
NB X/F buffer ops. 3 6 9950002 74.722 2958.973 7.5
Vsite spread 3 6 2525002 20.239 801.469 2.0
Write traj. 3 6 501 0.579 22.928 0.1
Update 3 6 2500001 35.212 1394.384 3.6
Constraints 3 6 2500001 27.844 1102.605 2.8
Comm. energies 3 6 250001 3.313 131.201 0.3
-----------------------------------------------------------------------------
Total 743.884 39277.038 100.0
-----------------------------------------------------------------------------
(*) Note that with separate PME ranks, the walltime column actually sums to
twice the total reported, but the cycle count total and % are correct.
-----------------------------------------------------------------------------
Core t (s) Wall t (s) (%)
Time: 17853.170 743.884 2400.0
(ns/day) (hour/ns)
Performance: 580.736 0.041
Finished mdrun on rank 0 Sat Dec 25 16:52:13 2021
From this log output it seems like the PME computation runs slower. The PME load is higher than the PP load, so this task mainly determines the performance. I don’t know why this got slower. I don’t think there are significant changes there.
Do you use the same CUDA version for both runs?
Was PME tuning active and if so, do the tuned PME grids differ between the two runs?
PME tuning was on, both grids announced as optimal at 32 32 72.
CUDA version for the old run: 11.10 (runtime driver 10.20)
For the new run: 11.50 (11.50)
These could be due to CUDA compiler/runtime performance differences. I suggest to compare 2020 vs 2021 using the same CUDA version. Can you please share full log files?
However, I wonder if you really get scaling across four GPUs with such a small system? I could imagine that using a separate PME GPU could improve performance, but decomposing 17k atoms across three GPUs will generally not scale.
Side-note: consider trying the direct GPU communication in particular (in the 2021 release you need some extra environment variables GMX_GPU_DD_COMMS / GMX_GPU_PME_PP_COMMS, in the upcoming 2022 you won’t need those anymore)?
I probably won’t be able to test a “new” build of the old gmx version, but yes, I absolutely get distinct scaling across GPUs. I am however happy to rebuild the latest version with the additional variables set. What should that do in terms of performance?
One other question I’ve been meaning to ask: does Gromacs still benefit from hyperthreading? The reason is that we have some other software that actually gets hurt when HT is is enabled.
If there is high computational load on the CPU, HT will help. If there is high memory/cache load, HT might be worse. mdrun will automatically turn of HT when all compute is done on the GPU.
Compiling the latest GROMACS 2020 is not very difficult and it would help ruling out whether the CUDA version is the culprit rather than the GROMACS release.
Also / alternatively, are you able to compare single GPU performance of the inputs/builds for which you report a regression?
I think you see your runs scale to four GPUs because you’re only offloading the nonbonded calculation, but not PME, bonded interaction nor integrations/constraints. That also means you will most likely get better performance if you also offload those, although it’s also quite likely that you’d get peak performance on 1-2 GPUs if you offload more (and you’ll need the direct communication support from 2021 / 2022 for best scaling).
No-no, I didn’t mean that it’s hard to build Gromacs. I am happy to rebuild the latest version – I meant your other suggestion to rebuild the old version with the newest CUDA, which I am sure could cause minor differences in performance.
About offloading PME: in the starting post I gave the actual mdrun line, which contains “-npme 1 -pme gpu”. Is this incorrect?
I did mean compiling the latest 2020, that is 2020.6, with the latest CUDA because earlier you wrote:
Assuming by “old run” you meant GROMACS 2020, and by “new run” 2021, you were comparing different version of code suing different versions of CUDA compilers and runtime.
Yes, if you wanted to offload PME and use a separate rank for that, it is (the -pme gpu does the former).
The earlier performance numbers you posted were about 280-300 ns/day, but now you show log output that indicates 580-590 ns/day. These runs are not the same, and differences are also smaller (in the latter runs only ~2%).
Now I am confused… Yes as in it’s incorrect, or yes it’s correct? :)
In other words, if I want to properly offload PME to GPU, should I have ‘-pme gpu’ without setting the number of ranks to 1?
Sorry about this confusion. The first post was about NPT relax with 1fs timestep. When responding to Berk, I used production NVT performance with 2fs timestep, hence the jump in ns/day. The systems are identical and so are the mdrun lines.
If you intended to use a separate GPU for PME, it is correct. GROMACS will by design use a single GPU per MPI rank, so if you want to use multiple GPUs, you have two means to parallelize:
assign PME to separate a rank(s) and dedicate a GPU to this(/each)
use domain decomposition which maps computation in a domain to MPI ranks each of which can be assigned a different GPU.
If you write:
-ntmpi 2 -pme gpu this implies automatic count for -npme which is by default zero, hence you would be requesting PME decomposition across the 2 ranks. This is not supported (it will have partial supporte with lib-MPI in release 2022).
-ntmpi 2 -npme 1 -pme gpu implies 1 PP rank (which does the particle work and integration) and one separate PME rank.
I hope that clarifies things!
OK, but the performance gap between the two 1 fs runs is still greater than the 2 fs runs, so I am not sure there is an actual regression, nor that it is unrelated to the different CUDA versions.
Yes, but not completely, because I have zero idea what would be an “optimal” number of PME ranks. For instance, let’s consider the system I already have. My mdrun command is currently: mdrun -nt 24 -ntmpi 4 -npme 1 -pme gpu -gputasks 0123
and my understanding of your comment suggests that -npme could be generally N (between 1 and 3), in which case 4 - N will be allotted for PP/integrator. In other words, if I set mdrun -nt 24 -ntmpi 4 -npme 2 -pme gpu
I will get two ranks for PP and two for PME (instead of 3 and 1 with the original line). Now, here is my question: off the bat, how would you set the numbers in this case? Are there any hints for optimality? Basically, I have zero feel for what the relative computational “weights” between PP and PME are. Hence, completely blindly, knowing only the system size (say, 20K particles) and the hardware I described, what would be your first attempt at the mdrun line?
My first post was about v. 2021 (not 2021.x). My response to Berk with 2fs timestep was on the latest version I could find & build at that moment (2021.4, I think). I think the initial question I posted is more or less void, given the CUDA driver versions, etc.
However, the other things you are commenting on are of utmost importance to me! :)
Depends on the hardware and simulation system. For GPU machines the limitations posed by your use-case makes it relatively simple: i) the current mdrun capabilities (PME decomposition is not supported on GPUs until the next release) ii) the extremely small input won’t scale anyway
Let’s make things concrete, using the 2021 code either of the following two is what will likely be fastest:
These will run on one and two GPUs, resp., and I think you can’t get better performance on more GPUs with 20k atoms, but I’m curious to hear your feedback.
This one fails with:
“Inconsistency in user input: Update task on the GPU was required, but the following conditions were not satisfied: Virtual sites are not supported.”
Vsites complaint probably because I use TIP4P water. Any suggestions? Thanks!