Questions about RMSD calculations using the medoid of the largest cluster as the reference

GROMACS version: 2020.4
GROMACS modification: No
I’m trying to use gmx cluster to find the centroid of the biggest cluster and use it as the reference structure for RMSD calculations. I wanted to see if there is a transition in the RMSD value at the end of the simulation to justify the length of a long simulation (if the RMSD has reached an equilibrium).
Below are the steps of my method:

  • Use gmx cluster to perform clustering analysis. For my protein, with a cutoff distance of 0.2 nm, 31 clusters were identified. The biggest cluster had 5782 members (72% of the total number of frames ) and its medoid was the configuration at 1412.25 ns (as shown in cluster.log).
  • Copy the first PDB frame from clusters.pdb and save it as protein_cluster_medoid.pdb. I assumed that the first PDB frame should correspond to the medoid of the biggest cluster (please let me know if this is wrong).
  • Use gmx rms to take in protein_cluster_medoid.pdb as the reference structure and calculate the RMSD value.
    As a result, there was no transition shown in the RMSD values, but it was shown that the RMSD value could be up to 3.5 nm, which is pretty large. I assumed that the large values were due to the fact that there were a fair amount of structures from other clusters. (Although I still think that 3.5 nm is still too large.) I further plotted the distribution of the RMSD values and I was then confused. I thought that the distribution should be left-skewed and at least 72% of the data should be below 0.2 nm, but this is apparently not the case shown by the figure. Regarding this method, I’m wondering if there is something I misunderstood.

Specifically, my questions can be summarized as follows:

  • What is the reason for having such large RMSD values when using the medoid of the largest cluster as the reference?
  • Why the histogram showed that the majority of the samples had RMSD more than 2.0 nm? Ideally, I thought that at least 72% of the data should be below 0.2 nm.
  • Is there another way to justify the simulation length except for looking at the RMSD value?