Questions about RMSD calculations using the medoid of the largest cluster as the reference

wehs7661 · August 3, 2021, 8:41am

GROMACS version: 2020.4
GROMACS modification: No
I’m trying to use gmx cluster to find the centroid of the biggest cluster and use it as the reference structure for RMSD calculations. I wanted to see if there is a transition in the RMSD value at the end of the simulation to justify the length of a long simulation (if the RMSD has reached an equilibrium).
Below are the steps of my method:

Use gmx cluster to perform clustering analysis. For my protein, with a cutoff distance of 0.2 nm, 31 clusters were identified. The biggest cluster had 5782 members (72% of the total number of frames ) and its medoid was the configuration at 1412.25 ns (as shown in cluster.log).
Copy the first PDB frame from clusters.pdb and save it as protein_cluster_medoid.pdb. I assumed that the first PDB frame should correspond to the medoid of the biggest cluster (please let me know if this is wrong).
Use gmx rms to take in protein_cluster_medoid.pdb as the reference structure and calculate the RMSD value.
As a result, there was no transition shown in the RMSD values, but it was shown that the RMSD value could be up to 3.5 nm, which is pretty large. I assumed that the large values were due to the fact that there were a fair amount of structures from other clusters. (Although I still think that 3.5 nm is still too large.) I further plotted the distribution of the RMSD values and I was then confused. I thought that the distribution should be left-skewed and at least 72% of the data should be below 0.2 nm, but this is apparently not the case shown by the figure. Regarding this method, I’m wondering if there is something I misunderstood.

4eyd_rmsd6000×2400 138 KB

Specifically, my questions can be summarized as follows:

What is the reason for having such large RMSD values when using the medoid of the largest cluster as the reference?
Why the histogram showed that the majority of the samples had RMSD more than 2.0 nm? Ideally, I thought that at least 72% of the data should be below 0.2 nm.
Is there another way to justify the simulation length except for looking at the RMSD value?

Topic		Replies	Views
GMX Cluster is not giving RMSD distributions that make sense? User discussions	3	524	January 11, 2021
Cluster.log User discussions	0	211	July 31, 2020
Gmx cluster User discussions	0	588	January 11, 2023
Choosing an appropriate RMSD cut off for gromos clustering User discussions	2	1327	January 18, 2022
Large RMSD Differences of Homologous Proteins User discussions	30	3121	November 4, 2020

Questions about RMSD calculations using the medoid of the largest cluster as the reference

Related topics