Support of PDBx/mmCIF format

PDB (Protein Data Bank) was established in 1971. Since then, it has been growing in a growing rate. As of December 2021, there are 202,467 structures in the PDB archive.

PDB ID’s are composed of 4 alphanumeric characters in the format [1-9]([0-9A-Z]){3} e.g. 3HHB, 4HHB are different deposited files for Hemoglobin.
PDB file format is composed of fixed column width record lines.

There are 2 problems with current deposited structures in the PDB:

  1. the 4 alphanumeric PDB IDs are running out.
  2. PDB format has a lot of limitations:
  • Atom numbers are 5 digits only (can not support systems with > 99,999 atoms)
  • Chain numbers are 1 character only
  • Coordinates are in (8.3) format. (no positions beyond -999.999)
  • The 3 letters Chemical Component Dictionary (CCD) names are running out as well.

Therefore, the wwPDB adapted this plan:

  1. Designing the extensible mmCIF format (AKA PDBx format) that is based on the concept of dictionary, to overcome the now-legacy PDB format limitations. More details on the format can be found here and here.
  2. Releasing the structures that violate any of the PDB limitations as mmCIF files only.
  3. After exhausting all PDB IDs, new extended 8 character IDs will be used.
  4. All new structures afterwards will be released in mmCIF format only.

Please refer to this announcment for more details. This exhaustion is expected to occur very soon in 2023.

Implication on Gromacs
Gromacs does not currently support mmCIF format. Therefore, it will have:

  • Limited ability to read newly released PDB entries
  • Limited ability to write correctly representable simulation output or to convert the .XTC files into other formats that other applications support, in case the simulation output violates one of the PDB format limitations, e.g. gmx trjconv -drop will not be able to generate proper PDB files with more than 99,999 atoms. I believe this limitation has always been there but it was overlooked.

course of action(s)

  • We need to support the mmCIF format soon. It might be a wise action to createa mmcif2gmx module instead of modifying the current pdb2gmx module.
  • However, we need to start by analyzing whither supporting the new format with relaxed limitations will affect other locations in Gromacs.

I have the experience enabling BioJava to support extending PDB ID using the ciftools-java library from RCSB, and I believe I can extend that to Gromacs.

In a preliminary search, I found some mmCIF support libraries for C++ cpp-cif-file, cpp-cif-file-util, cpp-dict-pack on their GitHub repo.

However, a change analysis should start before any implementation, and a schedule of support is important for proper delivery.

Amr

Thanks for looking into this. I created an issue on GitLab: Support of PDBx/mmCIF format (#4753) · Issues · GROMACS / GROMACS · GitLab