PDB identification code

From Proteopedia

Jump to: navigation, search

Every molecular model (atomic coordinate file) in the Protein Data Bank (PDB) has a unique accession or identification code. These codes are always 4 characters in length. The first character is a numeral in the range 1-9, while the last three characters can be either numerals (in the range 0-9) or letters (in the range A-Z in the Latin alphabet). Plans for an expanded identification code system that handle more entries have been announced.

Contents

Lower vs. Upper Case

PDB codes are often written in upper case. However, to avoid confusing zero (0) with the letter "O", lower case is helpful, for example 1o1o is clearer than 1O1O, and 2ou0 is clearer than 2OU0. (Also links to upper case codes in Proteopedia don't work! For example, 1O1O.) Depending on the font, number 1 can also be confused with capital "I" or lower case "L". So 1imo is clearer than 1IMO, but 1X9L is clearer than 1x9l.

PDB codes in Proteopedia

Every released entry in the PDB has an automatically-generated page in Proteopedia. To find it, simply enter the PDB code in the search slot found at the left of this (and every) page in Proteopedia. Proteopedia is updated once each week, shortly following the weekly new release cycle at the PDB. To link to a PDB code-titled page in Proteopedia, in the wikitext box, use double square brackets around the code. So for example, typing [[1vot]] when editing a Proteopedia article generates the link 1vot.

Examples of PDB Codes

  • 1mbn - a 1973 model of myoglobin, the first protein structure solved.
  • 1tna - a 1975 model of yeast phenylalanine transfer RNA, the first RNA structure solved.
  • 1bna - the first full turn of a B-form DNA double helix solved by crystallography. Solved in 1980, this confirmed, 27 years later, the 1953 theoretical model of Watson & Crick. In the intervening years, methods were developed for macromolecular crystallography, and for producing short segments of DNA of defined sequences. More...
  • 2hhd - human hemoglobin, deoxy.
  • 9ins - insulin.

Newer PDB Codes are Sequential

For many years, depositors of models could request an available PDB code that represented an acronym for the molecule represented. All the above examples are such cases. With the increase in number of new entries each week, the PDB no longer permits this option. In recent years, all PDB codes are assigned by the PDB from the pool of available codes, in sequential ascending order, without reference to the name of the molecule.

PDB codes have been permanently associated with a single structure

Once a PDB code is assigned to a given structure, it's forever, even in those cases when a structure is withdrawn (retired from the database), like 3luw, or superceded by a newer or corrected structure, like 1ace. If requesting a page for a superseded structure, like 1aak, Proteopedia will automatically display the newest structure 2aak. Look for the explanation in the 'Structural Highlights' section of each page.

In May, 2017, the PDB announced plans for a versioning system. This went into effect in July, 2019[1]. It allows multiple versions of the same entry to keep a single PDB code. See below.

Limited Number of 4-Character PDB Codes

There are 419,904 possible 4-character PDB identification codes[2]. This could be increased to 466,560 if the numeral "0" is allowed as the first character[3]. Thus, the ~170,000 entries in mid 2017 (plus withdrawn and superceded entries) have used up nearly half of the available codes. After approximately 2027[4], a scheme that can accommodate more entries will be required, requiring revision of macromolecular visualization and modeling software programs that obtain data online, all of which, of necessity, currently require 4-character PDB codes. See plans for an expanded system in the following section.

Future Plans for Expanded PDB Codes

In May, 2017, the Protein Data Bank announced plans to introduce, later in 2017, an expanded PDB accession code with versioning[5]. The new codes will have the format

pdb_00001abc

where the 5 characters "00001" may each be a numeral, 0-9, and the 3 trailing characters "abc" may each be a numeral or a letter. In addition to increasing the number of possible accession codes from ~4 x 104 to >109, this will facilitate "text mining detection of PDB entries in the published literature"[5]. The PDB also promises "For as long as practicable, the wwPDB will continue assigning PDB codes that can be truncated losslessly to the current four-character style."[5] When 4-character codes are exhausted, new entries will be available in mmCIF format only, since the legacy PDB format will not accommodate 12-character IDs.

In 2024, the wwPDB plans to make a beta 12-character ID archive available in 2026[4]. In 2024, the wwPDB estimates that the 4-character IDs will be consumed in 2029[4].

Versioning

Along with the expanded accession codes, a versioning system was introduced in mid-2019[5][1].

At present, revised atomic coordinates for an existing released PDB entry are assigned a new accession code, and the prior entry is obsoleted. This long-standing wwPDB policy had the unintended consequence of breaking connections with publications and usage of the prior set of atomic coordinates ....[5]

The version of an accession will be included in its filename thus:

pdb_00001abc_xyz_v1-2.cif.gz

where "v1" designates a major version, and "-2" a minor version.[5] "xyz" is a constant that signifies an atomic coordinate file. Other types of data files might use the same PDB accession codes in future.

Document Object Identifiers (DOI) for PDB Entries

Each PDB entry is accessible through a DOI. For example, 6ef8 is accessible as doi.org/10.2210/pdb6ef8/pdb.

See Also

References

  1. 1.0 1.1 PDB News July 31, 2019: Improve your previously released coordinates AND keep your original PDB ID with OneDep
  2. Ten numerals plus 26 letters = 36. The first character is 1-9. (9)(363) = 419,904.
  3. In April, 2013. according to Rachel Kramer Green of the RCSB in Rutgers, NJ, there were no plans to use PDB codes beginning with 0. However, in July, 2017, the WWPDB FAQ states "The four-letter PDB identifier currently consists of a number (0-9) followed by 3 letters or numbers.".
  4. 4.0 4.1 4.2 Resources for Supporting the Extended PDB ID Format (pdb_00001abc), Spring 2024 Issue of the RCSB PDB Newsletter.
  5. 5.0 5.1 5.2 5.3 5.4 5.5 PDB News May 17, 2017: Revise Your Structure Without Changing the PDB Accession Code and Related Changes to the FTP Archive.

Proteopedia Page Contributors and Editors (what is this?)

Eric Martz, Jaime Prilusky, Wayne Decatur

Personal tools