How to find a protein's best structure
From Proteopedia
(→Empirical Models) |
|||
| Line 11: | Line 11: | ||
Each model in the PDB has a unique 4-character identification code ([[PDB ID]]) that begins with a numeral, and has letters or numerals for the last 3 characters . Examples are 1d66, 4mdh, 9ins. | Each model in the PDB has a unique 4-character identification code ([[PDB ID]]) that begins with a numeral, and has letters or numerals for the last 3 characters . Examples are 1d66, 4mdh, 9ins. | ||
| - | + | Below are two methods for finding out if your query amino acid sequence, or parts of it, have [[Empirical models|empirically-determined 3D structures]] in the [[PDB]]. | |
=== Easy UniProt search for empirical models === | === Easy UniProt search for empirical models === | ||
Revision as of 17:42, 23 October 2024
|
This article is under construction. This notice will be removed when it is completed. Eric Martz 14:19, 21 October 2024 (UTC) |
Here is a general guide to finding a structure for a protein molecule of interest. This procedure is one of many possible. It is one favored by User:Eric Martz. When you find a structure you want, below are also instructions for loading it into FirstGlance in Jmol, which is the easiest place to learn about and explore your structure.
Contents |
Empirical Models
Empirical models are structures determined empirically (experimentally) by X-ray crystallography, cryo-Electron Microscopy, solution NMR, or rarely by other methods. Empirical models are usually the most accurate and reliable, especially when they have good resolution. All published, empirically-determined, atomic-resolution, macromolecular 3D structures are available in the World Wide Protein Data Bank (the "PDB").
Each model in the PDB has a unique 4-character identification code (PDB ID) that begins with a numeral, and has letters or numerals for the last 3 characters . Examples are 1d66, 4mdh, 9ins.
Below are two methods for finding out if your query amino acid sequence, or parts of it, have empirically-determined 3D structures in the PDB.
Easy UniProt search for empirical models
At UniProt.Org, find your protein of interest.
- Example: search for human acetylcholinesterase, then click on P22303.
- Click on Structure in the column at left, and wait for this section to load.
- If there is a table with PDB in the first column of each row, followed by a PDB ID in each row in the IDENTIFIER column, these are empirical structures for your protein.
- If there is no PDB list, go to the section below on #Alphafold.
Choosing a model
If there are multiple PDB structures, you need to pick one (or a few) that best meet your needs. Because of its medical and pharmaceutical interest, P22303 has more than 50 empirical PDB structures.
- Models with the best resolution (2.5 Å or less) will be the most reliable.
- You will likely prefer models that cover all or most of the complete sequence.
- Click Sequence (in the column at left) and note the length (in amino acids). For P22303, the length is 614.
- Click PDM/Processing (in the column at left) and note the length of the signal sequence, if given. The mature protein will start after the signal sequence. For P222303, the signal sequence is 1-31. Therefore, the mature protein will start at position 32.
- The POSITIONS column gives the sequence range for the protein used for structure determination. For P22303, 1vzj includes only a short C-terminal segment, 575-614, a tetramerization domain[1]. But most of the models span 33-574, omitting the tetramerization domain.
- You may prefer a model that includes a specific ligand, such as an inhibitor. Here are two ways to evaluate ligands.
Proteopedia
- Go to Proteopedia.Org.
- Enter the 4-character PDB ID from the IDENTIFIER column into the Proteopedia search slot at the left.
- At the Proteopedia page titled with your PDB ID:
- The title of the model often mentions the key ligand. For 4ey5, it is huperzine A.
- Abbreviations for all ligands present are listed in a blue/green bar. Clicking on one highlights it in the 3D view, and shows its full name in red at the bottom.
FirstGlance in Jmol
- Go to FirstGlance.
- Enter the 4-character PDB ID from the IDENTIFIER column into the FirstGlance slot.
- After the model displays, in the upper left panel scroll down to Ligands+ & Non-Standard Residues. There you will find a clickable list of all ligands with their full names.
Structure Predicted by AlphaFold
If there are no empirical models for your sequence, the Structure section in UniProt usually offers a structure predicted by AlphaFold. You can explore the AlphaFold-predicted model (instructions just below), or you can search for empirical models with sequences closely related to your query sequence (next section below).
Empirical models are the most reliable, but if none are available, AlphaFold has an impressive track record of correctly predicting structures from sequence. If there is no AlphaFold model in UniProt for your sequence, you can submit the sequence and get a prediction: How to predict structures with AlphaFold. Another model prediction service with a good track record is RoseTTaFold. Submit your sequence there, making sure to check RoseTTaFold as the method.
- Download the predicted PDB file (a file ending .pdb). In UniProt, use the download button in the AlphaFold line. Example: spider acetylcholinesterase.
- Go to FirstGlance.
- Upload the PDB file to FirstGlance.
FirstGlance automatically colors predicted models by reliability.
Sequence-Related Empirical Models
This method finds empirical structures that have sequence similarity to the query.
For example, if your query is calmodulin from the lancelet fish (Q9UB37, CALM2_BRALA), zero empirical structures are listed at UniProt. However, the query is 97% sequence identical to human calmodulin (P62158 CALM_HUMAN) and calmodulins from other taxa, for which there are numerous full-length empirical structures. A very high quality homology model can be constructed.
- At UniProt.Org, in the Sequence section, note the length of your sequence.
- At UniProt, click Download. This displays the sequence in FASTA format.
- Copy the FASTA-formatted sequence, excluding the identifier line at the top that begins '>'.
- At RCSB.org (the USA branch of the PDB), click on the Advanced Search link just below the slot at the top.
- Click on Sequence Similarity under 'Advanced Search Query Builder', which opens a slot for your sequence.
- Paste your query sequence into the slot.
- Change Return in the bottom line from Structures to Polymer Entities.
- Push the
button at the lower right to run the search.
- Scroll down to see the list of hits.
- The best hits will be listed first. Notice that each hit starts with a large, bold PDB ID.
For each hit, notice the Sequence Identity % above the sequence alignment box.
Also notice the Region range, which tell you the range of residues in the PDB ID that align with your query sequence. Compare this to the full length of your query sequence.
To view a hit in FirstGlance, just enter the PDB ID.
If you click the Download button in the list of hits, you will get the CIF file. If you need PDB file format, click on the PDB ID code and open the Download menu on that single entry page to get all format options.
Example: searching UniProt for trapdoor spider acetylcholinesterase finds W4VSJ0 which has no empirical models.
- The top sequence similarity hit at RCSB.Org has 41% sequence identity over Region 8-529 of
