How to find a protein's best structure

From Proteopedia

(Difference between revisions)

Revision as of 20:13, 23 October 2024

This article is under construction. This notice will be removed when it is completed. Eric Martz 14:19, 21 October 2024 (UTC)

Here is a general guide to finding a structure for a protein molecule of interest. This procedure is one of many possible. It is one favored by User:Eric Martz. When you find a structure you want, below are also instructions for loading it into FirstGlance in Jmol, which is the easiest place to learn about and explore your structure.

1 Empirical Models
- 1.1 Easy UniProt search for empirical models
  - 1.1.1 Choosing a model
    - 1.1.1.1 Proteopedia
    - 1.1.1.2 FirstGlance in Jmol
2 Structure Predicted by AlphaFold
3 Sequence-Related Empirical Models
4 Structure Superposition
5 Notes & References

Empirical Models

Empirical models are structures determined empirically (experimentally) by X-ray crystallography, cryo-Electron Microscopy, solution NMR, or rarely by other methods. Empirical models are usually the most accurate and reliable, especially when they have good resolution. All published, empirically-determined, atomic-resolution, macromolecular 3D structures are available in the World Wide Protein Data Bank (the "PDB").

Each model in the PDB has a unique 4-character identification code (PDB ID) that begins with a numeral, and has letters or numerals for the last 3 characters . Examples are 1d66, 4mdh, 9ins.

Below are two methods for finding out if your query amino acid sequence, or parts of it, have empirically-determined 3D structures in the PDB.

Easy UniProt search for empirical models

At UniProt.Org, find your protein of interest.

Example: search for human acetylcholinesterase, then click on P22303.
Click on Structure in the column at left, and wait for this section to load.
If there is a table with PDB in the first column of each row, followed by a PDB ID in each row in the IDENTIFIER column, these are empirical structures for your protein.
If there is no PDB list, go to Structure Predicted by AlphaFold below.

Choosing a model

If there are multiple PDB structures, you need to pick one (or a few) that best meet your needs. Because of its medical and pharmaceutical interest, P22303 has more than 50 empirical PDB structures.

Models with the best resolution (2.5 Å or less) will be the most reliable.

You will likely prefer models that cover all or most of the complete sequence.
- Click Sequence (in the column at left) and note the length (in amino acids). For P22303, the length is 614.
- Click PDM/Processing (in the column at left) and note the length of the signal sequence, if given. The mature protein will start after the signal sequence. For P222303, the signal sequence is 1-31. Therefore, the mature protein will start at position 32.
- The POSITIONS column gives the sequence range for the protein used for structure determination. For P22303, 1vzj includes only a short C-terminal segment, 575-614, a tetramerization domain^[1]. But most of the models span 33-574, omitting the tetramerization domain.

You may prefer a model that includes a specific ligand, such as an inhibitor. Here are two ways to evaluate ligands.

Proteopedia

Go to Proteopedia.Org.
Enter the 4-character PDB ID from the UniProt IDENTIFIER column into the Proteopedia search slot at the left.
At the Proteopedia page titled with your PDB ID:
The title of the model often mentions the key ligand. For 4ey5, it is huperzine A.
Abbreviations for all ligands present are listed in a blue/green bar. Clicking on one highlights it in the 3D view, and shows its full name in red at the bottom.

FirstGlance in Jmol

Go to FirstGlance.
Enter the 4-character PDB ID from the UniProt IDENTIFIER column into the FirstGlance slot.
After the model displays, in the upper left panel scroll down to Ligands+ & Non-Standard Residues. There you will find a clickable list of all ligands with their full names.

Structure Predicted by AlphaFold

If there are no empirical models for your sequence, the Structure section in UniProt usually offers a structure predicted by AlphaFold. Empirical models are the most reliable, but if none are available, AlphaFold has an impressive track record of correctly predicting structures from sequence. If there is no AlphaFold model in UniProt for your sequence, you can submit the sequence and get a prediction: How to predict structures with AlphaFold. Another model prediction service with a good track record is RoseTTaFold. Submit your sequence there, making sure to check RoseTTaFold as the method.

Download the predicted PDB file (a file ending .pdb). In UniProt, use the download button (a down arrow) in the AlphaFold line. Example: spider acetylcholinesterase.
Go to FirstGlance.
Upload the PDB file to FirstGlance.

FirstGlance automatically colors predicted models by reliability.

Sequence-Related Empirical Models

This method finds empirical structures that have sequence similarity to the query. Their structures can be compared to the AlphaFold model.

For example, if your query is calmodulin from the lancelet fish (Q9UB37, CALM2_BRALA), zero empirical structures are listed at UniProt. However, the query is 97% sequence identical to human calmodulin (P62158 CALM_HUMAN) and calmodulins from other taxa, for which there are numerous full-length empirical structures. Comparing these with the AlphaFold-predicted model gives it further credence.

Another example: searching UniProt for trapdoor spider acetylcholinesterase finds W4VSJ0 which has no empirical models.

At UniProt.Org, in the Sequence section, note the length of your sequence.
At UniProt, in the Sequence section, click Download. This displays the sequence in FASTA format.
Copy the FASTA-formatted sequence, excluding the identifier line at the top that begins '>'.
At RCSB.org (the USA branch of the PDB), click on the Advanced Search link just below the slot at the top.
Click on Sequence Similarity under 'Advanced Search Query Builder', which opens a slot for your sequence.
Paste your query sequence into the slot.
Change Return in the bottom line from Structures to Polymer Entities.
Push the button at the lower right to run the search.
Scroll down to see the list of hits.
The best hits will be listed first. Notice that each hit starts with a large, bold PDB ID.

For each hit, notice the Sequence Identity % above the sequence alignment box. The top sequence similarity hit for W4VSJ0 at RCSB.Org, 6emi, has 41% sequence identity.

Also notice the Region range, which tell you the range of residues in the PDB ID that align with your query sequence. Compare this to the full length of your query sequence. The length of W4VSJ0 is 559. The top hit sequence alignment region is 8-529 for 6emi. That range aligns with 31-557 of the query sequence (with a few small gaps), so the coverage is nearly complete.^[2]

To explore the structure of a hit in FirstGlance, just enter the PDB ID.

If you click the Download button in the list of hits, you will get the CIF file. If you need PDB file format, click on the PDB ID code and open the Download menu on that single entry page to get all format options. A downloaded PDB file can be uploaded to FirstGlance.

Structure Superposition

Superposing ("aligning"^[3]) two structures tells how similar or different they are from each other. Similarity between an AlphaFold-predicted structure and an empirical structure for a sequence similar to the query sequence supports confidence in the AlphaFold prediction.

Structure is more conserved than sequence^[4]. This conclusion is supported by many examples of proteins that have similar structures, yet no discernable sequence identity. An example is the ftsZ cell division protein in bacteria which shares structure with mammalian tubulin despite only 12-15% sequence identity^[5]. The customary interpretation is that modern proteins with very similar structures have a common ancestor, and that their sequences diverged while maintaining the ancestral 3D fold structure.

An easy and powerful tool for superposing two structures is FATCAT (see Structure superposition tools).

Notes & References

↑ Dvir H, Harel M, Bon S, Liu WQ, Vidal M, Garbay C, Sussman JL, Massoulie J, Silman I. The synaptic acetylcholinesterase tetramer assembles around a polyproline II helix. EMBO J. 2004 Nov 10;23(22):4394-405. Epub 2004 Nov 4. PMID:15526038 doi:7600425
↑ In the sequence alignment graphic at RCSB below 6emi, touch any part of the graphic and enlarge it with your mouse wheel. When sufficiently enlarged, you can see the first (or last) aligned residue of the query. Touching that residue reports its sequence number above the graphic.
↑ Structure superposition is often called "structure alignment", but "alignment" is easily confused with sequence alignment. Some structure superposition methods are guided by the sequence alignment, while others are independent of sequence. See Structure superposition tools.
↑ Chothia C, Lesk AM. The relation between the divergence of sequence and structure in proteins. EMBO J. 1986 Apr;5(4):823-6. PMID:3709526
↑ A 3D structure similarity search gives tubulin as one of the closest matches to ftsZ, with an RMSD (alpha carbons) of <2.6 Å.

Proteopedia Page Contributors and Editors (what is this?)

Eric Martz

Retrieved from "http://52.214.119.220/wiki/index.php/How_to_find_a_protein%27s_best_structure"

@@ Line 91: / Line 91: @@
 ==Structure Superposition==
-Superposing ("aligning"<ref>Structure superposition is often called &quot;structure alignment&quot;, but &quot;alignment&quot; is easily confused with sequence alignment. Some structure superposition methods are guided by the sequence alignment, while others are independent of sequence. See [[Structure superposition tools]].</ref>) two structures tells how similar or different they are from each other. An easy and powerful tool for superposing two structures is [https://fatcat.godziklab.org/ FATCAT] (see [[Structure superposition tools]]).
+Superposing ("aligning"<ref>Structure superposition is often called &quot;structure alignment&quot;, but &quot;alignment&quot; is easily confused with sequence alignment. Some structure superposition methods are guided by the sequence alignment, while others are independent of sequence. See [[Structure superposition tools]].</ref>) two structures tells how similar or different they are from each other. Similarity between an AlphaFold-predicted structure and an empirical structure for a sequence similar to the query sequence supports confidence in the AlphaFold prediction.
+Structure is more conserved than sequence<ref name="chothia-lesk-86">PMID: 3709526</ref>. This conclusion is supported by many examples of proteins that have similar structures, yet no discernable sequence identity. An example is the ftsZ cell division protein in bacteria which shares structure with mammalian tubulin  despite only 12-15% sequence identity<ref>A 3D structure similarity search gives tubulin as one of the closest matches to ftsZ, with an RMSD (alpha carbons) of <2.6 Å.</ref>. The customary interpretation is that modern proteins with very similar structures have a common ancestor, and that their sequences diverged while maintaining the ancestral 3D fold structure.
+An easy and powerful tool for superposing two structures is [https://fatcat.godziklab.org/ FATCAT] (see [[Structure superposition tools]]).
 ==Notes & References==
 <references />

How to find a protein's best structure

From Proteopedia

Revision as of 20:13, 23 October 2024

Contents

Empirical Models

Easy UniProt search for empirical models

Choosing a model

Proteopedia

FirstGlance in Jmol

Structure Predicted by AlphaFold

Sequence-Related Empirical Models

Structure Superposition

Notes & References

Proteopedia Page Contributors and Editors (what is this?)

Views

Personal tools

Navigation

Search

Toolbox