Bioinformatics exercise from Ind State Carbondale

advertisement
LIST OF SUPPLEMENTARY MATERIALS
These materials are included on the following pages and are also posted on JKI’s website:
carbon.indstate.edu/inlow/bioinformatics/exercises.htm
Background on bioinformatics and blast
PowerPoint, Word, and PDF versions on website
Exercise 1 detailed procedure
Word and PDF versions on website
Sequence of mouse protein for Exercise 1
>protein_of_unknown_function_from_Mus_musculus
MVHLTDAEKAAVSGLWGKVNADEVGGEALGRLLVVYPWTQRYFDSFGDLSSASAIMGNAKVKAHGKKVIT
AFNDGLNHLDSLKGTFASLSELHCDKLHVDPENFRLLGNMIVIVLGHHLGKDFTPAAQAAFQKVVAGVAA
ALAHKYH
Globin multiple sequence alignment handout for Exercise 1
Word and PDF versions on website
Exercise 2 detailed procedure
Word and PDF versions on website
Chymotrypsin structure handout for Exercise 2
PowerPoint and PDF versions on website
2
BIOINFORMATICS IN BIOCHEMISTRY
Bioinformatics– a field at the interface of molecular biology, computer science, and mathematics
Bioinformatics focuses on the analysis of molecular sequences (DNA, RNA, and proteins)
The National Institutes of Health (NIH) definition of bioinformatics: “research, development, or
application of computational tools and approaches for expanding the use of biological, medical,
behavioral or health data, including those to acquire, store, organize, analyze, or visualize such data.”
How is bioinformatics important to biochemistry?
The tools of bioinformatics include algorithms and computer programs for analysis of molecular
sequences that reveal the structure and function of macromolecules.
Bioinformatics analysis gives valuable information that can guide experimental work.
AMINO ACID SEQUENCE ALIGNMENT
A way to compare 2 or more sequences;
The sequences are lined up (“aligned”), one above the other, so that each residue of one sequence can
be compared to the corresponding residue of the other sequence;
Sometimes one sequence must be “cut,” and a gap introduced, in order to make this sequence align in
the optimal way with the other sequence.
An example of a pairwise amino acid sequence alignment (2 sequences):
sequence_1 1 MLFMCHQRVMKKEAEEKLKAEELRRARAAADIPIIWILGGPGCGKGTQCA 50
.|||||..:
||:::||||.||||||.
sequence_2 1
MEEKLKKTK-----------IIFVVGGPGSGKGTQCE 26
All the residues that are identical in the two sequences are indicated with the “|” symbol between
them; residues that are chemically similar are indicated with the “:” or “.” symbol, such as W and F
(both have aromatic side chains). Note that a gap (----- region) was introduced into sequence_2 in
order to make it align optimally with sequence_1.
BLAST– Basic Local Alignment Search Tool
A bioinformatics tool that allows users to compare a protein or DNA sequence to databases of other
protein or DNA sequences from many organisms.
A web-based version is available free of charge at the National Center for Biotechnology Information
(NCBI) website:
http://www.ncbi.nlm.nih.gov/BLAST/
The output from a “BLAST search” is a series of sequence alignments.
3
EXAMPLE OF A BLAST SEARCH
Suppose you have the sequence of a human protein and want to know if there is a homologous protein in
the fruit fly Drosophila melanogaster. The amino acid sequence of the human protein will be the “query”
for the BLAST search.
The BLAST algorithm compares the query sequence to all proteins in the Drosophila genome.
The BLAST output will show a list of the Drosophila proteins that have statistical sequence similarity
to the human query protein. These Drosophila proteins can be referred to as “BLAST hits.” Below this
list of BLAST hits, there will be a series of sequence alignments between the human query protein and
each Drosophila protein that is in the list of BLAST hits. The first alignment will be between the query
and the Drosophila protein that is most similar in sequence; the second alignment will be between the
query and the Drosophila protein that is the second best match in terms of sequence similarity… and so
on.
The next slide shows just one of these alignments from a BLAST search. The last 2 slides explain some
of the features of the alignment.
Query = a human protein
Subject (sbjct) = the Drosophila protein that is most similar to this human protein
Sample from BLAST output (see explanation on next 2 slides):
>gi|24663208|ref|NP_729792.1| Adenylate kinase-1, [Drosophila melanogaster]
Length = 229
Score = 179 bits (453), Expect = 1e-45
Identities = 96/205 (47%), Positives = 131/205 (64%), Gaps = 15/205 (7%)
Query: 2 EEKLKKTK-----------IIFVVGGPGSGKGTQCEKIVQKYGYTHLSTGDLLRSEVSSG 50
EEKLK +
II+++GGPG GKGTQC KIV+KYG+THLS+GDLLR+EV+SG
Sbjct: 15 EEKLKAEELRRARAAADIPIIWILGGPGCGKGTQCAKIVEKYGFTHLSSGDLLRNEVASG 74
Query: 51 SARGKKLSEIMEKGQLVPLETVLDMLRDAMVAKVNTSKGFLIDGYPREVQQGEEFERRIG 110
S +G++L +M G LV + VL +L DA+
+SKGFLIDGYPR+ QG EFE RI
Sbjct: 75 SDKGRQLQAVMASGGLVSNDEVLSLLNDAITRAKGSSKGFLIDGYPRQKNQGIEFEARIA 134
Query: 111 QPTLLLYVDAGPETMTQRLLKRGETSG--RVDDNEETIKKRLETYYKATEPVIAFYEKRG 168
L LY + +TM QR++ R S R DDNE+TI+ RL T+ + T ++ YE +
Sbjct: 135 PADLALYFECSEDTMVQRIMARAAASAVKRDDDNEKTIRARLLTFKQNTNAILELYEPKT 194
Query: 169 IVRKVNAEGSVDSVFSQVCTHLDAL 193
+ +NAE VD +F +V +D +
Sbjct: 195 LT--INAERDVDDIFLEVVQAIDCV 217
4
First you will see sequence identification information for the subject (Drosophila) protein in the
alignment. This protein is called “Adenylate kinase-1”:
>gi|24663208|ref|NP_729792.1| Adenylate kinase-1, [Drosophila melanogaster]
Next you will see the total length of the subject protein, 229 amino acid residues:
Length = 229
Looking at the sequence alignment itself, you will see that it wraps around, taking up 3 ½ “rows.” One
“row” is shown at the bottom of this slide. Residues 2 to 193 of the query protein are aligned with
residues 15 to 217 of the Drosophila protein (see the numbers on the right and left sides of the
previous slide). The “middle” line of each row (the line between the query and subject lines) is called
the “consensus sequence.” Whenever there is a residue that is identical for the query protein and the
subject protein, it is indicated in this middle line. Whenever there is a residue that is chemically
similar (a conservative substitution) for the query and the subject, it is marked with a ‘+’ symbol. If
one of the sequences must be “cut” in order to align it with the other, this is indicated with a “-”
symbol. This is referred to as a “gap” in the alignment.
Query: 2 EEKLKKTK-----------IIFVVGGPGSGKGTQCEKIVQKYGYTHLSTGDLLRSEVSSG 50
EEKLK +
II+++GGPG GKGTQC KIV+KYG+THLS+GDLLR+EV+SG
Sbjct: 15 EEKLKAEELRRARAAADIPIIWILGGPGCGKGTQCAKIVEKYGFTHLSSGDLLRNEVASG 74
Just above the sequence alignment itself you will see statistical information for the alignment
(essentially telling you “how similar” the two sequences are):
Score = 179 bits (453), Expect = 1e-45
Identities = 96/205 (47%), Positives = 131/205 (64%), Gaps = 15/205 (7%)
This tells you that of the 205 amino acid residues that are aligned, 96 are identical between the query
protein and the subject protein. Of the 205 aligned residues, 131 are either identical OR similar (have
“+” symbol). 15 gaps were introduced into the sequences (have “-” symbol).
The expected-value (1x10-45 in this case; a very small number!) is the probability that this alignment
could occur by chance between two unrelated sequences from a database of the size that was searched.
The bottom line: the smaller the expected-value, the more similar the two sequences.
Exercise 1: Protein of Unknown Function from Mus musculus
1. Obtain the “Sequence of a protein of unknown function from Mus musculus” in electronic form
from your instructor.
Mus musculus is the genus and species name for the common house mouse, a model organism studied by
many researchers in the biological sciences.
2. Go to the National Center for Biotechnology Information website:
http://www.ncbi.nlm.nih.gov/
5
Click on “BLAST”, then click on “Protein-protein BLAST (blastp)”. Copy the mouse protein sequence
and paste it into the “Search” box.
From the dropdown menu “Choose database,” select “refseq.” From the dropdown menu under “Options
for advanced blasting,” select “Homo sapiens” so that you will be searching for human proteins similar
to the mouse protein. Be sure that the “Do CD-Search” box is checked. When this boxed is checked,
a search for conserved protein domains will be conducted. Leave all other settings and parameters the
same and click on the BLAST! button.
3. In the next window that appears, there will be a colored, diagrammatic representation of the mouse
protein sequence showing the locations of any functional/structural domains that are present in the
protein. Click on the domain(s) to find out what they are. Take some time to explore the various links
describing the domain(s). For instance, try clicking the [+] symbols on the left. This should give you
some clues about the identity of the mouse protein.
4. In the original window showing the diagram of domains in the mouse protein, click the “Format!”
button to see the BLAST results for the mouse protein. The BLAST results will appear in a new
window; print this BLAST report. If you scroll down through the BLAST results, you will see many
sequence alignments one after the other. Each alignment is a sequence comparison between the mouse
protein and a human protein that is similar in sequence to it. The first alignment compares the mouse
protein and a human protein that is the best match to the mouse protein; the second alignment
compares the mouse protein and a human protein that is the second best match to the mouse protein;
and so on…
To infer the function of a protein about which little is known, one can compare the sequence of the
“unknown protein” to other proteins of known function. If the unknown protein is very similar in
sequence to a protein of known function, then there is a good chance that the unknown protein has the
same function as the known protein.
Use the space provided to answer the following discussion questions:
A. Is there a conserved domain in the mouse protein? Name the domain and discuss what you learned
about it.
B. Which human proteins are similar in sequence to the mouse protein?
C. Speculate about the identity and function of the mouse protein.
Think back to what you learned about hemoglobin in class; name at least two amino acid residues that
are important to the structure and/or function of hemoglobin and explain why they are important. (You
may use your textbook or notes.)
________________________________________________________________
6
________________________________________________________________
________________________________________________________________
Obtain the handout showing a sequence alignment for human -globin, and myoglobin from the
instructor in order to do the question below. A number of residues that are important to the structure
and/or function of hemoglobin are highlighted on this alignment. Compare the alignment on the handout
to the first alignment on your BLAST results.
Use the space provided to answer the following discussion questions:
A. Does the mouse protein have all of the highlighted residues that you know to be important to the
structure and/or function of human hemoglobin? Mark/highlight the locations of these residues in the
mouse protein on the printout of your BLAST results.
B. What can you conclude about the importance of these residues to the structure and/or function of
hemoglobin in both organisms?
C. What might happen if one of these residues was replaced by some other residue?
D. Identify a few residues in the mouse protein that are different from those at the corresponding
position in the human globins. Why do you think these residues differ, and what can you conclude about
the importance of these residues to the structure and function of this protein in both organisms?
5. Repeat step 2, but this time select “nr” from the dropdown menu “Choose database.” From the
dropdown menu under “Options for advanced blasting,” select “Takifugu rubripes,” the pufferfish.
Click on “Format!” to see the BLAST results. Compare your BLAST results here to those from the first
BLAST search for human proteins. Print the pufferfish BLAST results.
Use the space provided to answer the following discussion questions:
A. Is the mouse protein more similar to human might have expected before doing the BLAST search? Why?
-globin? Is this what you
-globin have all of the residues highlighted on the handout that you know to be
important to the structure and/or function
-globin? Mark/highlight the locations of these
7
residues in the pufferfish protein on the printout of your BLAST results. If any of these residues
-globin, are the substitutions conservative?
6. Repeat step 2, but this time select “nr” from the dropdown menu “Choose database.” From the
dropdown menu under “Options for advanced blasting,” select “Arabidopsis thaliana.” Arabidopsis
thaliana is a plant and is a commonly studied model organism. Click “Format!” to see the BLAST results.
Print the Arabidopsis Blast results.
Use the space provided to answer the following discussion questions:
A. Is it surprising that an Arabidopsis -globin is not among the BLAST hits from this BLAST search?
Why or why not?
B. Name the Arabidopsis protein that is the top hit from the BLAST search. Based on its degree of
similarity to the mouse protein, in terms of total protein sequence length and percent amino acid
identity, do you think it is likely to have the same structure and function as the mouse protein? Explain
your reasoning.
When finished, turn in your three BLAST reports and answers to the discussion questions.
Sequence of a protein of unknown function from Mus musculus:
>protein_of_unknown_function_from_Mus_musculus
MVHLTDAEKAAVSGLWGKVNADEVGGEALGRLLVVYPWTQRYFDSFGDLSSASAIMGNAKVKAHGKKVITA
FNDGLNHLDSLKGTFASLSELHCDKLHVDPENFRLLGNMIVIVLGHHLGKDFTPAAQAAFQKVVAGVAA
ALAHKYH
8
Globin Multiple Sequence Alignment:
Human α-globin (Hbα), β-globin (Hbβ), and myoglobin (Mb)
The first and last residues of helices A-H are indicated.
NA2 A1
A16 B1
B16 C1
C7
D1
D7 E1
||
||
||
|
|
||
Hbα
V-LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLSH-----GSAQVKGH
Hbβ
VHLTPEEKSAVTALWGKV--NVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAH
Mb V-LSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTEAEMKASEDLKKH
E19
Hbα
Hbβ
Mb
F1
F9
G1
G19
H1
|
|
| |
| |
GKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHA
GKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQA
GVTVLTALGAILKKKGHHEAELKPLAQSHATKHKIPIKYLEFISEAIIHVLHSRHPGDFGADAQG
Hbα
Hbβ
Mb
H21 HC3
| |
SLDKFLASVSTVLTSKYR-----AYQKVVAGVANALAHKYH-----AMNKALELFRKDIAAKYKELGYQG
Residues that are important to the structure and function of hemoglobin:
His E7 (Hbα, Hbβ, Mb)-His F8 (Hbα, Hbβ, Mb)--
distal histidine
proximal histidine
Lys C5 (Hbα)-- involved in ion pair that stabilizes T state
Asp FG1 (Hbβ)-- involved in ion pair that stabilizes T state
His HC3 (Hbβ)-- involved in ion pair that stabilizes T state and
contributes to Bohr effect
His NA2 (Hbβ)—involved in BPG binding
Lys EF7 (Hbβ)—involved in BPG binding
His H21 (Hbβ)—involved in BPG binding
Exercise 2: Chymotrypsin—Active Site and Specificity
1. Obtain the amino acid sequences of bovine chymotrypsin and trypsin as follows. Go to the National
Center for Biotechnology Information website: http://www.ncbi.nlm.nih.gov/. Select “Protein” from the
dropdown menu and enter the identification numbers for the two proteins (“gi” numbers) in the search
box:
“576117, 60593450”. Click “Go.”
The entries for the two proteins should appear. From the “Display” dropdown menu, select “FASTA.”
You will then see the amino acid sequences of the proteins in FASTA format. (This is a format in which
9
the “>” symbol is followed by identification information and a carriage return; the amino acid sequence,
using the one-letter codes, begins after that.)
Keep this window open so that you can copy the sequences during step 2.
Chymotrypsin and trypsin are both serine proteases. Name the three active site residues of serine
proteases that constitute the catalytic triad.
________________________________________________________________
What chemical reaction do these enzymes catalyze?
________________________________________________________________
Describe the differing specificities of chymotrypsin and trypsin for this reaction.
________________________________________________________________
2. Open a new browser window and go to the EMBL-EBI Toolbox pairwise alignment site:
http://www.ebi.ac.uk/emboss/align/index.html. Copy and paste the sequences in FASTA format into
the two textboxes (include the “>” and identification information). Leave all alignment parameters the
same and click “Run.”
When your output appears, click the “Needle output” link to obtain a printer-friendly version of the
alignment. Print the alignment (you will turn in all alignments at the end).
Examine the sequence alignment, paying close attention to the residues of the catalytic triad. You may
wish to highlight these residues. Determine whether or not this is a good alignment.
Click the “back” button on the browser window twice to go back to the pairwise alignment site. Change
the “Gap Open” and “Gap Extend” parameters and generate another alignment. Try at least two
different combinations and print the alignment each time. Carefully compare all the alignments and
determine which one is best and label it “BEST” (you may decide that some alignments are roughly
equal).
Use the space provided to answer the following discussion questions:
A. What constitutes a “good” alignment?
B. Why did you choose this particular alignment as the best relative to other alignments you
generated?
10
3. Chymotrypsin residues 189, 190, and 228 are three of the residues lining the specificity pocket of
the enzyme (where a side chain of the substrate binds). Use your best sequence alignment to
determine the identity of these residues for chymotrysin, and then determine the identity of the
corresponding residues for trypsin. (The residue numbers will be different for trypsin.)
Chymotrypsin
Trypsin
189
____________
____________
190
____________
____________
228
____________
____________
Obtain the handout from the instructor which shows the structure of chymotrypsin.
Use the space provided to answer the following discussion question:
Although other features of these two enzymes are essential for their differing specificities, the
residues lining the specificity pocket make an important contribution to their specificities. Residue 189
of chymotrypsin and the corresponding residue of trypsin are each at the “base” of the specificity
pocket, as you can see on the handout. Reconcile the identity of this residue in chymotrypsin and
trypsin with the differing specificities of the two enzymes.
11
4. You have been working with bovine serine proteases so far. Choose an organism that is distantly
related to the cow (Bos Taurus) and search the NCBI protein database for a chymotrypsin or trypsin
sequence from this organism. (Go to http://www.ncbi.nlm.nih.gov/ and choose “Protein” from the
dropdown menu.) When you find a sequence that looks interesting, click on the link to view the full
entry. Read the entry carefully to be sure that you have found a full-length sequence. (For instance,
50 amino acids is only a partial sequence.) Write down the gi number for the sequence so that you will
have it for step 5.
Use the space provided to propose some hypotheses before examining the sequence further:
A. Predict whether or not the enzyme you chose will contain the residues forming the catalytic triad.
Explain your reasoning.
B. Predict the identities of the three residues lining the specificity pocket which you examined in step
3 for bovine chymotrypsin and trypsin. Explain your reasoning.
5. You will now do a multiple sequence alignment for bovine chymotrypsin, trypsin, and the serine
protease you chose in step 4. You will need to obtain these three sequences in FASTA format. If you
already closed the browser windows where these sequences were displayed, just go back to the NCBI
website and use the gi numbers to find them.
Open a new browser window and go to the EMBL-EBI Toolbox ClustalW multiple sequence alignment
site: http://www.ebi.ac.uk/clustalw/index.html. Copy and paste the three FASTA-format sequences
into the textbox (paste the sequences one after the other with a carriage return between them;
include the > and identification information for each sequence). Leave all alignment parameters the
same and click “Run.”
When your output appears, click the “Alignment file” link to obtain a printer-friendly version of the
alignment. Print the alignment (you will turn in all alignments at the end).
12
Examine the alignment, paying careful attention to the residues of the catalytic triad and the three
residues lining the specificity pocket. If you are not satisfied with the alignment, try it again changing
the “Gap open,” “End gaps,” “Gap extension,” or “Gap distances” parameters.
Use the space provided to answer the following discussion questions:
A. Were your hypotheses from step 4 correct? On your alignment, highlight the residues of the
catalytic triad and the three residues lining the specificity pocket. Name any residues that turned out
to be different from your predictions and discuss the implications in terms of enzymatic activity and
specificity.
B. Based on amino acid sequence alone, do you expect the serine protease from the organism you chose
to have chymotrypsin-like activity or trypsin-like activity? Justify your answer.
13
C. Compare your multiple sequence alignment to that of two other groups who have chosen a serine
protease from a different organism; focus on the catalytic triad and the three residues lining the
specificity pocket. Were the identities of the three residues lining the specificity pocket for the
serine proteases chosen by the other groups different from those of the serine protease you chose?
Discuss any differences.
D. Looking at your multiple sequence alignment, are there any regions that are highly conserved (many
identical amino acids for all three sequences)? Are there any regions that are not well conserved (few
identical amino acids for all three sequences)? Do any of these regions surround the residues of the
catalytic triad or the three residues lining the specificity pocket? How might the locations of any
highly conserved regions relate to the structure and function of the three enzymes?
When finished, turn in your sequence alignments and answers to the discussion questions.
14
The structure of bovine chymotrypsin (1GCD)
Close-up view looking into the specificity pocket
The three active-site residues of the catalytic triad are
shown in stick mode (C = gray; N = blue; O = red).
The oxygen atom of the Ser189 side chain (red) is visible
at the base of the specificity pocket.
Download
Study collections