LIST OF SUPPLEMENTARY MATERIALS These materials are included on the following pages and are also posted on JKI’s website: carbon.indstate.edu/inlow/bioinformatics/exercises.htm Background on bioinformatics and blast PowerPoint, Word, and PDF versions on website Exercise 1 detailed procedure Word and PDF versions on website Sequence of mouse protein for Exercise 1 >protein_of_unknown_function_from_Mus_musculus MVHLTDAEKAAVSGLWGKVNADEVGGEALGRLLVVYPWTQRYFDSFGDLSSASAIMGNAKVKAHGKKVIT AFNDGLNHLDSLKGTFASLSELHCDKLHVDPENFRLLGNMIVIVLGHHLGKDFTPAAQAAFQKVVAGVAA ALAHKYH Globin multiple sequence alignment handout for Exercise 1 Word and PDF versions on website Exercise 2 detailed procedure Word and PDF versions on website Chymotrypsin structure handout for Exercise 2 PowerPoint and PDF versions on website 2 BIOINFORMATICS IN BIOCHEMISTRY Bioinformatics– a field at the interface of molecular biology, computer science, and mathematics Bioinformatics focuses on the analysis of molecular sequences (DNA, RNA, and proteins) The National Institutes of Health (NIH) definition of bioinformatics: “research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, analyze, or visualize such data.” How is bioinformatics important to biochemistry? The tools of bioinformatics include algorithms and computer programs for analysis of molecular sequences that reveal the structure and function of macromolecules. Bioinformatics analysis gives valuable information that can guide experimental work. AMINO ACID SEQUENCE ALIGNMENT A way to compare 2 or more sequences; The sequences are lined up (“aligned”), one above the other, so that each residue of one sequence can be compared to the corresponding residue of the other sequence; Sometimes one sequence must be “cut,” and a gap introduced, in order to make this sequence align in the optimal way with the other sequence. An example of a pairwise amino acid sequence alignment (2 sequences): sequence_1 1 MLFMCHQRVMKKEAEEKLKAEELRRARAAADIPIIWILGGPGCGKGTQCA 50 .|||||..: ||:::||||.||||||. sequence_2 1 MEEKLKKTK-----------IIFVVGGPGSGKGTQCE 26 All the residues that are identical in the two sequences are indicated with the “|” symbol between them; residues that are chemically similar are indicated with the “:” or “.” symbol, such as W and F (both have aromatic side chains). Note that a gap (----- region) was introduced into sequence_2 in order to make it align optimally with sequence_1. BLAST– Basic Local Alignment Search Tool A bioinformatics tool that allows users to compare a protein or DNA sequence to databases of other protein or DNA sequences from many organisms. A web-based version is available free of charge at the National Center for Biotechnology Information (NCBI) website: http://www.ncbi.nlm.nih.gov/BLAST/ The output from a “BLAST search” is a series of sequence alignments. 3 EXAMPLE OF A BLAST SEARCH Suppose you have the sequence of a human protein and want to know if there is a homologous protein in the fruit fly Drosophila melanogaster. The amino acid sequence of the human protein will be the “query” for the BLAST search. The BLAST algorithm compares the query sequence to all proteins in the Drosophila genome. The BLAST output will show a list of the Drosophila proteins that have statistical sequence similarity to the human query protein. These Drosophila proteins can be referred to as “BLAST hits.” Below this list of BLAST hits, there will be a series of sequence alignments between the human query protein and each Drosophila protein that is in the list of BLAST hits. The first alignment will be between the query and the Drosophila protein that is most similar in sequence; the second alignment will be between the query and the Drosophila protein that is the second best match in terms of sequence similarity… and so on. The next slide shows just one of these alignments from a BLAST search. The last 2 slides explain some of the features of the alignment. Query = a human protein Subject (sbjct) = the Drosophila protein that is most similar to this human protein Sample from BLAST output (see explanation on next 2 slides): >gi|24663208|ref|NP_729792.1| Adenylate kinase-1, [Drosophila melanogaster] Length = 229 Score = 179 bits (453), Expect = 1e-45 Identities = 96/205 (47%), Positives = 131/205 (64%), Gaps = 15/205 (7%) Query: 2 EEKLKKTK-----------IIFVVGGPGSGKGTQCEKIVQKYGYTHLSTGDLLRSEVSSG 50 EEKLK + II+++GGPG GKGTQC KIV+KYG+THLS+GDLLR+EV+SG Sbjct: 15 EEKLKAEELRRARAAADIPIIWILGGPGCGKGTQCAKIVEKYGFTHLSSGDLLRNEVASG 74 Query: 51 SARGKKLSEIMEKGQLVPLETVLDMLRDAMVAKVNTSKGFLIDGYPREVQQGEEFERRIG 110 S +G++L +M G LV + VL +L DA+ +SKGFLIDGYPR+ QG EFE RI Sbjct: 75 SDKGRQLQAVMASGGLVSNDEVLSLLNDAITRAKGSSKGFLIDGYPRQKNQGIEFEARIA 134 Query: 111 QPTLLLYVDAGPETMTQRLLKRGETSG--RVDDNEETIKKRLETYYKATEPVIAFYEKRG 168 L LY + +TM QR++ R S R DDNE+TI+ RL T+ + T ++ YE + Sbjct: 135 PADLALYFECSEDTMVQRIMARAAASAVKRDDDNEKTIRARLLTFKQNTNAILELYEPKT 194 Query: 169 IVRKVNAEGSVDSVFSQVCTHLDAL 193 + +NAE VD +F +V +D + Sbjct: 195 LT--INAERDVDDIFLEVVQAIDCV 217 4 First you will see sequence identification information for the subject (Drosophila) protein in the alignment. This protein is called “Adenylate kinase-1”: >gi|24663208|ref|NP_729792.1| Adenylate kinase-1, [Drosophila melanogaster] Next you will see the total length of the subject protein, 229 amino acid residues: Length = 229 Looking at the sequence alignment itself, you will see that it wraps around, taking up 3 ½ “rows.” One “row” is shown at the bottom of this slide. Residues 2 to 193 of the query protein are aligned with residues 15 to 217 of the Drosophila protein (see the numbers on the right and left sides of the previous slide). The “middle” line of each row (the line between the query and subject lines) is called the “consensus sequence.” Whenever there is a residue that is identical for the query protein and the subject protein, it is indicated in this middle line. Whenever there is a residue that is chemically similar (a conservative substitution) for the query and the subject, it is marked with a ‘+’ symbol. If one of the sequences must be “cut” in order to align it with the other, this is indicated with a “-” symbol. This is referred to as a “gap” in the alignment. Query: 2 EEKLKKTK-----------IIFVVGGPGSGKGTQCEKIVQKYGYTHLSTGDLLRSEVSSG 50 EEKLK + II+++GGPG GKGTQC KIV+KYG+THLS+GDLLR+EV+SG Sbjct: 15 EEKLKAEELRRARAAADIPIIWILGGPGCGKGTQCAKIVEKYGFTHLSSGDLLRNEVASG 74 Just above the sequence alignment itself you will see statistical information for the alignment (essentially telling you “how similar” the two sequences are): Score = 179 bits (453), Expect = 1e-45 Identities = 96/205 (47%), Positives = 131/205 (64%), Gaps = 15/205 (7%) This tells you that of the 205 amino acid residues that are aligned, 96 are identical between the query protein and the subject protein. Of the 205 aligned residues, 131 are either identical OR similar (have “+” symbol). 15 gaps were introduced into the sequences (have “-” symbol). The expected-value (1x10-45 in this case; a very small number!) is the probability that this alignment could occur by chance between two unrelated sequences from a database of the size that was searched. The bottom line: the smaller the expected-value, the more similar the two sequences. Exercise 1: Protein of Unknown Function from Mus musculus 1. Obtain the “Sequence of a protein of unknown function from Mus musculus” in electronic form from your instructor. Mus musculus is the genus and species name for the common house mouse, a model organism studied by many researchers in the biological sciences. 2. Go to the National Center for Biotechnology Information website: http://www.ncbi.nlm.nih.gov/ 5 Click on “BLAST”, then click on “Protein-protein BLAST (blastp)”. Copy the mouse protein sequence and paste it into the “Search” box. From the dropdown menu “Choose database,” select “refseq.” From the dropdown menu under “Options for advanced blasting,” select “Homo sapiens” so that you will be searching for human proteins similar to the mouse protein. Be sure that the “Do CD-Search” box is checked. When this boxed is checked, a search for conserved protein domains will be conducted. Leave all other settings and parameters the same and click on the BLAST! button. 3. In the next window that appears, there will be a colored, diagrammatic representation of the mouse protein sequence showing the locations of any functional/structural domains that are present in the protein. Click on the domain(s) to find out what they are. Take some time to explore the various links describing the domain(s). For instance, try clicking the [+] symbols on the left. This should give you some clues about the identity of the mouse protein. 4. In the original window showing the diagram of domains in the mouse protein, click the “Format!” button to see the BLAST results for the mouse protein. The BLAST results will appear in a new window; print this BLAST report. If you scroll down through the BLAST results, you will see many sequence alignments one after the other. Each alignment is a sequence comparison between the mouse protein and a human protein that is similar in sequence to it. The first alignment compares the mouse protein and a human protein that is the best match to the mouse protein; the second alignment compares the mouse protein and a human protein that is the second best match to the mouse protein; and so on… To infer the function of a protein about which little is known, one can compare the sequence of the “unknown protein” to other proteins of known function. If the unknown protein is very similar in sequence to a protein of known function, then there is a good chance that the unknown protein has the same function as the known protein. Use the space provided to answer the following discussion questions: A. Is there a conserved domain in the mouse protein? Name the domain and discuss what you learned about it. B. Which human proteins are similar in sequence to the mouse protein? C. Speculate about the identity and function of the mouse protein. Think back to what you learned about hemoglobin in class; name at least two amino acid residues that are important to the structure and/or function of hemoglobin and explain why they are important. (You may use your textbook or notes.) ________________________________________________________________ 6 ________________________________________________________________ ________________________________________________________________ Obtain the handout showing a sequence alignment for human -globin, and myoglobin from the instructor in order to do the question below. A number of residues that are important to the structure and/or function of hemoglobin are highlighted on this alignment. Compare the alignment on the handout to the first alignment on your BLAST results. Use the space provided to answer the following discussion questions: A. Does the mouse protein have all of the highlighted residues that you know to be important to the structure and/or function of human hemoglobin? Mark/highlight the locations of these residues in the mouse protein on the printout of your BLAST results. B. What can you conclude about the importance of these residues to the structure and/or function of hemoglobin in both organisms? C. What might happen if one of these residues was replaced by some other residue? D. Identify a few residues in the mouse protein that are different from those at the corresponding position in the human globins. Why do you think these residues differ, and what can you conclude about the importance of these residues to the structure and function of this protein in both organisms? 5. Repeat step 2, but this time select “nr” from the dropdown menu “Choose database.” From the dropdown menu under “Options for advanced blasting,” select “Takifugu rubripes,” the pufferfish. Click on “Format!” to see the BLAST results. Compare your BLAST results here to those from the first BLAST search for human proteins. Print the pufferfish BLAST results. Use the space provided to answer the following discussion questions: A. Is the mouse protein more similar to human might have expected before doing the BLAST search? Why? -globin? Is this what you -globin have all of the residues highlighted on the handout that you know to be important to the structure and/or function -globin? Mark/highlight the locations of these 7 residues in the pufferfish protein on the printout of your BLAST results. If any of these residues -globin, are the substitutions conservative? 6. Repeat step 2, but this time select “nr” from the dropdown menu “Choose database.” From the dropdown menu under “Options for advanced blasting,” select “Arabidopsis thaliana.” Arabidopsis thaliana is a plant and is a commonly studied model organism. Click “Format!” to see the BLAST results. Print the Arabidopsis Blast results. Use the space provided to answer the following discussion questions: A. Is it surprising that an Arabidopsis -globin is not among the BLAST hits from this BLAST search? Why or why not? B. Name the Arabidopsis protein that is the top hit from the BLAST search. Based on its degree of similarity to the mouse protein, in terms of total protein sequence length and percent amino acid identity, do you think it is likely to have the same structure and function as the mouse protein? Explain your reasoning. When finished, turn in your three BLAST reports and answers to the discussion questions. Sequence of a protein of unknown function from Mus musculus: >protein_of_unknown_function_from_Mus_musculus MVHLTDAEKAAVSGLWGKVNADEVGGEALGRLLVVYPWTQRYFDSFGDLSSASAIMGNAKVKAHGKKVITA FNDGLNHLDSLKGTFASLSELHCDKLHVDPENFRLLGNMIVIVLGHHLGKDFTPAAQAAFQKVVAGVAA ALAHKYH 8 Globin Multiple Sequence Alignment: Human α-globin (Hbα), β-globin (Hbβ), and myoglobin (Mb) The first and last residues of helices A-H are indicated. NA2 A1 A16 B1 B16 C1 C7 D1 D7 E1 || || || | | || Hbα V-LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLSH-----GSAQVKGH Hbβ VHLTPEEKSAVTALWGKV--NVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAH Mb V-LSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTEAEMKASEDLKKH E19 Hbα Hbβ Mb F1 F9 G1 G19 H1 | | | | | | GKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHA GKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQA GVTVLTALGAILKKKGHHEAELKPLAQSHATKHKIPIKYLEFISEAIIHVLHSRHPGDFGADAQG Hbα Hbβ Mb H21 HC3 | | SLDKFLASVSTVLTSKYR-----AYQKVVAGVANALAHKYH-----AMNKALELFRKDIAAKYKELGYQG Residues that are important to the structure and function of hemoglobin: His E7 (Hbα, Hbβ, Mb)-His F8 (Hbα, Hbβ, Mb)-- distal histidine proximal histidine Lys C5 (Hbα)-- involved in ion pair that stabilizes T state Asp FG1 (Hbβ)-- involved in ion pair that stabilizes T state His HC3 (Hbβ)-- involved in ion pair that stabilizes T state and contributes to Bohr effect His NA2 (Hbβ)—involved in BPG binding Lys EF7 (Hbβ)—involved in BPG binding His H21 (Hbβ)—involved in BPG binding Exercise 2: Chymotrypsin—Active Site and Specificity 1. Obtain the amino acid sequences of bovine chymotrypsin and trypsin as follows. Go to the National Center for Biotechnology Information website: http://www.ncbi.nlm.nih.gov/. Select “Protein” from the dropdown menu and enter the identification numbers for the two proteins (“gi” numbers) in the search box: “576117, 60593450”. Click “Go.” The entries for the two proteins should appear. From the “Display” dropdown menu, select “FASTA.” You will then see the amino acid sequences of the proteins in FASTA format. (This is a format in which 9 the “>” symbol is followed by identification information and a carriage return; the amino acid sequence, using the one-letter codes, begins after that.) Keep this window open so that you can copy the sequences during step 2. Chymotrypsin and trypsin are both serine proteases. Name the three active site residues of serine proteases that constitute the catalytic triad. ________________________________________________________________ What chemical reaction do these enzymes catalyze? ________________________________________________________________ Describe the differing specificities of chymotrypsin and trypsin for this reaction. ________________________________________________________________ 2. Open a new browser window and go to the EMBL-EBI Toolbox pairwise alignment site: http://www.ebi.ac.uk/emboss/align/index.html. Copy and paste the sequences in FASTA format into the two textboxes (include the “>” and identification information). Leave all alignment parameters the same and click “Run.” When your output appears, click the “Needle output” link to obtain a printer-friendly version of the alignment. Print the alignment (you will turn in all alignments at the end). Examine the sequence alignment, paying close attention to the residues of the catalytic triad. You may wish to highlight these residues. Determine whether or not this is a good alignment. Click the “back” button on the browser window twice to go back to the pairwise alignment site. Change the “Gap Open” and “Gap Extend” parameters and generate another alignment. Try at least two different combinations and print the alignment each time. Carefully compare all the alignments and determine which one is best and label it “BEST” (you may decide that some alignments are roughly equal). Use the space provided to answer the following discussion questions: A. What constitutes a “good” alignment? B. Why did you choose this particular alignment as the best relative to other alignments you generated? 10 3. Chymotrypsin residues 189, 190, and 228 are three of the residues lining the specificity pocket of the enzyme (where a side chain of the substrate binds). Use your best sequence alignment to determine the identity of these residues for chymotrysin, and then determine the identity of the corresponding residues for trypsin. (The residue numbers will be different for trypsin.) Chymotrypsin Trypsin 189 ____________ ____________ 190 ____________ ____________ 228 ____________ ____________ Obtain the handout from the instructor which shows the structure of chymotrypsin. Use the space provided to answer the following discussion question: Although other features of these two enzymes are essential for their differing specificities, the residues lining the specificity pocket make an important contribution to their specificities. Residue 189 of chymotrypsin and the corresponding residue of trypsin are each at the “base” of the specificity pocket, as you can see on the handout. Reconcile the identity of this residue in chymotrypsin and trypsin with the differing specificities of the two enzymes. 11 4. You have been working with bovine serine proteases so far. Choose an organism that is distantly related to the cow (Bos Taurus) and search the NCBI protein database for a chymotrypsin or trypsin sequence from this organism. (Go to http://www.ncbi.nlm.nih.gov/ and choose “Protein” from the dropdown menu.) When you find a sequence that looks interesting, click on the link to view the full entry. Read the entry carefully to be sure that you have found a full-length sequence. (For instance, 50 amino acids is only a partial sequence.) Write down the gi number for the sequence so that you will have it for step 5. Use the space provided to propose some hypotheses before examining the sequence further: A. Predict whether or not the enzyme you chose will contain the residues forming the catalytic triad. Explain your reasoning. B. Predict the identities of the three residues lining the specificity pocket which you examined in step 3 for bovine chymotrypsin and trypsin. Explain your reasoning. 5. You will now do a multiple sequence alignment for bovine chymotrypsin, trypsin, and the serine protease you chose in step 4. You will need to obtain these three sequences in FASTA format. If you already closed the browser windows where these sequences were displayed, just go back to the NCBI website and use the gi numbers to find them. Open a new browser window and go to the EMBL-EBI Toolbox ClustalW multiple sequence alignment site: http://www.ebi.ac.uk/clustalw/index.html. Copy and paste the three FASTA-format sequences into the textbox (paste the sequences one after the other with a carriage return between them; include the > and identification information for each sequence). Leave all alignment parameters the same and click “Run.” When your output appears, click the “Alignment file” link to obtain a printer-friendly version of the alignment. Print the alignment (you will turn in all alignments at the end). 12 Examine the alignment, paying careful attention to the residues of the catalytic triad and the three residues lining the specificity pocket. If you are not satisfied with the alignment, try it again changing the “Gap open,” “End gaps,” “Gap extension,” or “Gap distances” parameters. Use the space provided to answer the following discussion questions: A. Were your hypotheses from step 4 correct? On your alignment, highlight the residues of the catalytic triad and the three residues lining the specificity pocket. Name any residues that turned out to be different from your predictions and discuss the implications in terms of enzymatic activity and specificity. B. Based on amino acid sequence alone, do you expect the serine protease from the organism you chose to have chymotrypsin-like activity or trypsin-like activity? Justify your answer. 13 C. Compare your multiple sequence alignment to that of two other groups who have chosen a serine protease from a different organism; focus on the catalytic triad and the three residues lining the specificity pocket. Were the identities of the three residues lining the specificity pocket for the serine proteases chosen by the other groups different from those of the serine protease you chose? Discuss any differences. D. Looking at your multiple sequence alignment, are there any regions that are highly conserved (many identical amino acids for all three sequences)? Are there any regions that are not well conserved (few identical amino acids for all three sequences)? Do any of these regions surround the residues of the catalytic triad or the three residues lining the specificity pocket? How might the locations of any highly conserved regions relate to the structure and function of the three enzymes? When finished, turn in your sequence alignments and answers to the discussion questions. 14 The structure of bovine chymotrypsin (1GCD) Close-up view looking into the specificity pocket The three active-site residues of the catalytic triad are shown in stick mode (C = gray; N = blue; O = red). The oxygen atom of the Ser189 side chain (red) is visible at the base of the specificity pocket.