BCB 444/544 Fall 06 Sep 14 Lab 4 p. 1 BCB 444/544 Lab 5 (Sept 27) Function Prediction Due Mon Oct 1 by 5 PM - email to jdsander@iastate.edu Objectives 1. Predict genes in a DNA sequence. (ORFFinder @ ncbi , GeneMark) 2. Use web-based tools to predict protein family/function and understand how they work. 3. Refresh skills we have developed in previous labs. Introduction In this lab you will be using sequence based approaches to predict genes and protein function. Gene prediction works best with an ensemble of techniques including ORF scanning, intron identification, transcription factor binding site identification, and codon bias. There are multiple approaches to the function prediction problem including: sequence alignments, motif identification, weight matrices, and Hidden Markov Models. After completing this lab you should be able to understand each of these methods and be familiar with online services that provide these services. We will be assuming knowledge of activities in previous labs such as sequence retrieval and sequence alignment via BLAST to complete this lab. Exercises Gene Prediction Part I – ORF Finder We’ll start our gene prediction journey with the simple task of scanning a genome for open reading frames. An open reading frame or ORF is the DNA sequence located between the start codon and a stop codon. Open reading frames represent potential genes. Since each codon consists of three base pairs, each ORF must have a length divisible by three. By the same token, there exist three possible reading frames for a strand of DNA. A fourth reading frame would be inline with the first reading frame. Since DNA is double stranded we have a total of 6 reading frames (3 on each strand). The three possible reading frames: AUG, UGG, GGT. Note: GTA is in the same reading frame as AUG Let’s use ORF Finder at NCBI http://www.ncbi.nlm.nih.gov/gorf/gorf.html to locate all the possible open reading frames. We’re going to locate all the open reading frames in the Human Immunodeficiency Virus (HIV). Its genbank accession number is NC_001802. We can either enter the accession number or we could find the sequence ourselves and enter it directly. For now enter the accession and click the “OrfFind” button. The default settings show us all open reading frames greater than 100 bases in length. BCB 444/544 Fall 06 Sep 14 1.) 2.) Lab 4 p. 2 What is the longest open reading frame detected in the HIV virus. Assuming this is a valid ORF with no introns, how long would the protein produced by this ORF be? Change the minimum displayed ORF length by changing the drop down box from 100 to 50 and then clicking the “Redraw” button. 3.) Do you think most of these new ORF’s are translated into proteins? Why or why not? 4.) Take note of the frames that these genes are found in. What do you think the + and the – stand for? Let’s assume we think the longest ORF is a real gene. Click either on the highlighted ORF in the 2D chromosomal picture or on the colored box next to the ORF in the ORF list. We can either accept this ORF as an actual gene or select the “Alternative Initiation Codons” button. Although we learned that open reading frames start with ATG, Biology has another rule that takes precedence over all other rules: “There is an exception to every rule except this one.” 5.) Select the “Alternative Initiation Codons” button. The gene now starts with what earlier codon. Part II – GeneMark ORF’s go a long way to helping us identify potential genes. However, not all ORF’s have the appropriate transcription factor binding sites to provide translation. Also, prediction gets much more challenging when we consider alternative start codons, introns, and alternative splicing. Thankfully there are plenty more tricks we can pull out of our bag. Nearly all eukaryotic introns begin with GT and end with AG; this is known as the GTAG rule. We can also take advantage of something known as codon bias. There are 20 Amino Acids and 64 possible codons (4^3). This means that many Amino Acids are represented by multiple codon triplets. However separate species are able to utilize certain codons better due to their specific cell machinery. If we know how often a given triplet is used in known genes for that organism, we can assign probabilities to each codon occurring. We can then look at these probabilities for the entire protein to assess whether it is likely to be a gene in this organism. We can also expand our search to scan for transcription factor binding sites upstream of the ORF. By combining all of these methods and several other techniques we can do a much better job of predicting genes than simply looking at all ORF’s. There are many programs out there, taking advantage of different combinations of this information. One such program, Gene Seqer, was developed by Volker Brendel, an Iowa State professor and is accessible through his site at http://www.plantgdb.org/ . Feel free to go through their tutorial after you have completed the lab. In this lab, we will be using a the GeneMark program. Go to http://www.ncbi.nlm.nih.gov and click on the “Genomic Biology” link on the left side of the screen. Here we can access whole genome sequences. On the right side of the screen under BCB 444/544 Fall 06 Sep 14 Lab 4 p. 3 Organism-Specific select the G for Genome Resources next to Human. There should be a figure with the human chromosomes in the upper left hand corner. Click on chromosome 1 to go to a map of that chromosome. Click on the “Download/View Sequence/Evidence” link in the upper right corner. The chromosome is very large and is broken into several files. We just want ~50,000 bases of the chromosome for this lab. Change the range values to display the sequence from 50000 to 100000, then click the “Change Region/Strand” button. A new region with the desired region should appear below. Click display and select FASTA format. Now copy this shortened FASTA sequence and go to http://opal.biology.gatech.edu/GeneMark/genemark24.cgi and paste in the copied sequence. Under running options change the species to H. sapiens. Under output options check “Generate PDF graphics (screen)” Click the “Start GeneMark” button then click on “View PDF Graphical Output” 6.) 7.) In which regions is the most probable gene? In what reading frame is the most probable gene? Function Prediction Part III - BLAST - Alignments Last lab we used sequence alignments to find similar proteins with similar sequence composition to our own. It is a common assumption that proteins of similar sequence share similar function. Protein sequence determines protein structure, and the combination structure and sequence determine function. You first instinct may be to directly infer that sequence determines function, and this is largely true. So why not drop “sequence infers structure” and skip directly to “sequence infers function”. It is worded this way because multiple diverse sequences may result in the same structure. In this case, proteins with vastly different sequence compositions may share similar functions. We state that sequence and structure infer function because structure itself is not enough to produce an actively interacting protein. The interacting regions must have the proper biochemical properties as well; therefore the sequence at these interacting regions is important. We will start today’s lab with the basic assumption that sequence infers function. Let’s return to NCBI and retrieve the protein sequence NP_031939. Now BLAST the protein sequence against all Fungi. 8.) How many sequences do you find? Many of these sequences will be unnamed or hypothetical, meaning their function or actual existence as a protein is unknown. Several with good e-vals exist and provide a known function. 9.) Do you think this protein shares similar function with the other proteins? Why? (Hint look at the picture of the aligned regions near the top of the page.) 10.) Looking at the best match, what positions in the query sequence align to the target sequence? At the top of the screen, click on the “Show Conserved Domains” link in the upper right corner. BCB 444/544 Fall 06 Sep 14 11.) Lab 4 p. 4 Do these domains appears to correlate with the conserved region identified in the BLAST search? Now repeat the BLAST again; however, this time BLAST against the entire NR database (not just Fungi). There should be many hits this time. Look at what regions of the blast match this time. 12.) Based solely on the description of the protein hits from this BLAST, what would you predict the function of this entire protein to be? 13.) How does this contradict with what we know to be the case? (Hint: Look at the answers to questions 10 and 11) 14.) What potential problems could you see in trying to identify protein function strictly by sequence alignment in the following scenario: A second protein exists with sequence related to only the first 200 bases of this protein? The blast results give this protein as the top hit. Now imagine this incorrect label propagating to several other related proteins. A user could blast a sequence and get a number of hits as matches, all of which have incorrect labels stemming from one original error. Unfortunately this scenario does happen. We must be careful when assessing function via alignments. Part IV – PROSITE – Motifs & Weight Matrices In lecture we have already been introduced to protein motifs, a recurring pattern of amino acids that denote a characteristic such as a structure or function. Identifying protein function by scanning for known protein motifs can prove more reliable than sequence alignments. Go to the PROSITE website at http://ca.expasy.org/prosite/. Read the PROSITE User Manual Introduction under Documents. Under the Tools for PROSITE section submit the NP_031939 sequence we explored above in Blast. 15.) How many hits were found for this sequence? 16.) What conditions are met for the profile identified? 17.) What is better evidence of function – a profile or a patter? What is the difference between the two? Part V – SuperFam – Hidden Markov Models (HMM) Read “A concrete Example” of a HMM http://en.wikipedia.org/wiki/Hidden_Markov_Models . Then go to the Super Family website at http://supfam.org/SUPERFAMILY/hmm.html . This website uses pre-trained HMM’s to determine what family a protein falls into. Again paste the BCB 444/544 Fall 06 Sep 14 Lab 4 p. 5 NP_031939 FASTA sequence into the sequence window. Change the notification to Browser and click submit. When “YOUR RUN IS COMPLETE” click “here” to see the output. 18.) What domains does SUPERFAM identify? There are alternative programs that predict Superfamily based on HMM’s. Each program may give varying results depending how they were train. Part VI - JAFA – Metaserver Using one method of function prediction may be adequate. However, it is often better to use multiple methods at once and combine the results. JAFA (Joint Assembly of Function Annotations) is one such metaserver. Go to the JAFA meta server at http://jafa.burnham.org/ and use the NP_031939 FASTA sequence again. 19.) Do you receive any new information? 20.) Based on all the function prediction results what can you say about our mystery protein? (What do we know and what do we not know?) Due Mon Sept 18 by 5 PM - email to jdsander@iastate.edu