PLSC 411/611 - Genomics Gene Discovery and Annotation Data Analysis Assignment The purpose of this assignment is for you to better understand the gene identification and annotation process. You will use a number of WWW sites for this project. As you visit these sites, you should try to learn as much as possible about them. This will certainly help the learning process. What to turn in. Each group will turn in a written report. As you go through the assignment, you will see a series of 24 questions. Your report will consist of answers to these questions in addition to a Table that is described below. Some of these questions will require a short answer; others will require full lengthy explanations. All answers must be in the form of complete sentences. Preface each answer with the question itself. It will make grading easier. Every group will be analyzing a BAC clone from soybean. These are Phase 3 BAC clones. What does Phase 3 mean? The following table from NCBI provides a good explanation. Phase Definition 0 One-to-few pass reads of a single clone (not contigs). 1 Unfinished, may be unordered, unoriented contigs, with gaps. 2 Unfinished, ordered, oriented contigs, with or without gaps. 3 Finished, no gaps (with or without annotations). from: http://www.ncbi.nlm.nih.gov/HTGS/ In the table “Unfinished” refers to the status of the sequencing of the BAC. Note that only Phase 3 BACs are considered to be finished. For hierarchical shotgun sequencing, all BACs are sequenced to Phase 3. Okay, the first step is to download your data. 1. Go the Project page at the class WWW site. The address is: http://www.ndsu.nodak.edu/instruct/mcclean/plsc411/project.htm 2. Find your group, and download the file containing BAC DNA sequence. There are a couple of things to note. First, this is a .txt (or flat file) without any hidden information as you would have with a word processing file. A good software program to use is called TextPad. I would recommend that you download the program and use it whenever you are working with this file. Here is the link: TextPad: http://www.textpad.com/ Next, it is important to become familiar with the data format. All DNA sequence files are in what is called “FASTA” format. The first line begins with the “>” symbol followed by descriptor information. The second line and on contains the DNA (or protein) sequence. Now, you will start the analysis of the DNA. You will use the WWW version of the FGENESH software. FGENESH is commercially available gene discovery software that uses the Hidden Markov Models approach. It is considered to be the fastest and most reliable of all of the gene prediction software currently available. The software is sold by a company called Softberry, Inc. Use the following link to connect to the Softberry WWW site. Softberry: http://www.softberry.com/berry.phtml 3. Running the FGENESH analysis. To run FGENESH, you need to click on the “Run Programs Online” drop down menu in the upper right of the Softberry homepage and then select “GENE FINDING in Eukaryota” tab. On this page select the “FGENESH” link. This will load a page where you will insert your data. As we discussed in class, gene prediction is a procedure based upon extrinsic (previous published data) and intrinsic (gene structure features) factors. FGENESH uses intrinsic data that is found in your sequence. The procedures used to discover the genes in the sequence data is based on a “training set”. A training set is a group of genes from which the software sets specific parameters that will be used to predict the presence of a gene. Because different species have slightly different parameters, it is important to select the appropriate training set. 4. Comparing the results of different training sets. The first part of the assignment is to create a table that reports the following information for you BAC clone. Table 1. Gene prediction for sequence soybean BAC clone XXXXXXXX based on different species training sets Training set Human Dicots (Arabidopsis) Glycine max Monocot plants (Corn, Rice, Wheat, Barley) Number of predicted genes Number in + chain Number in - chain To do this, paste your BAC sequence into the “Paste nucleotide sequence here:” box Then select the appropriate training set from those in the “Organism” section. The header sections in the table contains the information that you need to fill in. 5. Comparing the FGENESH output generated by the “Dicot plants (Arabidopsis)” and “Glycine max” training sets. You will now compare, in detail, the output from the “Dicot plants (Arabidopsis)” and “Glycine max” training sets. To do this you need to copy the complete output from these analyses. I found this method to be a good way to store and view the output. a. Open MS Word, create a new document, set all margins at “0.3”, and set the orientation to “Landscape”. b. Create a table that consists of two columns and one row. c. Copy the output of the “Dicot” training set in the left column, and copy the “Glycine max” training set output in the right column. d. Change the font to 8 point, Courier New. This formatting will allow you to view the data side-by-side. You will need to do that for the comparison. Here are the first questions that you to consider and answer for the report. 1. What do the abbreviations “G”, “Str”, “Feature”, “Start”, “End”, “Orf”, “Len”, “TSS”, “CDSf”, “CDSi”, “CDSf”, and “PolA” represent? 2. Which training set generated the most gene models? 3. What differences did you observe between the two gene models that are built from the same region of the BAC clone? You should also consider orientation, the position of the transcription start and polyadenylation sites of the models when answering this question. What is unique about the models supported by only one of the two training sets? 4. What percentage of exons predicted by one training set were also predicted by the other training set? You need to calculate this for both training sets. (For example, you can state it like this: X% of the exons predicted by the Glycine max training set were also predicted by the Arabidopsis training set.) 5. What percentage of the complete genes predicted by one training set was also predicted by the other training set? You need to calculate this for both training sets. 6. What percentage of the complete genes predicted by one training set had the same structure predicted by the other training set? You need to calculate this for both training sets. Note: I found using MS Excel was a good way to visualize the structure information that is in the FGENESH output. The structure information is found at the top of the output. 6. Now you need to determine if these gene models are really genes. In class we talked about extrinsic data that can help predict and/or confirm gene models. Do you remember one source of extrinsic data? Talk among your group before proceeding. Use the class notes if you have to. DON’T LOOK AHEAD!!! TRY TO FIGURE IT OUT!!! DON’T LOOK AHEAD!!! TRY TO FIGURE IT OUT!!! DON’T LOOK AHEAD!!! TRY TO FIGURE IT OUT!!! DON’T LOOK AHEAD!!! TRY TO FIGURE IT OUT!!! ESTs represent expressed genes. If a predicted gene model is homologous to an EST, then it is confirmation that the gene model is mostly likely correct. So, where is a good source of EST information to perform this analysis? Also, what analytic tool would you use? AGAIN, DON’T LOOK AHEAD!!! TRY TO FIGURE IT OUT!!! AGAIN, DON’T LOOK AHEAD!!! TRY TO FIGURE IT OUT!!! AGAIN, DON’T LOOK AHEAD!!! TRY TO FIGURE IT OUT!!! AGAIN, DON’T LOOK AHEAD!!! TRY TO FIGURE IT OUT!!! AGAIN, DON’T LOOK AHEAD!!! TRY TO FIGURE IT OUT!!! AGAIN, DON’T LOOK AHEAD!!! TRY TO FIGURE IT OUT!!! The PlantGDB makes available a tremendous amount of EST data. They have collated this into Plant GDB EST Assemblies. These assemblies are a collection of Plant Unique Transcripts (PUT) that are derived by analyzing available EST sequence data and collapsing overlapping sequences into a single PUT. We discussed these in class. Those ESTs that can not be overlapped with another to form a PUT are called singletons. For any one species, the collection of PUTs and singletons is called the Transcript Assembly. (I know the terminology is a bit confusing, but you will get used to it.) As you are probably now aware, the BAC clone you are working with is from soybean. Therefore, it is appropriate that you use the Soybean Gene Index. The second question regards the analytical tool to use. We discussed the Blast algorithms in class. Do remember these? Blast let’s you compare a sequence with a database and determine if the database contains a homolog. It will also put a statistical (E-value) on the similarity between the query and the hit. You will now perform a Blast analysis on the soybean PUT Assemblies at Plant GDB to test the validity of some of the gene models predicted by the two training sets. First, go to the “PlantGDB EST Assemblies - Overview” at the Plant. The URL is: http://www.plantgdb.org/prj/ESTCluster/. The EST Assemblies are built from individual EST sequences. Sequences that overlap are built into a longer sequence. These assemblies were the earliest representation of gene sequences for many species. Read the details on the Overview page to find out more about this data collection. Then select “BLAST Server” under the “Related Links” section. Now you are to perform a blast search with five protein sequences that are predicated by your FGENSH HMMM analysis. Pick at least one unique gene from each training set. You will perform the Blast analysis against the soybean Plant TAs. a. First, you need to get the amino acid sequence. You can copy this from the bottom of the FGENESH output. Only take the amino acid sequence, don’t take the header information. For some reason, the header layout as exported by FGENESH is not compatible with the Blast software. b. On the “Plant GDB BLAST” page type “Glycine max” into the space under the “Individual Species”. Click on the term “Glycine Max” and check the “PUT” box. c. Now paste the sequence into the box that says “Paste query sequence(s) in FASTA format” d. Next, select tblastn as the “Program”. e. And finally, click on the “Run BLAST” button. For your report answer these questions. 7. What criteria are you using to determine if the gene model is supported by PUT EST assembly data? 8. Of the five models you analyzed, how many were supported by PUT EST data? 9. If the model is not supported by the PUT EST data, is it necessarily true that the model is incorrect. Please explain you answer. 10. Is protein data the only form the query might take? Is another form better? Why? Why do you think the protein sequence rather than the nucleotide sequence is used? 11. Perform the analysis with three corresponding nucleotide sequences. (Note: You need to use the blastn algorithm for this analysis.) Do you observe any differences with your results? If so, why do you think this is the case? 12. Which sequence would you believe is more accurate, the PUT derived from the ESTs or the BAC sequencing? Why? 13. Of those that were supported, can they be annotated? 14. What name would you give to the genes that can be annotated? 7. Are the soybean BAC genes found in model plant species? Next, it is important to determine if your gene is also present in model species. The two model plant species for which we have full genomic information are Arabidopsis and rice. To determine if the soybean genes you just analyzed have a homolog in another species, you will use two other WWW site. The first is The Arabidopsis Information Resource (TAIR: http://www.arabidopsis.org/index.jsp) and the second is the Rice Genome Annotation Project (http://rice.plantbiology.msu.edu/). At each of these sites, here is what you are to do: run a blastp analysis to determine if a homolog to the five soybean BAC genes is present in the genome of the model plant species Arabidopsis and rice. I will let you figure out how to set up the blast search, but remember to select the appropriate program, the appropriate database, and turn off the filter. For this assignment, please answer these questions. 15. Were homologs discovered for the five soybean genes in these two other species? 16. What criteria did you use to determine that these were indeed homologs? 17. Use the information here to annotate the soybean genes? 18. Are the annotations that same as what you used from the TIGR Soybean Gene Index information? 19. If not, why might they have been different? 8. Are the soybean BAC genes found in other plant species? The last aspect of the assignment is to determine if the five gene models conserved represent genes present in other plant or non-plant species. The home for all gene and protein sequences is GenBank. This information is stored at the National Center for Biotechnology Information (NCBI). The URL for NCBI is: http://www.ncbi.nlm.nih.gov/. If you have not visited this site below, I suggest that you spend some time here to see what is available. Again you are going to analyze the five BAC models you have worked with so far. Here are detailed instructions on how to perform Blast analysis at NCBI. a. Go to the NCBI homepage and click on “BLAST” link. b. You will be performing a blastp analysis. From this page, select “protein blast” link. This will link you to a page where you will select the blastp algorithm. (You should study the Blast page for future reference.) c. Copy your protein sequence (again without the header) into the “Enter accession number, gi, or FAST sequence” box. d. We will want to search against all plant species present in GenBank. In the “Organism” box enter “Viridiplantae (taxid:33090)”. e. To view the Blast parameters, click on “Algorithm parameters”. f. Now hit the “BLAST” button. This will start the analysis and produce your results Please answer these final questions for each gene. 20. What is the range of E-values for the top 50 hits. 21. What species does this analysis suggest have homologs to your soybean query? 22. Study the annotation for these hits. Now compare the annotation here with your annotation based on the TIGR TCs and the model species (Arabidopsis and rice) analysis. 23. Based on all of the searches you have performed, provide a name for each of the five genes you are working with here. 24. What evidence was most instructive for you doing the naming process?