GEP Gene Model Checker User Guide Author Wilson Leung wleung@wustl.edu Document History Initial Draft First Revision Second Revision Third Revision Fourth Revision Fifth Revision Sixth Revision Current Version 06/04/2007 01/13/2009 01/20/2009 08/27/2010 08/14/2011 08/05/2013 08/21/2015 12/28/2015 Table of Contents Author .................................................................................................................................... 1 Document History ................................................................................................................... 1 Introduction............................................................................................................................ 2 Acknowledgements .................................................................................................................. 2 Questions about the Gene Model Checker ............................................................................... 2 Availability .............................................................................................................................. 2 General Overview .................................................................................................................... 2 Configuring the Gene Model ................................................................................................... 3 Getting Help ..................................................................................................................................... 3 Available Fields in the Gene Model Checker Configuration Panel ...................................................... 3 Gene Model Checker Results ................................................................................................... 5 Checklist ........................................................................................................................................... 6 Dot Plot ............................................................................................................................................ 7 Transcribed Sequence ........................................................................................................................ 7 Peptide Sequence .............................................................................................................................. 7 Extracted Coding Exons..................................................................................................................... 7 Downloads ........................................................................................................................................ 7 Using the Gene Model Checker to Identify Annotation Errors................................................. 8 Submitting our initial gene model ..................................................................................................... 8 Examine the Checklist ....................................................................................................................... 9 1 Examine the Checklist to identify problems in our gene model .......................................................... 9 Checking the dot plot and the global protein alignment ................................................................... 14 Download the files generated by the Gene Model Checker ............................................................... 16 Verifying Gene Models with Consensus Errors ....................................................................... 16 Checking the Thd1 Gene Model with Known Errors in the Consensus ............................................. 17 Checking the Thd1 Gene Model with the Modified Consensus Sequence .......................................... 18 Introduction Before you can submit an annotation project to the Genomics Education Partnership (GEP), you must first validate the gene models and generate three additional data files for all the gene models in your project (i.e. a file in the General Feature Format (GFF), a file containing all the transcribed sequences, and a file containing all the predicted peptide sequences). Based on the recommendations from GEP faculty members, we have created a gene model validation tool that may assist GEP students in their annotation efforts. This user guide contains an overview of the program and some examples on how to use this program in practice. Acknowledgements The Gene Model Checker is developed by Wilson Leung at Washington University in St. Louis as part of the Genomics Education Partnership (GEP) project. Questions about the Gene Model Checker Please contact Wilson at wleung@wustl.edu if you have any questions or encounter any problems with the Gene Model Checker. Availability The Gene Model Checker is available in the “Annotation Resources” section under the “Projects” menu at the GEP web site (http://gep.wustl.edu). General Overview The Gene Model Checker interface is divided into two main panels: the left panel allows you to specify the characteristics of the gene model you would like to verify while the right panel displays the results generated by the Gene Model Checker. In order to verify a gene model, you must provide the Gene Model Checker with the project sequence file in FASTA format, the name of the protein ortholog in D. melanogaster, the coordinates of the coding exons in a comma-separated list, the project group and the project name (Figure 1). 2 Figure 1 The graphical interface for the Gene Model Checker Configuring the Gene Model Getting Help A context-sensitive tooltip will appear when you select each field. These tooltips provide additional information (such as the required format) on each field. As you enter values into each field, the Gene Model Checker will validate the value and an exclamation point icon will appear on the right side of each field that failed validation. Available Fields in the Gene Model Checker Configuration Panel Field Contig Sequence File Description A file containing your unmasked sequence in FASTA format. For GEP annotation projects, the sequence file is included in the annotation package (inside the src directory). You can also create the file using a text editor and the “get DNA” functionality of the UCSC Genome Browser (available under the “View” menu item on the main navigation bar). Errors in Consensus Sequence Select “Yes” if there are errors in the project consensus sequence. Otherwise, select “No”. Files with Changes to the Consensus Sequence (Optional) If the project contains errors in the consensus sequence, provide a Variant Call Format (VCF) file that describes the changes to the consensus sequence. You can create this file using the Sequence Updater tool available on the GEP web site. Ortholog in D. melanogaster The name of the D. melanogaster protein isoform that is the putative ortholog of your gene model. 3 Coding Exons Coordinates A comma-delimited list of coordinates that corresponds to the translated exons in your gene model. The complete set of coordinates should encompass the region that begins with the start codon and ends just before the stop codon. For example the entry for a gene with two CDS will appear as follows: 16515-17504, 18473-19744 While you could enter the coordinates in any order, we recommend that you list the coordinates in the order in which the exons are translated (i.e. from 5’ to 3’). Annotated Untranslated Regions Select “Yes” if you have annotated untranslated exons (UTRs) in addition to the coding regions. Otherwise, select “No”. Transcribed Exon Coordinates (Optional) If you have annotated UTRs and selected “Yes” in the previous field, a new text box will appear where you can enter a commadelimited list of coordinates that correspond to the transcribed exons in your gene model. The complete set of coordinates should encompass the region that begins with the transcription start site and ends with the transcription end site. Orientation of Gene Relative The orientation of the gene relative to your project sequence to Query Sequence (Plus or Minus). Stop Codon Coordinates If this field is empty when you select this field, the Gene Model Checker will infer the stop codon coordinates (for complete genes or genes with only the 5’ end missing) based on the coding exon coordinates. You can change the coordinates by specifying the stop codon coordinates in the following format (start-end). For example, the entry for a stop codon on the positive strand will appear as follows: 19745-19747. Note that there should be no overlap between the stop codon coordinates and the coding exon coordinates. 4 Completeness of Gene Model Translation Complete The coordinates in the Coding Exons Coordinates field span from the start codon to just before the stop codon. Partial The coordinates in the Coding Exons Coordinates field only encompass part of the coding region of the gene. Region Missing Missing 5’ End of Translated Region Partial gene without the start codon but contains the stop codon. Missing 3’ End of Translated Region Partial gene without the stop codon but contains the start codon. Missing both 5’ and 3’ ends of Translated Region: Partial gene with neither the start nor the stop codons. Start Phase Number of bases (0, 1, or 2) before the first complete codon in your first CDS. This field only applies to partial genes where the 5’ end is missing. Project Group The region of the genome you are annotating. This value corresponds to the “genome” and “assembly” fields on the GEP UCSC Genome Browser mirror gateway page. Project Name The name of your annotation project. This value corresponds to the “search term” field on the GEP UCSC Genome Browser mirror gateway page. Examples: contig10, fosmid10, fosmid_1131J07 Gene Model Checker Results The results panel is divided into six sections (Figure 2). You can click on the tabs to switch to each section. For sections that contain a grid (i.e. Checklist and the Extracted Coding Exons sections), you can use the buttons labeled “Expand All” and “Collapse All” to control the visibility of all the rows within the grid. The expanded sections will provide you with additional information about a feature (for example, the sequences surrounding a splice site). Some of the grids have a magnifying glass icon next to each row. Clicking on this icon will bring up the corresponding region in the GEP UCSC Genome Browser. (You can also right click on the magnifying glass icon and then open the Genome Browser in a new web browser tab or window.) 5 Figure 2 The Gene Model Checker results panel Checklist The checklist contains the list of simple checks that were used to verify your gene model. For complete genes, the program checks for the presence of the start and stop codons, canonical donor and acceptor splice sites (GT, AG), and the presence of stop codons in the translated peptide sequence. For your reference, the five bases immediately before and after the start codon, splice donor sites, splice acceptor sites, and stop codons are provided in the expanded output of each row. The feature is highlighted in red and the surrounding sequences are highlighted in blue (Figure 3). This expanded section is useful when you are trying to correct minor errors in the exon or CDS coordinates (e.g., correct off-by-one errors). Figure 3 Splice donor site is highlighted in red and the surrounding five bases are highlighted in blue in the checklist Each checklist item has four possible values: Status Explanation Pass The submitted model passes the criteria. Warn The submitted model contains unusual features (e.g., non-canonical splice donor site) that require additional explanations in the GEP Annotation Report. Fail The submitted model fails the criteria. You should provide a detailed explanation as to why the proposed gene model fails the checklist item in the GEP Annotation Report. Skip The Gene Model Checker did not check this criterion. For example, the Gene Model Checker will skip the check for the acceptor site in the first coding exon if it already checks for the presence of the start codon. 6 Note: There might be valid reasons for failing some of the tests on the checklist (e.g., noncanonical splice sites, stop codon read through). Please provide an explanation for any failed tests or warnings in the GEP Annotation Report. Dot Plot This section contains a graphical dot plot of the peptide sequence from your annotated gene model compared against the orthologous peptide from D. melanogaster. It contains a link to the Dot Plot Viewer where you can adjust the parameters (e.g., scoring matrix, word size) used to create the dot plot. It also contains a link to the global alignment between the protein sequences derived from the submitted gene model and the orthologous protein from D. melanogaster. Transcribed Sequence Sequences corresponding to the exon coordinates specified by the user are extracted and concatenated together. This is one of the sequences required for annotation project submission. Sequences in the reverse strand relative to the project sequence are reverse complemented. Peptide Sequence A conceptual peptide translation based on the CDS coordinates specified by the user. Sequences corresponding to the set of coding exons are extracted and concatenated together. The concatenated sequence is translated using the standard genetic code. This is one of the sequences required for annotation project submission. Genes in the reverse strand relative to the input sequence are reverse complemented prior to translation. Extracted Coding Exons This section contains a list of all the coding exons and their translations based on the Coding Exon Coordinates provided by the user. This section is particularly useful for identifying coding exons with in-frame stop codons that are usually caused by a pair of incompatible donor and acceptor splice sites. Downloads This section allows you to download the files generated by the Gene Model Checker. For each project, you will need to save all three output files for each isoform you have annotated. In preparation for final project submission, you could use the Annotation Files Merger tool (available on the GEP web site) to concatenate all the GFF files into a single project GFF file, all the transcript files into a project transcript file, and all the peptide sequence files into a project peptide sequence file. 7 Using the Gene Model Checker to Identify Annotation Errors This section illustrates how you can use the Gene Model Checker to assist you in your gene annotation efforts. Submitting our initial gene model A common application of the Gene Model Checker is to use it to confirm splice donor and acceptor sites. In this example, we will use the Gene Model Checker to help us identify and correct problems in a gene model. Specifically, we will verify the coding region of a complete gene model Dyrk3-PA in contig1 of a D. erecta dot chromosome project. Navigate to Gene Model Checker at the GEP Web Framework (GEP Home Page è Projects è Annotation Resources è Gene Model Checker) Enter the following into the Gene Model Checker form (Figure 4): Field Contig Sequence File Errors in Consensus Sequence? Ortholog in D. melanogaster Coding Exons Coordinates Annotated Untranslated Regions? Orientation of Gene Relative to Query Sequence Completeness of Gene Model Translation Stop Codon Coordinates Project Group Project Name Value contig1.fasta No Dyrk3-PA 6776-7187, 7255-8115, 10127-10640, 10780-11130, 11191-11474 No Plus Complete 11475-11477 D. erecta Dot contig1 Figure 4 Enter our initial gene model into the Gene Model Checker Click on the “Verify Gene Model” button to check our gene model. 8 Examine the Checklist The checklist reports multiple problems in our gene model: it found in-frame stop codons in the middle of our translated peptide sequence, it cannot locate the consensus splice acceptor sites for CDS 2 and 3, and it cannot locate the consensus splice donor site for CDS 3 (Figure 5). We will tackle each of these problems in order to create a valid gene model. Figure 5 The Gene Model Checker identified multiple problems in our initial gene model Click on the “Peptide Sequence” tab to open the section. Notice that there are multiple stop codons (‘*’ characters) throughout the translated sequence (Figure 6). In-frame stop codons in the translation can often be attributed to incompatible donor and acceptor splice sites (because it introduces a frame shift). We can use the checklist to help us identify the splice sites that are incompatible with each other. Figure 6 Multiple stop codons in the conceptual peptide translation of our gene model Examine the Checklist to identify problems in our gene model Click on the “Checklist” tab to return to the checklist section. Then click the “Expand All” button to examine each of the items in the checklist in detail. In CDS 1, we notice the CDS 9 begins with an ATG (methionine), as we would expect. The two bases after the end of CDS 1 (highlighted in red) is “GT”, the canonical donor splice site. Examining the items further down the checklist, we find that the acceptor site for CDS 2 has the non-canonical splice site CA instead of AG. In the expanded section for the acceptor of CDS 2, we notice that while the red bases at 7253-7254 have the sequence CA, the bases at 7254-7255 have the sequence AG (Figure 7). Therefore, it is likely that we have made a mistake when we record the coordinates and CDS 2 actually begins at 7256 instead of 7255. Figure 7 Using the expanded checklist section to identify incorrect splice acceptor site for CDS 2 An additional hint of the cause of the problem with the CDS annotation can be found in the “Extracted Coding Exons” section. After expanding the section for CDS 2 and examining the corresponding translation, we find that there are multiple stop codons in the translation. The fact that CDS 1 is in the correct reading frame while CDS 2 is in the wrong reading frame suggests that the donor site for CDS 1 is incompatible with the acceptor site for CDS 2 (Figure 8). Figure 8 CDS translation for CDS 2 contains multiple stop codons 10 Based on the results of our analysis, we decide to shift the starting coordinates of CDS 2 by one base (from 7255 to 7256). To make this change, go back to the “Configure Gene Model” panel and change the “Coding Exon Coordinates” from 7255-8115 to 7256-8115 (Figure 9). Figure 9 Change the starting coordinate of the second CDS from 7255-8115 to 7256-8115 Click on the “Verify Gene Model” button to check our new gene model. Looking at the updated checklist (Figure 10), we see that the problem with CDS 2 has been resolved. However, the problem with CDS 3 remains. Consequently, we will examine coordinates for CDS 3 next. From the checklist criteria “Check for in-frame stop codons in CDS”, we see that CDS 3 is the only coding exon that contains in-frame stop codons. The checklist also indicates that CDS 3 uses non-canonical splice donor and acceptor sites. Figure 10 The Gene Model Checker Checklist after correcting the splice site boundaries for CDS 2 shows our gene model contains additional errors 11 Instead of using the surrounding sequence feature to diagnose the issue, we will use the custom track feature to identify the problems with the donor and acceptor splice sites for CDS 3. Click on the icon next to “Acceptor for CDS 3”. A new window will open showing the GEP UCSC Genome Browser view of this region (Figure 11). Figure 11 Genome Browser view of the putative acceptor site for CDS 3 We notice from the Genome Browser view that our gene model (shown under the track title “Custom Gene Model”) differs from the blastx track and the other gene predictors (e.g., Genscan, Twinscan, Geneid) by a single nucleotide. There is also a high quality splice acceptor site at 1012610127 (1bp away from our annotated splice site). Furthermore, we notice that the CDS 3 in our model has an “AA” splice acceptor site instead of the canonical “AG” splice acceptor site. Based on the evidence we have collected, we can conclude that we have made a mistake when annotating the acceptor site for CDS 3: CDS 3 should begin at position 10128 instead of 10127. Now that we have identified the problem with the acceptor for CDS 3, we will examine the donor for exon 3, which also fails the Gene Model Checker Checklist. Close the window and click on the icon next to “Donor for CDS 3”. A new window should open that shows the GEP UCSC Genome Browser view of this region (Figure 12). Figure 12 Evidence from the blastx alignment and other gene predictors shows an open reading frame that extends beyond our annotated CDS 12 The Genome Browser view shows that there is a potential problem with our gene model in this region. The blastx D. melanogaster protein alignment and results from multiple gene predictors indicate that CDS 3 should be substantially longer than our model. Consequently, we need to go back to our bl2seq alignments and gene prediction results to verify that we have correctly identify the coordinates for this CDS. After running bl2seq of the corresponding CDS from Dyrk3 in D. melanogaster (3_2380_0) against contig1, we see from the alignment that CDS 3 should end at around 10727 (Figure 13). Figure 13 bl2seq (blastx) alignment between D. melanogaster CDS Dyrk3:3_2380_0 and contig1 shows that CDS 3 should end at around 10,727 Examination of the region surrounding 10727 using the genome browser allows us to conclude that the CDS should actually end at 10728 (Figure 14). Figure 14 CDS 3 should end at 10,728 instead of 10,640 Our analysis suggests we should revise the coordinates for CDS 3 to 10128-10728. Click on the “Verify Gene Model” model to check our new gene model. We now have a gene model that passes all the validation tests in the checklist (Figure 15). 13 Figure 15 After fixing the problems with CDS 2 and CDS 3, the final gene model passes all the items on the Checklist Note: We could also have built the gene model by adding each exon incrementally instead of submitting the complete gene model. Simply change the “Completeness of Gene Model Translation” section to “Missing 3’ End of Translated Region” until we have constructed the entire gene model. Checking the dot plot and the global protein alignment In addition to the checklist, you can also use the dot plot and the protein alignment to verify your gene model. Click on the “Dot Plot” tab to examine the dot plot between the protein translation for your gene model and the putative D. melanogaster ortholog Dyrk3-PA (Figure 16). Figure 16 Dot plot alignment between the D. melanogaster protein Dyrk3-PA (x-axis) and the translation from the submitted model (y-axis). Each dot in the dot plot represents regions of similarity between the two sequences. 14 Looking at the dot plot, we see that there is a long diagonal line with a slope of approximately 1. This suggests that the two sequences have approximately the same lengths and that they are mostly similar to each other. The alternating colors in the dot plot represent the approximate location of each exon. We also see that there is weaker sequence homology at the beginning of the protein compared to the rest of the protein. Note: For gene models that show weak sequence similarity with the D. melanogaster ortholog, you can click on the “View dot plot in the Dot Plot Viewer” link to view the dot plot in the Dot Plot Viewer. The Dot Plot Viewer allows you to control the dot plot alignment parameters (e.g., scoring matrix, word size, and neighborhood) in order to alter the sensitivity and the specificity of the dot plot (Figure 17). Figure 17 Change the parameters in the Dot Plot Viewer to control the sensitivity of the dot plot. Go back to the Gene Model Checker page. Click on the link “View protein alignment” to see the global alignment between the two protein sequences. Our observation of the weaker sequence homology at the beginning of the protein on the dot plot is consistent with the global alignment. In particular, there is a large gap at the beginning of the alignment (Figure 18). Figure 18 Large gap at the beginning of the protein alignment between Dyrk3-PA and the submitted gene model. 15 However, after checking the blastx alignment of CDS 1_2380_0 against contig1, we see that the large gap is in an internal part of the first CDS. Hence the gap likely reflects a genuine difference between the two species and it is not caused by an error in our annotation (Figure 19). Figure 19 bl2seq alignment between the first CDS (1_2380_0) from the D. melanogaster Dyrk3-PA (subject) and the contig1 sequence (query) shows that the D. erecta CDS has a few additional residues in its first CDS compared to its D. melanogaster ortholog. Download the files generated by the Gene Model Checker Now that we have a completed gene model, we should save the files generated by the Gene Model Checker in preparation for project submission. Click on the “Downloads” tab to see links to the three files generated by the Gene Model Checker (Figure 20). Figure 20 Right-click on the links and choose “Save Link As...” or “Download Linked File As…” to save the files generated by the Gene Model Checker Right-click on the links and choose “Save Link As…” or “Download Linked File As…” to save the files onto your computer. When preparing files for project submission, you will need to concatenate the output from all the genes you have annotated and generate three files: one file containing all the GFF entries, another file with all the transcript sequences, and the third file with all the peptide sequences. Please refer to the “Submit Annotation Projects” document (available on the GEP web site under “Help” è “Documentations” è “Web Framework”) for more information on how to prepare the files for project submission. Verifying Gene Models with Consensus Errors The Baylor College of Medicine Human Genome Sequencing Center (BCM-HGSC) has sequenced eight Drosophila species as part of the modENCODE project. These assemblies have not 16 been manually improved and we have found many potential errors in the consensus sequence. Because of the known weaknesses of the 454 sequencing technology in resolving homopolymer runs, many of these potential errors are base insertions or deletions. These base insertions and deletions would often introduce spurious frame shifts into the gene annotations. To address this issue, the GEP has developed the Sequence Updater tool to enable students to document errors in the consensus sequence. The Sequence Updater will generate a file in the Variant Call Format (VCF) that describes changes to the original project sequence in a standardized format. Additional information on the Sequence Updater is available through the Sequence Updater User Guide on the GEP web site (available under “Help” è “Documentations” è “Web Framework”). The Gene Model Checker has been updated so that it will take the sequence changes specified in the VCF file into account when it validates a gene model. Checking the Thd1 Gene Model with Known Errors in the Consensus The Sequence Updater User Guide contains an example of two base deletions (at 21,412 and at 21,431) in a region of the D. biarmipes contig32 sequence that contains the putative ortholog of the D. melanogaster gene Thd1. In this tutorial, we will incorporate the VCF file generated by the Sequence Updater into our annotation of the D. biarmipes Thd1 ortholog. If we were to check the gene model for Thd1-PA using the original contig32 sequence, we find that there are multiple stop codons in the last coding exon because errors in the project consensus sequence would introduce frame shifts into the conceptual translation (Figure 21). Furthermore, the amino acid sequences at the C-terminus would not be conserved with the D. melanogaster gene model (Figure 22). Figure 21 Gene Model Checker reports multiple in-frame stop codons and incorrect length of the translated region because of errors in the D. biarmipes contig32 project sequence 17 Figure 22 Dot plot of the D. melanogaster Thd1-PA protein (x-axis) against the submitted D. biarmipes gene model (y-axis) show a lack of sequence similarity at the C-terminus because of errors in the contig32 sequence Checking the Thd1 Gene Model with the Modified Consensus Sequence In contrast, if we include the corrections to the contig32 sequence (specified by the VCF file contig32_changes.vcf.txt), the Gene Model Checker will automatically adjust the Coding Exons Coordinates so that it is relative to the revised sequence and verify the gene model using these revised coordinates. This allows us to specify the “Coding Exons Coordinates”, the “Transcribed Exons Coordinates” and the “Stop Codon Coordinates” relative to the original project sequence. In this case, we will enter the following into the Gene Model Checker form: Field Contig Sequence File Errors in Consensus Sequence? File with Changes to the Consensus Sequence Ortholog in D. melanogaster Coding Exons Coordinates Annotated Untranslated Regions? Orientation of Gene Relative to Query Sequence Completeness of Gene Model Translation Stop Codon Coordinates Project Group Project Name Value contig32.fasta Yes contig32_changes.vcf.txt Thd1-PA 31271-29686, 26745-26113, 25565-25400, 2530224940, 22112-19559 No Minus Complete 19558-19556 D. biarmipes Dot (Aug. 2012 (deprecated)) contig32 18 Figure 23 Verify the Thd1-PA gene model with sequence changes described in the file contig32_changes.vcf.txt The Gene Model Checker will issue a warning on the Checklist with the message “Modified Consensus Sequence” that indicates the original sequence has changed based on the VCF file (Figure 23). Using the revised sequence, the protein alignment and the dot plot show the last CDS of Thd1-PA is conserved between the D. melanogaster and its D. biarmipes ortholog (Figure 24). Figure 24 Dot plot of the D. melanogaster Thd1-PA protein (x-axis) against the submitted D. biarmipes gene model (y-axis) show sequence similarity at the C-terminus after applying the VCF file to the contig32 project sequence 19 In order to maintain the ability for the Gene Model Checker to be used as a tool for diagnosing errors in the submitted gene model, the Gene Model Checker will report two different sets of coordinates when you provide a VCF file. The GFF and custom tracks will show the coordinates relative to the original project sequence so that they are consistent with the rest of the evidence tracks on the GEP UCSC Genome Browser. In contrast, the expanded section of the Checklist and the Extracted Coding Exons will report the coordinates relative to the revised sequence. 20