GEP Gene Model Checker User Guide

advertisement
GEP Gene Model Checker User Guide
Author
Wilson Leung
wleung@wustl.edu
Document History
Initial Draft
First Revision
Second Revision
Third Revision
Fourth Revision
Fifth Revision
Sixth Revision
Current Version
06/04/2007
01/13/2009
01/20/2009
08/27/2010
08/14/2011
08/05/2013
08/21/2015
12/28/2015
Table of Contents
Author .................................................................................................................................... 1
Document History ................................................................................................................... 1
Introduction............................................................................................................................ 2
Acknowledgements .................................................................................................................. 2
Questions about the Gene Model Checker ............................................................................... 2
Availability .............................................................................................................................. 2
General Overview .................................................................................................................... 2
Configuring the Gene Model ................................................................................................... 3
Getting Help ..................................................................................................................................... 3
Available Fields in the Gene Model Checker Configuration Panel ...................................................... 3
Gene Model Checker Results ................................................................................................... 5
Checklist ........................................................................................................................................... 6
Dot Plot ............................................................................................................................................ 7
Transcribed Sequence ........................................................................................................................ 7
Peptide Sequence .............................................................................................................................. 7
Extracted Coding Exons..................................................................................................................... 7
Downloads ........................................................................................................................................ 7
Using the Gene Model Checker to Identify Annotation Errors................................................. 8
Submitting our initial gene model ..................................................................................................... 8
Examine the Checklist ....................................................................................................................... 9
1
Examine the Checklist to identify problems in our gene model .......................................................... 9
Checking the dot plot and the global protein alignment ................................................................... 14
Download the files generated by the Gene Model Checker ............................................................... 16
Verifying Gene Models with Consensus Errors ....................................................................... 16
Checking the Thd1 Gene Model with Known Errors in the Consensus ............................................. 17
Checking the Thd1 Gene Model with the Modified Consensus Sequence .......................................... 18
Introduction
Before you can submit an annotation project to the Genomics Education Partnership (GEP), you
must first validate the gene models and generate three additional data files for all the gene models
in your project (i.e. a file in the General Feature Format (GFF), a file containing all the transcribed
sequences, and a file containing all the predicted peptide sequences). Based on the
recommendations from GEP faculty members, we have created a gene model validation tool that
may assist GEP students in their annotation efforts. This user guide contains an overview of the
program and some examples on how to use this program in practice.
Acknowledgements
The Gene Model Checker is developed by Wilson Leung at Washington University in St. Louis
as part of the Genomics Education Partnership (GEP) project.
Questions about the Gene Model Checker
Please contact Wilson at wleung@wustl.edu if you have any questions or encounter any problems
with the Gene Model Checker.
Availability
The Gene Model Checker is available in the “Annotation Resources” section under the
“Projects” menu at the GEP web site (http://gep.wustl.edu).
General Overview
The Gene Model Checker interface is divided into two main panels: the left panel allows you to
specify the characteristics of the gene model you would like to verify while the right panel displays
the results generated by the Gene Model Checker.
In order to verify a gene model, you must provide the Gene Model Checker with the project
sequence file in FASTA format, the name of the protein ortholog in D. melanogaster, the
coordinates of the coding exons in a comma-separated list, the project group and the project name
(Figure 1).
2
Figure 1 The graphical interface for the Gene Model Checker
Configuring the Gene Model
Getting Help
A context-sensitive tooltip will appear when you select each field. These tooltips provide
additional information (such as the required format) on each field. As you enter values into each
field, the Gene Model Checker will validate the value and an exclamation point icon will appear
on the right side of each field that failed validation.
Available Fields in the Gene Model Checker Configuration Panel
Field
Contig Sequence File
Description
A file containing your unmasked sequence in FASTA format.
For GEP annotation projects, the sequence file is included in the
annotation package (inside the src directory). You can also create
the file using a text editor and the “get DNA” functionality of the
UCSC Genome Browser (available under the “View” menu item
on the main navigation bar).
Errors in Consensus
Sequence
Select “Yes” if there are errors in the project consensus sequence.
Otherwise, select “No”.
Files with Changes to the
Consensus Sequence
(Optional)
If the project contains errors in the consensus sequence, provide
a Variant Call Format (VCF) file that describes the changes to
the consensus sequence. You can create this file using the
Sequence Updater tool available on the GEP web site.
Ortholog in D. melanogaster
The name of the D. melanogaster protein isoform that is the
putative ortholog of your gene model.
3
Coding Exons Coordinates
A comma-delimited list of coordinates that corresponds to the
translated exons in your gene model. The complete set of
coordinates should encompass the region that begins with the
start codon and ends just before the stop codon.
For example the entry for a gene with two CDS will appear as
follows: 16515-17504, 18473-19744
While you could enter the coordinates in any order, we
recommend that you list the coordinates in the order in which
the exons are translated (i.e. from 5’ to 3’).
Annotated Untranslated
Regions
Select “Yes” if you have annotated untranslated exons (UTRs) in
addition to the coding regions. Otherwise, select “No”.
Transcribed Exon
Coordinates (Optional)
If you have annotated UTRs and selected “Yes” in the previous
field, a new text box will appear where you can enter a commadelimited list of coordinates that correspond to the transcribed
exons in your gene model.
The complete set of coordinates should encompass the region
that begins with the transcription start site and ends with the
transcription end site.
Orientation of Gene Relative The orientation of the gene relative to your project sequence
to Query Sequence
(Plus or Minus).
Stop Codon Coordinates
If this field is empty when you select this field, the Gene Model
Checker will infer the stop codon coordinates (for complete
genes or genes with only the 5’ end missing) based on the coding
exon coordinates.
You can change the coordinates by specifying the stop codon
coordinates in the following format (start-end). For example, the
entry for a stop codon on the positive strand will appear as
follows: 19745-19747.
Note that there should be no overlap between the stop codon
coordinates and the coding exon coordinates.
4
Completeness of Gene
Model Translation
Complete
The coordinates in the Coding Exons Coordinates field span
from the start codon to just before the stop codon.
Partial
The coordinates in the Coding Exons Coordinates field only
encompass part of the coding region of the gene.
Region Missing
Missing 5’ End of Translated Region
Partial gene without the start codon but contains the stop codon.
Missing 3’ End of Translated Region
Partial gene without the stop codon but contains the start codon.
Missing both 5’ and 3’ ends of Translated Region:
Partial gene with neither the start nor the stop codons.
Start Phase
Number of bases (0, 1, or 2) before the first complete codon in
your first CDS. This field only applies to partial genes where the
5’ end is missing.
Project Group
The region of the genome you are annotating. This value
corresponds to the “genome” and “assembly” fields on the GEP
UCSC Genome Browser mirror gateway page.
Project Name
The name of your annotation project. This value corresponds to
the “search term” field on the GEP UCSC Genome Browser
mirror gateway page.
Examples:
contig10, fosmid10, fosmid_1131J07
Gene Model Checker Results
The results panel is divided into six sections (Figure 2). You can click on the tabs to switch to
each section. For sections that contain a grid (i.e. Checklist and the Extracted Coding Exons
sections), you can use the buttons labeled “Expand All” and “Collapse All” to control the visibility
of all the rows within the grid. The expanded sections will provide you with additional
information about a feature (for example, the sequences surrounding a splice site). Some of the
grids have a magnifying glass icon next to each row. Clicking on this icon will bring up the
corresponding region in the GEP UCSC Genome Browser. (You can also right click on the
magnifying glass icon and then open the Genome Browser in a new web browser tab or window.)
5
Figure 2 The Gene Model Checker results panel
Checklist
The checklist contains the list of simple checks that were used to verify your gene model. For
complete genes, the program checks for the presence of the start and stop codons, canonical donor
and acceptor splice sites (GT, AG), and the presence of stop codons in the translated peptide
sequence. For your reference, the five bases immediately before and after the start codon, splice
donor sites, splice acceptor sites, and stop codons are provided in the expanded output of each
row. The feature is highlighted in red and the surrounding sequences are highlighted in blue
(Figure 3). This expanded section is useful when you are trying to correct minor errors in the exon
or CDS coordinates (e.g., correct off-by-one errors).
Figure 3 Splice donor site is highlighted in red and the surrounding five bases are highlighted in blue in the checklist
Each checklist item has four possible values:
Status Explanation
Pass
The submitted model passes the criteria.
Warn
The submitted model contains unusual features (e.g., non-canonical
splice donor site) that require additional explanations in the GEP
Annotation Report.
Fail
The submitted model fails the criteria. You should provide a detailed
explanation as to why the proposed gene model fails the checklist item
in the GEP Annotation Report.
Skip
The Gene Model Checker did not check this criterion. For example, the
Gene Model Checker will skip the check for the acceptor site in the first
coding exon if it already checks for the presence of the start codon.
6
Note: There might be valid reasons for failing some of the tests on the checklist (e.g., noncanonical splice sites, stop codon read through). Please provide an explanation for any failed
tests or warnings in the GEP Annotation Report.
Dot Plot
This section contains a graphical dot plot of the peptide sequence from your annotated gene
model compared against the orthologous peptide from D. melanogaster. It contains a link to the
Dot Plot Viewer where you can adjust the parameters (e.g., scoring matrix, word size) used to create
the dot plot. It also contains a link to the global alignment between the protein sequences derived
from the submitted gene model and the orthologous protein from D. melanogaster.
Transcribed Sequence
Sequences corresponding to the exon coordinates specified by the user are extracted and
concatenated together. This is one of the sequences required for annotation project submission.
Sequences in the reverse strand relative to the project sequence are reverse complemented.
Peptide Sequence
A conceptual peptide translation based on the CDS coordinates specified by the user. Sequences
corresponding to the set of coding exons are extracted and concatenated together. The
concatenated sequence is translated using the standard genetic code. This is one of the sequences
required for annotation project submission. Genes in the reverse strand relative to the input
sequence are reverse complemented prior to translation.
Extracted Coding Exons
This section contains a list of all the coding exons and their translations based on the Coding
Exon Coordinates provided by the user. This section is particularly useful for identifying coding
exons with in-frame stop codons that are usually caused by a pair of incompatible donor and
acceptor splice sites.
Downloads
This section allows you to download the files generated by the Gene Model Checker. For each
project, you will need to save all three output files for each isoform you have annotated. In
preparation for final project submission, you could use the Annotation Files Merger tool (available
on the GEP web site) to concatenate all the GFF files into a single project GFF file, all the
transcript files into a project transcript file, and all the peptide sequence files into a project peptide
sequence file.
7
Using the Gene Model Checker to Identify Annotation Errors
This section illustrates how you can use the Gene Model Checker to assist you in your gene
annotation efforts.
Submitting our initial gene model
A common application of the Gene Model Checker is to use it to confirm splice donor and
acceptor sites. In this example, we will use the Gene Model Checker to help us identify and
correct problems in a gene model. Specifically, we will verify the coding region of a complete gene
model Dyrk3-PA in contig1 of a D. erecta dot chromosome project.
Navigate to Gene Model Checker at the GEP Web Framework (GEP Home Page è Projects è
Annotation Resources è Gene Model Checker)
Enter the following into the Gene Model Checker form (Figure 4):
Field
Contig Sequence File
Errors in Consensus Sequence?
Ortholog in D. melanogaster
Coding Exons Coordinates
Annotated Untranslated Regions?
Orientation of Gene Relative to Query Sequence
Completeness of Gene Model Translation
Stop Codon Coordinates
Project Group
Project Name
Value
contig1.fasta
No
Dyrk3-PA
6776-7187, 7255-8115, 10127-10640, 10780-11130,
11191-11474
No
Plus
Complete
11475-11477
D. erecta Dot
contig1
Figure 4 Enter our initial gene model into the Gene Model Checker
Click on the “Verify Gene Model” button to check our gene model.
8
Examine the Checklist
The checklist reports multiple problems in our gene model: it found in-frame stop codons in the
middle of our translated peptide sequence, it cannot locate the consensus splice acceptor sites for
CDS 2 and 3, and it cannot locate the consensus splice donor site for CDS 3 (Figure 5). We will
tackle each of these problems in order to create a valid gene model.
Figure 5 The Gene Model Checker identified multiple problems in our initial gene model
Click on the “Peptide Sequence” tab to open the section. Notice that there are multiple stop
codons (‘*’ characters) throughout the translated sequence (Figure 6). In-frame stop codons in the
translation can often be attributed to incompatible donor and acceptor splice sites (because it
introduces a frame shift). We can use the checklist to help us identify the splice sites that are
incompatible with each other.
Figure 6 Multiple stop codons in the conceptual peptide translation of our gene model
Examine the Checklist to identify problems in our gene model
Click on the “Checklist” tab to return to the checklist section. Then click the “Expand All”
button to examine each of the items in the checklist in detail. In CDS 1, we notice the CDS
9
begins with an ATG (methionine), as we would expect. The two bases after the end of CDS 1
(highlighted in red) is “GT”, the canonical donor splice site. Examining the items further down
the checklist, we find that the acceptor site for CDS 2 has the non-canonical splice site CA instead
of AG. In the expanded section for the acceptor of CDS 2, we notice that while the red bases at
7253-7254 have the sequence CA, the bases at 7254-7255 have the sequence AG (Figure 7).
Therefore, it is likely that we have made a mistake when we record the coordinates and CDS 2
actually begins at 7256 instead of 7255.
Figure 7 Using the expanded checklist section to identify incorrect splice acceptor site for CDS 2
An additional hint of the cause of the problem with the CDS annotation can be found in the
“Extracted Coding Exons” section. After expanding the section for CDS 2 and examining the
corresponding translation, we find that there are multiple stop codons in the translation. The fact
that CDS 1 is in the correct reading frame while CDS 2 is in the wrong reading frame suggests that
the donor site for CDS 1 is incompatible with the acceptor site for CDS 2 (Figure 8).
Figure 8 CDS translation for CDS 2 contains multiple stop codons
10
Based on the results of our analysis, we decide to shift the starting coordinates of CDS 2 by one
base (from 7255 to 7256). To make this change, go back to the “Configure Gene Model” panel
and change the “Coding Exon Coordinates” from 7255-8115 to 7256-8115 (Figure 9).
Figure 9 Change the starting coordinate of the second CDS from 7255-8115 to 7256-8115
Click on the “Verify Gene Model” button to check our new gene model.
Looking at the updated checklist (Figure 10), we see that the problem with CDS 2 has been
resolved. However, the problem with CDS 3 remains. Consequently, we will examine coordinates
for CDS 3 next.
From the checklist criteria “Check for in-frame stop codons in CDS”, we see that CDS 3 is the
only coding exon that contains in-frame stop codons. The checklist also indicates that CDS 3 uses
non-canonical splice donor and acceptor sites.
Figure 10 The Gene Model Checker Checklist after correcting the splice site boundaries for CDS 2 shows our gene model
contains additional errors
11
Instead of using the surrounding sequence feature to diagnose the issue, we will use the custom
track feature to identify the problems with the donor and acceptor splice sites for CDS 3. Click
on the
icon next to “Acceptor for CDS 3”. A new window will open showing the GEP UCSC
Genome Browser view of this region (Figure 11).
Figure 11 Genome Browser view of the putative acceptor site for CDS 3
We notice from the Genome Browser view that our gene model (shown under the track title
“Custom Gene Model”) differs from the blastx track and the other gene predictors (e.g., Genscan,
Twinscan, Geneid) by a single nucleotide. There is also a high quality splice acceptor site at 1012610127 (1bp away from our annotated splice site). Furthermore, we notice that the CDS 3 in our
model has an “AA” splice acceptor site instead of the canonical “AG” splice acceptor site. Based
on the evidence we have collected, we can conclude that we have made a mistake when annotating
the acceptor site for CDS 3: CDS 3 should begin at position 10128 instead of 10127.
Now that we have identified the problem with the acceptor for CDS 3, we will examine the donor
for exon 3, which also fails the Gene Model Checker Checklist. Close the window and click on
the
icon next to “Donor for CDS 3”. A new window should open that shows the GEP UCSC
Genome Browser view of this region (Figure 12).
Figure 12 Evidence from the blastx alignment and other gene predictors shows an open reading frame that extends
beyond our annotated CDS
12
The Genome Browser view shows that there is a potential problem with our gene model in this
region. The blastx D. melanogaster protein alignment and results from multiple gene predictors
indicate that CDS 3 should be substantially longer than our model. Consequently, we need to go
back to our bl2seq alignments and gene prediction results to verify that we have correctly identify
the coordinates for this CDS.
After running bl2seq of the corresponding CDS from Dyrk3 in D. melanogaster (3_2380_0) against
contig1, we see from the alignment that CDS 3 should end at around 10727 (Figure 13).
Figure 13 bl2seq (blastx) alignment between D. melanogaster CDS Dyrk3:3_2380_0 and contig1 shows that CDS 3 should
end at around 10,727
Examination of the region surrounding 10727 using the genome browser allows us to conclude
that the CDS should actually end at 10728 (Figure 14).
Figure 14 CDS 3 should end at 10,728 instead of 10,640
Our analysis suggests we should revise the coordinates for CDS 3 to 10128-10728. Click on the
“Verify Gene Model” model to check our new gene model. We now have a gene model that passes
all the validation tests in the checklist (Figure 15).
13
Figure 15 After fixing the problems with CDS 2 and CDS 3, the final gene model passes all the items on the Checklist
Note: We could also have built the gene model by adding each exon incrementally instead of
submitting the complete gene model. Simply change the “Completeness of Gene Model
Translation” section to “Missing 3’ End of Translated Region” until we have constructed the
entire gene model.
Checking the dot plot and the global protein alignment
In addition to the checklist, you can also use the dot plot and the protein alignment to verify your
gene model. Click on the “Dot Plot” tab to examine the dot plot between the protein translation
for your gene model and the putative D. melanogaster ortholog Dyrk3-PA (Figure 16).
Figure 16 Dot plot alignment between the D. melanogaster protein Dyrk3-PA (x-axis) and the translation from the
submitted model (y-axis). Each dot in the dot plot represents regions of similarity between the two sequences.
14
Looking at the dot plot, we see that there is a long diagonal line with a slope of approximately 1.
This suggests that the two sequences have approximately the same lengths and that they are mostly
similar to each other. The alternating colors in the dot plot represent the approximate location of
each exon. We also see that there is weaker sequence homology at the beginning of the protein
compared to the rest of the protein.
Note: For gene models that show weak sequence similarity with the D. melanogaster ortholog,
you can click on the “View dot plot in the Dot Plot Viewer” link to view the dot plot in the
Dot Plot Viewer. The Dot Plot Viewer allows you to control the dot plot alignment parameters
(e.g., scoring matrix, word size, and neighborhood) in order to alter the sensitivity and the
specificity of the dot plot (Figure 17).
Figure 17 Change the parameters in the Dot Plot Viewer to control the sensitivity of the dot plot.
Go back to the Gene Model Checker page. Click on the link “View protein alignment” to see the
global alignment between the two protein sequences. Our observation of the weaker sequence
homology at the beginning of the protein on the dot plot is consistent with the global alignment.
In particular, there is a large gap at the beginning of the alignment (Figure 18).
Figure 18 Large gap at the beginning of the protein alignment between Dyrk3-PA and the submitted gene model.
15
However, after checking the blastx alignment of CDS 1_2380_0 against contig1, we see that the
large gap is in an internal part of the first CDS. Hence the gap likely reflects a genuine difference
between the two species and it is not caused by an error in our annotation (Figure 19).
Figure 19 bl2seq alignment between the first CDS (1_2380_0) from the D. melanogaster Dyrk3-PA (subject) and the
contig1 sequence (query) shows that the D. erecta CDS has a few additional residues in its first CDS compared to its
D. melanogaster ortholog.
Download the files generated by the Gene Model Checker
Now that we have a completed gene model, we should save the files generated by the Gene Model
Checker in preparation for project submission. Click on the “Downloads” tab to see links to the
three files generated by the Gene Model Checker (Figure 20).
Figure 20 Right-click on the links and choose “Save Link As...” or “Download Linked File As…” to save the files
generated by the Gene Model Checker
Right-click on the links and choose “Save Link As…” or “Download Linked File As…” to save the
files onto your computer. When preparing files for project submission, you will need to
concatenate the output from all the genes you have annotated and generate three files: one file
containing all the GFF entries, another file with all the transcript sequences, and the third file
with all the peptide sequences. Please refer to the “Submit Annotation Projects” document
(available on the GEP web site under “Help” è “Documentations” è “Web Framework”) for
more information on how to prepare the files for project submission.
Verifying Gene Models with Consensus Errors
The Baylor College of Medicine Human Genome Sequencing Center (BCM-HGSC) has
sequenced eight Drosophila species as part of the modENCODE project. These assemblies have not
16
been manually improved and we have found many potential errors in the consensus sequence.
Because of the known weaknesses of the 454 sequencing technology in resolving homopolymer
runs, many of these potential errors are base insertions or deletions. These base insertions and
deletions would often introduce spurious frame shifts into the gene annotations.
To address this issue, the GEP has developed the Sequence Updater tool to enable students to
document errors in the consensus sequence. The Sequence Updater will generate a file in the
Variant Call Format (VCF) that describes changes to the original project sequence in a
standardized format. Additional information on the Sequence Updater is available through the
Sequence Updater User Guide on the GEP web site (available under “Help” è “Documentations”
è “Web Framework”). The Gene Model Checker has been updated so that it will take the
sequence changes specified in the VCF file into account when it validates a gene model.
Checking the Thd1 Gene Model with Known Errors in the Consensus
The Sequence Updater User Guide contains an example of two base deletions (at 21,412 and at
21,431) in a region of the D. biarmipes contig32 sequence that contains the putative ortholog of the
D. melanogaster gene Thd1. In this tutorial, we will incorporate the VCF file generated by the
Sequence Updater into our annotation of the D. biarmipes Thd1 ortholog.
If we were to check the gene model for Thd1-PA using the original contig32 sequence, we find that
there are multiple stop codons in the last coding exon because errors in the project consensus
sequence would introduce frame shifts into the conceptual translation (Figure 21). Furthermore,
the amino acid sequences at the C-terminus would not be conserved with the D. melanogaster gene
model (Figure 22).
Figure 21 Gene Model Checker reports multiple in-frame stop codons and incorrect length of the translated region
because of errors in the D. biarmipes contig32 project sequence
17
Figure 22 Dot plot of the D. melanogaster Thd1-PA protein (x-axis) against the submitted D. biarmipes gene model (y-axis)
show a lack of sequence similarity at the C-terminus because of errors in the contig32 sequence
Checking the Thd1 Gene Model with the Modified Consensus Sequence
In contrast, if we include the corrections to the contig32 sequence (specified by the VCF file
contig32_changes.vcf.txt), the Gene Model Checker will automatically adjust the Coding Exons
Coordinates so that it is relative to the revised sequence and verify the gene model using these
revised coordinates. This allows us to specify the “Coding Exons Coordinates”, the “Transcribed
Exons Coordinates” and the “Stop Codon Coordinates” relative to the original project sequence.
In this case, we will enter the following into the Gene Model Checker form:
Field
Contig Sequence File
Errors in Consensus Sequence?
File with Changes to the Consensus Sequence
Ortholog in D. melanogaster
Coding Exons Coordinates
Annotated Untranslated Regions?
Orientation of Gene Relative to Query Sequence
Completeness of Gene Model Translation
Stop Codon Coordinates
Project Group
Project Name
Value
contig32.fasta
Yes
contig32_changes.vcf.txt
Thd1-PA
31271-29686, 26745-26113, 25565-25400, 2530224940, 22112-19559
No
Minus
Complete
19558-19556
D. biarmipes Dot (Aug. 2012 (deprecated))
contig32
18
Figure 23 Verify the Thd1-PA gene model with sequence changes described in the file contig32_changes.vcf.txt
The Gene Model Checker will issue a warning on the Checklist with the message “Modified
Consensus Sequence” that indicates the original sequence has changed based on the VCF file
(Figure 23). Using the revised sequence, the protein alignment and the dot plot show the last CDS
of Thd1-PA is conserved between the D. melanogaster and its D. biarmipes ortholog (Figure 24).
Figure 24 Dot plot of the D. melanogaster Thd1-PA protein (x-axis) against the submitted D. biarmipes gene model (y-axis)
show sequence similarity at the C-terminus after applying the VCF file to the contig32 project sequence
19
In order to maintain the ability for the Gene Model Checker to be used as a tool for diagnosing
errors in the submitted gene model, the Gene Model Checker will report two different sets of
coordinates when you provide a VCF file. The GFF and custom tracks will show the coordinates
relative to the original project sequence so that they are consistent with the rest of the evidence
tracks on the GEP UCSC Genome Browser. In contrast, the expanded section of the Checklist
and the Extracted Coding Exons will report the coordinates relative to the revised sequence.
20
Download