Lecture 7 - Pitt CPATH Project

advertisement
Genomics and Personalized
Care in Health Systems
Lecture 7 Gene Finding (Part 2)
Ab initio and Evidence-Based Gene Finding
Leming Zhou, PhD
School of Health and Rehabilitation Sciences
Department of Health Information Management
Department of Health Information Management
Annotation Goals of this Class
• Create an evidence based set of gene annotations for the
dot chromosome and a similarly sized “control” region of
a long chromosome arm.
• This gene annotation set includes all putative isoforms,
coding sequence only.
Department of Health Information Management
Ab initio Gene Prediction
• Gene prediction using only the genomic DNA sequence
– Search for “signals” of protein coding regions
– Typically uses a probabilistic model
• Hidden Markov Model (HMM)
• “Putative models”- external evidence needed to verify
predictions (e.g. mRNA, ESTs)
• We will use multiple gene finders (e.g. Genscan, Nscan,
SNAP) for our Drosophila annotations
Department of Health Information Management
Performance of Gene Finders
• Most gene finders can predict prokaryotic genes
accurately (>98%)
• However, gene-finders do a poor job of predicting genes
in eukaryotes
– Not as much is known about the general properties of eukaryotic
genes
– Splice site recognition, different isoforms
Department of Health Information Management
Common Problems
• Common problems with gene finders
–
–
–
–
Fusing neighboring genes
Splitting a single gene
Missing exons or entire genes
Over predict exons or genes
• Other challenges
–
–
–
–
Nested genes
Noncanonical splice sites
Pseudogenes
Different isoforms of same gene
Department of Health Information Management
How to Improve Predictions?
• Improve algorithms
– Identify conserved sequences from multiple sequence
alignments
– Consensus model based on results from multiple gene finders
– Computational predictions that incorporate biological evidence
(e.g. ESTs, cDNAs)
• Manual annotation
– Collect all the available evidence from multiple biological and
computational sources to create the best gene models
Department of Health Information Management
Generate Consensus Gene Model
• Each gene prediction program has different strengths
and weaknesses.
• Create consensus gene set by combining results from
multiple gene finders to generate the best reference set
• For the 12 Drosophila species that have been sequenced,
the reference gene set are available from FlyBase
Department of Health Information Management
Manual Annotation
• Evidence for genes:
– EST or other expression data
– Conservation to known genes
– Computational predictions
• Goals for this class
– Use web resources to collect evidence
– Practice collecting and analyzing data
Department of Health Information Management
Web Databases and Tools
• For this class we will use the following sites:
UCSC Genome Browser for Drosophila: http://gander.wustl.edu
FlyBase: http://flybase.org
NCBI: http://ncbi.nlm.nih.gov
GEP Gene Finder:
http://gander.wustl.edu/~wilson/dmelgenerecord/index.html
– GEP Model Checker:
http://gander.wustl.edu/~wilson/genechecker/index.html
–
–
–
–
Department of Health Information Management
Phylogenetic Tree for Drosophila
Department of Health Information Management
UCSC Genome Browser
• GEP version, parts of genomes, GEP data, used for
annotation of Drosophila species
– http://gander.wustl.edu
• Strengths
– Genome Browser: graphical views of genomic regions
– BLAT: BLAST-Like Alignment Tool
– Table Browser: access underlying data used to generate the
graphical genome browser
• Weaknesses
– Constraints on ability to display data
• Uses
– View and organize available computational results
– Follow along to view a section of drosophila genome
Department of Health Information Management
Adjusting Display Options
• Adjust following tracks to “pack”
– Under “Gene and Gene Prediction Tracks”
• FlyBase Genes, RefSeq Genes, N-SCAN, CONTRAST
• Adjust following tracks to “dense”
– Under “mRNA and EST Tracks”
• D. melanogaster mRNAs, Spliced ESTs
– Under “Comparative Genomics”
• Conservation
• Adjust following tracks to “full”
– Under “Mapping and Sequencing Tracks”
• Base Position
– Under “Variation and Repeats”
• RepeatMasker
Department of Health Information Management
FlyBase
• http://www.flybase.org
• Strengths
– Lots of ancillary data for each gene
– Curation of literature for each gene
• Weaknesses
– Difficult to find data if you don’t already know where to look
• Uses
– Species-specific BLAST searches
– Genome browsers and data sets for all the sequenced Drosophila
species
Department of Health Information Management
Basic Strategy for Annotation
• Use Ab Initio prediction to focus attention on genomic
features (areas) of interest
• Add as much other evidence as you can to refine the gene
model and support your conclusion
• What other evidence is there?
–
–
–
–
Basic gene structure
Motif information
BLAST homologies: nr, protein, est
Other species or other proteins
Department of Health Information Management
Eukaryotic Genome Annotation
• BLAST homology: nr, protein, EST
– Homology to known proteins argues against false positive
– Consider length, percent identity when examining alignments
– Without good EST evidence you can never be sure; make your
best guess and be able to defend it
• Other species or other proteins
– For any similarity hit, look for even better hits elsewhere in the
genome
– If you are convinced you have a gene and it is a member of a
multi-gene family, be sure to pick the right ortholog
Drosophila Sequence
Annotation
Department of Health Information Management
GEP projects
• http://gep.wustl.edu
• D. erecta
– Close to D. melanogaster
– easier to annotate
• D. grimshaw, D. virilis and D. mojavensis
– Further from D. melanogaster
– More difficult to annotate
• Will use example in D. virilis but basic technique is the
same
Department of Health Information Management
The Drosophila genomes
• Average gene size will be smaller than mammals
• Very low density of pseudogenes
• Almost all genes will have the same basic structure as D.
melanogaster orthologs; mapping exon by exon works
well for most genes
• Genes only rarely move to different chromosomal
element
Department of Health Information Management
Goals
• Identify genes
– For each gene identify and precisely map (accurate to the base
pair) all exons
– Do this for ALL isoforms
Department of Health Information Management
Evidence Based Annotation
• Human curated analysis
• Much better outcome than standard ab initio gene
finders
• Goal: Collect, analyze and synthesize all evidence
available and come up with most likely gene model
• Evidence for Gene Models
–
–
–
–
Basic Biology
Expression Data
Conservation
Computational
Department of Health Information Management
Basic Biology
• Known biological properties
–
–
–
–
–
–
Coding regions start with a methionine
Coding regions end with a stop codon
Genes are only on one strand of DNA
Exons appear in order along the DNA
Introns start with GT (or rarely GC)
Introns end with a AG
• CCTAGAGTACCA….CAGATAGCTAGGAG
Department of Health Information Management
Expression: EST, mRNA, Other
• Protein coding genes must be expressed
• Positive result very helpful
• Negative result less informative
– Did not find message because looking in wrong tissue or
developmental stage
– Transcription rate so low, messages remain undetected
• Drosophilids: only 20,000 EST from each species; only
helpful in rare cases
Department of Health Information Management
Conservation
•
•
•
•
•
Coding sequences evolve slowly
Similarity to known genes suggests new genes
D. melanogaster very well annotated
12 other drosophila genomes now available
This will be your most important source of evidence
Department of Health Information Management
Computational Evidence
• Assumption: there are recognizable signals in the DNA
sequence that the cell uses; it should be possible to detect
these algorithmically
• Many programs designed to detect these signals
• These programs do work to a certain extent, the
information they provide is better than nothing; high
error rates
Department of Health Information Management
How to Proceed
• You will need to investigate all features of interest:
– Ab initio gene finder results
• Watch out for ends - fused or split genes
– Regions of high similarity with D. melanogaster proteins and
EST’s, identified by BLAST but not overlapping with 1 above
• Overlapping genes usually on opposite strand
• Be vigilant for partial genes at fosmid ends
– Regions with high similarity to known proteins (i.e. BLAST to nr)
but not found by 1 or 2 above
Department of Health Information Management
Practical Example in D. virilis
• The following example will give a general outline of how
we recommend you proceed, works in many cases
• Goal: come up with the best gene model that integrates
all the evidence in a manner that maximizes likely
outcomes and minimizes contradictions
Department of Health Information Management
Basic Annotation Procedure
For each feature (exon in this case) of interest:
1.
2.
3.
4.
5.
Identify the likely ortholog in D. mel.
Use D. mel. database to find gene model of ortholog and identify
protein seq for each exon
Use BLASTX to locate exons; search one by one, find conservation,
note position and frame
Based on locations, frames of conservation, as well as other
evidence create gene model; identify the exact base location (start
and stop) of each CDS (coding exon) for each isoform
Confirm your model using Gene checker and genome browser.
Department of Health Information Management
Basic Procedure (graphically)
contig
feature
BLASTX of predicted gene to D. melanogaster proteins suggests this
region orthologous to Dm gene with 1 isoform and 5 exons:
BLASTX of each exon to locate region of similarity and which
translated frame encodes the similar amino acids:
1
3
3
2
1
Frame
alignments
Department of Health Information Management
Basic Procedure (graphically)
Zoom in on ends of exons and find first met, matching intron Doner (GT) and
Acceptor (AG) sites and final stop codon, making sure to keep the frame intact
1
3
Met
GT
AG
GT
Once these have been identified, write down the exact location of the first base
and last base of each exon. Use these numbers to check your gene model and if
correct generate and save GFF file
1121
1187
1402
1591
1754
1939
2122
2434
2601
2789
Department of Health Information Management
Example Annotation
• Open 4 Tabs
– gander.wustl.edu
• Genome browser
• D. virilis - Mar 2005 - chr10
– Flybase.org
• Tools  blast
– The Gene Record Finder
• Info from most recent D. mel genome
• The search term is case sensitive (Fts is different from fts)
– www.ncbi.nih.gov
• Blast page BLASTX  click the checkbox:
Department of Health Information Management
Example Annotation from Drosophila virilis
• Settings are: Insect; D. virilis; Mar. 2005; chr10 (chr10
is a fosmid from 2005)
• Click submit
Department of Health Information Management
Example Annotation
• Seven predicted Genscan genes
• Each one would be investigated
Department of Health Information Management
Investigate 10.4
• All putative genes will need to be analyzed; we will focus
on 10.4 in this example
• To zoom in on this gene enter:
– chr10:15000-21000 in position box
– Then click jump button
Department of Health Information Management
Step 1: Find Ortholog
• If this is a real gene it will probably have at least some
homology to a D. melanogaster protein
• Step one: do a BLAST search with the predicted protein
sequence of 10.4 to all proteins in D. melanogaster
– In Genome Browser (http://gander.wustl.edu):
• Click on one of the exons in gene 10.4
• On the Genscan report page click on Predicted Protein
• Select and copy the sequence
– On the flybase blast page (http://flybase.org/blast):
• Do a blastp search of the predicted sequence to the D. melanogaster
“Annotated Proteins” (Database)
Department of Health Information Management
Step 1: Find Ortholog
• The flybase blastp results show a significant hit to the
“A”, “B” and “C” isoforms of the gene “mav”
Note the large step in E value from Mav isoforms to
next best hit gbb; good evidence for orthology
Department of Health Information Management
Step 1: Results of Ortholog Search
• The alignment looks right for virilis vs. melanoasterregions of high similarity interspersed with regions of
little or no similarity
• Same chromosome supports orthology
• We have a probable ortholog: “mav”
Department of Health Information Management
Step 2: Gene Structure Model
• We need information on “mav”; What is mav? What does
its gene look like?
• If this is the ortholog we will also need the protein
sequence of each exon
• Use the Gene Record Finder
– Available off the “Projects” menu at http://gep.wustl.edu
• Projects Annotation Resources  Gene Record Finder
• http://gander.wustl.edu/~wilson/dmelgenerecord/index.html
Department of Health Information Management
Step 2: Gene Structure Model
• Search for “mav” (search box)
• Only one mav so select and hit the button:
Department of Health Information Management
Step 2: Gene Structure Model
• Sequence of exons in the mav gene in D. mel
Department of Health Information Management
Step 2: Gene Structure Model
• We now have a gene model (three isoforms each with 2
CDS).
• In this case, we will annotate isoform A only.
• For your project report you would annotate ALL
isoforms for all genes you identify in your fosmid/contig.
• Isoforms B and C only differ in non-coding regions.
Simply make B model, duplicate and rename for
submission
Department of Health Information Management
Step 3: Investigate Exons
• The last section of the Gene record includes the exons:
Department of Health Information Management
Step 3: Investigate Exons
Department of Health Information Management
Step 3: Investigate Exons
• Use exon to search fosmid with exon
– Where in this fosmid are sequences which code for amino acids
that are similar
• Best to search entire fosmid DNA sequence (easier
to keep track of positions) with the amino acid sequences
of exons
• In the genome browser tab, go to the browser “chr10” of
D. virilis; click the DNA button, then click “get DNA”
• In the Gene Record Finder tab, make sure you have the
peptide sequence for the melanogaster mav gene exons
• These two tabs now have the two sequences you are
going to compare
Department of Health Information Management
Step 3: Investigate Exons
• Copy and paste the fosmid genomic sequence obtained
from genome browser tab into the top box 1 of blastx
• Copy and paste the peptide sequence of exon 1 from
“gene record finder” tab into bottom box 2 of blastx
• Open the “algorithm parameters” section: turn off the
filter; leave other values at default
• Click “BLAST” button to run the comparison
Department of Health Information Management
BLASTX Comparison
Department of Health Information Management
Step 3: Investigate Exons
Sometimes you have not find any similarity
If so: change the expect value to 1000 or even larger and
click “BLAST”, keep raising the expect value until you
get hits, then evaluate hits by position
• This will show very weak similarity which can be better
than nothing
•
•
Department of Health Information Management
Step 3: Investigate Exons
• We have a weak alignment (50 identities and 94
similarities), but we have seen worse when comparing
single exons from these two species
• Notice the location of the hit (bases 16866 to 17504) and
frame +3 and missing 92 aa
Department of Health Information Management
Step 3: Investigate Exons
• A similar search with exon 2 sequences gives a location
of chr10:18476-19744 and frame +2 with one amino
acids missing
Department of Health Information Management
Step 3: Investigate Exons
• For larger genes continue with each exon, searching with
BLASTX for similarity (adjusting e cutoff if needed)
noting location, frame and any missing amino acids
• Very small exons may be undetectable, move on and
come back later
• Draw these out if this will help
– Note unaligned amino acid as well
16,866
17,504
18,476
93
271
2
Frame +3
19,744
430
Frame +2
Department of Health Information Management
Step 4: Create a Gene Model
• Pick ATG (met) at start of gene, first met in frame with
coding region of similarity (+3)
• For each putative intron/exon boundary compare
location of BLASTX result to locate exact first and last
base of the exon such that the conserved amino acids are
linked together in a single long open reading frame
•
•
Exons: 16515-17504; 18473-19744
Intron GT and AG present
Department of Health Information Management
Step 4: Create a Gene Model
• For many genes the locations of donor and acceptor sites
will be easily identified based on the locations of the
alignments of the individual exons.
• However when amino acid conservation is not sufficient
to allow for the identification of intron/exon boundaries
then other evidence must be considered. See the handout
“Annotation Instruction sheet” for more help.
Department of Health Information Management
Splicing
• The splice of life: Alternative splicing
and neurological disease, B. Kate
Dredge, Alexandros D. Polydorides
& Robert B. Darnell, Nature Reviews
Neuroscience 2, 43-50 (January
2001)
http://www.humgen.nl/lab-aartsma-rus/frameset.php?frame=introduction_bestanden/splicing.htm
Department of Health Information Management
Alternative Splicing
•
•
•
•
Splicing mutations can arise by alteration of
conserved splice donor and splice acceptor
sequences or by activation of cryptic splice sites
(A) Mutations at conserved splice donor (SD) or
splice acceptor (SA) sequences result in intron
retention where there is failure of splicing and an
intron sequence is not excised; or in exon
skipping where the spliceosome brings together
the splice donor and splice acceptor sites of nonneighboring exons.
(B) Sequences that are very similar to the splice
donor or splice acceptor sequences may
coincidentally exist in introns and exons (sd and
sa). These sequences are not normally used in
splicing and so are known as cryptic splice sites.
A mutation can activate a cryptic splice site by
making the sequence more like the consensus
splice donor or acceptor sequence and the cryptic
splice site can now be recognized and used by the
spliceosome (activation of the cryptic splice site).
From “Human molecular genetics”, Tom
Strachan and Andrew Read
Department of Health Information Management
Alternative Splicing
• http://en.wikipedia.org/wiki/File:Splicing_overview.jpg
Department of Health Information Management
Revisit “Phase”
• Introns and internal exons are divided according to
“phase”, which is closely related to the reading frame.
–
–
–
–
An intron which falls between codons is considered phase 0
An intron after the first base of a codon, phase 1
An intron after the second base of a codon, phase 2
Internal exons are similarly divided according to the phase of the
previous intron (which determines the codon position of the first
base-pair of the exon, hence the reading frame).
Department of Health Information Management
Step 5: Confirm Gene Model
• We will do this later once you have a real model of your
own to check
• As a final check:
– enter coordinates into “gene model checker” (gep.wustl.edu,
projects  annotation resource  gene model checker) to
confirm it is a valid model
– Use custom tracks to view model and double check that the final
model agrees with all your evidence
– Examine dot plot to discover possible errors
Department of Health Information Management
Considerations
• Some exons are very hard to find (small or nonconserved) keep increasing E value to find any hits;
restrict location based on flanking exons
• Donor “GC” seen on rare occasions
• We have seen a couple of examples where the only
reasonable interpretation was that the basic gene
structure had changed (i.e. intron was gained or lost)
• Use the techniques described in the Handout as well as
discussions with your peers when things get difficult.
Research is almost always a collaborative effort.
Department of Health Information Management
Complex Genes with Many Isoforms
• Some genes will have many isoforms and many exons.
The technique described for mav does not scale well to
these large complex genes.
• The recommended protocol in these cases:
1.
Identify all unique exons and which isoforms have which exons
(Gene record finder)
2. Map each unique exon
3. Build gene models of each isoform based on which exons it
contains; re-use previously found exons
Department of Health Information Management
Homework 6
•
•
•
•
•
•
Read the “Annotation Instruction Sheet”
Go through “A Simple Annotation Problem” and do the exercises
Get familiar with “Annotation Report”
Read requirements for your annotation project report
Work on an annotation project (details are given in the next slide)
Read “Gene and Disease” (ebook) on NCBI website, select one
genetic disease and prepare roughly 10 slides for your presentation
given in next lecture (3/12/2012, provide the slides to me before the
class)
Department of Health Information Management
Annotation Procedure
•
•
•
•
•
•
•
•
•
You are provided a zip file named “derecta_2nd3Lcontrol_Nov2011_fosmid22.zip”, unzip it you
will get a folder named “derecta_2nd3Lcontrol_Nov2011_fosmid22”
Go into the provided folder, go to subfolder “src”, get sequence named “fosmid22.fasta.masked”.
Use only this masked genomic sequence when the genomic sequence is needed. This sequence is
produced by repeatmasker and the corresponding report can be found in subfolder “analysis” 
“Repeats”
Go to http://gander.wustl.edu, select “D. erecta” for genome, “Nov.2011” for assembly and type
“fosmid22” in the “position or search term” box, then click submit
In the genome browser, click one predicted gene from GenScan, obtain the predicted protein
sequence in the next page
Use blastp on the flybase website (use “Annotated Protein” database), search this predicted
protein against annotated D. mel proteins. Find the best match and determine the gene name in
D. mel. Adjust blast parameters when needed
Search this gene in the GEP gene record finder, find D. mel gene details (exon amino acids
sequences, gene structure etc.). http://gander.wustl.edu/~wilson/dmelgenerecord/index.html
Search the masked genomic sequence against each amino acid sequence of exon in each isoform of
D. mel using blastx on the NCBI website (The genomic sequence as query and each exon as
subject). Record the matched regions (in the exon and in the genomic sequence), frame
Determine precise exon boundary by using signals (ATG, GT, AG, TAA,…), phase, conservation
and frame information in the genome browser
Fill out the project report form
Download