Evolutionary signatures for protein-coding genes

advertisement
Evolutionary signatures for protein-coding genes
By studying evolution, we found this:
 Evolutionary signatures: Reading Frame Conservation, Codon Substitution Biases, Ka/Ks rate
 Scaling to multiple species: increased separation between genes and non-coding regions
 Combining these signatures: Classification.
Revisiting the D. melanogaster protein-coding gene catalog
Purpose here is not to study gene evolution (see Eisen paper), but rather to leverage the 12 genomes to
improve gene annotation quality.





People involved:
o MIT (Manolis, Mike Lin)
o FlyBase (Bill Gelbart, Peili Zhang, and FlyBase-Harvard curators)
o BDGP (Sue Celniker, Joe Carlson)
Identification of novel protein-coding sequence
o Method
 CONGO similar in effect to Siepel's Exoniphy, but uses more flexible discriminative
algorithms
o Predictions
o Experimental results/validations
o BDGP August iPCR runs - new cDNAs in GenBank
o BDGP heterochromatin cDNAs (unpublished data)
o FB4.2 vs 4.3
o FB human curation (~400 annotation changes)
o Intersection with Affy transfrags (so-so enrichment)
o Mass spec (ETH Zurich/Sandra Lovenich, Erich Bruggner, Konrad Basler, Ernst Hafen)
o To Do:
 who are those genes (Dscam etc.), how many exons prior to this (alternative splicing!)
 is length of exon mult. of 3, or does it otherwise fit with splicing requirements
 What is the expected fraction of novel exons to be in an intron (1/3 ?)
Dubious genes
o Method: identify genes where we can't find any comparative evidence to believe they are real,
by multiple metrics and in multiple alignment sets
o Properties of set
 length/RFC distribution
 lack of GO terms, lack of names
 lack of cDNA/EST evidence
 many single exon/single ORF
o For some of those that are transcribed, we predict conserved noncoding elements in the
transcripts (RNA genes, microRNA genes)
"Confirmed" hypothetical genes
o Method
 Identify genes without cDNA/EST evidence, but strong evolutionary evidence in
syntenic alignments
 Limited definition of "confirmed" (can be sure protein-coding, but can't really confirm
gene structure, or identify alt. splicing variation)
Corrections and adjustments to existing annotations
o Translation start adjustments: evolutionary evidence suggests translation starts at a
downstream ATG
o Transcript model corrections: by detecting frameshifts near an intron

o ORF corrections: wrong ORF currently annotated
Summary: revised gene catalog
o Incorporation in FlyBase
Discovery of non-canonical genic phenomena

People involved:
o MIT (Manolis Kellis, Mike Lin)
o Harvard (Bill Gelbart, Andy Schroeder, others from FlyBase-Harvard)
o UCSC (Jakob Pedersen on RNA involvement in recoding)

Translational readthrough: observe protein-coding signatures continuing straight past stop codon

Frameshifts: observe adjacent windows conserved in different frames (not near an intron)

Polycistronics/uORFs: observe well-conserved disjoint ORFs in known transcript models
Conserved non-coding regions

People involved:
o MIT (Manolis Kellis, Mik Lin, Huy, Alex Stark, Pouya, Leo)
o Harvard (Bill Gelbart)
o CSHL (Greg Hannon, Julius Brennecke)
o Whitehead (Dave Bartel, Graham Ruby)
o UCSC (Jakob Pedersen)

“ultraconserved” elements
o 1851 elements > 60nts 100% conserved in at least 11 species
o Enriched in intron/exon boundaries and intergenic regions
o Intron CNEs enriched for transcription factor genes
o intron/exon CNEs enriched in nervous system proteins/channels.
o Overlap with known enhancer elements?
o Blasts/Blast-pipeline (other species) is there (2days)
o Blast to Dmel for other fam. Members: data there 2 days)

RNA genes
o tRNAs
o snoRNAs, snRNAs, rRNAs
o New types of RNA genes
o Secondary structure properties of mRNAs

microRNAs: conservation-based identification of Drosophila miRNAs
o prediction selects against exons, transposon and repeat sequence
o top prediction have miRNA-like features not used for prediction
o top novel hairpins are validated by library cloning (Bartel, Hannon) with 90% accurary
o 28 novel mirnas validated, 9 prev. predicted are confirmed by clonining, 6 are corrected
o new miRNAs-family members, new miRNA families
o targets for new miRNAs
o miRNAs in the introns of msi, kis, E2F, cdc2D
o mature mirna 5'ends can be predicted with high accuracy, exceptions highlight importance of
star sequence (though this is not a general trend!)
o prediction accuracy scales with branch-length, prediction of clade-specific miRNAs with high
accurary is currently impossible
o estimate of (conserved) Drosophila miRNAs < 150
Gene regulation

People involved
o MIT (Manolis Kellis, Alex Stark)

Promoter motifs
o Properties of known regulatory motifs
o Signatures for motif discovery




o Computational validation (against known motifs, tissue-specific expression, GO, positional
bias, no strand bias)
3’ UTR motifs
o Role in miRNA regulation
o Role in identification of new miRNA genes
o Other elements Pumillo (PUF) binding sites (incl nanos)
Identification of gene targets
o Transcriptional regulation
o Targets of miRNA genes
Motif combinations and grammars
Towards motif-based gene regulatory networks
Discussion: Assessing power to identify functional elements with 12 genomes

Evolutionary signatures
o Protein-coding genes
o miRNA genes
o motifs
o CNEs
Download