Evolutionary signatures for protein-coding genes By studying evolution, we found this: Evolutionary signatures: Reading Frame Conservation, Codon Substitution Biases, Ka/Ks rate Scaling to multiple species: increased separation between genes and non-coding regions Combining these signatures: Classification. Revisiting the D. melanogaster protein-coding gene catalog Purpose here is not to study gene evolution (see Eisen paper), but rather to leverage the 12 genomes to improve gene annotation quality. People involved: o MIT (Manolis, Mike Lin) o FlyBase (Bill Gelbart, Peili Zhang, and FlyBase-Harvard curators) o BDGP (Sue Celniker, Joe Carlson) Identification of novel protein-coding sequence o Method CONGO similar in effect to Siepel's Exoniphy, but uses more flexible discriminative algorithms o Predictions o Experimental results/validations o BDGP August iPCR runs - new cDNAs in GenBank o BDGP heterochromatin cDNAs (unpublished data) o FB4.2 vs 4.3 o FB human curation (~400 annotation changes) o Intersection with Affy transfrags (so-so enrichment) o Mass spec (ETH Zurich/Sandra Lovenich, Erich Bruggner, Konrad Basler, Ernst Hafen) o To Do: who are those genes (Dscam etc.), how many exons prior to this (alternative splicing!) is length of exon mult. of 3, or does it otherwise fit with splicing requirements What is the expected fraction of novel exons to be in an intron (1/3 ?) Dubious genes o Method: identify genes where we can't find any comparative evidence to believe they are real, by multiple metrics and in multiple alignment sets o Properties of set length/RFC distribution lack of GO terms, lack of names lack of cDNA/EST evidence many single exon/single ORF o For some of those that are transcribed, we predict conserved noncoding elements in the transcripts (RNA genes, microRNA genes) "Confirmed" hypothetical genes o Method Identify genes without cDNA/EST evidence, but strong evolutionary evidence in syntenic alignments Limited definition of "confirmed" (can be sure protein-coding, but can't really confirm gene structure, or identify alt. splicing variation) Corrections and adjustments to existing annotations o Translation start adjustments: evolutionary evidence suggests translation starts at a downstream ATG o Transcript model corrections: by detecting frameshifts near an intron o ORF corrections: wrong ORF currently annotated Summary: revised gene catalog o Incorporation in FlyBase Discovery of non-canonical genic phenomena People involved: o MIT (Manolis Kellis, Mike Lin) o Harvard (Bill Gelbart, Andy Schroeder, others from FlyBase-Harvard) o UCSC (Jakob Pedersen on RNA involvement in recoding) Translational readthrough: observe protein-coding signatures continuing straight past stop codon Frameshifts: observe adjacent windows conserved in different frames (not near an intron) Polycistronics/uORFs: observe well-conserved disjoint ORFs in known transcript models Conserved non-coding regions People involved: o MIT (Manolis Kellis, Mik Lin, Huy, Alex Stark, Pouya, Leo) o Harvard (Bill Gelbart) o CSHL (Greg Hannon, Julius Brennecke) o Whitehead (Dave Bartel, Graham Ruby) o UCSC (Jakob Pedersen) “ultraconserved” elements o 1851 elements > 60nts 100% conserved in at least 11 species o Enriched in intron/exon boundaries and intergenic regions o Intron CNEs enriched for transcription factor genes o intron/exon CNEs enriched in nervous system proteins/channels. o Overlap with known enhancer elements? o Blasts/Blast-pipeline (other species) is there (2days) o Blast to Dmel for other fam. Members: data there 2 days) RNA genes o tRNAs o snoRNAs, snRNAs, rRNAs o New types of RNA genes o Secondary structure properties of mRNAs microRNAs: conservation-based identification of Drosophila miRNAs o prediction selects against exons, transposon and repeat sequence o top prediction have miRNA-like features not used for prediction o top novel hairpins are validated by library cloning (Bartel, Hannon) with 90% accurary o 28 novel mirnas validated, 9 prev. predicted are confirmed by clonining, 6 are corrected o new miRNAs-family members, new miRNA families o targets for new miRNAs o miRNAs in the introns of msi, kis, E2F, cdc2D o mature mirna 5'ends can be predicted with high accuracy, exceptions highlight importance of star sequence (though this is not a general trend!) o prediction accuracy scales with branch-length, prediction of clade-specific miRNAs with high accurary is currently impossible o estimate of (conserved) Drosophila miRNAs < 150 Gene regulation People involved o MIT (Manolis Kellis, Alex Stark) Promoter motifs o Properties of known regulatory motifs o Signatures for motif discovery o Computational validation (against known motifs, tissue-specific expression, GO, positional bias, no strand bias) 3’ UTR motifs o Role in miRNA regulation o Role in identification of new miRNA genes o Other elements Pumillo (PUF) binding sites (incl nanos) Identification of gene targets o Transcriptional regulation o Targets of miRNA genes Motif combinations and grammars Towards motif-based gene regulatory networks Discussion: Assessing power to identify functional elements with 12 genomes Evolutionary signatures o Protein-coding genes o miRNA genes o motifs o CNEs