Genome analysis and Gene identification: Genome Annotation, Gene Prediction, and DNA motifs Objectives: -learn what is meant by “genome annotation” and why this is important -understand why some genomic features are relatively easy to annotate, and others are not -learn the major ways of representing transcription factor binding sites and the advantages and disadvantages of each Assigned reading: Stein, L. 2001. Genome annotation: from sequence to biology. Nat Rev Genet 2: 493-503. D'Haeseleer, P. 2006. What are DNA sequence motifs? Nat Biotechnol 24: 423-425. Genome Sequencing (As of 6/25/04) (As of 7/25/05) (As of 7/20/07)As of 6/29/08 (1128)(1496)(2680) (3825) genome projects: 199 (274)(579) 827complete (includes (28) (36) (49) 94 eukaryotes) 508 (728) (1285) 1932 prokaryotic genomes in progress 421 (494) (721) 936 eukaryotic genomes in progress small: archaebacterium Nanoarchaeum equitans 500 kb Bacillus anthracis (anthrax) 5228 kb S. cerivisiae (yeast) 12,069 kb Arabidopsis thaliana 115,428 kb Drosophila melanogaster (fruit fly) 137,000 kb Anopheles gambiae (malaria mosquito) 278,000 kb Oryza sativa (rice) 420,000 kb Mus musculus (mouse) 2,493,000 kb Homo sapiens (human) 2,900,000 kb http://www.genomesonline.org/ so what? Genome sequencing helps in: • identifying new genes (“gene discovery”) • looking at chromosome organization and structure • finding gene regulatory sequences • comparative genomics These in turn lead to advances in: •medicine •agriculture •biotechnology •understanding evolution and other basic science questions Because of the vast amounts of data that are generated, we need new approaches •high throughput assays •robotics •high speed computing •statistics •bioinformatics Understanding the genome We know the sequence—but can we understand it? Anna Pavlovna's drawing room was gradually filling. The highest Petersburg society was assembled there: people differing widely in age and character but alike in the social circle to which they belonged. Prince Vasili's daughter, the beautiful Helene, came to take her father to the ambassador's entertainment; she wore a ball dress and her badge as maid of honor. The youthful little Princess Bolkonskaya, known as la femme la plus seduisante de Petersbourg, was also there. She had been married during the previous winter, and being pregnant did not go to any large gatherings, but only to small receptions. Prince Vasili's son, Hippolyte, had come with Mortemart, whom he introduced. The Abbe Morio and many others had also come. To each new arrival Anna Pavlovna said, "You have not yet seen my aunt," or "You do not know my aunt?" and very gravely conducted him or her to a little old lady, wearing large bows of ribbon in her cap, who had come sailing in from another room as soon as the guests began to arrive; and slowly turning her eyes from the visitor to her aunt, Anna Pavlovna mentioned each one's name and then left them. --Tolstoy, War and Peace Understanding the genome We don’t know the language: Гостиная Анны Павловны начала понемногу наполняться. Приехала высшая знать Петербурга, люди самые разнородные по возрастам и характерам, но одинаковые по обществу, в каком все жили; приехала дочь князя Василия, красавица Элен, заехавшая за отцом, чтобы с ним вместе ехать на праздник посланника. Она была в шифре и бальном платье. Приехала и известная, как la femme la plus séduisante de Pétersbourg 1, молодая, маленькая княгиня Болконская, прошлую зиму вышедшая замуж и теперь не выезжавшая в большой свет по причине своей беременности, но ездившая еще на небольшие вечера. Приехал князь Ипполит, сын князя Василия, с Мортемаром, которого он представил; приехал и аббат Морио и многие другие. — Вы не видали еще, — или: — вы не знакомы с ma tante? 2 — говорила Анна Павловна приезжавшим гостям и весьма серьезно подводила их к маленькой старушке в высоких бантах, выплывшей из другой комнаты, как скоро стали приезжать гости, называла их по имени, медленно переводя глаза с гостя на ma tante, и потом отходила. Все гости совершали обряд приветствования никому не известной, никому не интересной и не нужной тетушки. Анна Павловна с грустным, торжественным участием следила за их приветствиями, молчаливо одобряя их. Ma tante каждому говорила в одних и тех же выражениях о его здоровье, о своем здоровье и о здоровье ее величества, которое нынче было, слава Богу, лучше. Все подходившие, из приличия не выказывая поспешности, с чувством облегчения исполненной тяжелой обязанности отходили от старушки, чтоб уж весь вечер ни Understanding the genome Even if we did, we don’t know the grammar, or punctuation: annapavlovnasdrawingroomwasgraduallyfillingthehighestpetersburgsocietywasassembledt herepeopledifferingwidelyinageandcharacterbutalikeinthesocialcircletowhichtheybelonged princevasilisdaughterthebeautifulhelenecametotakeherfathertotheambassadorsentertainmen tsheworeaballdressandherbadgeasmaidofhonortheyouthfullittleprincessbolkonskayaknown aslafemmelaplusseduisantedepetersbourgwasalsothereshehadbeenmarriedduringthepreviou swinterandbeingpregnantdidnotgotoanylargegatheringsbutonlytosmallreceptionsprincevasil issonhippolytehadcomewithmortemartwhomheintroducedtheabbemorioandmanyothershada lsocometoeachnewarrivalannapavlovnasaidyouhavenotyetseenmyauntoryoudonotknowmya untandverygravelyconductedhimorhertoalittleoldladywearinglargebowsofribboninhercapw hohadcomesailinginfromanotherroomassoonastheguestsbegantoarriveandslowlyturninghere yesfromthevisitortoherauntannapavlovnamentionedeachonesnameandthenleftthemeachvisit orperformedtheceremonyofgreetingthisoldauntwhomnotoneofthemknewnotoneofthemwant edtoknowandnotoneofthemcaredaboutannapavlovnaobservedthesegreetingswithmournfula ndsolemninterestandsilentapprovaltheauntspoketoeachoftheminthesamewordsabouttheirhea lthandherownandthehealthofhermajestywhothankgodwasbettertodayandeachvisitorthoughp olitenesspreventedhisshowingimpatiencelefttheoldwomanwithasenseofreliefathavingperfor medavexatiousdutyanddidnotreturntoherthewholeeveningtheyoungprincessbolkonskayahad broughtsomeworkinagold--Tolstoy, War and Peace In order to make use of the genome sequence, we need to understand all of its components. Assigning identities and functions to sequences within the genome is called genome annotation. “With the complete human genome sequence now in hand, we face the enormous challenge of interpreting it and learning how to use that information to understand the biology of human health and disease. The ENCyclopedia Of DNA Elements (ENCODE) Project is predicated on the belief that a comprehensive catalog of the structural and functional components encoded in the human genome sequence will be critical for understanding human biology well enough to address those fundamental aims of biomedical research. Such a complete catalog, or "parts list," would include protein-coding genes, non–proteincoding genes, transcriptional regulatory elements, and sequences that mediate chromosome structure and dynamics; undoubtedly, additional, yet-to-bedefined types of functional sequences will also need to be included.” What’s in a genome? Genes (i.e., protein coding) But. . . only <2% of the human genome encodes proteins Other than protein coding genes, what is there? • genes for noncoding RNAs (rRNA, tRNA, miRNAs, etc.) • structural sequences (scaffold attachment regions) • regulatory sequences • non-functional “junk” ? It’s still uncertain/controversial how much of the genome is composed of any of these classes The answers will come from experimentation and bioinformatics. Current human genome annotations can be viewed using the UCSC genome browser The ENCyclopedia Of DNA Elements (ENCODE) Project aims to identify all functional elements in the human genome sequence. •pilot phase focused on 30 Mb (~ 1%) of the genome •international consortium of computational and laboratory-based scientists working to develop and apply high-throughput approaches for detecting all sequence elements that confer biological function •now in its second phase, extending study to entire human genome Functional genomic elements being identified by the ENCODE pilot phase The ENCODE Project Consortium Science 306, 636 -640 (2004) Published by AAAS protein-coding genes, non–proteincoding genes •easier to find than other functional elements •why? •genes are transcribed—which means that we can identify them by looking at RNA •traditionally this has been done by cDNA or EST sequencing, more recently by microarray, SAGE, MPSS, etc. protein-coding genes, non–proteincoding genes •we can also find genes ab initio using computational methods •this is most suited to protein-coding genes •why? •protein-coding genes have recognizable features •open reading frames (ORFs) •codon bias •known transcription and translational start and stop motifs (promoters, 3’ poly-A sites) •splice consensus sequences at intron-exon boundaries ab initio gene discovery •Protein-coding genes have recognizable features •We can design software to scan the genome and identify these features •Some of these programs work quite well, especially in bacteria and simpler eukaryotes with smaller and more compact genomes •It’s a lot harder for the higher eukaryotes where there are a lot of long introns, genes can be found within introns of other genes, etc. •We tend to do OK finding protein coding regions, but miss a lot of non-coding 5’ exons and the like ab initio gene discovery—validating predictions and refining gene models •Standard types of evidence for validation of predictions include: •match to previously annotated cDNA •match to EST from same organism •similarity of nucleotide or conceptually translated protein sequence to sequences in GenBank (translation works better—why?) •protein structure prediction match to a PFAM domain •associated with recognized promoter sequences, ie TATA box, CpG island •known phenotype from mutation of the locus Finding non–protein-coding genes •e.g., tRNA, rRNA, snoRNA, miRNA, various other ncRNAs •Harder to find than protein-coding genes •Why? •often not poly-A tailed—don’t end up in cDNA libraries •no ORF •constraint on sequence divergence at nucleotide not protein level, so homology is harder to detect •So, how do we find these? Finding non–protein-coding genes •secondary structure •homology, especially alignment of related species •experimentally •isolation through non-polyA dependent cloning methods •microarrays ab initio gene discovery—approaches Most gene-discovery programs makes use of some form of machine learning algorithm. A machine learning algorithm requires a training set of input data that the computer uses to “learn” how to find a pattern. Two common machine learning approaches used in gene discovery (and many other bioinformatics applications) are artificial neural networks (ANNs) and hidden Markov models (HMMs). ab initio gene discovery—HMMs An example state diagram for an HMM for gene discovery is this simplified version of one used by Genescan: 5’ UTR begin gene region initial exon start translation final exon exon donor splice site acceptor splice site 3’ UTR stop translation end gene region intron A,T,G,C single exon Each box and arrow has associated transition probabilities, and emission probabilities for emission of nucleotides (dotted arrow). These are learned from examples of known gene models and provide the probability that a stretch of sequence is a gene. adapted from Gibson and Muse, A Primer of Genome Science What about other genomic features? Other than protein coding genes, what is there? • genes for noncoding RNAs (rRNA, tRNA, miRNAs, etc.) • structural sequences (scaffold attachment regions) • regulatory sequences • non-functional “junk” ? We can begin to annotate regulatory sequences such as transcription factor binding sites and cis-regulatory modules. Remember from Unit 2-2: Control of Gene Expression—Transcription Factors Transcription factors (TFs) are proteins that bind to the DNA and help to control gene expression. We call the sequences to which they bind transcription factor binding sites (TFBSs), which are a type of cis-regulatory sequence. Transcription factors bind to specific DNA sequences Isalan et al. Biochemistry 37:12026 Usually, binding sites are first determined empirically. Most transcription factors can bind to a range of similar sequences. We can represent these in either of two ways, as a consensus sequence, or as a position weight matrix (PWM). Once we know the binding site, we can search the genome to find all of the (predicted) binding sites. Control of Gene Expression—Transcription Factors Most transcription factors can bind to a range of similar sequences. We call this a binding “motif.” Wasserman, W. W. and A. Sandelin (2004). Nat Rev Genet 5(4): 276-287. We can represent these motifs either as a consensus sequence or as a frequency (or weight) matrix. Binding site (motif) representations TCCGGAAGC TCCGGATGC TCCGGATCT CATGGATGC CCAGGAAGT GGTGGATGC ACCGGATGC T CC GGAAGC C T 7 characterized binding sites for a certain transcription factor: consensus sequence: Frequency matrix and its graphical depiction, a sequence logo: A T G C 111007200 302000502 110770060 254000015 Binding site (motif) representations A consensus sequence is a one-line description of the TFBS, based on a column-by-column alignment of the individual known binding sites. The usual rule is: A single base is shown if it occurs in more than half the sites and at least twice as often as the second most frequent base. Otherwise, a double degenerate symbol (e.g., G/C= S) is used if two bases occur in more than 75% of the sites, or a triple degenerate symbol when one base does not occur at all. A frequency matrix shows the actual frequencies of each base in each column. This can be easily converted to a position weight matrix (PWM), which is a normalized version of the frequency matrix that is therefore not dependent on the number of sites in the alignment. Finding binding sites in the genome T C CC TGGATGC Consensus sequences make searching easy—it’s a simple text search that can even be done using a word processor, or very simply programmed in a computer language such as Perl: while(<SEQUENCE>){ if ($_ =~ /[T|C]C[T|C]GGATGC/) {do something;} } All positions in the motif are treated the same. Identifying transcription factor binding sites But PWMs are generally more useful: •they allow us to assign more importance to more invariant positions •they are related to the binding energy of the DNA-protein interaction •we can compare PWMs and we can score PWMs Scores are based on the probability of a given nucleotide being in a given position. Identifying transcription factor binding sites A T G C 1 3 1 2 1 0 1 5 1 2 0 4 0 0 7 0 T C C G G A T G C C T 0 0 7 0 7 0 0 0 2 5 0 0 0 0 6 1 0 2 0 5 TCCGGAAGC TCCGGAACT TCCGGAAAA Example 1: TCCGGAAGC scores higher than TCCGGAACT scores higher than TCCGGAAAA as GC > CT > AA in the last two positions. Note that the latter two sequences would score the same if using only the consensus representation. Identifying transcription factor binding sites A T G C T C C 111007200 302000502 110770060 254000015 C T G G A TG C Example 2: TCGGGAAGC and TCCAGATCT both have a single mismatch compared to the consensus. But the first is a much better binding site when scored using a PWM due to the strong conservation of the G in position 4 versus the weak requirement for the C or T in position 3. Issues with finding binding sites in the genome But it’s important to use caution: just because a sequence in the genome is a reasonable match to a known TFBS, this doesn’t necessarily mean that the TF is binding there in vivo. By crude calculation: The probability of finding a 7 bp motif is 4-7 = 1/16,384 i.e., expect only about 1 motif every 16 kb. So in human genome, this sequence should be present over 183,000 times! (>7x per gene!) Even in a 10 Mb genome, the sequence would occur over 600 times. And this calculation does not even take into account motif degeneracy! So we need to consider additional factors in deciding what predicited binding sites are important—such as how regulatory regions are organized Empirical methods, such as ChIP-chip (see Unit 2-3) are a good alternative for looking at in vivo binding; bioinformatics methods can be combined with this to determine the transcription factor binding motifs. Genome Annotation—Transcription Factors Because of the difficulty in accurately predicting bona fide, functional TFBSs, most current genome annotation focuses on empirically determined sites. Several databases curate these data, e.g. the Open Regulatory Annotation database (ORegAnno) and the Regulatory Element Database for Drosophila (REDfly). Tracks displaying these data can be found in the UCSC Genome Browser. These databases also curate cisregulatory module sequences, which at present can only reliably be determined by empirical methods. Genome Annotation—much work remains Despite good progress in identifying both protein coding and non-protein coding genes, much work remains to be done before even the best-studied genomes are fully annotated. For the higher eukaryotes, only a tiny percentage of features such as TFBSs, CRMs, and other non-gene features have so far been indentified.