Hvordan få oversikten? Annotering av sekvensen Kromosom 16: et av de minste Finding genes What are we looking for? Proteins encoded in mRNA Non-coding RNA (ncRNA) genes Where are we looking? Prokaryotes Eukaryotes (often introns) Classes of RNA Functional RNA — essentially synonymous with noncoding RNA mRNA: Messenger RNA — coding for proteins miRNA: MicroRNA — putative translational regulatory gene family ncRNA: Non-coding RNA — all RNAs other than mRNA rRNA: Ribosomal RNA siRNA: Small interfering RNA — active molecules in RNA interference snRNA: Small nuclear RNA — includes spliceosomal RNAs snmRNA: Small non-mRNA — essentially synonymous with small ncRNAs snoRNA:Small nucleolar RNA — usually involved in rRNA modification stRNA: Small temporal RNA — e.g. lin-4 and let-7 in C. elegans tRNA: Transfer RNA fRNA: Source: Eddy SR (2001) Nature Reviews in Genetics Informasjon i sekvensen som kan brukes for å finne gener ”Signaler” i sekvensen: Spleisesignaler, promotere, termineringssignaler, polyAsignaler, CpG-øyer (Gene search by signal) ”Innholdet” i sekvensen: ORFs, kodonstatistikk osv.(Gene search by content) Likhet med kjente gener (Gene search by similarity) Fra gen til protein: så lett for cellen, så vanskelig for oss Simple protein finding Examine all 6 possible reading frames 3 frames on forward strand 3 frame on reverse strand Plot positions of Initiation (start) (Methionine) codon: ATG Termination (stop) codons: TAA, TAG, TGA Look for long stretches without stop codons after a start codon Source: http://cwx.prenhall.com/horton/medialib/media_portfolio/ Standard Genetic Code The standard genetic code is used in most organisms Another code is use din mitochondria and some organisms Overview of gentic codes in various organisms: http://www.ncbi.nlm.nih.gov/htbinpost/Taxonomy/wprintgc?mode=c Start and stop codon distribution Distribution of start codons (short lines) and stop codons (long lines) in the six reading frames along a genomic sequence (lacZ operon in E.coli) There is an open reading frame (lacZ) in frame +3 from position 1284 to 4355. Created by DNA STRIDER. Prokaryotic promotor regions Source: http://cwx.prenhall.com/horton/medialib/media_portfolio/ Transcription termination Shine-Dalgarno (SD) sequence The 16S rRNA ribosomal protein binding site Transcription and translation Genomic DNA Promotor Terminator Exon1 Primary transcript Spliced mRNA Protein Intron2 GU…AG GU…AG Exon3 3’UTR 5’UTR Cap Intron1 Exon2 AAAA… Start AUG M Stop TAA/TAG/TGA Gene, exon and intron number for whole ExInt and subdivisions Gene number Exon number Intron number Whole ExInt 94 615 518 169 525 870 Non-redundant ExInt 15 271 113 457 128 065 Rattus norvegicus 835 4889 7191 Homo sapiens 8287 60 499 43 127 Mus musculus 3044 18 920 15 407 Drosophila melanogaster 15 220 64 271 89 969 Caenorhabditis elegans 18 924 121 708 108 803 Arabidopsis thaliana 25 216 158 629 127 386 1695 1438 Saccharomyces cerevisiae 589 Fordeling av eksonstørrelser i ExInt Fordeling av intronstørrelser i ExInt Intron-fase: ekson/intron-overganger mellom kodoner eller i dem Intron phase 0 10 2 All ExInt 257 713 (49%) 147 625 (28%) 120 532 (23%) Non-redundant 60 979 (48%) 35 438 (28%) 31 608 (24%) Rattus norvegicus 2842 (39%) 2365 (33%) 1384 (28%) Mus musculus 6703 (44%) 5921 (38%) 2783 (18%) Caenorhabditis elegans 51 251 (47%) 28 553 (26%) 28 999 (27%) Homo sapiens 19 102 (44%) 15 423 (36%) 8602 (20%) Arabidopsis thaliana 71 958 (56%) 28 178 (22%) 27 250 (22%) Drosophila melanogaster 38 101 (42%) 28 896 (32%) 22 972 (26%) Saccharomyces cerevisiae 641 (45%) 1 428 (30%) 2 369 (25%) Hvordan finne spleisesignaler og eksoner? Vektsmatriser: Hvordan er fordelingen av nukleotider rundt spleiseseter? ”Weight array matrices” hvor det tas hensyn til nabonukleotider ”Maximal dependence decomposition”: Korrelasjoner med ikke-nabonukleotider Skjulte Markov-modeller Neurale nettverk: En mønstergjenkjenningsteknikk som ”lærer” Slik lages en vektmatrise Og slik brukes den Konsensussekvenser for ekson/intronoverganger Forskjellige klasser av eksoner som må oppdages på forskjellige måter Innledende eksoner: Begynner med et startkodon og slutter med et spleisedonorsete Interne eksoner: Begynner med et akseptorsete og slutter med et donorsete Terminale eksoner: Begynner med et akseptorsete og slutter med et stoppkodon Enkelteksongener: Begynner med et startkodon og slutter med et stoppkodon Integrert genfinning: Hva følger etter hva? Neuronnettverk: et eksempel with a positive value and others with a negative value; sums these values; and then converts them to an output of approximately 0 or 1. The Grail II system for finding exons in eukaryotic genes (Uberbacher and Mural 1991; Uberbacher et al. 1996). The method uses a neural network to identify patterns characteristic of coding sequences. The network includes three layers, an input layer for the data with the data coming from a candidate exon sequence, and a hidden layer for discerning relationships among the input data. An output layer comprising one neuron indicates whether or not the region is likely to be an exon. Each neuron receives information from a set in the layer above, some The system is trained using a set of known coding sequences, and as each sequence is utilized, the strengths and types of connections (positive or negative) between the neurons are adjusted, decreasing or increasing the signal to the next neuron in a manner that produces the correct output. The major difference between neural networks for exon and secondary structure prediction is that the exon prediction uses sequence pattern information as input whereas secondary structure prediction uses a window of amino acid sequence in the protein. In Grail II, a candidate sequence is evaluated by calculating pattern frequencies in the sequence and applying these values to the neural network. If the output is close to a value of 1, then the region is predicted to be an exon. Sekvens”innhold”: Forskjeller mellom den ekte leserammen og de to andre Ramme 1 er den ekte, og inneholder kodoner som koder for et protein med gjennomsnittlig aminosyresammensetning Kodonbruk i de tre leserammene Basefordeling på de tre kodonposisjonene Å skille mellom kodende og ikkekodende sekvenser ut fra basesammensetningen av de tre kodonposisjonene Antall ganger en base forekommer i hver av de tre kodonposisjonene i vinduet = Nij. Forventet verdi for hver base i hver av de tre kodonposisjonene Eij=(Ni1+Ni2+Ni3)/3 Divergensen D=Σ|Eij-Nij| Vindu: 67 kodoner EMBL-databasen 1984 Codon usage in the E.coli genome Escherichia coli [gbbct]: 11865 CDS's (3662594 codons) fields: [triplet] [amino acid] [fraction] [frequency: per thousand] ([number]) UUU UUC UUA UUG F F L L 0.58 0.42 0.14 0.13 22.1 16.0 14.3 13.0 80995) 58774) 52382) 47500) UCU UCC UCA UCG S S S S 0.17 10.4 ( 38027) 0.15 9.1 ( 33430) 0.14 8.9 ( 32715) 0.14 8.5 ( 31146) UAU UAC UAA UAG Y Y * * 0.59 17.5 ( 63937) UGU 0.41 12.2 ( 44631) UGC 0.61 2.0 ( 7356) UGA 0.08 0.3 ( 989) UGG C C * W 0.46 5.2 0.54 6.1 0.30 1.0 1.00 13.9 CUU CUC CUA CUG L L L L 0.12 11.9 ( 43449) 0.10 10.2 ( 37347) 0.04 4.2 ( 15409) 0.47 48.4 (177210) CCU CCC CCA CCG P P P P 0.18 7.5 0.13 5.4 0.20 8.6 0.49 20.9 27340) 19666) 31534) 76644) CAU CAC CAA CAG H H Q Q 0.57 12.5 ( 45879) 0.43 9.3 ( 34078) 0.34 14.6 ( 53394) 0.66 28.4 (104171) R R R R 0.36 20.0 ( 73197) 0.36 19.7 ( 72212) 0.07 3.8 ( 13844) 0.11 5.9 ( 21552) AUU AUC AUA AUG I I I M 0.49 29.8 (109072) 0.39 23.7 ( 86796) 0.11 6.8 ( 24984) 1.00 26.4 ( 96695) ACU ACC ACA ACG T T T T 0.19 10.3 ( 37842) 0.40 22.0 ( 80547) 0.17 9.3 ( 33910) 0.25 13.7 ( 50269) AAU AAC AAA AAG N N K K 0.49 0.51 0.74 0.26 20.6 21.4 35.3 12.4 ( 75436) AGU S 0.16 9.9 ( 36097) ( 78443) AGC S 0.24 15.2 ( 55551) (129137) AGA R 0.07 3.6 ( 13152) ( 45459) AGG R 0.04 2.1 ( 7607) GUU GUC GUA GUG V V V V 0.28 0.20 0.17 0.35 GCU GCC GCA GCG A A A A 0.18 0.26 0.23 0.33 GAU GAC GAA GAG D D E E 0.63 0.37 0.68 0.32 32.7 19.2 39.1 18.7 (119939) GGU G 0.35 25.5 ( 93325) ( 70394) GGC G 0.37 27.1 ( 99390) (143353) GGA G 0.13 9.5 ( 34799) ( 68609) GGG G 0.15 11.3 ( 41277) 19.8 14.3 11.6 24.4 ( ( ( ( ( ( ( ( 72584) 52439) 42420) 89265) 17.1 24.2 21.2 30.1 ( ( ( ( ( 62479) ( 88721) ( 77547) (110308) CGU CGC CGA CGG ( 19138) ( 22188) ( 3623) ( 50991) Coding GC 50.58% 1st letter GC 57.71% 2nd letter GC 40.68% 3rd letter GC 53.36% Genetic code 1: Standard Source: http://www.kazusa.or.jp/codon/ Codon usage in the human genome Homo sapiens [gbpri]: 44580 CDS's (19894411 codons) fields: [triplet] [amino acid] [fraction] [frequency: per thousand] ([number]) UUU UUC UUA UUG F F L L 0.45 0.55 0.07 0.13 16.9 20.4 7.2 12.6 (336562) (406571) (143715) (249879) UCU UCC UCA UCG S S S S 0.18 0.22 0.15 0.06 14.6 17.4 11.7 4.5 (291040) (346943) (233110) ( 89429) UAU UAC UAA UAG Y Y * * 0.44 0.56 0.28 0.22 12.0 15.6 0.7 0.5 (239268) (310695) ( 14322) ( 10915) UGU UGC UGA UGG C C * W 0.45 0.55 0.50 1.00 9.9 12.2 1.3 12.8 (197293) (243685) ( 25383) (255512) CUU CUC CUA CUG L L L L 0.13 0.20 0.07 0.41 12.8 19.4 6.9 40.3 (253795) (386182) (138154) (800774) CCU CCC CCA CCG P P P P 0.28 0.33 0.27 0.11 17.3 20.0 16.7 7.0 (343793) (397790) (331944) (139414) CAU CAC CAA CAG H H Q Q 0.41 0.59 0.25 0.75 10.4 14.9 11.8 34.6 (207826) (297048) (234785) (688316) CGU CGC CGA CGG R R R R 0.08 0.19 0.11 0.21 4.7 10.9 6.3 11.9 ( 93458) (217130) (126113) (235938) AUU AUC AUA AUG I I I M 0.36 0.48 0.16 1.00 15.7 21.4 7.1 22.3 (313225) (426570) (140652) (443795) ACU ACC ACA ACG T T T T 0.24 0.36 0.28 0.12 12.8 19.2 14.8 6.2 (255582) (382050) (294223) (123533) AAU AAC AAA AAG N N K K 0.46 0.54 0.42 0.58 16.7 19.5 24.0 32.9 (331714) (387148) (476554) (654280) AGU AGC AGA AGG S S R R 0.15 0.24 0.20 0.20 11.9 19.4 11.5 11.4 (237404) (385113) (228151) (227281) GUU GUC GUA GUG V V V V 0.18 0.24 0.11 0.47 10.9 14.6 7.0 28.9 (216818) (290874) (139156) (575438) GCU GCC GCA GCG A A A A 0.26 0.40 0.23 0.11 18.6 28.5 16.0 7.6 (370873) (567930) (317338) (150708) GAU GAC GAA GAG D D E E 0.46 0.54 0.42 0.58 22.3 26.0 29.0 40.8 (443369) (517579) (577846) (810842) GGU GGC GGA GGG G G G G 0.16 0.34 0.25 0.25 10.8 22.8 16.3 16.4 (215544) (453917) (325243) (326879) Coding GC 52.65% 1st letter GC 56.26% 2nd letter GC 42.37% 3rd letter GC 59.31% Genetic code 1: Standard Source: http://www.kazusa.or.jp/codon/ Codon usage diagram Usage of various codons along the sequence of lacZ O: Optimal codon usage S: Suboptimal codon usage R: Rare codon usage Comparative genomics methods Gene finding by sequence comparison to sequences known to be transcribed or translated Compare the genomic sequence to sequence databases Proteins mRNA sequences EST sequences (mRNA) Both exact matches and approximate matches are interesting Conserved sequences between species Program: Procrustes Et eksempel på et resultat med søkeprogrammet Genscan Genfinnere på nettet