Introduction to Genomics and the Tree of Life Chapter 13 Extra-Reading • Next generation sequencer – What next generation sequencer can do for genetics/genomics research? • Compar_genomics – What can we learn from comparative genomics? Outline of today’s lecture Introduction: 5 perspectives, history of life Genome-sequencing projects: chronology Genome analysis: criteria, resequencing, metagenomics DNA sequencing technologies: Sanger, 454, Solexa Process of genome sequencing: centers, repositories Genome annotation: features, prokaryotes, eukaryotes Five approaches to genomics As we survey the tree of life, consider these perspectives: Approach I: cataloguing genomic information Genome size; number of chromosomes; GC content; isochores; number of genes; repetitive DNA; unique features of each genome Approach II: cataloguing comparative genomic information Orthologs and paralogs; COGs; lateral gene transfer Approach III: function; biological principles; evolution How genome size is regulated; polyploidization; birth and death of genes; neutral theory of evolution; positive and negative selection; speciation Approach IV: Human disease relevance Approach V: Bioinformatics aspects Algorithms, databases, websites Page 519 Introduction Lessons learned form comparative genomics What have we learned about genes by comparing genomic sequences? What have we learned about regulation? About 5% of the human genome is under purifying selection Positively regulated regions Mechanisms and history of mammalian evolution Nonuniformity of neutral evolutionary rates within species Nonuniformity of evolution along the branches of phylogeny Learning more form existing data Choice of species Choice of tools Future of comparative genomics Levels of analysis in genomics level DNA RNA protein complexes pathways organelles organs individuals species genus phylum kingdom topics genes, chromosomes ESTs, ncRNA ORFs, composition binary, multimeric databases GenBank UniGene, GEO UniProt BIND COGs, KEGG variation and disease speciation HapMap TaxBrowser; SGD JAX mouse FishBase TOL Definitions of terms Genomics is the study of genomes (the DNA comprising an organism) using the tools of bioinformatics. Bioinformatics is the study protein, genes, and genomes using computer algorithms and databases. Systematics is the scientific study of the kinds and diversity of organisms and of any and all relationships among them. Classification is the ordering of organisms into groups on the basis of their relationships. The relationships may be evolutionary (phylogenetic) or may refer to similarities of phenotype (phenetic). Taxonomy is the theory and practice of classifying organisms. Pace (2001) described a tree of life based on small subunit rRNA sequences. This tree shows the main three branches described by Woese and colleagues. Fig. 13.1 Page 521 Molecular sequences as basis of trees Historically, trees were generated primarily using characters provided by morphological data. Molecular sequence data are now commonly used, including sequences (such as small-subunit RNAs) that are highly conserved. Visit the European Small Subunit Ribosomal RNA database for 20,000 SSU rRNA sequences. Page 523 Tree of life from David Hillis’ lab (based on ~3000 rRNAs) animals you are here plants protists fungi bacteria archaea http://www.zo.utexas.edu/faculty/antisense/Download.html Tree of life from David Hillis’ lab (based on ~3000 rRNAs) you are here http://www.zo.utexas.edu/faculty/antisense/Download.html Ribosomal RNA Database Ribosomal Database Project http://rdp.cme.msu.edu/index.jsp Santos, S. R. and Ochman H. Identification and phylogenetic sorting of bacterial lineages with universally conserved genes and proteins. Environmental Microbiology. 2004. Jul(6)7:754-9. ►Download fusA (translation elongation factor 2 [EF-2]) ►Obtain DNA in the fasta format ►Align by ClustalW in MEGA ►Create a neighbor-joining tree Page 524 European Small Subunit Ribosomal RNA database (http://www.psb.ugent.be/rRNA/ssu/) sA 0 fu 174 M S nD usA oge 11168 f n i c uc NC TC lla burnetii RSA493 fusA Coxie lla s Vibrio linelo jejuni o parah usAfa Xy9leflla W py a5c Xy stiosa lell 4 m a dios9a fas tidi 4 a aA oelycAtuR1 Teemm 1 C fus 5 fuD flori fusfusA IM sA an 19718 us i B ea europa m Nitro h c c A 22106 i lo t s d n a a C u p f V ib e ri o h v 5 u A ln if icus CMC o 9 fus c i 6 l P6 fusA 6 99 V He A ibwo iasfus ri 2i Jgg svA rio les urth fBu Aic ufu ln A S s yploylorWi if p P f u i p s YJ01 i pAhSidg 6 fusA licloico hcBhaupachphihdaidi V Sh HeHe BB ibrioewa ucu chonle ella raeoN ne1id 69e6n1si sfuM sAR1 Acin fu s etob HNa acte r AD eim so P1 f PaNeismdeuncin ste me regyit M usA u m ningi 35C5 H it 0 8 f ult Ph aem oc Z24090HuPsA i o Yeoto i da 1 f nf r Pm fusuAsA Ye sinluiami lu R 7 0 d n fus rs in pees s T KW2 A ia tis T 0 O1 fu pe CO s st 92fus A is fusA KI M A fu sA usA 2f A 10 fussA H8 P31 fu p W s9B31 u ss T t I ccu sngMa A A oco nluo us us A f f s aorie ech 1 u 3 cmh 42 80 f 6 nroe sA A C7 C6 98 hloy fu us omcos PC PC P1 30 f s 1812 M eu tis ys CC ac C7 TCW iol oc s onP ch inu ov moc o e y ne ar esut Gl S m pNno o o Pr Pr Neighbor-joining tree of ~150 fusA (GTPase) DNA sequences r lo ch oc Ch hl or o Ch sA m C fu lam h ar 3 l i nu yd am 04 s o yd I1 A fusA Ch C pn o R s C lam eu pn sA A M SC fu 18 me yd P1 a 01 CT 2 fu fus r onum Ch Chl op 37 vo 0 hi Ty T2 lam am CW n on 5 to 91Typ phi m L y t ydo eum o A f r A L is us y u rac c fus ca estnter r T uri A ho av on J 02R939 3 3 e m Pa C a i DU iae 1 f L9 rac hl ni a p e nt h W3 GP 38 f ufsuAs hlaam EDsA wi ini on on ne typ r 7 A u u m y m C s I f H l Clo E er a lm o ydimu A X fC f sA st a 57 H7 a Urid 0 fus a alm us usA Y S SS ceto sA O1 57 fousLA113 WaEru A A u s f i 1 u l f i Clobut Fuso 0 1 o 7T 73 6e6n F st yt licu 25mfufus E c oli O FT0 xneri 245 Therm nuCclloest Cio5p eta m sAA rAr la e Clla flefusA atu c t s e p t o i n fu e i l a 8 n n 1 m r i n fA E choige 12 xnerLie3pp0ttoo i 2 a Myco r ES olieKlla fle Le plas ero tengcinTgCeCn2s5E884ffusA onge 15386 usA pneu A ESchig f nsis fusfuAsA mon 1 fus 8 fusA usA Mycop M129 SB elfluustilaA r M i a P las ge m f s ari usA liacum 82 fusA nita G x aoetoosgth io VPI54 37 fus rm e ta e if e h u T q A A Ureaplasm Mycoplas ide a parvum 7galli R Bactero W83 fusA 00970fufussA A hyro gingiv rp o P Mycoplas penetrans HF2 fus fusA AA A ulfur PCA Mycoplas mycoides PG1 Geo sX ufussA sAfusA ffusA TLS fusA m a A P fu A idu A A A A fus tep M n o Y 6 A 3K s lor s A O s t 16 s s Ch CTIP e UAB is bil sm 3ffuu1VsR pulmon u sA sye etuhhdomo R mo la MycoplaMy X uuf2usfsfusfufu playtsop P aenP co S f4 ph 6 l 50 fusA s A f C 3 3 n h 2 G R M o ch G on ro A 3 o a 8 0 I on ni s m uet O R 1S si RB Dm als 3i1Ta 9331sA5sAfu e0S 6G A rd so to nrpe Bo cxamp G uvMpdio 2EeM acBsen esdonseobr s ia 5 rd M A 4 y 3 N S et a s e la pa i 8 s r n o M ra a o la n e e 1 n s 1 n rt o e A i e u 5 a l 12 e u F u n G u fus e o o 5 n u AgSi pertussis gm M1n82 I es 2s0fu f fI1 I1 eg mTohamaI AV SS eostia Bordet oe0 gen ypoypgoep yegoue td0riis C ciatiuaCm tpyo Str fufusA trpetrpepp ssA r lfC A96 tre 33ss30 fu vodaraudeg SS nhoC2s4pa7o2 is u53015 cntissalU A S p pgenagplaaplcaneN rD Br onoturhm Tyi2 1f3usfuA auurhlgizK laW 4 P a a i StrreeSpatra 3 m i c i 4 c u t i A sA z N r z 0 e 0 u 0 i o e M u c l s 0 n f 1 p r l m t f o a c o u 0 a sus f j H Stre ctotsapo m u f c s s a t e f A i i clierlon i M ldpe usA sA uis S Lalahnnre ero erue s pjoSt nt 13 estic1C05A8FoFnnicfus uaru 30 en 21 fu3s0umA aacc E hy ya b b fu tus fusA309U aph ctoto sA Sat p CB A 9S f LLaac St 15 DuAsA 1 fu sA 10 fu sA la o yd m Clostridium Yersinia pestis Aquifex aeolicus Oc Rickettsia sA fu A 09 A us A0 29 sfA 0 fus sA fu 68 DC1551 fu CG 1c31tu 7 bAe4rcu C 97 fusA sA M risycoTbCaBis2 M 2122 TW08 27 5fu uoslsA fu ich l ei fsA pl F i A ip t A wh 7R st TrC m A a 0 H3 H m is erc su tub he idvu3fus vac 54N ms icob My bory feum rop lu e N c a paollla Ae3po2nwhippl riayacvoebiluaTrophe A sryma 31 fusA denticei Twist rA paM ri BfusA us fuoTl nema burgdorfe othemy ph 1f r4liecpBoorrelia iupdhpto o c1oTe eSsdtere hermtomysYRS3 sA sAfusA t rep ns fu ryn p N kfu10 Co odo usSt uieran 3a2cpraaveiuTm cter HD100 fusA 0 b e c m l o lloA ba d i de B Rh er ifofMyb1ca3c henselae e o a Houst1 fusA Bart d c t fus Th nreaMylu mR ryo e g gu icket p lon Ricket c rowa o Coinyn n o ri i MzaeliksiihMadrid Doer ido C 7 fusAE fu Bif sA fusA Bac. antracis Bart quintana fusA Wolbachia ea no St bac LLis t ap ihe Bac anthracis hy y Stim ner en l fu noSontne e sA a L i copuc p si B BaaBccaacthnauthnrirt st m hayytoCider s HT h 9c7oA2 4lipb m BB E no07aefu aca cnagrcais 5yu8sr1efsu1AFs1i2A cBeacre is m 21362 831 c u f u 2 cerhu t s s 10 o s A652fu fu eau los 14987EGM 8fusfsuA sA AsA Ba dur5a79fufDsW n ueA2 cs ubt s fussAfufsuA s ilis A A fus A Mycoplasma Mycobacterium Treponema History of life on earth 4.55 BYA 4.4-3.8 BYA 3.9 BYA 3.8 BYA formation of earth (violent 100 MY period) last ocean-evaporating impacts oldest dated rocks sun brightened to 70% of today’s luminosity Ammonia, methane, or carbon dioxide atmosphere. Earliest life: RNA, protein Source: Schopf J.W. (ed.), Life’s Origins (U. Calif. Press, 2002) Page 521 Millions of years ago (MYA) deuterostome/ echinoderm/ Cambrian Land protostome Insects chordate explosion plants Proterozoic eon 1000 Age of Reptiles ends Phanerozoic eon 500 100 0 Page 522 Millions of years ago (MYA) Mass extinction 100 Human/chimp divergence Dinosaurs extinct; Mammalian radiation 50 10 0 Page 522 Millions of years ago (MYA) Homo sapiens/ Chimp divergence 10 Australepithecus Earliest Lucy stone tools 5 Emergence of Homo erectus 1 0 Page 522 Years ago Homo erectus emerges in Africa 1,000,000 Mitochondrial Eve 500,000 100,000 0 Page 523 Years ago Emergence of anatomically modern H. sapiens 100,000 Neanderthal and Homo erectus disappear 50,000 10,000 0 Page 523 Years ago “Ice Man” Earliest from Alps pyramids 10,000 5,000 Aristotle 1,000 0 Page 523 Years ago algebra 1,000 Gutenberg 500 calculus Darwin, Mendel 100 0 Page 523 Chronology of genome sequencing projects We will next summarize the major achievements in genome sequencing projects from a chronological perspective. Page 525 Chronology of genome sequencing projects 1976: first viral genome Fiers et al. sequence bacteriophage MS2 (3,569 base pairs, Accession NC_001417). 1977: Sanger et al. sequence bacteriophage fX174. This virus is 5,386 base pairs (encoding 11 genes). See accession J02482; NC_001422. Page 527 Chronology of genome sequencing projects 1981 Human mitochondrial genome 16,500 base pairs (encodes 13 proteins, 2 rRNA, 22 tRNA) Today (10/09), over 1800 mitochondrial genomes sequenced 1986 Chloroplast genome 156,000 base pairs (most are 120 kb to 200 kb) Page 527 mitochondrion chloroplast Lack mitochondria (?) Entrez Genomes organelle resource at NCBI http://www.ncbi.nlm.nih.gov/genomes/ORGANELLES/organelles.html There are >2100 eukaryotic organelles (10/09) GOBASE: resource for organelle genomes http://megasun.bch.umontreal.ca/gobase/ MitoDat: resource for organelle genomes “This database is dedicated to the nuclear genes specifying the enzymes, structural proteins, and other proteins, many still not identified, involved in mitochondrial biogenesis and function. MitoDat highlights predominantly human nuclearencoded mitochondrial proteins.” Not updated recently. http://www-lecb.ncifcrf.gov/mitoDat/ MitoMap: resource for organelle genomes http://www.mitomap.org/ It is possible to map mutations in human mitochondrial DNA that are responsible for disease Chronology of genome sequencing projects 1995: first genome of a free-living organism, the bacterium Haemophilus influenzae Page 530 Chronology of genome sequencing projects 1996: first eukaryotic genome The complete genome sequence of the budding yeast Saccharomyces cerevisiae was reported. We will describe this genome soon. Also in 1996, TIGR reported the sequence of the first archaeal genome, Methanococcus jannaschii. Page 532 Chronology of genome sequencing projects 1997: More bacteria and archaea Escherichia coli 4.6 megabases, 4200 proteins (38% of unknown function) 1998: first multicellular organism Nematode Caenorhabditis elegans 97 Mb; 19,000 genes. 1999: first human chromosome Chromosome 22 (49 Mb, 673 genes) Page 532 1999: Human chromosome 22 sequenced Chronology of genome sequencing projects 2000: Fruitfly Drosophila melanogaster (13,000 genes) Plant Arabidopsis thaliana Human chromosome 21 2001: draft sequence of the human genome (public consortium and Celera Genomics) Page 534 2000 Overview of genome analysis • Selection of genomes for sequencing • Sequence one individual genome, or several? • How big are genomes? • Genome sequencing centers • Sequencing genomes: strategies • When has a genome been fully sequenced? • Repository for genome sequence data • Genome annotation Page 537 Applications of Genome Sequencing Purpose Template Example De novo sequencing Genome sequencing Sequencing genomes Ancient DNA Extinct Neanderthal genome Metagenomics Human gut Resequencing Whole genomes Genomic regions Somatic mutations Transcriptome Full-length transcripts Serial Analysis of Gene Expression (SAGE) Epigenetics >1000 influenza Individual humans Assessment of genomic rearrangements or diseaseassociated regions Sequencing mutations in cancer Defining regulated messenger RNA transcripts Noncoding RNAs Identifying and quantifying microRNAs in samples Methylation changes Measuring methylation changes in cancer Table 13.15 p.538 Overview of genome analysis Fig. 13.8 p.539 Criteria for selecting genomes for sequencing Criteria include: • genome size (some plants are >>>human genome) • cost • relevance to human disease (or other disease) • relevance to basic biological questions • relevance to agriculture Page 538 Criteria for selecting genomes for sequencing Criteria include: • genome size (some plants are >>>human genome) • cost • relevance to human disease (or other disease) • relevance to basic biological questions • relevance to agriculture Recent projects: Chicken Chimpanzee Cow Dog Fungi (many) Honey bee Sea urchin Rhesus macaque Page 540 Selection criteria Selection of genomes for sequencing is based on specific criteria. For an overview, see a series of white papers posted on the National Human Genome Research Institute (NHGRI) website: http://www.genome.gov/10002154 For a description of NHGRI selection criteria, visit: http://www.genome.gov/10001495 Page 540 Criteria for selecting genomes for sequencing Sequence one individual genome, or several? Try one… --Each genome center may study one chromosome from an organism --It is necessary to measure polymorphisms (e.g. SNPs) in large populations For viruses, thousands of isolates may be sequenced. For the human genome, cost is the impediment. Page 540 Diversity of genome sizes How big are genomes? Viral genomes: 1 kb to 350 kb (Mimivirus: 1181 kb) Bacterial genomes: 0.5 Mb to 13 Mb Eukaryotic genomes: 8 Mb to 686 Gb (human: ~3 Gb) Page 540 Genome sizes in nucleotide base pairs plasmids viruses bacteria fungi plants algae insects mollusks bony fish The size of the human genome is ~ 3 X 109 bp; almost all of its complexity is in single-copy DNA. amphibians reptiles birds The human genome is thought to contain ~30,000-40,000 genes. 104 105 106 107 mammals 108 109 1010 1011 http://www3.kumc.edu/jcalvet/PowerPoint/bioc801b.ppt 16 eukaryotic genome projects > 1000 megabases Genus, species Subgroup Size (Mb) #chr common name Macropus eugenii Mammals 3800 8 tammar wallaby Oryctolagus cuniculus Mammals 3500 22 rabbit Cavia porcellus Mammals 3400 31 guinea pig Pan troglodytes Mammals 3100 24 chimpanzee Homo sapiens Mammals 3038 23 human Bos taurus Mammals 3000 30 cow Dasypus novemcinctus Mammals 3000 32 nine-banded armadillo Loxodonta africana Mammals 3000 28 African savanna elephant Sorex araneus Mammals 3000 Rattus norvegicus Mammals 2750 21 rat Canis familiaris Mammals 2400 39 dog Zea mays Land Plants 2365 10 corn Aplysia californica Other Animals 1800 17 California sea hare Danio rerio Fishes 1700 25 zebrafish Gallus gallus Birds 1200 40 chicken Triphysaria versicolor Land Plants 1200 European shrew plant parasite Ancient DNA projects Special challenges: • Ancient DNA is degraded by nucleases • The majority of DNA in samples derives from unrelated organisms such as bacteria that invaded after death • The majority of DNA in samples is contaminated by human DNA • Determination of authenticity requires special controls, and analysis of multiple independent extracts Page 542 Metagenomics projects Two broad areas: • Environmental (ecological) e.g. hot spring, ocean, sludge, soil • Organismal e.g. human gut, feces, lung Page 543 Outline of today’s lecture Introduction: 5 perspectives, history of life: time lines Genome-sequencing projects: chronology Genome analysis: criteria, resequencing, metagenomics DNA sequencing technologies: Sanger, 454, Solexa Process of genome sequencing: centers, repositories Genome annotation: features, prokaryotes, eukaryotes Outline of today’s lecture Introduction: 5 perspectives, history of life: time lines Genome-sequencing projects: chronology Genome analysis: criteria, resequencing, metagenomics DNA sequencing technologies: Sanger, 454, Solexa Process of genome sequencing: centers, repositories Genome annotation: features, prokaryotes, eukaryotes Overview of genome analysis 20 Genome sequencing centers contributed to the public sequencing of the human genome. Many of these are listed at the Entrez genomes site. (Or see Table 19.3, page 803.) Page 548 Two approaches to genome sequencing Whole genome shotgun sequencing (Celera) Hierarchical shotgun sequencing (public consortium) Two approaches to genome sequencing Whole Genome Shotgun (from the NCBI website) An approach used to decode an organism's genome by shredding it into smaller fragments of DNA which can be sequenced individually. The sequences of these fragments are then ordered, based on overlaps in the genetic code, and finally reassembled into the complete sequence. The 'whole genome shotgun' (WGS) method is applied to the entire genome all at once, while the 'hierarchical shotgun' method is applied to large, overlapping DNA fragments of known location in the genome. Page 548 Human genome project: strategies Whole genome shotgun sequencing (Celera) -- given the computational capacity, this approach is far faster than hierarchical shotgun sequencing -- the approach was validated using Drosophila Two approaches to genome sequencing Hierarchical shotgun method Assemble contigs from various chromosomes, then sequence and assemble them. A contig is a set of overlapping clones or sequences from which a sequence can be obtained. The sequence may be draft or finished. A contig is thus a chromosome map showing the locations of those regions of a chromosome where contiguous DNA segments overlap. Contig maps are important because they provide the ability to study a complete, and often large segment of the genome by examining a series of overlapping clones which then provide an unbroken succession of information about that region. Page 548 Two approaches to genome sequencing Hierarchical shotgun sequencing (public consortium) -- 29,000 BAC clones -- 4.3 billion base pairs -- it is helpful to assign chromosomal loci to sequenced fragments, especially in light of the large amount of repetitive DNA in the genome -- individual chromosomes assigned to centers Source: IHGSC (2001) Sequenced-clone contigs are merged to form scaffolds of known order and orientation Source: IHGSC (2001) Fig. 19.8 Page 804 When has a genome been fully sequenced? A typical goal is to obtain five to ten-fold coverage. Finished sequence: a clone insert is contiguously sequenced with high quality standard of error rate 0.01%. There are usually no gaps in the sequence. Draft sequence: clone sequences may contain several regions separated by gaps. The true order and orientation of the pieces may not be known. Page 549 When has a genome been fully sequenced? When has a genome been fully sequenced? Fold coverage 0.25 0.5 0.75 1 2 3 4 5 6 7 8 9 10 % sequenced 22 39 53 63 87.5 95 98.2 99.4 99.75 99.91 99.97 99.99 99.995 Page 551 Trace repository for genome sequence data Raw data from many genome sequencing projects are stored at the trace archive at NCBI or EBI (main NCBI page, bottom right). Also visit: http://trace.ensembl.org/ As of October 2008, the Trace Archive had ~2b traces. As of October 2009 it has ~2,108,000,000 traces. Page 552 Fig. 13.12 Page 553 http://www.jgi.doe.gov/education/ http://www.youtube.com/watch?v=RLsb0pM x_oU&feature=channel_page A Howard Hughes Medical Institute (HHMI) video production describing the Whole Genome Shotgun Sequencing process at the JGI. This video is viewable on YouTube in three parts: Part1(chapters 15), Part 2 (chapters 6-8), Part 3 (chapters 9-14). Role of comparative genomics Phylogenetic footprinting Phylogenetic shadowing Population shadowing Page 552 Fig. 13.13 Page 554 Outline of today’s lecture Introduction: 5 perspectives, history of life: time lines Genome-sequencing projects: chronology Genome analysis: criteria, resequencing, metagenomics DNA sequencing technologies: Sanger, 454, Solexa Process of genome sequencing: centers, repositories Genome annotation: features, prokaryotes, eukaryotes Fig. 13.14 Page 555 Genome annotation Information content in genomic DNA includes: -- nucleotide composition (GC content) -- repetitive DNA elements -- protein-coding genes, other genes Page 555 GC content varies across genomes Bacteria Number of species in each GC class 10 5 Plants 5 Invertebrates 3 Vertebrates 10 5 20 30 40 50 60 70 GC content (%) 80 Fig. 13.15 Page 556 Gene prediction tools • http://bioinformatics.ca/links_directory/?subcategory_i d=39 • http://www.geneprediction.org/ Common tools GenScan: http://genes.mit.edu/GENSCAN.html HMMgene: http://www.cbs.dtu.dk/services/HMMgene/ Microbial: http://www.ncbi.nlm.nih.gov/genomes/MICROBES/gli mmer_3.cgi Fungal: http://www.cbcb.umd.edu/software/GlimmerHMM/