Genomics, Bioinformatics and the Revolution in Biology Jonathan Pevsner, Ph.D. Kennedy Krieger Institute/ Johns Hopkins School of Medicine Outline Three views of bioinformatics and genomics Informatics From small to large From genotype to phenotype The chromosomes SNPs, HapMap, and the 1000 Genomes project Definitions of bioinformatics and genomics • Bioinformatics is the interface of biology and computers. It is the analysis of proteins, genes and genomes using computer algorithms and databases. • Genomics is the analysis of genomes, including the nature of genetic elements on chromosomes. The tools of bioinformatics are used to make sense of the billions of base pairs of DNA that are sequenced by genomics projects. • Genetics is the study of the origin and expression of individual uniqueness. Three views of bioinformatics and genomics 1. The field of informatics 2. From small to large 3. From genotype to phenotype bioinformatics medical informatics genomics Tool-users public health informatics Tool-makers algorithms databases infrastructure Three views of bioinformatics and genomics 1. The field of informatics 2. From small to large 3. From genotype to phenotype DNA RNA protein phenotype Rapid growth of DNA sequences 200 180 160 140 120 100 80 60 40 20 0 1982 Total number of DNA base pairs in GenBank/WGS Base pairs (billions) Sequences (millions) 1992 Year 2002 2008 Time of development Body region, physiology, pharmacology, pathology The Origin of Species (1859) It is interesting to contemplate a tangled bank, clothed with many plants of many kinds, with birds singing on the bushes, with various insects flitting about, and with worms crawling through the damp earth, and to reflect that these elaborately constructed forms, so different from each other, and dependent upon each other in so complex a manner, have all been produced by laws acting around us. Source: Origin of Species, Chapter 15 Eukaryotes (Baldauf et al. 2000) fungi animals slime mold plants Paramecium Plasmodium Trypanosoma Giardia Trichomonas Wolfe et al. (1999) 8 chromosomes (5,000 genes) 16 chromosomes (10,000 genes) 16 chromosomes (6,000 genes) Wolfe et al. (1999) Paramecium tetraurelia: a ciliate with two nuclei, 40,000 genes, and three whole-genome duplications Phylogenetic footprinting Phylogenetic shadowing Population shadowing Three views of bioinformatics and genomics 1. The field of informatics 2. From small to large 3. From genotype to phenotype DNA RNA protein pathway cell organism population DNA RNA protein We see 500 inpatients and 13,000 outpatients per year at the Kennedy Krieger Institute. Why do children engage in self-injurious behavior? In many cases, there are chromosomal insults. pathway cell organism population Phenotype DNA From genotype… RNA protein pathway cell organism population …to phenotype DNA RNA DNA RNA protein pathway cell cellular phenotype organism clinical phenotype population protein DNA RNA DNA RNA protein protein pathway cell organism population Central dogma of molecular biology: DNA is transcribed into RNA, and translated into protein. Central dogma of bioinformatics/genomics: the genome is transcribed into the transcriptome, and translated into the proteome. DNA 200 180 160 RNA 140 120 100 80 protein 60 40 20 0 pathway 1982 1992 2002 2008 cell organism population Over 200 billion base pairs of DNA have now been sequenced, from >165,000 organisms. DNA RNA protein pathway cell organism population Scope of bioinformatics Sequence analysis Pairwise alignment Multiple sequence alignment Phylogeny Database searching (e.g. BLAST) Functional genomics RNA studies; gene expression profiling Proteomics; protein structure Gene function Pairwise alignments in the 1950s b-corticotropin (sheep) Corticotropin A (pig) Oxytocin Vasopressin ala gly glu asp asp glu asp gly ala glu asp glu CYIQNCPLG CYFQNCPRG globins: a- b- myoglobin Early example of sequence alignment: globins (1961) H.C. Watson and J.C. Kendrew, “Comparison Between the Amino-Acid Sequences of Sperm Whale Myoglobin and of Human Hæmoglobin.” Nature 190:670-672, 1961. LAGAN 2e Fig. 5.21 Multiple sequence alignment of five globins: ClustalW Praline MUSCLE Probcons TCoffee DNA RNA protein pathway cell organism population Scope of bioinformatics Sequence analysis Pairwise alignment Multiple sequence alignment Phylogeny Database searching (e.g. BLAST) Functional genomics RNA studies; gene expression profiling Proteomics; protein structure Gene function DNA RNA protein pathway cell organism Four bases: A, G, C, T arranged in base pairs along a double helix (1953). population Human genome project: sequencing all ~3 billion base pairs (2003). DNA RNA protein pathway cell organism population 1995: first genome sequence (a bacterium) 2000: fruit fly genome, plant 2003: human genome 2008: --two individual human genomes finished --1,000 human genomes (launched) --SNPs used to study chromosomes DNA RNA protein pathway cell organism population DNA RNA protein pathway cell organism population DNA RNA protein Time of development pathway cell organism Body region, physiology, pharmacology, pathology population DNA RNA protein pathway cell organism population DNA Genotype RNA protein pathway cell organism population Phenotype Outline Three views of bioinformatics and genomics Informatics From small to large From genotype to phenotype The chromosomes SNPs, HapMap, and the 1000 Genomes project Eukaryotic genomes are organized into chromosomes Genomic DNA is organized in chromosomes. The diploid number of chromosomes is constant in each species (e.g. 46 in human). Chromosomes are distinguished by a centromere and telomeres. The chromosomes are routinely visualized by karyotyping (imaging the chromosomes during metaphase, when each chromosome is a pair of sister chromatids). Fig. 16.19 Page 565 nucleolar organizing center centromere human chromosome 21 at NCBI nucleolar organizing center centromere human chromosome 21 at www.ensembl.org centromere human chromosome 21 at UCSC Genome Browser centromere human chromosome 21 at UCSC Genome Browser First P.G. mitosis in polar view. Tradescantia virginiana, Commelinaceae, n = 9 (from aberrrant plant with 22 chromosomes). 2 BE CV smears. x 1200. Printed on multigrade paper. Darlington. First P.G. mitosis in Paris quadrifolia, Liliaceae, showing all stages from prophase to telophase. n = 10 (cf. Darlington 1937, 1941) 2 BE – CV smear, 8mm. objective. x 800 Darlington. Root tip squashes showing anaphase separation. Fritillaria pudica, 3x = 39, spiral structure of chromatids revealed by pressure after cold treatment. 2 BD – Feulgen; x 3000 Darlington. Cleavage mitosis in the morula of the teleostean fish, Coregonus clupeoides, in the middle of anaphase. Spindle structure revealed by slow fixation. Section cut at 10 u. x 4000. Strong Flemming, haematoxylin. Prep. and photo by P.C. Koller. Darlington. The eukaryotic chromosome: Robertsonian fusion creates one metacentric by fusion of two acrocentrics ordinary male house mouse (Mus musculus, 2n = 40) male tobacco mouse (Mus poschiavinus, 2n = 26) Ohno (1970) Plate II The spectrum of variation Category of variation Size Single base pair changes 1 bp type SNPs, point mutations Small insertions/deletions 1 – 50 bp Short tandem repeats 1 – 500 bp microsatellites Fine-scale structural var. 50 bp – 5 kb del, dup, inv tandem repeats Retroelement insertions 0.3 – 10 kb SINEs, LINEs LTRs, ERVs Intermediate-scale struct. 5 kb – 50 kb del, dup, inv, tandem repeats Large-scale structural var. 50 kb – 5 Mb del, dup, inv, large tandem repeats Chromosomal variation >>5Mb aneuploidy Adapted from Sharp AJ et al. (2006) Annu Rev Genomics Hum Genet 7:407-42 Across the genome, there are four possible SNP calls: [1] homozygous (AA) [2] homozygous (BB) [3] heterozygous (AB) [4] no call In a deleted region, there are three possible SNP calls: [1] A (interpreted as AA) [2] B (interpreted as BB) [3] no call Across the genome, there are four possible SNP calls: [1] homozygous (AA) [2] homozygous (BB) [3] heterozygous (AB) [4] no call Single nucleotide polymorphisms (SNPs) to investigate chromosomes: A case of 7p deletion AA AB BB A case of 7p deletion A B AA AB BB A case of 7p deletion A B •Deletions (and duplications) such as these are called copy number variants (CNVs). • CNVs commonly occur in normal individuals. • When found in individuals with disease, we can tell if they are inherited (likely to be benign) or occur de novo (more likely to be disease-associated) by comparison to the parents’ genotypes. • Recent papers report many CNVs in disease. A case of trisomy 21 (Down syndrome) AAA AAB ABB BBB Three cases of 10q deletion Deafness gene? The International HapMap Project ► A catalog of common genetic variants that occur in humans ► The project’s goal is to compare the genetic sequences of different individuals to identify chromosomal regions where genetic variants are shared ► An initial focus has been on four groups (n=270): CEU European ancestry (30 trios) Utah residents YRI African ancestry (30 trios) Yoruba in Ibadan, Nigeria JPT/CHB Asian ancestry (90 individuals) Japanese in Tokyo, Japan Han Chinese in Beijing, China ► Phase I (2005): > 1 million SNPs Phase II (2007): added 2.1 million SNPs The International HapMap Project ► In addition to CEU, YRI, and JPT/CHB additional populations have been genotyped including: Maasai in Kinyawa, Kenya Luhya in Webuye, Kenya Gujarati Indians in Houston, TX Toscani in Italy Mexican ancestry in Los Angeles African ancestry in southwestern US The ENCODE project ►The ENCyclopedia Of DNA Elements (ENCODE) project was launched in 2003 ► Pilot phase: devise and test high-throughput approaches to identify functional elements. Efforts center on 44 DNA targets. These cover about 1 percent of the human genome, or about 30 million base pairs. ► Second phase: technology development. ► Third phase: production. Expand the ENCODE project to analyze the remaining 99 percent of the human genome. The ENCODE project Goal of ENCODE: build a list of all sequence-based functional elements in human DNA. This includes: ► protein-coding genes ► non-protein-coding genes ► regulatory elements involved in the control of gene transcription ► DNA sequences that mediate chromosomal structure and dynamics. ENCODE data at the UCSC Genome Browser: beta globin HBB, HBD, HBG1, HBG2, HBE1 ENCODE data at the UCSC Genome Browser: beta globin (50,000 base pairs including HBB, HBD, HBG1, HBG2, HBE1) ENCODE tracks available at the UCSC Genome Browser EGASP: the human ENCODE Genome Annotation Assessment Project EGASP goals: [1] Assess of the accuracy of computational methods to predict protein coding genes. 18 groups competed to make gene predictions, blind; these were evaluated relative to reference annotations generated by the GENCODE project. [2] Assess of the completeness of the current human genome annotations as represented in the ENCODE regions. UCSC: tracks for Gencode and for various gene prediction algorithms (focus on 50 kb encompassing five globin genes) Gencode JIGSAW On bioinformatics “Science is about building causal relations between natural phenomena (for instance, between a mutation in a gene and a disease). The development of instruments to increase our capacity to observe natural phenomena has, therefore, played a crucial role in the development of science - the microscope being the paradigmatic example in biology. With the human genome, the natural world takes an unprecedented turn: it is better described as a sequence of symbols. Besides high-throughput machines such as sequencers and DNA chip readers, the computer and the associated software becomes the instrument to observe it, and the discipline of bioinformatics flourishes.” On bioinformatics “However, as the separation between us (the observers) and the phenomena observed increases (from organism to cell to genome, for instance), instruments may capture phenomena only indirectly, through the footprints they leave. Instruments therefore need to be calibrated: the distance between the reality and the observation (through the instrument) needs to be accounted for. This issue of Genome Biology is about calibrating instruments to observe gene sequences; more specifically, computer programs to identify human genes in the sequence of the human genome.” Martin Reese and Roderic Guigó, Genome Biology 2006 7(Suppl I):S1, introducing EGASP, the Encyclopedia of DNA Elements (ENCODE) Genome Annotation Assessment Project The 1000 Genomes Project Goal: To create a deep catalog of human genetic variation in multiple populations. [1] Discover variants (SNPs, copy number variants, insertions/deletions). Include ~all variants with allele frequencies >1% across the genome (and >0.1-0.5% in gene regions) [2] Estimate the frequencies of variant alleles The 1000 Genomes Project Secondary goals: • Characterize SNPs • Improve the human reference sequence • Study regions under selection • Study variation across populations • Study mutation and recombination The 1000 Genomes Project Current approaches include sequencing two HapMap trios (one from YRI, one CEU; father/mother/child) at 20X depth using next generation sequencing technology. For one individual, 20X depth = 60 gigabases For one trio, 20X depth = 180 gigabases In another approach, sequence many individuals (n=1000) from the extended HapMap collection at lighter coverage. Conclusions We briefly surveyed the fields of bioinformatics and genomics. Bioinformatics serves biology, and genomics depends on the tools of bioinformatics. There are rapid advances in available technologies, such as next generation sequencing, that allow us to address fundamental biological questions at unprecedented resolution. These questions include the nature of variation within and between genomes of individuals, groups (gender, ethnicity, disease status), and across species. Other questions, posed decades ago, concern biological processes such as development, metabolism, adaptation, and function.