5 Genomics, Proteomics, and Systems Biology 5 Genomics, Proteomics, and Systems Biology • Genomes and Transcriptomes • Proteomics • Systems Biology Introduction Genome sequencing projects introduced large-scale experimental approaches, that generate vast amounts of data, to the study of biological systems. Complete genome sequences can be determined, as well as large-scale analyses of all the RNAs and proteins expressed in a cell. Introduction These global experimental approaches form the basis of the new field of systems biology, which seeks a quantitative understanding of the integrated behavior of complex biological systems. Genomes and Transcriptomes The Human Genome Project: the effort to sequence the entire human genome (3 billion base pairs), published in 2004. The genome sequences of many other species have also been determined, and advances in sequencing technology now allow rapid sequencing of individual genomes. Genomes and Transcriptomes The first complete genome was reported in 1995, of the bacterium Haemophilus influenzae. It contains 1.8 × 106 base pairs. Protein-coding regions were identified by computer analysis to detect openreading frames—long stretches that don’t contain any stop codons. Figure 5.1 The genome of Haemophilus influenzae Genomes and Transcriptomes In bacteria, most of the DNA encodes proteins. The E. coli genome is twice the size of H. influenzae, 4.6 × 106 base pairs (about 4,000 genes). Nearly 90% of the DNA is protein-coding. More than 2,000 bacterial genomes have now been sequenced. Genomes and Transcriptomes The yeast Saccharomyces cerevisiae has the simplest eukaryotic genome, making it a useful model for eukaryotic cells. Yeasts have about 6,000 genes; about 70% of the genome codes for proteins. Genomes and Transcriptomes Multicellular organisms (C. elegans, Drosophila, and Arabidopsis), were sequenced next. These genomes are about 10 times larger than yeast, but had fewer genes than expected for more complex organisms. Much less of the DNA is protein-coding than in bacteria and yeasts. Table 5.1 Representative Genomes Genomes and Transcriptomes Drosophila has fewer genes than C. elegans. Sequencing revealed the fact that biological complexity is not just related to number of genes. Genomes and Transcriptomes The genome of Arabidopsis thaliana was sequenced in 2000 and found to have about 26,000 genes. Even more genes occur in other plant genomes (e.g., 57,000 in apples). Genomes and Transcriptomes The human genome has about 3 × 109 base pairs. Draft sequences were published in 2001 by two different groups using different approaches. The complete sequence was published in 2004. Genomes and Transcriptomes The International Human Genome Sequencing Consortium sequenced DNA fragments derived from BAC (bacterial artificial chromosome) clones that had been previously mapped to human chromosomes. Key Experiment, Ch. 5, p. 161 (3) Genomes and Transcriptomes A team led by Craig Venter of Celera Genomics used a shotgun approach: Small DNA fragments were cloned and sequenced; overlaps between sequences were then used to assemble the sequence of the genome. Genomes and Transcriptomes A major surprise from the human genome sequence was that there are only 21,000 protein-coding genes, about 1% of the total genome. Genomes and Transcriptomes 40% of human proteins are related to proteins in simpler eukaryotes; most function in basic cellular processes. Most proteins that are unique to humans are made up of domains that are also found in other organisms, but are arranged in novel combinations. Genomes and Transcriptomes The genomes of many other vertebrates have now been sequenced. This allows comparisons to the human genome, and helps identify functional sequences. Comparison of human, mouse, chicken, and zebrafish genomes shows that about half of proteincoding genes are common to all vertebrates. Figure 5.2 Evolution of sequenced vertebrates Figure 5.3 Comparison of vertebrate genomes Genomes and Transcriptomes Mice, rats, and humans have 90% of their genes in common. Mouse and rat genome sequences provide essential databases for research in mammalian genetics and human physiology and medicine. Genomes and Transcriptomes The dog genome sequence has become important in understanding the genetic basis of morphology, behavior, and a variety of diseases. Characteristics of the many dog breeds are highly specific, which facilitates identification of the responsible genes. Genomes and Transcriptomes Many diseases, including cancer, are common in some breeds, and understanding the genetic basis will benefit both veterinary and human medicine. Genomes and Transcriptomes Genome sequences of other primates may help pinpoint unique features that distinguish humans. Human and chimpanzee genomes are nearly 99% identical. But sequence differences often alter the coding sequences, leading to different amino acid sequences of most of the proteins in the two species. Genomes and Transcriptomes Neandertals and modern humans diverged 300,000 to 400,000 years ago, and their genomes are about 99.9% identical. The differences alter coding sequences of only 90 genes that are conserved in modern humans. Genomes and Transcriptomes The human genome project used the dideoxynucleotide technique first described by Fred Sanger in 1977. But even with automation, this approach is slow and expensive. Next-generation sequencing: new techniques that increased speed and lowered costs. Figure 5.4 Progress in DNA sequencing Genomes and Transcriptomes Next-generation, or massively parallel sequencing, are methods in which millions of templates are sequenced simultaneously. Figure 5.5 Next-generation sequencing Genomes and Transcriptomes The first individual human genomes to be sequenced were those of Craig Venter and James Watson (2007 and 2008). Since then, thousands of individual genomes have been sequenced. Personal sequences will allow therapies to be specifically tailored to the needs of individual patients. Genomes and Transcriptomes In the future, genome sequencing may be important in disease prevention by identifying genes that confer susceptibility to particular diseases. Genomes and Transcriptomes Transcriptome: all the RNAs that are transcribed in a cell. Complete genome sequences allow study of gene expression for the whole genome, instead of one gene at a time. One method used is hybridization to DNA microarrays. Genomes and Transcriptomes Oligonucleotides are printed by a robotic system onto glass or silicon chips. Each spot on the array consists of a single oligonucleotide. DNA microarrays can be used to compare gene expression between two cell types. Figure 5.6 DNA microarrays Genomes and Transcriptomes cDNAs are synthesized from mRNAs by reverse transcription, labeled with fluorescent dyes and hybridized to DNA microarrays. The relative level of expression of each gene is indicated by intensity of fluorescence at each position on the microarray. Genomes and Transcriptomes RNA-seq reveals the sequences of all mRNAs in a cell. Cellular mRNAs are reverse transcribed to cDNAs, which are analyzed by next-generation sequencing. The frequency of mRNAs found also indicates their abundance in the cell. Figure 5.7 RNA-seq Proteomics To understand cell function, it is necessary to know what proteins are expressed and how they function within the cell. The large-scale analysis of cell proteins is called proteomics. The goal is to identify and quantify all proteins expressed in a given cell (the proteome). Proteomics The number of proteins expressed in a cell is greater than the number of genes. Many genes can be expressed to yield several distinct mRNAs, which encode different polypeptides as a result of alternative splicing. Proteins can also be modified in various ways. Proteomics The first technology to separate proteins was two-dimensional gel electrophoresis. Proteins are separated based on charge and then size. This technique is biased toward the most abundant proteins. Figure 5.8 Two-dimensional gel electrophoresis Proteomics The main tool currently used is mass spectrometry. A protease cleaves the protein into small peptides. These are ionized and analyzed in a mass spectrometer, which determines the mass-to-charge ratio of each peptide. The mass spectrum is compared to a data base of known spectra. Figure 5.9 Identification of proteins by mass spectrometry Proteomics A “shot-gun” approach eliminates the gel electrophoresis. Cell proteins are digested with protease and the whole mixture sequenced by tandem mass spectrometry. Figure 5.10 Tandem mass spectrometry Proteomics Determining the locations of proteins in cells and organelles is also important. Organelles are isolated by subcellular fractionation and the proteins are analyzed by mass spectrometry. The proteome of a variety of organelles and structures have been characterized. Table 5.2 Protein composition of cellular structures Proteomics Proteins function by interacting with other proteins in protein complexes and networks. The systematic analysis of these complexes and interactions has become an important goal of proteomics. Proteomics Proteins can be isolated from cells under gentle conditions so that protein complexes are not disrupted. Typically, an antibody against a protein of interest would be used to isolate the protein from a cell extract by immunoprecipitation. Figure 5.11 Immunoprecipitation Proteomics Immunoprecipitated protein complexes can then be analyzed by mass spectrometry. The protein against which the antibody was directed can be identified, along with other proteins it was associated with in the cell extract. Figure 5.12 Analysis of protein complexes Proteomics Alternative approaches include screens for protein interactions in vitro, and screens that detect interactions between pairs of proteins introduced into yeast cells. Proteomics In the yeast two-hybrid system, two different cDNAs (e.g., from human cells) are joined to two distinct domains of a protein that stimulates expression of a target gene in yeast. Figure 5.13 The yeast two-hybrid system Proteomics Screens have identified thousands of protein–protein interactions, which can be presented as maps that depict a network of interacting proteins within a cell. Figure 5.14 A protein interaction map of Drosophila Bioinformatics and Systems Biology Genome sequencing, proteomics, and other large-scale experiments have yielded vast amounts of data. Bioinformatics, at the interface between biology and computer science, uses computational methods to analyze and extract biological information from all this data. Bioinformatics and Systems Biology These large-scale experimental approaches form the basis of the new field of systems biology. The goal: A quantitative understanding of the integrated dynamic behavior of complex biological systems and processes. Figure 5.15 Systems biology Bioinformatics and Systems Biology Systematic screens of gene function: One approach to study gene function is to inactivate (knockout) each gene. Collections of strains with mutations in all known genes are available for E. coli, yeast, Drosophila, C. elegans, and Arabidopsis thaliana. Bioinformatics and Systems Biology A large-scale international project to systematically knockout all genes in the mouse is also under way. Targeted mutagenesis has determined functions of more than 7,000 mouse genes. Bioinformatics and Systems Biology Other large-scale screening projects are based on RNA interference (RNAi). Double-stranded RNAs are used to induce degradation of homologous mRNAs in cells. Figure 4.38 RNA Interference Bioinformatics and Systems Biology With the availability of complete genome sequences, libraries of double-stranded RNAs can be designed and used in genome-wide screens to identify all of the genes involved in any biological process. Figure 5.16 Genome-wide RNAi screen for cell growth and viability Bioinformatics and Systems Biology Regulation of gene expression: Understanding the mechanisms that control gene expression is a central undertaking in cell and molecular biology. It is far more difficult to identify gene regulatory sequences than proteincoding sequences. Bioinformatics and Systems Biology Most regulatory elements are short sequences, typically only about ten base pairs. Consequently, sequences resembling regulatory elements occur frequently by chance in genomic DNA. Identifying regulatory sequences is a major challenge in systems biology. Bioinformatics and Systems Biology Global studies of gene expression, using microarrays or RNA-seq can reveal overall changes in gene regulation associated with discrete cell behaviors, such as the response of cells to a particular hormone. Changes in expression of multiple genes can help pinpoint shared regulatory elements. Bioinformatics and Systems Biology Computational approaches are also used to characterize regulatory elements. Comparative analysis of genome sequences of related organisms assumes that functionally important sequences are conserved in evolution, and nonfunctional segments diverge more rapidly. Bioinformatics and Systems Biology Computational analysis to identify noncoding sequences that are conserved between the mouse, rat, dog, and human genomes has helped identify sequences that control gene transcription. Figure 5.17 Conservation of functional gene regulatory elements Bioinformatics and Systems Biology Genome-wide analysis of the binding sites of regulatory proteins have also been developed. Genome-wide analysis of the sites of histone modifications can also provide identification of gene regulatory sequences. Bioinformatics and Systems Biology ENCODE (Encyclopedia of DNA Elements) utilized RNA-seq to characterize all transcribed RNAs, plus global methods to determine gene regulatory sequences in 147 different types of human cells. One result: Many transcribed noncoding sequences play important roles in gene regulation. Bioinformatics and Systems Biology Networks: Classical experimental biology focuses on single genes and proteins, which often act sequentially to catalyze reactions in a metabolic pathway. Signaling pathways act similarly to transmit information from the environment, such as presence of a hormone, to targets within the cell. Figure 5.18 Example of a signaling pathway Bioinformatics and Systems Biology But metabolic and signaling pathways do not operate in isolation. There is extensive crosstalk between pathways, so that multiple pathways interact with one another to form networks. Computational modeling of networks is currently a major challenge in systems biology. Bioinformatics and Systems Biology Many pathways are controlled by feedback loops (e.g., feedback inhibition of metabolic pathways, or negative feedback loop). Feedforward relays: activity of one component of a pathway stimulates a distant downstream component. Bioinformatics and Systems Biology Crosstalk: interaction of one pathway with another; can be positive (one pathway stimulates the other) or negative (one pathway inhibits the other). Figure 5.19 Elements of signaling networks Bioinformatics and Systems Biology In this view of the cell as an integrated system, a full understanding of cell signaling will require development of network models. A model of a gene regulatory network controlling development of an embryonic cell lineage in sea urchins has recently been developed. Figure 5.20 A gene regulatory network Bioinformatics and Systems Biology Synthetic biology: The goal is to design and create new (unnatural or synthetic) systems, to create useful products and to better understand how the behavior of existing cells is controlled. Bioinformatics and Systems Biology Synthetic biologists can synthesize new molecules with biological properties, such as RNA, or engineer new systems using components of existing cells. The ability to engineer a novel biological system tests and expands our understanding of how natural systems function. Bioinformatics and Systems Biology Genetic circuits in E. coli were first engineered in 2000. A genetic toggle switch was designed to confer stability and memory on a network regulating gene expression. The key feature is that two repressors control expression of each other as well as a reporter gene. Figure 5.21 A genetic toggle switch Bioinformatics and Systems Biology Similar genetic circuits have since been engineered in eukaryotic models. This has substantially advanced our understanding of how a regulatory circuit can alternate between two stable states—a common feature of networks involved in many aspects of cell signaling and regulation of cell proliferation. Bioinformatics and Systems Biology Practical applications of synthetic biology—treating malaria: Malaria is a serious parasitic disease, caused by the protozoan Plasmodium and transmitted by mosquitoes. Research on vaccine development is underway, but none is currently available. Molecular Medicine, Ch. 5, p. 180 Bioinformatics and Systems Biology The most effective antimalarial drug right now is artemisinin, a compound produced by a plant that takes 8 months to mature. The supply of artemisinin from these plants is limited and the price fluctuates. Figure 5.22 Structure of artemisinin Bioinformatics and Systems Biology Synthetic biologists have developed strains of yeast engineered to produce a precursor to artemisin, which is then used for commercial production of this important drug. Bioinformatics and Systems Biology The first cell with a completely synthetic genome was created in 2010. Venter et al. synthesized overlapping oligonucleotides corresponding to the complete genome sequence of Mycoplasma mycoides. Bioinformatics and Systems Biology The synthetic genome was then introduced into a different mycoplasma subspecies, M. capricolum. These cells grew normally and showed the morphology of normal M. mycoides. Because the cell proteins are specified by the synthetic genome, they represent the first synthetic cells. Figure 5.23 First cell with a synthetic genome