16 Microbial Genomics CHAPTER OVERVIEW This chapter introduces genomics, a revolutionary new discipline in the biological sciences. Techniques important to the study of genomes are discussed. Bioinformatics, functional genomics, and comparative genomics are detailed. Proteomics theory and techniques are discussed. The chapter then gives numerous examples of the types of patterns already being discerned in the analysis of the microbial genomes thus far sequenced. Finally, metagenomic analysis of environmental communities is introduced. CHAPTER OBJECTIVES After reading this chapter you should be able to: • • • • • • define genomics and bioinformatics compare and contrast structural genomics, functional genomics, and comparative genomics describe methods of sequencing DNA and the whole-genome shotgun method for sequencing a genome describe the types of analyses done for functional genomics and proteomics discuss some of the insights gained thus far by the analysis of microbial genomes discuss metagenomics CHAPTER OUTLINE I. II. Introduction A. Genomics is the study of the molecular organization of genomes, their information content, and the gene products they encode B. Genomics will enable scientists to get a holistic view of microbial genetics, gene expression patterns, microbial communities, and evolutionary relationships Determining DNA Sequences A. Sanger DNA sequencing 1. Uses dideoxynucleoside triphosphates (ddNTPs) in DNA synthesis; these lack a 3′-hydroxyl and terminate DNA synthesis 2. Single strands of DNA are mixed with a primer, DNA polymerase I, four deoxynucleoside triphosphates (one is labeled), and a small amount of one of the ddNTPs; DNA synthesis begins with primer but terminates each time a ddNTP is added to the chain 3. Four reactions are run, each with a different ddNTP; these reactions generate DNA fragments of different length because the site at which the ddNTP is inserted is random 4. Newly synthesized DNA fragments are separated electrophoretically on a polyacrylamide gel or with capillary electrophoresis often using an automated system; the gel can autoradiographed if radioactive ddNTPs were used or monitored with a laser if fluorescent ddNTPs were used; the sequence is then read from the autoradiogram or chromatographic trace B. Post-Sanger DNA sequencing 1. Newer sequencing technologies do not require the construction of genomic clone libraries; these methods attach DNA to solid substrates, PCR amplify sequences, and separate DNA fragments. 159 2. Three approaches are available: pyrosequencing (454 Life Sciences), SOLEXA, and SOLiD technology (sequencing by ligation) III. Genome Sequencing A. Sequencing a genome by the whole-genome shotgun approach is a multi-step process 1. Library construction—chromosomes are broken into gene-sized fragments, inserted into plasmids, and transformed into special E. coli strains 2. Random sequencing—the cloned fragments are sequenced, typically several times to assure full coverage 3. Fragment alignment and gap closure—DNA fragments are clustered and assembled into longer stretches of sequence by comparing nucleotide sequence overlaps between fragments producing contigs (contiguous sequences); the contigs are aligned in the proper order to form the completed genome sequence; gaps in the sequence are filled 4. Editing—sequence is proofread to resolve any ambiguities B. Single-cell genomic sequencing uses DNA polymerase from bacteriophage phi29 to randomly amplify many genomic DNA fragments using a multiple strand displacement (MDA) scheme IV. Bioinformatics A. The field concerned with the management and analysis of biological data using computers B. Genome annotation is done once the sequence is obtained; annotation involves identifying open reading frames (ORFs), determining potential amino acid sequences, and comparison to known protein and DNA sequences (using alignments and BLAST) C. These comparisons allow tentative assignment of gene function as well as identification of transposable elements, operons, and repeat sequences, and the detection of various metabolic pathways D. Two or more genes in the genome of a single organism that arise through duplication of a common ancestral gene are called paralogues, and between genomes are called orthologues V. Functional Genomics A. Functional genomics is focused on how genes and genomes operate; physical maps of genomes are useful in annotation B. Metabolic pathways and physiological features can be modeled using annotated genomes where potential functional proteins have been defined C. Microarray analysis 1. DNA microarrays—solid supports (e.g., glass) that have DNA attached in highly organized arrays of spots; in commercial chips, the array may consist of many expressed sequence tags (ESTs; an expressed gene product made from cDNA) covering every ORF of an organism 2. The mRNA (transcriptome) or cDNA to be analyzed (target mixture) is isolated, labeled with fluorescent reporter groups, and incubated with the DNA chip; fluorescence at an address on the chip indicates that the DNA probe on the chip is bound to a mRNA or cDNA in the target mixture; analysis of the hybridization pattern shows which genes are being transcribed 3. Using this procedure, the characteristic expression of whole sets of genes during differentiation or in response to environmental changes can be observed; patterns of gene expression can be detected using hierarchical cluster analysis and functions can be tentatively assigned based on expression VI. Proteomics A. Study of genome function at the level of translation 1. Proteome—entire collection of proteins that an organism produces; proteomics is the study of the proteome 2. Functional proteomics determines the function of proteins, how they interact with each other, and how they are regulated a. Two-dimensional electrophoresis is used to resolve thousands of proteins in a mixture; proteins are first separated based on charge qualities and then by size b. Mass spectrometry is used to tentatively identify the proteins isolated by two-dimensional electrophoresis; N-terminal amino acid sequencing can be used to determine ORFs when the genome sequence is available 160 3. Structural proteomics attempts to directly determine the three-dimensional structures of many proteins and then uses that information to predict the structures of other proteins and protein complexes based on their amino acid sequence (protein modeling) B. Similar studies can be performed using lipidomics (lipid profiles), glycomics (carbohydrate profiles), and metabolomics (small molecule profiles) C. DNA-protein interactions are important for gene regulation and in understanding transcription and replication 1. Electrophoretic mobility shift assays examine DNA-protein interactions by observing changes in the migration of DNA fragments when bound to target proteins 2. Chromatin immunoprecipitation (ChIP) assays examine DNA-protein complexes fixed in vivo and then detected by antibody precipitation; the captured DNA molecules can be detected using microarray analysis (ChIP-chip) VII. Systems Biology seeks to integrate the molecular interactions among the many chemical components of a cell into a theoretical framework that broadly describes living systems VIII. Comparative Genomics A. Comparisons of genomes and their functional genes leads to new insights in microbial biology and the development of vaccines (reverse vaccinology) B. Genome sizes vary among domains and organisms with varied ecological roles C. The core genome (essential backbone of genes) is a set of genes that all organisms within a monophyletic group share; the pan-genome (flexible gene pool) is the collection of all genes within a given group D. Horizontal gene transfer (HGT) is important for the exchange of genetic material between organisms; mobile elements integrated into the genome (genomic islands) can confer virulence (pathogenicity islands) E. Synteny is used to compare the order in which genes appear in different phylogenetic groups IX. Metagenomics A. Environmental genomics, or metagenomics, is being used to study microbial diversity in natural systems; fewer than 1% of the microbes in the environment can grow in the laboratory, so genetic techniques are used to directly detect and enumerate microbial populations B. The genomes of entire microbial communities can be sequenced and assembled, giving a picture of their species composition and functionality; new species (phylotypes) are detected, unique genes catalogued, and new functions ascribed to taxa TERMS AND DEFINITIONS ____ 1. ____ 2. ____ 3. ____ 4. ____ 5. ____ 6. ____ 7. ____ 8. The study of the molecular organization of genomes, their information content, and the gene products they encode The study of the physical nature of genomes The study of the way genomes function The comparison of genomes from different organisms to discern patterns of gene function and regulation and microbial evolution Identification and localization of genes in a genome, and the determination of their function by comparison to gene sequences in databases A reading frame sequence that is not interrupted by a stop codon; if larger than 100 codons, it is thought to encode a protein The field concerned with the management and analysis of biological data using computers The entire collection of proteins that an organism produces 161 ____ 9. The study of the array of proteins an organism can produce ____ 10. The study of the function of different proteins, how they interact with each other, and how they are regulated ____ 11. The process of determining the structure of various proteins and then using that information to predict the structure of other proteins and protein complexes based on their amino acid sequence ____ 12. A technique used to ____ 13. ____ 14. ____ 15. ____ 16. ____ 17. ____ 18. evaluate gene expression where DNA is attached to a solid support The flexible genome that is a collection of all genes within a given group The totality of all the mRNA in an organism A taxon that is characterized only by its nucleic acid sequence Essential set of genes present in all organisms of a monophyletic group A technique used to compare the order of genes in the genomes of different organisms A section of the genome containing genes involved in virulence a. b. c. d. e. f. g. h. i. j. k. l. m. n. o. p. q. r. annotation bioinformatics core genome comparative genomics DNA microarray functional genomics functional proteomics genomics open reading frame (ORF) pathogenicity island pan-genome phylotype proteome proteomics structural genomics structural proteomics synteny transcriptome FILL IN THE BLANK 1. 2. 3. 4. 5. 6. The most widely used sequencing technique was developed by Frederick Sanger. It uses dideoxynucleotides and is called the DNA sequencing method. Post-Sanger sequencing techniques includes 454 sequencing, also called . Analysis of vast amounts of genome data requires sophisticated computers and computer software; these analytical procedures are part of the field of . The study of the way a genome functions is called . It begins with of the genome, which identifies genes and tentatively assigns functions to them. One important aspect of understanding the function of a genome, is to determine under what conditions each gene in the genome is expressed. One of the best ways to evaluate gene expression is through the use of , which are highly organized arrays of DNA on a solid support (e.g., glass or silicon). Commercially made arrays often use short sequences (~25 base pairs in length) that are unique to a gene, rather than the entire gene sequence. These short sequences are called , and they are derived from cDNA molecules. The proteome is often analyzed by , followed in many cases by mass spectrometry. Antibodies are often used in assays to determine protein-DNA interactions. The use of genomics to study microbial diversity in natural systems is called _____________. The genomes of entire communities can be ___________ and then ___________ to give a picture of the functionality of the entire community. The essential set of genes in a taxon is called the and this is supplemented by a wider flexible set of genes, called the , that are specific to individual members of that group. MULTIPLE CHOICE For each of the questions below select the one best answer. 1. Which of the following is NOT a step in whole-genome shotgun sequencing? a. library construction b. sequencing of randomly produced fragments c. d. e. 162 fragment alignment and closure editing All of the above are steps in wholegenome shotgun sequencing. 2. 3. 4. Which of the following is a general pattern of genome organization discerned by comparisons of genomes? a. There is very little variation in genome organization in bacteria and archaea. b. There has been considerable horizontal gene transfer, especially of housekeeping or operational genes. c. Most parasitic organisms have more genes than do free-living organisms. d. All of the above patterns have been observed. A complex mixture of proteins can be separated using two-dimensional electrophoresis. What is the basis for the separation? a. charge differences (isoelectric focusing) b. size differences c. both (a) and (b) d. neither (a) nor (b) Which type of genomic analysis provides information about microbial evolution? a. structural genomics 5. 6. 7. b. functional genomics c. comparative genomics d. none of the above Translated amino acid sequences can be analyzed for motifs. What do these represent? a. functional units b. transcriptional controls c. paralogues d. orthologues Microarray analysis is NOT appropriate for which of the following? a. monitoring individual gene expression b. tentatively assigning gene functions c. observing patterns of gene expression d. determining phylogenetic relationships What percentage of environmental microbes grow in the laboratory? a. 1% b. 20% c. 60% d. nearly 100% TRUE/FALSE 1. The genome of M. genitalium is one of the smallest of any free-living organism. 2. There has been a great deal of horizontal gene transfer between genomes in both Bacteria and Archaea. 3. One of the ultimate goals of genomic analysis is to model a cell on a computer and make predictions about how it would respond to environmental changes. 4. It is unlikely that genomic analysis will provide any information useful for understanding pathogenicity or for developing treatments for infectious disease. 5. Vaccine development can only be done using killed or weakened viruses. ____ 6. Open reading frames are known to be functional genes. CRITICAL THINKING 1. In order for computers to identify open reading frames (ORFs) and other features of a genome, they must be programmed to do so. What features of a nucleotide sequence would be important for identifying ORFs? Explain your choices. Would the features be the same for both eukaryotic and prokaryotic organisms? Explain. 2. Molecular microbial ecology uses genetic techniques to describe microbial communities in the environment. If you were asked to describe the diversity of the microbes in a lake rich in Epsom salts, what research plan would you pursue? Would you include a cultivation campaign? Why or why not? Which molecular techniques would you apply and what might be their limitations? 163 ANSWER KEY Terms and Definitions 1. h, 2. o, 3. f, 4. d, 5. a, 6. i, 7. b, 8. m, 9. n, 10. g, 11. p, 12. e, 13. k, 14. r, 15. l, 16. c, 17. q, 18. j Fill in the Blank 1. chain-termination; pyrosequencing 2. bioinformatics 3. functional genomics; annotation; DNA microarrays (chips); expressed sequence tags 4. two-dimensional electrophoresis; chromatin immunoprecipitation (ChIP) 5. metagenomics; sequenced; assembled 6. core genome; pan-genome Multiple Choice 1. e, 2. b, 3. c, 4. c, 5. a, 6. d, 7. a True/False 1. T, 2. T, 3. T, 4. F, 5. F, 6. F 164