Microbial genome analysis and comparisons Dave Baumler Genome Center of Wisconsin, UW-Madison dbaumler@wisc.edu Today’s session overview: Introduction Module #1) Microbial genomes at NCBI (http://www.ncbi.nlm.nih.gov/Class/minicourses/) -familiarize the tools and options using the NCBI tutorial “Microbial genomes Quickstart”, learn how to download genome (.gbk) files Module #2) Conduct genome alignments of phage genomes -using Mauve to conduct whole genome alignments, familiarize yourself with Mauve Module #3) Compare genomes from 3 outbreaks of E. coli O157:H7 -identify genomic islands using Mauve & conservation of virulence factors Module #4) Compare genomes from 5 strains of Yersinia pestis -identify genomic islands, conservation of virulence factors, analyze mutations with phenotypic consequences due to insertion and/or deletion events and Single nucleotide polymorphisms (SNP’s), and paleomicrobiology Conclusion Choose one of the two Problems: #1 Escherichia coli O157:H7 strain Sakai #2 Rickettsia prowazekii strain Madrid E Lists of all complete and in progress microbial genomes Download full genome sequence (.gbk) files Downloading Microbial Genome Files #1) Look for the largest .gbk file which is the main genome, smaller .gbk files are plasmids #2) Double click on the file #3 From the file pull down choose “Save page as” give the file a name with a .gbk at the end Links to other E. coli and database and/or resources Brief information about the organism Overview with links to assorted tools Search on page for words using the Edit>>Find in this page pulldown Entrez protein view Clusters of Orthologous Groups of proteins (COGs) were delineated by comparing protein sequences encoded in complete genomes, representing major phylogenetic lineages. Each COG consists of individual proteins or groups of paralogs from at least 3 lineages and thus corresponds to an ancient conserved domain. COG link Geneplot Entrez Genome offers a new pairwise comparison tool called GenePlot to visualize similarities among bacterial genomes. Support for fungal genomic comparisons is also planned. To construct a GenePlot, genes are numbered sequentially along the genomic sequences of two organisms and the two corresponding sets of predicted proteins are compared using BLAST. For every case in which a pair or proteins, one from each genome, are mutual best matches, a point is plotted using the indices of the equivalent gene in the two genomes as the X and Y coordinates. Use the GenePlot link from an organism’s genome record to see a GenePlot against the organism with which it shares the highest number of reciprocal best hits. Comprisons between other organisms can be made using pull-down menus. TaxMap Comparisons of COG groups between various organisms The ERIC database houses all of the available genomes of the members of family Enterobacteriaceae Boxes, represent organisms with at least one genome sequenced Human Pathogens -Calymmatobacterium -Moellerella -Cedecea -Morganella -Citrobacter -Plesiomonas Insect Pathogens /Endosymbionts -Edwardsiella -Proteus -Enterobacter -Providencia Environmental/ -Brenneria -Arsenophonus -Escherichia -Rahnella Animals/Industrial -Dickeya -Buchnera -Ewingella -Salmonella -Alterococcus -Erwinia -Sodalis -Hafnia -Serratia -Budvicia -Pantoea -Klebsiella -Shigella -Buttiauxella -Pectobacterium -Kluyvera -Tatumella -Obesumbacterium -Phlomobacter -Leclercia -Yersinia -Pragia -Sacchararobacter -Leminorella -Yokenella -Trabulsiella -Samsonia -Wigglesworthia -Xenorhabdus Phytopathogens/ Plant-associated Orthologs If at least two of these criteria are met for the pair of genes in question they are typically assigned as orthologs. •Percentage identity and alignment percentage are in the typical range •Local genome context, the conserved gene is part of an operon with other genes that are already considered orthologs. •Larger scale conservation of genomic context, the conserved gene is in the same general genomic context as other orthologs. •Functional conservation, the conserved gene is predicted or known to perform the same function as the potential ortholog in another genome. Reciprocal Best Blast hits BlastP X >60% Y BlastP Y X >60% Enterobacteria cont. Generated from 180 orthologs ERIC-Enteropathogen Resource Integration Center Genomes Tools & Annotations Genome Views and Comparisons Part of a genome sequence TCAGCGAAGATGAGATAGTTTTTAAAGGTGGGATTTCCCCACCTTTAAAAAGCGAGAAGTCCCGGTTTTAA AGAGGAGTAAAATCCTCTTTTTCTAGCCCACTCAGGTGGTTTTTTTGGTTTTCGCTCCTTGCCGCATCTTC TGTGCCTTTGATGGCGGCTGGTTGGGGTGAAAGGCTGCATATTCCAGAATTTCAGACAGTAGATTGTTTTT GAAATCTTCCGTTTTATCGTTGACGAACTTAACCATCCTGTTGAAATCATCTTCCTTTGATACACCTTCAG GAAATGCCTTAGGAACTGATGTTTGGCTATCCAAGGCATCTTGCAATATCTGCACGATCTCCGAATTCATT GATCGCCCATTGGCCTTTGCTCTGGCGGCAACTGCGTCACGCATACCGTCAGGCATCCTAACTGTAAATCT CTCAATGAAAGCTGGATCTTCTTTTTCAGTCATCATCTTAAACCATAAAAATTTATACAAAACACACTAGC ATCATATTGACATTACCCACAATGACATCATAATGGTGTCAGGCATCAAAATGATGTCATCATGACAAGGG GAAAGTAAATGCAAGATGTTCTCTATACAGGTCGTAAGAACGACAGCTTTCAGCTTCGTCTGCCTGAGCGA ATGAAAGAAGAGATCCGTCGCATGGCAGAGATGGACGGCATTTCGATTAATTCTGCAATCGTGCAGCGCCT TGCTAAAAGCTTGCGTGAGGAAAGAGTTAATGGGCAGTAAAAACAGCGAAGCCCGGAAGTGTGGGGACACT AACCGGGCTTCTAATGTCAGTTACCTAGCGGGAAACCAACAATGACCAGTATAGCAATCTTTGAAGCAGTA AACACTATCTCTCTTCCATTCCACGGACAGAAGATCATAACTGCGATGGTGGCGGGTGTGGCGTATGTGGC AATGAAGCCCATCGTGGAAAACATCGGTTTAGACTGGAAGAGCCAGTATGCCAAGCTCGTTAGTCAGCGTG AAAAGTTCGGGTGTGGTGATATCACCATACCTACCAAAGGTGGTGTTCAGCAGATGCTTTGCATCCCTTTG AAGAAACTGAATGGATGGCTCTTCAGCATTAACCCAGCAAAAGTACGTGATGCAGTTCGTGAAGGTTTAAT TCGCTATCAAGAAGAGTGTTTTACAGCTTTGCACGATTACTGGAGCAAAGGTGTTGCAACGAATCCCCGGA CACCGAAGAAACAGGAAGACAAAAAGTCACGCTATCACGTTCGCGTTATTGTCTATGACAACCTGTTTGGT GGATGCGTTGAATTTCAGGGGCGTGCGGATACGTTTCGGGGGATTGCATCGGGTGTAGCAACCGATATGGG ATTTAAGCCAACAGGATTTATCGAGCAGCCTTACGCTGTTGAAAAAATGAGGAAGGTCTACTGATTGGCGT ATTGGAAGGCGCAAAAAGAAAAGCCAGCAGATGGGCTGCTGGCATTCATTGGGTATATGAACTTTCGGAGA ACATATGAAGTCAATTATCAAGCATTTTGAGTTTAAGTCAAGTGAAGGGCATGTAGTGAGCCTTGAGGCTG CAAGCTTTAAAGGCAAGCCAGTTTTTTTAGCAATTGATTTGGCTAAGGCTCTCGGGTACTCAAATCCGTCA What exactly are gene annotations? Genome annotation is the process of attaching biological information to sequences. It consists of two main steps: 1.-identifying elements on the genome, a process called “structural annotation” or “gene finding” 1.-attaching information to these elements such as their molecular and biological functions. Annotation step #1: Structural Annotation Example of a gene - the start codon is green and the stop codon is red Structural annotation consists of the identification of genomic elements (e.g. genes). •Open Reading Frames (ORFs) also called coding sequences (CDSs) must have a start codon and a stop codon •location of regulatory motifs (such as promoters and ribosome binding sites) •This step is typically automated using gene prediction software (Automation only finds ~50-90% of the genes) Annotation step #1: Structural Annotation (cont.) using Genemark.hmm a statistical model Annotation step #2 Functional annotation: consists in attaching biological information to genomic elements. •biochemical function •involved regulation and interactions •expression •cellular location Three examples of annotations for one gene: •Name/synonym: a short “word” used to refer to the gene (Ex. ureC) •Product: a descriptive protein name (Ex. Urease gamma subunit) •Function : Describes what the protein does (Ex. Catalyzes the hydrolysis of urea to form ammonia and carbon dioxide) Module #2 Conduct genome alignments of phage genomes -this module is developed to teach how to use Mauve using enterobacteria phage -Phage genomes can be aligned using Mauve in a matter of minutes. -applicable as a teaching tool to decipher the mosaicism of phage genomes. -comparative studies of 30 mycobacteriophage genomes reveal new insights into the diverse architecture and insight about gene exchange (Hatfull et al. PLoS genetics et al. 2006) -using Mauve, you could align EVERY mycobacteriophage genome available -How diverse are enterobacteriophage? (the following series of slides are Mauve alignments of phage isolated from E. coli, Salmonella spp., Yersinia spp., and Shigella spp.) all alignments are also provided for further inquiry -we will run alignments with 3 phage genomes from E. coli O157:H7 Mauve: Multiple Genome Aligner • Able to identify and align collinear regions of multiple genomes even in the presence of rearrangements • Find and extend seed matches • Group into locally collinear blocks • Align intervening regions (Darling et al. Genome Res. 2004 Jul;14(7):1394-403.) Module #2 Understanding phage, the viruses that infect microorganisms, via genome alignments Recently aligned 56 enterobacterial phage, phage genomes are an ideal training tools for teaching how to set up mauve alignments Why Phage? Genomics timeline 1977 1982 1995 1996 1997 1998 2000 2001 2008 Step #1 copy the folder called 3 phage genomes for alignment excercise, and paste it on the hard drive of your computer (C: drive) Step #2 from the start menu, in programs select Mauve 2.1.1 Step #3 under the File pull down select Align with progressive Mauve This new window will appear #4 click here to choose where to send the output file, find the folder (from Step#1), and double click on the folder #5 Type in a file name, and click on Save Next add the sequences to align Click on Add sequence Select the first phage genome and click on Open, then continue with the 2nd and 3rd phage genomes. Then click on Align to start the genome alignment When viewing the LCB’s, mauve displays regions that are highly conserved/identical as full color. Areas that are unique/variable to one genome appear in white, and represent unique islands Your tool bar is at the top on the left, the tools you will use are in the View pulldown, and also the buttons Returns the viewer back to home Move left or right, you will find this useful to center a region of interest in the middle of the screen prior to zooming in Zoom in/out, you can also hold down the ctrl button and use the arrows on the keyboard Search for features Other useful commands in Mauve Function Key Zoom in Ctrl+Up Zoom out Ctrl+Down Scroll Left Ctrl+Left Scroll Right Ctrl+Right Export the current view as Ctrl+E An image Module #3) Dissecting virulence of E. coli O157:H7 using genome alignments The first E. coli genome sequenced was the nonpathogenic E. coli K-12 genome MG1655 -determination of the complete E. coli sequence required almost 6 years -E. coli is the preferred model in biochemical genetics, molecular biology, and biotechnology and its genomic characterization will undoubtedly further research toward a more complete understanding of this important experimental, medical, and industrial organism (Blattner et al. Science 1997) The first pathogenic E. coli genome sequence was enterohaemorrhagic (EHEC) Escherichia coli O157:H7 strain 933 EDL -In 1982 Escherichia coli O157:H7 recognized as a pathogen for human disease -Also known as EDL933 from the Michigan outbreak in 1982 from ground beef -shiga toxin producing (STEC) (Perna et al. Nature 2001) The completion of the 2nd E. coli O157:H7 (EHEC) sequence strain Sakai -In July 1996, an outbreak of Escherichia coli O157:H7 infection occurred among schoolchildren in Sakai City, Osaka, Japan. -8,938 schoolchildren sickened, 3 deaths - We are starting to ask-What genomic differences determine differences in virulence, epidemiology, and fatality? (Hayashi et al. DNA Res 2001) In 2006 E. coli O157:H7 outbreak from bagged spinach (from CDC) -multistate outbreak 205 people sickened, 3 deaths Currently there are 13 E. coli O157:H7 Genomes sequenced, we will have you focus on three that are all in the Enteropathogen Resource Integration Center (ERIC) database (www.ericbrc.org) The three strains you will focus on are: Escherichia coli EDL933 (EHEC) Escherichia coli Sakai (EHEC) also called RIMD Escherichia coli EC4042 (EHEC) In your start menu under programs go to Mauve 2.1.1, start up Mauve, notice there is a users guide in pdf form in this folder, this will contain useful information and commands to navigate Note: your computer may need to update Java, since mauve uses a Java platform for the alignment. You should see a window for Mauve appear Next double click on the uncompressed 3 O157H7 folder, it should contain the following 19 files, take the first one (3 O157 alignment), and drag and drop it into the mauve window It should start to say reading sequences here, and in a few seconds the alignment will appear, note computers with less than 512MB RAM may not be able to open the file Your alignment should look like this Organism name notice the first is EDL933, the second is RIMD(Sakai), and the third is EC4042 (spinach) Using the up or down arrows, you can switch the position of the genomes Top strand Bottom strand The colored blocks are called local colinear blocks (LCB’s), and represent regions of the genome that Mauve has identified as conserved, the lines connect the LCBS, notice that some are in different positions in the other genomes, some are inverted and appear on the bottom strand of the double stranded genome When you move your mouse over a region of one genome it will show a black box and also show the corresponding region (boxes) in the other two genomes, try scrolling left to right on one genome Notice, that when you scroll (slowly) over a white region (island) the black boxes pause in the other genomes, then comes back once you have passed over the island and back into conserved regions If you would like to look at all three LCB’s, even though one is in a different position, scroll over one LCB and click the mouse button Lets use the zoom function, press the home button to restore the alignment to original view Now click on the white island in the top genome, and using the right button bring it to the center of the screen, now start to zoom in multiple times You will start to see the genes, scroll over one and pause, and a window will pop-up with the product annotation, so here you can view what genes are present in this EDL933 island, and not in the other two Now place you mouse over one of the genes, in my example I have iha irgA homolog adhesion Click your mouse once on the gene, and a window will pop-up, scroll down and select View CDS iha in ERICdb This will open the page in the ERIC database for that gene, containing all of the annotations, you can look to see if it is involved in virulence Lets use the search feature #1) Click on the search feature #2) Choose a genome (EDL933) #4) Click on search #3) Type in a gene name (stx2A) Notice that it has found the stx2A gene (highlighted in blue), and also in the RIMD strain. Just because it isn't aligned in the EC4042 strain does not mean it isn't there, if you look to the right in the EC4042 genome, you will find it Stx2A One last feature you can use in Mauve To find an island that is in 2 out of 3 strains you will use the backbone view Press the home button first Then go to the View pull down select color scheme then backbone color Your alignment should look like this in backbone color, regions in all three appear in light purple color, there will be regions that are different colors that will correspond to 2 out of 3 genomes (you may have to zoom in a bit to see these regions Regions in only EDL933 and RIMD appear olive green Regions in only EDL933 and EC4042 appear maroon Regions in only RIMD and EC4042 appear tan/brown This is how you identify islands unique to 2/3 strains Using genomics to track the dissemination of Yersinia pestis strains Courtesy of www.cdc.gov Deng et al. 2002 J. Bacteriol. 184:16 4601-4611 Transmission cycle of Plague Historic 3 pandemics of plague -pandemic: is defined as an epidemic that spreads throughout the human population across a large region such as a continent or worldwide -1st pandemic ~550 A.D. confined to mainly Africa and some parts of the middle ease -2nd pandemic originated in Central Asia and spread via trading routes into Europe (Killed ~30% of Europe population) Courtesy of edsitement.neh.gov -3rd pandemic started in 1850’s in China’s Yunnan providence century confined mainly to Asia The first two genomes of Yersinia pestis CO92 & KIM Parkhill et al. 2001 Nature 413, 523-527 Deng et al. 2002 J. Bacteriol. 184:16 4601-4611 Comparison of 2 genomes was not interactive initially As of 04/2008 there are 7 complete and 14 Y. pestis draft genomes Traditionally the strains are classified as serovars (Antiqua, Mediaevalis, Orientalis, and other) based on the following phenotypic characteristics: -Antiqua = East Africa: (glycerol positive, arabinose positive, and nitrate positive) -Mediaevalis = Central Asia: (glycerol positive, arabinose positive, and nitrate negative) -Orientalis Central Asia (glycerol negative, arabinose positive, and nitrate positive) -other (ie Microtus, Pestoides) not consistent for these phenotypes Paleomicrobiology Partial view of the grave in Dreux investigated in this work, which illustrates anthropologic features of a mass grave suitable for paleomicrobiology research. (courtesy of www.cdc.gov) -the prefix paleo comes from the Greek work palaios meaning “ancient” -bacterial colonization of dental pulp can occur during bacteremia -Bacteremia (also known as plague septicaemia with Y. pestis) is the presence of bacteria in the blood Courtesy of www.nidcr.nih.gov Extraction of bacterial DNA from Dental pulp -Some historians believed that a flu-like virus and not Y. pestis was responsible for the 1st and 2nd pandemics -DNA detected in dental pulp confirm that Y. pestis was the cause -Which serovar(s) are most similar to the Y. pestis strain(s) from the dental pulp from the corpses? Figure 1 The original protocol developed in our study allows recovering the dental pulp and minimizes the risk of laboratory-acquired contamination of the specimen. The tooth was encasted into sterile resin (1a) ; the apex was sterily sectioned (1b) to give access to the canal system (1c) ; solutions were injected (1d) ; after incubation, the tooth was put upside down into sterile tube (1e) and centrifuged (1f). Tran-Hung et al. PLoS ONE v.2(10); 2007 Use of genomic tools to study Y. pestis Concepts in this module that you will address: #1) mutations that affect the production of a full functional gene product that has phenotypic consequences (insertions, deletions, single nucleotide polymorphisms [SNP’s]) to study the genes glpD, napA, and araC #2) Paleomicrobiology investigation, determine which serovar(s) have the most similar matching genes compared to the amplified sequence from the dental pulp of 3 corpses. #3) use of genome alignments; determine a island that is unique to the 4 genomes that infect humans and is absent in Y. pestis strain 91001 #4) determine the conservation of a virulence factor in the 5 strains in the genome alignment. Determine if it is a full functional product in strain 91001. Next double click on the uncompressed Yersinia pestis alignment 5 genome folder, it should contain the following 29 files, take the one (yersinia_pestis_alignment_5genomes), and drag and drop it into the mauve window It should start to say reading sequences here, and in a few seconds the alignment will appear, note computers with less than 512MB RAM may not be able to open the file Your alignment should look like this Organism name notice the first is CO92, the second is KIM,the third is 91001, the fourth is Antiqua, and the fifth is Nepal516 Using the up or down arrows, you can switch the position of the genomes You may find it easier to view the 5 genome alignment without the connecting lines: on your keyboard press Shift L (pressing this again makes them reappear) Now place you mouse over one of the genes, Click your mouse once on a gene, and a window will pop-up, scroll down and select View CDS in ERICdb This will open the page in the ERIC database for that gene, containing all of the annotations, you can look to see what is known about it and/or if it is involved in virulence (note you may be prompted to a log-in screen, click on the button that says “Enter ASAP”) Lets use the search feature to find the genes glpD, napA, and araC #1) Click on the search feature #2) Choose a genome or search all of the genomes #4) Click on search #3) Type in a gene name (glpD) Notice that it has found the glpD gene (highlighted in blue), and also a corresponding gene in each genome. You need to determine which of the five CDS’s produce the full-length functional protein Method #1: click on each gene and go to the view CDS in ERICdb, look at the length and if any are labeled as pseudogenes. If so look for a note that describes why it is thought to be a pseudogene Identifying mutations in glpD, napA, and araC cont. Method #2: from the feature page in ERIC Scroll down to the feature context part of the page This is a list of all features that are neighboring your gene in the genome, notice some are upstream, downstream, or contained within Notice that contained within your glpD gene there are polymorphic sites (otherwise known as SNP’s) For SNP analysis, you will use a new tool called “Snippy” In a new tab or web browser window go to http://asap.ahabs.wisc.edu/~cabot/aep/snippy.php It should look like this: Highlight and copy all feature ID’s for polymorphic sites from glpD and paste them into here and click submit feature ID’s In your SNP analysis, you want to look for SNP’s that cause a change in the amino acid that it encodes for. In some cases the change results in a premature stop-codon, which may generate a truncated non-functional protein #1) note Snippy shows you if the SNP variation results in a amino acid change, in this case A (Alanine) to T (Threonine) #2) In this second SNP, the change resulted in a stop codon In the middle of each region you will see the polymorphic site (in this case capitol G’s) and the corresponding base in each genome, note you are interested in variations in YPKIM, YPCO92, YP91001, YPNepal, and YpAntiqua. -in this case there is no difference in these 5 genomes in this analysis, scroll down and search the remaining polymorphic sites and see if there is any difference in the various polymorphic sites in the 5 genomes, if not it probably is a larger deletion or insertion event Using the DNA sequence obtained from the dental pulp from three corpses (found in the file called Ypestis corpse and CA88-4125YPE genes.doc), conduct a BlastN search within the ERIC database with each sequence against the 91001,Nepal, Kim, Antiqua, and CO92 genomes. For each of the three corpses, which serovar is most similar to the strains that caused the 1st and 2nd pandemics? From the ERIC home page you can select to run a Blast search here (http://www.ericbrc.org/) Paste the first nucleotide sequence from corpse #1 Select entire genomes Select the genomes to query, hold down the Ctrl key and select Y . pestis genomes 91001, Antiqua, CO92, KIM, and Nepal Finally click on the Submit Query button, repeat with the other two corpses sequences Next repeat the BlastN process using the gene sequences from a known North American ancestor (Y. pestis CA88-4125/YPE) for glpD, napA, and araC. Of the 5 genomes (91001, Antiqua, CO92, KIM, and Nepal) representing the three serovars, which is most similar to the known North American ancestor? Based on your analysis did Y. pestis arrive in North America via shipping routes over the Atlantic or Pacific? Atlantic? Pacific? (Serovar Antiqua of African origin) Serovar Orientalis or Mediaevalis of Asian origin Courtesy of education.usgs.gov Your alignment should look like this in backbone color, regions in all five appear in light purple color, there will be regions that are different colors that will correspond to 2, 3, 4 out of 5 genomes (you may have to zoom in a bit to see these regions) Look for a region in the lightest blue color that is present in CO92, KIM, Antiqua, and Nepal, but absent in the 91001 strain. Analyze the contents and determine if any of the genes may contribute to human infection of Y. pestis. Conclusion If you are interested in using some or all of these modules in your class, please sign up, and provide email, institution, course(s) -In the last two weeks of August 2008 I will be leading multiple WebX training sessions to refresh and field Q&A, you need a telephone and internet-ready computer Thanks for your time Collaborators: Dr. Kai F. (Billy) Hung (UW-Madison/assistant Prof. At Eastern Illinois University Fall 2008) Dr. Amy C. Wong (UW-Madison) Dr. Lois Banta (Williams College) Mentors: Dr. Nicole Perna (UW-Madison) Dr. Charles Kaspar (UW-Madison) Dr. Jeffrey Byrd (St. Mary’s College) Dr. Bob Kadner and the ASM Summer Institute Thank you: everyone on the ERIC database team (especially Guy Plunkett III for setting up module #1 & Eric Cabot for making Snippy) and all of the members of the Perna Genome Evolution Laboratory Funding: This project has been funded with Federal funds from the National Institute of Allergy and Infectious Diseases, National Institutes of Health, Department of Health and Human services, under contract No. HHSN266200400040C