Exploring a fatal outbreak of Escherichia coli using PATRIC On May 19, 2011, the Robert Koch Institute, Germany's national-level public health authority, was informed about a cluster of three cases of the hemolytic–uremic syndrome in children admitted on the same day to the Hamburg university hospital. As numbers of effected children began to rise, they realized that they had a problem on their hands. They also began to see adults that had been sickened, and that number also began to increase. What was now considered an epidemic began to spread throughout Europe. The hemolytic–uremic syndrome associated with the epidemic has been characterized by the triad of acute renal failure (an abrupt loss of kidney function that develops within 7 days), hemolytic anemia (a condition in which red blood cells are destroyed and removed from the bloodstream) and thrombocytopenia (low platelet count). Diarrheaassociated hemolytic–uremic syndrome occurs primarily in children, and a precipitating infection with Shiga-toxin–producing Escherichia coli, mainly of serotype O157:H7, is usually the primary cause. In adults, the hemolytic–uremic syndrome with prodromal diarrhea, indicating an infectious cause, is a rare event. The serotype of the E. coli outbreak strain was determined to be O104:H4. A comparative genomic examination showed that the pathogen possessed genes typical of enteroaggregative E. coli, such as attA, aggR, aap, aggA, and aggC, located on a virulence plasmid. In addition, the strain carried the gene for a Shiga-toxin 2 variant (stx2a). Other typical Shiga-toxin–producing E. coli genes such as stx1, eae, and ehx were missing.[1] Using the genomes isolated from this outbreak, we will use PATRIC tools to examine the presence or absence of specific genes, and also compare the outbreak genomes to others similar genomes to see if we can see the same patterns of genes. Creating genome groups 1. Login to the PATRIC website so that you can use your workspace in the downstream analysis. 2. On the PATRIC homepage (patricbrc.org), open the Organisms tab at the top of the page. 3. When the tab opens to reveal the box listing the names of pathogens, click on Escherichia. 4. This will take you to the landing page for Escherichia, which summarizes all the information that PATRIC has about the genus, including the number of genomes, experiments associated with it, publications on it, and tools that can analyze the available data sorted at that taxonomic level. 5. Find the tab across the top that is labeled “Genome List” and click on it. 6. This will take you to the Genome List for the genus Escherichia. On the left you will see a dynamic filter, and on the right a table that lists the genomes. 7. At the top of the filter on the left hand side you can see a text box. Enter the word “Germany” in that box and then hit return. 8. This will filter the table on the right hand side to show all the genomes that were either isolated in Germany, or had that word mentioned in the information that was submitted when the genome became public. Other information about these genomes can be seen in the columns, including information like the host that the bacterium was isolated from. 9. One of the columns to the right of this table is titled “Collection Date”. Click on those words and it will sort the table in the order of the years that the bacteria were collected. 10. One click will sort the table from the earliest collection date. 11. A second click shows the most recently collected genomes. 12. Check each of the boxes next to the genome name from the organisms that were collected in 2011. 13. Click on the “Add Genomes” next to the folder icon in the Workspace header. 14. This will open up a pop-up window that allows you to save the group. 15. Select the “Create New Group” option. 16. Name the group and click “Save to Workspace”. Now that data is saved and you can use a number of tools to explore it. Assignment Create genome groups for the three categories below. Use the dynamic filter on the Genome List page, and remember that you can use the text box at the top to filter on specific terms (hint: like O104). You can also use the filters underneath the text box to further refine your search (hint: Isolation Country and Collection Date). When you complete your assignment, you will have four different groups that include the one we just created. Create a group that contains all the O104 genomes collected in Europe, but not including Germany, in 2011 Create a group that contains all the E. coli genomes collected in 2011 in the United States. Collect all the O104 genomes are available in the PATRIC database, but exclude those collected in 2011. Comparing genome groups in PATRIC using the Protein Family Sorter tool 1. To look for presence or absence of the protein families within a genome group that you have created, click on the Tools tab and under Comparative Genomics, select the Protein Family Sorter tool 2. This will take you to the landing page for that tool. 3. Scroll down in the Select Organism box until you see the genome groups you created. Select the boxes for the Germany 2011 group, the O104 group from 2011 that don’t include Germany, and the O104 group that contains genomes isolated in years other than 2011. 4. Hit the select button under the keyword search box. 5. This takes you to the Protein Family Sorter landing page. On the right you will see a dynamic filter, and on the left a table that lists all the protein families. 6. One way you can examine differences in your genome groups is to visualize the data. To do this, click on the Heatmap at the top of the table (next to the Table tab). 1. 7. This will take you the heatmap view, where absence (black cells) and presence (yellow, mustard and orange cells) can be seen across all genomes. The genomes are on the y-axis, and the protein families on the x-axis. 8. You can order the protein families by the way the genes occur in a given genome. This is a good way to check for something called genomic islands, which are parts of a genome that were not directly inherited, but are obtained from different bacteria in what is described as horizontal transfer. To do this using the Protein Family Sorter, click on the down arrow in the text box next to the words Advanced Clustering. 9. This will open up a list of genomes that are included in the groups. Scroll down until you find one of the German genomes (Escherichia coli O104:H4 str. Ty-2482). Click on that name. 10. This will order all the protein families along the order that the genes occur in the Ty-2482 strain. You’ll notice that several of the genomes appear to have long black boxes associated with them. This means that these genomes could be missing a long section of the genome that is present in the reference strain. This is an indication of a genomic island. 11. To explore a particular section, you should use your mouse to draw a box around the area of the genome that is next to a black box. 12. This generates a pop-up window that gives the user choices on what they want to do with the selected data. Click the Show Proteins button at the bottom of the pop-up window. 13. This will open a new window that shows the genes found in that section of the heatmap view that you selected. 14. To see the order the genes occur in, first resize the table by changing the number at the bottom of it to include all the genes and hit return. 15. Then at the top of the table, click once on the column head that reads “Alternative Locus Tag” to reorder the genes from first to last. 16. You can see that the majority of these genes are sequential (each of the locus tags increases numerically by one). Moreover, many of the names of these genes include the word “phage”. This word is derived from “bacteriophage,” which are viruses of bacteria. They are often associated with horizontal transfer of DNA, the transfer of genes between organisms in a manner other than traditional reproduction. Assignment Use the protein family sorter and the groups you created to answer the following questions. Compare the groups from all the genomes collected in Germany in 2011 with the O104 genomes that were not isolated in 2011. Go to the heatmap view and choose the Escherichia_coli_O104-H4_str_01-0959 (isolated in 2001) as the reference. If you scroll down the heatmap (use the slider at the bottom of the view), you will see a large black box in strain E112/10. Use you mouse to select the proteins found in another genome that occur where the E112/10 genome is missing them. Many of these are metabolic proteins. From the other classes you have had, can you determine which pathways would be impacted in the E112/10 strain by not having these genes? Comparing genomes in PATRIC using the Protein Family Sorter tool to look for specific genes. 1. To look for presence or absence of the protein families within a genome group that you have created, click on the Tools tab and under Comparative Genomics, select the Protein Family Sorter tool 2. This will take you to the landing page for that tool. 3. Scroll down in the Select Organism box until you see the genome group you created that contains the genomes from the Germany outbreak. Check the box in front of that group. 4. We are going to see if these genomes have the Shiga toxin genes described. Enter the work “Shiga” in the keyword search box and click on the Search button below the box. 5. This returns a table that has a filter on the right, and the results on the left. You can see that a single protein family has been found in these genomes. 6. If you look carefully at the name under the product description it says “Shiga-like toxin II subunit B precursor”. The name is a hyperlink. Click on it. 7. This will take you to the summary information for all the genes in your genome group that were in that particular protein family. This information includes the names of the genomes, the various locus tags that identify the genes, and the length of the proteins. 8. To find out more information about any of the genes, click on any locus tag in the Column called PATRIC ID. 9. This will take you to the landing page for that gene where all the information available for it in PATRIC is summarized, including its different gene identifiers, tools and resources that can be used to examine this gene, and any publications that might have been written about it. 10. If you remember the story from above, the gene that was associated with the outbreak was Shiga-toxin 2 variant (stx2a). The “a” generally implies the A subunit, and we’re looking at the “B” subunit here. What happened to A? A good thing is that these genes generally travel in pairs, so let’s look at the genes around this one to see if we can find A. To do this, in the tabs along the top of the page, click on the one named “Genome Browser.” 11. This will open up a tool that shows you the gene you are looking at, and the genes surrounding it. The Shiga toxin subunit B gene that we were looking at is the fig|1048256.3.peg.1439 locus tag. 12. Mousing over the gene immediately upstream reveals the A subunit. 13. If you click on that gene in the genome browser, a pop-up box shows you specific information about it. Double click on the first line under Feature Details 14. This takes you to the landing page for this gene 15. So there is a Shiga toxin subunit A in this genome. Why didn’t we see it in the tool. Now you’re exposed to some of the problems research biologist have. The gene is present, but we don’t see it because it has not yet been assigned to a protein family. Look down at under Functional Properties. You’ll see that next to FIGFam Assignments, there is nothing assigned. This means that it is not assigned to a protein family. Below I’ve provided a comparison of both the A and B subunit. You can see that Shiga toxin B subunit has a FIGFam assignment, but A does not. That’s why only the B subunit is seen in the Protein Family Sorter. Shiga toxin A subunit Shiga toxin B subunit Searching for specific genes in PATRIC Scientists studying the 2011 outbreak found that genomes isolated from the E. coli bacteria associated with the epidemic certain genes that had previously been associated with virulence (attA, aggR, aap, aggA, and aggC). In addition, these strains also carried the gene for a Shiga-toxin 2 variant (stx2a). In contrast, these same genomes were found to be missing other typical Shiga-toxin–producing E. coli genes (stx1, eae, and ehx). In this part of the exercise, we are about to embark on one of the most frustrating aspects of searching for information that research biologists encounter. In an age where there is an abundance of information about organisms, their genomes and genes, and how those genes are expressed, scientists are often unable to find the information that could help their research. Sometimes the data is located in different repositories, and each of these places call the genes by different names or by different IDs. Scientists often rely on older publications that identify their gene of interest by a certain name, and that name may no longer exist in any resource. And sometimes, a specific annotation pipeline that is used to call the genes on a genome and name them may not recognize that a specific gene is there. Part of this exercise will be to try and map whatever data we can from the outbreak genomes in PATRIC and find the discrepancies in the available information. 1. In the search box at the top of the page, enter stx2a and coli. This will narrow the search to look at the E. coli genomes. Hit return. 2. This will take you to the Search Results page. This page will always be structured with the same format, with the results of genes with the best hit to your search term on top, followed by genomes. The search results also include taxon (if you’re looking for a species, genus, family or higher) and experiments that might result from your search term. Genes Genomes Taxonomy Experiments 3. Look at the Features the top of the results. These are the genes that match your search. 85 genes match the search terms Genes name Genome name RefSeq locus tag This symbol means that the gene is a RefSeq annota on, and may or may not have a PATRIC annota on. 4. As there are 85 features that match this return, lets be more specific and try to refine the search. In the search box enter stx2a and O104 and hit return. 5. The results table shows fewer genes. 6. Click on the name of the first gene in the list. This will take you to the landing page for that gene. Assignment: Use the landing page to fill out the table below, and then search for the other genes in PATRIC. You will not be able to find all of them, and to locate some of them, you may have to broaden your scope (Hint: Start with the O104 genomes, and then change to “coli” if necessary). Gene Name attA aggR aap aggA PATRIC locus tag E. coli strain FigFam number Product Description in PATRIC aggC stx2a fig|1090928.3.peg.1113 O104:H4 str. E112/10 None Shiga-like toxin II subunit A precursor (EC 3.2.2.22) stx1 eae ehx In a previous exercise, you learned how to use the FIGFam IDs in the Protein Family Sorter tool to to see the presence or absence of certain genes across various genome groups. Use this technique to examine the genomes from the 2011 German outbreak. Which genes do the genomes share, and which are they lacking? Expand to the other outbreak genomes outside of Germany. Do they have a similar pattern? Look carefully at the O104 genomes that were not part of the 2011 epidemic. Do any of those genomes have the same pattern as you see in the German genomes? What are the differences? References 1. Frank, C., et al., Epidemic profile of Shiga-toxin-producing Escherichia coli O104:H4 outbreak in Germany. N Engl J Med, 2011. 365(19): p. 1771-80.