Examining evolution in bacteria: Mycobacterium tuberculosis Tuberculosis (TB) is a major global health problem and the second leading cause of death from an infectious disease, after HIV[1]. It is an ancient disease that plagued people for thousands of years, and our relationship with its causative agent, Mycobacterium tuberculosis, is so intimate that it appears that the pathogen and the host have coevolved[2, 3]. The M. tuberculosis bacterium is spread through the inhalation of infectious aerosols, and newly infected individuals progress to one of two states: a symptomatic and potentially infectious state, known as active TB, or an asymptomatic, noninfectious state called latent TB infection (LTBI). Approximately one-third of the global human population has LTBI[1], and although LTBI does not manifest with any clinical symptoms, it comes with the risk of developing into active TB disease. In 2013, an estimated 9 million people developed TB and 1.5 million died from the disease[4]. Despite evidence that TB is slowly declining, the emergence and spread of multidrug-resistant strains of Mycobacterium tuberculosis (MDR-TB) represents a major challenge to the global control of the disease[4]. Looking at how bacteria respond to antibiotics is a good way to study evolution. Many scientific studies have found drug resistance can be associated with a single nucleotide change that results in a different amino acid being encoded in one position. A simple change like this can result in drastic results for a human patient. We will examine genes that are associated with antibiotic resistance in Mycobacterium tuberculosis. The list of genes comes from the Tuberculosis Drug Resistance Mutation Database (TBDream, https://tbdreamdb.ki.se/Info/)[5] and is provided in the text file that I have also sent. Creating a group of genomes 1. Enter the word “Belarus” into the global search box at PATRIC and click the search icon. 2. This will return 145 genomes in the Search Results page. These genomes all are associated with the word “Belarus,” either in the comment section provided for each genome, or in the isolation country metadata. 3. To create the genome group, click on the number 146 in the parenthesis. 4. This will return a table that lists the first 20 of 146 genomes. 5. To create a group with all 146 of the genomes, you will need to resize the table. Scroll down to the bottom of the table and change the number in the text box from 20 to 146, then hit return. 6. To create the genome group, scroll up to the top of the table and click in the checkbox next to the text that says “Select all (145) displayed genome(s). 7. Above that button, click on the Workspace Folder that says “Add Genome(s)”. 8. This will generate a pop-up window that asks you to log in. You must log in to continue. 9. You will be asked if you want to name the group. 10. Clicking on the down arrow next to “None” gives you the option of creating a new group. 11. You can name the group by typing the name you want (I used Belarus) in the text box. Click on the “Save to Workspace” button at the bottom of the popup window, and this will save the 145 genomes into your workspace. Finding the Protein Families that include the TBDream genes 1. The Rv locus tags belong to the Mycobacterium tuberculosis H37Rv genome that was sequenced by Welcome Sanger Trust in 1998. Currently, there are 8 different laboratories that have sequenced this same genome, all of which are in PATRIC. We need to find the original genome that will map to the locus tags from TBDream in PATRIC. In the global search, enter the text H37Rv and Sanger. 2. On the landing page that returns, click on the Mycobacterium tuberculosis H37Rv name. 3. This will take you to the landing page for that genome. 4. On the left hand side under the tabs are the Search Tools. Click on Protein Family Sorter, which is the fourth one down. 5. This will take you to the landing page for that tool, which is preselected to search the Welcome Sanger H37Rv genome. 6. Enter all 39 of the TBDream locus tags into the text box. To get exact matches, the locus tags must be defined by quotation marks (ex. “Rv0006”). Click the Search button. 7. This will return a protein family table that includes a lot of information about the genes that were entered. A filter is on the left side, but won’t be used in this example. 8. We need the protein family identifiers that are found on the first column and begin with FIG. In the toolbar immediately above the table, find the Download box and click on the down arrow next to the table to choose the download format for the data. 9. To get just the information in this table, select Excel file (.xlxs). This will give you just the summary information for the protein families. If you want all the information across all the 8 M. tuberculosis H37Rv genomes, including all the locus tags across those genomes that belong to each family, select Family Details: Excel file (.xlxs). 10. From the excel file, you need to select the column that has all the protein family ids. Each of these ids needs to be defined by quotation marks (i.e., “FIG00000080”) to find an exact match in the next step. 11. There were 39 Rv locus tags that were originally used to find families, but only 37 protein families were returned. Using the original list, plus the protein family list with all the locus tags (see step 9 above, Family Details: Excel file (.xlxs)), you can highlight duplicate values (the locus tags) in Excel and see that protein families for two genes were not returned. These are Rv2427A and Rv3126. To get the protein families, individually enter each of these genes in the global search box. To get an exact match, put the search term in quotation marks. 12. When you enter “Rv3126” into the global search, the results page shows two possible hits. You will need to add the protein family id (“FIG00823168”) for this family to your list. 13. The last locus tag (Rv2427A) returns no results at PATRIC. The “A” next to the locus number is non-standard. If you remove the “A” and enter “Rv2427” in global search, four results are returned. 14. Looking at the original list from TBDream, Rv2427A had a gene name of oxyR. On the search results page, the first hit is to Rv2427Ac, which is identified as a pseudogene. This is why this gene does not have a protein family in PATRIC. 15. If you’re curious about Rv2427A, you can click on the hyperlink for the first hit (725571..2726087oxyR') and this will take you to the landing page for that feature. 16. If you click on the genome browser tab, you can see how the RefSeq annotation compares to the PATRIC annotation. Do the Belarus genomes have all of the protein families that contain the TBDream genes? 1. To look for presence or absence of the protein families within a genome group that you have created, click on the Tools tab and under Comparative Genomics, select the Protein Family Sorter tool 2. This will take you to the landing page for that tool. If you are a registered user and are logged in, you will see groups you have previously created and saved in the Select Groups box. 3. Scroll down and select the group you want to examine. In this case, I selected the “Belarus Mycobacterium” group. 4. In the text box on the Protein Family Sorter tool page, enter all the protein family ids, each one inside quotation marks. This will give you an exact match. Then click the Search button. 5. This will return the results page for the tool. Thirty-seven families were found. The protein family table shows a lot of information about these families, including those that have many more (or less) proteins than the 144 genomes examined. 6. To see presence and absence of genes within these protein families across the selected genomes, click on the Heatmap tab. 7. This will take you the heatmap view, where absence (black cells) and presence (yellow, mustard and orange cells) can be seen across all 144 genomes. The genomes are on the y-axis, and the 37 protein families on the x-axis. Examining an alignment to look for amino acid changes associated with antibiotic resistance. 1. The data at TBDream describes a specific amino acid change that is associated with a change in antibiotic resistance in the Rv0006 gene at position 90. They describe a change from a valine (V) to an alanine (A) at that specific location. We can look across the specific Belarus genome group to see how many of the 144 genomes have an alanine at position 90. Rv0006 is DNA gyrase subunit A, and reading the names of the genes across the top of the heatmap shows that the first column contains DNA gyrase subunit A. Clicking on the name in the column will select all the genes within the column. 2. This generates a pop-up window that gives the user choices on what they want to do with the selected data. Click the Show Proteins button at the bottom of the pop-up window. 3. This will open a new window that shows the genes found across the 144 Belarus genomes. 4. Resize the table so that you can see all the genes by entering 144 and hitting return. 5. To generate a multiple sequence alignment, first select the checkbox at the top of the first column, Genome Name. This will select all of the genes. 6. Next, in the toolbar heading for the table, go to the Tools section and click on the Multiple Sequence Alignment icon. 7. This will open up the Protein Alignment page that has a gene tree for the selected proteins on the left hand side, and the actual alignment on the right hand side. You can scroll along the alignment using the slider at the bottom of the page. 8. To see the alignment in a more traditional manner, click on the Printable Alignment button that is above the alignment on the right hand side of the page. 9. This will open up a window that has the alignment. 10. Scroll down the alignment, find position 90, and then continue scrolling down. Towards the end of the 144 genomes you can see that 25 of the Belarus genomes have a valine in that specific position. How broadly are these protein families are shared across all the Mycobacterium tuberculosis genomes? 1. Click on the Organism tab and then click on Mycobacterium. 2. This will take you to the Mycobacterium landing page. 3. Click on the Taxonomy tab. 4. Scroll down the list until you reach the Mycobacterium tuberculosis complex. 5. Open the folder for the Mycobacterium tuberculosis complex, find Mycobacterium tuberculosis, and click on the first icon on the left hand side. 6. This takes you to the landing page for M. tuberculosis. There are 1952 M. tuberculosis genomes currently available in PATRIC. 7. On the left hand side under the tabs are the Search Tools. Click on Protein Family Sorter, which is the fourth one down. 8. This will take you to the landing page for that tool, which is preselected to search across all 1952 M. tuberculosis genomes. 9. In the text box on the Protein Family Sorter tool page, enter all the protein family ids, each one inside quotation marks. This will give you an exact match. Then click the Search button. 10. This will return the results page for the tool. Thirty-eight families were found. The protein family table shows a lot of information about these families, including those that have many more (or less) proteins than the 1952 genomes examined. The filter on the left hand side does not have a list of genomes, as that filter can only display 500 genomes. 11. To see presence and absence of genes within these protein families across the selected genomes, click on the Heatmap tab. 12. This will take you the heatmap view, where absence (black cells) and presence (yellow, mustard and orange cells) can be seen across all 1833 genomes. The genomes are on the y-axis, and the 38 protein families on the x-axis. Assignment: Answer the following questions using the PATRIC website. Rv0005 encodes a gene called DNA gyrase subunit B (EC 5.99.1.3). a. What protein family does this gene belong to (FIG….?)? b. In how many of the Belarus genomes is this gene 675 aa long? c. A change in amino acid sequence from an Arginine (Arg, R) to a Cysteine (Cys, C) around position 457 has resulted in resistance to the drug ofloxacin. How many of the Belarus genomes have the “C” in that approximate position that might make them resistant to this antibiotic? You will find that some of the proteins have different lengths, differing as much as 20 aa. You will have to take this into account while looking for this specific change. d. In this same example, scientists have found antibiotic resistance associated with a change Serine (Ser, S) to Phenylalanine (Phe, F) at position 458. How many of the Belarus genomes have this specific change? e. A study in Thailand showed that a change from Aspartic Acid (Asp, D) to Asparagine (Asn, N) is associated with resistance to multiple drugs. How many of the Belarus genomes have this specific change? f. Finally, if resistance to any antibiotic depended on not having any of the three changes mentioned above, how many of the Belarus genomes could be said to be susceptible? Bonus Question. Scientists have found that resistance to ethambutol has been associated with a single amino acid change in tuberculosis genomes isolated from India. This change, found in Rv3797, was found at position 270 and involved a change from Isoleucine (Ile, I) to Threonine (Thr, T). A. How many genomes from India show a change from I > T? B. How many genomes from India have an Isoleucine in that approximate position? C. Considering when these specific genomes were collected, what is the earliest time that this particular mutation (T instead of I in the position around 270) could have first appeared in India? D. What is the specific nucleotide change that is responsible for this change? (Hint: Use the genome browser). References 1. 2. 3. Wlodarska, M., et al., A microbiological revolution meets an ancient disease: improving the management of tuberculosis with genomics. Clin Microbiol Rev, 2015. 28(2): p. 523-39. Brosch, R., et al., A new evolutionary scenario for the Mycobacterium tuberculosis complex. Proc Natl Acad Sci U S A, 2002. 99(6): p. 3684-9. Bos, K.I., et al., Pre-Columbian mycobacterial genomes reveal seals as a source of New World human tuberculosis. Nature, 2014. 514(7523): p. 494-7. 4. 5. Fonseca, J.D., G.M. Knight, and T.D. McHugh, The complex evolution of antibiotic resistance in Mycobacterium tuberculosis. Int J Infect Dis, 2015. 32: p. 94-100. Sandgren, A., et al., Tuberculosis drug resistance mutation database. PLoS Med, 2009. 6(2): p. e2.