Spring 2015 BIOL 312: Microbiology A Town on Fire Metagenomic Analysis of Bacterial Communities in Soils Overlying the Centralia, Pennsylvania Mine Fire Instructor: Dr. Tammy Tobin Susquehanna University E-Mail: tobinjan@susqu.edu Team Application Activity #1: An Introduction to LINUX and QIIME An Introduction to MacQIIME Quantitative Insights into Microbial Ecology (QIIME) is an open source pipeline that runs in a LINUX environment. It can be used to process next generation sequencing data in a variety of ways that range from making sure that all of your sequences are of high enough quality to be used (quality trimming), to performing a whole suite of phylogenetic and statistical analyses on the quality trimmed data. We will be utilizing many of these functions in this case study, but first you must get used to working in the LINUX environment using the Mac Terminal, which is part of the operating systems on all Macs. It has been pre-loaded with MacQIIME, so it should be ready to go. The Linux and QIIME tutorials that follow are largely the work of Dr. Regina Lamandella at Juniata College. I have tweaked them a bit to be appropriate for our operating system and case study. Unix/Linux Tutorial Linux is an open-source Unix-like operating system. It allows the user considerable flexibility and control over the computer by command line interaction. Many bioinformatics pipelines are built for Unix/Linux environment; therefore it is a good idea to become familiar with Linux basics before beginning bioinformatics. Every desktop computer uses an operating system. The most popular operating systems in use today are Windows, Mac OS, and UNIX. Linux is an operating system very much like UNIX, and it has become very popular over the last several years. Operating systems are computer programs. An operating system is the first piece of software that the computer executes when you turn the machine on. The operating system loads itself into memory and begins managing the resources available on the computer. It then provides those resources to other applications that the user wants to execute. The shell- The shell acts as an interface between the user and the kernel. When a user logs in, the login program checks the username and password, and then starts another program called the shell. The shell is a command line interpreter (CLI). It interprets the commands the user types in and arranges for them to be carried out. The commands are themselves programs: when they terminate, the shell gives the user another prompt ($ on our systems). Filename Completion - By typing part of the name of a command, filename or directory and pressing the [Tab] key, the shell will complete the rest of the name automatically. If the shell finds more than one name beginning with those letters you have typed, it will pause, prompting you to type a few more letters before pressing the tab key again. History - The shell keeps a list of the commands you have typed in. If you need to repeat a command, use the cursor keys to scroll up and down the list or type “history” for a list of previous commands. Files and Processes Everything in UNIX is either a file or a process. A process is an executing program identified by a unique process identifier. A file is a collection of data. They are created Metagenomic Analysis of Bacterial Communities in Soils Overlying the Centralia, Pennsylvania Mine Fire 1 by users using text editors, running compilers etc. Examples of files: A document (report, essay etc.) The text of a program written in some high-level programming language Instructions comprehensible directly to the machine and incomprehensible to a casual user, for example, a collection of binary digits (an executable or binary file); A directory, containing information about its contents, which may be a mixture of other directories (subdirectories) and ordinary files. It is not required to have a Linux operating system to use QIIME. We will be running the Linux environment through the Mac Terminal. So let’s proceed on to the team activity! Metagenomic Analysis of Bacterial Communities in Soils Overlying the Centralia, Pennsylvania Mine Fire 2 Team Application Activity # 1: Practicing with LINUX and MacQIIME Names of Team Members: Part One: Practicing with the LINUX environment Instructions: Follow each of the steps below, and answer the questions in the spaces provided. Please note that spelling, spaces, cases, etc. are absolutely critical in LINUX. You can always type his (for history) to see what you typed if you get an error message….. 1. Double click on the Terminal program icon (located in Applications – Utilities) to open it. You should see your user name followed by a dollar sign (username $). Every time you see a $, it is a prompt that is telling you that it is ready for the next programming command. 2. Tell the computer that you want to run the MacQIIME shell by typing MacQIIME exactly as it is written on this page and hitting return. IF you have typed the command correctly, you will now see the following prompt in the command line: MacQIIME your username:~$ 3. Understanding the file structure and knowing how to use some basic Linux commands are essential for using QIIME effectively. Below is a simplified version of the file structure we see in our distribution of Linux 4. The file structure is important when we use the command line, since we need to tell the shell where to find certain files, or where to output the results. The full path to qiime in this example would be /home/qiime. Metagenomic Analysis of Bacterial Communities in Soils Overlying the Centralia, Pennsylvania Mine Fire 3 5. In your worksheet, answer the following question: If you want to work in the Shared_Folder directory, what is the full path that you would type to get there? 6. You can always determine which directory you are working in by typing pwd. Type it now. What do you see? That is your home directory. 7. If you want to list the contents of a directory, you use the command ls (list). Which files and directories do you see in your home directory? 8. In order to change directories, you will use the command cd. Navigate to the Desktop directory (it should have shown up when you typed ls) by typing in cd Desktop. The command line should now indicate that you are in the Desktop directory: MacQIIME username:Desktop $ 9. List a few of the contents of your Desktop. 10. What command did you just use to get that information? 11. Go to the Centralia_Case_Study directory. What command did you use to do that? 12. List the files and folders that you see in the Centralia_Case_Study directory: Metagenomic Analysis of Bacterial Communities in Soils Overlying the Centralia, Pennsylvania Mine Fire 4 13. You can go up one level in your directories by typing cd .. (note there is a space between the cd and the two periods). Go ahead and do that. Which directory are you in now? 14. cd back to the Centralia_Case_Study directory. You will need to be in that directory in order to perform your QIIME analyses. Practicing with QIIME: THE QIIME Workflow 1. In the the Centralia_Case_Study folder, you should have seen three files: catfna, catqual and mapping_centralia.txt. The mapping file contains a chart (shown below) that contains a whole bunch of information about each metagenomic sample. 2. Sample ID: A name that I have given to each sample, in this case S1, S2 and S3 Barcode Sequence: A short DNA sequence that is added to the 5’end of every PCR fragment in a sample. So, every PCR fragment from S1 will start with the sequence ACGAGTGCGTA. This will let the QIIME program sort the sequences based on which borehole they came from in later analyses. Linker Primer Sequence/Primer: This is the sequence of the PCR primer that was used to sequence the 16S rRNA genes. In each sequence, it will be found immediately after the barcode sequence. 8F is the name of the primer used. Sulfate, Sulfur, Ammonia, Nitrate, pH, Temp…: Chemical analyses were performed on each of the borehole samples, and these are the results, in parts per million (ppm). If you look at sample S1, you will see that it has 250 ppm sulfate and came from the 52°C borehole. File name and description: Sample identifiers that are needed by QIIME, but are not relevant to your analyses. Catfna is a file that contains all of the metagenomic sequence data from all three study sites (that is why the barcodes are needed!) in a format called “FASTA”. Go ahead and double click on that file in your desktop folder and take a look. If the computer does not open the file right away, choose ‘Open with’ and Text Edit (found in Applications). You should see thousands of sequences that look like this: >HD4AU5D04IK2EL#AGACGCACTCA AGACGCACTCAGAGTTTGATCATGGGCTCAGAATCAAACGCTGGCGGCGCGCTT AACACATGC The first line contains a unique sequence sample ID (assigned by the sequencer), followed by the sample barcode. In this case the barcode is AGACGCACTCA, which lets QIIME know that this sample came from S2 (the 37°C borehole – see the chart above). The second line is the sequence of the metagenomic sample itself, starting with the barcode, and then the 8F primer, and then the 16SrRNA sequence itself. As you can see, it is quite short. The length of these sequences can vary quite a bit, ranging from just a few bases to a few hundred. What is the barcode sequence for the 60°C sample? What is its ammonia content in ppm? Metagenomic Analysis of Bacterial Communities in Soils Overlying the Centralia, Pennsylvania Mine Fire 5 3. Catqual is a file that contains a numerical quality score, called the Phred score (Q) for each of the bases in the sequence. The formula for Q is shown below (P is the probability of a base calling error): So, a Phred score of 10 means that there is a 1/10 chance of an incorrect base call (90% accuracy) and a Phred score of 40 means that there is a 1/10,000 chance of an incorrect base call. If a sequence does not have adequate quality scores (we will use 20, or 99% accuracy, as the cut-off), QIIME will filter it out of the downstream analyses. 4. Open desktop catqual file with Text Edit. Compare the third sequence (HD4AU5D04IXVDY) with the fourth sequence (HD4AU5D04IXVNZ). Based on their Phred scores, which sequence do you think is the most reliable? Which is the least reliable? Justify your answers. 5. If you were to go ahead to use sequences with low Phred scores, how do you think that would impact your final phylogenetic analysis? 6. Go back to the terminal and make sure that you are in the Centralia_Case_Study directory. We will now ask QIIME to quality trim the sequence files (those below 20 will be discarded at this step) and to sort the sequences, which are all mixed together at this point, into groups by sample location (S1, S2 or S3, in this case). The command for this is: split_libraries.py. However, in order for QIIME to perform this analysis, you will need to provide it with other information, as well. Specifically, you need to tell it what your input file is (catfna), what your mapping file is (mapping_centralia.txt), what your quality file is (catqual), how many bases your barcodes have, if they are different from the default value of 12 (we have 11 bases), and, finally, what to call your output file. We will use split_library_centralia_output. The overall command goes as follows: split_libraries.py -m mapping_centralia.txt -f catfna -q catqual -b 11 -o split_library_centralia_output -m designates a mapping file -f designates the input FASTA file -q designates the input quality file (Phred scores) -b defines the length of the barcode sequence in base pairs -o designates the output file Type the command above exactly as it is written…making sure that spaces, underscores, etc. are all correct, then hit return. When QIIME has finished its work, you will see the $ prompt again. 7. List the contents of the Centralia_Case_Study directory again. If everything worked, you should see a Metagenomic Analysis of Bacterial Communities in Soils Overlying the Centralia, Pennsylvania Mine Fire 6 new folder. What is the name of that folder? 8. cd to that folder and list its contents…what three files are there? 9. Open the seqs.fna file from the desktop using Text Edit. You should see FASTA files that are now identified and organized by sample location. For example, the sequence shown below came from S2, the 37°C borehole. >S2_1 HD4AU5D04JD7UD#AGACGCACTCA orig_bc=AGACGCACTCA new_bc=AGACGCACTCA bc_diffs=0 AGAATCAAACGCTGGCGGCGCGCTTAACACATGCAAGTCGAGCGAGAAAGGGGAGCAATCCCTGAGTACAGCG GCGTACGGGTGAGTAACACGTGGGTAATCCACCTTCTAGTGGGGAATAACCCTGGGAAACCGGGGCTAATACCG CATAAGCCCGTGAGGGGAAAGCCGAAAGGCGCTGGAAGAGGAGCCCGCGGCCGATTAGCTAGTTGGTGAGGTA 10. On the next page you will see a QIIME workflow. Congratulations: you have just completed the first step! During the next class you will do the rest of the analyses. Don’t worry…it is simpler than it looks! Metagenomic Analysis of Bacterial Communities in Soils Overlying the Centralia, Pennsylvania Mine Fire 7 Metagenomic Analysis of Bacterial Communities in Soils Overlying the Centralia, Pennsylvania Mine Fire 8