Part 1 - Susquehanna University

advertisement
Spring 2015
BIOL 312: Microbiology
A Town on Fire
Metagenomic Analysis of Bacterial
Communities in Soils Overlying the
Centralia, Pennsylvania Mine Fire
Instructor: Dr. Tammy Tobin
Susquehanna University
E-Mail: tobinjan@susqu.edu
Team Application Activity #1: An Introduction to LINUX and QIIME
An Introduction to MacQIIME
Quantitative Insights into Microbial Ecology (QIIME) is an open source pipeline that runs in a LINUX environment. It
can be used to process next generation sequencing data in a variety of ways that range from making sure that all of your
sequences are of high enough quality to be used (quality trimming), to performing a whole suite of phylogenetic and
statistical analyses on the quality trimmed data. We will be utilizing many of these functions in this case study, but first
you must get used to working in the LINUX environment using the Mac Terminal, which is part of the operating systems
on all Macs. It has been pre-loaded with MacQIIME, so it should be ready to go. The Linux and QIIME tutorials that
follow are largely the work of Dr. Regina Lamandella at Juniata College. I have tweaked them a bit to be appropriate for
our operating system and case study.
Unix/Linux Tutorial
Linux is an open-source Unix-like operating system. It allows the user considerable flexibility and control over the
computer by command line interaction. Many bioinformatics pipelines are built for Unix/Linux environment; therefore it
is a good idea to become familiar with Linux basics before beginning bioinformatics.
Every desktop computer uses an operating system. The most popular operating systems in use today are Windows, Mac
OS, and UNIX. Linux is an operating system very much like UNIX, and it has become very popular over the last several
years. Operating systems are computer programs. An operating system is the first piece of software that the computer
executes when you turn the machine on. The operating system loads itself into memory and begins managing the
resources available on the computer. It then provides those resources to other applications that the user wants to execute.
The shell- The shell acts as an interface between the user and the kernel. When a user logs in, the login program checks the
username and password, and then starts another program called the shell. The shell is a command line interpreter (CLI). It
interprets the commands the user types in and arranges for them to be carried out. The commands are themselves
programs: when they terminate, the shell gives the user another prompt ($ on our systems).
Filename Completion - By typing part of the name of a command, filename or directory and pressing the [Tab] key, the
shell will complete the rest of the name automatically. If the shell finds more than one name beginning with those letters
you have typed, it will pause, prompting you to type a few more letters before pressing the tab key again.
History - The shell keeps a list of the commands you have typed in. If you need to repeat a command, use the cursor keys
to scroll up and down the list or type “history” for a list of previous commands.
Files and Processes
Everything in UNIX is either a file or a process.
A process is an executing program identified by a unique process identifier. A file is a collection of data. They are created
Metagenomic Analysis of Bacterial Communities in Soils Overlying the Centralia,
Pennsylvania Mine Fire
1
by users using text editors, running compilers etc.
Examples of files:
 A document (report, essay etc.)
 The text of a program written in some high-level programming language
 Instructions comprehensible directly to the machine and incomprehensible to a casual user, for example, a
collection of binary digits (an executable or binary file);
 A directory, containing information about its contents, which may be a mixture of other directories
(subdirectories) and ordinary files.
It is not required to have a Linux operating system to use QIIME. We will be running the Linux environment through the
Mac Terminal. So let’s proceed on to the team activity!
Metagenomic Analysis of Bacterial Communities in Soils Overlying the Centralia,
Pennsylvania Mine Fire
2
Team Application Activity # 1: Practicing with LINUX and MacQIIME
Names of Team Members:
Part One: Practicing with the LINUX environment
Instructions: Follow each of the steps below, and answer the questions in the spaces provided.
Please note that spelling, spaces, cases, etc. are absolutely critical in LINUX. You can always
type his (for history) to see what you typed if you get an error message…..
1.
Double click on the Terminal program icon (located in Applications – Utilities) to open it. You should see
your user name followed by a dollar sign (username $). Every time you see a $, it is a prompt that is telling
you that it is ready for the next programming command.
2.
Tell the computer that you want to run the MacQIIME shell by typing MacQIIME exactly as it is written on
this page and hitting return. IF you have typed the command correctly, you will now see the following
prompt in the command line:
MacQIIME your username:~$
3.
Understanding the file structure and knowing how to use some basic Linux commands are essential for using
QIIME effectively. Below is a simplified version of the file structure we see in our distribution of Linux
4.
The file structure is important when we use the command line, since we need to tell the shell where to find
certain files, or where to output the results. The full path to qiime in this example would be /home/qiime.
Metagenomic Analysis of Bacterial Communities in Soils Overlying the Centralia,
Pennsylvania Mine Fire
3
5.
In your worksheet, answer the following question: If you want to work in the Shared_Folder directory, what
is the full path that you would type to get there?
6.
You can always determine which directory you are working in by typing pwd. Type it now. What do you
see? That is your home directory.
7.
If you want to list the contents of a directory, you use the command ls (list). Which files and directories do
you see in your home directory?
8.
In order to change directories, you will use the command cd. Navigate to the Desktop directory (it should
have shown up when you typed ls) by typing in cd Desktop. The command line should now indicate that
you are in the Desktop directory:
MacQIIME username:Desktop $
9.
List a few of the contents of your Desktop.
10. What command did you just use to get that information?
11. Go to the Centralia_Case_Study directory. What command did you use to do that?
12. List the files and folders that you see in the Centralia_Case_Study directory:
Metagenomic Analysis of Bacterial Communities in Soils Overlying the Centralia,
Pennsylvania Mine Fire
4
13. You can go up one level in your directories by typing cd .. (note there is a space between the cd and the two
periods). Go ahead and do that. Which directory are you in now?
14. cd back to the Centralia_Case_Study directory. You will need to be in that directory in order to perform your
QIIME analyses.
Practicing with QIIME: THE QIIME Workflow
1.
In the the Centralia_Case_Study folder, you should have seen three files: catfna, catqual and
mapping_centralia.txt. The mapping file contains a chart (shown below) that contains a whole
bunch of information about each metagenomic sample.





2.
Sample ID: A name that I have given to each sample, in this case S1, S2 and S3
Barcode Sequence: A short DNA sequence that is added to the 5’end of every PCR
fragment in a sample. So, every PCR fragment from S1 will start with the sequence
ACGAGTGCGTA. This will let the QIIME program sort the sequences based on which
borehole they came from in later analyses.
Linker Primer Sequence/Primer: This is the sequence of the PCR primer that was used to
sequence the 16S rRNA genes. In each sequence, it will be found immediately after the barcode
sequence. 8F is the name of the primer used.
Sulfate, Sulfur, Ammonia, Nitrate, pH, Temp…: Chemical analyses were performed on each
of the borehole samples, and these are the results, in parts per million (ppm). If you look at
sample S1, you will see that it has 250 ppm sulfate and came from the 52°C borehole.
File name and description: Sample identifiers that are needed by QIIME, but are not relevant
to your analyses.
Catfna is a file that contains all of the metagenomic sequence data from all three study sites (that
is why the barcodes are needed!) in a format called “FASTA”. Go ahead and double click on that
file in your desktop folder and take a look. If the computer does not open the file right away, choose
‘Open with’ and Text Edit (found in Applications). You should see thousands of sequences that look
like this:
>HD4AU5D04IK2EL#AGACGCACTCA
AGACGCACTCAGAGTTTGATCATGGGCTCAGAATCAAACGCTGGCGGCGCGCTT
AACACATGC
The first line contains a unique sequence sample ID (assigned by the sequencer), followed by the
sample barcode. In this case the barcode is AGACGCACTCA, which lets QIIME know that this
sample came from S2 (the 37°C borehole – see the chart above). The second line is the sequence
of the metagenomic sample itself, starting with the barcode, and then the 8F primer, and then the
16SrRNA sequence itself. As you can see, it is quite short. The length of these sequences can
vary quite a bit, ranging from just a few bases to a few hundred.
What is the barcode sequence for the 60°C sample? What is its ammonia content in ppm?
Metagenomic Analysis of Bacterial Communities in Soils Overlying the Centralia,
Pennsylvania Mine Fire
5
3.
Catqual is a file that contains a numerical quality score, called the Phred score (Q) for each of the
bases in the sequence. The formula for Q is shown below (P is the probability of a base calling error):
So, a Phred score of 10 means that there is a 1/10 chance of an incorrect base call (90% accuracy) and
a Phred score of 40 means that there is a 1/10,000 chance of an incorrect base call. If a sequence does
not have adequate quality scores (we will use 20, or 99% accuracy, as the cut-off), QIIME will filter it
out of the downstream analyses.
4.
Open desktop catqual file with Text Edit. Compare the third sequence (HD4AU5D04IXVDY) with
the fourth sequence (HD4AU5D04IXVNZ). Based on their Phred scores, which sequence do you
think is the most reliable? Which is the least reliable? Justify your answers.
5.
If you were to go ahead to use sequences with low Phred scores, how do you think that would impact
your final phylogenetic analysis?
6.
Go back to the terminal and make sure that you are in the Centralia_Case_Study directory. We will
now ask QIIME to quality trim the sequence files (those below 20 will be discarded at this step) and
to sort the sequences, which are all mixed together at this point, into groups by sample location (S1,
S2 or S3, in this case). The command for this is:
split_libraries.py.
However, in order for QIIME to perform this analysis, you will need to provide it with other
information, as well. Specifically, you need to tell it what your input file is (catfna), what your
mapping file is (mapping_centralia.txt), what your quality file is (catqual), how many bases your
barcodes have, if they are different from the default value of 12 (we have 11 bases), and, finally, what
to call your output file. We will use split_library_centralia_output.
The overall command goes as follows: split_libraries.py -m mapping_centralia.txt -f catfna -q catqual
-b 11 -o split_library_centralia_output
-m designates a mapping file
-f designates the input FASTA file
-q designates the input quality file (Phred scores)
-b defines the length of the barcode sequence in base pairs
-o designates the output file
Type the command above exactly as it is written…making sure that spaces, underscores, etc. are all
correct, then hit return. When QIIME has finished its work, you will see the $ prompt again.
7.
List the contents of the Centralia_Case_Study directory again. If everything worked, you should see a
Metagenomic Analysis of Bacterial Communities in Soils Overlying the Centralia,
Pennsylvania Mine Fire
6
new folder. What is the name of that folder?
8.
cd to that folder and list its contents…what three files are there?
9.
Open the seqs.fna file from the desktop using Text Edit. You should see FASTA files that are now
identified and organized by sample location. For example, the sequence shown below came from S2,
the 37°C borehole.
>S2_1 HD4AU5D04JD7UD#AGACGCACTCA orig_bc=AGACGCACTCA new_bc=AGACGCACTCA bc_diffs=0
AGAATCAAACGCTGGCGGCGCGCTTAACACATGCAAGTCGAGCGAGAAAGGGGAGCAATCCCTGAGTACAGCG
GCGTACGGGTGAGTAACACGTGGGTAATCCACCTTCTAGTGGGGAATAACCCTGGGAAACCGGGGCTAATACCG
CATAAGCCCGTGAGGGGAAAGCCGAAAGGCGCTGGAAGAGGAGCCCGCGGCCGATTAGCTAGTTGGTGAGGTA
10. On the next page you will see a QIIME workflow. Congratulations: you have just completed the
first step! During the next class you will do the rest of the analyses. Don’t worry…it is simpler
than it looks!
Metagenomic Analysis of Bacterial Communities in Soils Overlying the Centralia,
Pennsylvania Mine Fire
7
Metagenomic Analysis of Bacterial Communities in Soils Overlying the Centralia,
Pennsylvania Mine Fire
8
Download