Creation and Critical Evaluation of Two Methods for Taxonomic Classification of Metagenomic Sequences in Perl on the Data with Known GC Content and on Fully-Sequenced Genome Suruchi Anand, Manoshi Datta, Ana Daniela Guajardo, Nina Revko Introduction While genomics is a study of the genome sequences of individual organisms (which are usually cultivatable and easy accessible), metagenomics studies genetic information of individual communities of species. Metagenomics sequencing obtains huge amounts of data recovered directly from environment; that genetic material undergoes further processing starting with the assembly of sequence reads, followed by gene prediction and functional annotation. It is clear that metagenomics relies heavily on the accuracy of the protein databases, which were created using genomics study, and it is unknown how their internal bias (assembly from information that came from common organisms) can influence the analysis of real data. Presence of multiple organisms in a sample raises need for additional step of predicting the phylogenetic origins of the processed data (binning). Analysis of metagenomic sequences without taxomic assignment will always provide biased results; that is why accurate binning is essential pre-processing step that reduces that final error. The phylogenetic origins of the test sequences are predicted mainly based on compositional statistics and sequence similarity. In this project, we developed two methods: binning by GC content and by using Markov Models. We used benchmarking to critically evaluate and compare our methods’ ability for accurate classification of sequence fragments into their corresponding species populations. The tests were performed using real sequences coming from different backgrounds, but which GC content is known, and using short reads that came from recently completed genomes from the same environment niche. During the course of metagenomics experiments, thousands of individual DNA fragments are amassed from within a particular environmental niche. In order to make this vast amount of information more tractable, these fragments must first undergo a process known as “binning,” in which they are categorized based upon the organisms from which they originated. Several different binning algorithms have been employed, the majority of which attempt to distinguish the origins of various fragments based upon their DNA “signatures,” or characteristic compositional patterns. Methods 1. Choosing sequences. During the development of the binning methods two sets of genomic data were used for assessment. In the first trials 10 mitochondrial DNA sequences of lengths ranging from ~6,000 to 30,000 nt were chosen from different kingdoms. To better asses the GC Content analysis the sequences all had a minimum variation of 3% GC content as stated by NCBI. For a more realistic approach UBA Acid Mine Drainage Biofilm Metagenome WGS Sequences were posteriorly used to test the binning methods. UBA Acid Mine Drainage Biofilm Metagenome WGS Sequences were obtained from CAMERA Calit2 .1 From these shotgun sequences 43 were semi-randomly picked, trying to have sequences of varying sizes to be able to carry out different tests. Out of these 43 sequences two groups were made one containing 2 sequences of lengths ~ 37,000 nt and ~ 97,700 nt; the second group contained 4 sequences varying from ~ 8600 nt to ~ 22000 nt. These two groups were used to carry out the main comparison between the GC Content binning method and the 2-Markov Model binning method. 2. First approach: Binning by GC content Although all DNA molecules are composed of the four basic nucleotides – adenosine (A), thymine (T), cytosine (C), and guanine (G) – the distribution of these molecules within a genome is far from uniform. Indeed, classes of organisms can be grouped based upon the so-called “GC content” of their genomes, a property that is highly variable between species. In general, the GC content of a genome increases with the taxonomic complexity of the species, a fact that can be attributed to differences in selective pressure, biases in mutation, and other environmental factors. In this study, a binning algorithm was developed to correlate DNA fragments with the genomes from which they most likely originated. Using Bayesian inference methods, the probability that a given DNA fragment arose from a particular genome sequence was calculated. The fragment was binned with the genome sequence that yielded the highest calculated probability value. Detailed Probability Calculations Note: In these calculations, the fragment that is to be binned is denoted by the capital letter “S”. The genome sequence with which it is to be categorized is represented by the letter “C”. (a) Manipulation of Bayes’ Theorem: In general, the probability that a particular DNA fragment is associated with a given genome in the training data set can be calculated with the use of Bayes’ Theorem (shown below): As discussed, P(C1|S) is the posterior probability that the DNA fragment (S) is binned with a given training genome sequence (C1), while P(C) represents the prior probability. In order to simplify the analysis, it was assumed that all prior probabilities were equal (ie. P(C1) = P(C2) = … = P(Cn)). This assumption is valid for the data set, since, a priori, the DNA fragment has an equal probability of being binned with any one of the genome sequences in the training set. Thus, the probability equation can be rewritten as a simple fractional likelihood calculation, shown below: (b) Calculation of likelihood values from binomial probability distribution In order to solve for the posterior probability, the likelihood that a given fragment would be associated with a genome in the training sequence was determined. These probabilities were assumed to follow a binomial probability distribution, since each individual nucleotide can either be classified as a G/C (success) or an A/T (not success) event. The binomial probability can be calculated with the equation shown below: In this equation, n = length of the fragment, k = number of G/C events in the test fragment, and p = G/C content of the training genome of interest. (c) Bin the test fragment with the genome that yields the highest posterior probability. Once these likelihood calculations were made, the posterior probability values for the test fragment of interest with each training genome sequence were calculated with Bayes’ Theorem. The maximum posterior probability value was found, and the test fragment was binned with the genome sequence that yielded this value. 3. Second approach: Binning by N-Markov Method In addition to binning by G/C content, a 2nd Order Markov Approximation was used to find the most probable genome sequence to which the unknown test fragments were associated. Preparing Probability Tables From Training Data: For all the training data (sequences representing the genomes), the probability of each nucleotide (A, T, C and G) appearing in the sequence was found. The probability P(X) where X was the representative nucleotide was calculated by counting all the X's present in the sequence and dividing by the total number of elements. Next, the the conditional probability of each nucleotide give that the previous neighbor was known was obtained for each training sequence. This was done by going through the sequence and finding all 16 pairs (AT, AA, AC, AG, TA,...etc) and calculating the probability of their appearance in the sequence by dividing the total count of each pair by the total number of nucleotides. Similarly, the probabilities of each triplet (ATC, AAA, ACG,..etc.) were obtained from the training data for a total of 64 different combinations. This data was represented in a hash for future reference and calculation. (a) Markov Approximation for Test Data: Each test sequence was run through the program in order to find its best Markov approximation using the following equation: P(X1X2...Xk)= P(X1)P(X2|X1)P(X3|X1X2)P(X4|X3X2)...P(Xk|X(k-2)X(k-1)) where (X|XX) is the probability of obtaining an as the ith term in a sequence given that the previous two terms are XX. P(X|XX)= P(XXX)/P(XX*). Essentially, the conditional probability described above is found by dividing the probability of getting the desired triplet from the data obtained from the training sequences, divided by the the probability that the first two characters precede any third nucleotide. (b) Binning Test Sequences: After obtaining the Markov approximation of each test sequence using the data from each training sequence, the resulting probabilities were compared and the test sequence as assigned to the genome sequence whose data allowed the test sequence to obtain the highest probability. 4. Benchmarking For the benchmarking the first data set created with varying GC Contents was used as trial for the developing of the program. The program takes in a directory with any number of FASTA files each containing one genome sequence (training sequence). It then creates a FASTA file for each training sequence called ‘benchmarkbin#.txt (the # represents the sequence). Reads are created from each training sequence randomizing between 4 options for the length, varying from 80 to 864 nt and creating a 20 nt overlap of each read. The lengths were chosen using the 42 notation as stated by Chatterji et al.; this method is known to be feasible for k < 6. (k being the oligomer length for NMarkov). The reads are numbered in order of created and appended to the file created for the training sequence it came from. A file containing solely the reads entitled ‘reads.txt’ and a file containing all the training sequences entitled ‘dataset.txt’ are also created by the program. These files were used to assess binning methods. Results GC Content Results The ability of the method to bin DNA fragments based upon common GC content was assessed with two different benchmark datasets, in which the correct binning pattern was known for a set of test DNA fragments and training genome sequences. This known pattern was compared to those assigned by the binning algorithm, and these results are shown below: Figure 1: An assessment of the GC-content-based algorithm (DATASET 1) Figure 2: An assessment of the GC-content-based algorithm (DATASET 2) As illustrated by Figure 1, the GC-content-based binning algorithm was able to correctly assign just over 50% of the test fragments in the first benchmark dataset to the correct genome sequences from which they were derived. Figure 2 indicates that the accuracy of the method was higher for the second benchmark dataset on which the method was tested, and roughly 69% of the sequences were binned correctly. Although the overall binning accuracy offers a rough idea as to the efficacy of the algorithm, a look at the accuracy of binning for each of the training sequences offers a more informative picture. As shown in Figures 1 and 2, the binning accuracy varies quite widely, depending upon the training sequence with which the test fragments are to be associated. This trend is particularly evident in Figure 1, in which nearly 70% of the sequences in Bin 0 were binned accurately, while a mere 22% of the sequences in Bin 1 were placed with the correct training sequence. 2-Markov Results: Figure 3: An assessment of the 2nd Order Markov algorithm (DATASET 1) Percentage of test Fragments Correctly Binned Using 2nd Order Markov 0.877697842 0.913043478 0.857142857 1 0.8 0.730769231 0.666666667 0.6 0.4 0.2 0 Bin 0 Bin 1 Bin 2 Bin 3 Total Figure 4: An assessment of the 2nd Order Markov algorithm Percentage of test Fragments Correctly Binned Using 2nd Order Markov 0.88 0.9 0.7309090910.748704663 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Bin 0 Bin 1 Total (DATASET 2) In the first dataset with 4 bins, the 2nd order Markov approximation had an accuracy of 87.7%. In the second dataset, the accuracy decreased slightly to approximately 74%. The disparity in the accuracy could have resulted from the fact that fewer sequences were binned in the first run. The time required for the program to run using the second dataset was in the order of minutes, much higher than the first run. A 3rd Order Markov model was designed and tested for greater efficiency of the program. However, it appeared that the Markov approximations at that level were too small to be compared and the program rounded them all off to zero. Analysis GC Content The correlation between the relative GC contents of the training sequences and the rate of false positive results within a bin suggests that there is an optimal set of training sequences for the metagenomics data collected from a particular environmental niche. Such a set would be able to provide a coarse differentiation of DNA fragments obtained from organisms with diverse GC content values. 2-Markov The overall accuracy was higher using the dataset 1 which had more training sequences. The percentage accuracy within each bin within each dataset were roughly uniform. The major difference between the accuracy calculated between datasets 1 and 2 could simply due to the randomization of test sequences. If multiple trials were done using both training datasets but with different randomized test sequences, the data from the two sets could be better compared. Comparison: Figure 5: G/C and Markov Accuracy Comparison for DATASET 1 1 0.9 0.8 0.7 0.6 0.5 G C c ontent Markov model 0.4 0.3 0.2 0.1 0 B in 0 B in 1 B in 2 B in 3 TO TA L G/C Content = Red 2-Markov = Blue Figure 6: G/C and Markov Accuracy Comparison for DATASET 2 0.8 0.78 0.76 G/C Content = Red 2-Markov = Blue 0.74 0.72 G C c ontent Markov model 0.7 0.68 0.66 0.64 0.62 B in 0 B in 1 TO TA L The Markov Model performed with higher accuracy in both datasets. Bin0 and Bin1 in dataset 1 in which the GC content was quite similar were binned with markedly greater accuracy using the Markov model. This can be attributed to the fact that the Markov algorithm takes into account the probability of every nucleotide in training and test data whereas the G/C content algorithm takes an average of G/C content from the training data and test data. Hence since the G/C content varies widely within a genome (changes between coding regions/ non regions, etc.), the calculated probability is highly dependent upon what region the test sequence represents. Therefore, overall the Markov Model is more representative of the entire genome being compared. Conclusion The accuracy and time complexity of two binning methods: GC- content and 2nd order Markov Model, applied in a low-complexity community – a sample from the microbes in the acid mine drainage of the Richmond Mine - were assessed. The 2nd order Markov Model binning method was proven to be more accurate for this type of communities. Regarding time complexity GC- content was slower, which could have been by various reasons, like the general programming method. Some future modifications to consider for a more accurate and generally efficient binning method is to lower the time complexity of GC-content and probably 2nd order Markov. A threshold to determine GC content divergence that might maximize binning accuracy, as well as another threshold to consider the minimum read length required for accurate binning can also be implemented. In the case of the N-Markov Model higher orders could be assessed and the best for the type of data studying could be selected. In this study the coding vs. non coding regions of the genomes were not differentiated, this difference can have an effect on GC content binning given the fact that GC content varies between these two types of regions. After revising the two binning methods separately they could be combined as to first implement GC content to make broad distinction groups, and then within each group apply the N-Markov Model binning method, which may surely increase accuracy. WORK BREAKDOWN Minireport Programming Version1 – Ana, Nina, and Suruchi Data Input, Data Output, Benchmarking – Nina & Ana Version2 – Manoshi Data for Benchmarking Test and Training Files – Ana & Nina Debugging – All (All night) Binning by GC Content – Manoshi Binning by 2 – Markov - Suruchi WORKS CITED Acid Mine Drainage Metagenome. Publication Data. CAMERA Calit2. Accessed: 12/01/2008. <http://web.camera.calit2.net/cameraweb/gwt/org.jcvi.camera.web.gwt.download.DownloadByPubP age/DownloadByPubPage.oa?projectSymbol=CAM_PROJ_AcidMine > Chatterji, Sourav, Ichitaro Yamazaki, Zhaojun Bai, and Jonathan A. Eisen. “CompostBin: A DNA Composition-Based Algorithm for Binning Environmental Shotgun Reads”. University of California, Davis and the Joint Genome Institute, Walnut Creek, CA. 2008. "Genome". NCBI. Revised on Mar. <http://www.ncbi.nlm.nih.gov/sites/entrez>. 31, 2008. Accessed on Dec. 5, 2008. Huson, Daniel H. "Computational Aspects of Metagenome Analysis." Universitat Tubingen. 2008. Accessed on Dec. 14, 2008. <http://www-ab.informatik.unituebingen.de/talks/pdfs/ComputationalMetagenomicSanDiego2008.pdf>. Joint Genome Institute. “DOE JGI sets ‘gold standard’ for metagenomic data analysis”. Department of Energy. 14 May 2007. Nasser, Sara, Adrienne Breland, Frederick C. Harris, Jr., Monica Nicolescu. “A fuzzy classifier to taxonomically group DNA fragments within a metagenome”. University of Nevada, Reno. 2008. Mavromatis, C. et al. "Use of simulated data sets to evaluate the fidelity of metagenomic processing methods." Nature2007.