Creation and Critical Evaluation of Two Methods for Taxonomic Classification of Metagenomic
Sequences in Perl on the Data with Known GC Content and on Fully-Sequenced Genome
Suruchi Anand, Manoshi Datta, Ana Daniela Guajardo, Nina Revko
While genomics is a study of the genome sequences of individual organisms (which are usually
cultivatable and easy accessible), metagenomics studies genetic information of individual
communities of species. Metagenomics sequencing obtains huge amounts of data recovered directly
from environment; that genetic material undergoes further processing starting with the assembly of
sequence reads, followed by gene prediction and functional annotation. It is clear that metagenomics
relies heavily on the accuracy of the protein databases, which were created using genomics study,
and it is unknown how their internal bias (assembly from information that came from common
organisms) can influence the analysis of real data. Presence of multiple organisms in a sample raises
need for additional step of predicting the phylogenetic origins of the processed data (binning).
Analysis of metagenomic sequences without taxomic assignment will always provide biased results;
that is why accurate binning is essential pre-processing step that reduces that final error.
The phylogenetic origins of the test sequences are predicted mainly based on compositional statistics
and sequence similarity. In this project, we developed two methods: binning by GC content and by
using Markov Models. We used benchmarking to critically evaluate and compare our methods’
ability for accurate classification of sequence fragments into their corresponding species populations.
The tests were performed using real sequences coming from different backgrounds, but which GC
content is known, and using short reads that came from recently completed genomes from the same
environment niche.
During the course of metagenomics experiments, thousands of individual DNA fragments are
amassed from within a particular environmental niche. In order to make this vast amount of
information more tractable, these fragments must first undergo a process known as “binning,” in
which they are categorized based upon the organisms from which they originated. Several different
binning algorithms have been employed, the majority of which attempt to distinguish the origins of
various fragments based upon their DNA “signatures,” or characteristic compositional patterns.
1. Choosing sequences.
During the development of the binning methods two sets of genomic data were used for assessment.
In the first trials 10 mitochondrial DNA sequences of lengths ranging from ~6,000 to 30,000 nt were
chosen from different kingdoms. To better asses the GC Content analysis the sequences all had a
minimum variation of 3% GC content as stated by NCBI. For a more realistic approach UBA Acid
Mine Drainage Biofilm Metagenome WGS Sequences were posteriorly used to test the binning
methods. UBA Acid Mine Drainage Biofilm Metagenome WGS Sequences were obtained from
CAMERA Calit2 .1 From these shotgun sequences 43 were semi-randomly picked, trying to have
sequences of varying sizes to be able to carry out different tests. Out of these 43 sequences two
groups were made one containing 2 sequences of lengths ~ 37,000 nt and ~ 97,700 nt; the second
group contained 4 sequences varying from ~ 8600 nt to ~ 22000 nt. These two groups were used to
carry out the main comparison between the GC Content binning method and the 2-Markov Model
binning method.
2. First approach: Binning by GC content
Although all DNA molecules are composed of the four basic nucleotides – adenosine (A), thymine
(T), cytosine (C), and guanine (G) – the distribution of these molecules within a genome is far from
uniform. Indeed, classes of organisms can be grouped based upon the so-called “GC content” of their
genomes, a property that is highly variable between species. In general, the GC content of a genome
increases with the taxonomic complexity of the species, a fact that can be attributed to differences in
selective pressure, biases in mutation, and other environmental factors.
In this study, a binning algorithm was developed to correlate DNA fragments with the genomes from
which they most likely originated. Using Bayesian inference methods, the probability that a given
DNA fragment arose from a particular genome sequence was calculated. The fragment was binned
with the genome sequence that yielded the highest calculated probability value.
Detailed Probability Calculations
Note: In these calculations, the fragment that is to be binned is denoted by the capital letter “S”. The
genome sequence with which it is to be categorized is represented by the letter “C”.
(a) Manipulation of Bayes’ Theorem:
In general, the probability that a particular DNA fragment is associated with a given genome in the
training data set can be calculated with the use of Bayes’ Theorem (shown below):
As discussed, P(C1|S) is the posterior probability that the DNA fragment (S) is binned with a given
training genome sequence (C1), while P(C) represents the prior probability. In order to simplify the
analysis, it was assumed that all prior probabilities were equal (ie. P(C1) = P(C2) = … = P(Cn)).
This assumption is valid for the data set, since, a priori, the DNA fragment has an equal probability
of being binned with any one of the genome sequences in the training set. Thus, the probability
equation can be rewritten as a simple fractional likelihood calculation, shown below:
(b) Calculation of likelihood values from binomial probability distribution
In order to solve for the posterior probability, the likelihood that a given fragment would be
associated with a genome in the training sequence was determined. These probabilities were assumed
to follow a binomial probability distribution, since each individual nucleotide can either be classified
as a G/C (success) or an A/T (not success) event. The binomial probability can be calculated with the
equation shown below:
In this equation, n = length of the fragment, k = number of G/C events in the test fragment, and p =
G/C content of the training genome of interest.
(c) Bin the test fragment with the genome that yields the highest posterior probability.
Once these likelihood calculations were made, the posterior probability values for the test fragment
of interest with each training genome sequence were calculated with Bayes’ Theorem. The maximum
posterior probability value was found, and the test fragment was binned with the genome sequence
that yielded this value.
3. Second approach: Binning by N-Markov Method
In addition to binning by G/C content, a 2nd Order Markov Approximation was used to find the most
probable genome sequence to which the unknown test fragments were associated.
Preparing Probability Tables From Training Data:
For all the training data (sequences representing the genomes), the probability of each nucleotide (A,
T, C and G) appearing in the sequence was found. The probability P(X) where X was the
representative nucleotide was calculated by counting all the X's present in the sequence and dividing
by the total number of elements.
Next, the the conditional probability of each nucleotide give that the previous neighbor was known
was obtained for each training sequence. This was done by going through the sequence and finding
all 16 pairs (AT, AA, AC, AG, TA,...etc) and calculating the probability of their appearance in the
sequence by dividing the total count of each pair by the total number of nucleotides. Similarly, the
probabilities of each triplet (ATC, AAA, ACG,..etc.) were obtained from the training data for a total
of 64 different combinations. This data was represented in a hash for future reference and
(a) Markov Approximation for Test Data:
Each test sequence was run through the program in order to find its best Markov approximation using
the following equation:
P(X1X2...Xk)= P(X1)P(X2|X1)P(X3|X1X2)P(X4|X3X2)...P(Xk|X(k-2)X(k-1))
where (X|XX) is the probability of obtaining an as the ith term in a sequence given that the previous
two terms are XX.
P(X|XX)= P(XXX)/P(XX*).
Essentially, the conditional probability described above is found by dividing the probability of
getting the desired triplet from the data obtained from the training sequences, divided by the the
probability that the first two characters precede any third nucleotide.
(b) Binning Test Sequences:
After obtaining the Markov approximation of each test sequence using the data from each training
sequence, the resulting probabilities were compared and the test sequence as assigned to the genome
sequence whose data allowed the test sequence to obtain the highest probability.
4. Benchmarking
For the benchmarking the first data set created with varying GC Contents was used as trial for the
developing of the program. The program takes in a directory with any number of FASTA files each
containing one genome sequence (training sequence). It then creates a FASTA file for each training
sequence called ‘benchmarkbin#.txt (the # represents the sequence). Reads are created from each
training sequence randomizing between 4 options for the length, varying from 80 to 864 nt and
creating a 20 nt overlap of each read. The lengths were chosen using the 42 notation as stated by
Chatterji et al.; this method is known to be feasible for k < 6. (k being the oligomer length for NMarkov). The reads are numbered in order of created and appended to the file created for the
training sequence it came from. A file containing solely the reads entitled ‘reads.txt’ and a file
containing all the training sequences entitled ‘dataset.txt’ are also created by the program. These
files were used to assess binning methods.
GC Content Results
The ability of the method to bin DNA fragments based upon common GC content was assessed with
two different benchmark datasets, in which the correct binning pattern was known for a set of test
DNA fragments and training genome sequences. This known pattern was compared to those assigned
by the binning algorithm, and these results are shown below:
Figure 1: An assessment of the GC-content-based algorithm
Figure 2: An assessment of the GC-content-based algorithm
As illustrated by Figure 1, the GC-content-based binning algorithm was able to correctly assign just
over 50% of the test fragments in the first benchmark dataset to the correct genome sequences from
which they were derived. Figure 2 indicates that the accuracy of the method was higher for the
second benchmark dataset on which the method was tested, and roughly 69% of the sequences were
binned correctly. Although the overall binning accuracy offers a rough idea as to the efficacy of the
algorithm, a look at the accuracy of binning for each of the training sequences offers a more
informative picture. As shown in Figures 1 and 2, the binning accuracy varies quite widely,
depending upon the training sequence with which the test fragments are to be associated. This trend
is particularly evident in Figure 1, in which nearly 70% of the sequences in Bin 0 were binned
accurately, while a mere 22% of the sequences in Bin 1 were placed with the correct training
2-Markov Results:
Figure 3: An assessment of the 2nd Order Markov algorithm
Percentage of test Fragments Correctly Binned
Using 2nd Order Markov
Bin 0
Bin 1
Bin 2
Bin 3
Figure 4: An assessment of the 2nd Order Markov algorithm
Percentage of test Fragments Correctly Binned
Using 2nd Order Markov
Bin 0
Bin 1
In the first dataset with 4 bins, the 2nd order Markov approximation had an accuracy of 87.7%. In
the second dataset, the accuracy decreased slightly to approximately 74%. The disparity in the
accuracy could have resulted from the fact that fewer sequences were binned in the first run. The
time required for the program to run using the second dataset was in the order of minutes, much
higher than the first run.
A 3rd Order Markov model was designed and tested for greater efficiency of the program. However,
it appeared that the Markov approximations at that level were too small to be compared and the
program rounded them all off to zero.
GC Content
The correlation between the relative GC contents of the training sequences and the rate of false
positive results within a bin suggests that there is an optimal set of training sequences for the
metagenomics data collected from a particular environmental niche. Such a set would be able to
provide a coarse differentiation of DNA fragments obtained from organisms with diverse GC content
The overall accuracy was higher using the dataset 1 which had more training sequences. The
percentage accuracy within each bin within each dataset were roughly uniform. The major difference
between the accuracy calculated between datasets 1 and 2 could simply due to the randomization of
test sequences. If multiple trials were done using both training datasets but with different
randomized test sequences, the data from the two sets could be better compared.
Figure 5: G/C and Markov Accuracy Comparison for DATASET 1
G C c ontent
Markov model
B in 0
B in 1
B in 2
B in 3
G/C Content = Red
2-Markov = Blue
Figure 6: G/C and Markov Accuracy Comparison for DATASET 2
G/C Content = Red
2-Markov = Blue
G C c ontent
Markov model
B in 0
B in 1
The Markov Model performed with higher accuracy in both datasets. Bin0 and Bin1 in dataset 1 in
which the GC content was quite similar were binned with markedly greater accuracy using the
Markov model. This can be attributed to the fact that the Markov algorithm takes into account the
probability of every nucleotide in training and test data whereas the G/C content algorithm takes an
average of G/C content from the training data and test data. Hence since the G/C content varies
widely within a genome (changes between coding regions/ non regions, etc.), the calculated
probability is highly dependent upon what region the test sequence represents. Therefore, overall the
Markov Model is more representative of the entire genome being compared.
The accuracy and time complexity of two binning methods: GC- content and 2nd order Markov
Model, applied in a low-complexity community – a sample from the microbes in the acid mine
drainage of the Richmond Mine - were assessed. The 2nd order Markov Model binning method was
proven to be more accurate for this type of communities. Regarding time complexity GC- content
was slower, which could have been by various reasons, like the general programming method.
Some future modifications to consider for a more accurate and generally efficient binning method is
to lower the time complexity of GC-content and probably 2nd order Markov. A threshold to
determine GC content divergence that might maximize binning accuracy, as well as another
threshold to consider the minimum read length required for accurate binning can also be
implemented. In the case of the N-Markov Model higher orders could be assessed and the best for
the type of data studying could be selected. In this study the coding vs. non coding regions of the
genomes were not differentiated, this difference can have an effect on GC content binning given the
fact that GC content varies between these two types of regions. After revising the two binning
methods separately they could be combined as to first implement GC content to make broad
distinction groups, and then within each group apply the N-Markov Model binning method, which
may surely increase accuracy.
Version1 – Ana, Nina, and Suruchi
Data Input, Data Output, Benchmarking –
Nina & Ana
Version2 – Manoshi
Data for Benchmarking
Test and Training Files – Ana & Nina
Debugging – All (All night)
Binning by GC Content – Manoshi
Binning by 2 – Markov - Suruchi
