12_Thilini_Dr_Ajantha_Final_Abstract

advertisement
Neural Network Based Phylogenetic Analysis
Thilini Halgaswaththa1, Dr.Ajantha Athukorala1 Dr.Jagath Weerasena2 & Dr.Mahen Jayawardena1
1.University of Colombo School of Computing, Sri Lanka, 2. Institute of Biochemistry Molecular Biology and
Biotechnology, University of Colombo, Sri Lanaka
thilini.hal@gmail.com, aja@ucsc.cmb.ac.lk, jagath@ibmbb.cmb.ac.lk, mcj@ucsc.cmb.ac.lk
I
someone finds an unknown bone fragment, it is
important to understand which category that fragment may
relate to. The standard way would be to extract a common
gene from it, create the phylogenetic tree and then
understand the category. Phylogeny is thus one of the
fundamental tools used to understand the evolutionary
relationship between various taxonomic groups. But this
method involves a series of tasks such as multiple sequences
alignment; and is thus complex
F
Phylogeny has become even more important with the rapid
growth of DNA and protein sequence data along with the
visual impact of phylogenetic trees. It also helps in
understanding the evolutionary relationship among various
sequence data and also to understand the categories of those
sequence data using phylogenetic tree. Although
phylogenetic trees are very important; it has some limitations
such as the quality of the phylogenetic tree depends on the
quality of the sequence data and multiple Sequence
alignment. It is also necessary to choose the appropriate
model of the nucleotide substitution for the phylogeny, also
it is difficult to choose an optimal method to generate the
optimal phylogenetic tree. If we have an unknown sequence
it is difficult to predict the category of that sequence without
doing multiple sequence alignment and creating a
phylogenetic tree, but this requires considerable
computational time.
In our research have implemented a neural network as an
alternative method without using the standard phylogenetic
tree approach. Our goal is to predict the category of a DNA
sequence using a neural network based on the features
extracted from DNA sequences.
The approach we have used is to first identify the main
category of the data sequence using the phylogenetic tree
and then train the neural network to identify such target
categories. Finally the trained neural network would be able
to identify category of an unknown DNA sequence. We have
used the Transferring sequences as the primary data set.
Although traditionally the number of training data needs to
be a multiple of features multiplied by 10 for the neural
network, it was difficult to find such large numbers of
sequences for our system. We then used the maximum
number of sequences and used Mitochondrial DNA with 400
sequences as the secondary data set for the experiment. We
have also used this method with the minimal data set
(transferring sequences) and we could see our method work
with them also. We have divided the sequences in to
appropriate testing and training data for the neural network.
We also used the tri-gram method as the sequence-encoding
schema and have used various codon searching mechanisms
to get the codon content in the whole sequence as the feature
extraction mechanism to prepare the input vector for the
neural network. We have used the tri-gram method because
it has biological meaning, which is proteins are created from
amino acids encoded by three nucleotide bases. There are 20
different amino acids and each amino acid is creates from a
codon a series of three adjacent bases in a DNA molecule.
We can get a total of 43 different codon combinations. We
consider the frequency of each codon and use it to generate
the 64 dimension vector as the input vector of the neural
network. Also we have used another feature extraction
method which is get the amino acid content of each 20
amino acid and convert that to the probabilities. From this
method we can get a 20 dimensional vector. We have used
both probabilistic neural networks and feed forward neural
networks as supervised neural networks. It can be seen that
the most suitable supervised neural network type for this
type of analysis was a probabilistic neural network according
to our result.
Our result suggests that above experiment can be done using
a neural network approach without doing multiple sequence
alignment and using a phylogenetic tree; and also this
approach is faster and the accuracy is high with the relevant
sequence encoding schema. We input the unknown DNA
sequence to the trained neural network and we received the
correct category for those sequences which we received from
the phylogeneic analysis also. We can assume, using this
approach the DNA sequences can be categorized in to the
main target categories and we can understand the most
suitable category for the unknown sequence which relate to
the trained data sequences using such a neural network in
less amount of time without using the phylogenetic analysis.
The most suitable feature extraction method is the tri-gram
sequence-encoding schema according to the result we have
received. Also we have tried to understand the reason for the
mismatched sequences from the neural network approach.
But this method has some disadvantages such as the
difficulty to understand the relationships among species and
it also requires more data than a phylogenetic tree approach.
In our analysis we used a supervised neural network because
we understood the target vector of the neural network using
the phylogenetic tree approach. We can create this neural
network using an unsupervised neural network by clustering
input data sequence as the extensions of this research. Also
we used only the DNA sequences for doing the experiments.
It can be extended to use protein sequences also to see the
possibility of this type of experiments. Protein sequences are
created from 20 different amino acids. Therefore the task
becomes much more complicated than using DNA
sequences. We need to find the relevant sequence encoding-
schema to extract features from the protein sequences for
this analysis as furtherer studies. When selecting the training
data set, the number of sequences for the training should be
similar to the number of features multiplied by 10. Therefore
to achieve accurate results, we can select the suitable
sequence and increase the number of training data and can
do this experiment again and see the results as a further
extension.
Download