Neural Network Based Phylogenetic Analysis Thilini Halgaswaththa1, Dr.Ajantha Athukorala1 Dr.Jagath Weerasena2 & Dr.Mahen Jayawardena1 1.University of Colombo School of Computing, Sri Lanka, 2. Institute of Biochemistry Molecular Biology and Biotechnology, University of Colombo, Sri Lanaka thilini.hal@gmail.com, aja@ucsc.cmb.ac.lk, jagath@ibmbb.cmb.ac.lk, mcj@ucsc.cmb.ac.lk I someone finds an unknown bone fragment, it is important to understand which category that fragment may relate to. The standard way would be to extract a common gene from it, create the phylogenetic tree and then understand the category. Phylogeny is thus one of the fundamental tools used to understand the evolutionary relationship between various taxonomic groups. But this method involves a series of tasks such as multiple sequences alignment; and is thus complex F Phylogeny has become even more important with the rapid growth of DNA and protein sequence data along with the visual impact of phylogenetic trees. It also helps in understanding the evolutionary relationship among various sequence data and also to understand the categories of those sequence data using phylogenetic tree. Although phylogenetic trees are very important; it has some limitations such as the quality of the phylogenetic tree depends on the quality of the sequence data and multiple Sequence alignment. It is also necessary to choose the appropriate model of the nucleotide substitution for the phylogeny, also it is difficult to choose an optimal method to generate the optimal phylogenetic tree. If we have an unknown sequence it is difficult to predict the category of that sequence without doing multiple sequence alignment and creating a phylogenetic tree, but this requires considerable computational time. In our research have implemented a neural network as an alternative method without using the standard phylogenetic tree approach. Our goal is to predict the category of a DNA sequence using a neural network based on the features extracted from DNA sequences. The approach we have used is to first identify the main category of the data sequence using the phylogenetic tree and then train the neural network to identify such target categories. Finally the trained neural network would be able to identify category of an unknown DNA sequence. We have used the Transferring sequences as the primary data set. Although traditionally the number of training data needs to be a multiple of features multiplied by 10 for the neural network, it was difficult to find such large numbers of sequences for our system. We then used the maximum number of sequences and used Mitochondrial DNA with 400 sequences as the secondary data set for the experiment. We have also used this method with the minimal data set (transferring sequences) and we could see our method work with them also. We have divided the sequences in to appropriate testing and training data for the neural network. We also used the tri-gram method as the sequence-encoding schema and have used various codon searching mechanisms to get the codon content in the whole sequence as the feature extraction mechanism to prepare the input vector for the neural network. We have used the tri-gram method because it has biological meaning, which is proteins are created from amino acids encoded by three nucleotide bases. There are 20 different amino acids and each amino acid is creates from a codon a series of three adjacent bases in a DNA molecule. We can get a total of 43 different codon combinations. We consider the frequency of each codon and use it to generate the 64 dimension vector as the input vector of the neural network. Also we have used another feature extraction method which is get the amino acid content of each 20 amino acid and convert that to the probabilities. From this method we can get a 20 dimensional vector. We have used both probabilistic neural networks and feed forward neural networks as supervised neural networks. It can be seen that the most suitable supervised neural network type for this type of analysis was a probabilistic neural network according to our result. Our result suggests that above experiment can be done using a neural network approach without doing multiple sequence alignment and using a phylogenetic tree; and also this approach is faster and the accuracy is high with the relevant sequence encoding schema. We input the unknown DNA sequence to the trained neural network and we received the correct category for those sequences which we received from the phylogeneic analysis also. We can assume, using this approach the DNA sequences can be categorized in to the main target categories and we can understand the most suitable category for the unknown sequence which relate to the trained data sequences using such a neural network in less amount of time without using the phylogenetic analysis. The most suitable feature extraction method is the tri-gram sequence-encoding schema according to the result we have received. Also we have tried to understand the reason for the mismatched sequences from the neural network approach. But this method has some disadvantages such as the difficulty to understand the relationships among species and it also requires more data than a phylogenetic tree approach. In our analysis we used a supervised neural network because we understood the target vector of the neural network using the phylogenetic tree approach. We can create this neural network using an unsupervised neural network by clustering input data sequence as the extensions of this research. Also we used only the DNA sequences for doing the experiments. It can be extended to use protein sequences also to see the possibility of this type of experiments. Protein sequences are created from 20 different amino acids. Therefore the task becomes much more complicated than using DNA sequences. We need to find the relevant sequence encoding- schema to extract features from the protein sequences for this analysis as furtherer studies. When selecting the training data set, the number of sequences for the training should be similar to the number of features multiplied by 10. Therefore to achieve accurate results, we can select the suitable sequence and increase the number of training data and can do this experiment again and see the results as a further extension.