1 Phylogenetic Tree Generation (August 2012) Brandon J. Andrews Abstract— In order to visualize differences between nucleotide sequences it is often nice to have a visual representation in a tree form. The algorithms for generating such trees are referred to as phylogenetic tree generators and can generate trees by calculating the difference between sequences and then clustering them. The distance based algorithms create generally accurate evolutionary paths showing where one genetic sequence might have branched from. To solve this problem the Fitch-Margoliash method is commonly used. Index Terms—phylogenetic, bioinformatics, fitch I. INTRODUCTION P trees allow for the visualization of the similarities and dissimilarities between genetic sequences and plot the possible evolutionary paths where the sequences might have originated from. These algorithms allow researchers to find information that would not be obvious from just viewing the raw distance or other data calculated about the sequences [1]. In some algorithms the evolutionary rate is taken into consideration. So the further away from the root the longer it took to for those changes to appear. Knowing the evolutionary rate can then allow researchers to connect a time value to the tree. A tree can either be unrooted, indicating there is no start or primary parent in the sequence, or rooted, indicating a possible beginning to the sequence mutations. There are many ways to create a phylogenetic tree. One of the most common ways is via distance information between each pair of sequences. The distances can be calculated in many ways depending on the input data, but represent the amount of dissimilarity in the sequences. It is often critical that this distance data is linear in relationship to the amount of dissimilarity and some algorithms require that criteria to produce accurate trees. Two distance based algorithms are usually discussed. The Unweighted Pair Group Method with Arithmetic Mean (UPGMA) method and the Fitch-Margoliash method. The former is not used and produces inaccurate graphs since it assumes each child has equal weighting, while the FithMargliash algorithm uses weighting to help closely preserve the distances between nodes in the tree. The Fitch-Algorithm is the focus in the following paragraphs and the implementation. HYLOGENETIC Manuscript received August 16, 2012 II. RESEARCH AND DESIGN A. Research The research performed was simply to look for examples and explanations of the algorithm and its proper implementation. I found a slide which contained a brief example of the algorithm which was enough to program the whole algorithm from [2]. The other research was to get basic information about phylogenetic trees; however, it turns out there’s a lot of misinformation about how they are defined. Some papers for instance will erroneously assume that they must be binary trees when in fact a distance at a branch of zero indicates multiple branches [1]. B. Design Phylogenetic trees can be drawn linearly in two dimensions or they can be drawn using polar coordinates which creates a nice radial diagram. I chose to use a radial drawing for aesthetic reasons. The core algorithm design was just ad-hoc from an example in a Powerpoint presentation [2]. III. IMPLEMENTATION The implementation used Javascript for computing and HTML 5 Canvas for rendering. The Context2D API in Canvas allows a user to easily draw lines, arcs, and text which are all that is required to draw a phylogenetic tree from the outputted nodes of the Fitch-Margoliash algorithm. Since the algorithm can take in sequence data it often requires that the distance matrix be computed. I used a very simple hamming distance algorithm that counts the number of mismatches and returns that as the distance into the distance matrix for each pair of sequences. It works, but it is not a very good algorithm for judging the dissimilarity of sequences. If a single nucleotide is missing it skews the distances, so in practice a better dissimilarity algorithm would be required. IV. TESTING AND SAMPLE DATA There is one set of test data that I kept finding over and over again which is listed in Fig 1 [2]. Since the input to the Fitch-Margoliash method is just a distance matrix it is either generated from the sequence data or given separately in a precomputed form. The example in Fig 1 was given precomputed. 2 A B C D E A 0 0 0 0 0 B 22 0 0 0 0 C 39 41 0 0 0 D 39 41 18 0 0 E 41 43 20 10 0 Fig. 1. This testing data found in multiple presentations. V. RESULTS AND ANALYSIS The Fitch-Margoliash method, while there might be minor errors, generally creates the expected phylogenetic tree. The lengths for each branch also correctly correspond to the dissimilarity allowing easy visualization of which sequences are similar and which are completely different. To analyze the tree’s accuracy to the distance matrix one only had to recomputed the distances between each node in the tree and calculate the path distances through the branches. An error could then be created to judge the accuracy of the tree. VI. CONCLUSION The Fitch-Margoliash distance based method creates very “accurate” trees that preserve most of the distances in the inputted distance matrix. However, it’s important to realize that the phylogenetic tree might not relate to any real evolutionary mutations without further analysis, especially on the scale of whole organisms [1]. REFERENCES [1] [2] K. Louhisuo. (2004, May 4). Constructing phylogenetic trees with UPGMA and Fitch-Margoliash. [Online]. Available: http://www.niksula.cs.hut.fi/~klouhisu/Bioinfo/phyltree.pdf J. Bacardit, and N. Krasnogor, Phylogenetic Trees [PPT]. Available: http://www.cs.nott.ac.uk/~jqb/G53BIO/Slides/Phylogenetic%20Trees.pp t Brandon J. Andrews Lives in Kalamazoo, Michigan and is a computer science graduate. Andrews received his undergraduate computer science degree at Western Michigan University (WMU), Kalamazoo, Michigan in 2010. He works for the Office of Information Technology at WMU maintaining printer and database servers while also being a part-time teaching assistant.