1 Selecting Restriction Enzymes for Terminal Restriction Fragment Length Polymorphism based on Phylogenetic Distance K. J. Mei1 and S. H. Chang2 National Cheng Kung University, mq2kr3j92@hotmail.com National Cheng Kung University, waterman2000.tw@gmail.com 1 2 ABSTRACT An algorithm is proposed for choosing the most identical vector from a set of sample vectors to match the given clusters. An application based on the algorithm for choosing restriction enzymes according to the clusters has been developed. Keywords: pattern identity, T-RFLP 1 INTRODUCTION Terminal restriction enzyme fragment length polymorphism (T-RFLP) analysis is one of the most common techniques for analyzing bacterial community (Liu, W. T. et al. 1997). This analysis simply works through two facts: 1) different bacteria usually have different DNA codes 2) restriction enzyme can recognize identical DNA code and cuts at those positions with identical DNA code. Briefly, DNA fragments are generated from bacteria experimentally; and these DNA fragments are tagged fluorescently at one end. Finally, restriction enzyme is applied to cut DNA fragments; only fragment end is fluorescent, hence only these terminal fragments are detectable. Since different bacteria have different DNA codes, restriction enzyme will cut at identical code yet different position, thus generating terminal fragments with different lengths for different bacteria. For e.g. bacteria A has DNA code “GGCCGATCGCCCT”, bacteria B has DNA code “GGCGGATCGGCCA”; if restriction enzyme that cuts at middle “GGCC” is applied, it will generate “GG” for bacteria A and “GGCGGATCGG” for bacteria B, and thus these two bacteria are distinguishable; however, if restriction enzyme that cuts at middle “GATC” is applied, though it will generate different fragments “GGCCGA” and “GGCGGA”, their lengths are the same and thus are not distinguishable. This main problem always exists. Phylogenetic distantly-related bacteria, i.e. bacteria that are more different based on evolution, sometimes share the same terminal restriction fragment length (T-RFL), while phylogenetic closely-related bacteria sometimes have different T-RFLs. This may confuse researchers since they may observe only one signal for several different bacteria, otherwise, many signals for more similar bacteria (Engebretson, J. J. and Moyer, C. L. 2003). Restriction enzyme is, therefore, one of the key factors towards successful T-RFLP analysis (Kitts, C. L. 2001; Dickie, I. A. and FitzJohn, R. G. 2007). Ideally, researchers look for restriction enzyme that may generate most different lengths for all bacteria. However, a clear guideline is currently lacking. Usually, researchers decide suitable restriction enzymes by referring previous researches (Wells, G. F. et al. 2009), or by testing different restriction enzymes towards samples experimentally (Zhang, R. et al. 2008), or by simulating towards bacteria DNA code database (Kent, A. D. et al. 2003; Alvarado, P. and Manjon, J. L. 2009). To build logical method in choosing suitable restriction enzyme, an algorithm is developed. Briefly, we try to match bacteria cluster information and sample vectors of restriction enzymes. The bacteria cluster information tells whether two bacteria are considered similar or different. This cluster information could be achieved through user-defined or through conversion from bacteria phylogenetic distance matrix (distance matrix for every pair of bacteria DNA code after clustering analysis). In sum, if sample vector of restriction enzyme is more matched towards cluster information, it indicates that 2 restriction enzyme is approaching nearer to the ideal “same T-RFL for similar bacteria/same cluster, different T-RFLs for different bacteria/different clusters”. 2 ALGORITHM An algorithm is proposed for choosing the most identical vector from a set of sample vectors to match the given cluster. A program based on the algorithm for choosing restriction enzymes according to clusters has been developed. By given cluster information and sample vectors of restriction enzymes, user gets score for each sample vector. Cluster information and sample vectors of restriction enzymes are first transfer into matrixes for distinction respectively. The clusters can be achieved by Userdefined or contributed by phylogenetic distance matrix. The distinct matrix of clusters then compares to each distinct matrix of restriction enzymes. The scores for each sample vector of restriction enzyme can be approached by matching and weighting between distinct matrixes. The weighting can be achieved according to phylogenetic distance matrix or artificial weight. As shown in Fig. 1. Cluster By Phylogenetic By User-defined Distance Matris Sample Vectors of Restriction Enzyme Distinct Matrix Distinct Matrixes Match and Weight By Phylogenetic Distance Matris By Artificial Weight Score Fig. 1: Overview of selecting restriction Enzymes. 2.1 Distinct Matrix Establishment for Clusters The clusters can be established by user-defined or phylogenetic distance matrix. Phylogenetic distance matrix is an nxn upper triangle matrix. Values in the matrix is all within 0~1. By giving a cutoff value, which has the same range, values in phylogenetic distance matrix can be classified into two values, SAME and DIFFERENT. If the value in phylogenetic distance matrix is higher than the cutoff, the classified value is DIFFERENT; otherwise, the classified value is SAME. The nxn upper triangle boolean matrix for distinction, can be build according to the cutoff. As shown in Fig.2. 3 Phylogenetic Distance Matrix (nxn upper triangle matrix, value between 0~1) Distance Cutoff (value between 0~1) Compare distance cutoff for each element Distinct Matrix for Cluster (nxn upper triangle boolean matrix, value within {SAME, DIFFERENT}) Fig. 2: Construction of distinct matrix for phylogenetic distance matrix. The user-defined cluster can be expanded to an nxn upper triangle boolean matrix. The value at an element is SAME while the indexes are in the same cluster; otherwise, the value is DIFFERENT. A sample for building distinct matrix is shown as Fig. 3. There are five species. There are three userdefined clusters and two clusters clustered by phylogenetic distance matrix and 3% cutoff, built into an nxn upper triangle boolean matrix. Fig. 3: A sample for building distinct matrix for clusters. 2.2 Distinct Matrix Establishment for Sample vector of Restriction Enzymes Sample vector of restriction enzyme is an integer vector of length n. Assume there are m sample vectors of restriction enzyme. The values in a vector are T-RFLs. Sometimes one restriction enzyme is not enough to match the clusters. So combination of restriction enzymes is necessary. Through the permutation of sample vectors, there are m* sample vectors of combined restriction enzymes. 4 Through the property that the restriction enzyme only cut the genes by side, the combination can be done by chosen the smallest value for each element between vectors. After combination, make difference matrixes for each sample vectors by pair-wise differentiate within each sample vector. The difference matrix is with dimension nxn and each element in matrix is integer, totally m* matrixes. The matrix for distinction can be approached by comparing with T-RFLP cutoff for all elements in the difference matrix. If the value in the difference matrix is higher than the cutoff, the value in distinct matrix is DIFFERENT; otherwise, the value is SAME. The distinct matrixes of restriction enzymes are m* nxn upper triangle Boolean matrixes. As shown in Fig. 4. Sample Vectors of Restriction Enzyme (m vectors of length n, all value is integer) Sample Vectors of Restriction Enzymes (m* vectors of length n, all value is integer) Build difference matrixes by pair-wise differentiate within each sample vector Distinct Matrixes of Restriction Enzymes (m* upper triangle boolean matrixes of dimension nxn, all value is within {SAME, DIFFERENT}) Permutation Degree of Combination Difference Matrixes of Restriction Enzymes (m* upper triangle matrixes of dimension nxn, all value is integer) Compare with T-RFLP cutoff for all elements for all matrixes T-RFLP Cutoff (integer) Fig. 4: Construction of distinct matrixes for restriction enzymes. A sample for building distinct matrix for restriction enzyme is shown in Fig. 5. The sample is with T-RFLP cutoff 0 bp and a restriction enzyme with five T-RFLs for five species. By pair-wise comparison, the vector of T-RFLs can be transfer into difference matrix. With T-RFLP Cutoff, the difference matrix can be transfer into a boolean matrix for distinction. 5 Fig. 5: A sample for building distinct matrix for restriction enzyme. 2.3 Match and Weight between Distinct Matrixes for Cluster and Restriction Enzyme There are two ways in this paper to weight between distinct matrixes for cluster and restriction enzyme: by phylogenetic distance matrix or by artificial weight matrix. The phylogenetic distance matrix must be the same dimension for distinct matrix; it is dimension of nxn, float matrix and all values within 0~1. The artificial weight matrix is a 2x2 float matrix. The artificial weight matrix is shown as Tab. 1. While matching elements by user-defined weight matrix for matrixes, if both values are SAME, the result score is as weight value; if both values do not match, the result score is 0; if both values are DIFFERENT, the result score is 1. Distinct matrix for clusters matches and weights by weight matrix for each distinct matrixes for restriction enzymes. The result is a score matrix for each restriction enzyme. Finally score matrixes are achieved; they are m* nxn upper triangle float matrixes. As shown in Fig. 6.a. The phylogenetic distance matrix effects while the element between distinct matrixes is equivalent. The score matrix is made by phylogenetic distance matrix dotting the matched matrix. As shown in Fig. 6.b. Restriction Enzyme\Cluster SAME DIFFERENT SAME Weight 0 DIFFERENT 0 1 Tab. 1: Artificial weight table for matching distinct matrixes of cluster and restriction enzyme. Distinct Matrix for Clusters Artificial Weight Matrix (2x2 float matrix) Distinct Matrixes for Restriction Enzymes Match and weight by weight matrix Score Matrixes (m* nxn upper triangle float matrixes) Distinct Matrix for Clusters Phylogenetic Distance Matrix Distinct Matrixes for Restriction Enzymes Match and dot phylogenetic distance matrix Score Matrixes (m* nxn upper triangle float matrixes) 6 Fig. 6: Match and weight between distinct matrixes for clusters and restriction enzymes: (a) by artificial weight matrix, (b) by phylogenetic distance matrix The scores for all restriction enzymes can be achieved by summing all elements in each score matrixes and dividing the sums by sum of perfectly matched matrix. After sorting the scores, the best restriction enzymes can be found. As shown in Fig. 7. Score Matrixes Sum all elements in each matrix. Divide sum by sum of perfectly matched matrix to get score Scores Sort and find the best restriction enzymes according to highest scores The Best Restriction Enzymes Fig. 7: Find the best restriction enzymes through the score matrixes. An equivalent sample of match and weight for a restriction enzyme followed with Fig. 3 and Fig. 5 is shown in Fig 8. The matched matrix is build by signed S while both elements in distinct matrix for cluster and restriction enzyme are SAME, signed D while they are both DIFFERENT and signed N while they are not equivalent. Only S and D get weight. If weighting by phylogenetic distance matrix, S and D are no difference; they get the same weight according to phylogenetic distance matrix. The artificial weight here is set as 0.1. S get the score of given weight value; D get the score of value 1. Finally there are four kinds of result: (1) both cluster and weight by phylognetic distance matrix, (2) cluster by userdefined cluster and weight by phylognetic distance matrix, (3) cluster by phylognetic distance matrix and weight by artificial weight, (4) cluster by user-defined cluster and weight by artificial weight. 7 Fig. 8: A sample for getting score by match and weight for four different ways. 3 PROGRAM APPLICATION A program for the above algorithm is achieved. The user interface is separate to four parts: Cluster operation console, restriction enzymes operation console, weight method selection console and score console. As shown in the above labels in Fig. 9. Cluster operation console is shown in Fig. 9. User can upload a phylogenetic distance matrix, set the corresponding cutoff value; or upload a file for user-defined cluster. Restriction enzymes operation console is shown in Fig. 10. The console can upload T-RFL vectors of restriction enzymes in a matrix form and set the cutoff value for T-RFL and degree of combination. It can export combination result. Weight method selection console is shown in Fig. 11. User can select to use phylogenetic distance matrix as weighting in accordance or to use artificial weight. The score console presents the corresponding T-RFLs for the selected restriction enzyme. As shown in Fig. 12, the row labels are species; the figure above shows the clusters, which the red dash lines put on the same horizontal line implies they are the same clusters; the column label of below figure shows the T-RFL. The title beyond is the name of selected restriction enzyme. The graphs here are not perfectly matched. Fig. 9: Cluster operation console. 8 Fig. 10: Restriction enzymes operation console. Fig. 11: Weight method selection console. Fig. 12: Score console presents the corresponding T-RFLs for the selected restriction enzyme. 4 DISCUSSION The algorithm is like to choose the best “mask” for sample vectors of restriction enzymes. The mask can be chosen and weighted from clusters or user-defined clusters. Since within cluster, the difference for “natural weight”, which is based on phylogenetic distance matrix, is smaller between clusters. The distinctibility causes the most weight. User often interests in 9 difference between classes then difference within class. So the “natural weight” for identical element is reasonable no matter for clusters which based on phylogenetic distance matrix or for user-defined clusters. The artificial weight matrix is proposed as Tab. 1; the weight value is put on the position for both values are “SAME” and the weight value is usually set within 0~1. If the weight value is set to 0, it means the distinctibility is the only concern. The result can distinct most of species but not clusters; it is still very meaningful, since the user knows how species clustered. If user can identify species, there is no problem for clustering. If the weight value set much higher and close to 1, the feature of distinct matrix of restriction enzyme will close to the distinct matrix of clusters. So, user can set weight value to 1 first; if the result is not good enough; user can try the smaller weight value. The difference is user should do cluster after mask. In the artificial weight matrix, there is 0 for (Cluster, Restriction Enzyme)=(SAME, DIFFERENT). If there is weight, the best restriction enzyme will be chosen like Fig. 13, all the elements are distinguishable. Fig. 13: An example for the corresponding T-RFLs of the best restriction enzyme for weighting at (Cluster, Restriction Enzyme)=(SAME, DIFFERENT) in artificial weighting matrix. 5 CONCLUSION The algorithm for choosing the most identical vector from a set of sample vectors to match the given clusters can be applied for choosing restriction enzymes according to the clusters. The clusters can be achieved by User-defined or contributed by phylogenetic distance matrix. The weighting can be achieved according to phylogenetic distance matrix or artificial weight. The application can provide a free environment to manipulate clusters and selecting method. REFERENCES Alvarado, P. and Manjon, J. L. (2009). "Selection of enzymes for terminal restriction fragment length polymorphism analysis of fungal internally transcribed spacer sequences." Appl Environ Microbiol 75(14): 4747-4752. Dickie, I. A. and FitzJohn, R. G. (2007). "Using terminal restriction fragment length polymorphism (TRFLP) to identify mycorrhizal fungi: a methods review." Mycorrhiza 17(4): 259-270. Engebretson, J. J. and Moyer, C. L. (2003). "Fidelity of select restriction endonucleases in determining microbial diversity by terminal-restriction fragment length polymorphism." Appl Environ Microbiol 69(8): 4823-4829. Kent, A. D., Smith, D. J., Benson, B. J. and Triplett, E. W. (2003). "Web-based phylogenetic assignment tool for analysis of terminal restriction fragment length polymorphism profiles of microbial communities." Appl Environ Microbiol 69(11): 6768-6776. Kitts, C. L. (2001). "Terminal restriction fragment patterns: a tool for comparing microbial communities and assessing community dynamics." Curr Issues Intest Microbiol 2(1): 17-25. 10 Liu, W. T., Marsh, T. L., Cheng, H. and Forney, L. J. (1997). "Characterization of microbial diversity by determining terminal restriction fragment length polymorphisms of genes encoding 16S rRNA." Appl Environ Microbiol 63(11): 4516-4522. Wells, G. F., Park, H. D., Yeung, C. H., Eggleston, B., Francis, C. A. and Criddle, C. S. (2009). "Ammoniaoxidizing communities in a highly aerated full-scale activated sludge bioreactor: betaproteobacterial dynamics and low relative abundance of Crenarchaea." Environ Microbiol. Zhang, R., Thiyagarajan, V. and Qian, P. Y. (2008). "Evaluation of terminal-restriction fragment length polymorphism analysis in contrasting marine environments." FEMS Microbiol Ecol 65(1): 169178.