Characteristic Restriction Endonuclease cut order for Classification and analysis of DNA Sequences Rajib SenGupta College of Information Science and Technology, University of Nebraska at Omaha Omaha,NE, 68102-0116, USA Problem Statement The motivation for this project is the old holy grail of Bioinformatics Sequence Identification & Classification Current Approaches 1. Computational approach – Pairwise local and Multiple Sequence Alignment 2. Laboratory Method – RFLP, Southern Blotting Existing Methods - Limitations Pairwise or Multiple Alignment 1. Alignment is ‘fine-grained’ approach 2. More computation intensive and so NP hard for large dataset 3. Introduces gaps - gaps are interpreted as evolutionary events in molecular phylogeny, misaligned sequences have no useful biological information 4. Heuristics like BLAST is employed Laboratory Methods (RFLP) 1. Only feasible for few sequences 2. Human and procedural error 3. In-silico RFLP methods (TRFLP program) requires Alignment as the second step for sequence identification Ideation Utilize ‘coarse-grain-features’ of RFLP/Restriction Enzyme in-silico as opposed to the ‘fine-grainfeatures’ of Alignment computationally. Restriction Endonuclease Proteins that recognize particular sequence of nucleotide (called the restriction site and generally 4 to 8 bases long) and cut the double stranded DNA molecule at restriction site RFLP Restriction Fragment Length Polymorphism (RFLP) Widely used laboratory method in molecular identification and Phylogenetic studies. This approach requires the sequences to be cut into several fragments with the help of restriction endonucleases. The variation in the position of these sites along the DNA, among the sequences being analyzed will lead to digested product that are of varying lengths. Following a high-resolution gel electrophoresis of the digested product, the fragment-patterns are visually compared to determine the similarity between the sequences. RFLP Proposed Concept New Idea Uses Enzyme Cut Order (ECO) – information from DNA for evaluation Definition: – ECO for a DNA sequence (S) for a particular set of restriction enzymes {Ez} is a string (array) of enzyme names (represented as numeric id) in the order each enzyme (ez Є Ez) cuts the sequence. ECO may also include position of nucleotide from the start of sequence where the cut occur. – Thus, ECO is a string (array) of tuples consisting of enzyme id and cut position. – Example: GenBank Classification O R G A N IS M L iru la m a c ro s p o ra C L A S S IF IC A T IO N : E u ka ryo ta ; F u n g i; D ika rya ; A sco m yco ta ; P e zizo m yco tin a ; L e o tio m yce te s; R h ytism a ta le s; R h ytism a ta ce a e ; L iru la O R G A N IS M N e c tria h a e m a to c o c c a C L A S S IF IC A T IO N : E u ka ryo ta ; F u n g i; D ika rya ; A sco m yco ta ; P e zizo m yco tin a ; S o rd a rio m yce te s; H yp o cre o m yce tid a e ; H yp o cre a le s; N e ctria ce a e ; N e ctria ; N e ctria h a e m a to co cca co m p le x. O R G A N IS M N e c tria m a u ritiic o la C L A S S IF IC A T IO N : E u ka ryo ta ; F u n g i; D ika rya ; A sco m yco ta ; P e zizo m yco tin a ; S o rd a rio m yce te s; H yp o cre o m yce tid a e ; H yp o cre a le s; N e ctria ce a e ; N e ctria . O R G A N IS M O lig o p o ru s p la c e n tu s C L A S S IF IC A T IO N : E u ka ryo ta ; F u n g i; D ika rya ; B a sid io m yco ta ; A g a rico m yco tin a ; A g a rico m yce te s; P o lyp o ra le s; O lig o p o ru s Concept Contd.. Closely related organisms have similar Enzyme Cut Order Table1 : The ECO for ‘ITS’ sequences from close and distantly related fungi. The closely related Nectria species (Nectria haematococca and Nectria mauritiicola) show high level of ECO similarity. Quantifying ECO Enzyme Cut Order (ECO)- Similarity Score – The similarity score between two ECO consists Number of similar enzymes and Order in which these enzyme cut the sequence 1. The similarity score will be higher if we find larger number of similar enzymes appearing in the same order among two Enzyme Cut Orders. 2. This similarity score is the Longest Common Subsequence (LCS) among two strings – the strings are the ECO 3. The length of Longest Common Subsequence (LCS) between two ECO (E1 and E2) of two corresponding sequences (S1 and S2) are considered as the Enzyme Cut Order Similarity Score between E1 and E2. Hypothesis Organisms closer to each other in the Phylogenetic tree have highly similar Enzyme Cut Order. The similarity is defined as the Enzyme Cut Order Similarity Score which is the length of LCS among the corresponding Enzyme Cut Orders of the DNA sequences of the organisms. Preliminary Result Summary Enzyme Cut Order is a distinguishing characteristic of DNA sequences The similarity between two sequences can be defined by Enzyme Cut Order Similarity Score ECO-similarity score can be measured as the length of LCS among the corresponding Enzyme Cut Orders of the DNA sequences of the organisms Overall Method Diagram Enzyme Cut Order SEQUENCE DB Array of Enzyme cut orders RES ENZ DB Similarity Score Algorithm Similarity Matrix Clustering Algorithm TAXON DB CLUSTER DB Analysis of Clusters Report Graph Genetic Algorithms Optimal Enzyme Set Step 1 Sequence Data Collection and Curation Created a local database of GenBank sequences obtained in FASTA or XML format Reference these sequences against taxon database Create a curated taxonomy database for these sequences using user-defined taxonomical rules Fungi ITS Sequences from Genbank – Organism description” of the genbank entries (or OrgName_Lineage in XML format) – Classification categories included Kingdom, Division, Class, Order, Family, Genus, Species – Use simple suffix rule and the position to decide Step 2 – Enzyme Data Collection Create a database of restriction enzymes obtained from REBASE Add more relevant information about these restriction enzymes (Isoschizomers, Commercial availability, Reverse Cutsite) for later use Appropriate recognition sequence containing bases other than A, T, G and C were interpreted as per IUB ambiguity code (Eur. J. Biochem. 150: 1-5, 1985). Step 3 – Enzyme Cut Order DB Build Obtained Enzyme cut order using user defined set of restriction enzymes {Ez}. The Enzyme cutorder is obtained for every test sequences and every enzymes in {Ez} Evaluate the effect of the size and type of restriction endonuclease Different sets of (Ez) were chosen with the following properties. 1. Enzymes that cut at least one of the sequences from the given sequence data 2. Enzymes that cut 50% of the sequences of the given sequence data 3. Enzymes that cut all the sequences at least once 4. Random enzyme set (consisting a mixture from the sets listed previously) 5. Commonly used restriction enzymes in a biology laboratory working with the RFLP of fungi. 6. Restriction enzymes set obtained by using genetic algorithm Step 4 – Similarity Matrix based on LCS score Create a similarity matrix or a complete weighted graph for each Enzyme Set {Ez} – each node represents an enzyme cutorder of a sequence and the weight between two nodes is similarity score (SS = LCS length) between two corresponding enzyme cut-order – (G)Ez = (Kn) Ez = (v Є V, e Є E) where v is enzyme cut order of the sequence and |e| = SS Step 5 – Clustering The Similarity matrix is clustered and the cluster is analyzed for its phylogenetic accuracy Clustering algorithm employed: – Maximum gap based exclusive clustering – Hierarchical clustering – Similarity Clustering Step 5 – Clustering Sensitivity and the positive predictive value were two important evaluation parameters for cluster analysis and are defined as follows: For a particular taxon in a group X S = Sensitivity = TP/(TP + FN) where TP= True Positive = Count of taxon’s in X FN= False Negative = Count of taxon’s in DB1, excluding that in X TP+FN = Total counts of taxon’s in the entire DB1 (S) tax,x = Count of the taxon in X / Total count of the taxon in database Similarly, for a particular taxon in a group X PP = Positive Predictive Value = TP / (TP + FP) where TP = True Positive = Count of taxon’s sequences in X FP = False Positive = Count of other taxons which are not in X TP + FP = Total counts of sequences in the group X (PP) tax,x = Count of the taxon in X / total count of sequences in X Step 6 – Genetic Algorithm Find optimal enzyme set for a particular dataset using genetic algorithm. Optimal enzyme set is defined as the minimal size enzyme set that shows highest phylogenetic resolution The Fitness Function is based on the expected and actual count of an organism in the cluster. The score is quantitatively determined in terms of Sensitivity and Positive Predictive Value The Selection is either Roulette-wheel selection, tournament selection or random selection. Uniform, Single-Point or Two-Point crossover is used along with a user specified crossover rate. Experiment -1 Sequence (Set-1) – Type = Internal Transcribed Spacer – Size = 7 – Taxonomy Ascomycota = 5 – Nectria sp. – Lirula sp. =3 =2 Bacidiomycota = 2 – Oligoporus sp.= 2 Enzyme (Set1 - TaqI , HaeIII, HinfI, AluI, RsaI, MspI) – Size =6 – Property = Frequent cutter Result-1 Result-1 All sequences are perfectly clustered – Similarity Gap is close and reflected on highlighted samples Sample Test Set 1 – Enz Set 2 1. Using 57 enzymes on the same test set 1 2. Obtained better Similarity Matrix (Higher Similarity Gap) 3. Larger Enzyme set may have better clustering result 4. All Sequences are perfectly clustered Result -2 21 species are perfectly clustered out of 26 with 65 Enzymes Experiment -3 (Find Optimal Enzyme Set using GA) Sequence (AspCan) – Type = Internal Transcribed Spacer – Size = 78 – Taxonomy – Aspergillus and Candida Sequence (All9Genus) – Type = Internal Transcribed Spacer – Size = 97 – Taxonomy – 9 Genus Result -3 Conclusion Restriction Enzymes data can be modeled and used for computational analysis. Introduced an new property of DNA sequences based on order of the cut by multiple restriction enzymes on the sequences, namely Enzyme Cut Order. This property can be quantified to a similarity score as the length of the Longest Common Subsequence between two enzyme cut orders. The resulting similarity matrix shows high phylogenetic resolution while clustered. Can be considered as an alternative”coarse-grain” method for sequence identification and classification compared to computational intensive alignment methods