advertisement

Report on “Solving Restriction mapping Problem using Branch and Bounds Algorithm” Abstract Restriction enzymes which are developed by Bacteria are used to cut specific sites of DNA called ‘restriction sites’. Every bacteria has it’s own specific recognition and cut sites. These sites are usually 4-6 base pair long. So A DNA molecule with n number of restriction sites (complete digest set) produce n+1 number of fragments. If the sample of DNA is exposed to the restriction enzyme for only a limited amount of time to prevent it from being cut at all restriction sites. So we generate the set of all possible restriction fragments between every two cuts. This set of fragments are called ‘partial digest set’ is used to determine the positions of the restriction sites in the DNA sequence. Before the DNA sequencing reaction process were invented, calculating the complete digest set from partial digest set was the only process to know the restriction sites in DNA and so as the sequence of base pairs. So an experiment is done based on an algorithm called Branch and bound algorithm. With a given set of partial digest, the program gives the complete digest set. The only exception for this program is for a same partial digest set there are two different complete digest set called Homometric set. This program involves Recursive Selection Sort having the running time O(n2) in average case. Introduction Restriction Mapping is the process of getting structural information on a sequence of DNA by the help of Restriction enzymes. Restriction enzymes are endonuclease that recognize specific base in DNA called ‘Restriction sites’ and make cleavages. As DNA sequencing technology were discovered recently, few years ago restriction maps became powerful research tools in molecular biology by helping the location of genetic markers. The distance between two restriction sites can be calculated experimentally by gel electrophoresis. So a DNA molecule can be either fully digested by restriction enzymes, or partially digested which cut the DNA with a probability less than 1, there by producing all the possible fragments of two consecutive sites. The Objective of this project is to obtain the set of complete digest from a given partial digest sets. Restriction digest problem can be solved by implementing various algorithms 1. Brute Force Algorithm- This is exhaustive search and efficient but very slow and need memory in each step. 2. Another Brute Force Algorithm- This is more efficient, faster than Brute Force but still slow. 3. Branch and Bound Algorithm- Most efficient among three and relatively faster than the two above. Branch and Bound Algorithm PartialDigest(L): width <- Maximum element in L DELETE(width, L) X <- {0, width} PLACE(L, X) 1. PLACE(L, X) 2. if L is empty 3. output X 4. return 5. y <- maximum element in L 6. Delete(y,L) 7. if D(y, X ) Í L 8. Add y to X and remove lengths D(y, X) from L 9. PLACE(L,X ) 10. Remove y from X and add lengths D(y, X) to L 11. if D(width-y, X ) Í L 12. Add width-y to X and remove lengths D(width-y, X) from L 13. PLACE(L,X ) 14. Remove width-y from X and add lengths D(width-y, X ) to L 15. return Implementation of above algorithm is done by using Java. The following are the methods involved in solving the problems. It has one main method with ten different class methods to get the solution. The data application process flow is detailed in fig 1. main :This is the main method which first calls assighnData method that validates and accept the user’s input as partial digest set. Then it include 0 and last element in the Complete digest set and remove the same from partial digest set by deletePD method. Then it calls placeL and print the array. placeL : this is the major method which calls itself recursively and calls other methods. First it takes one element at a time from right side in the partial digest array. Build the distance array matrix for that number by ‘distanceArray’. Then call ’ifDExistInL’ method to check if all the distance array are exist in partial digest set. If that is true, add the number in complete digest set and delete the same from partial digest set. If that is not true, then subtract the no from the highest no of complete digest set. Again build the distance matrix for this new no and check whether all are present in PD set. If yes then include the new no in CD. Else move to next number. This method print every iteration and made the process more clearer. ifDExistInL: This is a boolean method compare each and every element of PD with the distance matrix and return either true or false accordingly. assignData: This is another important method which first validate the total number of PD given by user. It can be achieved by following formula numOfCD = (int)(Math.sqrt(numOfPD * 2.0)+1) As it is known that numOfPD = [numOfCD(numOfCD-1)] /2, so the method again rechecking whether the calculated CD is a valid one or not. Then it takes the PD number separated by comma and put it in an array by using string functions. distanceArray: This is just calculate the distance matrix for any number passed as parameters. deletePD: This method called by placeL method. It delete every elements of distance matrix if they are present in PD set for validating a number. But it delete a number only once if the number has repetition, as the repeated number may be generated by some other number. addCD: if the number validation is true in placeL method, then that number is added to the array of complete digest. sortPD: sorting the element after the user input and after every delete statement on partial digest set. sortCD: sorting the elements after every addCD methods called. printArray: It just print the existing array with curly ({}) brackets. Start Read comma separated data, split to array, calculate # complete digest Read Data Fail Validate Data Pass Evaluate next element No Delete from PD & Store in CD If PD = NULL Yes Yes If Dist Array exists in PD Show Result In CD Compute Distance Array Fig 1: Application Process Flow Result The java program gives final result with set of complete digest. This Branch and bound algorithm work very well in high quality PDP data. Unlike it’s corresponding PDP algorithm as BruteForce and AnotherBruteForce algorithm, this Branch and Bound one is very fast and efficient too. At each point we examine two alternatives left or right, ruling out the obvious incorrect decision. This algorithm is very fast for most instance of PDP because only one of the two alternatives either left or right is viable at any step. Let T(n) be the maximum time Partial Digest takes to find the solution for an n-point instance of the PDP. If there is only one viable alternatives at every step, then the program steadily reduce the size of the problem by one and calls itself recursively, it is also called RECURSIVE SELECTION SORT. T(n) = T(n-1) + n, T(1) = 1, Therefore, T(n) = n + T(n-1) = n + (n-1) + T(n-2) = n + (n-1) + (n-2) +…….+3+2+T(1) = O(n2) So the upper bound of the algorithm is quadratic time that is : O(n2) , However if there are two alternatives, then T(n) = 2T(n-1) + n 2n-1 So, the Big O for this is an exponential algorithm. Exception One of the exception of this program is with same set of partial digest set, there are two possible complete digest set. These two sets are called Homometric set. There is no solution for this kind of problem. Future Works The partial digest algorithm can be applied in following two problems. Motif finding problem: Given a set of DNA sequences, find a motif , one from each sequence, that maximizes the consensus score. Median String problem: Given a set of DNA sequences, find a median string. Both the above problem can be solved by Branch and Bound algorithm. These application shows more efficiency over the exhaustive BruteForce one in speed, but unfortunately it always goes exponential time to find. Conclusion As it is understood that the partial digest problem can be solved by various algorithm, like BruteForce, AnotherBruteForce and Branch and Bound algorithm. The later one is significantly faster than previous two algorithm with very good PDP data set. Although this problem shows an exception when the data set is homometric that is two possible complete digest set from same partial digest set. The running time is usually O(n2) in best or average case but it becomes exponential in worst case. Reference Neil C. Jones and Pavel A. Pevzner – An Introduction to Bioinformatics Algorithm. K.N. King- An introduction to Java Programming.