COMP 482: Design and Analysis of Algorithms Spring 2012 Lecture 16 Prof. Swarat Chaudhuri 6.3 Segmented Least Squares Segmented Least Squares Least squares. Foundational problem in statistic and numerical analysis. Given n points in the plane: (x1, y1), (x2, y2) , . . . , (xn, yn). Find a line y = ax + b that minimizes the sum of the squared error: y n SSE = å ( yi - axi - b)2 i=1 x Solution. Calculus min error is achieved when a= n åi xi yi - (åi xi ) (åi yi ) n åi xi - (åi xi ) 2 2 , b= åi yi - a åi xi n 3 Segmented Least Squares Segmented least squares. Points lie roughly on a sequence of several line segments. Given n points in the plane (x1, y1), (x2, y2) , . . . , (xn, yn) with x1 < x2 < ... < xn, find a sequence of lines that minimizes f(x). Q. What's a reasonable choice for f(x) to balance accuracy and parsimony? goodness of fit number of lines y x 4 Segmented Least Squares Segmented least squares. Points lie roughly on a sequence of several line segments. Given n points in the plane (x1, y1), (x2, y2) , . . . , (xn, yn) with x1 < x2 < ... < xn, find a sequence of lines that minimizes: – the sum of the sums of the squared errors E in each segment – the number of lines L Tradeoff function: E + c L, for some constant c > 0. y x 5 Dynamic Programming: Multiway Choice Notation. OPT(j) = minimum cost for points p1, pi+1 , . . . , pj. e(i, j) = minimum sum of squares for points pi, pi+1 , . . . , pj. To compute OPT(j): Last segment uses points pi, pi+1 , . . . , pj for some i. Cost = e(i, j) + c + OPT(i-1). ìï 0 if j = 0 OPT( j) = í min e(i, j) + c + OPT(i -1) otherwise } ïî1 £ i £ j { 6 Segmented Least Squares: Algorithm INPUT: n, p1,…,pN , c Segmented-Least-Squares() { M[0] = 0 for j = 1 to n for i = 1 to j compute the least square error eij for the segment pi,…, pj for j = 1 to n M[j] = min 1 i j (eij + c + M[i-1]) return M[n] } O(n3). can be improved to O(n2) by pre-computing various statistics Running time. Bottleneck = computing e(i, j) for O(n2) pairs, O(n) per pair using previous formula. 7 6.6 Sequence Alignment String Similarity How similar are two strings? ocurrance occurrence o c u r r o c c u r a n c e - r e n c e 5 mismatches, 1 gap o c - u r o c c u r r a n c e r e n c e 1 mismatch, 1 gap o c - u r o c c u r r - r e a n c e - n c e 0 mismatches, 3 gaps 9 Edit Distance Applications. Basis for Unix diff. Speech recognition. Computational biology. Edit distance. [Levenshtein 1966, Needleman-Wunsch 1970] Gap penalty ; mismatch penalty pq. Cost = sum of gap and mismatch penalties. C T G A C C T A C C T - C T G A C C T A C C T C C T G A C T A C A T C C T G A C - T A C A T TC + GT + AG+ 2CA 2 + CA 10 Sequence Alignment Goal: Given two strings X = x1 x2 . . . xm and Y = y1 y2 . . . yn find alignment of minimum cost. Def. An alignment M is a set of ordered pairs xi-yj such that each item occurs in at most one pair and no crossings. Def. The pair xi-yj and xi'-yj' cross if i < i', but j > j'. cost( M ) = åax y + i ( xi , y j ) Î M j å d+ i : x i unmatched mismatch Ex: CTACCG vs. TACATG. Sol: M = x2-y1, x3-y2, x4-y3, x5-y4, x6-y6. å d j : y j unmatched gap x1 x2 x3 x4 x5 C T A C C - G - T A C A T G y1 y2 y3 y4 y5 y6 x6 11 Sequence Alignment: Problem Structure Def. OPT(i, j) = min cost of aligning strings x1 x2 . . . xi and y1 y2 . . . yj. Case 1: OPT matches xi-yj. – pay mismatch for xi-yj + min cost of aligning two strings x1 x2 . . . xi-1 and y1 y2 . . . yj-1 Case 2a: OPT leaves xi unmatched. – pay gap for xi and min cost of aligning x1 x2 . . . xi-1 and y1 y2 . . . yj Case 2b: OPT leaves yj unmatched. – pay gap for yj and min cost of aligning x1 x2 . . . xi and y1 y2 . . . yj-1 ì jd ï ì a xi y j + OPT(i -1, j -1) ïï ï OPT(i, j) = í min í d + OPT(i -1, j) ï ï d + OPT(i, j -1) î ï ïî id if i = 0 otherwise if j = 0 12 Sequence Alignment: Algorithm Sequence-Alignment(m, n, x1x2...xm, y1y2...yn, , ) { for i = 0 to m M[0, i] = i for j = 0 to n M[j, 0] = j for i = 1 to m for j = 1 to n M[i, j] = min([xi, yj] + M[i-1, j-1], + M[i-1, j], + M[i, j-1]) return M[m, n] } Analysis. (mn) time and space. English words or sentences: m, n 10. Computational biology: m = n = 100,000. 10 billions ops OK, but 10GB array? (Solution: sequence alignment in linear space) 13 Q1: Least-cost concatenation You have a DNA sequence A of length n; you also have a “library” of shorter strings each of length m < n. Your goal is to generate a concatenation C of strings B1,…,Bk in the library that the cost of aligning C to A is as low as possible. You can assume a gap cost δ and a mismatch cost αpq for determining the cost of alignment. 14 Answer Let A[x:y] denote the substring of A consisting of its symbols from position x to position y. Let c(x,y) be the cost of optimally aligning A[x:y] to any string in the library. Let OPT(j) be the alignment cost for the optimal solution on the string A[1:j]. OPT(j) = mint < j c(t,j) + OPT(t – 1) for j ≥ 1 OPT(0) = 0 [Essentially, t is a “breakpoint” where you choose to use a new library component.] 15 6.4 Knapsack Problem Q2: Knapsack Problem Knapsack problem. Given n objects and a "knapsack." Item i weighs wi > 0 kilograms and has value vi > 0. Knapsack has capacity of W kilograms. Goal: fill knapsack so as to maximize total value. Ex: { 3, 4 } has value 40. W = 11 Item Value Weight 1 1 1 2 6 2 3 18 5 4 22 6 5 28 7 Greedy: repeatedly add item with maximum ratio vi / wi. Ex: { 5, 2, 1 } achieves only value = 35 greedy not optimal. 17 Dynamic Programming: False Start Def. OPT(i) = max profit subset of items 1, …, i. Case 1: OPT does not select item i. – OPT selects best of { 1, 2, …, i-1 } Case 2: OPT selects item i. – accepting item i does not immediately imply that we will have to reject other items – without knowing what other items were selected before i, we don't even know if we have enough room for i Conclusion. Need more sub-problems! 18 Dynamic Programming: Adding a New Variable Def. OPT(i, w) = max profit subset of items 1, …, i with weight limit w. Case 1: OPT does not select item i. – OPT selects best of { 1, 2, …, i-1 } using weight limit w Case 2: OPT selects item i. – new weight limit = w – wi – OPT selects best of { 1, 2, …, i–1 } using this new weight limit ì 0 if i = 0 ï OPT(i, w) = íOPT(i -1, w) if w i > w ïmax OPT(i -1, w), v + OPT(i -1, w - w ) otherwise { î i i } 19 Knapsack Problem: Bottom-Up Knapsack. Fill up an n-by-W array. Input: n, w1,…,wN, v1,…,vN for w = 0 to W M[0, w] = 0 for i = 1 to n for w = 1 to W if (wi > w) M[i, w] = M[i-1, w] else M[i, w] = max {M[i-1, w], vi + M[i-1, w-wi ]} return M[n, W] 20 Knapsack Algorithm W+1 n+1 0 1 2 3 4 5 6 7 8 9 10 11 0 0 0 0 0 0 0 0 0 0 0 0 {1} 0 1 1 1 1 1 1 1 1 1 1 1 { 1, 2 } 0 1 6 7 7 7 7 7 7 7 7 7 { 1, 2, 3 } 0 1 6 7 7 18 19 24 25 25 25 25 { 1, 2, 3, 4 } 0 1 6 7 7 18 22 24 28 29 29 40 { 1, 2, 3, 4, 5 } 0 1 6 7 7 18 22 28 29 34 34 40 OPT: { 4, 3 } value = 22 + 18 = 40 W = 11 Item Value Weight 1 1 1 2 6 2 3 18 5 4 22 6 5 28 7 21 Knapsack Problem: Running Time Running time. (n W). Not polynomial in input size! "Pseudo-polynomial." Decision version of Knapsack is NP-complete. [Chapter 8] Knapsack approximation algorithm. There exists a polynomial algorithm that produces a feasible solution that has value within 0.01% of optimum. [Section 11.8] 22 Q3: Longest palindromic subsequence Give an algorithm to find the longest subsequence of a given string A that is a palindrome. “amantwocamelsacrazyplanacanalpanama” 23 Q3-a: Palindromes (contd.) Every string can be decomposed into a sequence of palindromes. Give an efficient algorithm to compute the smallest number of palindromes that makes up a given string. 24