15-853 Final Exam Yangdong Deng 1. Solution: (A). In addition to the compressed data (we assume the total code length information is included), we also need to send the symbol list (i.e. vocabulary) in the original order as well as the probability model so that the data can be uncompressed. (B). After using move-to-front heuristic, repeated messages of ABCAB will be converted to {0, 1, 2, 2, 2, 1, 1, 2, 2, 2, 1, 1, 2, 2, 2, …}, i.e. the final encoded message will be repeated sequence of {1, 1, 2, 2, 2} (we ignore the initial irregular part. In this sequence, asymptotically the probabilities of number {1, 2} are {0.4, 0.6}. Thus the self informations are {1.32, 0.74}. If we use arithmetic coding and assume there are N times of repeated {1, 1, 2, 2, 2} sequence, asymptotically the bits-per-symbol is 2N 1.32 3N 0.74 0.972 bits/symbol 5N (C). After using move-to-front heuristic, message ABCDE will be converted to {0, 1, 2, 3, 4, 3, 3, 4, 4, 4, 4, , 4, 4, 4, …}. In this sequence, the probabilities of number {1} are 1. Thus the self-informations are 0. If we use arithmetic coding, asymptotically the bits-per-symbol is 5N 0 0 bits/symbol 5N (D). When file type changes, after enough long time, move-to-front will smartly rearrange the order of symbol list. Thus this heuristic adapts well. (E). In English, some character combinations, e.g. “er”, “th”, etc., appears significantly more frequently then others. Thus if the symbol pairs are carefully ordered, better compression is expected. In addition, intuitively we can have shorter coded message because we can manipulate two symbols at each time. 2. Solution: (A). To decode using the McEliece’s method in polynomial time, we first compute c’=cP-1=mSG+eP-1, where c is the encoded message. Note that eP-1 has the same weight with e. Next we can use the decoding algorithm for Goppa to get m’=mS because the Hamming distance between m’G and c’ is less then or equal to t. Finally we get the original message by computing m=m’S-1. (B). Since c=mG’+e=mSGP+e, and G’ is public key, we can choose a known message m and find c by setting e=0. Next we try every possible permutation matrix P’ (n! possibilities), i.e. c(P’)-1= mSGP(P’)-1 -1 If c(P’) is a legal Goppa code, we possibly find P. Next, we study SG=G’P-1. The brute force method is to try all combinations of S and G. Obviously, k2 because S is binary, there are at most 2 possible S’s. In addition, the number of possible matrixes for Goppa code is bounded by O(2mt/2), where n is final code length and n=2m. (C). If we need to correct up to 10 bit-errors, we have to set t (=(101-1)/2=50 in the original code) in the encryption process to 50-10=40. Note that c=mG’+e where G’=SGP. If we randomly pick k=524 columns from G’ and corresponding k components from both c and e, we have ck=mGk’+ek If luckily the components in ek we pick up are all 0’s, we simply get ck=mGk’ Because G’ is the public key, we can easily derive m by: m= ck (Gk’)-1 Because we now add e with only 40 1’s instead of 50 1’s to mG’, the number of 0’s in e increases, which means the probability of picking up an all zero-component ek will be larger if we don’t get additional errors in the Goppa code (in other words, though our code can correct up to 10 bit-errors, we don’t necessarily have 10 errors). Hence, the security is impaired. Actually, the above mentioned probability is k n t k . Thus the selection of t is a little bit more intricate. However, for the numbers in this n problem, the probability of picking up an all zero-component ek does become larger. (D). The advantage of McEliece’s method over RSA is its much faster speed. The disadvantages are: enormous public key, cipher-text is twice as large as the message, and easier to be attacked (or easier to attract attach obviously there are too many attacking techniques for such a method which has never been used extensively). 3. Solution: (A). Note that matrix A is in the following form: edge(i,j) edge(k,i) ith vertex -1 -1 jth vertex +1 kth vertex +1 In each column, which is corresponding to an edge, there are only two non-zero entries, +1 and –1, respectively. Also note that an edge (i, j) is directed from i to j. When using simplex method to solve the minimum-cost network flow problem, at each step we will choose a basis corresponding to a set of edges. The column vectors in A corresponding to the basis must be linearly independent. Now suppose the edges corresponding to these columns form a cycle. Without loss of generality, we assume the cycle consists of edges [i1, i2], [i2, i3], …, [im, i1] (Note that with [i, j] we represent an edge that can directed either from i to j or from j to i), where m is the number of edges in the cycle. First, let’s assume the edges in the cycle are directed in the same directions, i.e. (i1, i2), (i2, i3), …, (im, i1). In the entries of A corresponding to this set of edges, a vertex i contributes a –1 to edge (i, j), a +1 to edge (j, i), and 0s to all other edges. Thus, if we add the columns in A corresponding to the edges together, the sum is a zero column vector, which indicates the columns in this basis are linearly dependent. If the edges in the cycle are not directed in the same directions, for those edges with reverse direction as in the (i1, i2), (i2, i3), …, (im, i1) sequence discussed above, we can multiple a –1 to those column vectors and again add all of them together. Evidently, we also get a zero sum vector, which indicates the columns in this basis are linearly dependent. The above conclusion contradicts the fact that all columns in a basis must be linearly independent. Thus there is no cycle in the edges corresponding to a basis. (B). In any of the intermediate solutions during each step of the simplex method, there cannot be two different paths between a source and a sink with nonzero flow. Note that nonzero flow has to be associated with paths contained in the basis because flow variables corresponding to non-basis edges must equal to zero. Thus if there are two different such paths, we will have a cycle in the basis, which is impossible as showed in part (A). (C). As showed in part (A) and (B), there is no cycle in the edges corresponding to a basis. Hence each basis a solution is a spanning tree of the graph. Accordingly, each step of simplex is to find another spanning tree by adding an edge to current tree and then deleting another edge in the resultant cycle. The final optimum result is just the minimum spanning tree. 4. Solutions: Let’s first solve another simpler problem: is K (K |V|, |V| is number of vertices in the graph) colors suffice to coloring a graph? We can formulate this problem as to find a feasible solution for the following problem: For a vertex i and a color k<K we define a variable xik: 1, vertex i is labled in color k xik 0, otherwise For an edge (i, j) and a color k, we have: xik xik 1 which means vertices i and j cannot be colored in the same color. There are k|E| such inequations for all edges and all colors. For a vertex i: K xik 1 k 1 There are |V| of such equations. Thus we have k|V| variables and k|E|+|V| constraints. To find the minimum number of colors to color the graph, we can begin with K = |V| and do a binary search. The number of searches is bounded by O(log|V|). 5. Solution: We can model this problem using a Hidden Marcov Model, where Alphabet ={A, C, G, T}. State set Q={A+, C+, G+, T+, A-, C-, G-, T-} State transition probability matrix A=(aij) as given in the problem Each state K emits a symbol K, where K={A, C, G, T}. Thus a state path V=v1 v2 …vn is a sequence sates. The probability of a sequence X=x1 x2 …xn is generated by the path V is: n n i 1 i 1 P(X | V) P(vi | vi 1 ) ai,i 1 Then the problem is to find an optimum path V for a given sequence x of nucleotides, such that P(X|V) is maximized. This problem can be solved by a dynamic programming technique. Let Vl(i) be the probability of the most probable path that generates x1 x2 …xi and end in state Vl. Then Vl (i 1) max {Vk (i ) akl} k Thus the dynamic programming is: Initially, if the first base is K, K={A, C, G, T}, then Vothers(0) 0 . VK (0) VK (0) 0.5 and Recursion is on i=1, …, n Vl (i 1) max {Vk (i ) akl} k * Finally we pick path V*= P( X | V ) max {Vk ( n )} . k After the above procedure, a simple backtrack procedure will discover the optimum path with maximum probability to generate the input string. By checking every state in the path we can find whether a symbol is generated in the CG island or not. Evidently, the above dynamic programming can be efficiently computed in O(n) time. A possible numerical problem of the above procedure is that multiple multiplications, a ij, may lead to insufficient bits on a computer with limited numerical precision (note that 0 aij 1 ). To avoid this, we can do the recursive step in logarithmic manner, i.e. log( Vl (i 1) ) max {log( Vk (i )) log( akl )} k 6. Solution: To align two strings S and T with lengths m and n, respectively, we assume the best score at S[i] and T[j] is w(i, j). We have w(i, j) = max{a(i, j), b(i, j), c(i, j)} where a(i, j) is the best alignment of S[1…i] and T[1…j] that aligns S[i] and T[j]; b(i, j) is the best alignment of S[1…i] and T[1…j] that aligns a gap, ‘-‘, and T[j]; c(i, j) is the best alignment of S[1…i] and T[1…j] that aligns S[i] and a gap, ‘-‘. To pay an + penalty when a gap is completed, the recurrence relation should be: a(i, j) = max a(i-1, j-1) + (S[i], T[j]); b(i-1, j) + (S[i], T[j]) +; c(i, j-1) + (S[i], T[j]) +; b(i, j) = max a(i-1, j) + (S[i], -)+ ; b(i-1, j) + (S[i], -) +. c(i, j) = max a(i, j-1) + (-, T[j])+ ; c(i, j-1) + (-, T[j]) +. where (x, y) is the penalty for aligning x and y. Hence, the penalty is added when a gap is completed. Initially we need to set: a(i, 0)=i a(0, j)=j b(0, j)= c(i, 0)= Evidently, the above recurrence can be done by recursively walk through every row and column and thus the complexity is O(mn). 7. Solution: (A). We need to generate the link incident matrix, or document-to-document matrix A. In matrix A, aij is 1 if document i has a link pointing to document j; otherwise it’s set to 0. Then we compute A’s SVD, i.e. A = UVT. After this we need to pick up the k biggest eigen-values of A in and corresponding Uk and Vk. (B). For a query q, we convert it to a k-dimensional vector q’= qTUk--1. Then we can compare the distances between q’ and any vector corresponding to an indexed document in the k-dimensional space. Consequently, vectors within a given distance range can be seen as matched documents. (C). Note that q’ is a k-dimensional vector. Thus distance between the query and a document can be calculated in O(k). If we have n documents in total, a query will take O(nk). (D). We can use the same preprocessing. Note that we need to transpose the incident matrix so that the an entry of ‘1’ in the document-to-document matrix A means a document is pointed by another document instead of pointing to another document. Accordingly, AT = (UVT)T = VUT. Obviously, V, , and U can all be reused. (E). We have to pre-process the text-document matrix using SVD method to derive a similarity measure between every pair of documents with respect to text terms. These measures can be stored in a matrix. Then we can use a single SVD to calculate the similarities regarding to outgoing links and incoming links, respectively. Finally we combine these three parts together according the fixed linear combination.