yangdon.final

advertisement
15-853 Final Exam
Yangdong Deng
1. Solution:
(A). In addition to the compressed data (we assume the total code length information is included), we also
need to send the symbol list (i.e. vocabulary) in the original order as well as the probability model so that
the data can be uncompressed.
(B). After using move-to-front heuristic, repeated messages of ABCAB will be converted to {0, 1, 2, 2, 2, 1,
1, 2, 2, 2, 1, 1, 2, 2, 2, …}, i.e. the final encoded message will be repeated sequence of {1, 1, 2, 2, 2} (we
ignore the initial irregular part.
In this sequence, asymptotically the probabilities of number {1, 2} are {0.4, 0.6}. Thus the self
informations are {1.32, 0.74}. If we use arithmetic coding and assume there are N times of repeated {1, 1,
2, 2, 2} sequence, asymptotically the bits-per-symbol is
2N  1.32  3N  0.74
 0.972 bits/symbol
5N
(C). After using move-to-front heuristic, message ABCDE will be converted to {0, 1, 2, 3, 4, 3, 3, 4, 4, 4, 4,
, 4, 4, 4, …}.
In this sequence, the probabilities of number {1} are 1. Thus the self-informations are 0. If we use
arithmetic coding, asymptotically the bits-per-symbol is
5N  0
 0 bits/symbol
5N
(D). When file type changes, after enough long time, move-to-front will smartly rearrange the order of
symbol list. Thus this heuristic adapts well.
(E). In English, some character combinations, e.g. “er”, “th”, etc., appears significantly more frequently
then others. Thus if the symbol pairs are carefully ordered, better compression is expected. In addition,
intuitively we can have shorter coded message because we can manipulate two symbols at each time.
2. Solution:
(A). To decode using the McEliece’s method in polynomial time, we first compute c’=cP-1=mSG+eP-1,
where c is the encoded message. Note that eP-1 has the same weight with e. Next we can use the decoding
algorithm for Goppa to get m’=mS because the Hamming distance between m’G and c’ is less then or equal
to t. Finally we get the original message by computing m=m’S-1.
(B). Since c=mG’+e=mSGP+e, and G’ is public key, we can choose a known message m and find c by
setting e=0. Next we try every possible permutation matrix P’ (n! possibilities), i.e.
c(P’)-1= mSGP(P’)-1
-1
If c(P’) is a legal Goppa code, we possibly find P.
Next, we study SG=G’P-1. The brute force method is to try all combinations of S and G. Obviously,
k2
because S is binary, there are at most 2
possible S’s. In addition, the number of possible matrixes for
Goppa code is bounded by O(2mt/2), where n is final code length and n=2m.
(C). If we need to correct up to 10 bit-errors, we have to set t (=(101-1)/2=50 in the original code) in the
encryption process to 50-10=40. Note that
c=mG’+e
where G’=SGP.
If we randomly pick k=524 columns from G’ and corresponding k components from both c and e, we have
ck=mGk’+ek
If luckily the components in ek we pick up are all 0’s, we simply get
ck=mGk’
Because G’ is the public key, we can easily derive m by:
m= ck (Gk’)-1
Because we now add e with only 40 1’s instead of 50 1’s to mG’, the number of 0’s in e increases, which
means the probability of picking up an all zero-component ek will be larger if we don’t get additional errors
in the Goppa code (in other words, though our code can correct up to 10 bit-errors, we don’t necessarily
have 10 errors). Hence, the security is impaired. Actually, the above mentioned probability is
 k 


n  t
k 
  . Thus the selection of t is a little bit more intricate. However, for the numbers in this
n
problem, the probability of picking up an all zero-component ek does become larger.
(D). The advantage of McEliece’s method over RSA is its much faster speed. The disadvantages are:
enormous public key, cipher-text is twice as large as the message, and easier to be attacked (or easier to
attract attach  obviously there are too many attacking techniques for such a method which has never been
used extensively).
3. Solution:
(A). Note that matrix A is in the following form:
edge(i,j)
edge(k,i)
ith vertex
-1
-1
jth vertex
+1
kth vertex
+1
In each column, which is corresponding to an edge, there are only two non-zero entries, +1 and –1,
respectively. Also note that an edge (i, j) is directed from i to j. When using simplex method to solve the
minimum-cost network flow problem, at each step we will choose a basis corresponding to a set of edges.
The column vectors in A corresponding to the basis must be linearly independent.
Now suppose the edges corresponding to these columns form a cycle. Without loss of generality, we
assume the cycle consists of edges [i1, i2], [i2, i3], …, [im, i1] (Note that with [i, j] we represent an edge that
can directed either from i to j or from j to i), where m is the number of edges in the cycle.
First, let’s assume the edges in the cycle are directed in the same directions, i.e. (i1, i2), (i2, i3), …, (im, i1). In
the entries of A corresponding to this set of edges, a vertex i contributes a –1 to edge (i, j), a +1 to edge (j,
i), and 0s to all other edges. Thus, if we add the columns in A corresponding to the edges together, the sum
is a zero column vector, which indicates the columns in this basis are linearly dependent.
If the edges in the cycle are not directed in the same directions, for those edges with reverse direction as in
the (i1, i2), (i2, i3), …, (im, i1) sequence discussed above, we can multiple a –1 to those column vectors and
again add all of them together. Evidently, we also get a zero sum vector, which indicates the columns in
this basis are linearly dependent.
The above conclusion contradicts the fact that all columns in a basis must be linearly independent. Thus
there is no cycle in the edges corresponding to a basis.
(B). In any of the intermediate solutions during each step of the simplex method, there cannot be two
different paths between a source and a sink with nonzero flow. Note that nonzero flow has to be associated
with paths contained in the basis because flow variables corresponding to non-basis edges must equal to
zero. Thus if there are two different such paths, we will have a cycle in the basis, which is impossible as
showed in part (A).
(C). As showed in part (A) and (B), there is no cycle in the edges corresponding to a basis. Hence each
basis a solution is a spanning tree of the graph. Accordingly, each step of simplex is to find another
spanning tree by adding an edge to current tree and then deleting another edge in the resultant cycle. The
final optimum result is just the minimum spanning tree.
4. Solutions:
Let’s first solve another simpler problem: is K (K  |V|, |V| is number of vertices in the graph) colors
suffice to coloring a graph? We can formulate this problem as to find a feasible solution for the following
problem:
For a vertex i and a color k<K we define a variable xik:
1, vertex i is labled in color k
xik  
0, otherwise
For an edge (i, j) and a color k, we have:
xik  xik  1
which means vertices i and j cannot be colored in the same color. There are k|E| such inequations for all
edges and all colors.
For a vertex i:
K
 xik  1
k 1
There are |V| of such equations.
Thus we have k|V| variables and k|E|+|V| constraints.
To find the minimum number of colors to color the graph, we can begin with K = |V| and do a binary
search. The number of searches is bounded by O(log|V|).
5. Solution:
We can model this problem using a Hidden Marcov Model, where
 Alphabet ={A, C, G, T}.
 State set Q={A+, C+, G+, T+, A-, C-, G-, T-}
 State transition probability matrix A=(aij) as given in the problem
 Each state K emits a symbol K, where K={A, C, G, T}.
Thus a state path V=v1 v2 …vn is a sequence sates. The probability of a sequence X=x1 x2 …xn is generated
by the path V is:
n
n
i 1
i 1
P(X | V)   P(vi | vi 1 )   ai,i 1
Then the problem is to find an optimum path V for a given sequence x of nucleotides, such that P(X|V) is
maximized. This problem can be solved by a dynamic programming technique. Let Vl(i) be the probability
of the most probable path that generates x1 x2 …xi and end in state Vl. Then
Vl (i 1)  max {Vk (i )  akl}
k
Thus the dynamic programming is:
 Initially, if the first base is K, K={A, C, G, T}, then
Vothers(0)  0 .

VK  (0)  VK  (0)  0.5 and
Recursion is on i=1, …, n
Vl (i  1)  max {Vk (i )  akl}

k
*
Finally we pick path V*= P( X | V )  max {Vk ( n )} .
k
After the above procedure, a simple backtrack procedure will discover the optimum path with maximum
probability to generate the input string. By checking every state in the path we can find whether a symbol is
generated in the CG island or not. Evidently, the above dynamic programming can be efficiently computed
in O(n) time.
A possible numerical problem of the above procedure is that multiple multiplications,  a ij, may lead to
insufficient bits on a computer with limited numerical precision (note that 0  aij  1 ). To avoid this, we
can do the recursive step in logarithmic manner, i.e.
log( Vl (i 1) )  max {log( Vk (i ))  log( akl )}
k
6. Solution:
To align two strings S and T with lengths m and n, respectively, we assume the best score at S[i] and T[j] is
w(i, j). We have
w(i, j) = max{a(i, j), b(i, j), c(i, j)}
where
 a(i, j) is the best alignment of S[1…i] and T[1…j] that aligns S[i] and T[j];
 b(i, j) is the best alignment of S[1…i] and T[1…j] that aligns a gap, ‘-‘, and T[j];
 c(i, j) is the best alignment of S[1…i] and T[1…j] that aligns S[i] and a gap, ‘-‘.
To pay an + penalty when a gap is completed, the recurrence relation should be:
a(i, j) = max a(i-1, j-1) + (S[i], T[j]);
b(i-1, j) + (S[i], T[j]) +;
c(i, j-1) + (S[i], T[j]) +;
b(i, j) = max a(i-1, j) + (S[i], -)+ ;
b(i-1, j) + (S[i], -) +.
c(i, j) = max a(i, j-1) + (-, T[j])+ ;
c(i, j-1) + (-, T[j]) +.
where (x, y) is the penalty for aligning x and y.
Hence, the  penalty is added when a gap is completed.
Initially we need to set:
a(i, 0)=i
a(0, j)=j
b(0, j)=
c(i, 0)=
Evidently, the above recurrence can be done by recursively walk through every row and column and thus
the complexity is O(mn).
7. Solution:
(A). We need to generate the link incident matrix, or document-to-document matrix A. In matrix A, aij is 1
if document i has a link pointing to document j; otherwise it’s set to 0. Then we compute A’s SVD, i.e. A =
UVT. After this we need to pick up the k biggest eigen-values of A in  and corresponding Uk and Vk.
(B). For a query q, we convert it to a k-dimensional vector q’= qTUk--1. Then we can compare the
distances between q’ and any vector corresponding to an indexed document in the k-dimensional space.
Consequently, vectors within a given distance range can be seen as matched documents.
(C). Note that q’ is a k-dimensional vector. Thus distance between the query and a document can be
calculated in O(k). If we have n documents in total, a query will take O(nk).
(D). We can use the same preprocessing. Note that we need to transpose the incident matrix so that the an
entry of ‘1’ in the document-to-document matrix A means a document is pointed by another document
instead of pointing to another document. Accordingly, AT = (UVT)T = VUT. Obviously, V, , and U can
all be reused.
(E). We have to pre-process the text-document matrix using SVD method to derive a similarity measure
between every pair of documents with respect to text terms. These measures can be stored in a matrix. Then
we can use a single SVD to calculate the similarities regarding to outgoing links and incoming links,
respectively. Finally we combine these three parts together according the fixed linear combination.
Download