Ph.D. Qualifying Exam in CSE Fall 2006 October 30th, 2006, 9am5pm

advertisement
Ph.D. Qualifying Exam in CSE
Fall 2006
October 30th, 2006, 9am5pm
This exam is composed of four sections: numerical methods, discrete algorithms, modeling and simulation, and high-performance computing. Each section has ve questions. You are expected to answer six
questions from the two subareas you chose on the CSE qualifying Exam Form (three from each subarea,
or four from one and two from the other). This is an open-book, open-note exam, but you are not allowed
to ask others for help. To save time, you need NOT type your solutions using a computer, especially for
questions with complex equations. Please return your nished exam to Barbara Binder by 5pm.
Good luck!
1
Numerical methods
1. Consider the linear system
for all
j,
Ax = b
where
A
is an
and all other elements are zeros, and
b
n × n matrix with A(i, i) = 1 for all i, A(1, j) = j
n × 1 vector wehre b(1) = n and b(i) = 1 for all
is an
i ≥ 2.
(a) (25%) What is the condition number of
(b) (30%) Consider the
n×1
vector
x̃
A
in
L1
norm?
whose components are all 1's as an approximate solution to
the above system. Calculate the residual norm
kb − Ax̃k1 .
kx −
x̃k1 /kxk1 , where x is the exact solution of the system. You may not actually compute the solution
x to the above linear system in answering this question.
Use the condition number and this residual norm to give an upper bound for the norm
x to the above system. Now, we have a new linear
Cx = b where b is the same as above and C is an n × n matrix with C(i, i) = 1 for all i,
C(1, j) = j for 1 ≤ j ≤ n − 1, C(j, n) = 1 for all j , and all other elements are zeros. Present a
fast algorithm for computing the solution of the new system Cx = b. What is the computational
(c) (45%) Suppose we have the exact solution
system
complexity of your algorithm? The faster your algorithm is, the better.
2. Let
A=
aT
Â
∈ Rm×n , m > n, rank(A) = n
Assume that we have the reduced QR decompositions
A = QR
where
Q=
q1T
Q1
∈ Rm×n
has orthonormal columns and
compute the reduced QR decomposition of
Â
R ∈ Rn×n
is upper triangular. We want to
Q and
R factors of A that
1
e1 =
to the columns of
0
as eciently as possible using
we already have. We can do this by rst, orthogonalizing the unit vector
Q.
1
(a) (20%) Supposed the reduced QR decomposition of
Express
x
and
α
in terms of
q1T
Q1
1
0
Q1 , q1 , γ ,
q1T
Q1
=
or
( Q e1 ) is
γ
I x
.
h
0 α
(1)
h.
(b) (25%) How can the QR decomposition (1) be obtained most eciently?
( Q e1 )
(c) (40%) After the QR decomposition of
the QR decomposition of
Â?
is obtained, how would you proceed to obtain
Present your algorithm and computational complexity.
The less
computational complexity your algorithm requires, the better it is.
(d) (15%) Discuss what happens to your algorithm if the matrix
3. Given an
n-by-n
matrix
W = [wij ], i, j = 1, . . . , n
n
X
f (x) =
and
wij ≥ 0.
A
does not have full rank.
x = [x1 , . . . , xn ]T ,
For
dene
wij (xi − xj )2 .
i,j=1
Consider the following optimization problem,
(
min f (x) |
n
X
n
X
xi = 0,
i=1
)
x2i
= 1.
i=1
(a) (50%) Show that the above optimization problem is equivalent to nding the eigenvector corresponding to the second smallest eigenvalue of the following matrix
L = D − W,
where
D = diag(d1 , . . . , dn )
is a diagonal matrix with
n
X
di =
wij .
j=1
f (x) in terms of L and x.
v1 , . . . , vn , and there is an edge between vi and vj
second smallest eigenvalue of L is positive if and only if G
Hint: Use direct computation to rewrite
(b) (50%) Consider a graph
and only if
wij 6= 0.
G
with
n
Show that the
vertices
if
is
connected.
4. Polynomial interpolation is equivalent to solving a linear system of equations. For example, if we use
monomials, we end up with a Vandermonde coecient matrix for the linear system and if we use
Lagrange polynomials we end up with an identity coecient matrix.
(a) (25%) Show that if we use the Newton polynomials, the coecient matrix is lower triangular.
(b) (75%) Show that divided dierence method for computing the Newton's interpolation formula is
a special way of solving the lower triangular referred to in 1).
5. The dierential equation
y 0 (t) =
with initial condition
y(0) = 0
p
y(t)
has the solution
y(t) =
1 2
t .
4
The Euler scheme for this equation
Yk+1 − Yk = hYk
with initial condition
Y0 = 0
has the solution
Yk = 0
for all
k.
Discuss why the solution of the
nite dierence scheme does not converge to the given solution of the dierential equation. Mention
appropriate theorems that give convergence results for nite dierence schemes applied to dierential
equations.
2
2
Discrete algorithms
1. In computational biology, DNA can be represented as a sequence of characters drawn from an alphabet
of four letters, A, C, T, and G, representing the four nucleotides. Given two sequences
and
m
S1
and
S2
of
n
characters, respectively, describe what is meant by a local alignment. Given a similarity score
of +2, a mismatch penalty of -1, and a gap score of 0, give an ecient sequential algorithm to compute
the score of the best local alignment between
S1
and
S2 .
What is the asymptotic complexity of your
algorithm? What are the space requirements? Suppose now that you are given a multi-core processor
with
p cores (with 1 < p < min(n, m)), design and analyze a multicore algorithm for sequence similarity
problem using local alignments that scales with the number of cores.
2. The problem of
multiple sequence alignment on DNA sequences is that of nding the optimal alignment
of a set of three or more sequences under the sum of pairs (SP) score scheme. Assume that the score is
a metric (i.e., it obeys the triangle inequality) and prove that the multiple sequence alignment problem
is NP-complete.
3. A problem of signicant importance to a network designer is nding the edges in the network whose
removal causes the performance of network applications to degrade the most. A most vital edge is an
edge, which if removed, causes the maximum change (increase) in the cost of the minimum spanning
tree (MST(G)) of the graph
G = (V, E) be a weighted undirected graph with n vertices and m
w(e) assigned to it. Let f (G) be the weight of a minimum spanning
tree of G if G is connected; otherwise f (G) = ∞. The most precious edge of G is an edge e such that
f (G − e) ≥ f (G − e0 ) for every other edge e0 of G. Give the best known sequential algorithm for solving
edges; each edge
e
G.
Let
has a weight
this problem, and analyze its running time in terms of the problem size.
4. Give a
O(log n) time parallel algorithm for solving the most vital edge problem on the concurrent read,
exclusive write PRAM model.
5. Although merge sort runs on
Θ(n lg n)
worst-case time and insertion sort runs in
time, the constant factors in insertion sort make it faster for small
n.
Θ(n2 )
worst-case
Thus, it makes sense to use
insertion sort within merge sort when subproblems become suciently small. Consider a modication
to merge sort in which
n/k
sublists of length
the standard merging mechanism, where
k
k
are sorted using insertion sort and then merged using
is a value to be determined.
(a) In the rst part of the modied algorithm, the
n/k
sublists, each of length
k,
can be sorted by
insertion sort. Analyze the worst-case running time for this step.
(b) In the second part of the modied algorithm, the sublists can be merged together. Analyze the
worst-case running time for this step.
(c) What is the largest asymptotic (Θnotation) value of
k
as a function of
n
algorithm has the same asymptotic running time as standard merge sort.
(d) How should
k
be chosen in practice?
3
for which the modied
3
Modeling and simulation
1. Several algorithms have been proposed in the parallel discrete event simulation literature to relax
message ordering in order to improve performance. One approach is to use time intervals rather than
precise time stamps on events, to indicate the event could happen any time within the specied interval.
(a) Describe precisely the partial ordering that would apply using only time intervals to order events.
(b) Modify the Time Warp algorithm to work with time intervals, and specify the local control
algorithm that would be used.
(c) Dene global virtual time using time intervals, and give an algorithm for computing its value.
2. Design a synchronization protocol where the topology of logical processes is always organized as a
tree, and events (messages) only are sent down tree, i.e., from a parent to child node.
State all
assumptions in your protocol. Assume the tree can have arbitrary fanout (i.e., a node can have any
number of children nodes). Your solution to this problem must exploit the fact that the topology is a
tree, i.e., you will receive zero credit if you simply use one of the standard conservative synchronization
algorithms that works for arbitrary topologies.
Suppose the topology of logical processes is acyclic, but not necessarily a tree. Does your algorithm
still work? Explain why or why not.
3. Suppose you are given a trace of a parallel discrete event simulation program that indicates what events
are executed on each logical process, and what events are scheduled by each event. How would you
determine the minimum execution time of this program assuming an unlimited number of processors,
and zero time overheads for communication, synchronization, etc. Assume conservative synchronization
is used. Write an algorithm for computing this lower bound.
4. Suppose that the altitude of the trajectory of a projectile is described by the second-order ODE
u00 = −4.
Suppose that the projectile is red from position
a target at position
t = 1,
also of height
t=0
and height
u(0) = 1
and is to strike
u(1) = 1.
t = 0 required to
h = 1 to derive a system of
0
slope s1 = u (1). What are the
(a) Solve this problem by the shooting method. To determine the initial slope at
hit the desired target at
t = 1,
use the trapezoid rule with step size
two equations for the unknown initial slope
s0 = u0 (0)
and nal
resulting values for the initial and nal slope?
t = 0.5, together with the boundary
u(t) approximating the solution. What is the resulting
the point t = 0.5?
(b) Solve the same BVP again using collocation at the time
values to determine a quadratic polynomial
approximate height of the projectile at
(c) Comment on the advantages and disadvantages of the two dierent methods.
5. The
diusion equation
describes the change of density in a material undergoing diusion. A particular
form is the heat equation, generally written as
∂φ
= D 52 φ(~x, t),
∂t
where
52
denotes the Laplace operator, and
D
denotes the diusion coecient.
(a) Describe the discretization of the heat equation in two spatial dimension using the Crank-Nicholson
method, which uses centered dierence in space and in time (at time
t + 21 ∆t,
where
∆t
denotes
the time step).
(b) What is the order of accuracy of this discretization in time and space? Show the derivation of its
accuracy in time.
(c) Under what time step is this method stable? Outline a brief argument in three or four sentences
to explain how this stability limit can be proven (you do not need to give the full proof ).
4
4
High-performance computing
1. Given an array
bi = bi−1 + ai ,
A
for
of n elements,
1 ≤ i < n.
we dene the
n
prex-sums B
(a) Give an optimal RAM sequential algorithm to compute
B
of
A
as follows.
Let
b0 = a0 .
Let
and analyze its asymptotic time com-
plexity.
(b) Explain the dierence in performance one would observe running this algorithm on two hypothetical computers, one with a reasonable-sized cache and the other without any caches. Assume that
A
and
B
t within main memory.
(c) Explain the principle that caches are designed to exploit, and whether or not this code exhibits
this principle.
(d) Assume now that
A
exceeds the capacity of main memory and resides on tertiary storage (e.g.,
B.
disk). Give an optimal external memory algorithm for computing the prex-sums
Estimate
the ratio in performance between the in-memory and external memory approaches.
2. The design of microprocessors has abruptly switched to multi-core designs where two or more independent processing cores are packaged together on the same chip. We expect the number of cores per
chip to grow to counts of 64 or more.
(a) Give a computational model that may be suitable for multi-core chips.
Argue why this model
is a reasonable. Describe what critical aspects of design it encourages, and articulate any major
issues it ignores.
(b) Design and analyze a scalable multi-core algorithm for computing
optimal
prex-sums.
Is this algorithm
in your model? Why or why not?
3. Distributed memory machines, such as cluster computers, are often used to solve large-scale problems
that require high-performance computing. In this problem, we are using a cluster of
communication between two nodes takes
between any pair of processors.
t + lw
p
nodes, where
time to send a point-to-point message of length
l
Assume that the network has sucient bandwidth such that no
congestion occurs.
We are given an array
A
n where p evenly divides n. The array A is stored in a block layout
n/p elements of A are on node P0 , the second n/p elements of A are
n > p2 and each node has O(n/p) memory. You may assume that n is a
of size
on the cluster such that the rst
on node
P1 ,
and so on. Let
power of two.
Design a message-passing algorithm to compute the prex-sums of
A.
Analyze the time complexity
in terms of computation and communication. Derive the speedup compared with the best sequential
algorithm. For what values of
n
and
p
can the maximum speedup be achieved?
4. Superlinear speedup is dened emprically as cases where a problem runs
on
p
more than p
times faster
processors than it did on a single processor. Can a parallel algorithm have superlinear absolute
speedup? Why or why not. Describe at least two common causes for reports of superlinear speedup
in the literature.
5. Answer the following questions related to high-performance computer architecture.
(a) Identify the principal dierence between a superscalar architecture and a VLIW architecture.
(b) Why does the addition of multithreading to a processor help that processor to tolerate memory
and communication latencies?
(c) How does predication improve processor performance?
(d) Describe two redundancy schemes used to provide higher reliability in disk arrays.
(e) Compare and contrast a 128-processor cluster versus a 128-processor symmetric multiprocessor
in terms of power, price, performance, applications, reliability, etc.
5
Download