Analysis of Algorithms, 91

advertisement
UML CS
Analysis of Algorithms 91.404 (section 201)
Spring, 2008
Homework Set #8
Assigned: Monday, 4/7
Due: Friday, 4/18 (start of lecture)
This assignment covers textbook material in Chapters 10-11.
Note: Partial credit for wrong answers is only given if work is shown.
1. (52 points) Chapter 10: DNA Alternative Splicing. Biologists now know that a single gene
can yield multiple proteins via a process called alternative splicing. The input to your
alternative splicing algorithm is:
- An array Y of n DNA characters, where each character is either A, C, T, or G. A
character represents a base: adenine, cytosine, thymine or guanine.
- A doubly-linked list S with a head pointer. In addition to next and prev pointers, each
node has:
- a character attribute c that is either I (for intron) or E (for exon). Some biologists
think that an exon contributes to protein synthesis but an intron does not.
- 2 positive integer attributes s and t representing a range of genomic sequence
positions in the array Y. For a node whose pointer is current, the following
statements are satisfied:
- Intra-node ordering: s[current ]  t[current ] .
- Inter-node ordering: if next[current] exists, then
t[current ]  1  s[next[current ]]
if prev[current] exists, then
t[ prev[current ]]  1  s[current ] .
-
A probability Pintron.
-
A probability Pexon.
Your algorithm should create from S one doubly-linked list S’ representing a “spliced”
sequence with the following properties:
-
Each intron node is randomly determined to be present or absent, with the probability
of being present = Pintron.
-
Any introns that exist in S’ should be in the same order as they appear in S.
-
Each exon node is randomly determined to be present or absent, with the probability
of being present = Pexon.
1 of 3
UML CS
-
Analysis of Algorithms 91.404 (section 201)
Spring, 2008
Any exons that exist in S’ should be in the same order as they appear in S.
Your algorithm should also traverse the new list S’ and print out all the associated elements
of Y in the proper order.
Provide the following for your algorithm: pseudo-code, upper bound on the worst-case
asymptotic running time, and correctness justification.
2. (24 points) Chapter 11: Consider inserting keys 3,4,2,5,1 in the order given into a hash table
of length m = 5 using hash function h(k) = k2 mod m.
a) Using h(k) as the primary hash function, illustrate the result of inserting these keys using
open addressing with linear probing.
b) Using h(k) as the primary hash function, illustrate the result of inserting these keys using
open addressing with quadratic probing, where c1=1 and c2=2.
c) Using h(k) as the hash function, illustrate the result of inserting these keys using chaining.
Compute the load factor  for the hash table resulting from the insertions.
d) What different values can the hash function h(k) = k2 mod m produce when m = 11?
Carefully justify your answer in detail.
3. (24 points) This task is to pick the most efficient representation in a variety of cases. In all
cases assume that the number of inputs n is large. Assume that the algorithms in the
textbook are used for the associated data structures. Justify your choices.
a) A grocery store wants to determine how many cashiers they need in order to service n
customers. They simulate the operation of the cashiers by dividing time into slices.
During each time slice, customers can join cashier lines and each cashier can service
some number of customers. Select a representation for a cashier’s line of customers.
b) Suppose that we want to build a sequence of URLs representing a “good” path towards
an objective (in the form of a question) as follows. From our question we initially select a
small set of keywords, pass them to a search engine and then randomly pick one of the
resulting URLs that looks promising and follow that link. If no link looks promising, we
discard a randomly chosen number of the most recent links that we have followed. At
the next phase we either pass a refined list of keywords to the search engine or follow a
promising link on the current web page that we are at. Repeat this process until we
either reach an upper bound on the number of links examined (assume that this bound is
an input to the algorithm) or we reach a URL that comes sufficiently close to answering
our original question (assume that this tolerance is also an input to the algorithm).
c) Malware is malicious code intended to disrupt a system’s behavior. One example of
malware is a computer virus. In order to defend against malware, it is helpful to be able
to compare pieces of malware and decide which are similar. One technique is to apply a
function to a number that represents the binary executable version of the malware and
then calculate the result modulo some modulus. The resulting numbers are then
2 of 3
UML CS
Analysis of Algorithms 91.404 (section 201)
Spring, 2008
compared with each other. Select a representation to hold an entry for each malware
executable based on the number that is associated with it.
d) A discrete-event simulation executes events one at a time. To select the next event to
process, the simulation chooses the event with the smallest timestamp that has not yet
been processed. Select a representation to store the timestamps of the unprocessed
events.
(a)
(b)
(c)
(d)
STACK
QUEUE
LINKED-LIST
HEAP
HASH-TABLE
Extra Credit: (due Monday, 4/28) Note: you may do as many parts below as you like. Each
part is worth 10 points. You will need to decide how large n = (number of inputs) needs to be in
order to have meaningful results.
1) Pick one of the sorting algorithms that we have studied in our course. Implement our
textbook’s pseudocode for this algorithm in either C, C++ or Java. Design an
experiment to try to determine the constant coefficient of the leading (largest) term in the
“big O” expression for the upper bound on the worst-case running time of the sorting
algorithm. Graphically show how this function of n changes across different values of n.
2) Perform (1) for another sorting algorithm.
3) Graphically compare the running times that you obtained from (1) and (2) to show how
they vary across different values of n. Are there any “cross-over” point(s)? If so, identify
them.
3 of 3
Download