UML CS Analysis of Algorithms 91.404 (section 201) Spring, 2008 Homework Set #8 Assigned: Monday, 4/7 Due: Friday, 4/18 (start of lecture) This assignment covers textbook material in Chapters 10-11. Note: Partial credit for wrong answers is only given if work is shown. 1. (52 points) Chapter 10: DNA Alternative Splicing. Biologists now know that a single gene can yield multiple proteins via a process called alternative splicing. The input to your alternative splicing algorithm is: - An array Y of n DNA characters, where each character is either A, C, T, or G. A character represents a base: adenine, cytosine, thymine or guanine. - A doubly-linked list S with a head pointer. In addition to next and prev pointers, each node has: - a character attribute c that is either I (for intron) or E (for exon). Some biologists think that an exon contributes to protein synthesis but an intron does not. - 2 positive integer attributes s and t representing a range of genomic sequence positions in the array Y. For a node whose pointer is current, the following statements are satisfied: - Intra-node ordering: s[current ] t[current ] . - Inter-node ordering: if next[current] exists, then t[current ] 1 s[next[current ]] if prev[current] exists, then t[ prev[current ]] 1 s[current ] . - A probability Pintron. - A probability Pexon. Your algorithm should create from S one doubly-linked list S’ representing a “spliced” sequence with the following properties: - Each intron node is randomly determined to be present or absent, with the probability of being present = Pintron. - Any introns that exist in S’ should be in the same order as they appear in S. - Each exon node is randomly determined to be present or absent, with the probability of being present = Pexon. 1 of 3 UML CS - Analysis of Algorithms 91.404 (section 201) Spring, 2008 Any exons that exist in S’ should be in the same order as they appear in S. Your algorithm should also traverse the new list S’ and print out all the associated elements of Y in the proper order. Provide the following for your algorithm: pseudo-code, upper bound on the worst-case asymptotic running time, and correctness justification. 2. (24 points) Chapter 11: Consider inserting keys 3,4,2,5,1 in the order given into a hash table of length m = 5 using hash function h(k) = k2 mod m. a) Using h(k) as the primary hash function, illustrate the result of inserting these keys using open addressing with linear probing. b) Using h(k) as the primary hash function, illustrate the result of inserting these keys using open addressing with quadratic probing, where c1=1 and c2=2. c) Using h(k) as the hash function, illustrate the result of inserting these keys using chaining. Compute the load factor for the hash table resulting from the insertions. d) What different values can the hash function h(k) = k2 mod m produce when m = 11? Carefully justify your answer in detail. 3. (24 points) This task is to pick the most efficient representation in a variety of cases. In all cases assume that the number of inputs n is large. Assume that the algorithms in the textbook are used for the associated data structures. Justify your choices. a) A grocery store wants to determine how many cashiers they need in order to service n customers. They simulate the operation of the cashiers by dividing time into slices. During each time slice, customers can join cashier lines and each cashier can service some number of customers. Select a representation for a cashier’s line of customers. b) Suppose that we want to build a sequence of URLs representing a “good” path towards an objective (in the form of a question) as follows. From our question we initially select a small set of keywords, pass them to a search engine and then randomly pick one of the resulting URLs that looks promising and follow that link. If no link looks promising, we discard a randomly chosen number of the most recent links that we have followed. At the next phase we either pass a refined list of keywords to the search engine or follow a promising link on the current web page that we are at. Repeat this process until we either reach an upper bound on the number of links examined (assume that this bound is an input to the algorithm) or we reach a URL that comes sufficiently close to answering our original question (assume that this tolerance is also an input to the algorithm). c) Malware is malicious code intended to disrupt a system’s behavior. One example of malware is a computer virus. In order to defend against malware, it is helpful to be able to compare pieces of malware and decide which are similar. One technique is to apply a function to a number that represents the binary executable version of the malware and then calculate the result modulo some modulus. The resulting numbers are then 2 of 3 UML CS Analysis of Algorithms 91.404 (section 201) Spring, 2008 compared with each other. Select a representation to hold an entry for each malware executable based on the number that is associated with it. d) A discrete-event simulation executes events one at a time. To select the next event to process, the simulation chooses the event with the smallest timestamp that has not yet been processed. Select a representation to store the timestamps of the unprocessed events. (a) (b) (c) (d) STACK QUEUE LINKED-LIST HEAP HASH-TABLE Extra Credit: (due Monday, 4/28) Note: you may do as many parts below as you like. Each part is worth 10 points. You will need to decide how large n = (number of inputs) needs to be in order to have meaningful results. 1) Pick one of the sorting algorithms that we have studied in our course. Implement our textbook’s pseudocode for this algorithm in either C, C++ or Java. Design an experiment to try to determine the constant coefficient of the leading (largest) term in the “big O” expression for the upper bound on the worst-case running time of the sorting algorithm. Graphically show how this function of n changes across different values of n. 2) Perform (1) for another sorting algorithm. 3) Graphically compare the running times that you obtained from (1) and (2) to show how they vary across different values of n. Are there any “cross-over” point(s)? If so, identify them. 3 of 3