Algorithms and Data Structures 2010/11 Coursework 2 Issue date: Wednesday, 10th November 2010 The deadline for this coursework is NOON on Thursday 25th November, 2010. Please submit your solutions electronically via submit. This is worth 50% of the coursework for A&DS. In this coursework, we consider the “knapsack counting” problem. Your mission is to understand, implement, and experiment with a suite of algorithms for this problem. In the knapsack counting problem, we are given as input a list of non-negative integer weights w1 , w2 , . . . , wn ∈ N, and an upper bound B ∈ N. We say that some specific set S ⊆ {1, . . . , n} represents a feasible knapsack solution (wrt w1 , . . . , wn , B) if and only if X wi ≤ B. i∈S The total number of feasible knapsack solutions (which we wish to count) is X count(n, B) = S ⊆ {1, 2, . . . , n} : wi ≤ B . i∈S Note we may assume wlog that wi ≤ B for all 1 ≤ i ≤ n (we can delete any wi > B). §1 describes how to compute count(n, B) exactly via a recursive (exponential-time) algorithm. §2 describes a dynamic programming which computes count(n, B) exactly in Θ(nB) time. §3 introduces a very basic approximation algorithm, which runs in Θ(n3 ) time on a “rounded” version of the input, and returns an “approximate count” within a factor of (n + 1) of count(n, B) (this is an n-approximation algorithm). §3 then shows we can improve the quality of this approximation (to within a factor of 1.25, say, or closer), by drawing uniform samples from the feasible solutions of the “rounded problem”, and checking whether or not they also are solutions to the original problem 1 . Your task in this coursework is to implement all three algorithms, prove some relevant facts (see §3.1), and write a short report on experimental results. 1 Exact Algorithm via Recursion Given the input w1 , . . . , wn , B, ∈ N, we define for every 0 ≤ k ≤ n and every b ≤ B, X count(k, b) = S ⊆ {1, 2, . . . , k} : wi ≤ b . (1) i∈S Clearly count(0, b) = 1 for all 0 ≤ b ≤ B (the only feasible solution is to take S = ∅). Also, count(k, 0) = 1 for all 0 ≤ k ≤ n (again, the only feasible solution is to take S = ∅). In all other cases, when k ≥ 1 and b ≥ 1, we can partition the set of feasible solutions into solutions 1 Both the initial n-approximation algorithm we get from “rounding” wi → ai , and the betterapproximation algorithm we get by repetitively sampling from this set, are due to Martin Dyer [1]. 1 with k ∈ S, and solutions with k 6∈ S. This gives the following recurrence, which can easily be coded up into a natural recursive algorithm. 1 if k = 0 count(k − 1, b) if k > 0, but wk > b . count(k, b) = (2) count(k − 1, b) + count(k − 1, b − wk ) if k > 0 and wk ≤ b To justify the recurrence, note that S with k ∈ S is a solution for w1 , . . . , wk with bound b if and only if S\ {k} is a solution for w1 , . . . , wk−1 with bound b−wk (there are count(k−1, b− wk ) such solutions in total). Also, S with k 6∈ S is a solution for w1 , . . . , wk with bound b if and only if S is a solution for w1 , . . . , wk−1 with the same bound b (there are count(k − 1, b) of these solutions). Hence count(k, b) = count(k − 1, b) + count(k − 1, b − wk ). If wk > b, then count(k − 1, b − wk ) = 0. The recursive algorithm for counting knapsack solutions is an exact algorithm - it always returns the exact value of count(n, B). However, it can take exponential time, even for relatively small values of B (eg B ≤ n2 ). You will see this when you run experiments. The first task of this coursework is to implement (2) directly as the Java method public static int countKnapsackRecurse(int[] w, int B). 2 Exact Algorithm via Dynamic Programming In presenting a DP algorithm we have three issues to address: 1. Describe the expanded set of subproblems and the recurrence which relates them; 2. Describe the table where we will store the solutions to these subproblems; 3. Finally, we must describe the order in which we fill in this table, by giving our algorithm. In our DP algorithm, we will ask for the solution to count(k, b) (defined in (1)) for all 0 ≤ k ≤ n and all 0 ≤ b ≤ B, and we will work with the recurrence (2). We will store all these solutions in a table (an integer array) C of dimensions (n + 1)(B + 1). For every 0 ≤ k ≤ n, 0 ≤ b ≤ B, the cell C(k, b) will hold the value of count(k, b) once it has been computed by the DP algorithm. The algorithm we use to build the table is given below. Algorithm countKnapsackDP(w, B) 1. n ← length(w). 2. Define array C of dimensions (n + 1)(B + 1) 3. for b ← 0 to B do C[0, b] ← 1 od 4. for k ← 1 to n do 5. for b ← 0 to B do if wk > b then C[k, b] ← C[k − 1, b] 6. else C[k, b] ← C[k − 1, b] + C[k − 1, b − wk ] fi 7. 8. 9. od od 10. return C[n, B] 2 In practice (ie, when you implement this algorithm), the wi values will be stored in an array, and instead of referring to wi you will refer to W[i − 1] (as Java arrays begin at 0). Your second task of this coursework is to implement countKnapsackDP in Java for general values of w1 , . . . , wn , B ∈ N, as the method public static int countKnapsackDP(int[] w, int B) In doing this, always check the wk > b case, so that you do not try to access undefined cells of the table. When you run experiments on your implementation, you will find that it runs much much faster than countKnapsackRecurse (from §1) if B is not too large. For large values of B, countKnapsackDP may still require quite a large amount of time/space. 3 Approximation algorithm via Dynamic Programming 3.1 Initial n + 1-approximation algorithm Now we present an algorithm which for any input w1 , . . . , wn , B (regardless of the size of B, \ n2 ) such that or the values of the wi ), will, in Θ(n3 ) time, compute a estimate count(n, \ count(n, B) ≤ count(n, n2 ) ≤ (n + 1) · count(n, B). (3) It will be your responsibility to prove (3), which then implies we have an (n+1)-approximation algorithm (even better, we only have one-sided error). To compute our estimate, we construct a “rounded” version a1 , . . . , an of our input weights, where 2 n wi , ai =def B \ and take n2 as our new upper bound. We write count(n, n2 ) to denote the number of 2 \ knapsack solutions for the rounded instance a1 , . . . , an , n - the “hat” on count(n, n2 ) is to \ indicate we are working with rounded values. We compute count(n, n2 ) as follows: 1. n ← length(w). 2. for k ← 1 to n do 3. 2 ai ← b n Bwi c od \ 4. count(n, n2 ) ← countKnapsackDP(a, n2 ) \ 5. return count(n, n2 ) Your third task is to prove this is a Θ(n3 )-time (n + 1)-approximation algorithm: \ • Step 1: count(n, B) ≤ count(n, n2 ) P (show that every element of count(n, B)’s set - this is {y ∈ {0, 1}n : n j=1 yj wj ≤ B} Pn 2 n 2 \ also lies in count(n, n )’s set {x ∈ {0, 1} : j=1 xj aj ≤ n }). \ • Step 2: count(n, n2 ) ≤ (n + 1) · count(n, B) P 2 \ (define a functionP f from count(n, n2 )’s set {x ∈ {0, 1}n : n j=1 xj aj ≤ n } into count(n, B)’s set {y ∈ {0, 1}n : n j=1 yj wj ≤ B} and show that no element in count(n, B)’s set is the image of f(x) for more than (n + 1) different x). • Step 3: Justify the Θ(n3 ) running time. 3 3.2 Refining the Approximation via Sampling The basic approximation procedure described in §3.1 is easy to implement and very useful for reducing the size of the DP table (in the case when B is large), but the estimate returned can be n+1 times greater than the true value. We now show how to use this rough approximation, plus the DP table (plus a bit of randomness) to come up with an estimate which with high probability will lie within a factor of (1 ± 0.25) of count(n, B). We define the sets of feasible solutions, for the initial instance and the rounded instance: X K =def S ⊆ {1, 2, . . . , n} : wi ≤ B . i∈S S ⊆ {1, 2, . . . , n} : b =def K X ai ≤ n 2 . i∈S b Also recall \ Observe that by definition, count(n, b) = |K| and that count(n, n2 ) = |K|. from §3.1 that we have a simple procedure (defined in terms of countKnapsackDP) to \ compute count(n, n2 ) exactly in O(n3 ) time. Clearly, count(n, B) = |K| = |K| b |K|, b |K| b either! However, we can estimate |K|/|K| b but unfortunately we do not know the value of |K|/|K| via a procedure known as random sampling. The idea is as follows: we know, from Step 1 of b (the solutions to the original instance are all solutions of the the third task, that that K ⊆ K rounded version). Suppose we had a oracle drawRandomSample which we could ask to b where S would be chosen uniformly at random2 from supply us with a random element S ∈ K, b We could run our magic procedure m times to obtain m samples S1 , S2 , . . . , Sm . the set K. Next, we could test each SjP in turn to determine whether it is also an element of K or not (this just involves checking i∈Sj wi against B), and hence evaluate p= |{Sj : 1 ≤ j ≤ m, Sj ∈ K}| . m b then we should Observe that if m is large (ie, we take a large number of samples from K), b have a reasonably good approximation of |K|/|K|. More specifically, it is possible to show using b Chernoff bounds, that if we take about m = Ω(|K|/|K|) samples, we have b ≤ p ≤ (1 + 0.25)|K|/|K|. b (1 − 0.25)|K|/|K| b Step 2 of the third task of §3.1 is crucial here, because it implies that |K|/|K| ≤ (n + 1). If we b take about 10n samples from K and evaluate p with these, then w.h.p p lies within (1 ± 0.25) b 3 . A direct consequence is that we can then estimate count(n, B) = |K| to within a of |K|/|K| \ factor of 1 ± 0.25 (w.h.p) by taking p · count(n, n2 ). Here is the pseudocode: 2 b have the same chance of being taken. Uniformly at random means that all elements of the set K b b is usually more than we Hardly any instances of w1 , . . . , wn , B have |K|/|K| ∼ n, so 10n samples from K b need. In your experiments, try to include a case where |K|/|K| ∼ n (for reasonably large n). 3 4 Algorithm countKnapsackApprox(w, B) 1. n ← length(w). 2. m ← 10n. 3. ` ← 0 4. for i ← 1 to n do 5. 2 ai ← b n Bwi c od \ 6. count(n, n2 ) ← countKnapsackDP(a, n2 ) 7. for k ← 1 to m do 8. S ← drawRandomSample(a, n2 ) 9. if (S is a feasible solution for w1 , . . . , wn , B) then ` ← ` + 1 10. 11. od 12. p ← `/m \ 13. return bp · count(n, n2 )c Observe that the time taken by this code-fragment is Θ(n3 ) + O(m · TdrawRandomSample ), where Θ(n3 ) is from line 6 and O(m · (n + TdrawRandomSample (n))) comes from lines 7-11. We can achieve TdrawRandomSample (n) = O(n), hence countKnapsackApprox is Θ(n3 ). Note: When we mention with high probability, this probability is taken over the random choices made by our algorithm, for any fixed input w1 , . . . , wn , B ∈ N. So this should hold true (with high probability) for any fixed input w1 , . . . , wn , B. In most cases of w1 , . . . , wn , B, we can get away with far fewer samples than 10n. It may be interesting to experiment, by √ taking m = 10 n, say. 3.3 b Generating a random sample from the set K Our refined approximation of §3.2 is highly dependent on the existence of an oracle for b of rounded feasible solutions - we refer to this oracle as drawing samples from the set K 2 drawRandomSample(a, n ) in the code-fragment in §3.2 for our refined approximation. I now claim that there is a simple sampling algorithm which allows us to generate a b in O(n) time, using the DP table that we have already built uniform random sample from K 2 b is the set of feasible solutions to our rounded instance for a1 , . . . , an , n . Recall that K a1 , . . . , an , n2 . Also recall that the feasible solutions we consider are represented as subsets S of the index set {1, . . . , n}. It is your job to come up with the pseudocode for generating a uniform random b in O(n) time (and your job to justify your algorithm in your report). The sample from K sampling procedure does not have to be coded as a separate method drawRandomSample - it may be better to include it as a code fragment within countKnapsackApprox, so you can re-use the DP table for many samples. Hint: You will need to make use of the DP table you will have built for a1 , . . . , an , n2 . The recurrence (2) is also a key component in generating a sample. As an extra hint, I have drawn 5 b the a1 , . . . , an , n2 table below, showing the location of count(n, n2 ) (this is the value |K|), 2 2 and its two component values count(n − 1, n − an ) and count(n − 1, n ). Note however, you will need to exploit the recurrence recursively. 0 1 2 3 .. . 0 1 .. . .. . n−1 n 4 ......... ......... ......... ......... ......... .. . ......... ...... n 2 − an .. . count(n − 1, n2 − an ) ...... ...... ...... ...... ...... .. . ...... ...... ...... ...... ...... .. . ...... ...... ...... ...... n2 0 count(1, n2 ) .. . count(n − 1, n2 ) count(n, n2 ) Testing For your implementation of countKnapsackRecurse and countKnapsackDP, you could start by testing some examples. Suppose that input is w1 = 3, w2 = 4, w3 = 5, w4 = 6, w5 = 8, w6 = 9, B = 19. answers: count(5, B) = 27, count(5, B − 9) = 11, count(6, B) = 38. 5 Your tasks Download the file CountKnapsack.java from the course webpage. This file contains declarations for the methods you are required to write. 1. Write a method which implements the recursive algorithm countKnapsackRecurse [5 marks] to evaluate count(n, B), described in §1. public static int countKnapsackRecurse(int[] w, int B) 2. Write a method which implements the Θ(nB) dynamic algorithm discussed in §2. [10 marks] public static int countKnapsackDP(int[] w, int B) Note: In preparation for Task 4, it might help to write a method buildKnapsackTable which returns the entire DP table, rather than just count(n, B). countKnapsackDP could be written as a ‘wrapper’ around this. Or you could just duplicate your table-building code into the body of countKnapsackApprox, not a big deal. 3. Prove the inequality (3) mentioned in §3.1 by showing each of Step 1 (2 marks) and [10 marks] Step 2 (4 marks). Justify the Θ(n3 ) running-time of the algorithm of §3.1 (4 marks). These proofs should be given in a file named task3.txt or task3.pdf. 4. Write a method to implement countKnapsackApprox, described in §3. Your main [10 marks] challenge will be to write code to draw a uniform random sample from the set of solutions for for a1 , . . . , an , n2 . This is the only tricky bit. If you can’t work out pseudocode for sampling, please ask someone (but they should only help with pseudocode, not the coding)!!! Also, make sure to credit the person who helps on your submission. public static int countKnapsackApprox (int[] w, int B) 6 5. Write a short report of about 2-3 pages. This must include a justification of your [15 marks] sampling algorithm used in countKnapsackApprox, plus a discussion of your experimental results. You should run tests on at least 100 instances (and preferably more) of the knapsack problem, varying the number of weights n and also the sizes of the wi and of B. It may be helpful to generate test examples randomly. Issues to be addressed: • Justification of the correctness and O(n) running time of the sampling code you write for countKnapsackApprox (5 marks). If you had to ask someone for help with this that is fine, but credit them here! • Experimental comparison of the speed of countKnapsackRecurse against countKnapsackDP. The algorithms give identical answers, but their running times vary wildly (the running time of countKnapsackRecurse will grow exponentially with n, unless Java’s compiler is better than I think!). (5 marks) • Experimental comparison of your implementation of countKnapsackApprox (5 marks). Many possibilities here. Certain things I would like to see explored: (i) for examples where B (and hence the wi ) is not much greater than n2 , a comparison of count(n, B) (computed exactly by countKnapsackDP) against the value returned by countKnapsackApprox. (ii) Same test as (i), but using a variant of countKnapsackApprox which takes √ only 10 n (or even 10) samples. b (iii) Tests with a particular example where |K|/|K| ∼ n. (iv) For examples where B is very large, we cannot compute count(n, B) exactly, but it would be nice to test how the answer given by countKnapsackApprox varys depending on the number of samples used. Implement all of your methods within CountKnapsack.java, available from the course webpage. Write your report in a file called report.pdf or report.txt. Then submit as follows: submit cs3 ads cw2 CountKnapsack.java task3.??? report.??? (if you have extra files, please also include them.) The DEADLINE is NOON, Thursday, November 25, 2010. Warning: Before submitting, please do “more” files-to-be-submitted (acroread on the report) from your current directory, to check that you have the right versions to hand (the rules are “what is marked, is what is submitted”). References [1] Approximate counting by dynamic programming, by Martin Dyer. 35th Annual ACM Symposium on Theory of Computing, 2003, pages 693-699. Mary Cryan 7