Project in Advanced Programming B (236512) Winter 2009-2010 Assignment 1 The code + documentation need to be submitted (via e-mail) by Friday 6/11/09. 1) A (80,20) distribution with depth k over M items is defined as follows. The total probability of the 80% less probable items is 0.2. All these items have the same probability (0.2/0.8M) and they form the long tail of the distribution. The total probability of the remaining 20% is 0.8, and is defined recursively in the same manner. That is, the total probability of the 80% less probable items out of these 20% is 0.2*0.8, and they all have the same probability, while the total probability of the remaining 20% is 0.8*0.8 and is defined in the same manner. The depth of the distribution, k, defines how many different values we are going to have. When k=1, there is only one level of calculations, and the probabilities are 0.2/0.8M for 0.8M items and 0.8/0.2M for 0.2M of the items. That is, two different values. In general, in depth k there are k+1 different values of probabilities. Note that the recursive definition is only over the more probable 20% of the items. 1. Implement a (80, 20) distribution. Specifically, write a function that returns, for a known parameter M, a number in 1,2,…,M such that: 1. The probability for output i is pi, 1 ≤i ≤M, where p1≥p2≥ … ≥pM. That is, the items are indexed in non-increasing order of probability. 2. The depth k is determined such that the set of most probable items includes at most 20 items. Find an analytic formula to calculate k. 3. Distinguish between a preprocessing step (performed once), and the code for producing the output in each single call. Try to make the latter as efficient as possible. Note that the parameter M is globally known. Note: The number of items in each probability category must be an integer; whenever fractions are involved they should be rounded (You need to decide how). Make sure that the total number of values is M and that their overall probability is 1. This function will be used in later stages of the project. 2) Our Media-on-Demand system consists of 50 disks (servers) organized in four clusters. One of these clusters is the root. The other 3 are called pop-clusters. The root consists of 20 servers, and each of the 3 pops consists of 10 servers. The storage capacity of each server is 660,000Mb (approx 2/3 Terabytes). The system has a library of M=12500 movies. The system supports transmission in two different qualities of service (QoS). For the high QoS (type-A transmission), the movie file is stores in a format requiring 880Mb. For the low QoS (type-B transmission), the movie file is stored in a format requiring α⋅880Mb. α is a parameter (typical values are in the range 0.05-0.15). For simplicity, we assume that all files have the same storage requirements. If a movie is stored on some server then the two formats must be stored on it. In other words, storing a movie on a server requires (1+α)880Mb. In this step you will decide how many copies will be stored from each movie and on which servers. Assume that the movie popularities obey the 80-20 distribution explained in (1). The output should be an integral vector F of length 12500, where Fi denotes how many copies (in both formats) are stored from movie i. It must hold that Σi Fi = 50*660,000/880(1+α), and for all i , 50 ≥ Fi >0, i.e., at least one copy of each movie in the whole system, and at most a single copy of each movie on each disk. You should suggest and implement a method for calculating the vector F in a way that reflects the popularities and fulfills the above conditions. Remark: In general, this problem is called the apportionment problem, and it is known for its importance in determining the number of seats each party should get in a parliament with x seats, given the number of votes it received in the elections. There are many known algorithms for this problem. Of course, in this problem there is no need to guarantee that Fi>0. If you would like to read more about it, see http://math.arizona.edu/~voting-theory/chapter01.doc (not required for the project). 3) The static assignment: In this step you are going to determine which movies are stored on which disk, based on the output of step (2). The assignment is done according to the following algorithm: • Store one copy of each movie on the root cluster. • For all movies i with Fi >1, store additional copies of these movies starting from the most popular one, concluding with the least popular. For each movie i, use round-robin over the clusters; in each cluster store one copy of the movie on a least loaded disk (that does not hold a copy of this movie yet). If all the disks of some cluster are full, move to the next cluster. It may be the case that more than one disk from a cluster holds a copy of the movie. Note that the root cluster participates in this stage as a regular cluster and may get additional copies of i. When moving to the next file (i+1), the round robin algorithm moves to the next cluster.