Winter 2009-2010 Assignment 1

advertisement
Project in Advanced Programming B (236512)
Winter 2009-2010
Assignment 1
The code + documentation need to be submitted (via e-mail) by Friday 6/11/09.
1) A (80,20) distribution with depth k over M items is defined as follows. The total
probability of the 80% less probable items is 0.2. All these items have the same
probability (0.2/0.8M) and they form the long tail of the distribution.
The total probability of the remaining 20% is 0.8, and is defined recursively in the same
manner. That is, the total probability of the 80% less probable items out of these 20% is
0.2*0.8, and they all have the same probability, while the total probability of the
remaining 20% is 0.8*0.8 and is defined in the same manner.
The depth of the distribution, k, defines how many different values we are going to have.
When k=1, there is only one level of calculations, and the probabilities are 0.2/0.8M for
0.8M items and 0.8/0.2M for 0.2M of the items. That is, two different values. In general,
in depth k there are k+1 different values of probabilities. Note that the recursive
definition is only over the more probable 20% of the items.
1. Implement a (80, 20) distribution. Specifically, write a function that returns, for a
known parameter M, a number in 1,2,…,M such that:
1. The probability for output i is pi, 1 ≤i ≤M, where p1≥p2≥ … ≥pM. That is, the items
are indexed in non-increasing order of probability.
2. The depth k is determined such that the set of most probable items includes at most
20 items. Find an analytic formula to calculate k.
3. Distinguish between a preprocessing step (performed once), and the code for
producing the output in each single call. Try to make the latter as efficient as possible.
Note that the parameter M is globally known.
Note: The number of items in each probability category must be an integer; whenever
fractions are involved they should be rounded (You need to decide how). Make sure that
the total number of values is M and that their overall probability is 1.
This function will be used in later stages of the project.
2) Our Media-on-Demand system consists of 50 disks (servers) organized in four clusters.
One of these clusters is the root. The other 3 are called pop-clusters. The root consists of
20 servers, and each of the 3 pops consists of 10 servers. The storage capacity of each
server is 660,000Mb (approx 2/3 Terabytes). The system has a library of M=12500
movies.
The system supports transmission in two different qualities of service (QoS). For the high
QoS (type-A transmission), the movie file is stores in a format requiring 880Mb. For the
low QoS (type-B transmission), the movie file is stored in a format requiring α⋅880Mb. α
is a parameter (typical values are in the range 0.05-0.15). For simplicity, we assume that
all files have the same storage requirements. If a movie is stored on some server then the
two formats must be stored on it. In other words, storing a movie on a server requires
(1+α)880Mb.
In this step you will decide how many copies will be stored from each movie and on
which servers. Assume that the movie popularities obey the 80-20 distribution explained
in (1). The output should be an integral vector F of length 12500, where Fi denotes how
many copies (in both formats) are stored from movie i.
It must hold that Σi Fi = 50*660,000/880(1+α), and for all i , 50 ≥ Fi >0, i.e., at least
one copy of each movie in the whole system, and at most a single copy of each movie on
each disk. You should suggest and implement a method for calculating the vector F in a
way that reflects the popularities and fulfills the above conditions.
Remark: In general, this problem is called the apportionment problem, and it is known
for its importance in determining the number of seats each party should get in a
parliament with x seats, given the number of votes it received in the elections. There are
many known algorithms for this problem. Of course, in this problem there is no need to
guarantee that Fi>0. If you would like to read more about it, see
http://math.arizona.edu/~voting-theory/chapter01.doc (not required for the project).
3) The static assignment: In this step you are going to determine which movies are stored on
which disk, based on the output of step (2). The assignment is done according to the
following algorithm:
• Store one copy of each movie on the root cluster.
• For all movies i with Fi >1, store additional copies of these movies starting from the most
popular one, concluding with the least popular. For each movie i, use round-robin over
the clusters; in each cluster store one copy of the movie on a least loaded disk (that does
not hold a copy of this movie yet). If all the disks of some cluster are full, move to the next
cluster. It may be the case that more than one disk from a cluster holds a copy of the movie.
Note that the root cluster participates in this stage as a regular cluster and may get additional
copies of i. When moving to the next file (i+1), the round robin algorithm moves to the next
cluster.
Download