advertisement

Assignment 2 Trouble with Bitmaps Due: Monday, October 15th, 9:00 am One interesting alternative to traditional tree and hash based indexes is a Bitmap Index. It has an interesting history. A company called Nucleus, founded by Ted Glaser, patented the idea and developed the DBMS where bitmap was both Index structure and Data representation. The company failed in late 1980’s, but the idea has resurfaced and recently as incorporated in a number of major commercial databases. How does it work? The Bitmap is a collection of 1’s and 0’s, representing data. Such a binary number can be represented by run of I 0’s followed by a single 1. Thus the bit vector 000101 consists of two runs, of lengths 3 and 1, respectively. So, we could represent this vector as 3 1, but need to store both 3 and 1 in binary. In binary system, 3 is 11, and 1 is 1, but storing 111 creates ambiguity, because the bit-vector 010001 is also encoded by 1 and 3, or 111, as well as vector 010101 (three runs of 1). So, how do we Compress Bitmaps to make them usable in applications? The known fact about bitmaps is that the larger they are, the more rare 1 bit become. If we use bitmap index on key K of a file with n records, and there are m different values for key K, then the number of bits in all bit-vectors for this index is mn, and if m is large, it was proven that the probability of any bit being 1 bit is 1/m! Thus, one common encoding schema relies on this fact. Example 1: Run-Length encoding of bit-vectors. Consider bit-vector for value 25 – it is 100000001000, and has two runs of 0’s followed by a single 1 bit (0,7), note that trailing 0s are simply omitted. Let us now encode (0.7) using a special scheme that avoids ambiguity above. i=0 encode as 00 i=1 encode as 01 for all other i’s: encode it as (j-1) number of 1’s followed by a single 0 followed by i in binary code, where j-1=floor(log2(i)). For example, i=7 will result in j = log7=3, so (3-1) 1’s, followed by single 0, followed by 7 in binary (111), combined together we get 11 0 111 or 110111. The compression allows to represent nxn bit vector in at most 2j (or 2logn) bits! Your goal is to perform the following manipulations on bits and explain some results. No program implementation is required. Question 1 [6p]: Using the given scheme, please encode values [3p] a) 000000010000 b) 010000000100 c) 0001000000000010000010000 Please provide explanation of each step and give values of i and j [3p] Question 2 [6p] For the above encoding, Decoding is now unique. To check that, please try to DECODE sequence 11101101. Starting at the beginning, scan forward and try to identify ( j-1) (number of ones followed by a single 0) and then i as binary number represented using j bits. In this example, (j-1) is 3, so j=4 and 4 next bits 1101 is integer 13 in binary. From this we can reconstruct original bit-vector - run of 13 0’s followed by 1, or 00000000000001. Using the above approach, DECODE the bit sequences to integer values [3p]: a) 11101101001011 b) 01110111 c) 00110111 Finally, reconstruct original bit-vectors for the same values above [3p] Question 3 [4p] In your opinion, is it possible to provide a MORE OPTIMAL scheme to encode bit-vectors? Right now the number of bits taken to encode number i is 2log(i). Can it be closer to log(i)? If no –justify, if yes –provide an IDEA of how this can be achieved. What to submit Submit answers to above questions to your TA, according to TA requirements. Course late assignment policy allows for up to 2 days late submission, based on the date and time it is received by your TA, with 10% of your mark penalty for each late day. Bonuses for extra credit [up to 4p]: 1. Implement encoder as per Question 1 [2p] 2. Implement decoder as per Question 2 [2p] Collaboration The assignment must be done individually so everything that you hand in must be your original work, except for the code copied from the text. When someone else's code is used, you must acknowledge the source explicitly. Copying another student's work is an academic misconduct.