Collaboration

advertisement
Assignment 2
Trouble with Bitmaps
Due: Monday, October 15th, 9:00 am
One interesting alternative to traditional tree and hash based indexes is a Bitmap Index. It has an
interesting history. A company called Nucleus, founded by Ted Glaser, patented the idea and
developed the DBMS where bitmap was both Index structure and Data representation. The
company failed in late 1980’s, but the idea has resurfaced and recently as incorporated in a
number of major commercial databases.
How does it work?
The Bitmap is a collection of 1’s and 0’s, representing data. Such a binary number can be
represented by run of I 0’s followed by a single 1. Thus the bit vector 000101 consists of two
runs, of lengths 3 and 1, respectively. So, we could represent this vector as 3 1, but need to store
both 3 and 1 in binary. In binary system, 3 is 11, and 1 is 1, but storing 111 creates ambiguity,
because the bit-vector 010001 is also encoded by 1 and 3, or 111, as well as vector 010101 (three
runs of 1).
So, how do we Compress Bitmaps to make them usable in applications?
The known fact about bitmaps is that the larger they are, the more rare 1 bit become. If we use
bitmap index on key K of a file with n records, and there are m different values for key K, then
the number of bits in all bit-vectors for this index is mn, and if m is large, it was proven that the
probability of any bit being 1 bit is 1/m!
Thus, one common encoding schema relies on this fact.
Example 1: Run-Length encoding of bit-vectors. Consider bit-vector for value 25 – it is
100000001000, and has two runs of 0’s followed by a single 1 bit (0,7), note that trailing 0s are
simply omitted.
Let us now encode (0.7) using a special scheme that avoids ambiguity above.
i=0 encode as 00
i=1 encode as 01
for all other i’s:
encode it as (j-1) number of 1’s followed by a single 0 followed by i in binary code, where
j-1=floor(log2(i)).
For example, i=7 will result in j = log7=3, so (3-1) 1’s, followed by single 0, followed by 7 in
binary (111), combined together we get 11 0 111 or 110111.
The compression allows to represent nxn bit vector in at most 2j (or 2logn) bits!
Your goal is to perform the following manipulations on bits and explain some results. No
program implementation is required.
Question 1 [6p]:
Using the given scheme, please encode values [3p]
a) 000000010000
b) 010000000100
c) 0001000000000010000010000
Please provide explanation of each step and give values of i and j [3p]
Question 2 [6p]
For the above encoding, Decoding is now unique. To check that, please try to DECODE
sequence 11101101. Starting at the beginning, scan forward and try to identify ( j-1) (number of
ones followed by a single 0) and then i as binary number represented using j bits. In this
example, (j-1) is 3, so j=4 and 4 next bits 1101 is integer 13 in binary. From this we can
reconstruct original bit-vector - run of 13 0’s followed by 1, or 00000000000001.
Using the above approach, DECODE the bit sequences to integer values [3p]:
a) 11101101001011
b) 01110111
c) 00110111
Finally, reconstruct original bit-vectors for the same values above [3p]
Question 3 [4p] In your opinion, is it possible to provide a MORE OPTIMAL scheme to encode
bit-vectors? Right now the number of bits taken to encode number i is 2log(i). Can it be closer to
log(i)? If no –justify, if yes –provide an IDEA of how this can be achieved.
What to submit
Submit answers to above questions to your TA, according to TA requirements. Course late assignment
policy allows for up to 2 days late submission, based on the date and time it is received by your TA, with
10% of your mark penalty for each late day.
Bonuses for extra credit [up to 4p]:
1. Implement encoder as per Question 1 [2p]
2. Implement decoder as per Question 2 [2p]
Collaboration
The assignment must be done individually so everything that you hand in must be your original work,
except for the code copied from the text. When someone else's code is used, you must acknowledge the
source explicitly. Copying another student's work is an academic misconduct.
Download