Bitmap Indices for Data Warehouse Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Star Schema Vs. Multi-dimensional Range Queries product prodId name price p1 bolt 10 p2 nut 5 sale oderId o100 o102 o105 date custId 1/7/97 53 2/7/97 53 3/8/97 111 SUM (qty * amt) WHERE ProdId in [p1.. p10] AND custId < 200 store prodId p1 p2 p1 customer storeId c1 c1 c3 custId 53 81 111 qty 1 2 5 storeId c1 c2 c3 city nyc sfo la amt 12 11 50 name joe fred sally address 10 main 12 main 80 willow city sfo sfo la Characteristics of Multi-Dimensional Range Queries in Data Warehouse Ad-Hoc Give N dimensions (attributes), every combination is possible: 2N combinations A Data Cube equals to 2N GROUP-Bys High Dimensions ( > 20) Large Number of Records Multi-Dimensional Index Fails! R-Trees or KD-Trees Effective only for moderate number of dimensions Efficient only for queries involving all indexed dimensions. For Ad-hoc Rang Queries, Projection Index is usually better, and Bitmap Index is even better. Projection Index Fix the order of the records in the base table Project records along some dimension Store i.e, A single Column Keeping the record order Keeping the duplicates Like “array” in C language base table store storeId c1 c2 c3 storeId c1 c2 c3 city nyc sfo la Projection Index Multi-dimensional Range Queries : A General Idea Build an index for each dimension (attribute); A Projection Index A B-Tree 1 Primary B-Tree, N -1 Secondary B-Trees For each involved dimension, use the index on that dimension to select records; “AND” the records to get the final answer set. How to make the “AND” operation fast? Projection Index (B-Tree is similar) Scan each involved dimension, And return a set of RIDs. Intersection the RID sets Sets have different lengths We can use Sort and Merge to do the Intersection Life is easier when all the sets have the same length and in the same order Use 1/0 to record the membership of each record General Ideas of Bitmap Index Fix the order of records in the base table Suppose the base table has m records For each dimension For each distinct dimension value (as the KEY) Build a bitmap with m bits (as the POSITIONS) A bitmap is like an Inverted Index “AND”, “OR” operations realized by bitwise logical operations Well supported by hardware Basic Bitmap Index P. O’Neil, Model 204,1987 Size of Bitmap Indices Number of Bitmap (Indices) How to build bitmap indices for dimensions with large distinct values Temperature dimension Size (i.e., Length) of a Single Bitmap Three Solutions Encoding Binning Reduce the Number of Bitmaps Reduce the Number of Bitmaps Compression Reduce the Size of a Single Bitmap Encoding Strategies Equality-encoded Bit-sliced index Assume dimension A has c distinct values, use log2c bitmap indices to represent each record (its value) Range-encoded Good for equality queries,such as “temperature == 100” Basic Bitmap Index Good for one-sided range queries, such as “Pressure < 56.7” Interval-encoded Good for two-sided range queries, such as“35.8 < Pressure < 56.7” Binning Encoding mainly considers discrete dimension values Basic Ideas of Binning Usually integers Build a bitmap index for a bin instead of for a distinct value The Number of Bitmaps has nothing to do with the number of distinct values in a dimension. Pros and Cons Pros:control the number of bitmap via controling the number of bins. Cons:need to check original dimension values to decide if the records really satisfy query conditions. A Binning Example: Values of Dimension A lie in [0, 100] Compression Strategies General-purpose compression methods Software packages are widely available Tradeoff between query processing and compression ratio De-compress data first Specific methods BBC (Byte-aligned Bitmap Code ), Antoshenkov,1994,1996. Adopted since Oracle 7.3 WAH(Word-aligned Hybrid Bitmap code ), Wu et al 2004, 2006. Used in Lawrence Berkeley Lab for high-energy physics WAH(Word-aligned Hybrid Bitmap code ) Based on run-length encoding Use machine WORD as the unit for compression For consecutive 0s or 1s in a bit sequence (part of a bitmap) Instead of BYTE in BBC Design Goal: reduce the overhead of de-compression, in order to speedup query response. Run-length encoding Bit sequence B: 11111111110001110000111111110001001 fill:a set of consecutive identical bits (all 0s or all 1s) The first 10 bits in B fill = count “+” bit value 1111111111=10 “+” 1 tail: a set of mixed 0s and 1s The last 8 bits in B Run: Run = fill + tail Basic Ideas of WAH Define fill and tail appropriately so that they can be stored in WORDs. Word-aligned Hybrid Bitmap code: 32-bit WORD Characteristics of Industrial Products Model 204. (Pat O’Neil,1987) Oracle (1995) The first that adopted bitmap index Basic Bitmap Index, No binning, No compression Now owned by Computer Corporation of America Adopted compressed bitmap index since 7.3 Probably use BBC for compression, Equality-encoded, No binning. Sybase IQ bit-sliced index(Pat O’Neil et al,1997) No binning, No compression For dimension with small number of distinct values, use Basic Bitmap Index. References Kurt Stockinger, Kesheng Wu, Bitmap Indices for Data Warehouses, In Wrembel R., Koncilia Ch.: Data Warehouses and OLAP: Concepts, Architectures and Solutions. Idea Group, Inc. 2006.