Data Warehousing/Mining Comp 150 Data Warehousing Design (not in book) Instructor: Dan Hebert Data Warehousing/Mining 1 Warehouse Design What to materialize in the warehouse – Which source data? – Which summary tables? – Which indices? Influenced by both querying and maintenance Trade storage space and update time for query speed Data Warehousing/Mining 2 Designing a Data Warehouse Data models designed to support DW require optimization strategies for DSS Design option – Relational model in DW - ROLAP Servers for analysis – Special-purpose multi-dimensional data model in DW (MDDB) - MOLAP Servers for analysis Data Warehousing/Mining 3 Why is DW Design Different? DSS: few transactions, each accessing a large number of records Typical ER designs tend to be complex and difficult to navigate Topic/Function Data Content Data Organization Nature of Data Data Structure, Format Random Access Probability Data Update Usage Response Time Data Warehousing/Mining Operational Current values Application by application Dynamic Complex: suitable for operational computation High Updated on a field-by-field basis Highly structured repeptive processing Sub-second to 2-3 seconds Decision Support Archival data Subject areas across enterprises Static until refreshed Simple: suitable for business analysis Moderate to low Accessed and read: no direct update Highly unstructured analytical processing Seconds to minutes, hours 4 Multi-Dimensional Data Measures - numerical data being tracked Dimensions - business parameters that define a transaction Example: Analyst may want to view sales data (measure) by geography, by time, and by product (dimensions) Dimensional modeling is a technique for structuring data around the business concepts ER models describe “entities” and “relationships” Dimensional models describe “measures” and “dimensions” Data Warehousing/Mining 5 Dimensional Modeling Using Relational DBMS Special schema design: star, snowflake Special indexes: bitmap, multi-table join Special tuning: maximize query throughput Proven technology (relational model, DBMS), tend to outperform specialized MDDB especially on large data sets Products – IBM DB2, Oracle, Sybase IQ, RedBrick, Informix Data Warehousing/Mining 6 Dimensional Modeling Using Special-Purpose Model (MDDB) Facts stored in multi-dimensional arrays Dimensions used to index array Sometimes on top of relational DB Products – Pilot, Arbor Essbase, Gentia Data Warehousing/Mining 7 Example “Sales by product line over the past six months” “Sales by account between 1990 and 1995” Account Info Key columns joining fact table Numerical Measures to dimension tables Prod Code Time Code Acct Code Qty Fact table for measures Product Info Dimension tables Sales Time Info ... Data Warehousing/Mining 8 Dimensional Modeling Dimensions are organized into hierarchies – E.g., Time dimension: days weeks quarters – E.g., Product dimension: product product line brand Dimensions have attributes Physical architecture describe by Star Schema Data Warehousing/Mining 9 Example Cont’d Time Time Code Quarter Code Quarter Name Week Code Day Code Day name Account Account Code Key Account Code Account Name Account Type Account Market Data Warehousing/Mining Sales Geography Code Time Code Account Code Product Code Dollar Amount Units Geography Geography Code Region Code Region Mgr City Code City Name Product Product Code Product Name Brand Mgr Brand Code Prod. Line Code Prod. Line Name Prod. Name ... 10 Dimensional Modeling Cont’d Fact tables are fully normalized Dimension tables are denormalized – Repetitively stored for sake of simplicity and performance Product_Code Product_Name Widget Gadget Snicket Graplit 101 102 103 104 Product_Color Blue Blue Orange Orange Brand_Code XYZ ABC Product_Code 101 102 103 104 105 Product_Name Widget Gadget Gadget Snicket Graplit Data Warehousing/Mining Product_Color Blue Blue Green Orange Orange Brand_Code XYZ XYZ ABC ABC Brand_Mgr J. Smith T. Jones Brand_Code XYZ XYZ XYZ ABC ABC Brand_Mgr J. Smith J. Smith J. Smith T. Jones T. Jones 11 Extending Dimensional Modeling Some instances when star schema is not ideal – Denormalized schema may require too much storage – Very large dimension tables are affecting performance negatively “Snowflake schema” – Normalized dimensions Data Warehousing/Mining 12 Advantages of Dimensional Modeling Define complex, multi-dimensional data with simple model Reduces the number of phycial joins a query has to process Allows the data warehouse to evolve with rel. low maintenance HOWEVER! Star schema and rel. DBMS are not the magic solution – Query optimization is still problematic Data Warehousing/Mining 13 Index Structures Traditional access methods – B-trees, hash tables, grid files, etc. Popular in warehouses – Inverted indexes (lists) – Bit map indexes – Join indexes Data Warehousing/Mining 14 Inverted Index Index for every keyword Query: – “Get people with age =20 and name =‘Fred’” (1) Use age index and retrieve ids: r4,r18,r34,r35 (2) Use name index and retrieve ids: r18,r52 (3) Answer is intersection: r18 Data Warehousing/Mining 15 Bit Map Index Developed for Model 204 DBMS in 1987 18 19 Id 1 2 3 4 5 6 7 Name Joe Fred Sally Nancy Tom Pat Dave Age 20 20 21 20 20 25 21 18 20 23 20 21 22 23 25 26 Data Warehousing/Mining 1 1 0 1 1 0 0 0 0 1 0 0 0 0 16 Using Bit Maps Query: – “Get people with age=20 and name =‘Fred’” (1) Bit map for age =20: 1101100 (2) Bit map for name=‘Fred’: 0100000 (3) Answer is intersection: 0100000 Good if domain cardinality is small Bit vectors can be compressed Data Warehousing/Mining 17 Join Index Index on one table for a quantity that involves a column value of a different table Product Sale Id P1 P2 rid r1 r2 r3 r4 r5 r6 Data Warehousing/Mining Name Bolt Nut Prodid P1 P2 P1 P2 P1 P1 Price 10 5 Storeid C1 C1 C3 C2 C1 C2 jindex r1,r3,r5,r6 r2,r4 Date 1 1 1 1 2 2 Amt 12 11 50 8 44 4 18 Aggregation Process by which low-level data is summarized in advanced and placed into intermediate tables Speeds up query processing, less ad-hoc – “Show me total US sales for 1990” How much to aggregate? Data cube data model – All possible aggregations along all dimensions – Cells contain aggregated values – How much of the cells in cube should be pre-computed? Data Warehousing/Mining 19 Aggregation Cont’d Special operators to navigate the hierarchies – – – – – Roll-up: remove a dimension element e.g., Roll-up products to brands Drill-down (opposite of roll-up), Slice (defines a subcube) Various visualization ops (e.g., pivot) Data Warehousing/Mining 20 Example roll-up to region NY SF roll-up to brand Product LA Juice Milk Coke Cream Soap Bread 10 34 56 32 12 56 roll-up to week M T W Th F S S Dimensions: Time, Product, Geography Attributes: Product (upc, price, …) Geography … … Hierarchies: Product Brand … Day Week Quarter City Region Country Time 56 units of bread sold in LA on M Data Warehousing/Mining 21 Warehouse DBMS—Buzzwords Used primarily for decision support (DSS) – A.K.A. On-Line Analytical Processing (OLAP) – Complex queries, substantial aggregation – TPC-D benchmark Multidimensional data model – Can be implemented either using rel. model or proprietary data model – Multi-dimensional database (MDDB) Aggregation: Data Cube – All possible groupings and aggregations Data Warehousing/Mining 22 Warehouse DBMS — Buzzwords (2) ROLAP vs. MOLAP Special purpose OLAP servers that directly implement multidimensional data and operations – Roll-up = aggregate on some dimension – Drill-down = deaggregate on some dimension ROLAP: Oracle, Sybase IQ, RedBrick MOLAP: Pilot, Essbase, Gentia Data Warehousing/Mining 23 Warehouse DBMS - Buzzwords (3) Clients: – Query and reporting tools – Analysis tools – Data mining: discovering patterns of various forms Poses many new research issues in: – Query processing and optimization – Database design – View management Data Warehousing/Mining 24 Data Warehouse Physical Design Data Warehousing/Mining 25 Common Design Activities – OLTP Schema design (base tables) – Normalization (3NF, BCNF, …) Schema design (views) – Mostly for convenience, security – Usually NOT for performance – Exception: View indexing [Roussopolous 1982] Materialize pointers to tuples instead of tuples themselves Index selection – In practice, use rules of thumb – Tool: DBDSGN [IBM Almaden], RDT for System R Data Warehousing/Mining 26 Relational Views Part of the ANSI/SPARC architecture Derived, virtual table View definition is an SQL query statement View update problem Good for logical data independence, security How to implement a view for querying – Query modification: modify view query into a query on the underlying base tables – View materialization: physically implementing view as table Data Warehousing/Mining 27 View Indexing ... In general, no need for materialized views in OLTP systems – Increase in performance through indexing – Secondary storage space used to be expensive New idea (N. Roussopolous 1982) - view index Store index whose elements point to tuples which comprise view View selection problem: Find a subset of views, which, when indexed, minimizes the total cost of answering all queries as well as cost of maintaining the view structures Data Warehousing/Mining 28 … View Indexing … Assume N views to consider, 2N subsets Can’t do simple enumeration (cost to answer all queries in a given subset) – NP-complete problem Solution uses search algorithm to approximate the optimal view selection – Potential exponential worst case – Only subset of views needs to be considered Cost function which computes for each state (set of views + remaining storage) (1) Cost to compute queries, maintenance of current index set + (2) Estimate of incremental cost that must be incurred in extending view set (upper bound on actual cost) Data Warehousing/Mining 29 … View Indexing But ... Algorithm does not consider index selection on views (view indexes) – Indexes have impact on which view indexes to choose Very simple cost model (maintenance cost ~ size of view) – Problem: Cost of maintaining view is a complex query optimization problem – Cannot be estimated without knowing which subview indexes are chosen Good first treatment of subject Data Warehousing/Mining 30 Indexing ... Which type of index structure, which attribute(s) to index on – “Access path selection” -> DBA – Many choices, depend on many factors – Space-time trade-off Index selection problem: Which ordering rule for stored records and which non-clustered indices Database practitioners use rules/guidelines (e.g., SYBASE manual) Design tools available – Support dba during creation and maintenance of database, i.e., solve the index-selection problem Data Warehousing/Mining 31 Factors that Influence Index Selection Maintenance Storage cost Global solution depends on index selection of all tables combined Data Warehousing/Mining 32 Example ORDERS (OrderNo, SuppNo, PartNo, Date, Qty) PARTS (PartNo, Descrip, SuppNo, QtyOnhand, Color, …) Query: SELECT O.SuppNo FROM PARTS P, ORDERS O WHERE O.PartNo = P.PartNo AND O.SuppNo = 15 AND P.QtyOnHand BETWEEN 100 AND 150 Situation 1: Assume PARTS clustered on Descrip and non-clustered index on PartNo Then: Best clustered index for ORDERS SuppNo Situation 2: Assume PARTS clustered on PartNo Then: Best clustered index for ORDERS PartNo Data Warehousing/Mining 33 Data Warehouse Design Schema design (base tables) – Star schema (dimensions, measures) Schema design (view/index selection) – Mostly for performance enhancement Physical warehouse design. Balance three costs: (1) The cost of answering queries using warehouse relations and additional structures (2) Cost of maintaining additional structures (3) Cost of secondary storage Data Warehousing/Mining 34 WH Schema Design Tables must map efficiently to the operational requests OLTP: maximize concurrency, optimize insert/update/delete performance OLAP: Queries large, complex, ad-hoc, data-intensive, no updates Query centric view -> Star schema (facts, dimensions) – Widely accepted, intuitive, easy to navigate (query formulation) Problem: Poor performance on OLTP db engines – Join processing (pair-wise join problem) – Number of pair-wise joins for N tables = N! e.g., 7 tables -> 5,040 combinations, 5 different join algorithms -> 25,200 combinations Data Warehousing/Mining 35 Star Schema Join Problem Heuristic: “pick directly related tables” doesn’t work in star schema Options: – Join unrelated tables (Cartesian product) – Parallelism (speed-up, scale-up) – New join techniques (e.g., bit vector star joins) in combination with new indexing schemes (e.g., bit maps, variant indexes) Data Warehousing/Mining 36 Warehouse Access Path (Physical) Design Problem Materialize user queries as views (reduces cost 1) How to reduce cost 2 and 3? “View Index Selection Problem VIS” – Choose a set of supporting views and a set of indexes to materialize such that the total maintenance cost for the warehouse is minimized (cost 2 & 3) Data Warehousing/Mining 37 Solutions - Relational DB Design Practices Rel. DB design algorithms must be adapted – View index approach has no index selection, simple cost model (cannot achieve global solution by locally optimizing each materialized subview) – Index selection approach can be extended - but trouble ahead Algorithms require queries and frequencies as input Data Warehousing/Mining 38 Solutions - Rule Condition Maintenance Work on rule condition evaluation – How to evaluate trigger conditions for rules efficiently ( ~ view maintenance problem: rule is triggered whenever view that satisfies its condition becomes non-empty) – Discrimination networks for each rule (view) RETE model materializes selection and join nodes TREAT materializes only selection nodes – Incremental evaluation techniques Recommendations not generally applicable Data Warehousing/Mining 39