Data Warehouses for Decision Support Dr. Denise Ecklund 16 October 2002 ©2002 Denise Ecklund DW-1 What and Why of Data Warehousing Database Database Datastore Database System Database System Data System Data Extraction and Load Data Warehouse System Queries . . . Datacubes DSS app workstations • What: A very large database containing materialized views of multiple, independent source databases. The views generally contain aggregation data (aka datacubes). • Why: The data warehouse (DW) supports read-only queries for new applications, e.g., DSS, OLAP & data mining. ©2002 Denise Ecklund ©2002 Denise Ecklund DW-2 DW-1 Data Warehouse (DW) Life Cycle • The Life Cycle: Global Schema Definition Data Extraction and Load Query Processing Data Update • General Problems: – Heavy user demand – Problems with source data • ownership, format, heterogeneity – Underestimating complexity & resources for all phases • Boeing Computing Services – DW for DSS in airplane repair • • • • • DW size: 2-3 terabytes Online query services: 24×7 service Data life cycle: retain data for 70+ years (until the airplane is retired) Data update: No “nighttime”; concurrent refresh is required Access paths: Support new and old methods for 70+ years ©2002 Denise Ecklund DW-3 Global Schema Design – Base Tables • Fact Table – – – – – Stores basic facts from the source databases (often denormalized) Data about past events (e.g., sales, deliverings, factory outputs, ...) Have a time (or time period) associated with them Data is very unlikely to change at a data source; no updates Very large tables (up to 1 TB) GSD DE&L QP DU • Dimension Table – Attributes of one dimension of a fact table (typically denormalized) – A chain of dimension tables to describe attributes on other dimension tables, (normalized or denormalized) – Data can change at a data source; updates executed occasionally ProductID SupplierID PurchaseDate DeliveryDate CustYrs Fact Table ©2002 Denise Ecklund ©2002 Denise Ecklund ProductID ProdName ProdDesc ProdStyle ManufSite SupplierID SuppName SuppAddr SuppPhone Date1stOrder Dimension Tables TimeID Quarter Quarter Year Year AuditName AuditID AuditComp Addr AcctName Phone ContractYr DW-4 DW-2 Schema Design Patterns GSD DE&L QP • Star Schema D3 F D1 • Snowflake Schema D2 • Starflake Schema D1 D3.1 F D2.1 D1.2 D1, D2, D3 are denormalized D3.1 F D1.1 DU D3.2 D2.2 D1, D2, D3 are normalized • Constellation Schema D3.2 F1 F2 D1 D2 D1 stores attributes about a relationship between F1 and F2 D3 may be normalized or denormalized ©2002 Denise Ecklund DW-5 Summary Tables GSD DE&L QP • aka datacubes or multidimensional tables • Store precomputed query results for likely queries DU – Reduce on-the-fly join operations – Reduce on-the-fly aggregation functions, e.g., sum, avg • Stores denormalized data • Aggregate data from one or more fact tables and/or one or more dimension tables • Discard and compute new summaries as the set of likely queries changes Summary Table Fact Table Dim Table#2 ©2002 Denise Ecklund ©2002 Denise Ecklund Fact Table Dim Table#1 DW-6 DW-3 Summary Tables = Datacubes GSD DE&L Total Expenses for Parts by Product, Supplier and Quarter Average Price paid to all suppliers of parts for Product P11 in the 1st quarter GROUP BY product, quarter Supplier All-S S2 S1 Q1 0.2M £ Q2 0.4M £ Q3 0.6M £ Q4 1.0M £ All-Q 2.2M £ Fiscal Quarter P11 1.2M £ 1.0M £ P14 P19 QP Total Expenses paid to all suppliers of parts for Product P19 in the 1st quarter GROUP BY product, quarter DU Total Expenses paid to Supplier S1 for parts for Product P33 in 2nd quarter GROUP BY supplier, product, quarter 2.7M £ 4.6M £ P27 P33 All-P Product Total Expenses paid to Supplier S2 for parts for all products in all quarters GROUP BY supplier • Typical, pre-computed Measures are: – Sum, percentage, average, std deviation, count, min-value, max-value, percentile ©2002 Denise Ecklund DW-7 Too Many Summary Tables GSD DE&L • The Schema Design Problem: QP – Given a finite amount of disk storage, what views (summaries) will you pre-compute in the data warehouse? • Factors to be considered: DU A – What queries must DW support? – What source data are available? – What is the time-space trade-off to store versus re-compute joins and measures? – Cost to acquire and update the data? • An NP-complete optimization problem B D C F E G H ALL/None Derivation Lattice of materialized views – Use heuristics and approximation algorithms } • Benefit Per Unit Space (BPUS) • Pick By Size (PBS) • Pick By Size–Use (PBS-U) ©2002 Denise Ecklund ©2002 Denise Ecklund Use a derivation lattice to analyze possible materialized views DW-8 DW-4 A Lattice of Summary Tables • Derivation Lattice GSD DE&L Nodes: The set of attributes that would appear in the ”group by” clause to construct this view Edges: Connect view V2 to view V1 if V1 PC can be used to answer queries over V2 P MetaData: estimated # of records in each view • Determine cost and benefit of each view • Select a subset of the possible views • Typical simplifying assumptions: – – – – QP 6M DU PSC SC PS S C ALL/None Derivation Lattice for parts, supplier, & customers 0.1M Query cost ≈ # of records scanned to answer the query I/O costs are much larger than CPU cost to compute measures Ignore cost reductions due to using indexes to access records All queries are equally likely to occur ©2002 Denise Ecklund DW-9 Benefit Per Unit Space (BPUS) View A B C D A B C D F E G H ALL/None Derivation Lattice of materialized views #MRecs View 100 E 50 F 75 G 20 H GSD DE&L #MRecs 30 40 1 10 QP DU • S is the set of views we will materialize • bf(u,v,S) = min(#w-#v | w∈S and w covers u) • Benefit(v, S) = SUM (bf(u,v,S) | u=v or v covers u) • Savings: read 420M records, not 800M S={A} S = S U {B} S = S U {F} S = S U {D} ©2002 Denise Ecklund ©2002 Denise Ecklund Benefit- round#1 Benefit- round#2 B C D E F G H B C D E F G H 50 * 5 = 250 25 * 5 = 125 80 * 2 = 160 70 * 3 = 210 60 * 2 = 120 99 * 1 = 99 90 * 1 = 90 25 * 2 = 50 30 * 2 = 60 20 * 3 = 60 60 + 10 = 70 49 * 1 = 49 40 * 1 = 40 Benefit- round#3 B C D E F G H 25 * 1 = 25 30 * 2 = 60 20 * 2 + 10= 50 49 * 1 = 49 30 * 1 = 30 DW-10 DW-5 Pick By Size (PBS) Parts (0.2M) Suppls (0.01M) Custs (0.1M) PC PS P S QP SC DU C ALL/None Derivation Lattice S = {PSC} While (space > 0) Do v = smallest views If (space - |v| > 0) Then space = space - |v| S = S U {v} views = views – {v} Else space = 0 DE&L PSC Table sizes (in millions of records) Parts+Suppls+Custs (6M) Parts+Custs (6M) Parts+Suppls (0.8M) Suppls+Custs (6M) GSD • Storage Savings: Reduced from 19.2M records to 7.2M records • Query Savings: Read 19.11M records, not 42M ©2002 Denise Ecklund DW-11 Pick By Size-Use (PBS-U) GSD DE&L QP • Extends the Pick By Size algorithm to consider the frequency of queries on each possible view DU Table sizes (#Mrecs) & query frequency (probabilities) Parts+Suppls+Custs (6M, 0.05) Parts (0.2M, 0.1) Parts+Custs (6M, 0.3) Suppls (0.01M, 0.1) Parts+Suppls (0.8M, 0.3) Custs (0.1M, 0.1) Suppls+Custs (6M, 0.05) While (space > 0) Do v = smallest { |v| / prob(v), where v ∈ views } If (space - |v| > 0) Then space = space - |v| S = S U {v} views = views – {v} Else space = 0 PSC PC PS P S SC C ALL/None Derivation Lattice • This query frequency did not change the selected views → same savings ©2002 Denise Ecklund ©2002 Denise Ecklund DW-12 DW-6 Comparing Schema Design Algorithms GSD DE&L • All Proposed Algorithms QP – Produce only a near optimal solution DU • Best known is within (0.63 – f ) of optimal, where f is the fraction of space consumed by the largest table – Make (unrealistic) assumptions • e.g., ignore indexed data access – Rely heavily on having good metadata • e..g., table size and query frequency • Algorithmic Performance O(n3) runtime – Benefit Per Unit Space (BPUS) O(n log n) runtime – Pick By Size (PBS) • Limited applicability for PBS? – Finds the near optimal solution only for SR-Hypercube lattices A lattice forms an SR-Hypercube when for each v in the lattice, except v = DBT |v| ≥ ((# of direct children of v) * (# of records in the child of v)) ©2002 Denise Ecklund DW-13 Data Extraction and Load DE&L Step1: Extract and clean data from all sources QP GSD – Select source, remove data inconsistencies, add default values Step2: Materialize the views and measures DU – Reformat data, recalculate data, merge data from multiple sources, add time elements to the data, compute measures Step3: Store data in the DW – Create metadata and access path data, such as indexes • Major Issue: Failure during extraction and load • Approaches: – UNDO/REDO logging • Too expensive in time and space – Incremental Checkpointing • When to checkpoint? Modularize and divide the long-running tasks • Must use UNDO/REDO logs also; Need high/performance logging ©2002 Denise Ecklund ©2002 Denise Ecklund DW-14 DW-7 Materializing Summary Tables GSD DE&L QP • Scenario: CompsiQ has factories in 7 cities. Each factory manufactures several of CompsiQ’s 30 hardware products. Each factory has 3 types of manufacturing lines: robotic, hand-assembly, and mixed-line. DU • Schema for source data from Factory-A: YieldInfo ProductCode RoboticYield Hand-AssemYield MixedLineYield Week Year ProductInfo ProductCode ProductName ProductType FCS-Date EstProductLife • Target summary query: What is last year’s yield from Factory-A by product type? ©2002 Denise Ecklund Materialization using SchemaSQL What is last year’s yield from Factory-A by product type? select p.ProductType, sum(y.lt) At execution time, from Factory-A::YieldInfo→ lt, lt ranges over the Factory-A::YieldInfo y, attribute names in Factory-A::ProductInfo p relation YieldInfo where lt < > ”ProductCode and lt < > ”Week” and lt < > ”Year” and y.ProductCode = p.ProductCode and y.Year = 01 group by p.ProductType YieldInfo DW-15 GSD DE&L QP DU ProductCode RoboticYield Hand-AssemYield MixedLineYield Week Year ProductInfo ProductCode ProductName ProductType FCS-Date EstProductLife ©2002 Denise Ecklund ©2002 Denise Ecklund DW-16 DW-8 Aggregation Over Irregular Blocks YieldInfo P11 P12 P13 P14 ... P11 P12 P13 P14 ... YieldInfo ProductInfo 17 12 5 45 01 9 11 12 45 01 5 10 3 45 01 22 8 7 45 01 20 15 0 46 01 8 9 10 46 01 31 0 0 46 01 15 15 20 46 01 P11 P12 P13 P14 P15 ATMCard SMILCard ATMHub MPEGCard MP3 Net 3-8-99 Video 1-02-98 Net 1-11-99 Video 24-3-00 Audio 17-1-01 36 18 36 24 36 ProductCode RoboticYield Hand-AssemYield MixedLineYield Week Year ProductInfo ProductCode ProductName ProductType FCS-Date EstProductLife ©2002 Denise Ecklund DW-17 User Queries GSD DE&L • Retrieve pre-computed data or formulate new measures not materialized in the DW. • New user operations on logical datacubes: – – – – – QP DU Roll-up, Drill-down, Pivot/Rotate Slicing and Dicing with a “data blade” Sorting Supplier All-S S2 S1 Selection Q1 Derived Attributes Fiscal Q2 Quarter Q3 Q4 All-Q P11 P14 P19 P27 P33 Product ©2002 Denise Ecklund ©2002 Denise Ecklund DW-18 DW-9 Query Processing • • • • Traditional query transformations Index intersection and union Advanced join algorithms Piggy-backed scans GSD DE&L QP DU – Multiple queries with different selection criteria • SQL extensions => new operators – Red Brick Systems has proposed 8 extensions, including: • • • • • MovingSum and MovingAvg Rank … When RatioToReport Tertiles Create Macro ©2002 Denise Ecklund Data Update • Data sources change over time • Must “refresh” the DW DW-19 GSD DE&L QP DU – Adds new historical data to the fact tables – Updates descriptive attributes in the dimension tables – Forces recalculation of measures in summary tables • Issues: 1. Monitoring/tracking changes at the data sources 2. Recalculation of aggregated measures 3. Refresh typically forces a shutdown for DW query processing ©2002 Denise Ecklund ©2002 Denise Ecklund DW-20 DW-10 Monitoring Data Sources Approaches: 1. Value-deltas - Capture before and after values of all tuples changed by normal DB operations and store them in differential relations. GSD DE&L QP DU • Issues: must take the DW offline to install the modified values 2. Operation-deltas – Capture SQL updates from the transaction log of each data source and build a new log of all transactions that effect data in the DW. • Advantages: DW can remain online for query processing while executing data updates (using traditional concurrency control) 3. Hybrid – use value-deltas and operation-deltas for different data sources or a subset of the relations from a data source. ©2002 Denise Ecklund Creating a Differential Relation Approaches at the Data Source: 1.Execute the update query 3 times DW-21 GSD DE&L QP DU • (1) Select and record the before values; (2) Execute the update; (3) Select and record the after values • Issues: High cost in time & space; reduces autonomy of the data sources 2. Define and insert DB triggers • Triggers fire on “insert”, “delete”, and “update” operations; Log the before and after values • Issues: Not all data sources support triggers; reduces autonomy of the data sources ©2002 Denise Ecklund ©2002 Denise Ecklund DW-22 DW-11 Creating Operation-Deltas GSD DE&L QP • The process: DU – Scan the transaction log at each data source – Select pertinent transactions and delta-log them • Advantage: – Op-delta is much smaller than the value-delta • Issues: – Must transform the update operation on the data source schema into an update operation on the DW schema – not always possible. Hence can not be used in all cases. ©2002 Denise Ecklund DW-23 Recalculating Aggregated Measures GSD DE&L QP • Delta Tables – Assume we have differential relations for the base facts in the data sources (i.e., value deltas) – Two processing phases (Propagation & Refresh): DU 1) Propagation – pre-compute all new tuples and all replacement tuples and store them in a delta table Differential Relations Global DW Schema ©2002 Denise Ecklund ©2002 Denise Ecklund Propagation Process Delta Tables DW-24 DW-12 Recalculating Aggregated Measures GSD DE&L 2) Refresh – Scan the DW tuples, replace existing tuples with the pre-computed tuple values, insert new tuples from the delta tables QP DU Delta Updated Refresh Process Tables DW DW Tables Tables Issue: Can not pre-compute Delta Table for non-commutative measures Ex: average (without #records), percentiles Must compute these during the refresh phase. ©2002 Denise Ecklund DW-25 Data Marting • What: Stores a second copy of a subset of a DW Data Warehouse System Data Mart System Data Extraction and Load Queries datacubes • Why build a data mart? – – – – – Data Mart System Datacubes Queries DSS app A user group with special needs (dept.) workstations Better performance accessing fewer records To support a “different” user access tool To enforce access control over different subsets To segment data over different hardware platforms ©2002 Denise Ecklund ©2002 Denise Ecklund . . . Datacubes DW-26 DW-13 Costs and Benefits of Data Marting • System costs: – More hardware (servers and networks) – Define a subset of the global data model – More software to: • Extract data from the warehouse • Load data into the mart • Update the mart (after the warehouse is updated) • User benefits: – Define new measures not stored in the DW – Better performance (mart users and DW users) – Support a more appropriate user interface • Ex: a browser with forms versus SQL queries – Company achieves more reliable access control ©2002 Denise Ecklund DW-27 Commercial DW Products • Short list of companies with DW products: – – – – Informix/Red Brick Systems Oracle Prism Solutions Software AG • Typical Products and Tools – Specially tuned DB Server – DW Developer Tools: data extraction, incremental update, index builder – User Tools: ad hoc query and spreadsheet tools for DSS and post-processing (creating graphs, pie-charts, etc.) – Application Developer Tools (toolkits for OLAP and DSS): spreadsheet components, statistics packages, trend analysis and forecasting components ©2002 Denise Ecklund ©2002 Denise Ecklund DW-28 DW-14 Ongoing Research Problems • How to incorporate domain and business rules into DW creation and maintenance • Replacing manual tasks with intelligent agents – Data acquisition, data cleaning, schema design, DW access paths analysis and index construction • Separate (but related) research areas: – Tools for data mining and OLAP – Providing active database services in the DW ©2002 Denise Ecklund ©2002 Denise Ecklund DW-29 DW-15