Outline What is a data warehouse? Why a warehouse? Models & operations Implementing a warehouse Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 Notes11 1 CS 245 What is a Warehouse? Notes11 2 What is a Warehouse? Collection of diverse data subject oriented aimed at executive, decision maker often a copy of operational data with value-added data (e.g., summaries, history) integrated time-varying non-volatile Collection of tools gathering data integrating, ... querying, reporting, analysis data mining monitoring, administering warehouse cleansing, more CS 245 Notes11 3 CS 245 Warehouse Architecture Forecasting Comparing performance of units Monitoring, detecting fraud Visualization Query & Analysis Metadata 4 Motivating Examples Client Client Notes11 Warehouse Integration Source CS 245 Source Notes11 Source 5 CS 245 Notes11 6 1 Query-Driven Approach Alternative to Warehousing Two Approaches: Query-Driven (Lazy) Warehouse (Eager) Client Client Mediator ? Source CS 245 Wrapper Source Source Notes11 7 CS 245 Advantages of Warehousing High query performance Queries not visible outside warehouse Local processing at sources unaffected Can operate when sources unavailable Can query data not stored in a DBMS Extra information at warehouse 8 No need to copy data storage need to purchase data More up-to-date data Query needs can be unknown Only query interface needed at sources May be less draining on sources 9 CS 245 Notes11 Warehouse Models & Operators Source Notes11 no summarize (store aggregates) historical information Notes11 Source less Modify, CS 245 Wrapper Advantages of Query-Driven Add Wrapper 10 Star Data Models product relational prodId p1 p2 name price bolt 10 nut 5 store storeId c1 c2 c3 cubes Operators sale oderId date o100 1/7/97 o102 2/7/97 105 3/8/97 customer CS 245 Notes11 11 CS 245 custId 53 81 111 custId 53 53 111 prodId p1 p2 p1 name joe fred sally Notes11 storeId c1 c1 c3 address 10 main 12 main 80 willow qty 1 2 5 city nyc sfo la am t 12 11 50 city sfo sfo la 12 2 Star Schema Terms Fact table Dimension tables Measures sale orderId date custId prodId storeId qty amt product prodId name price customer custId name address city sale orderId date custId prodId storeId qty amt product prodId name price store storeId city store storeId city CS 245 customer custId name address city Notes11 13 CS 245 Notes11 Dimension Hierarchies 14 Cube sType store store storeId s5 s7 s9 city cityId sfo sfo la tId t1 t2 t1 region mgr joe fred nancy Fact table view: sType tId t1 t2 city cityId pop sfo 1M la 5M location downtown suburbs sale regId north south prodId p1 p2 p1 p2 storeId c1 c1 c3 c2 Notes11 Fact table view: prodId p1 p2 p1 p2 p1 p1 storeId c1 c1 c3 c2 c1 c2 15 amt 12 11 50 8 44 4 CS 245 c2 c3 50 8 Notes11 Traditional aggregation day 2 day 1 p1 p2 c1 p1 12 p2 11 c1 44 c2 4 c2 c3 ... c3 50 8 Relational Cube Analysis clean dimensions = 3 Notes11 16 selection find CS 245 c1 12 11 Operators Multi-dimensional cube: date 1 1 1 1 2 2 p1 p2 region regId nam e north cold region south warm region 3-D Cube sale Multi-dimensional cube: amt 12 11 50 8 dimensions = 2 snowflake schema constellations CS 245 size small large data trends ... 17 CS 245 Notes11 18 3 Aggregates Aggregates • Add up amounts for day 1 • In SQL: SELECT sum(amt) FROM SALE WHERE date = 1 sale prodId p1 p2 p1 p2 p1 p1 storeId c1 c1 c3 c2 c1 c2 date 1 1 1 1 2 2 CS 245 amt 12 11 50 8 44 4 • Add up amounts by day • In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date sale prodId p1 p2 p1 p2 p1 p1 81 Notes11 19 storeId c1 c1 c3 c2 c1 c2 date 1 1 1 1 2 2 CS 245 storeId c1 c1 c3 c2 c1 c2 date 1 1 1 1 2 2 amt 12 11 50 8 44 4 sale prodId p1 p2 p1 date 1 2 sum 81 48 20 Aggregates • Add up amounts by day, product • In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date, prodId prodId p1 p2 p1 p2 p1 p1 ans Notes11 Another Example sale amt 12 11 50 8 44 4 date 1 1 2 Operators: sum, count, max, min, median, ave “Having” clause Using dimension hierarchy amt 62 19 48 average by region (within store) by month (within date) maximum rollup drill-down CS 245 Notes11 21 CS 245 Notes11 Cube Aggregation 22 Cube Operators Example: computing sums day 2 day 1 c1 c2 c3 p1 44 4 p2 c1 c2 c3 p1 12 50 p2 11 8 p1 p2 c1 56 11 c2 4 8 rollup drill-down CS 245 c3 50 day 2 ... day 1 sum c1 67 c2 12 c3 50 Notes11 c1 56 11 p1 p2 129 p1 p2 c1 c2 c3 p1 44 4 p2 c1 c2 c3 p1 12 50 p2 11 8 sum 110 19 c2 4 8 c3 50 sale(c2,p2,*) 23 CS 245 ... sale(c1,*,*) sum c1 67 p1 p2 sum 110 19 c2 12 c3 50 129 Notes11 sale(*,*,*) 24 4 Extended Cube c1 56 11 c267 4 c2 4 8 c312 p1 p2 c1 * 12 p1 p2 c1* 44 c2 44 c3 4 50 11 23 8 8 50 * 62 19 81 * day 2 day 1 p1 p2 * c3 50 * 50 48 48 Aggregation Using Hierarchies * 110 19 129 c1 c2 c3 p1 44 4 p2 c1 c2 c3 p1 12 50 p2 11 8 day 2 day 1 Notes11 country region A region B 56 54 11 8 CS 245 25 Decision Trees Clustering Association Rules Example: • Conducted survey to see what customers were interested in new model car • Want to select customers for advertising campaign sale Notes11 sale custId c1 c2 c3 c4 c5 c6 age<30 N city=sf CS 245 unlikely Y likely age 27 35 40 22 50 25 city newCar sf yes la yes sf yes sf yes la no la no age 27 35 40 22 50 25 city newCar sf yes la yes sf yes sf yes la no la no training set Notes11 sale 28 Y unlikely N CS 245 car taurus van van taurus merc taurus age 27 35 40 22 50 25 city newCar sf yes la yes sf yes sf yes la no la no age<45 N likely 29 custId c1 c2 c3 c4 c5 c6 car=taurus Y city=sf N Notes11 car taurus van van taurus merc taurus Another Possibility car=van N likely car taurus van van taurus merc taurus custId c1 c2 c3 c4 c5 c6 CS 245 27 One Possibility Y 26 Decision Trees Y (customer c1 in Region A; customers c2, c3 in Region B) Notes11 Data Analysis CS 245 region sale(*,p2,*) p1 p2 CS 245 customer unlikely Y likely N unlikely Notes11 30 5 Issues Decision tree cannot be “too deep” Clustering would not have statistically significant amounts of data for lower decisions Need to select tree that most reliably predicts outcomes income education age CS 245 Notes11 31 CS 245 Clustering Notes11 32 Another Example: Text Each document is a vector e.g., <100110...> contains words 1,4,5,... Clusters contain “similar” documents Useful for understanding, searching documents income sports education international news age business CS 245 Notes11 33 CS 245 Issues Notes11 34 Association Rule Mining Given desired number of clusters? Finding “best” clusters Are clusters semantically meaningful? e.g., sales records: “yuppies’’ cluster? Using clusters for disk storage tran1 tran2 tran3 tran4 tran5 tran6 cust33 cust45 cust12 cust40 cust12 cust12 p2, p5, p1, p5, p2, p9 p5, p8 p8, p11 p9 p8, p11 p9 market-basket data • Trend: Products p5, p8 often bough together • Trend: Customer 12 likes product p9 CS 245 Notes11 35 CS 245 Notes11 36 6 Association Rule Implementation Issues Rule: {p1, p3, p8} Support: number of baskets where these products appear High-support set: support threshold s Problem: find all high support sets ETL (Extraction, transformation, loading) Getting data to the warehouse Entity Resolution What to materialize? Efficient Analysis Association rule mining ... CS 245 Notes11 CS 245 37 Periodic snapshots Database triggers Log shipping Data shipping (replication service) Transaction shipping Polling (queries to source) Screen scraping Application level monitoring CS 245 Notes11 Advantages & Disadvantages!! ETL: Monitoring Techniques Notes11 38 ETL: Data Cleaning Migration (e.g., yen dollars) Scrubbing: use domain-specific knowledge (e.g., social security numbers) Fusion (e.g., mail list, customer merging) billing DB service DB 39 customer1(Joe) merged_customer(Joe) customer2(Joe) Auditing: discover rules & relationships (like data mining) CS 245 More details: Entity Resolution Notes11 40 Applications comparison shopping mailing lists classified ads N: a customer files counter-terrorism e1 e2 e1 N: a A: b CC#: c Ph: e N: a Exp: d Ph: e A: b Ph: e e2 N: a 41 CC#: c Exp: d Ph: e 42 7 Taxonomy: Pairwise vs Global Why is ER Challenging? Decide if r, s match only by looking at r, s? Or need to consider more (all) records? Huge data sets No unique identifiers Lots of uncertainty Many ways to skin the cat Nm: Pat Smith Ad: 123 Main St Ph: (650) 555-1212 Nm: Patrick Smith Ad: 132 Main St Ph: (650) 555-1212 or Nm: Patricia Smith Ad: 123 Main St Ph: (650) 777-1111 43 44 Taxonomy: Outcome Taxonomy: Pairwise vs Global Global matching complicates things a lot! e.g., Partition of records Merged records e.g., change decision as new records arrive Nm: Pat Smith Ad: 123 Main St Ph: (650) 555-1212 Nm: Patrick Smith Ad: 132 Main St Ph: (650) 555-1212 Nm: Pat Smith Ad: 123 Main St Ph: (650) 555-1212 or Nm: Patricia Smith Ad: 123 Main St Ph: (650) 777-1111 45 46 after merging Nm: Tom Ad: 123 Main BD: Jan 1, 85 Wk: IBM Nm: Tom Ad: 123 Main BD: Jan 1, 85 Wk: IBM Oc: lawyer 47 Nm: Thomas Ad: 123 Maim Oc: lawyer Nm: Tom Wk: IBM Oc: laywer Sal: 500K Nm: Tom Ad: 123 Main BD: Jan 1, 85 Wk: IBM Oc: lawyer Sal: 500K Nm: Patricia Smith Ad: 132 Main St Ph: (650) 777-1111 Hair: Black Nm: Patricia Smith Ad: 123 Main St Ph: (650) 555-1212 (650) 777-1111 Hair: Black Taxonomy: Record Reuse Taxonomy: Outcome Iterate comparison shopping One record related to multiple entities? Nm: Pat Smith Sr. Ph: (650) 555-1212 Ph: (650) 555-1212 Ad: 123 Main St Nm: Pat Smith Jr. Ph: (650) 555-1212 Nm: Pat Smith Sr. Ph: (650) 555-1212 Ad: 123 Main St Nm: Pat Smith Jr. Ph: (650) 555-1212 Ad: 123 Main St 48 8 Taxonomy: Record Reuse Partitions r s • Merges r t Taxonomy: Record Reuse Partitions r rs s • Merges r t rs s s st st t t • Record reuse complex and expensive! 49 50 Taxonomy: Multiple Entity Types Taxonomy: Multiple Entity Types papers authors person 2 person 1 brother Organization A member a1 p1 a2 p2 a3 p5 a4 p7 member same?? business Organization B a5 51 52 Taxonomy: Exact vs Approximate cameras CDs ER resolved cameras ER resolved CDs products books ... 53 ER Taxonomy: Exact vs Approximate terrorists resolved books sort by age terrorists B Cooper 30 match against ages 25-35 ... 54 9 Implementation Issues What to Materialize? ETL (Extraction, transformation, loading) Store in warehouse results useful for common queries Example: total sales Getting data to the warehouse Entity Resolution What to materialize? Efficient Analysis Association day 2 day 1 p1 p2 c1 c2 c3 p1 44 4 p2 c1 c2 c3 12 50 11 8 rule mining ... p1 p2 c1 56 11 c2 4 8 ... c3 50 Notes11 CS 245 55 p1 p2 c1 110 19 c3 50 56 Cube Aggregates Lattice Type/frequency of queries Query response time Storage cost Update cost 129 c1 67 p1 c2 12 c3 50 city c1 56 11 c2 4 8 date city, date product, date c3 50 day 2 day 1 57 all product city, product p1 p2 Notes11 c2 12 Notes11 Materialization Factors CS 245 c1 67 129 materialize CS 245 p1 c1 c2 c3 p1 44 4 p2 c1 c2 c3 p1 12 50 p2 11 8 city, product, date CS 245 Dimension Hierarchies use greedy algorithm to decide what to materialize Notes11 58 Dimension Hierarchies all all city cities state city c1 c2 state CA NY city, product product city, date date product, date state city, product, date state, date city state, product state, product, date not all arcs shown... CS 245 Notes11 59 CS 245 Notes11 60 10 Interesting Hierarchy time all years weeks quarters day 1 2 3 4 5 6 7 8 week 1 1 1 1 1 1 1 2 month 1 1 1 1 1 1 1 1 quarter 1 1 1 1 1 1 1 1 Implementation Issues year 2000 2000 2000 2000 2000 2000 2000 2000 Getting data to the warehouse Entity Resolution What to materialize? Efficient Analysis Association conceptual dimension table months ETL (Extraction, transformation, loading) rule mining ... days CS 245 Notes11 61 CS 245 Finding High-Support Pairs Baskets(basket, item) SELECT I.item, J.item, COUNT(I.basket) FROM Baskets I, Baskets J WHERE I.basket = J.basket AND I.item < J.item GROUP BY I.item, J.item HAVING COUNT(I.basket) >= s; Notes11 Baskets(basket, item) SELECT I.item, J.item, COUNT(I.basket) FROM Baskets I, Baskets J WHERE I.basket = J.basket AND I.item < J.item WHY? GROUP BY I.item, J.item HAVING COUNT(I.basket) >= s; 63 CS 245 Example basket item t1 p2 t1 p5 t1 p8 t2 p5 t2 p8 t2 p11 ... ... CS 245 Notes11 64 Example basket item1 item2 t1 p2 p5 t1 p2 p8 t1 p5 p8 t2 p5 p8 t2 p5 p11 t2 p8 p11 ... ... ... Notes11 62 Finding High-Support Pairs CS 245 Notes11 basket item t1 p2 t1 p5 t1 p8 t2 p5 t2 p8 t2 p11 ... ... 65 CS 245 basket item1 item2 t1 p2 p5 t1 p2 p8 t1 p5 p8 t2 p5 p8 t2 p5 p11 t2 p8 p11 ... ... ... Notes11 check if count s 66 11 Issues Performance for size 2 rules big Association Rules basket t1 t1 t1 t2 t2 t2 ... item p2 p5 p8 p5 p8 p11 ... basket t1 t1 t1 t2 t2 t2 ... item1 p2 p2 p5 p5 p5 p8 ... item2 p5 p8 p8 p8 p11 p11 ... How do we perform rule mining efficiently? even bigger! Performance for size k rules CS 245 Notes11 CS 245 67 Notes11 Association Rules 68 Association Rules How do we perform rule mining efficiently? Observation: If set X has support t, then each X subset must have at least support t How do we perform rule mining efficiently? Observation: If set X has support t, then each X subset must have at least support t For 2-sets: if we need support s for {i, j} each i, j must appear in at least s baskets then CS 245 Notes11 69 CS 245 Algorithm for 2-Sets appearing in s or more baskets (2) Find high-support pairs using only OK products CS 245 Notes11 70 Algorithm for 2-Sets (1) Find OK products those Notes11 71 INSERT INTO okBaskets(basket, item) SELECT basket, item FROM Baskets GROUP BY item HAVING COUNT(basket) >= s; CS 245 Notes11 72 12 Algorithm for 2-Sets Counting Efficiently INSERT INTO okBaskets(basket, item) SELECT basket, item FROM Baskets GROUP BY item HAVING COUNT(basket) >= s; Perform mining on okBaskets Notes11 73 CS 245 Counting Efficiently One way: basket I.item J.item t1 p5 p8 t2 p5 p8 t2 p8 p11 t3 p2 p3 t3 p5 p8 t3 p2 p8 ... ... ... sort CS 245 basket I.item J.item t3 p2 p3 t3 p2 p8 t1 p5 p8 t2 p5 p8 t3 p5 p8 t2 p8 p11 ... ... ... One way: basket I.item J.item t1 p5 p8 t2 p5 p8 t2 p8 p11 t3 p2 p3 t3 p5 p8 t3 p2 p8 ... ... ... 75 Another way: sort CS 245 Counting Efficiently Notes11 74 Counting Efficiently threshold = 3 Notes11 threshold = 3 basket I.item J.item t1 p5 p8 t2 p5 p8 t2 p8 p11 t3 p2 p3 t3 p5 p8 t3 p2 p8 ... ... ... SELECT I.item, J.item, COUNT(I.basket) FROM okBaskets I, okBaskets J WHERE I.basket = J.basket AND I.item < J.item GROUP BY I.item, J.item HAVING COUNT(I.basket) >= s; CS 245 One way: threshold = 3 basket I.item J.item t3 p2 p3 t3 p2 p8 t1 p5 p8 t2 p5 p8 t3 p5 p8 t2 p8 p11 ... ... ... count & remove count I.item J.item 3 p5 p8 5 p12 p18 ... ... ... Notes11 76 Counting Efficiently threshold = 3 basket I.item J.item t1 p5 p8 t2 p5 p8 t2 p8 p11 t3 p2 p3 t3 p5 p8 t3 p2 p8 ... ... ... Another way: basket I.item J.item t1 p5 p8 t2 p5 p8 t2 p8 p11 t3 p2 p3 t3 p5 p8 t3 p2 p8 ... ... ... scan & count threshold = 3 count I.item J.item 1 p2 p3 2 p2 p8 3 p5 p8 5 p12 p18 1 p21 p22 2 p21 p23 ... ... ... keep counter array in memory CS 245 Notes11 77 CS 245 Notes11 78 13 Counting Efficiently Another way: basket I.item J.item t1 p5 p8 t2 p5 p8 t2 p8 p11 t3 p2 p3 t3 p5 p8 t3 p2 p8 ... ... ... scan & count Counting Efficiently threshold = 3 count I.item J.item 1 p2 p3 2 p2 p8 3 p5 p8 5 p12 p18 1 p21 p22 2 p21 p23 ... ... ... remove count I.item J.item 3 p5 p8 5 p12 p18 ... ... ... Another way: basket I.item J.item t1 p5 p8 t2 p5 p8 t2 p8 p11 t3 p2 p3 t3 p5 p8 t3 p2 p8 ... ... ... scan & count threshold = 3 count I.item J.item 1 p2 p3 2 p2 p8 3 p5 p8 5 p12 p18 1 p21 p22 2 p21 p23 ... ... ... keep counter array in memory CS 245 Notes11 (1) scan & hash & count count 1 5 2 1 8 1 ... bucket A B C D E F ... 79 in-memory hash table CS 245 Notes11 Notes11 threshold = 3 basket I.item J.item t1 p5 p8 t2 p5 p8 t2 p8 p11 t3 p2 p3 t3 p5 p8 t3 p2 p8 ... ... ... (1) scan & hash & count 81 (1) scan & hash & count (2) scan & remove false positive CS 245 count 1 5 2 1 8 1 ... bucket A B C D E F ... in-memory hash table bucket A B C D E F ... in-memory hash table threshold = 3 CS 245 Notes11 82 Yet Another Way threshold = 3 basket I.item J.item t1 p5 p8 t2 p5 p8 t2 p8 p11 t3 p5 p8 t5 p12 p18 t8 p12 p18 ... ... ... Notes11 count 1 5 2 1 8 1 ... basket I.item J.item t1 p5 p8 t2 p5 p8 t2 p8 p11 t3 p5 p8 t5 p12 p18 t8 p12 p18 ... ... ... Yet Another Way basket I.item J.item t1 p5 p8 t2 p5 p8 t2 p8 p11 t3 p2 p3 t3 p5 p8 t3 p2 p8 ... ... ... 80 Yet Another Way (2) scan & remove CS 245 count I.item J.item 3 p5 p8 5 p12 p18 ... ... ... keep counter array in memory Yet Another Way basket I.item J.item t1 p5 p8 t2 p5 p8 t2 p8 p11 t3 p2 p3 t3 p5 p8 t3 p2 p8 ... ... ... remove basket I.item J.item t1 p5 p8 t2 p5 p8 t2 p8 p11 t3 p2 p3 t3 p5 p8 t3 p2 p8 ... ... ... (1) scan & hash & count (2) scan & remove false positive 83 CS 245 count 1 5 2 1 8 1 ... bucket A B C D E F ... in-memory hash table threshold = 3 in-memory counters basket I.item J.item t1 p5 p8 t2 p5 p8 t2 p8 p11 t3 p5 p8 t5 p12 p18 t8 p12 p18 ... ... ... Notes11 count I.item J.item 3 p5 p8 1 p8 p11 5 p12 p18 ... ... ... (3) scan& count 84 14 Yet Another Way basket I.item J.item t1 p5 p8 t2 p5 p8 t2 p8 p11 t3 p2 p3 t3 p5 p8 t3 p2 p8 ... ... ... (1) scan & hash & count count 1 5 2 1 8 1 ... in-memory hash table bucket A B C D E F ... (2) scan & remove false positive CS 245 Notes11 threshold = 3 count I.item J.item 3 p5 p8 5 p12 p18 ... ... ... in-memory counters basket I.item J.item t1 p5 p8 t2 p5 p8 t2 p8 p11 t3 p5 p8 t5 p12 p18 t8 p12 p18 ... ... ... Discussion Hashing scheme: 2 (or 3) scans of data Sorting scheme: requires a sort! Hashing works well if few high-support pairs and many low-support ones (4) remove count I.item J.item 3 p5 p8 1 p8 p11 5 p12 p18 ... ... ... (3) scan& count 85 CS 245 Discussion 86 Implementation Issues Hashing scheme: 2 (or 3) scans of data Sorting scheme: requires a sort! Hashing works well if few high-support pairs and many low-support ones frequency Notes11 ETL (Extraction, transformation, loading) Getting Entity iceberg queries data to the warehouse Resolution What to materialize? Efficient Analysis Association rule mining ... threshold item-pairs ranked by frequency CS 245 Notes11 87 Extra: Data Mining in the InfoLab CS 245 Notes11 Extra: Data Mining in the InfoLab Recommendations in CourseRank Recommendations in CourseRank quarters user u1 u2 u3 u4 u q1 a: 5 a: 1 g: 4 b: 2 a: 5 q2 b: 5 e: 2 h: 2 g: 4 g: 4 q3 d: 5 d: 4 e: 3 h: 4 e: 4 quarters q4 user u1 u2 u3 u4 u f: 3 f: 3 e: 4 q1 a: 5 a: 1 g: 4 b: 2 a: 5 q2 b: 5 e: 2 h: 2 g: 4 g: 4 u3 and u4 are similar to u CS 245 Notes11 88 89 CS 245 q3 d: 5 d: 4 e: 3 h: 4 e: 4 q4 f: 3 f: 3 e: 4 Recommend h Notes11 90 15 Extra: Data Mining in the InfoLab Sequence Mining Given a set of transcripts, use Pr[x|a] to predict if x is a good recommendation given user has taken a. Two issues... Recommendations in CourseRank quarters user u1 u2 u3 u4 u q1 a: 5 a: 1 g: 4 b: 2 a: 5 q2 b: 5 e: 2 h: 2 g: 4 g: 4 q3 d: 5 d: 4 e: 3 h: 4 e: 4 q4 f: 3 f: 3 e: 4 Recommend d (and f, h) CS 245 Notes11 91 CS 245 Pr[x|a] Not Quite Right transcript 1 2 3 4 5 containing ‐ a x a ‐> x x ‐> a Notes11 92 User Has Taken >= 1 Course User has taken T= {a, b, c} Need Pr[x|T~x] Approximate as Pr[x|a~x b~x Expensive to compute, so... target user’s transcript: [ ... a .... || unknown ] recommend x? c~x ] Pr[x|a] = 2/3 Pr[x|a~x] = 1/2 CS 245 Notes11 93 CS 245 Notes11 94 percentage of ratings CourseRank User Study 25 20 15 10 good, Series2 Series1 good, 5 expected unexpected 0 CS 245 Notes11 95 16