DIMENSIONAL MODELING MIS2502 Data Analytics So we know… • Relational databases are good for storing transactional data • But bad for analytical data • What we can do is design an analytical data store based on the operational data store • That architecture gives us the advantages of both • Relational database for operational use • Analytical database for analysis (Online Analytical Processing) Why have a separate ADS? • Issue 1: Performance • The structure is built to handle analysis • You keep the load off the operational data store • Issue 2: Usability • We can structure the data in an intuitive way • You keep the load off of your IT department Some terminology Data Warehouse • Takes many forms • Really is just a repository for data Data Mart • More focused • Specially designed for analysis Data Cube • Organization of data as a “multidimensional matrix” • Implementation of a Data Mart How they all relate The data in the operational database… …is put into a data warehouse… …which feeds the data mart… …and is analyzed as a cube. We’ll start here. Product The Data Cube M&Ms Diet Coke Doritos Famous Amos Online Analytical Processing and Multidimensional Data Analysis • Made up of “facts” and “dimensions” Store • Core component of Ardmore, PA quantity & total price quantity & total price quantity & total price quantity & total price Temple Main quantity & total price quantity & total price quantity & total price quantity & total price Cherry Hill, NJ quantity & total price quantity & total price quantity & total price quantity & total price Mar. 2011 King of Prussia, PA quantity & total price quantity & total price quantity & total price quantity & total price Quantity sold and total price are measured facts. Why isn’t product price a measured fact? Feb. 2011 Jan. 2011 The Data Cube Product M&Ms A single summary record representing a business event (monthly sales). Store The highlighted element represents all the M&Ms sold in Ardmore, PA in January, 2011 Diet Coke Doritos Ardmore, PA quantity & total price quantity & total price quantity & total price quantity & total price Temple Main quantity & total price quantity & total price quantity & total price quantity & total price Cherry Hill, NJ quantity & total price quantity & total price quantity & total price quantity & total price Famous Amos Mar. 2011 King of Prussia, PA quantity & total price quantity & total price quantity & total price quantity & total price Feb. 2011 Jan. 2011 The Data Cube Product This is called “slicing the data.” Store The highlighted elements represent Famous Amos cookies sold on Temple’s Main campus from January to March, 2011 M&Ms Diet Coke Doritos Ardmore, PA quantity & total price quantity & total price quantity & total price quantity & total price Temple Main quantity & total price quantity & total price quantity & total price quantity & total price Cherry Hill, NJ quantity & total price quantity & total price quantity & total price quantity & total price Famous Amos Mar. 2011 King of Prussia, PA quantity & total price quantity & total price quantity & total price quantity & total price Feb. 2011 Jan. 2011 The Data Cube Product M&Ms What do the blue highlighted elements represent? Store What do the orange highlighted elements represent? Diet Coke Doritos Ardmore, PA quantity & total price quantity & total price quantity & total price quantity & total price Temple Main quantity & total price quantity & total price quantity & total price quantity & total price Cherry Hill, NJ quantity & total price quantity & total price quantity & total price quantity & total price Famous Amos Mar. 2011 King of Prussia, PA quantity & total price quantity & total price quantity & total price quantity & total price Feb. 2011 Jan. 2011 The n-dimensional cube • Could you have a data mart with five dimensions? • If so, give an example • Then why does our cube example (and most others you will see) only have three? Fact store the original data as a single table • Summarization would • So it is stored as a star schema Dimension be too slow • A lot of redundancy Product Product_ID Product_Name Product_Price Product_Weight Store Store_ID Store_Address Store_City Store_State Store_Type Sales Sales_ID Product_ID Store_ID Time_ID Quantity Sold Total Price Dimension • We can’t reasonably Dimension Designing the Cube: The Star Schema Time Time_ID Day Month Year Revisiting Usability: Why a Cube? • So you have your star schema, now what? • Non-IT folks probably won’t understand data normalization and table relations in MySQL • They won’t know how to do table JOINs • They probably cannot work with raw table data • Two options to produce usable data (i.e., cube) • #1: Perform one big table JOIN from star schema • #2: Calculate meaningful values, store in a data cube. Option 1: One Big JOIN Storing the entire join would generate many, many rows! Sales ID Qty. Sold Total Price Prod. ID Prod. Name Prod. Price Prod. Weight Store ID Store Address Store City Store State Store Type Time ID Day Month Year 1000 1001 1002 Sales Fact Product Dimension Store Dimension Time Dimension It adds up fast… see Sales Person ex. 1000 products 300 stores 365 days 100 daily product purchases =10,950,000,000 records per year! Option 2: Cube of Summary Stats Summarize the data and store it in the cube Retrieve only the summary, not the raw data. Much more efficient, but we are “locked in” Product Store M&Ms Diet Coke Doritos Ardmore, PA quantity & total price quantity & total price quantity & total price quantity & total price Temple Main quantity & total price quantity & total price quantity & total price quantity & total price Cherry Hill, NJ quantity & total price quantity & total price quantity & total price quantity & total price King of Prussia, PA quantity & total price quantity & total price quantity & total price quantity & total price Famous Amos Mar. 2011 Feb. 2011 Jan. 2011 Demo – Foodmart • A pre-created data cube that can be read in Excel Gender Houseowner Marital status Media type Product name Sales region Store name Total children Yearly income Measured Facts Dimensions • Summaries are already created Cost Sales Updating the cube • Data marts are non-volatile (i.e., they can’t be changed) • Logically: It’s a record of what has happened • Practically: Would require constant re-computation of the cube • The cube is refreshed periodically from the transactional database • Overnight • Daily • Weekly Designing the Star Schema • Ralph Kimball’s Four Step Process for Data Cube Design (Kimball et al., 2008) • Choose the business process • Decide on the level of granularity • Identify the dimensions • Identify the fact • From where do you get the data? • Can be in house, but might come from elsewhere too… Choose the business process • What your data cube is “about” • Determined by the questions you want to answer about your organization Question Business Process Who is my best customer? Sales What are my highest selling products? Sales Which teachers have the best student performance? Standardized testing Which supplier is offering us the best deals? Purchasing Note that a “business process” is not always about business. Decide on the level of granularity • Level of detail for each event (row in the table) • Will determine the data in the dimensions • Example: Who is my best customer? • The “event” is a sales transaction • Choices for time: yearly, quarterly, monthly, daily • Choices for store: store, city, state How would you select the right granularity? Identify the dimensions • Determined by the business process • Refined by the level of granularity • The key elements of the process needed to answer to the question • Example: Sales transaction • Our example schema defines a “sale” as taking place for a particular product, in a particular store, at a particular time • Could this data mart tell you • The best selling product? • The best customer? Try it for the “student performance” example. Identify the fact • The data associated with the business event Keys Measured, numeric data • Unique identifier for each row • Unique identifiers from dimensions • Associates a combination of the dimensions to a unique business event • Example: Sales has Product_ID, Store_ID, and Time_ID • Quantifiable information for each business event • Does not describe any particular dimension • Describes a particular combination of dimensional data • Example: Sales has quantity_sold and total_price. Try it for the “student performance” example. Data cube caveats • You have to choose your aggregations in advance • So choose wisely! • Consider a sales data cube with product, store, time, salesperson • If quantity_sold and total_price are the facts, you can’t figure out the average number of people working in a store • All people might not have sold all products and therefore wouldn’t be in the joined table • Granularity is also an issue • Can’t track daily sales if “date” is monthly (pre-aggregated) • So why not include every single sale and do no aggregation beforehand? OGE Energy (Oklahoma) Example • Trying to reduce peak power demand • A few strategies… • Variable Pricing; • Smart Meters; • Customer Notifications (Text, e-Mail, Twitter); • Customer Rewards sources: OGE overview, http://energy.gov/, http://www.ogepet.com OGE Energy (Oklahoma) Example • Business question • “How can we reduce peak power demand?” • What are the relevant facts (perform. measures)? Power consumption (kw/h) Power outages • What are relevant dimensions? • Time (hours or minutes), location, weather_emergency, price, smart_meter, communications, rebates comms_ID Num_texts Num_tweets Num_emails Location Fact Dimension PowerDraw Customer customer_ID Attribute 1 Attribute 2 Attribute 3 PowerDrawID Customer_id Comms_id Time_id Location_id Consumption Outages _ID Attribute 1 Attribute 2 Attribute 3 Time Time_ID Day Month Year Hour Dimension Communicat. Dimension Dimension OGE Energy Star Schema