Data Warehouse and Business Intelligence Dr. Minder Chen Professor of MIS Martin V. Smith School of Business and Economics CSU Channel Islands Minder.Chen@CSUCI.EDU BI Business Intelligence (BI) is the process of gathering meaningful information to answer questions and identify significant trends or patterns, giving key stakeholders the ability to make better business decisions. “The key in business is to know something that nobody else knows.” -- Aristotle Onassis PHOTO: HULTON-DEUTSCH COLL “To understand is to perceive patterns.” — Sir Isaiah Berlin "The manager asks how and when, the leader asks what and why." — “On Becoming a Leader” by Warren Bennis © Minder Chen, 2004-2014 DW & BI - 2 BI Questions • What happened? – What were our total sales this month? • What’s happening? – Are our sales going up or down, trend analysis • Why? – Why have sales gone down? • What will happen? – Forecasting & “What If” Analysis • What do I want to happen? – Planning & Targets Source: Bill Baker, Microsoft © Minder Chen, 2004-2014 DW & BI - 3 Business Valuation Models for BI © Minder Chen, 2004-2014 DW & BI - 4 Performance Dashboards for Information Delivery © Minder Chen, 2004-2014 DW & BI - 5 Scorecards for Information Delivery Balanced Scorecard © Minder Chen, 2004-2014 DW & BI - 6 © Minder Chen, 2004-2014 DW & BI - 7 Inmon's Definition of Data Warehouse – Data View • A warehouse is a – subject-oriented, – integrated, – time-variant and – non-volatile collection of data in support of management's decision making process. Source: http://www.intranetjournal.com/features/datawarehousing.html – Bill Inmon in 1990 © Minder Chen, 2004-2014 DW & BI - 8 Inmon's Definition Explain • Subject-oriented: They are organized around major subjects such as customer, supplier, product, and sales. Data warehouses focus on modeling and analysis to support planning and management decisions vs. operations and transaction processing. • Integrated: Data warehouses involve an integration of sources such as relational databases, flat files, and online transaction records. Processes such as data cleansing and data scrubbing achieve data consistency in naming conventions, encoding structures, and attribute measures. • Time-variant: Data contained in the warehouse provide information from an historical perspective. • Nonvolatile: Data contained in the warehouse are physically separate from data present in the operational environment. © Minder Chen, 2004-2014 DW & BI - 9 Business Intelligence Increasing potential to support business decisions (MIS) Making Decisions Data Presentation Visualization Techniques End User Business Analyst Data Mining Information Discovery Data Analyst Data Exploration OLAP, MDA, Statistical Analysis, Querying and Reporting Data Warehouses / Data Marts Data Sources (Paper, Files, Information Providers, Database Systems, OLTP) © Minder Chen, 2004-2014 DBA DW & BI - 10 Where is Business Intelligence applied? Operational Efficiency • • • • • • • • ERP Reporting KPI Tracking Product Profitability Risk Management Balanced Scorecard Activity Based Costing Global Sourcing Logistics © Minder Chen, 2004-2014 Customer Interaction • • • • • • • Sales Analysis Sales Forecasting Segmentation Cross-selling CRM Analytics Campaign Planning Customer Profitability DW & BI - 11 OLTP Versus Business Intelligence: Who asks what? OLTP Questions • When did that order ship? • How many units are in inventory? • Does this customer have unpaid bills? • Are any of customer X’s line items on backorder? © Minder Chen, 2004-2014 Analysis Questions • What factors affect order processing time? • How did each product line (or product) contribute to profit last quarter? • Which products have the lowest Gross Margin? • What is the value of items on backorder, and is it trending up or down over time? DW & BI - 12 The Data Warehouse/BI Architecture & Process ETL: Extract, Transform, and Load OLAP Cubes Data Marts Source Systems ETL ETL ETL Data Warehouse 1 Design the Data Warehouse © Minder Chen, 2004-2014 2 Populate Data Warehouse 3 Create OLAP Cubes Clients Query Tools Reporting Analysis Data Mining 4 Query Data DW & BI - 13 Normalized Database for OLTP © Minder Chen, 2004-2014 DW & BI - 14 OLTP vs. OLAP OLTP System OLAP System Online Transaction Processing Online Analytical Processing (Operational System) (Data Warehouse) Operational data; OLTPs are the original Consolidation data; OLAP data comes from the various Source of data source of the data. OLTP Databases Purpose of To control and run fundamental business To help with planning, problem solving, and decision data tasks support Reveals a snapshot of ongoing business Multi-dimensional views of various kinds of business What the data processes activities Inserts and Short and fast inserts and updates Periodic long-running batch jobs refresh the data Updates initiated by end users Queries Relatively standardized and simple queries Returning relatively few records Often complex queries involving aggregations Depends on the amount of data involved; batch data Typically very fast refreshes and complex queries may take many hours; query speed can be improved by creating indexes Space Can be relatively small if historical data is Larger due to the existence of aggregation structures Requirements archived and history data; requires more indexes than OLTP Database Typically de-normalized with fewer tables; use of star Highly normalized with many tables Design and/or snowflake schemas Processing Speed Backup and Recovery Backup religiously; operational data is Instead of regular backups, some environments may critical to run the business, data loss is consider simply reloading the OLTP data as a recovery likely to entail significant monetary loss method and legal liability Source: http://datawarehouse4u.info/OLTP-vs-OLAP.html © Minder Chen, 2004-2014 DW & BI - 15 Measuring Performance • Real estate consumer services and analysis firm Trulia reports that Oct. 2013 saw only an 0.6% rise in home asking prices comparing to Sept. 2013. • However, the average home asking price rose by 11.7% from Oct. 2012 to Oct. 2013. • The year-over-year figure is the largest jump since the housing bubble popped back in 2007-08. Source: http://www.thestreet.com/story/12100873/1/home-sellers-price-hikes-coming-unsustainably-fast.html © Minder Chen, 2004-2014 DW & BI - 16 compare with last period vs. year-on-year comparison • A time series is a sequence of data points, measured typically at successive points in time spaced at uniform time intervals. Examples of time series are the daily closing value of the Dow Jones Industrial Average Wikipedia “Time series” • The year-over-year data compares a time period (e.g., a month or a quarter) against the same time period last year. • You can compare a performance indicator with one from last period (quarter, month, week, day) • One of the advantages of year-over-year comparisons is that it automatically negates the effect of seasonality (e.g., seasonal effect). It is a more effective way of looking at performance. © Minder Chen, 2004-2014 DW & BI - 17 Identifying Measures and Dimensions Information for Decision Making Performance Measures for KPI Measures What? Performance Drivers Attribute Type? Dimensions Why? The attribute (column) varies continuously: The attribute is perceived as a constant or discrete value: • • • • •Name/Description •Location •Color •Size Unit Sold Cost Sales Balance © Minder Chen, 2004-2014 DW & BI - 18 Star Schema Dimension Table Products Multi-dimensional Data model Channels Dates Dimension Table Dimension Table Sales Fact Table with measures Dimension Promotions Table © Minder Chen, 2004-2014 Customers Dimension Table DW & BI - 19 Snowflake Schema Brands Normalized Dimension Table Products Channels Dates Dimension Table Star Schema Sales Dimension Table Fact Table Normalized Dimension Table Promotions Customers Dimension Table Customer type Source: http://www.diffen.com/difference/Snowflake_Schema_vs_Star_Schema © Minder Chen, 2004-2014 DW & BI - 20 Designing Data Warehouse: Dimensional Design Process Business Requirements • Select the business process to model • Declare the grain of the business process/data in the fact table (The grain represents the most atomic level by which the facts may be defined. The grain of a SALES fact table might be stated as "Sales volume by Day by Product by Store". ) • Identify the numeric facts/meaures that will populate each fact table row • Choose the dimensions that apply to each fact table row Data Realities Ref: http://en.wikipedia.org/wiki/Fact_table © Minder Chen, 2004-2014 DW & BI - 21 Select a business process to model • • • • Not business departments or business functions Cross-functional business processes Business events Examples: – – – – – – – – – Raw materials purchasing Order fulfillment process Shipments Invoicing Inventory General ledger Insurance claims Class enrollment Airline ticket sales © Minder Chen, 2004-2014 DW & BI - 22 Facts Table Measurements of business events. DateID ProductID Dimensions CustomerID Units Dollars Measures The Fact Table contains keys and units of measure © Minder Chen, 2004-2014 DW & BI - 23 Fact Tables Fact tables have the following characteristics: • It contains numeric measures (metric) of the business. • It may contain summarized (aggregated) data. • It almost always contains date-stamped data. • Measures are typically additive. • Have key value that is typically a concatenated key composed of the primary keys of the dimensions. • Joined to dimension tables through foreign keys that reference primary keys in the dimension tables. • Fact tables are narrow (few attributes) but many records. © Minder Chen, 2004-2014 DW & BI - 24 A Dimensional Model for a Grocery Store’s Sales © Minder Chen, 2004-2014 DW & BI - 25 Creating Dimensional Model • Identify fact tables • Translate business measures into fact tables • Analyze information from source systems for additional measures • Identify base and derived measures • Document additivity of measures (e.g., nonadditive[price], semi-additive [quantity-on-hand is not additive over time], or additive [quantity]) • Identify dimension tables • Link fact tables to the dimension tables • Create views for users © Minder Chen, 2004-2014 DW & BI - 26 Transaction Level Order Item Fact Table © Minder Chen, 2004-2014 DW & BI - 27 Inside a Dimension Table • Dimension table key: Uniquely identify each row. Use surrogate key (integer). • Table is wide: A table may have many attributes • • • • (columns). Textual attributes. Descriptive attributes in string format. No numerical values for calculation. Attributes not directly related: E.g., product color and product package size. No transitive dependency. Not normalized (star schema). Drilling down and rolling up along a dimension. • One or more hierarchy within a dimension. • Fewer number of records. © Minder Chen, 2004-2014 DW & BI - 28 OLAP Solutions • • • • • • Data Warehouse Data Mart Cubes Dimensions Measures Cells A cube is a collection of data that’s been aggregated to allow queries to return data quickly. © Minder Chen, 2004-2014 OLAP Server (e.g., Oracle ESSBase & SQL Server’s Analysis Services) Europe Asia US Gadgets 130 135 140 142 Gizmos 205 390 350 475 Thingies 175 230 190 250 Widgets 310 340 410 450 Q1 Q2 Q3 Q4DW & BI - 29 Hierarchy © Minder Chen, 2004-2014 DW & BI - 30 A Hierarchy in the Product Dimension • SKU: Stock Keeping Unit • Hierarchy: – Department Category Subcategory Brand Product © Minder Chen, 2004-2014 DW & BI - 31 Multidimensional Query Techniques Performance Drivers Performance Measures What? Why? Product Time Slicing Geography Why? Dicing Drilling down Drill down Why? Roll up Aggregated data Detail data © Minder Chen, 2004-2014 DW & BI - 32 Roll-Up and Drill-Down Source: http://www.tutorialspoint.com/dwh/dwh_olap.htm © Minder Chen, 2004-2014 DW & BI - 33 Slice and Dice Source: http://www.tutorialspoint.com/dwh/dwh_olap.htm © Minder Chen, 2004-2014 DW & BI - 34 A Visual Operation: Pivot (Rotate) Juice Cola Milk 10 47 30 Cream 12 3/1 3/2 3/3 3/4 Date © Minder Chen, 2004-2014 Product DW & BI - 35 Operations in Multidimensional Data Model • Aggregation (roll-up) – dimension reduction: e.g., total sales by city – summarization over aggregate hierarchy: e.g., total sales by city and year total sales by region and by year • Navigation to detailed data (drill-down) – e.g., (sales - expense) by city, top 3% of cities by average income • Selection (slice or dice) defines a subcube – e.g., sales where city = Palo Alto and date = 1/15/96 • Visualization operations (e.g., Pivot) © Minder Chen, 2004-2014 DW & BI - 36 Pivot Table in Excel © Minder Chen, 2004-2014 DW & BI - 37 Date Dimension of the Retail Sales Model © Minder Chen, 2004-2014 DW & BI - 38 Store Dimension • It is not uncommon to represent multiple hierarchies in a dimension table. Ideally, the attribute names and values should be unique across the multiple hierarchies. © Minder Chen, 2004-2014 DW & BI - 39 ETL ETL = Extract, Transform, Load. ETL cycle includes • Build reference data (e.g., currency codes) • Extract (from sources) • Validate • Transform (clean, apply business rules, check for data integrity, create aggregates) • Stage (load into staging tables, if used) • Audit reports on compliance with business rules. • Publish/load (to target tables in the data warehouse) • Clean up © Minder Chen, 2004-2014 DW & BI - 40 Data Quality Issues • • • • • • • • • No common time basis Different calculation algorithms Different levels of extraction Different levels of granularity Different data field names Different data field meanings Missing information No data correction rules No drill-down capability © Minder Chen, 2004-2014 DW & BI - 41 Building The Warehouse Transforming Data © Minder Chen, 2004-2014 DW & BI - 42 The Anomalies Nightmare CUST # NAME ADDRESS 90328575 Digital Equipment 187 N. PARK St. Salem NH 01458 OEM 90328575 DEC 187 N. Pk. St. Salem NH 01458 OEM 90238475 Digital 187 N. Park St Salem NH 01458 $#% 90233479 Digital Corp 187 N. Park Ave. Salem NH 01458 Comp 90233489 Digital Consulting 15 Main Street Andover MA 02341 Consult 90234889 Digital Info Service PO Box 9 Boston MA 02210 Mail List 90345672 Digital Integration Park Blvd. Boston MA 04106 SYS INT No Unique Key Anomalies No Standardization TYPE Spelling Noise in Blank Fields How does one correctly identify and consolidate anomalies from millions of records? © Minder Chen, 2004-2014 DW & BI - 43 Data Mining & Knowledge Discovery in Database (KDD) Process Data Mining is the practice of searching through large amounts of computerized data to find useful patterns or trends Data mining is the analysis step of the "Knowledge Discovery in Databases" process (KDD) involving methods such as artificial intelligence, machine learning, statistics, and database systems. Source: © Minder http://www2.cs.uregina.ca/~dbd/cs831/notes/kdd/1_kdd.html Chen, 2004-2014 DW & BI - 44 Knowledge Discovery • Knowledge discovery in databases is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data. Data A set of facts. An association, dependence, clusters, etc. among facts (items) Pattern in the data set. KDD is a multi-step process involving data preparation, pattern Process searching, knowledge evaluation, and refinement with iteration after modification. Discovered patterns should be true on new data with some Valid degree of certainty. Generalize to the future (other data). Novel Patterns must be novel (should not be previously known). Actionable; patterns should potentially lead to some useful Useful actions. The process should lead to human insight. Patterns must be Undermade understandable in order to facilitate a better understanding standable of the underlying data. © Minder Chen, 2004-2014 http://www2.cs.uregina.ca/~dbd/cs831/notes/kdd/1_kdd.html DW & BI - 45 Cross Industry Standard Process for Data Mining Action Decision Source: http://en.wikipedia.org/wiki/Cross_Industry_Standard_Process_for_Data_Mining © Minder Chen, 2004-2014 DW & BI - 46 Data Mining Tasks and Examples • Classification - Customer profiling into predefined categories via supervised learning using Decision Tree or Neural Network • Clustering - grouping a set of objects in such a way that objects in the same group (cluster) are more similar to each other than to those in other groups (clusters) Market segmentation , e.g., • Summarization - Credit scoring and risk analysis using Bayesian inference. It is considered a Structured prediction technique. • Association - What is the likelihood that a customer will buy a product next month, if he buys a related item today? (sequence association) http://www2.cs.uregina.ca/~dbd/cs831/notes/kdd/2_tasks.html © Minder Chen, 2004-2014 DW & BI - 47 OLAP and Data Mining Address Different Types of Questions While reporting and OLAP are informative about past facts, only data mining can help you predict the future of your business. OLAP Data Mining What was the response rate to our mailing? What is the profile of people who are likely to respond to future mailings? How many units of our new product did we Which existing customers are likely to buy sell to our existing customers? our next new product? Who were my 10 best customers last year? Which 10 customers offer me the greatest profit potential? Which customers didn't renew their policies Which customers are likely to switch to the last month? competition in the next six months? Which customers defaulted on their loans? Is this customer likely to be a good credit risk? What were sales by region last quarter? What are expected sales by region next year? What percentage of the parts we produced yesterday are defective? What can I do to improve throughput and reduce scrap? Source: http://www.dmreview.com/editorial/dmreview/print_action.cfm?articleId=2367 © Minder Chen, 2004-2014 DW & BI - 48 Shopping Basket Analysis • Which items are purchased in a retail store at the same time? • Amazon use collaborative filtering that use shopping basket (sales) data to make recommendations when you select an item. Ref: http://en.wikipedia.org/wiki/Collaborative_filtering © Minder Chen, 2004-2014 DW & BI - 49 Issues on Interpreting Modeling Results • Housing price: Use factors, such as location, number of bedrooms, and square footage, to determine the market value of a property. • Beer and Diaper Source: http://dssresources.com/newsletters/66.php © Minder Chen, 2004-2014 DW & BI - 50 Veracity Scale of Data Analysis of Streaming Data Different forms of data Uncertainty of data Source: •http://www.ibmbigdatahub.com/infographic/four-vs-big-data •http://whatsthebigdata.com/2013/07/25/big-data-3-vs-volume-variety-velocity-infographic/ © Minder Chen, 2004-2014 DW & BI - 51 CRISP-DM Methodology Cross Industry Standard Process for Data Mining Methodology Source: http://lyle.smu.edu/~mhd/8331f03/crisp.pdf © Minder Chen, 2004-2014 & (link) DW & BI - 52 Data Mining Contexts Source: http://lyle.smu.edu/~mhd/8331f03/crisp.pdf © Minder Chen, 2004-2014 DW & BI - 53 Phases and Tasks Source: http://lyle.smu.edu/~mhd/8331f03/crisp.pdf © Minder Chen, 2004-2014 DW & BI - 54 • Backup Slides © Minder Chen, 2004-2014 DW & BI - 55 © Minder Chen, 2004-2014 DW & BI - 56 Key Concepts in BI Development Lifecycle © Minder Chen, 2004-2014 DW & BI - 57 OLTP Normalized Design Warehouse Ordering Process Chain Retailer Store Retailer Payments Retailer Returns Product POS Process Retail Promo Brand GL Account Retail Cust Cash Register © Minder Chen, 2004-2014 Clerk DW & BI - 58 © Minder Chen, 2004-2014 DW & BI - 59