TEACHING THE DATA WAREHOUSE COURSE Paul Gray ISECON 2001 1 Outline Overview of what is Data Warehousing The 5 and 10 week courses The Indiana University of Pennsylvania course 2 ORIGINS Data warehouses are the results of two software solutions needing and finding one another: Data base firms developed data warehouses and were looking for applications EIS and DSS software developers and vendors needed to deal with ever-increasing data bases About 10 years ago, the two groups started interacting with the results described here. 3 ORIGINS Database developers long understood that their software was required for both transactional and analytic processing However, their principal developments were directed to ever-larger transactional data bases. This process occurred even through operational and analytic data are separate with different requirements and different user communities. 4 ORIGINS Once these differences were understood, new data bases were created specifically for analysis use. Today, data warehouses have 3 major applications On-line analytic processing for business intelligence Data Mining Customer Relationship Management 5 WHAT IS DATA WAREHOUSING A data warehouse is typically a dedicated data base system for decision making that is separate from the production data base(s) used operationally. It differs from production system in that: it covers a much longer time horizon than transaction systems it includes multiple data bases that have been processed so that the warehouse’s data are defined uniformly (i.e., ‘clean’ data) it is optimized for answering complex queries from managers and analysts. 6 WHAT IS DATA WAREHOUSING? In the last 5 years, data warehousing has become a major industry within computing which has brought together the ideas of data bases and decision support. It has also been the foundation for efforts in data mining and in CRM Data mining refers to finding answers about an organization from the information in the data warehouse that the executive or the analyst had not thought to ask. Data mining is made possible by the very presence of large databases in the data warehouse. It provides techniques that allow managers to obtain managerial information from their legacy systems. Its objective is to identify valid, novel, potentially useful, and understandable patterns in data. 7 WHAT IS DATA WAREHOUSING? The objective of a data warehouse is to create a “single truth” Data warehousing is a major new application area. It rates extremely high salaries (up to $100,000 for specialists, $300,000 for consultants). 8 DEFINITION A data warehouse is a: Subject oriented Integrated Time-variant Non-volatile Collection of data in support of management decision processes 9 NOTE: Data warehouse is physically separated from operational systems and operational data bases Data warehouses hold both aggregated and detailed data for management separate from the databases used for On-Line Transaction Processing (OLTP) 10 CHARACTERISTICS Subject oriented Data are organized by how users refer to it Integrated Non-volatile Time series Inconsistencies are removed in both nomenclature and conflicting information; (i.e. data are ‘clean’) Read-only data. Data do not change over time. Data are time series, not current status 11 CHARACTERISTICS Summarized Non normalized Operational data are mapped into decision usable form Time series implies much more data is retained Data can be redundant Metadata =Data about data Input Unintegrated, operational environment (‘legacy systems’) Larger 12 SUBJECT ORIENTATION Data is organized around major subjects of the enterprise Example: OPERATIONAL •Loans •Savings •Bank card •Trust An application orientation DATA WAREHOUSE Customer Vendor Product Activity A subject orientation 13 USING THE WAREHOUSE The higher the level of summarization, the more the data is used The more summarized the data, the quicker it is to retrieve However, the higher the level of summarization, the lower the level of detail 14 DATA MARTS Data Mart: A scaled-down version of the data warehouse A data mart is a small warehouse designed for the SBU or department level. It is often a way to gain entry and provide an opportunity to learn Major problem: if they differ from department to department, they can be difficult to integrate enterprise-wide 15 COST Data warehouses are not cheap Median cost to create (does not include operating cost) = $2.2M Multimillion dollar costs are common Their design and implementation is still an art and they require considerable time to create 16 SIZE Being designed for the enterprise so that everyone has a common data set, they are large and increase in size with time. Typical storage sizes run from 50 Gigabytes to several Terabytes 17 SIZE OF INDUSTRY Data warehouses are a major industry within information systems. 6B$/year Estimates vary but it is clear that many more than 90% of Fortune 1000 have data warehouse projects Major players include: Oracle IBM+Informix Sybase NCR + BI companies (Brio, Cognos, Pilot,….) 18 MARKET COMPONENTS High end business intelligence (OLAP) Low end query tools Data cleansing Data marts Data mining Customer relationship management (CRM) 19 APPLICATION – OLAP/BI OLAP = On Line analytic processing Basic idea of OLAP: managers should be able to manipulate enterprise data models across many dimensions to understand changes that are occurring Vendors claim they are OLAP compliant even if they are not 20 APPLICATION - DATA MINING Also known as Knowledge Data Discovery (KDD) Mining terminology refers to finding answers about a business from the data warehouse that the executive or analyst had not thought to ask 21 APPLICATION - DATA MINING KDD applies techniques mostly from artificial intelligence and statistics to discover new information. It is designed to find information that queries and reports don’t reveal effectively KDD uses AI and statistics to find pattern in data and to infer rules. 22 APPLICATION - DATA MINING Some successes: People who buy scuba gear take Australian vacations Fraud detection, consumer loan analysis Optimizing production lines IBM’s SCOUT (apocryphal?) Men who buy diapers buy beer 23 APPLICATION - DATA MINING Associations Sequences Classifications Clusters Forecasting Things done together (buy groceries) Events over time (house.refrigerator) Pattern recognition (rules) Define new groups Predictions from time series 24 CUSTOMER RELATIONSHIP MANAGEMENT (CRM) Successor to data base marketing Implies marketing to customers on a 1:1 basis Requires data granularity at the level of the individual customer large amounts of data Data warehouse is only a part of the CRM concept. 25 History of Course I First given as PhD Seminar on Management of Information Systems 1996 Overview lecture by instructor Students present topics each week – typically 2 or 3 Great source of material! 26 History of Course II Attend Data Warehouse Institute conference in San Diego. Agree to write book with Watson. January 1997 Complete book in late 1997. Desk-top published. Used sabbatical at UCI to do the writing. Move course to regular MS course in Spring 1997. 27 Two Versions: 14 week for Claremont Graduate University 5 Week for UC Irvine Taught 1/year for last 5 years WHY DO STUDENTS TAKE COURSE? High salaries (100K specialist, $300K for consultant) 28 Both Versions Text: Decision Support in the Data Warehouse by P. Gray and H.J. Watson Prentice Hall 1998 PLUS readings PLUS hands-on exercises PLUS term paper 29 5 WEEK VERSION Wk Part 1 1 Overview of DW 2 Demo Startracker software 3 Data sources, cleansing,metadata 4 Data marts,ODS 5 Mining, BI Part 2 OLAP, applications Strategic use, Framework Planning and operating the DW Industry, economics CRM, Developments 30 Date 14-WEEK COURSE 9-9 Introduction to DW 9-16 Architecture, Data Input 9-23 Metadata, Multiple Dimensions 9-30 Interface, Strategic Uses 10-7 Design Methods; Data Marts 10-14 Progress Report; ODS; Maintaining the DW 10-21 Midterm 10-28 Economics of DW; Personnel 11-7 Querying, Data Mining 11-14 Progress Report; CRM 11-21 Business Intelligence ; Future Developments 11-28 Student Reports 12-2 Student Reports 12-9 Final Exam 31 Using Software—Star Tracker Simple DW The Data Warehouse Toolkit : Practical Techniques for Building Dimensional Data Warehouses by Ralph Kimball (Wiley 1996) Commercial version “Synchrony” from If… no longer available 32 Using Software – Commercial Red Brick (now part of Informix which is now owned by IBM) Gift from Red Brick Mostly used with student projects 33 Data Sources Journal of Data Warehousing DM Review On-line (dmreview.com) Hard copy Lots and lots on Internet e.g., www.Dw-institute.com Pwp.starnetinc.com/larryg www.datawarehouse.org 34 Assignments Find 10 articles on Internet teaches search techniques, nomenclature, rapid course immersion Exercise with software The DW industry 35 Term Projects for DW Next chart shows 22 of the 30 suggested topics. Last topic allows student to choose topic subject to instructor approval. 36 DW Architecture Data sources, loading, cleaning, summarizing, granularity Metadata in a DW Data Modeling and the DW Multiple Dimensions in DWs via ROLAP and MOLAP Indexing in DWs DW Interfaces –conventional, window, browser Design principles for DWs Strategic use of the DW Justification of the DW including C/B analysis DW Security Data Marts Operational Data Stores Maintaining the DW Economics of DWs Querying and reporting in a DW OLAP and the DW Decision Support Systems and the DW Database marketing and the DW Data Mining and the DW Personnel considerations for building the DW and for maintaining the DW Organizing the DW Project (Project management etc.) 37 Elizabeth Pierce’s Course at Indiana University of Penn. Paper published in Communications of AIS (CAIS) Sept. 1999 (Vol. 2 Article 16) Developing and Delivering a Data Warehousing and Mining Course “Introduces students to the strategies, technologies, and techniques” 38 Elizabeth Pierce’s Course at Indiana University of Penn. “Students learn what is involved in planning, designing, building, using, and managing a data warehouse. Students also learn about how a data warehouse must fit into an overall corporate data architecture that may include legacy systems, operational data stores, enterprise data warehouses, and data marts. In addition, students are exposed to the different data mining techniques used by organizations to derive information from the data warehouse for strategic and long-term business decision making.” (abstract) 39 U. Of Indiana Course 14 weeks –7 weeks on DW Challenges: Evolving nature of the subject Hands on experience for students Textbook selection—mostly tradebooks 40 Useful Tables in Pierce List of Internet sources Available software demos Some on-line data warehouses with public access Schools that offered data mining courses in 1999 41 Conclusions Students come because it is a job source A good 2nd course for data base (can be taken w/o database course) Allows covering the new topics of the 90’s and 00’s—OLAP,CRM,Mining, in more depth than Intro to IS course. It’s a fun topic and you can even do research on it. 42