Introduction to Data Warehousing Rob Meredith DSS Lab, Monash University Overview What is a data warehouse? What makes it so different? – Managers as clients – Architecture Dimensional Modelling – Compared to traditional data modelling – Facts and dimensions – OLAP What is a data warehouse? “Subject oriented, integrated, time variant, non-volatile collection of data in support of management decision making” Inmon “The basic data warehouse architecture interposes between end-user desktops and production data sources a warehouse that we usually think of as a single, large system maintaining an approximation of an enterprise data model.” Demarest Data Warehouses A set of databases created to provide information to decision makers Supports the access, understanding and analysis of data by decision makers Provides the “data infrastructure” for management support systems (eg. DSS and EIS) Most of the effort is in data extraction, transformation and load activities Another view... “Data warehousing is a process not a product. ... The data warehousing process can be broken down into 4 phases: Assemble data systematically Transform the data, correct errors and form a consistent view Distribute the data where needed Furnish high speed tools of choice Data warehousing provides a means for the useful storage of historical information allowing the user wider scope on which to base decision support information.” The Butler Group What’s so different about data warehouses? Compared to operational systems (OLTP): – Managers as clients • What managers supposedly do • The reality – Architecture Managers as clients Discretionary and demanding clients Chauffeured Fragmentation, brevity and variety Uncertain tasks Urgency Organisationally powerful What’s in it for managers? Fast access to data Views of the organisation they have never had before Exception reports (data mining agents) Infra-structure for EIS Infra-structure for DSS What’s really in it for managers? Beware “technocratic utopianism” Maybe nothing at all! Ackoff (1967) revisited: – MIS are based on the following false assumptions: • More information is better • Managers don’t have the information they need • Managers need the information they want • Managers don’t have to understand a system to use it http://images.lib.monash.edu.au/ims3001/04103275.pdf Operational Systems Environment OLTP Systems tend to be: – Unintegrated – Unsynchronised – Complex – Update-Oriented – Dirty data Data Warehouse Environment OLAP Systems (explained later) tend to be: – Subject oriented – Integrated – Time Variant – Non-Volatile (Inmon & Hackathorn, 1994) Goals of data warehouse architecture Architectural goals (Demarest, 1994): – To protect production systems from query drain – To provide a traditional, highly manageable data oriented environment for DSS • To separate data management and query processing issues from end-user access issues – To enable data from different systems to be brought together in a logical unified fashion Data Warehouse Architecture Internal Legacy Systems Query System Data Warehouse Special Purpose Data External Data Sources Executive Information System Decision Support System EIS Client EIS Client DSS Client Research at Monash - What We Know (The Benefits) The major benefits of data warehousing we have noted are: – Better data management – Better access to data – Better decision making – A reduction in the cost associated with the production of ad hoc reports IT professionals involved consider the investment to be very worthwhile. What we know: architecture Organisations are using existing technologies for their datawarehouse – As a result the traditional vendors have a strong presence in the market (eg. IBM, Sun, Oracle etc.) Client / Server architectures are dominant – However many organisations are running their data warehouse on the same platform as their OLTP systems. What we know: project scale The majority of projects are not enterprisewide in scale (data marts rather than data warehouses) A small number of systems cost many millions of dollars but around $500,000 is typical (but proportional to the authority of the sponsor!) A small number of users (~10) is common The development team usually consists of 24 people Where is the technology heading? Architecture – Web enablement Project scale – More large projects – More Users Issues facing developers Shortage of skilled people Vendor support in Australia Increasing expectations of users Internet Evolutionary development Data quality (!!) Some fundamentals (Ackoff again) Don’t ask what people want Managers don’t need more information Find out what people need Use the [warehouse] to provide better information Ackoff - 1967 Evolutionary development Users understanding of business is shaped by the information they have System is developed to suit their understanding of the business System provides better information Users understanding of business is changed System must change, ... Data Warehouse Modelling Aims – Easily understood – Extendable – Stable – Good performance for queries and reports ER or Star Schema or both? ER Schema (Simple) Customer Type groups Customer within contains makes Product Type groups Product in Sale Region located at within Period (based on Kimball (1996), p29, and Simsion-Bowles (1996), p2) Store Traditional ER Approach to design Entities and relationships Rules of normalisation – 3NF is typical – Protection of integrity of database by avoiding anomalies – Every logical thing is represented only once Separate consideration logical and physical Traditional Database Design Large numbers of tables – Oracle Financials - 1,800; SAP 7 up to 8,000 Commonly used – Feels natural once you get used to it Research shows that they are not easily understood by IT people – Especially concepts like abstraction, generalisation, sub-types, etc. Multi-Dimensional Models It is possible to conceptualise data as multidimensional Difficult to design Easy to use resulting reports So what is this dimensional stuff anyway? An approach to database design that provides an easy to understand and navigate database – The aim is to encourage understanding, exploration and learning Each number has a set of associated attributes – What it measures, what point of time it was created, what location its from, what product its associated with, what promotion, etc. Multi-dimensionality Usually talk about information spaces as cubes or hyper cubes or n-cubes Each attribute associated with each number represents a dimension – Measure, time, location, product, location, etc. Resulting views are easy to navigate and move around – Slice and dice – Report template From Traditional Relational to Multi-dimensional Typical relational data-base From Pilot Software OLAP White Paper Same data displayed in twodimensions Easy! (The key is to identify the continuous and discrete variables in the flat file.) From a Spreadsheet to a Multidimensional report Typical spreadsheet model Two Dimensional? Lurking Dimensions What about 1997? What about other states? Other dimensions are implicit. Year and State? Spot the design choices! (Time and Region) What is OLAP? On-Line analytical processing Term was popularised by Codd in 1993 – 12 OLAP rules defining a standard by which to assess products – Nothing new - most products already complied OLAP Council Client/Server Multi-dimensional view of data OLAP and ROLAP Many OLAP tools have their own way of storing data (MDDB) Some make it look like the data is in a cube but actually query a relational database (ROLAP) – ‘How?’ you might ask! Star Schema Used to implement dimensional analysis using relational database technology Very common in data warehouse – Many variations Fact table – additive and non additive facts Dimension tables – become constraints (WHERE part of SQL) Star schema (with attributes) Customer Customer key Name Customer type Sale Product Product key Product type weight Time key Store key Customer key Product key Dollar sales Unit sales Time Time key Day Month Store Store key Address Region Snowflake schema Customer Type Customer Product Type Product Sale Time Store Region Conversion from ER to Star “Event remembered” or “transaction” entity types become fact tables – SALE – SHIPMENT – CLAIM “Master” entity types become dimension tables – CUSTOMER – PRODUCT – LOCATION Uses of ER and Star Schemas ER schemas are useful for data mapping to legacy systems and for integration of the data warehouse Star schemas are useful for the design of warehouse databases as they are efficient and easy to understand and use – Allow relational databases to support multidimensional data cubes Dimensions Dimensions Star schema might (typically) have 1015 dimensions Individual user views of the warehouse might include 6-7 of these Typical systems (eg an EIS) might have 20 different views and 4-5 different base fact tables Dimension tables can be related to a large number of facts Steps in the design process 1. Choose a business process 2. Choose the grain of the fact table Too fine > Oversized database Too large > Loss of meaningful information 3. Choose the dimensions 4. Choose the measured facts (usually numeric, additive quantities) 5. Complete the dimension tables Kimball (1996) Extra steps in the design process 6. Determine strategy for slowly changing dimensions 7. Create aggregations and other physical storage components 8. Determine the historical duration of the database 9. Determine the urgency with which the data is to be extracted and loaded into the data warehouse. Kimball (1996) That’s it from me! Check the web Useful links: – www.sims.monash.edu.au/dsslab – www.rkimball.com – www.dwassit.com – www.olap.org Stuff to read – Anything by Ralph Kimball, Bill Inmon, lots of others Thinking in Data Warehousing Hard Vs Soft ?? Perspective – Objective vs Subjective – Nature of the organisation Evaluation of Data Warehousing Problem oriented Product oriented Conceptual Structured analysis Structured design Entity relationship Object oriented design modelling Logical construction of systems Modern structured analysis Object oriented analysis Formal PSL/PSA JSD VDM Levels of abstraction Stepwise refinement Proof of correctness Data abstraction JSP Object oriented programming Data Warehousing’s view of ISD Objectives Development group Object system Hirschheim et al see reading list Change process Environment Object system