Data warehouse data mining and OLAP

Core of Business “Intelligence” technology Database warehouse, data mining and on-line analytical processing Business Intelligence and Analytics for Decision Support The diagram show the role played by data warehouse, data-mining and olap in the “overall” business “decision making” process Business intelligence and analytics requires a strong database foundation, a set of analytic tools, and an involved management team that can ask intelligent questions and analyze data. Laudon and Laudon Chapter 10 The Data Warehouse “A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of “all” an organisation’s data in support of management’s decision making process.” – Data warehouses developed because E.G.: – if you want to ask “How much does this customer owe?” then the sales database is probably the one to use. However if you want to ask “Was this ad campaign more successful than that one?”, you require data from more disparate sources Other sources e.g. production, marketing etc. Characteristics of a Data Warehouse • Subject oriented – (based around business processes; e.g. sale of products,… • Integrated – inconsistencies removed • Nonvolatile – stored in read-only format • Time variant – data is “static” and update periodically; • Summarized – in decision-usable format; monthly average. • Large volume – data sets are quite large; all the pertinent data of an organisation • Non normalized – often redundant: star flake schema (it has dimension tables and fact tables): The Atomic Schema Customer Customer ID Status Date Cust Addr State Cust ZIP Code Customer Type Customer Status ... Cust Purchases Customer ID Activity Date Product Code Product Name Sales Rep ID Qty Purchased Total Dollars Promotion Flag Product Ref Product Code ProdRef Eff. Date ProdRef End Date Product Name Unit Price Product Category Product Type Product Sub Type Cust Averages Customer ID Cust Average Date Cust Avg. End Date Cust Avg. Rev. Cust Longevity Outlet Reference Store ID Store Name Store Location Distribution Channel Sales Rep Ref Sales Rep ID Sales Person Name Store ID For Example: Selling Responsibility Sales Rep ID Sales Rep Name Store ID Store Name Store Location Sales Channel Product Product Code Product Name Prod. Category Product Type Prod Sub Type Customer Location Cust ZIP Code Purchases 1 Sales Rep ID Product Code Cust ZIP Code Customer Type Week Ending Date Days of Activity Unit Price Total Quantity Total Dollars Returned Qty Returned Dollars Promotion Qty City State/Province Country Customer Type Customer Type Cust Type Desc Date Information Week Ending Date Month Quarter Year Elements of the building of a Data warehousing infrastructure Dependent Data Mart External Data Extract/Summarize Data ETL Routine Operational Database(s) (Extract/Transform/Load) Data Warehouse Independent Data Mart Decision Support System Report A data warehouse process model Meta Data • A key concept behind D.W. is Meta Data. – Meta data is data about the data (which has come from the data sources) and shows what data is contained in the DW, where it came from, and what changes have been made to it. • The metadata are essential ingredients in the transformation of raw data into knowledge. They are the “keys” that allow us to handle the raw data. – For example, a line in a sales database may contain: 1023 K596 111.21 – This is mostly meaningless until we consult the metadata (in the data directory) that tells us it was store number 1023, product K596 and sales of $111.21. Meta Data Answers Questions for Users of the Data Warehouse • How do I find the data I need? • What is the original source of the data? • How was this summarization created? • What queries are available to access the data?  How have business definitions and terms changed over time?  How do product lines vary across organizations?  What business assumptions have been made? Dependent Data marts • A data mart is a data store that is subsidiary to a data warehouse of integrated data. • The data mart is directed at a partition of data (subject area) that is created for the use of a dedicated group of users and is sometimes termed a “subject warehouse” • The data mart might be a set of denormalised, summarised or aggregated data that can be placed on the data warehouse database or more often placed on a separate physical store. • Data marts can be “dependent data marts” when the data is sourced from the data warehouse. • Independent data marts represent fragmented solutions to a range of business problems in the enterprise, however, such a concept should not be deployed as it doesn’t have the “data integration” concept that’s associated with data warehouses. Independent Data marts • However, such marts are not necessarly all bad. • Often a valid solution to a pressing business problem: – Extremely urgent user requirements – The absence of a budget for a full data warehouse – The decentralisation of business units Data Warehousing Architecture • Access Tools – The principal purpose of the data warehouse is to provide information for strategic decision making. – The main Decision tools used to achieve this objective are: • Data mining tools • On-line analytical processing tools • Decision support systems / Executive information system tools Data Warehousing Typology – THE D.W. can be at single location i.e. a central data warehouse – The collection of data is replicated around multiple locations. This means users have a local copy of the data warehouse. This can improve query run-times, and reduce communications overheads. Distributed Data warehouse (Note: The principles associated with distributed database equally apply to Distributed Data warehouses, however, the static nature of the data needs to be factored in to the design process ) . Data Warehouse Construction Tips • Accept that your first try will require revision • Examine the data: What formats and specific data are needed to support your application? • Clean up the data before using it in the warehouse • Build a prototype mini-data warehouse as a learning experience and revise strategies as necessary • Plan on more users than anticipated wanting to use the warehouse • Keep storage requirements constantly in mind Sample type question • Discuss how D.W. can play’s key role in strategic decision making. Data Mining • The process of extracting valid, previously unknown, comprehensible, and actionable information from large databases and using it to make crucial business decisions. • Involves the analysis of data and the use of software techniques for finding hidden and unexpected patterns and relationships in sets of data. 16 Data Mining • Data mining tools uses ,e.g. AI techniques, to help: – predict future trends: , – Segment datasets – “Product” association • allowing businesses to make proactive, knowledge-driven decisions. 17 Data mining: A.I. techniques. • The most commonly used techniques A.I. techniques in data mining are: – Decision trees: Tree-shaped structures that represent sets of decisions. These decisions generate rules for the classification of a dataset. – Nearest neighbour method: A technique that classifies each record in a dataset based on a combination of the classes of the k record(s) most similar to it in a historical dataset. Sometimes called the k-nearest neighbour technique; a clustering technique – Rule induction: The extraction of useful if-then rules from data based on statistical significance. – Artificial neural networks: Predictive models that learn through training and resemble biological neural networks in structure. 18 How Data Mining Works • For example, say that you are the director of marketing for a insurance company and you'd like to acquire some new customers – You could just randomly go out and mail coupons to the general population. However you would not achieve the required result. – Alternatively As the marketing director you have access to a lot of information about all of your customers: their age, sex, income range and credit card insurance. 19 How Data Mining Works Customers Prospects General information (e.g. demographic data) Known Known Proprietary information (e.g. customer transactions) Known Target • The goal in prospecting is to make some decisions about the information in the lower right hand quadrant based on the model that we build going from Customer General Information to Customer Proprietary Information. 20 An Algorithm for Building Decision Trees Consider the following using decision trees. The following is decision tree algorithm: 1. Let T be the set of training instances. 2. Choose an attribute that best differentiates the instances in T. 3. Create a tree node whose value is the chosen attribute. -Create child links from this node where each link represents a unique value for the chosen attribute. -Use the child link values to further subdivide the instances into subclasses. 4. For each subclass created in step 3: -If the instances in the subclass satisfy predefined criteria or if the set of remaining attribute choices for this path is null, specify the classification for new instances following this decision path. -If the subclass does not satisfy the criteria and there is at least one attribute to further subdivide the path of the tree, let T be the current set of subclass instances and return to step 2. 21 Table 3.1 • The Credit Card Promotion Database Income Range Life Insurance Promotion Credit Card Insurance Sex Age 40–50K 30–40K 40–50K 30–40K 50–60K 20–30K 30–40K 20–30K 30–40K 30–40K 40–50K 20–30K 50–60K 40–50K 20–30K No Yes No Yes Yes No Yes No No Yes Yes Yes Yes No Yes No No No Yes No No Yes No No No No No No No Yes Male Female Male Male Female Female Male Male Male Female Female Male Female Male Female 45 40 42 43 38 55 35 27 43 41 43 29 39 55 19 Table 3.1 • The Credit Card Promotion Database Income Range Life Insurance Promotion Credit Card Insurance Sex Age 40–50K 30–40K 40–50K 30–40K 50–60K 20–30K 30–40K 20–30K 30–40K 30–40K 40–50K 20–30K 50–60K 40–50K 20–30K No Yes No Yes Yes No Yes No No Yes Yes Yes Yes No Yes No No No Yes No No Yes No No No No No No No Yes Male Female Male Male Female Female Male Male Male Female Female Male Female Male Female 45 40 42 43 38 55 35 27 43 41 43 29 39 55 19 Income Range 20-30K 2 Yes 2 No 30-40K 4 Yes 1 No 40-50K 1 Yes 3 No 50-60K 2 Yes How Data Mining Works • For instance, a simple model for a • Insurance company might be: Age – Customers who earn between 50 K to 60 K have a life insurance policy. • This model could then be applied to the general population to target those for the life insurance promotion. • The tree can be more complex e.g. See figure opposite <= 43 > 43 No (3/0) Sex Female Male Yes (6/0) Credit Card Insurance No No (4/1) Yes Yes (2/0) 24 Data Mining Operations • Data mining operations include: – Predictive modelling: decision trees, regression analysis… – Database segmentation: clustering techniques – Link analysis: decision trees, association rules 25 Predictive Modeling • Applications of predictive modelling include direct marketing and use techniques like decision trees. Simple decision tree example • uses observations to form a model of the important characteristics of some phenomenon: e.g. those traits associated with those who will buy property 26 Database Segmentation • Aim is to partition a database into an unknown number of segments, or clusters, of similar records. • Uses clustering techniques in order to group data • Applications of database segmentation include credit card fraud…. 27 Database Segmentation using a Scatterplot 28 Link Analysis • Aims to establish links between records, or sets of records, in a database; one such example would be association discovery…. • Applications include product affinity analysis. • Finds items that imply the presence of other items in the same event. 29 Link Analysis - Associations Discovery • Affinities between items are represented by association discovery. – e.g. ‘When a customer rents property for more than 2 years and is more than 25 years old, in 40% of cases, the customer will buy a property. This association happens in 35% of all customers who rent properties’. 30 Examples of Applications of Data Mining • Retail / Marketing – Predicting response to mailing campaigns – Market basket analysis • Banking: – Detecting patterns of fraudulent credit card use. • Insurance – Claims analysis • Medicine – Identifying successful medical therapies for different illnesses 31 Data mining in conclusion • Two critical factors for success with data mining are: – a large, well-integrated data warehouse and – a well-defined understanding of the business process within which data mining is to be applied (e.g. customer prospecting (target marketing), retention, campaign management etc.). 32 Sample types questions • Discuss, using suitable examples how data mining can contribute to companies making a proactive knowledge driven decisions which could help with formulation of a companies strategy. 33 What is OLAP • OLAP stands for "On-Line Analytical Processing.“ • OLTP ("On-Line Transaction Processing") • OLAP describes a class of technologies that are designed for live ad hoc data access and analysis. • OLTP generally relies solely on relational databases, • OLAP has become synonymous with multidimensional views of business data supported by multidimensional databases • Relational databases were never intended to provide data synthesis, analysis and consolidation functionality. 34 What is OLAP • OLTP databases are optimised for transaction updating however, OLAP applications are used by managers and analysts for a higher level aggregate view of the data, thus they are designed for analysis. • Many problems that people try to solve using relational databases e.g. summaries are handled much more efficiently by an OLAP server than by RDBMS 35 Key OLAP Features Although OLAP applications are found in widely divergent functional areas, as illustrate in the table opposite. Moreover they all have the following key features: 1. multi-dimensional views of data (MD databases via Star Schema) 2. Support complex calculations 3. Time intelligence 36 Purchase Key 1 2 3 4 5 6 . . . Purchase Dimension Category Supermarket Travel & Entertainment Auto & Vehicle Retail Restarurant Miscellaneous . . . Star Schema: basis of MD view Time Dimension Time Key Month Day Quarter Year 10 Jan 5 1 2002 . . . . . . . . . . . . . . . Cardholder Key Purchase Key Location Key 1 2 1 15 4 5 1 2 3 . . . . . . . . . Cardholder Key Name 1 John Doe 2 Sara Smith . . . . . . Cardholder Dimension Gender Income Range Male 50 - 70,000 Female 70 - 90,000 . . . . . . Fact Table Time Key Amount 10 14.50 11 8.25 10 22.40 . . . . . . Location Key Street 10 425 Church St . . . . . . A star schema for credit card purchases Location Dimension City State Region Charleston SC 3 . . . . . . . . . Multi-dimensional view as a cube: also represented a 4 column table Month = Dec. Category = Vehicle Region = Two Amount = 6,720 Count = 110 Dec. Nov. Oct. Sep. Aug. Month • Example of threedimensional query. • What is the total amount and number of purchases for vehicles in region 2 for December. Jul. Jun. May Apr. Mar. Feb. Jan. Multidimensional cube for credit card purchases Category Miscellaneous Restaurant Retail Vehicle Travel Supermarket On e Tw o Re Fo ur Th ree n gio Why Multidimensional Data • Queries requiring only a single number to be retrieved need not use multidimensional databases. • If queries involved retrieving multiple numbers and aggregating them for large databases can become intolerable as relational databases can scan only a few hundred records per second. • However multidimensional databases can add up 10,000 or more numbers in rows and columns per second. • Thus for such queries multidimensional databases have an enormous performance advantage 39 Multi-dimensional Operations • Slice – A single dimension operation • Dice – A multidimensional operation • Roll-up – A higher level of generalization • Drill-down – A greater level of detail • Rotation – View data from a new perspective Simple Hierarchies: Roll up • With hierarchical dimensions the database knows not to combine members of the dimension that are at different levels of the hierarchy: referred to as roll-up • It allows the user to view queries at all or any different levels e.g.. At street level ,city level, state level and region level. (refer to the above star schema example ) • Such hierarchies facilitate drill down to successive levels of detail: State level, city level, street level 41 Multiple hierarchies: roll up • Utilising multiple hierarchies e.g. product sales can roll up by region, by type , by brand name and so forth. Without this capability an extra dimension would have to be created for each. • Another use of multiple hierarchies is for geographical dimensions e.g.: 42 Drill down to core database • Most organisations now utilise relational databases as standard for their data warehouses. • Often there is no need to replicate all the data in the relational database into a MD database for OLAP. • Summary level data can be kept in the MD database and detailed data in the relational database. 43 Drilling to relational data • To get a single number from a MD database takes the same time as it does from a relational database. • Thus it would be futile to individual customers into a MD database. But for summarised data a MD database is superior. • Thus ideally you should be able to drill down through the MD database into the relational database. • Such an approach is useful as most of data volume will reside at the detailed level and will thus not hinder queries of the higher levels 44 Support for complex calculations • Important computational features of OLAP servers inlcude: – Independently dimensioned variables (IDV) – Statistical calculations – Consolidation speed – Vector Arithmetic 45 OLAP calculations : Variables • Variables are numeric measures (facts) such as Sales, Cost, price…; dimensions include region, customer type, product… : i.e. fact table and dimension tables • OLAP servers can treat variables as a special dimension. So one can select only the relevant dimensions for each variable (IDV) . See next slide • Must provide a range of powerful computational and statistical methods such as that required by sales forecasting: regression analysis , projection . Correlations… • They can also incorporate various rules for consolidation 46 Star schema for property sales of DreamHome 47 Vector Arithmetic • Data held in 2-D arrays [Matrix] can be more easily manipulated than data stored in a relational table. • Thus a 2-D plane for actual can be easily subtracted from a plane from budget to give a plane for variance. • Such arithmetic allows entire planes of the database to be combined quickly. 48 Time Series Data Types • Users want to look at trends in all aspects of their business e.g. sales trends, market trends etc. • A series of numbers representing a particular variable over time is called a time series e.g.. 52 weekly sales numbers is a time series. • Utilising a time-series data type allows you to store an entire string of numbers representing daily, weekly or monthly data. • Thus an OLAP server that supports time-series data type allows one to store historical data without having to specify a separate dimension for time. • Unlike other dimensions time has special attributes and rules. 49 Time-series data type • Time series always have a particular periodicity. • Time series data must include rules to convert one periodicity to another • In the absence of a time-series data type a new dimension must be declared and labelled explicitly. • A time-series data cell contains a great deal of information compared with a single cell or even a full record. 50 Time-Series Data types • Consider the following example for a time-series data type of sales. • • • • • • • • Start date = 1\1\2000 Periodicity = Daily, business days only Conversion = Summation Long description = Variable=Sales, Product=Nuts, Region=East Data type = Numeric, single precision Sacristy = Non-sparse Calendar = 445 Fiscal year Data points = 708,800,821,743,779,856,878,902,799, ... 51 Time-series data types • Start date is the first data point • Periodicity can be daily, weekly etc with calendar years, fiscal periods and business weeks etc being understood. • Data type can be single precision, double precision, text strings or dates • Sparse data is used where the same number is used over and over again e.g. price. Defining it as sparse would cause the database to store dates on which the price changed and the corresponding new values. • Data points can store very long time series e.g. 10 years of daily data. 52 Sparse Data • When less than 10% of the cells contain data the database is said to be sparsely populated or sparse. • Scarcity can also occur if there are many cells that contain the same number e.g.. Price of a product every day. • This situation can also be represented by storing the number once along with the number of days that the number is repeated • While a relational database would fill up the database with duplicate data an OLAP server that understands sparse data can skip over zeros, missing data and duplicate data. 53 Conclusion • In essence OLAP technology is a fast, flexible data summarisation and analysis tool. • The data analysis requires the ability to summarise data in many ways and view trends. • It should have 3 main characteristics: MD views, ability to perform complex calculations, time intelligence 54 Alternative Database topology: The star schema D.W. O.L.A.P Data mining The Atomic Schema Customer Customer ID Status Date Cust Addr State Cust ZIP Code Customer Type Customer Status ... Cust Purchases Customer ID Activity Date Product Code Product Name Sales Rep ID Qty Purchased Total Dollars Promotion Flag Product Ref Product Code ProdRef Eff. Date ProdRef End Date Product Name Unit Price Product Category Product Type Product Sub Type Cust Averages Customer ID Cust Average Date Cust Avg. End Date Cust Avg. Rev. Cust Longevity Outlet Reference Store ID Store Name Store Location Distribution Channel Sales Rep Ref Sales Rep ID Sales Person Name Store ID The Star Schema Dimension Table 1 Dimension Table 3 Dimension Key 1 Fact Table Dimension Key 3 Description 1 Aggregatn Lvl 1.1 Aggregatn Lvl 1.2 Aggregatn Lvl 1.n Dimension Key 1 Dimension Key 2 Dimension Key 3 Dimension Key 4 Description 3 Aggregatn Lvl 3.1 Aggregatn Lvl 3.2 Aggregatn Lvl 3.n Dimension Table 2 Dimension Key 2 Description 2 Aggregatn Lvl 2.1 Aggregatn Lvl 2.2 Aggregatn Lvl 2.n Fact 1 Fact 2 Fact 3 Fact 4 . . . Fact n Dimension Table 4 Dimension Key 4 Description 4 Aggregatn Lvl 4.1 Aggregatn Lvl 4.2 Aggregatn Lvl 4.n Dimension Table Dimension Table 1 Dimension Key 1 Description 1 Aggregatn Lvl 1.1 Aggregatn Lvl 1.2 Aggregatn Lvl 1.n • Describes the data that has been organized in the Fact Table • Key should either be the most detailed aggregation level necessary (e.g. country vs. county), if possible, or... • Surrogate keys may be necessary, but will decrease the natural value of the key • Manageable number of aggregation levels Fact Table Fact Table Dimension Key 1 Dimension Key 2 Dimension Key 3 Dimension Key 4 Fact 1 Fact 2 Fact 3 Fact 4 . . . Fact n • Quantifies the data that has been described by the Dimension Tables • Key made up of unique combination of values of dimension keys –ALWAYS contains date or date dimension • Fact values should be additive –Aggregations of quantities or amounts from atomic level –No percentages or ratios –May be non-additive, time-variant data For Example: Selling Responsibility Sales Rep ID Sales Rep Name Store ID Store Name Store Location Sales Channel Product Product Code Product Name Prod. Category Product Type Prod Sub Type Customer Location Cust ZIP Code Purchases 1 Sales Rep ID Product Code Cust ZIP Code Customer Type Week Ending Date Days of Activity Unit Price Total Quantity Total Dollars Returned Qty Returned Dollars Promotion Qty City State/Province Country Customer Type Customer Type Cust Type Desc Date Information Week Ending Date Month Quarter Year Star Schema Query Select E.Month, B.Customer_Type, C.Product_Type, D.Store_Location, sum(A.Total_Quantity) From Purchases_1 A, Customer_Type B, Product C, Selling_Responsibility D, Date_Information E Where B.Customer_Type = A.Customer_Type C.Product_Code = A.Product_Code and D.Sales_Rep_ID = A.Sales_Rep_ID and E.Week_Ending_Date = A.Week_Ending_Date E.Year = “1996” C.Product_Category = “V” Group by E.Month, B.Customer_Type, C.Product_Type, D.Store_Location; and and and Answer: Distinct Time Period Fact Tables Weekly D1 D2 Date Monthly D3 D1 D4 D2 Date D3 D4 • Create separate fact tables to account for different time periods • Date still part of each fact table key • Same dimension tables used by both fact tables • Improves overall performance (loading and accessing) for each time period • Will not increase amount of managed redundancy Question • Business decisions require the delivery of critical information in a timely, suitable format. Explain, using appropriate examples, how OLAP can facilitate the business decision making process. 63 Question • A data warehouse, a data mining systems and OLAP are 3 important technologies used in facilitating business decision making. using a suitable examples. – The star schema is a database schema that can be utilised by all three technologies: Describe, using a simple example, The essential elements of this schema – (10 marks) – Explain how the any two of the technologies could be used to provide information to formulate or derive simple business strategies. – (20 marks)

Data warehouse data mining and OLAP

Related documents

Products

Support

Data warehouse data mining and OLAP

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib