資料倉儲介紹 Data Warehousing and OLAP 楊立偉教授 台灣大學工管系 1 Agenda 1. Introduction 2. Data Warehouse Theory 3. System Features 4. Demo 5. Discussions 2 1. Introduction 3 1.1 Introduction • A data warehouse is a subject-oriented, integrated, time-variant, nonvolatile collection of data in support of management decisions 4 1.1 Introduction (cont’d) How are organizations using data warehouse ? 1. Increasing customer focus, which includes the analysis of customer buying patterns. 2. Repositioning products and managing product portfolios by comparing the performance of sales by time or regions, in order to fine-tune production strategies 3. Analyzing operations and looking for sources of profit 4. Managing the customer relationship, making environmental corrections, and managing the cost of 5 corporate assets 1.2 Data Warehouse Characteristics • It is a database designed for analytical tasks, using data from multiple applications • It supports a relatively small number of users with relatively long interactions • Its usage is read-intensive • Its content is periodically updated 6 1.2 Data Warehouse Characteristics (cont’d) • It contains current and historical data to provide a historical perspective of information • It contains a few large tables • Each query frequently results in a large result set and involves frequent full table scan and multi-table joins 7 1.3 Datawarehousing • The Processing of constructing and using data warehouses Heterogeneous Data Sources Data Cleaning Data Integration And Consolidation Constructing Data warehouse 8 Interactive Analysis Making Strategic Decisions Using Data Warehouse 1.4 Three-tier System Architecture Executives or Decision Making Staffs OLAP Tools Datawarehouse Server Operational DBMS 9 IT or Datawarehouse Administrators 2. Data Warehouse Theory 10 2.1 Data Warehouse Theory • Why not use Database directly ? – The update-driven approach is inefficient. – Potentially expensive for frequent queries. • Use Data warehouse instead – The query-driven approach is enough for making strategic decisions. – Separate the operational DBMS for daily and critical operations. 11 2.2 Data Cube • A multidimensional, logical view of the data • Concept hierarchy – Multiple data granularity 多重的資料顆粒度 – Data summarization 資料加總 – Data generalization 資料一般化 12 • A 3-dimension Data Cube 13 • Drill-down on time data for Q1 14 • Roll-up on address • Adding a dimension supplier 15 2.3 Efficient Data Cube Computation • The challenges : 2N combinations – Concept hierarchy and Aggregations makes it more complicated ! • Materialization of data cube 如何實作 ALL – Materialize every, none, or some ? – Algorithms for selection Address Item Address, Time Address, Item Time • Based on size • Based on sharing, • Based on access frequency. 16 Address, Time, Item Time, Item 2.4 On-Line Analytical Processing (OLAP) • Fast on-line processing of data cubes or multi-dimensional databases • OLAP operations: – – – – Drilling Pivoting 樞紐分析 Slicing and Dicing Filtering, etc. 17 2.4 On-Line Analytical Processing (Cont’d) • A multidimensional, logical view of the data. • Interactive analysis of the data (drill, pivot, slice_dice, filter) and Quick response to OLAP queries. • Summarization and aggregations at every dimension intersection. • Retrieval and display of data in 2-D or 3-D cross-tabs, charts, and graphs, with easy pivoting of the axes. • Analytical modeling: deriving ratios, variance, etc. and involving data across many dimensions. • Forecasting, trend analysis, and statistical analysis. 18 3. System Feature 19 3.1 Data sources supported • ODBC-compatible DBMS – Oracle, Microsoft SQL, MySQL, IBM DB2, etc. • Files – MS Access, MS Excel, etc. – Text files (CSV-format) 20 3.2 Data Cleansing 資料清洗 • Database schema translation – Field selection and mapping – Field re-naming – Field aggregating and deriving • Data filtering • Data value conversion – Data value mapping – Data value function – Date value conversion and decomposition 21 3.3 Building of Data Cube • Support for multi-dimension data • Support for concept hierarchy 22 3.5 Interactive Front-end Tools • • • • User-defined multi-dimension User-defined dimension hierarchy User-defined data granularity Real-time graph capabilities – Bar chart – Pie chart – Line chart 23 3.6 Other features • Web-based OLAP GUI – Easy to access from Internet • Easy to integrated with other systems – Import / Export capability 24 4. Demo 25 5. Discussions 26 5.1 Roadmap • Integrated with Data mining – – – – – Major Group / Sales Analysis 主力客群 Prospects Analysis and Forecast 潛在購買分析與預測 Association of Customers and Sales 關聯分析 Market Segment Recommendation 市場區隔 Other Business Intelligence application • Integrated to e-Marketing – 1-to-1 Personalization & Recommendation 個人化推薦 – Target marketing 目標行銷 – Loyalty program 客戶忠誠度計劃 27