Data Warehousing Definition • Data Warehouse: – A subject-oriented, integrated, time-variant, nonupdatable collection of data used in support of management decision-making processes – Subject-oriented: e.g. customers, patients, students, products – Integrated: Consistent naming conventions, formats, encoding structures; from multiple data sources – Time-variant: Contain a time dimenstion so that it may be used to study trends and changes – Nonupdatable: Read-only, periodically refreshed • Data Mart: – A data warehouse that is limited in scope Need for Data Warehousing • Integrated, company-wide view of high-quality information (from disparate databases) • Separation of operational and informational (decision support) systems and data (for improved performance) Data Warehouse Architectures • Generic Two-Level Architecture • Independent Data Mart All involve some form of extraction, transformation and loading (ETL) Figure 11-2: Generic two-level data warehousing architecture L T One, companywide warehouse E Periodic extraction data is not completely current in warehouse Figure 11-3 Independent data mart data warehousing architecture Data marts: Mini-warehouses, limited in scope L T E Separate ETL for each independent data mart Data access complexity due to multiple data marts The ETL Process • Capture/Extract • Scrub or data cleansing • Transform: – Convert data from the format of the source to the format of the data warehouse. • Load and Index ETL = Extract, transform, and load Load/Index= place transformed data into the warehouse and create indexes Figure 11-10: Steps in data reconciliation (cont.) Refresh mode: bulk rewriting of target data at periodic intervals Update mode: only changes in source data are written to data warehouse Index • Bitmap index • Join index Figure 6-8 Bitmap index index organization Bitmap saves on space requirements Rows - possible values of the attribute Columns - table rows Bit indicates whether the attribute of a row has the values Figure 6-9 Join Indexes–speeds up join operations Star Schema for Data Warehouse • Objectives – Ease of use for decision support applications – Fast response to predefined user queries – Customized data for particular target audiences Also called “dimensional model” • Dimension: – A dimension is a term used to describe any category used in analyzing data, such as time, geography, and product line. Figure 11-13 Components of a star schema Fact tables contain factual or quantitative data 1:N relationship between dimension tables and fact tables Dimension tables are denormalized to maximize performance Dimension tables contain descriptions about the subjects of the business Excellent for ad-hoc queries, but bad for online transaction processing Figure 11-14 Star schema example Fact table provides statistics for sales broken down by product, period and store dimensions Figure 11-15 Star schema with sample data On-Line Analytical Processing (OLAP) Tools • The use of a set of graphical tools that provides users with multidimensional views of their data and allows them to analyze the data using simple windowing techniques • Relational OLAP (ROLAP) – Traditional relational representation • Multidimensional OLAP (MOLAP) – Cube structure • OLAP Operations – Cube slicing–come up with 2-D view of data – Drill-down–going from summary to more detailed views Figure 11-23 Slicing a data cube Figure 11-24 Example of drill-down Starting with summary data, users can obtain details for particular cells Summary report Drill-down with color added Data Mining and Visualization • Knowledge discovery using a blend of statistical, AI, and computer graphics techniques • Goals: – Explain observed events or conditions – Confirm hypotheses – Explore data for new or unexpected relationships • Techniques – – – – – – – – – Statistical regression Decision tree induction Clustering and signal processing Affinity Sequence association Case-based reasoning Rule discovery Neural nets Fractals • Data visualization–representing data in graphical/multimedia formats for analysis Pivot Table • Excel: – Drill Down, Roll Up • Access CrossTab query SQL GROUPING SETS • GROUPING SETS – SELECT CITY,RATING,COUNT(CID) FROM HCUSTOMERS – GROUP BY GROUPING SETS(CITY,RATING,(CITY,RATING),()) – ORDER BY CITY; • Note: () indicates that an overall total is desired. SQL CUBE • Perform aggregations for all possible combinations of columns indicated. – SELECT CITY,RATING,COUNT(CID) FROM HCUSTOMERS – GROUP BY CUBE(CITY,RATING) – ORDER BY CITY, RATING; SQL ROLLUP • The ROLLUP extension causes cumulative subtotals to be calculated for the columns indicated. If multiple columns are indicated, subtotals are performed for each of the columns except the far-right column. – SELECT CITY,RATING,COUNT(CID) FROM HCUSTOMERS – GROUP BY ROLLUP(CITY,RATING) – ORDER BY CITY, RATING