DESIGNING THE DATA WAREHOUSE DATA MARTS Basic principles WAREHOUSE COMPONENTS Any Source Operational data Any Data Relational / Multidimensional Any Access Relational tools Oracle Medi` External data Text, image Spatial Web Audio, video OLAP tools Applications/ Web INTELLIGENCE TOOLS IS develops user’s Views Business users Current Tactical Reports Analysts Strategic Discoverer DATA MART SUITE Data Modeling Data Mart Designer OLTP Databases OLTP Engines Warehousing Engines Data Mart Database Data Extraction Data Management Data Access & Analysis Data Mart Builder Enterprise Manager Discoverer & Reports DATA MART A collection of subject areas organized for decision support based on the needs of a given department Characteristics include: A star-join structure that is optimal for the needs of the users found in the department. A dependent data mart is one whose source is a data warehouse. An independent data mart is one whose source is the legacy applications environment DATA WAREHOUSES VERSUS DATA MARTS Data Warehouse Property Scope Subject Data Source Size(typical) Implementation time Data Mart Data Warehouse Enterprise Multiple Many 100 GB to>1 TB Months to years Data Mart Department Single-subject Few <100 GB Months DEPENDENT DATA MART Flat Files Operational Systems Marketing Marketing Sales Finance Human Resources Data Warehouse Sales Finance Data Marts External Data INDEPENDENT DATA MART Operational Systems Flat Files Sales or Marketing External Data REASONS FOR CREATING A DATA MART To give users more flexible access to the data they need to analyse most often. To provide data in a form that matches the collective view of a group of users To improve end-user response time. Potential users of a data mart are clearly defined and can be targeted for support REASONS FOR CREATING A DATA MART Building a data mart is simpler compared with establishing a corporate data warehouse. The cost of implementing data marts is far less than that required to establish a data warehouse. EXAMPLE OF DW TOOL OLAP Rotate and drill down to successive levels of detail. Create and examine calculated data interactively on large volumes of data. Determine comparative or relative differences. Perform exception and trend analysis. Perform advanced analytical functions for example forecasting, modeling, and regression analysis ORIGINAL OLAP RULES 1. Multidimensional conceptual view 2. Transparency 3. Accessibility 4. Consistent reporting performance 5. Client-server architecture ORIGINAL OLAP RULES 6. Multi-user support 7. Unrestricted cross-dimensional operations 8. Intuitive data manipulation 9. Flexible reporting 10. Unlimited dimensions and aggregation levels RELATIONAL DATABASE MODEL Attribute 1 Attribute 2 Attribute 3 Attribute 4 Name Age Gender Emp No. Row 1 Anderson 31 F 1001 Row 2 Green 42 M 1007 Row 3 Lee 22 M 1010 Row 4 Ramos 32 F 1020 The table above illustrates the employee relation. MULTIDIMENSIONAL DATABASE MODEL Store Time SALES Product The data is found at the intersection of dimensions. Two dimensions Three dimensions ROLAP SERVER The warehouse stores atomic data. The application layer generates SQL for the three- dimensional view. The presentation layer provides the multidimensional view. DSS client ROLAP engine Application layer Multiple SQL Warehouse server ROLAP Cache Live fetch Query Data cache Warehouse Data Express Server Express user MOLAP SERVER The application layer stores data in a multidimensional structure The presentation layer provides the multidimensional view • Efficient storage and processing • Complexity hidden from the user • Analysis using preaggregated summaries and precalculated measures DSS client MOLAP Engine Application layer Warehouse MOLAP MDDB Query Periodic load Warehouse Data Express Server Express user CHOOSING A REPORTING ARCHITECTURE Business needs Potential for growth Interface Enterprise architecture Network architecture Speed of access Openness Good MOLAP Query Performance ROLAP OK Simple Complex Analysis DATA ACQUISITION Identify, extract, transform, and transport source data Consider internal and external data Perform gap analysis between source data and target database objects Plan move of data between sources and target Define first-time load and refresh strategy Define tool requirements Build, test, and execute data acquisition modules MODELING Warehouses differ from operational structures: Analytical requirements Subject orientation Data must map to subject oriented information: Identify business subjects Define relationships between subjects Name the attributes of each subject Modeling is iterative Modeling tools are available MODELING THE DATA WAREHOUSE Defining the business model Creating the dimensional model Modeling summaries Creating the physical model 1 Select a business process 2, 3 4 Physical model FROM TABLES AND SPREADSHEETS TO DATA CUBES A data warehouse is based on a multidimensional data model which views data in the form of a data cube A data cube, such as sales, allows data to be modeled and viewed in multiple dimensions Dimension tables, such as item (item_name, brand, type), or time(day, week, month, quarter, year) Fact table contains measures (such as dollars_sold) and keys to each of the related dimension tables In data warehousing literature, an n-D base cube is called a base cuboid. The top most 0-D cuboid, which holds the highest-level of summarization, is called the apex cuboid. 27 CUBE: A LATTICE OF CUBOIDS all time time,item 0-D(apex) cuboid item time,location location item,location time,supplier time,item,location supplier 1-D cuboids location,supplier 2-D cuboids item,supplier time,location,supplier 3-D cuboids time,item,supplier item,location,supplier 4-D(base) cuboid time, item, location, supplier IDENTIFYING BUSINESS RULES Location Geographic proximity 0 - 1 miles 1 - 5 miles > 5 miles Time Month > Quarter > Year Product Type Monitor Status PC Server 15 inch 17 inch 19 inch None New Rebuilt Custom Store Store > District > Region CREATING THE DIMENSIONAL MODEL Identify fact tables Translate business measures into fact tables Analyze source system information for additional measures Identify base and derived measures Document additivity of measures Identify dimension tables Link fact tables to the dimension tables Create views for users DIMENSION TABLES Dimension tables have the following characteristics: Contain textual information that represents the attributes of the business Contain relatively static data Are joined to a fact table through a foreign key reference Product Channel Facts (units, price) Customer Time A CONCEPT HIERARCHY: DIMENSION (LOCATION) all all Europe region country city office Germany Frankfurt ... ... ... Spain North_America Canada Vancouver ... L. Chan ... ... Mexico Toronto M. Wind Data Mini ng: Con cept s and Tech niqu FACT TABLES Fact tables have the following characteristics: Contain numeric measures (metrics) of the business May contain summarized (aggregated) data May contain date-stamped data Have key value that is typically a concatenated key composed of the primary keys of the dimensions Joined to dimension tables through foreign keys that reference primary keys in the dimension tables Product Channel Facts (units, price) Customer Time CONCEPTUAL MODELING OF DATA WAREHOUSES Modeling data warehouses: dimensions & measures Star schema: A fact table in the middle connected to a set of dimension tables Snowflake schema: A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake Fact constellations: Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation DIMENSIONAL MODEL (STAR SCHEMA) Fact table Product Channel Facts (units, price) Customer Time Dimension tables STAR SCHEMA MODEL Product Table Product_id Product_desc … Central fact table Radiating dimensions Denormalized model Time Table Day_id Month_id Period_id Year_id Store Table Store_id District_id ... Sales Fact Table Product_id Store_id Item_id Day_id Sales_dollars Sales_units ... Item Table Item_id Item_desc ... STAR SCHEMA MODEL Easy for users to understand Fast response to queries Simple metadata Supported by many front end tools Less robust to change SNOWFLAKE SCHEMA MODEL Direct use by some tools More flexible to change Provides for speedier data loading May become large and unmanageable Degrades query performance More complex metadata SNOWFLAKE SCHEMA MODEL Product Table Product_id Product_desc Store Table Store_id Store_desc District_id District Table District_id District_desc Sales Fact Table Item_id Store_id Sales_dollars Sales_units Time Table Week_id Period_id Year_id Item Table Item_id Item_desc Dept_id Dept Table Dept_id Dept_desc Mgr_id Mgr Table Dept_id Mgr_id Mgr_name EXAMPLE OF FACT CONSTELLATION time time_key day day_of_the_week month quarter year item Sales Fact Table time_key item_key item_key item_name brand type supplier_type location_key branch_key branch_name branch_type units_sold dollars_sold avg_sales Measures time_key item_key shipper_key from_location branch_key branch Shipping Fact Table location to_location location_key street city province_or_street country dollars_cost units_shipped shipper shipper_key shipper_name location_key shipper_type USING SUMMARY DATA Provides fast access to precomputed data Reduces use of I/O, CPU, and memory Usually exists in summary fact tables Average? MEASURES: THREE CATEGORIES distributive: if the result derived by applying the function to n aggregate values is the same as that derived by applying the function on all the data without partitioning. algebraic: if it can be computed by an algebraic function with M arguments (where M is a bounded integer), each of which is obtained by applying a distributive aggregate function. E.g., count(), sum(), min(), max(). 42 E.g., avg(), min_N(), standard_deviation(). Data holistic: if there is no constant bound on the storage size Mini ng: needed to describe a subaggregate. Con E.g., median(), mode(), rank(). cept s and Tech niqu DESIGNING SUMMARY TABLES Average Total Maximum Percentage Units Product A Total Product B Total Product C Total Sales(€) Store SUMMARY TABLES EXAMPLE SALES FACTS Sales Region Month 10,000 North Jan 99 12,000 South Feb 99 11,000 North Jan 99 15,000 West Mar 99 18,000 South Feb 99 20,000 North Jan 99 10,000 East Jan 99 2,000 West Mar 99 SALES BY MONTH/REGION Month Region Tot_Sales$ Jan 99 North 41,000 Jan 99 East 10,000 Feb 99 South 40,000 Mar 99 West 17,000 SALES BY MONTH Month Tot_Sales Jan 99 51,000 Feb 99 40,000 Mar 99 17,000 SUMMARY MANAGEMENT Sales summary Sales Region State City Product Time Summary advisor Summary usage Summary recommendations Space requirements THE TIME DIMENSION Time is critical to the data warehouse. A consistent representation of time is required for extensibility. Sales fact Time dimension How and where should it be stored? MULTIDIMENSIONAL DATA Sales volume as a function of product, month, and region 47 Dimensions: Product, Location, Time Hierarchical summarization paths Industry Region Year Product Category Country Quarter Product City Office Month Month Week Day Data Mini ng: Con cept s and Tech niqu A SAMPLE DATA CUBE 2Qtr 3Qtr 4Qtr sum U.S.A Canada Mexico sum Country TV PC VCR sum 1Qtr Date Total annual sales of TV in U.S.A. CUBOIDS CORRESPONDING TO THE CUBE all product product,date date country product,country 1-D cuboids date, country 2-D cuboids 3-D(base) cuboid product, date, country 49 0-D(apex) cuboid September 22, 2012 BROWSING A DATA CUBE 50 Data Mini ng: Con cept s and Tech niqu Visualization OLAP capabilities Interactive manipulation Roll up (drill-up): summarize data Drill down (roll down): reverse of roll-up project and select Pivot (rotate): from higher level summary to lower level summary or detailed data, or introducing new dimensions Slice and dice: by climbing up hierarchy or by dimension reduction 51 September 22, 2012 TYPICAL OLAP OPERATIONS reorient the cube, visualization, 3D to series of 2D planes. Other operations drill across: involving (across) more than one fact table drill through: through the bottom level of the cube to its end relational tables (using SQL) Data Mini ng: Con cept backs and Tech niqu Slicing a data cube 52 Summary report Example of drill-down Drill-down with color added 53 54 55 DATA WAREHOUSE BACK-END TOOLS AND UTILITIES Data extraction get data from multiple, heterogeneous, and external sources Data cleaning detect errors in the data and rectify them when possible Data transformation convert data from legacy or host format to warehouse format Load sort, summarize, consolidate, compute views, check integrity, and build indicies and partitions Refresh propagate the updates from the data sources to the warehouse 56