BI & DM for CRM Lecture 2 Data Warehouse and OLAP Steps in Knowledge Discovery Steps in KD cont... Data Cleaning: To remove noise and inconsistent data / check data validity & correctness – PLAY WİTH DATA (good step to know your data) Data İntegration: To combine multiple sources of data / integrate various data sources on various platforms (DBMS/flatfiles/spreedsheets etc..) Data Selection: Retrieve relevent data from data repository for analysis Data Transformation: To transform or consolidate data into different forms to deploy in DM operations. (If Data Warehouses were used, the process would be before the data selection.) Data Mining: To apply intelligent models for extracting patterns from data. Pattern Evaluation: To identify interesting patterns, result of DM process, on some measures. Knowledge presentation: To visualize the representation of knowledge, driven by DM process and pattern evaluationz DM System Architecture So, what is Data Warehouse? Loosely speaking; • A decision support database that is maintained separately from the organization’s operational database • Support information processing by providing a solid platform of consolidated, historical data for analysis. W.H. Inmon- ‘ A data warehouse is a subject-oriented, integrated, time-varient and nonvolatile collection of data in support of management’s decision making proces.’ DW—Subject-Oriented Organized around major subjects, such as customer, product, sales. Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing. Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process. 6 DW—Integrated Constructed by integrating multiple, heterogeneous data sources relational databases, flat files, on-line transaction records Data cleaning and data integration techniques are applied. Ensure consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources E.g., Hotel price: currency, tax, breakfast covered, etc. When data is moved to the warehouse, it is converted. 7 DW—Time Variant The time horizon for the data warehouse is significantly longer than that of operational systems. Operational database: current value data. Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years) Every key structure in the data warehouse Contains an element of time, explicitly or implicitly But the key of operational data may or may not contain “time element”. 8 DW—Non-Volatile A physically separate store of data transformed from the operational environment. Operational update of data does not occur in the data warehouse environment. Does not require transaction processing, recovery, and concurrency control mechanisms Requires only two operations in data accessing: • initial loading of data and access of data. 9 On-line transaction processing vs On-line Analytical Processing OLTP OLAP users clerk, IT professional knowledge worker function day to day operations decision support DB design application-oriented subject-oriented data current, up-to-date detailed, flat relational isolated repetitive historical, summarized, multidimensional integrated, consolidated ad-hoc lots of scans unit of work read/write index/hash on prim. key short, simple transaction # records accessed tens millions #users thousands hundreds DB size 100MB-GB 100GB-TB metric transaction throughput query throughput, response usage access complex query 10 Why seperate Database? Main reason is high performance for both systems – processing OLAP queries in transactional DBs would substantially degrade the performance of operational tasks Unlike OLTP, OLAP usually needs read only access from the database. So, for OLAP operations DBs does not require concurrent control and recovery. If applied for OLAP this may jeopardize the execution of OLTP operations Systems are mainly used for different operations... Data Model- Multidimentional DW & OLAP are based on multidementional data models... Actually... It is all about multidimentional data models... 2D view of data Facebook Data... POSTING Time Video Photo Write Up W1 200 500 1000 W2 250 520 1023 W3 310 782 1051 3D view of Data cont MALE FEMALE N/A POSTING POSTING POSTING Time Video Photo Write Video Photo Write Up Up W1 50 W2 W3 120 400 80 150 500 Video Photo Write Up 70 230 100 LATTICE OF CUBOIDS ALL 1-D Cuboids Time Time, Posting 2-D Cuboids Time, Posting, Gender 3-D Cuboids Posting Time, Gender Time, Education 0 D – Apex cuboid Gender Education Posting, Time, Gender Education Time, Posting, Time, Gender, Education Education Time, Posting, Gender, Education Gender, Education Posting, Gender, Education 4-D Cuboids Latice & Cuboids Any n-D data as a series of (n-1)-D “cubes” In data warehousing literature, A data cube is referred to as a cuboid The lattice of cuboids forms a data cube. The cuboid holding the lowest level of summarization is called a the 4-D cuboid is the base cuboid for the given four dimensions • base cuboid. The top most 0-D cuboid, which holds the highest-level of summarization, is called the apex cuboid. Here this is the total posting typically denoted by all Conceptual Modelling of DW Modeling data warehouses: dimensions & measures – Star schema: A fact table in the middle connected to a set of dimension tables – Snowflake schema: A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake – Fact constellations: Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation STAR SCHEMA Time Time_key Day Day_of_week Post Post Type Time_key Post_key Post_key Post _name Education_key Month Gender_key Quarter Total Post Year Gender Education Education_key Education_name Edu_type_key Gender_key Gender_name Snow Flake Schema Time Time_key Day Day_of_week FACT TABLE Post Post Type Time_key Post_key Post_key Post _name Education_key Month Gender_key Quarter Total Post Year Gender Gender_key Education Gender_name Education_key Education_name Edu_type_key Edu Type Edu_type_key Edu_type_name Fact Constellation Time Post Post Type Time_key Time_key Post_key Day Post_key Post _name Day_of_week Education_key Chat Month Gender_key Time_key Quarter Total Post Education_key Year Gender_key Total_chat_time Education Education_key Education_name Edu_type_key Gender Gender_key Gender_name Total_online_time Categorization & Computation How are measures computed? Data Cube measure is a numarical function that can be evaluated at each point in the data cube space. Distributive- sum(), min(), max() How many posts are from USA? Sum (Washington)+Sum(Nevada).... Algebraic: sum(),count(), standard_deviation()... Holisatic: median(), mode() Concept Hierarchies What is it? Flow of a concept (could be dimension) from lower levels to higher levels. More specific More General Example for concept hierarchy Location all all Europe region country city office Germany Frankfurt ... ... ... Spain North_America Canada Vancouver ... L. Chan ... ... Mexico Toronto M. Wind 23 Hierarchical vs lattice Structures (Concept Hierarchy) year country quarter Province or state city Street week month day We also have user defined concept hierarcies like Fiscal year or Academic Year How can we use Concept Hierarchies useful in OLAP? In multi dimensional model, data are organized in multi dimensions and each dimension contains multi level of abstraction defined by concept hierarchies This type of organization provides user to view the data from various perspectives Basically, OLAP provides user friendly environment for interactive data analysis. Roll-up (drill-up) Drill-down Slice and Dice: The slice operation performs a selection on one dimension of the given cube, resulting in a subcube. The operation defines a subcube by performing a selection on two or more dimensions. Pivot: visualization operation that rotates the data axes in view in order to provide an alternative presentation of the data. Roll-up (drill-up):Performs aggregation on a data cube, either by climbing up a concept hierarchy for a dimension or dimension reduction. PC Printer location by city Istanbul Ankara Berlin Münih 20 30 50 40 15 5 10 20 Hierarchy Roll Up Dimension Roll Up Location by country Location TR GR PC 140 PC 50 90 Printer 50 Printer 20 30 Drill Down:Drill down is reverse of roll-up. It navigates from general hierarchy to more specific hierarchy. Adding new dimension to data Year 2009 PC 100 Printer 150 Quarter Drill Down Q1 Q2 Q3 Q4 PC 20 30 40 10 Printer 30 60 20 40 The slice operation performs a selection on one dimension of the given cube, resulting in a subcube. The operation defines a subcube by performing a selection on two or more dimensions. Measure Quarter Q1 PC 20 Q2 30 40 Q1 Q4 10 Dice Printer 30 60 Time 2010 Slice Q3 20 Sales PC 20 Printer 30 40 Time 2010 Q1 PC 20 drill across: involving (across) more than one fact table drill through: through the bottom level of the cube to its back-end relational tables (using SQL) ranking the top N or bottom N items in lists moving averages growth rates interests Radial lines from a central point each line represents a concept hierarchy for a dimension each abstraction level is called a footprint granularities available for use by OLAP four radial lines for concept hierarchies location,customer,item,time time line has 4 footprints: day,month,quarter,year A Star-Net Query Model Customer Orders Shipping Method Customer CONTRACTS AIR-EXPRESS ORDER TRUCK PRODUCT LINE Time ANNUALY QTRLY DAILY CITY PRODUCT ITEM Product PRODUCT GROUP SALES PERSON COUNTRY DISTRICT REGION Location DIVISION Each circle is called a footprint Promotion Organization Design of a Data Warehouse: A Business Analysis Framework Four views regarding the design of a data warehouse Top-down view Data source view exposes the information being captured, stored, and managed by operational systems Data warehouse view allows selection of the relevant information necessary for the data warehouse consists of fact tables and dimension tables Business query view sees the perspectives of data in the warehouse from the view of end-user • • • Top-down, bottom-up approaches or a combination of both Top-down: Starts with overall design and planning (mature) Bottom-up: Starts with experiments and prototypes (rapid) From software engineering point of view Waterfall: structured and systematic analysis at each step before proceeding to the next Spiral: rapid generation of increasingly functional systems, short turn around time, quick turn around Typical data warehouse design process Choose a business process to model, e.g., orders, invoices, etc. Choose the grain (atomic level of data) of the business process Choose the dimensions that will apply to each fact table record Choose the measure that will populate each fact table record Enterprise Warehouse Collects all information about subject spanning of the entire organization Data Mart A subset of corporate-wide data that is valuable to specific groups of users. Such as marketing Virtual Warehouse A set of views over operational databases Only some of the possible summary views may be materialized Cuboids are referred as aggregations One factor affecting storage requirements Sparsity: the amount of empty cells in a cube The base cuboid is likely to contain many empty cells it is a spares cube or array the 0 or lower dimensional cuboids are less spares than the higher dimensional ones it is not likely that they contain empty cells Moving along higher levels for the dimension hierarchy the cuboids becomes less spares or more dense PC 01.10.2003 DV Two dimensional sparse cuboid 10 02.10.2003 03.10.2003 Prt CD 1 4 04.10.2003 2 2 items One dimensional dense cuboid 01.10.2003 10 02.10.2003 1 03.10.2003 6 04.10.2003 2 ROLAP – Relational OLAP MOLAP – Multidimensional OLAP HOLAP – Hybrid OLAP Use relational or extended-relational DBMS to store and manage warehouse data and OLAP middle ware to support missing pieces query response is generally slower low storage requirement Include optimization of DBMS backend, implementation of aggregation navigation logic, and additional tools and services greater scalability appropriate for large data sets that are infrequently queried historical data from less recent previous years Array-based multidimensional storage engine (sparse matrix techniques) fast indexing to pre-computed summarized data a two-level storage representation dense subcubes are stored as array structures spars subcubes are stored by compression techniques appropriate for cubes with frequent use and rapid query response combines ROLAP and MOLAP benefiting from greater scalability of ROLAP faster computation of MOLAP Large volumes of data base cuboid is stored in a relational database aggregations are stored as arrays appropriate for for cubes that requre rapid query response for summaries based on a large amount of base data Data cubes can be viewed as lattice of cuboids The bottom cuboid is the base cuboid The top is the apex cuboid What is the number of cuboids for N dimensional data cube? 2N OLAP computes at least some of the cuboids For fast response For avoiding redundant calculation all 0-D(apex) cuboid product product,date date country product,country 1-D cuboids date, country 2-D cuboids product, date, country 3-D(base) cuboid Three Types Materialize every cuboids Huge amounts of memory space Non materialization – Zero cube calculation Show processing of queries Some (partial) materialization Trade off between storage space and response time Selection of which cuboids to materialize Based on size, sharing, access frequency and etc Complete load of the cube all dimension and fact table data is read and all specified aggregations cuboids are calculated process a cube when its structure is new or its dimensions or measures have been edited Incrementally updating a cube new data is added but existing data not changed and cube structure si the same Refreshing data cleared and reloaded its aggregations recalculated faster then processing:no design of aggregation tables Dimension member or measure whose value is computed at run time using an expression Only the definitions are stored but values exists only in memory upon a query do not increase in cube size Ex: if sales and cost are included in the base fact table a profit measure can be a calculated member profit = sales – cost Average_sales = sales/#_items_sold Combination of multiple cubes in one logical cube can be based on a single cube to expose only selected subsets of measures and dimensions Require no physical space store only the dimensions information not actual data provide a valuable security function limiting the access of some users Attribute of a dimension member provides additional information about the member a column in the same dimension table as the associated members used in queries provide users more options when analysing cube data A typical time table: (time_id,day,month,quarter,year,business day,leap,day of the week) dimension levels day<month<quarter<year member properties for day: weekend or business day:0 or 1 day of the week:1,2,3,...,7 a member property for year is whether it is leap year or not:0 or 1 Logical dimension based on a member property of a level in a physical dimension enables users to analyse cube data based on the member properties of dimension levels add a virtual dimension to a cube only if the dimension that supplies its member property is also included in the cube adding a virtual dimension does not increase cube size not affect cube processing time calculated in memory when needed query processing time is slower The business day column was a member property for day level of the time dimension the user may want to investigate sales by type of day (business or weekend) makes business day member property as a virtual dimension of the sale cube Based on two dimension table columns that together define the lieage relationships among the members of the dimension Member key column:identifies each member Parent key column: identifies the parent of each member Example: A HR Department Poal West James Smith John Grande Amy Joens Jo Brown Jill Kelly Emp Name Emp_id Manager_id James Smith 1 3 Amy Jones 2 3 Paul West 3 3 Jim Kelly 4 3 John Grande 5 1 Jo Brown 6 1 Emp_id represnts each employee Manager_id represents parent dimention Q? How can you represnt the same hierarchy in the traditional concept hierarchy? Three kinds of data warehouse applications Information processing supports querying, basic statistical analysis, and reporting using cross-tabs, tables, charts and graphs Analytical processing multidimensional analysis of data warehouse data supports basic OLAP operations, slice-dice, drilling, pivoting Data mining knowledge discovery from hidden patterns supports associations, constructing analytical models, performing classification and prediction, and presenting the mining results using visualization tools. Differences among the three tasks Why online analytical mining? High quality of data in data warehouses – DW contains integrated, consistent, cleaned data Available information processing structure surrounding data warehouses – ODBC, OLEDB, Web accessing, service facilities, reporting and OLAP tools OLAP-based exploratory data analysis (mining with drilling, dicing, pivoting, etc.) On-line selection of data mining functions – integration and swapping of multiple mining functions, algorithms, and tasks. Q&A ???