Matakuliah : M0614 / Data Mining & OLAP Tahun : Feb - 2010 Data Cube Computation and Data Generalization Pertemuan 04 Learning Outcomes Pada akhir pertemuan ini, diharapkan mahasiswa akan mampu : • Mahasiswa dapat mendesain data cubes, computation of data cubes, dan data generalization. (C5) • Mahasiswa dapat menunjukkan cara eksplorasi lebih lanjut terhadap multidimensional database. (C3) 3 Bina Nusantara Acknowledgments These slides have been adapted from Han, J., Kamber, M., & Pei, Y. Data Mining: Concepts and Technique and Tan, P.-N., Steinbach, M., & Kumar, V. Introduction to Data Mining. Bina Nusantara Outline Materi • Efficient computation of data cubes • Exploration and discovery in multidimensional databases • Discovery-driven exploration of data cubes • Sampling cube • Summary 5 Bina Nusantara OLAP • • • On-Line Analytical Processing (OLAP) was proposed by E. F. Codd, the father of the relational database. Relational databases put data into tables, while OLAP uses a multidimensional array representation. – Such representations of data previously existed in statistics and other fields There are a number of data analysis and data exploration operations that are easier with such a data representation. Creating a Multidimensional Array • Two key steps in converting tabular data into a multidimensional array. – First, identify which attributes are to be the dimensions and which attribute is to be the target attribute whose values appear as entries in the multidimensional array. • The attributes used as dimensions must have discrete values • The target value is typically a count or continuous value, e.g., the cost of an item • Can have no target variable at all except the count of objects that have the same set of attribute values – Second, find the value of each entry in the multidimensional array by summing the values (of the target attribute) or count of all objects that have the attribute values corresponding to that entry. Iris Sample Data Set • Many of the exploratory data techniques are illustrated with the Iris Plant data set. – From the statistician Douglas Fisher – Three flower types (classes): • Setosa • Virginica • Versicolour – Four (non-class) attributes • Sepal width and length • Petal width and length Virginica. Robert H. Mohlenbrock. USDA NRCS. 1995. Northeast wetland flora: Field office guide to plant species. Northeast National Technical Center, Chester, PA. Courtesy of USDA NRCS Wetland Science Institute. Example: Iris data • We show how the attributes, petal length, petal width, and species type can be converted to a multidimensional array – First, we discretized the petal width and length to have categorical values: low, medium, and high – We get the following table - note the count attribute Example: Iris data (continued) • Each unique tuple of petal width, petal length, and species type identifies one element of the array. • This element is assigned the corresponding count value. • The figure illustrates the result. • All non-specified tuples are 0. Example: Iris data (continued) • Slices of the multidimensional array are shown by the following crosstabulations • What do these tables tell us? OLAP Operations: Data Cube • The key operation of a OLAP is the formation of a data cube • A data cube is a multidimensional representation of data, together with all possible aggregates. • By all possible aggregates, we mean the aggregates that result by selecting a proper subset of the dimensions and summing over all remaining dimensions. • For example, if we choose the species type dimension of the Iris data and sum over all other dimensions, the result will be a onedimensional entry with three entries, each of which gives the number of flowers of each type. Data Cube Example • Consider a data set that records the sales of products at a number of company stores at various dates. • This data can be represented as a 3 dimensional array • There are 3 two-dimensional aggregates (3 choose 2 ), 3 one-dimensional aggregates, and 1 zero-dimensional aggregate (the overall total) Data Cube Example (continued) • The following figure table shows one of the two dimensional aggregates, along with two of the one-dimensional aggregates, and the overall total OLAP Operations: Slicing and Dicing • Slicing is selecting a group of cells from the entire multidimensional array by specifying a specific value for one or more dimensions. • Dicing involves selecting a subset of cells by specifying a range of attribute values. – This is equivalent to defining a subarray from the complete array. • In practice, both operations can also be accompanied by aggregation over some dimensions. OLAP Operations: Roll-up and Drill-down • Attribute values often have a hierarchical structure. – Each date is associated with a year, month, and week. – A location is associated with a continent, country, state (province, etc.), and city. – Products can be divided into various categories, such as clothing, electronics, and furniture. • Note that these categories often nest and form a tree or lattice – A year contains months which contains day – A country contains a state which contains a city OLAP Operations: Roll-up and Drill-down • This hierarchical structure gives rise to the roll-up and drill-down operations. – For sales data, we can aggregate (roll up) the sales across all the dates in a month. – Conversely, given a view of the data where the time dimension is broken into months, we could split the monthly sales totals (drill down) into daily sales totals. – Likewise, we can drill down or roll up on the location or product ID attributes. Efficient Computation of Data Cubes: Multi-Way Array Aggregation • Array-based “bottom-up” algorithm all • Using multi-dimensional chunks • No direct tuple comparisons A B C • Simultaneous aggregation on multiple dimensions • Intermediate aggregate values are reused for computing ancestor cuboids • Cannot do Apriori pruning: No iceberg optimization AB AC ABC BC Multi-way Array Aggregation for Cube Computation (MOLAP) • Partition arrays into chunks (a small subcube which fits in memory). • Compressed sparse array addressing: (chunk_id, offset) • B Compute aggregates in “multiway” by visiting cube cells in the order which minimizes the # of times to visit each cell, and reduces memory access and storage cost. 62 63 64 C c2 c3 61 45 46 47 48 c1 29 30 31 32 What is the best c0 60 traversing order 14 15 16 b3 B13 44 to do multi-way 28 56 b2 9 40 aggregation? 24 52 b1 5 36 20 1 2 3 4 b0 a0 a1 A a2 a3 Multi-way Array Aggregation for Cube Computation C c3 61 62 63 64 c2 45 46 47 48 c1 29 30 31 32 c0 B b3 B13 b2 9 14 15 60 16 44 28 24 b1 5 b0 1 2 3 4 a0 a1 a2 a3 56 40 36 A 20 52 Multi-way Array Aggregation for Cube Computation C c3 61 62 63 64 c2 45 46 47 48 c1 29 30 31 32 c0 B b3 B13 b2 9 14 15 60 16 44 28 24 b1 5 b0 1 2 3 4 a0 a1 a2 a3 56 40 36 A 20 52 Multi-Way Array Aggregation for Cube Computation (Cont.) • Method: the planes should be sorted and computed according to their size in ascending order – Idea: keep the smallest plane in the main memory, fetch and compute only one chunk at a time for the largest plane • Limitation of the method: computing well only for a small number of dimensions – If there are a large number of dimensions, “top-down” computation and iceberg cube computation methods can be explored H-Cubing: Using H-Tree Structure all • Bottom-up computation • Exploring an H-tree structure • If the current computation of an H-tree cannot pass min_sup, do not proceed further (pruning) • No simultaneous aggregation A AB ABC AC ABD ABCD B AD ACD C BC D BD BCD CD H-tree: A Prefix Hyper-tree Header table Attr. Val. Edu Hhd Bus … Jan Feb … Tor Van Mon … Quant-Info Sum:2285 … … … … … … … … … … … Side-link root bus hhd edu Jan Mar Tor Van Tor Mon Q.I. Q.I. Q.I. Month City Cust_grp Prod Cost Price Jan Tor Edu Printer 500 485 Jan Tor Hhd TV 800 1200 Jan Tor Edu Camera 1160 1280 Feb Mon Bus Laptop 1500 2500 Sum: 1765 Cnt: 2 Mar Van Edu HD 540 520 bins … … … … … … Quant-Info Jan Feb Computing Cells Involving “City” Header Table HTor Attr. Val. Edu Hhd Bus … Jan Feb … Tor Van Mon … Attr. Val. Edu Hhd Bus … Jan Feb … Quant-Info Sum:2285 … … … … … … … … … … … Q.I. … … … … … … … Side-link From (*, *, Tor) to (*, Jan, Tor) root Hhd. Edu. Jan. Side-link Tor. Quant-Info Sum: 1765 Cnt: 2 bins Mar. Jan. Bus. Feb. Van. Tor. Mon. Q.I. Q.I. Q.I. Computing Cells Involving Month But No City 1. Roll up quant-info 2. Compute cells involving month but no city Attr. Val. Quant-Info Edu. Sum:2285 … Hhd. … Bus. … … … Jan. … Feb. … Mar. … … … Tor. … Van. … Mont. … … … root Hhd. Edu. Bus. Side-link Jan. Q.I. Tor. Mar. Jan. Q.I. Q.I. Van. Tor. Feb. Q.I. Mont. Top-k OK mark: if Q.I. in a child passes top-k avg threshold, so does its parents. No binning is needed! Computing Cells Involving Only Cust_grp root Check header table directly Attr. Val. Edu Hhd Bus … Jan Feb Mar … Tor Van Mon … Quant-Info Sum:2285 … … … … … … … … … … … … hhd edu Side-link Tor bus Jan Mar Jan Feb Q.I. Q.I. Q.I. Q.I. Van Tor Mon Discovery-Driven Exploration of Data Cubes • Hypothesis-driven – exploration by user, huge search space • Discovery-driven – Effective navigation of large OLAP data cubes – pre-compute measures indicating exceptions, guide user in the data analysis, at all levels of aggregation – Exception: significantly different from the value anticipated, based on a statistical model – Visual cues such as background color are used to reflect the degree of exception of each cell Kinds of Exceptions and their Computation • Parameters – SelfExp: surprise of cell relative to other cells at same level of aggregation – InExp: surprise beneath the cell – PathExp: surprise beneath cell for each drill-down path • Computation of exception indicator (modeling fitting and computing SelfExp, InExp, and PathExp values) can be overlapped with cube construction • Exception themselves can be stored, indexed and retrieved like precomputed aggregates Examples: Discovery-Driven Data Cubes Complex Aggregation at Multiple Granularities: Multi-Feature Cubes • Multi-feature cubes: Compute complex queries involving multiple dependent aggregates at multiple granularities • Ex. Grouping by all subsets of {item, region, month}, find the maximum price in 1997 for each group, and the total sales among all maximum price tuples SELECT item, region, month, max(price), sum(R.sales) FROM purchases WHERE year = 1997 CUBE BY item, region, month: R SUCH THAT R.price = max(price) • Continuing the last example, among the max price tuples, find the min and max shelf live, and find the fraction of the total sales due to tuple that have min shelf life within the set of all max price tuples Sampling Cube Statistical Surveys and OLAP Statistical survey: A popular tool to collect information about a population based on a sample Ex.: TV ratings, US Census, election polls A common tool in politics, health, market research, science, and many more An efficient way of collecting information (Data collection is expensive) Many statistical tools available, to determine validity Confidence intervals Hypothesis tests OLAP (multidimensional analysis) on survey data highly desirable but can it be done well? Surveys: Sample vs. Whole Population Data is only a sample of population Age\Education 18 19 20 … High-school College Graduate Problems for Drilling in Multidim. Space Data is only a sample of population but samples could be small when drilling to certain multidimensional space Age\Education 18 19 20 … High-school College Graduate OLAP on Survey (i.e., Sampling) Data Semantics of query is unchanged Input data has changed Age/Education 18 19 20 … High-school College Graduate OLAP with Sampled Data • Where is the missing link? – OLAP over sampling data but our analysis target would still like to be on population • Idea: Integrate sampling and statistical knowledge with traditional OLAP tools Input Data Analysis Target Analysis Tool Population Population Traditional OLAP Sample Population Not Available Challenges for OLAP on Sampling Data Computing confidence intervals in OLAP context No data? - Not exactly. No data in subspaces in cube - Sparse data - Causes include sampling bias and query selection bias Curse of dimensionality - Survey data can be high dimensional - Over 600 dimensions in real world example - Impossible to fully materialize Example 1: Confidence Interval What is the average income of 19-year-old high-school students? Return not only query result but also confidence interval Age/Education 18 19 20 … High-school College Graduate Confidence Interval Confidence interval at : x is a sample of data set; is the mean of sample tc is the critical t-value, calculated by a look-up • is the estimated standard error of the mean Example: $50,000 ± $3,000 with 95% confidence Treat points in cube cell as samples Compute confidence interval as traditional sample set Return answer in the form of confidence interval Indicates quality of query answer User selects desired confidence interval Efficient Computing Confidence Interval Measures Efficient computation in all cells in data cube Both mean and confidence interval are algebraic Why confidence interval measure is algebraic? is algebraic where both s and l (count) are algebraic Thus one can calculate cells efficiently at more general cuboids without having to start at the base cuboid each time Example 2: Query Expansion What is the average income of 19-year-old college students? Age/Education 18 19 20 … High-school College Graduate Boosting Confidence by Query Expansion From the example: The queried cell “19-year-old college students” contains only 2 samples Confidence interval is large (i.e., low confidence). why? - Small sample size - High standard deviation with samples Small sample sizes can occur at relatively low dimensional selections - Collect more data?― expensive! - Use data in other cells? Maybe, but have to be careful Intra-Cuboid Expansion: Choice 1 Expand query to include 18 and 20 year olds? Age/Education 18 19 20 … High-school College Graduate Intra-Cuboid Expansion: Choice 2 Expand query to include high-school and graduate students? Age/Education 18 19 20 … High-school College Graduate Intra-Cuboid Expansion If other cells in the same cuboid satisfy both the following 1.Similar semantic meaning 2.Similar cube value Then can combine other cells’ data into own to “boost” confidence - Only use if necessary - Bigger sample size will decrease confidence interval Intra-Cuboid Expansion (2) • • • Cell segment similarity – Some dimensions are clear: Age – Some are fuzzy: Occupation – May need domain knowledge Cell value similarity – How to determine if two cells’ samples come from the same population? – Two-sample t-test (confidence-based) Example: Inter-Cuboid Expansion If a query dimension is - Not correlated with cube value - But is causing small sample size by drilling down too much Remove dimension (i.e., generalize to *) and move to a more general cuboid Can use two-sample t-test to determine similarity between two cells across cuboids Can also use a different method to be shown later Query Expansion Query Expansion Experiments (2) Real world sample data: 600 dimensions and 750,000 tuples 0.05% to simulate “sample” (allows error checking) Query Expansion Experiments (3) Chapter Summary • Efficient algorithms for computing data cubes – Multiway array aggregation – H-cubing • Further development of data cube technology – Discovery-drive cube – Sampling cubes Dilanjutkan ke pert. 05 Mining Frequent Patterns, Association, and Correlations Bina Nusantara