Document 15062997

advertisement
Matakuliah : M0614 / Data Mining & OLAP
Tahun
: Feb - 2010
Data Cube Computation and Data
Generalization
Pertemuan 04
Learning Outcomes
Pada akhir pertemuan ini, diharapkan mahasiswa
akan mampu :
• Mahasiswa dapat mendesain data cubes, computation of
data cubes, dan data generalization. (C5)
• Mahasiswa dapat menunjukkan cara eksplorasi lebih
lanjut terhadap multidimensional database. (C3)
3
Bina Nusantara
Acknowledgments
These slides have been adapted from Han, J.,
Kamber, M., & Pei, Y. Data Mining: Concepts and
Technique and Tan, P.-N., Steinbach, M., & Kumar, V.
Introduction to Data Mining.
Bina Nusantara
Outline Materi
• Efficient computation of data cubes
• Exploration and discovery in multidimensional
databases
• Discovery-driven exploration of data cubes
• Sampling cube
• Summary
5
Bina Nusantara
OLAP
•
•
•
On-Line Analytical Processing (OLAP) was proposed by E. F. Codd, the
father of the relational database.
Relational databases put data into tables, while OLAP uses a
multidimensional array representation.
– Such representations of data previously existed in statistics and other
fields
There are a number of data analysis and data exploration operations that
are easier with such a data representation.
Creating a Multidimensional Array
• Two key steps in converting tabular data into a multidimensional array.
– First, identify which attributes are to be the dimensions and which
attribute is to be the target attribute whose values appear as entries in
the multidimensional array.
• The attributes used as dimensions must have discrete values
• The target value is typically a count or continuous value, e.g., the
cost of an item
• Can have no target variable at all except the count of objects that
have the same set of attribute values
– Second, find the value of each entry in the multidimensional array by
summing the values (of the target attribute) or count of all objects that
have the attribute values corresponding to that entry.
Iris Sample Data Set
• Many of the exploratory data techniques are illustrated with the Iris
Plant data set.
– From the statistician Douglas Fisher
– Three flower types (classes):
• Setosa
• Virginica
• Versicolour
– Four (non-class) attributes
• Sepal width and length
• Petal width and length
Virginica. Robert H. Mohlenbrock. USDA
NRCS. 1995. Northeast wetland flora: Field
office guide to plant species. Northeast National
Technical Center, Chester, PA. Courtesy of
USDA NRCS Wetland Science Institute.
Example: Iris data
• We show how the attributes, petal length, petal width, and species
type can be converted to a multidimensional array
– First, we discretized the petal width and length to have categorical
values: low, medium, and high
– We get the following table - note the count attribute
Example: Iris data (continued)
• Each unique tuple of petal width, petal length, and species type
identifies one element of the array.
• This element is assigned the corresponding count value.
• The figure illustrates
the result.
• All non-specified
tuples are 0.
Example: Iris data (continued)
• Slices of the multidimensional array are shown by the following crosstabulations
• What do these tables tell us?
OLAP Operations: Data Cube
• The key operation of a OLAP is the formation of a data cube
• A data cube is a multidimensional representation of data, together with
all possible aggregates.
• By all possible aggregates, we mean the aggregates that result by
selecting a proper subset of the dimensions and summing over all
remaining dimensions.
• For example, if we choose the species type dimension of the Iris data
and sum over all other dimensions, the result will be a onedimensional entry with three entries, each of which gives the number
of flowers of each type.
Data Cube Example
• Consider a data set that records the sales of products at a number of
company stores at various dates.
• This data can be represented
as a 3 dimensional array
• There are 3 two-dimensional
aggregates (3 choose 2 ),
3 one-dimensional aggregates,
and 1 zero-dimensional
aggregate (the overall total)
Data Cube Example (continued)
• The following figure table shows one of the two dimensional
aggregates, along with two of the one-dimensional aggregates, and
the overall total
OLAP Operations: Slicing and Dicing
• Slicing is selecting a group of cells from the entire multidimensional
array by specifying a specific value for one or more dimensions.
• Dicing involves selecting a subset of cells by specifying a range of
attribute values.
– This is equivalent to defining a subarray from the complete array.
• In practice, both operations can also be accompanied by aggregation
over some dimensions.
OLAP Operations: Roll-up and Drill-down
• Attribute values often have a hierarchical structure.
– Each date is associated with a year, month, and week.
– A location is associated with a continent, country, state (province,
etc.), and city.
– Products can be divided into various categories, such as clothing,
electronics, and furniture.
• Note that these categories often nest and form a tree or lattice
– A year contains months which contains day
– A country contains a state which contains a city
OLAP Operations: Roll-up and Drill-down
• This hierarchical structure gives rise to the roll-up and drill-down
operations.
– For sales data, we can aggregate (roll up) the sales across all the
dates in a month.
– Conversely, given a view of the data where the time dimension is
broken into months, we could split the monthly sales totals (drill
down) into daily sales totals.
– Likewise, we can drill down or roll up on the location or product ID
attributes.
Efficient Computation of Data Cubes:
Multi-Way Array Aggregation
• Array-based “bottom-up” algorithm
all
• Using multi-dimensional chunks
• No direct tuple comparisons
A
B
C
• Simultaneous aggregation on multiple
dimensions
• Intermediate aggregate values are reused for computing ancestor cuboids
• Cannot do Apriori pruning: No iceberg
optimization
AB
AC
ABC
BC
Multi-way Array Aggregation for Cube Computation
(MOLAP)
•
Partition arrays into chunks (a small subcube which fits in memory).
•
Compressed sparse array addressing: (chunk_id, offset)
•
B
Compute aggregates in “multiway” by visiting cube cells in the order which
minimizes the # of times to visit each cell, and reduces memory access and
storage cost.
62
63
64
C c2 c3 61
45
46
47
48
c1 29
30
31
32
What is the best
c0
60
traversing order
14
15
16
b3 B13
44
to do multi-way
28 56
b2 9
40
aggregation?
24 52
b1 5
36
20
1
2
3
4
b0
a0
a1
A
a2
a3
Multi-way Array Aggregation for Cube Computation
C
c3 61
62
63
64
c2 45
46
47
48
c1 29
30
31
32
c0
B
b3
B13
b2
9
14
15
60
16
44
28
24
b1
5
b0
1
2
3
4
a0
a1
a2
a3
56
40
36
A
20
52
Multi-way Array Aggregation for Cube Computation
C
c3 61
62
63
64
c2 45
46
47
48
c1 29
30
31
32
c0
B
b3
B13
b2
9
14
15
60
16
44
28
24
b1
5
b0
1
2
3
4
a0
a1
a2
a3
56
40
36
A
20
52
Multi-Way Array Aggregation for Cube
Computation (Cont.)
• Method: the planes should be sorted and computed according to
their size in ascending order
– Idea: keep the smallest plane in the main memory, fetch and
compute only one chunk at a time for the largest plane
• Limitation of the method: computing well only for a small number of
dimensions
– If there are a large number of dimensions, “top-down”
computation and iceberg cube computation methods can be
explored
H-Cubing: Using H-Tree Structure
all
• Bottom-up computation
• Exploring an H-tree structure
• If the current computation of
an H-tree cannot pass
min_sup, do not proceed
further (pruning)
• No simultaneous aggregation
A
AB
ABC
AC
ABD
ABCD
B
AD
ACD
C
BC
D
BD
BCD
CD
H-tree: A Prefix Hyper-tree
Header
table
Attr. Val.
Edu
Hhd
Bus
…
Jan
Feb
…
Tor
Van
Mon
…
Quant-Info
Sum:2285 …
…
…
…
…
…
…
…
…
…
…
Side-link
root
bus
hhd
edu
Jan
Mar
Tor
Van
Tor
Mon
Q.I.
Q.I.
Q.I.
Month
City
Cust_grp
Prod
Cost
Price
Jan
Tor
Edu
Printer
500
485
Jan
Tor
Hhd
TV
800
1200
Jan
Tor
Edu
Camera
1160
1280
Feb
Mon
Bus
Laptop
1500
2500
Sum: 1765
Cnt: 2
Mar
Van
Edu
HD
540
520
bins
…
…
…
…
…
…
Quant-Info
Jan
Feb
Computing Cells Involving “City”
Header
Table
HTor
Attr. Val.
Edu
Hhd
Bus
…
Jan
Feb
…
Tor
Van
Mon
…
Attr. Val.
Edu
Hhd
Bus
…
Jan
Feb
…
Quant-Info
Sum:2285 …
…
…
…
…
…
…
…
…
…
…
Q.I.
…
…
…
…
…
…
…
Side-link
From (*, *, Tor) to (*, Jan, Tor)
root
Hhd.
Edu.
Jan.
Side-link
Tor.
Quant-Info
Sum: 1765
Cnt: 2
bins
Mar.
Jan.
Bus.
Feb.
Van.
Tor.
Mon.
Q.I.
Q.I.
Q.I.
Computing Cells Involving Month But No City
1. Roll up quant-info
2. Compute cells involving
month but no city
Attr. Val.
Quant-Info
Edu.
Sum:2285 …
Hhd.
…
Bus.
…
…
…
Jan.
…
Feb.
…
Mar.
…
…
…
Tor.
…
Van.
…
Mont.
…
…
…
root
Hhd.
Edu.
Bus.
Side-link
Jan.
Q.I.
Tor.
Mar.
Jan.
Q.I.
Q.I.
Van.
Tor.
Feb.
Q.I.
Mont.
Top-k OK mark: if Q.I. in a child passes
top-k avg threshold, so does its parents.
No binning is needed!
Computing Cells Involving Only Cust_grp
root
Check header table directly
Attr. Val.
Edu
Hhd
Bus
…
Jan
Feb
Mar
…
Tor
Van
Mon
…
Quant-Info
Sum:2285 …
…
…
…
…
…
…
…
…
…
…
…
hhd
edu
Side-link
Tor
bus
Jan
Mar
Jan
Feb
Q.I.
Q.I.
Q.I.
Q.I.
Van
Tor
Mon
Discovery-Driven Exploration of Data Cubes
• Hypothesis-driven
– exploration by user, huge search space
• Discovery-driven
– Effective navigation of large OLAP data cubes
– pre-compute measures indicating exceptions, guide user in the
data analysis, at all levels of aggregation
– Exception: significantly different from the value anticipated,
based on a statistical model
– Visual cues such as background color are used to reflect the
degree of exception of each cell
Kinds of Exceptions and their Computation
• Parameters
– SelfExp: surprise of cell relative to other cells at same level of
aggregation
– InExp: surprise beneath the cell
– PathExp: surprise beneath cell for each drill-down path
• Computation of exception indicator (modeling fitting and computing
SelfExp, InExp, and PathExp values) can be overlapped with cube
construction
• Exception themselves can be stored, indexed and retrieved like
precomputed aggregates
Examples: Discovery-Driven Data Cubes
Complex Aggregation at Multiple Granularities:
Multi-Feature Cubes
•
Multi-feature cubes: Compute complex queries involving multiple dependent
aggregates at multiple granularities
•
Ex. Grouping by all subsets of {item, region, month}, find the maximum price
in 1997 for each group, and the total sales among all maximum price tuples
SELECT item, region, month, max(price), sum(R.sales)
FROM purchases
WHERE year = 1997
CUBE BY item, region, month: R
SUCH THAT R.price = max(price)
•
Continuing the last example, among the max price tuples, find the min and
max shelf live, and find the fraction of the total sales due to tuple that have
min shelf life within the set of all max price tuples
Sampling Cube
Statistical Surveys and OLAP
 Statistical survey: A popular tool to collect information about a
population based on a sample
Ex.: TV ratings, US Census, election polls
 A common tool in politics, health, market research, science, and many
more
 An efficient way of collecting information (Data collection is expensive)
 Many statistical tools available, to determine validity
Confidence intervals
Hypothesis tests
 OLAP (multidimensional analysis) on survey data
highly desirable but can it be done well?
Surveys: Sample vs. Whole Population
Data is only a sample of population
Age\Education
18
19
20
…
High-school
College
Graduate
Problems for Drilling in Multidim. Space
Data is only a sample of population but samples could be small
when drilling to certain multidimensional space
Age\Education
18
19
20
…
High-school
College
Graduate
OLAP on Survey (i.e., Sampling) Data


Semantics of query is unchanged
Input data has changed
Age/Education
18
19
20
…
High-school
College
Graduate
OLAP with Sampled Data
• Where is the missing link?
– OLAP over sampling data but our analysis target would still like to be on
population
• Idea: Integrate sampling and statistical knowledge with traditional
OLAP tools
Input Data
Analysis Target
Analysis Tool
Population
Population
Traditional OLAP
Sample
Population
Not Available
Challenges for OLAP on Sampling Data



Computing confidence intervals in OLAP context
No data?
- Not exactly. No data in subspaces in cube
- Sparse data
- Causes include sampling bias and query selection bias
Curse of dimensionality
- Survey data can be high dimensional
- Over 600 dimensions in real world example
- Impossible to fully materialize
Example 1: Confidence Interval
What is the average income of 19-year-old high-school students?
Return not only query result but also confidence interval
Age/Education
18
19
20
…
High-school
College
Graduate
Confidence Interval
 Confidence interval at
:
 x is a sample of data set;
is the mean of sample
 tc is the critical t-value, calculated by a look-up
•
is the estimated standard error of the mean
 Example: $50,000 ± $3,000 with 95% confidence
 Treat points in cube cell as samples
 Compute confidence interval as traditional sample set
 Return answer in the form of confidence interval
 Indicates quality of query answer
 User selects desired confidence interval
Efficient Computing Confidence Interval Measures
 Efficient computation in all cells in data cube
 Both mean and confidence interval are algebraic
 Why confidence interval measure is algebraic?
is algebraic
where both s and l (count) are algebraic
 Thus one can calculate cells efficiently at more general cuboids
without having to start at the base cuboid each time
Example 2: Query Expansion
What is the average income of 19-year-old college students?
Age/Education
18
19
20
…
High-school
College
Graduate
Boosting Confidence by Query Expansion



From the example: The queried cell “19-year-old college students”
contains only 2 samples
Confidence interval is large (i.e., low confidence). why?
- Small sample size
- High standard deviation with samples
Small sample sizes can occur at relatively low dimensional selections
- Collect more data?― expensive!
- Use data in other cells? Maybe, but have to be careful
Intra-Cuboid Expansion: Choice 1
Expand query to include 18 and 20 year olds?
Age/Education
18
19
20
…
High-school
College
Graduate
Intra-Cuboid Expansion: Choice 2
Expand query to include high-school and graduate students?
Age/Education
18
19
20
…
High-school
College
Graduate
Intra-Cuboid Expansion

If other cells in the same cuboid satisfy both the following
1.Similar semantic meaning
2.Similar cube value

Then can combine other cells’ data into own to “boost” confidence
- Only use if necessary
- Bigger sample size will decrease confidence interval
Intra-Cuboid Expansion (2)
•
•
•
Cell segment similarity
– Some dimensions are clear: Age
– Some are fuzzy: Occupation
– May need domain knowledge
Cell value similarity
– How to determine if two cells’ samples come from the same population?
– Two-sample t-test (confidence-based)
Example:
Inter-Cuboid Expansion

If a query dimension is
-
Not correlated with cube value
-
But is causing small sample size by drilling down too much

Remove dimension (i.e., generalize to *) and move to a more
general cuboid

Can use two-sample t-test to determine similarity between two cells
across cuboids

Can also use a different method to be shown later
Query Expansion
Query Expansion Experiments (2)
 Real world sample data: 600 dimensions and 750,000 tuples
 0.05% to simulate “sample” (allows error checking)
Query Expansion Experiments (3)
Chapter Summary
• Efficient algorithms for computing data cubes
– Multiway array aggregation
– H-cubing
• Further development of data cube technology
– Discovery-drive cube
– Sampling cubes
Dilanjutkan ke pert. 05
Mining Frequent Patterns, Association, and
Correlations
Bina Nusantara
Download