Data Mining: Concepts and Techniques

advertisement
Data Mining:
Concepts and Techniques
These slides have been adapted from Han, J.,
Kamber, M., & Pei, Y. Data Mining:
Concepts and Technique.
6/28/2016
Data Mining: Concepts and Techniques
1
Chapter 4: Data Cube Technology

Efficient Computation of Data Cubes

Exploration and Discovery in Multidimensional
Databases

Summary
6/28/2016
Data Mining: Concepts and Techniques
2
Efficient Computation of Data Cubes

General heuristics

Multi-way array aggregation

BUC

H-cubing

Star-Cubing

High-Dimensional OLAP
6/28/2016
Data Mining: Concepts and Techniques
3
Cube: A Lattice of Cuboids
all
time
0-D(apex) cuboid
item
time,location
time,item
location
supplier
item,location
time,supplier
1-D cuboids
location,supplier
2-D cuboids
item,supplier
time,location,supplier
3-D cuboids
time,item,location
time,item,supplier
item,location,supplier
4-D(base) cuboid
time, item, location, supplier
6/28/2016
Data Mining: Concepts and Techniques
4
Preliminary Tricks


Sorting, hashing, and grouping operations are applied to the dimension
attributes in order to reorder and cluster related tuples
Aggregates may be computed from previously computed aggregates,
rather than from the base fact table





6/28/2016
Smallest-child: computing a cuboid from the smallest, previously
computed cuboid
Cache-results: caching results of a cuboid from which other
cuboids are computed to reduce disk I/Os
Amortize-scans: computing as many as possible cuboids at the
same time to amortize disk reads
Share-sorts: sharing sorting costs cross multiple cuboids when
sort-based method is used
Share-partitions: sharing the partitioning cost across multiple
cuboids when hash-based algorithms are used
Data Mining: Concepts and Techniques
5
Efficient Computation of Data Cubes

General heuristics

Multi-way array aggregation

BUC

H-cubing

Star-Cubing

High-Dimensional OLAP
6/28/2016
Data Mining: Concepts and Techniques
6
Multi-Way Array Aggregation

Array-based “bottom-up” algorithm

Using multi-dimensional chunks

No direct tuple comparisons



all
Simultaneous aggregation on
multiple dimensions
Intermediate aggregate values are
re-used for computing ancestor
cuboids
Cannot do Apriori pruning: No
iceberg optimization
6/28/2016
Data Mining: Concepts and Techniques
A
B
AB
C
AC
BC
ABC
7
Multi-way Array Aggregation for Cube
Computation (MOLAP)

Partition arrays into chunks (a small subcube which fits in memory).

Compressed sparse array addressing: (chunk_id, offset)
Compute aggregates in “multiway” by visiting cube cells in the order
which minimizes the # of times to visit each cell, and reduces
memory access and storage cost.

C
c3 61
62
63
64
c2 45
46
47
48
c1 29
30
31
32
c0
B
b3
B13
b2
9
b1
5
b0
6/28/2016
14
15
16
1
2
3
4
a0
a1
a2
a3
A
60
44
28 56
40
24 52
36
20
Data Mining: Concepts and Techniques
What is the best
traversing order
to do multi-way
aggregation?
8
Multi-way Array Aggregation for Cube
Computation
C
c3 61
62
63
64
c2 45
46
47
48
c1 29
30
31
32
c0
b3
B
b2
B13
14
15
60
16
44
28
9
24
b1
5
b0
1
2
3
4
a0
a1
a2
a3
56
40
36
52
20
A
6/28/2016
Data Mining: Concepts and Techniques
9
Multi-way Array Aggregation for Cube
Computation
C
c3 61
62
63
64
c2 45
46
47
48
c1 29
30
31
32
c0
b3
B
b2
B13
14
15
60
16
44
28
9
24
b1
5
b0
1
2
3
4
a0
a1
a2
a3
56
40
36
52
20
A
6/28/2016
Data Mining: Concepts and Techniques
10
Multi-Way Array Aggregation for Cube
Computation (Cont.)

Method: the planes should be sorted and computed
according to their size in ascending order


Idea: keep the smallest plane in the main memory,
fetch and compute only one chunk at a time for the
largest plane
Limitation of the method: computing well only for a small
number of dimensions

6/28/2016
If there are a large number of dimensions, “top-down”
computation and iceberg cube computation methods
can be explored
Data Mining: Concepts and Techniques
11
Efficient Computation of Data Cubes

General heuristics

Multi-way array aggregation

BUC

H-cubing

Star-Cubing

High-Dimensional OLAP
6/28/2016
Data Mining: Concepts and Techniques
12
Bottom-Up Computation (BUC)


BUC
Bottom-up cube computation
(Note: top-down in our view!)
Divides dimensions into partitions
and facilitates iceberg pruning
 If a partition does not satisfy
min_sup, its descendants can
be pruned
 If minsup = 1  compute full
3 AB
CUBE!
No simultaneous aggregation
4 ABC
AB


ABC
all
A
AC
B
AD
ABD
C
BC
D
CD
BD
ACD
BCD
ABCD
1 all
2A
7 AC
6 ABD
10 B
14 C
16 D
9 AD 11 BC 13 BD
8 ACD
15 CD
12 BCD
5 ABCD
6/28/2016
Data Mining: Concepts and Techniques
13
BUC: Partitioning

Usually, entire data set
can’t fit in main memory

Sort distinct values, partition into blocks that fit

Continue processing

Optimizations

Partitioning


Ordering dimensions to encourage pruning


Cardinality, Skew, Correlation
Collapsing duplicates

6/28/2016
External Sorting, Hashing, Counting Sort
Can’t do holistic aggregates anymore!
Data Mining: Concepts and Techniques
14
Efficient Computation of Data Cubes

General heuristics

Multi-way array aggregation

BUC

H-cubing

Star-Cubing

High-Dimensional OLAP
6/28/2016
Data Mining: Concepts and Techniques
15
H-Cubing: Using H-Tree Structure
all




Bottom-up computation
Exploring an H-tree
structure
If the current
computation of an H-tree
cannot pass min_sup, do
not proceed further
(pruning)
A
AB
ABC
AC
ABD
B
AD
ACD
C
BC
D
BD
CD
BCD
ABCD
No simultaneous
aggregation
6/28/2016
Data Mining: Concepts and Techniques
16
H-tree: A Prefix Hyper-tree
Header
table
Attr. Val.
Edu
Hhd
Bus
…
Jan
Feb
…
Tor
Van
Mon
…
Quant-Info
Sum:2285 …
…
…
…
…
…
…
…
…
…
…
Side-link
root
Jan
Mar
Tor
Van
Tor
Mon
Quant-Info
Q.I.
Q.I.
Q.I.
Month
City
Cust_grp
Prod
Cost
Price
Jan
Tor
Edu
Printer
500
485
Jan
Tor
Hhd
TV
800
1200
Jan
Tor
Edu
Camera
1160
1280
Feb
Mon
Bus
Laptop
1500
2500
Sum: 1765
Cnt: 2
Mar
Van
Edu
HD
540
520
bins
…
…
…
…
…
…
6/28/2016
bus
hhd
edu
Data Mining: Concepts and Techniques
Jan
Feb
17
Computing Cells Involving “City”
Header
Table
HTor
Attr. Val.
Edu
Hhd
Bus
…
Jan
Feb
…
Tor
Van
Mon
…
6/28/2016
Attr. Val.
Edu
Hhd
Bus
…
Jan
Feb
…
Quant-Info
Sum:2285 …
…
…
…
…
…
…
…
…
…
…
Q.I.
…
…
…
…
…
…
…
Side-link
From (*, *, Tor) to (*, Jan, Tor)
root
Hhd.
Edu.
Jan.
Side-link
Tor.
Quant-Info
Mar.
Jan.
Bus.
Feb.
Van.
Tor.
Mon.
Q.I.
Q.I.
Q.I.
Sum: 1765
Cnt: 2
bins
Data Mining: Concepts and Techniques
18
Computing Cells Involving Month But No City
1. Roll up quant-info
2. Compute cells involving
month but no city
Attr. Val.
Edu.
Hhd.
Bus.
…
Jan.
Feb.
Mar.
…
Tor.
Van.
Mont.
…
6/28/2016
Quant-Info
Sum:2285 …
…
…
…
…
…
…
…
…
…
…
…
Side-link
root
Hhd.
Edu.
Jan.
Q.I.
Tor.
Mar.
Jan.
Q.I.
Q.I.
Van.
Tor.
Bus.
Feb.
Q.I.
Mont.
Top-k OK mark: if Q.I. in a child passes
top-k avg threshold, so does its parents.
No binning is needed!
Data Mining: Concepts and Techniques
19
Computing Cells Involving Only Cust_grp
root
Check header table directly
Attr. Val.
Edu
Hhd
Bus
…
Jan
Feb
Mar
…
Tor
Van
Mon
…
6/28/2016
Quant-Info
Sum:2285 …
…
…
…
…
…
…
…
…
…
…
…
hhd
edu
Side-link
bus
Jan
Mar
Jan
Feb
Q.I.
Q.I.
Q.I.
Q.I.
Tor
Data Mining: Concepts and Techniques
Van
Tor
Mon
20
Efficient Computation of Data Cubes

General heuristics

Multi-way array aggregation

BUC

H-cubing

Star-Cubing

High-Dimensional OLAP
6/28/2016
Data Mining: Concepts and Techniques
21
Star-Cubing: An Integrating Method



D. Xin, J. Han, X. Li, B. W. Wah, Star-Cubing: Computing Iceberg Cubes
by Top-Down and Bottom-Up Integration, VLDB'03
Explore shared dimensions

E.g., dimension A is the shared dimension of ACD and AD

ABD/AB means cuboid ABD has shared dimensions AB
Allows for shared computations
e.g., cuboid AB is computed simultaneously as ABD
Aggregate in a top-down
manner but with the bottom-up
AC/AC
AD/A
BC/BC
sub-layer underneath which will
allow Apriori pruning
ACD/A



Shared dimensions grow in
bottom-up fashion
6/28/2016
ABC/ABC
ABD/AB
C/C
D
BD/B
CD
BCD
ABCD/all
Data Mining: Concepts and Techniques
22
Iceberg Pruning in Shared Dimensions

Anti-monotonic property of shared dimensions



6/28/2016
If the measure is anti-monotonic, and if the
aggregate value on a shared dimension does not
satisfy the iceberg condition, then all the cells
extended from this shared dimension cannot
satisfy the condition either
Intuition: if we can compute the shared dimensions
before the actual cuboid, we can use them to do
Apriori pruning
Problem: how to prune while still aggregate
simultaneously on multiple dimensions?
Data Mining: Concepts and Techniques
23
Cell Trees

Use a tree structure similar
to H-tree to represent
cuboids

Collapses common prefixes
to save memory

Keep count at node

Traverse the tree to retrieve
a particular tuple
6/28/2016
Data Mining: Concepts and Techniques
24
Star Attributes and Star Nodes

Intuition: If a single-dimensional
aggregate on an attribute value p
does not satisfy the iceberg
condition, it is useless to distinguish
them during the iceberg
computation


E.g., b2, b3, b4, c1, c2, c4, d1, d2,
d3
A
B
C
D
Count
a1
b1
c1
d1
1
a1
b1
c4
d3
1
a1
b2
c2
d2
1
a2
b3
c3
d4
1
a2
b4
c3
d4
1
Solution: Replace such attributes by
a *. Such attributes are star
attributes, and the corresponding
nodes in the cell tree are star nodes
6/28/2016
Data Mining: Concepts and Techniques
25
Example: Star Reduction




Suppose minsup = 2
Perform one-dimensional
aggregation. Replace attribute
values whose count < 2 with *. And
collapse all *’s together
Resulting table has all such
attributes replaced with the starattribute
With regards to the iceberg
computation, this new table is a
loseless compression of the original
table
6/28/2016
Data Mining: Concepts and Techniques
A
B
C
D
Count
a1
b1
*
*
1
a1
b1
*
*
1
a1
*
*
*
1
a2
*
c3
d4
1
a2
*
c3
d4
1
A
B
C
D
Count
a1
b1
*
*
2
a1
*
*
*
1
a2
*
c3
d4
2
26
Star Tree

Given the new compressed
table, it is possible to
construct the corresponding
A
B
C
D
Count
a1
b1
*
*
2
a1
*
*
*
1
a2
*
c3
d4
2
cell tree—called star tree

Keep a star table at the side
for easy lookup of star
attributes

The star tree is a loseless
compression of the original
cell tree
6/28/2016
Data Mining: Concepts and Techniques
27
Star-Cubing Algorithm—DFS on Lattice Tree
all
BCD: 51
b*: 33
A /A
B/B
C/C
b1: 26
D/D
root: 5
c*: 14
AB/AB
d*: 15
ABC/ABC
c3: 211
AC/AC
d4: 212
ABD/AB
c*: 27
AD/A
BC/BC BD/B
CD
a1: 3
a2: 2
d*: 28
ACD/A
BCD
b*: 1
b1: 2
b*: 2
c*: 1
c*: 2
c3: 2
d*: 1
d*: 2
d4: 2
ABCD
6/28/2016
Data Mining: Concepts and Techniques
28
Multi-Way Aggregation
BCD
ACD/A
ABD/AB
ABC/ABC
ABCD
6/28/2016
Data Mining: Concepts and Techniques
29
Star-Cubing Algorithm—DFS on Star-Tree
6/28/2016
Data Mining: Concepts and Techniques
30
Multi-Way Star-Tree Aggregation

Start depth-first search at the root of the base star tree

At each new node in the DFS, create corresponding star
tree that are descendents of the current tree according to
the integrated traversal ordering

E.g., in the base tree, when DFS reaches a1, the
ACD/A tree is created


When DFS reaches b*, the ABD/AD tree is created
The counts in the base tree are carried over to the new
trees
6/28/2016
Data Mining: Concepts and Techniques
31
Multi-Way Aggregation (2)



When DFS reaches a leaf node (e.g., d*), start
backtracking
On every backtracking branch, the count in the
corresponding trees are output, the tree is destroyed,
and the node in the base tree is destroyed
Example



6/28/2016
When traversing from d* back to c*, the
a1b*c*/a1b*c* tree is output and destroyed
When traversing from c* back to b*, the
a1b*D/a1b* tree is output and destroyed
When at b*, jump to b1 and repeat similar process
Data Mining: Concepts and Techniques
32
Efficient Computation of Data Cubes

General heuristics

Multi-way array aggregation

BUC

H-cubing

Star-Cubing

High-Dimensional OLAP
6/28/2016
Data Mining: Concepts and Techniques
33
The Curse of Dimensionality


None of the previous cubing method can handle high
dimensionality!
A database of 600k tuples. Each dimension has
cardinality of 100 and zipf of 2.
6/28/2016
Data Mining: Concepts and Techniques
34
Motivation of High-D OLAP


Challenge to current cubing methods:
 The “curse of dimensionality’’ problem
 Iceberg cube and compressed cubes: only delay the
inevitable explosion
 Full materialization: still significant overhead in
accessing results on disk
High-D OLAP is needed in applications
 Science and engineering analysis
 Bio-data analysis: thousands of genes
 Statistical surveys: hundreds of variables
6/28/2016
Data Mining: Concepts and Techniques
35
Fast High-D OLAP with Minimal Cubing

Observation: OLAP occurs only on a small subset of
dimensions at a time

Semi-Online Computational Model
1.
Partition the set of dimensions into shell fragments
2.
Compute data cubes for each shell fragment while
retaining inverted indices or value-list indices
3.
Given the pre-computed fragment cubes,
dynamically compute cube cells of the highdimensional data cube online
6/28/2016
Data Mining: Concepts and Techniques
36
Properties of Proposed Method

Partitions the data vertically

Reduces high-dimensional cube into a set of lower
dimensional cubes

Online re-construction of original high-dimensional space

Lossless reduction

Offers tradeoffs between the amount of pre-processing
and the speed of online computation
6/28/2016
Data Mining: Concepts and Techniques
37
Example Computation


Let the cube aggregation function be count
tid
A
B
C
D
E
1
a1
b1
c1
d1
e1
2
a1
b2
c1
d2
e1
3
a1
b2
c1
d1
e2
4
a2
b1
c1
d1
e2
5
a2
b1
c1
d1
e3
Divide the 5 dimensions into 2 shell fragments:
 (A, B, C) and (D, E)
6/28/2016
Data Mining: Concepts and Techniques
38
1-D Inverted Indices

Build traditional invert index or RID list
6/28/2016
Attribute Value
TID List
List Size
a1
123
3
a2
45
2
b1
145
3
b2
23
2
c1
12345
5
d1
1345
4
d2
2
1
e1
12
2
e2
34
2
e3
5
1
Data Mining: Concepts and Techniques
39
Shell Fragment Cubes: Ideas




Generalize the 1-D inverted indices to multi-dimensional
ones in the data cube sense
Compute all cuboids for data cubes ABC and DE while
retaining the inverted indices
Cell
For example, shell
fragment cube ABC
a1 b1
contains 7 cuboids:
a1 b2
 A, B, C
a2 b1
 AB, AC, BC
a2 b2
 ABC
This completes the offline 

computation stage
6/28/2016
Intersection
TID List List Size
1 2 3 1 4 5
1
1
1 2 3 2 3
23
2
4 5 1 4 5
45
2
4 5 2 3

0

Data Mining: Concepts and Techniques
40
Shell Fragment Cubes: Size and Design

Given a database of T tuples, D dimensions, and F shell
fragment size, the fragment cubes’ space requirement is:

For F < 5, the growth is sub-linear
 D F

OT (2 1)
 F 


Shell fragments do not have to be disjoint

Fragment groupings can be arbitrary to allow for
maximum online performance


Known common combinations
 (e.g.,<city, state>)
should be grouped together.
Shell fragment sizes can be adjusted for optimal balance
between offline and online computation
6/28/2016
Data Mining: Concepts and Techniques
41
ID_Measure Table

If measures other than count are present, store in
ID_measure table separate from the shell fragments
6/28/2016
tid
count
sum
1
5
70
2
3
10
3
8
20
4
5
40
5
2
30
Data Mining: Concepts and Techniques
42
The Frag-Shells Algorithm
1.
Partition set of dimension (A1,…,An) into a set of k fragments
(P1,…,Pk).
2.
Scan base table once and do the following
3.
insert <tid, measure> into ID_measure table.
4.
for each attribute value ai of each dimension Ai
5.
build inverted index entry <ai, tidlist>
6.
7.
For each fragment partition Pi
build local fragment cube Si by intersecting tid-lists in bottomup fashion.
6/28/2016
Data Mining: Concepts and Techniques
43
Frag-Shells (2)
Dimensions
D Cuboid
EF Cuboid
DE Cuboid
A B C D E F …
ABC
Cube
6/28/2016
Cell
Tuple-ID List
d1 e1
{1, 3, 8, 9}
d1 e2
{2, 4, 6, 7}
d2 e1
{5, 10}
…
…
DEF
Cube
Data Mining: Concepts and Techniques
44
Online Query Computation: Query

A query has the general form

Each ai has 3 possible values
1.
Instantiated value
2.

Aggregate * function
3.
Inquire ? function

a1,a2 , ,an : M
For example, 3 ? ? * 1: count returns a 2-D
data cube.

6/28/2016
Data Mining: Concepts and Techniques
45
Online Query Computation: Method
Given the fragment cubes, process a query as

follows
1.
Divide the query into fragment, same as the shell
2.
Fetch the corresponding TID list for each
fragment from the fragment cube
3.
Intersect the TID lists from each fragment to
construct instantiated base table
4.
Compute the data cube using the base table with
any cubing algorithm
6/28/2016
Data Mining: Concepts and Techniques
46
Online Query Computation: Sketch
A B C D E F G H I J K L M N …
Instantiated
Base Table
6/28/2016
Online
Cube
Data Mining: Concepts and Techniques
47
Experiment: Size vs. Dimensionality (50
and 100 cardinality)


6/28/2016
(50-C): 106 tuples, 0 skew, 50 cardinality, fragment size 3.
(100-C): 106 tuples, 2 skew, 100 cardinality, fragment size 2.
Data Mining: Concepts and Techniques
48
Experiment: Size vs. Shell-Fragment Size


6/28/2016
(50-D): 106 tuples, 50 dimensions, 0 skew, 50 cardinality.
(100-D): 106 tuples, 100 dimensions, 2 skew, 25 cardinality.
Data Mining: Concepts and Techniques
49
Experiment: Run-time vs. Shell-Fragment Size

6/28/2016
106 tuples, 20 dimensions, 10 cardinality, skew 1, fragment
size 3, 3 instantiated dimensions.
Data Mining: Concepts and Techniques
50
Experiments on Real World Data

UCI Forest CoverType data set




54 dimensions, 581K tuples
Shell fragments of size 2 took 33 seconds and 325MB
to compute
3-D subquery with 1 instantiate D: 85ms~1.4 sec.
Longitudinal Study of Vocational Rehab. Data



6/28/2016
24 dimensions, 8818 tuples
Shell fragments of size 3 took 0.9 seconds and 60MB
to compute
5-D query with 0 instantiated D: 227ms~2.6 sec.
Data Mining: Concepts and Techniques
51
High-D OLAP: Further Implementation
Considerations


Incremental Update:

Append more TIDs to inverted list

Add <tid: measure> to ID_measure table
Incremental adding new dimensions


Bitmap indexing


Form new inverted list and add new fragments
May further improve space usage and speed
Inverted index compression

Store as d-gaps

Explore more IR compression methods
6/28/2016
Data Mining: Concepts and Techniques
52
Chapter 4: Data Cube Technology



Efficient Computation of Data Cubes
Exploration and Discovery in Multidimensional
Databases

Discovery-Driven Exploration of Data Cubes

Sampling Cube

Prediction Cube

Regression Cube
Summary
6/28/2016
Data Mining: Concepts and Techniques
53
Discovery-Driven Exploration of Data Cubes

Hypothesis-driven


exploration by user, huge search space
Discovery-driven (Sarawagi, et al.’98)




6/28/2016
Effective navigation of large OLAP data cubes
pre-compute measures indicating exceptions, guide
user in the data analysis, at all levels of aggregation
Exception: significantly different from the value
anticipated, based on a statistical model
Visual cues such as background color are used to
reflect the degree of exception of each cell
Data Mining: Concepts and Techniques
54
Kinds of Exceptions and their Computation

Parameters





SelfExp: surprise of cell relative to other cells at same
level of aggregation
InExp: surprise beneath the cell
PathExp: surprise beneath cell for each drill-down
path
Computation of exception indicator (modeling fitting and
computing SelfExp, InExp, and PathExp values) can be
overlapped with cube construction
Exception themselves can be stored, indexed and
retrieved like precomputed aggregates
6/28/2016
Data Mining: Concepts and Techniques
55
Examples: Discovery-Driven Data Cubes
6/28/2016
Data Mining: Concepts and Techniques
56
Complex Aggregation at Multiple
Granularities: Multi-Feature Cubes


Multi-feature cubes (Ross, et al. 1998): Compute complex queries
involving multiple dependent aggregates at multiple granularities
Ex. Grouping by all subsets of {item, region, month}, find the
maximum price in 1997 for each group, and the total sales among all
maximum price tuples
select item, region, month, max(price), sum(R.sales)
from purchases
where year = 1997
cube by item, region, month: R
such that R.price = max(price)

Continuing the last example, among the max price tuples, find the
min and max shelf live, and find the fraction of the total sales due to
tuple that have min shelf life within the set of all max price tuples
6/28/2016
Data Mining: Concepts and Techniques
57
Chapter 4: Data Cube Technology


Efficient Computation of Data Cubes
Exploration and Discovery in Multidimensional
Databases

Discovery-Driven Exploration of Data Cubes

Sampling Cube

6/28/2016
X. Li, J. Han, Z. Yin, J.-G. Lee, Y. Sun, “Sampling
Cube: A Framework for Statistical OLAP over
Sampling Data”, SIGMOD’08

Prediction Cube

Regression Cube
Data Mining: Concepts and Techniques
58
Statistical Surveys and OLAP





Statistical survey: A popular tool to collect information
about a population based on a sample
 Ex.: TV ratings, US Census, election polls
A common tool in politics, health, market research,
science, and many more
An efficient way of collecting information (Data collection
is expensive)
Many statistical tools available, to determine validity
 Confidence intervals
 Hypothesis tests
OLAP (multidimensional analysis) on survey data
 highly desirable but can it be done well?
6/28/2016
Data Mining: Concepts and Techniques
59
Surveys: Sample vs. Whole Population
Data is only a sample of population
Age\Education
High-school
College
Graduate
18
19
20
…
6/28/2016
Data Mining: Concepts and Techniques
60
Problems for Drilling in Multidim. Space
Data is only a sample of population but samples could be small
when drilling to certain multidimensional space
Age\Education
High-school
College
Graduate
18
19
20
…
6/28/2016
Data Mining: Concepts and Techniques
61
Traditional OLAP Data Analysis Model
Age
Education
Income
Full Data Warehouse
Data
Cube


Assumption: data is entire population
Query semantics is population-based , e.g., What is the
average income of 19-year-old college graduates?
6/28/2016
Data Mining: Concepts and Techniques
62
OLAP on Survey (i.e., Sampling) Data


Semantics of query is unchanged
Input data has changed
Age/Education
High-school
College
Graduate
18
19
20
…
6/28/2016
Data Mining: Concepts and Techniques
63
OLAP with Sampled Data


Where is the missing link?
 OLAP over sampling data but our analysis target would
still like to be on population
Idea: Integrate sampling and statistical knowledge with
traditional OLAP tools
Input Data
Analysis Target Analysis Tool
Population
Population
Traditional OLAP
Sample
Population
Not Available
6/28/2016
Data Mining: Concepts and Techniques
64
Challenges for OLAP on Sampling Data



Computing confidence intervals in OLAP context
No data?

Not exactly. No data in subspaces in cube

Sparse data

Causes include sampling bias and query
selection bias
Curse of dimensionality

Survey data can be high dimensional

Over 600 dimensions in real world example

Impossible to fully materialize
6/28/2016
Data Mining: Concepts and Techniques
65
Example 1: Confidence Interval
What is the average income of 19-year-old high-school students?
Return not only query result but also confidence interval
Age/Education
High-school
College
Graduate
18
19
20
…
6/28/2016
Data Mining: Concepts and Techniques
66
Confidence Interval

Confidence interval at





:
x is a sample of data set;
is the mean of sample
tc is the critical t-value, calculated by a look-up
is the estimated standard error of the mean
Example: $50,000 ± $3,000 with 95% confidence

Treat points in cube cell as samples

Compute confidence interval as traditional sample set
Return answer in the form of confidence interval

Indicates quality of query answer

User selects desired confidence interval
6/28/2016
Data Mining: Concepts and Techniques
67
Efficient Computing Confidence Interval Measures

Efficient computation in all cells in data cube

Both mean and confidence interval are algebraic

Why confidence interval measure is algebraic?
is algebraic
where both s and l (count) are algebraic

Thus one can calculate cells efficiently at more general
cuboids without having to start at the base cuboid each
time
6/28/2016
Data Mining: Concepts and Techniques
68
Example 2: Query Expansion
What is the average income of 19-year-old college students?
Age/Education
High-school
College
Graduate
18
19
20
…
6/28/2016
Data Mining: Concepts and Techniques
69
Boosting Confidence by Query Expansion
From the example: The queried cell “19-year-old
college students” contains only 2 samples



Confidence interval is large (i.e., low confidence).
why?

Small sample size

High standard deviation with samples
Small sample sizes can occur at relatively low
dimensional selections

Collect more data?― expensive!

Use data in other cells? Maybe, but have to be
careful
6/28/2016
Data Mining: Concepts and Techniques
70
Intra-Cuboid Expansion: Choice 1
Expand query to include 18 and 20 year olds?
Age/Education
High-school
College
Graduate
18
19
20
…
6/28/2016
Data Mining: Concepts and Techniques
71
Intra-Cuboid Expansion: Choice 2
Expand query to include high-school and graduate students?
Age/Education
High-school
College
Graduate
18
19
20
…
6/28/2016
Data Mining: Concepts and Techniques
72
Intra-Cuboid Expansion
If other cells in the same cuboid satisfy both
the following

1.
Similar semantic meaning
2.
Similar cube value
Then can combine other cells’ data into own to
“boost” confidence



6/28/2016
Only use if necessary
Bigger sample size will decrease confidence
interval
Data Mining: Concepts and Techniques
73
Intra-Cuboid Expansion (2)

Cell segment similarity
Some dimensions are clear: Age
 Some are fuzzy: Occupation
 May need domain knowledge
Cell value similarity
 How to determine if two cells’ samples come from the same
population?
 Two-sample t-test (confidence-based)
Example:



6/28/2016
Data Mining: Concepts and Techniques
74
Inter-Cuboid Expansion
If a query dimension is

Not correlated with cube value

But is causing small sample size by drilling down too
much




Remove dimension (i.e., generalize to *) and move to a
more general cuboid
Can use two-sample t-test to determine similarity
between two cells across cuboids
Can also use a different method to be shown later
6/28/2016
Data Mining: Concepts and Techniques
75
Query Expansion
6/28/2016
Data Mining: Concepts and Techniques
76
Query Expansion Experiments (2)


Real world sample data: 600 dimensions and
750,000 tuples
0.05% to simulate “sample” (allows error checking)
6/28/2016
Data Mining: Concepts and Techniques
77
Query Expansion Experiments (3)
6/28/2016
Data Mining: Concepts and Techniques
78
Sampling Cube Shell: Handing “Curse of
Dimensionality”

Real world data may have > 600 dimensions

Materializing the full sampling cube is unrealistic


Solution: Only compute a “shell” around the full sampling
cube
Method: Selectively compute the best cuboids to include in
the shell



Chosen cuboids will be low dimensional
Size and depth of shell dependent on user preference
(space vs. accuracy tradeoff)
Use cuboids in the shell to answer queries
6/28/2016
Data Mining: Concepts and Techniques
79
Sampling Cube Shell Construction
Top-down, iterative, greedy algorithm

1.
2.
3.

6/28/2016
Top-down ― start at apex cuboid and slowly
expand to higher dimensions: follow the
cuboid lattice structure
Iterative ― add one cuboid per iteration
Greedy ― at each iteration choose the best
cuboid to add to the shell
Stop when either size limit is reached or it is
no longer beneficial to add another cuboid
Data Mining: Concepts and Techniques
80
Sampling Cube Shell Construction (2)

How to measure quality of a cuboid?
 Cuboid Standard Deviation (CSD)






s(ci): the sample standard deviation of the samples in
ci
n(ci): # of samples in ci, n(B): the total # of samples
in B
small(ci) returns 1 if n(ci) ≥ min_sup and 0 otherwise.
Measures the amount of variance in the cells of a cuboid
Low CSD indicates high correlation with cube value 
good
High CSD indicates little correlation  bad
6/28/2016
Data Mining: Concepts and Techniques
81
Sampling Cube Shell Construction (3)


Goal (Cuboid Standard Deviation Reduction: CSDR)

Reduce CSD ― increase query information

Find the cuboids with the least amount of CSD
Overall algorithm



6/28/2016
Start with apex cuboid and compute its CSD
Choose next cuboid to build that will reduce the CSD
the most from the apex
Iteratively choose more cuboids that reduce the CSD
of their parent cuboids the most (Greedy selection―at
each iteration, choose the cuboid with the largest
CSDR to add to the shell)
Data Mining: Concepts and Techniques
82
Computing the Sampling Cube Shell
6/28/2016
Data Mining: Concepts and Techniques
83
Query Processing


If query matches cuboid in shell, use it
If query does not match
1.
2.
3.
6/28/2016
A more specific cuboid
exists ― aggregate to the
query dimensions level
and answer query
A more general cuboid
exists ― use the value in
the cell
Multiple more general
cuboids exist ― use the
most confident value
Data Mining: Concepts and Techniques
84
Query Accuracy
6/28/2016
Data Mining: Concepts and Techniques
85
Sampling Cube: Sampling-Aware OLAP
Confidence intervals in query processing
1.

Integration with OLAP queries

Efficient algebraic query processing
Expand queries to increase confidence
2.

Solves sparse data problem

Inter/Intra-cuboid query expansion
Cube materialization with limited space
3.

6/28/2016
Sampling Cube Shell
Data Mining: Concepts and Techniques
86
Chapter Summary


Efficient algorithms for computing data cubes
 Multiway array aggregation
 BUC
 H-cubing
 Star-cubing
 High-D OLAP by minimal cubing
Further development of data cube technology
 Discovery-drive cube
 Multi-feature cubes
 Sampling cubes
6/28/2016
Data Mining: Concepts and Techniques
87
Explanation on Multi-way array
aggregation “a0b0” chunk
a0b0c0
c1
c2
c3
a0b1c0
c1
c2
c3
a0b2c0
c1
c2
c3
b0
a0
xxxx
a0
b0
c0
c1
c2
x
x
x
c3
x
c0
c1
c2
x
x
x
c3
x
b1
b2
b3
a0b3c0
c1
c2
c3
…
6/28/2016
Data Mining: Concepts and Techniques
88
a0b1 chunk
a0b0c0
c1
c2
c3
a0b1c0
c1
c2
c3
a0b2c0
c1
c2
c3
b1
a0
yyyy
a0
c0
c1
c2
xy
xy
xy
c3
Done with a0b0
xy
c0
c1
c2
b0
x
x
x
x
b1
y
y
y
y
c3
b2
b3
a0b3c0
c1
c2
c3
…
6/28/2016
Data Mining: Concepts and Techniques
89
a0b2 chunk
a0b0c0
c1
c2
c3
a0b1c0
c1
c2
c3
a0b2c0
c1
c2
c3
b2
a0
zzzz
a0
c0
c1
c2
xyz
xyz
xyz
c3
xyz
c0
c1
c2
b0
x
x
x
x
b1
y
y
y
y
b2
z
z
z
z
Done with a0b1
c3
b3
a0b3c0
c1
c2
c3
…
6/28/2016
Data Mining: Concepts and Techniques
90
Table Visualization
a0b0c0
c1
c2
c3
a0b1c0
c1
c2
c3
a0b2c0
c1
c2
c3
c0
c1
c2
c3
xyzu
xyzu
xyzu
xyzu
c0
c1
c2
b0
x
x
x
x
b1
y
y
y
y
b2
z
z
z
z
b3
u
u
u
u
b3
a0
uuuu
a0
Done with a0b2
c3
a0b3c0
c1
c2
c3
6/28/2016
Data Mining: Concepts and Techniques
91
Table Visualization
…
a1b0c0
c1
c2
c3
a1b1c0
c1
c2
c3
a1b2c0
c1
c2
c3
b0
a1
xxxx
a1
c0
c1
c2
x
x
x
c3
Done with a0b3
Done with a0c*
x
c0
c1
c2
b0
xx
xx
xx
xx
b1
y
y
y
y
b2
z
z
z
z
b3
u
u
u
u
c3
a1b3c0
c1
c2
c3
…
6/28/2016
Data Mining: Concepts and Techniques
92
a3b3 chunk (last one)
…
a3b0c0
c1
c2
c3
a3b1c0
c1
c2
c3
a3b2c0
c1
c2
c3
a3b3c0
c1
c2
c3
6/28/2016
b0
a3
uuuu
a3
c0
c1
c2
c3
xyzu
xyzu
xyzu
xyzu
c0
c1
c2
Done with a0b3
Done with a0c*
Done with b*c*
c3
b0
xxxx
xxxx
xxxx
xxxx
b1
yyyy
yyyy
yyyy
yyyy
b2
zzzz
zzzz
zzzz
zzzz
b3
uuuu
uuuu
uuuu
uuuu
Finish
Data Mining: Concepts and Techniques
93
Memory Used



A: 40 distinct values
B: 400 distinct values
C: 4000 distinct values

ABC: Sort Order
Plane AB: Need 1 chunk (10 * 100 * 1)
Plane AC: Need 4 chunks (10 * 1000 * 4)
Plane BC: Need 16 chunks (100 * 1000 * 16)

Total memory: 1,641,000



6/28/2016
Data Mining: Concepts and Techniques
94
Memory Used



A: 40 distinct values
B: 400 distinct values
C: 4000 distinct values

CBA: Sort Order
Plane CB: Need 1 chunk (1000 * 100 * 1)
Plane CA: Need 4 chunks (1000 * 10 * 4)
Plane BA: Need 16 chunks (100 * 10 * 16)

Total memory: 156,000



6/28/2016
Data Mining: Concepts and Techniques
95
H-Cubing Example
H-table
condition: ???
Output
H-Tree: 1
root: 4
(*,*,*,*) : 4
a1
3
a2
1
b1
3
b2
1
c1
2
c2
1
c3
1
a1: 3
b1:2
c1: 1
c2: 1
a2: 1
b2:1
b1: 1
c3: 1
c1: 1
Side-links are used to create a virtual tree
In this example for clarity we will print the
resulting virtual tree, not the real tree
6/28/2016
Data Mining: Concepts and Techniques
96
project C: ??c1
H-table
condition: ??c1
Output
H-Tree: 1.1
root: 4
a1
1
a2
1
b1
2
b2
1
6/28/2016
(*,*,*,*) : 4
(*,*,*,c1): 2
a1: 1
b1:1
a2: 1
b1: 1
Data Mining: Concepts and Techniques
97
project ?b1c1
H-table
condition: ?b1c1
Output
H-Tree: 1.1.1
root: 4
a1
1
a2
1
6/28/2016
a1: 1
a2: 1
Data Mining: Concepts and Techniques
(*,*,*) : 4
(*,*,c1): 2
(*,b1,c1): 2
98
aggregate: ?*c1
H-table
condition: ??c1
Output
H-Tree: 1.1.2
root: 4
a1
1
a2
1
6/28/2016
a1: 1
a2: 1
Data Mining: Concepts and Techniques
(*,*,*) : 4
(*,*,c1): 2
(*,b1,c1): 2
99
Aggregate ??*
H-table
condition: ???
Output
H-Tree: 1.2
root: 4
a1
3
a2
1
b1
3
b2
1
6/28/2016
a1: 3
b1:2
a2: 1
b2:1
(*,*,*) : 4
(*,*,c1): 2
(*,b1,c1): 2
b1: 1
Data Mining: Concepts and Techniques
100
Project ?b1*
H-table
condition: ???
Output
H-Tree: 1.2.1
root: 4
a1
2
a2
1
a1: 2
a2: 1
(*,*,*) : 4
(*,*,c1): 2
(*,b1,c1): 2
(*,b1,*): 3
(a1,b1,*):2
After this we also project on a1b1*
6/28/2016
Data Mining: Concepts and Techniques
101
Aggregate ?**
H-table
condition: ???
Output
H-Tree: 1.2.2
root: 4
a1
3
a2
1
a1: 3
a2: 1
(*,*,*) : 4
(*,*,c1): 2
(*,b1,c1): 2
(*,b1,*): 3
(a1,b1,*):2
(a1,*,*): 3
Here we also project on a1
Finish
6/28/2016
Data Mining: Concepts and Techniques
102
3. Explanation of Star Cubing
ABCD-Tree
ABCD-Cuboid Tree
root: 5
NULL
a2: 2
a1: 3
b*: 1
b1: 2
b*: 2
c*: 1
c*: 2
c3: 2
d*: 1
d*: 2
d4: 2
6/28/2016
Data Mining: Concepts and Techniques
103
Step 1
ABCD-Tree
output
BCD:5
root: 5
a2: 2
a1: 3
b*: 1
b1: 2
b*: 2
c*: 1
c*: 2
c3: 2
d*: 1
d*: 2
d4: 2
6/28/2016
ABCD-Cuboid Tree
Data Mining: Concepts and Techniques
104
Step 2
ABCD-Tree
ABCD-Cuboid Tree
BCD:5
root: 5
output
(a1,*,*) : 3
a2: 2
a1: 3
a1CD/a1:3
b*: 1
b1: 2
b*: 2
c*: 1
c*: 2
c3: 2
d*: 1
d*: 2
d4: 2
6/28/2016
Data Mining: Concepts and Techniques
105
Step 3
ABCD-Tree
ABCD-Cuboid Tree
BCD:5
root: 5
a2: 2
a1: 3
b*: 1
b1: 2
b*: 2
c*: 1
c*: 2
c3: 2
d*: 1
d*: 2
d4: 2
output
(a1,*,*) : 3
b*: 1
a1CD/a1:3
a1b*D/a1b*:1
6/28/2016
Data Mining: Concepts and Techniques
106
Step 4
ABCD-Tree
ABCD-Cuboid Tree
BCD:5
root: 5
a2: 2
a1: 3
output
(a1,*,*) : 3
b*: 1
c*: 1
b*: 1
b1: 2
b*: 2
c*: 1
c*: 2
c3: 2
d*: 1
d*: 2
d4: 2
a1CD/a1:3
c*: 1
a1b*D/a1b*:1
a1b*c*/a1b*c*:1
6/28/2016
Data Mining: Concepts and Techniques
107
Step 5
ABCD-Tree
ABCD-Cuboid Tree
BCD:5
root: 5
output
(a1,*,*) : 3
b*: 1
a2: 2
a1: 3
c*: 1
b*: 1
b1: 2
b*: 2
c*: 1
c*: 2
c3: 2
d*: 1
d*: 2
d*: 1
a1CD/a1:3
d4: 2
c*: 1
d*: 1
a1b*D/a1b*:1
d*: 1
a1b*c*/a1b*c*:1
6/28/2016
Data Mining: Concepts and Techniques
108
Mine subtree ABC/ABC
ABCD-Tree
ABCD-Cuboid Tree
BCD:5
root: 5
output
(a1,*,*) : 3
b*: 1
a2: 2
a1: 3
c*: 1
b*: 1
b1: 2
b*: 2
c*: 1
c*: 2
c3: 2
d*: 1
a1CD/a1:3
d*: 2
d4: 2
c*: 1
d*: 1
a1b*D/a1b*:1
mine this subtree
but nothing to do
remove
d*: 1
a1b*c*/a1b*c*:1
6/28/2016
Data Mining: Concepts and Techniques
109
Mine subtree ABD/AB
ABCD-Tree
ABCD-Cuboid Tree
BCD:5
root: 5
output
(a1,*,*) : 3
b*: 1
a2: 2
a1: 3
c*: 1
b*: 1
b1: 2
b*: 2
c*: 2
c3: 2
d*: 1
a1CD/a1:3
d*: 2
d4: 2
c*: 1
d*: 1
a1b*D/a1b*:1
mine this subtree
but nothing to do
remove
6/28/2016
d*: 1
Data Mining: Concepts and Techniques
110
Step 6
ABCD-Tree
ABCD-Cuboid Tree
BCD:5
root: 5
b*: 1
b1: 2
output
(a1,*,*) : 3
(a1,b1,*):2
a2: 2
a1: 3
c*: 1
b1: 2
b*: 2
c*: 2
c3: 2
d*: 1
a1CD/a1:3
d*: 2
d4: 2
c*: 1
d*: 1
a1b1D/a1b1:2
6/28/2016
Data Mining: Concepts and Techniques
111
Step 7
ABCD-Tree
ABCD-Cuboid Tree
BCD:5
root: 5
b*: 1
b1: 2
c*: 1
c*: 2
output
(a1,*,*) : 3
(a1,b1,*):2
a2: 2
a1: 3
b1: 2
b*: 2
c*: 2
c3: 2
d*: 1
a1CD/a1:3
d*: 2
d4: 2
c*: 3
d*: 1
a1b1D/a1b1:2
a1b1c*/a1b1c*:2
6/28/2016
Data Mining: Concepts and Techniques
112
Step 8
ABCD-Tree
ABCD-Cuboid Tree
BCD:5
root: 5
b*: 1
b1: 2
c*: 1
c*: 2
d*: 1
d*: 2
output
(a1,*,*) : 3
(a1,b1,*):2
a2: 2
a1: 3
b1: 2
c*: 2
b*: 2
c3: 2
a1CD/a1:3
d*: 2
d4: 2
c*: 3
d*: 3
a1b1D/a1b1:2
d*: 3
a1b1c*/a1b1c*:2
6/28/2016
Data Mining: Concepts and Techniques
113
Mine subtree ABC/ABC
ABCD-Tree
ABCD-Cuboid Tree
BCD:5
root: 5
b*: 1
b1: 2
c*: 1
c*: 2
d*: 1
d*: 2
output
(a1,*,*) : 3
(a1,b1,*):2
a2: 2
a1: 3
b1: 2
c*: 2
b*: 2
c3: 2
a1CD/a1:3
d4: 2
c*: 3
d*: 3
a1b1D/a1b1:2
mine this subtree
but nothing to do
remove
6/28/2016
d*: 3
a1b1c*/a1b1c*:2
Data Mining: Concepts and Techniques
114
Mine subtree ABC/ABC
ABCD-Tree
ABCD-Cuboid Tree
BCD:5
root: 5
b*: 1
b1: 2
c*: 1
c*: 2
d*: 1
d*: 2
output
(a1,*,*) : 3
(a1,b1,*):2
a2: 2
a1: 3
b1: 2
b*: 2
c3: 2
a1CD/a1:3
d4: 2
c*: 3
d*: 3
mine this subtree
but nothing to do (all
interior nodes *)
remove
6/28/2016
a1b1D/a1b1:2
d*: 3
Data Mining: Concepts and Techniques
115
Mine subtree ABC/ABC
ABCD-Tree
ABCD-Cuboid Tree
BCD:5
root: 5
a1: 3
b*: 1
b1: 2
c*: 1
c*: 2
d*: 1
d*: 2
output
(a1,*,*) : 3
(a1,b1,*):2
a2: 2
b*: 2
c3: 2
a1CD/a1:3
d4: 2
c*: 3
d*: 3
mine this subtree
but nothing to do (all
interior nodes *)
remove
6/28/2016
Data Mining: Concepts and Techniques
116
Step 9
ABCD-Tree
ABCD-Cuboid Tree
BCD:5
root: 5
b*: 1
b1: 2
c*: 1
c*: 2
d*: 1
d*: 2
a2: 2
b*: 2
output
(a1,*,*) : 3
(a1,b1,*):2
(a2,*,*): 2
c3: 2
a2CD/a2:2
d4: 2
6/28/2016
Data Mining: Concepts and Techniques
117
Step 10
ABCD-Tree
ABCD-Cuboid Tree
BCD:5
root: 5
b*: 3
b1: 2
c*: 1
c*: 2
d*: 1
d*: 2
a2: 2
b*: 2
output
(a1,*,*) : 3
(a1,b1,*):2
(a2,*,*): 2
c3: 2
a2CD/a2:2
d4: 2
a2b*D/a2b*:2
6/28/2016
Data Mining: Concepts and Techniques
118
Step 11
ABCD-Tree
ABCD-Cuboid Tree
BCD:5
root: 5
b*: 3
b1: 2
a2: 2
c*: 1
b*: 2
c3: 2
d*: 1
output
(a1,*,*) : 3
(a1,b1,*):2
(a2,*,*): 2
c*: 2
d*: 2
c3: 2
a2CD/a2:2
d4: 2
c3: 2
a2b*D/a2b*:2
a2b*c3/a2b*c3:2
6/28/2016
Data Mining: Concepts and Techniques
119
Step 11
ABCD-Tree
ABCD-Cuboid Tree
BCD:5
root: 5
b*: 3
b1: 2
a2: 2
b*: 2
c*: 1
c3: 2
c*: 2
d*: 1
d4: 2
d*: 2
output
(a1,*,*) : 3
(a1,b1,*):2
(a2,*,*): 2
c3: 2
a2CD/a2:2
d4: 2
c3: 2
d4: 2
a2b*D/a2b*:2
a2b*c3/a2b*c3:2
6/28/2016
Data Mining: Concepts and Techniques
120
Mine subtree ABC/ABC
ABCD-Tree
ABCD-Cuboid Tree
BCD:5
root: 5
b*: 3
b1: 2
a2: 2
b*: 2
c*: 1
c3: 2
c*: 2
d*: 1
d4: 2
d*: 2
output
(a1,*,*) : 3
(a1,b1,*):2
(a2,*,*): 2
c3: 2
a2CD/a2:2
c3: 2
d4: 2
a2b*D/a2b*:2
mine subtree
nothing to do
remove
6/28/2016
a2b*c3/a2b*c3:2
Data Mining: Concepts and Techniques
121
Step 11
ABCD-Tree
ABCD-Cuboid Tree
BCD:5
root: 5
b*: 3
b1: 2
a2: 2
b*: 2
c*: 1
c3: 2
c*: 2
d*: 1
d4: 2
d*: 2
output
(a1,*,*) : 3
(a1,b1,*):2
(a2,*,*): 2
a2CD/a2:2
c3: 2
d4: 2
mine subtree
nothing to do
remove
6/28/2016
a2b*D/a2b*:2
Data Mining: Concepts and Techniques
122
Mine Subtree ACD/A
ABCD-Tree
ABCD-Cuboid Tree
BCD:5
root: 5
b*: 3
b1: 2
a2: 2
c*: 1
c3: 2
c*: 2
d*: 1
d4: 2
d*: 2
output
(a1,*,*) : 3
(a1,b1,*):2
(a2,*,*): 2
a2CD/a2:2
c3: 2
d4: 2
mine subtree
AC/AC, AD/A
remove
6/28/2016
Data Mining: Concepts and Techniques
123
Recursive Mining – step 1
ACD/A tree
a2CD/a2:2
ACD/A-Cuboid Tree
a2D/a2:2
output
(a1,*,*) : 3
(a1,b1,*):2
(a2,*,*): 2
c3: 2
d4: 2
6/28/2016
Data Mining: Concepts and Techniques
124
Recursive Mining – Step 2
ACD/A tree
a2CD/a2:2
ACD/A-Cuboid Tree
a2D/a2:2
c3: 2
d4: 2
6/28/2016
output
(a1,*,*) : 3
(a1,b1,*):2
(a2,*,*): 2
(a2,*,c3): 2
a2c3/a2c3:2
Data Mining: Concepts and Techniques
125
Recursive Mining - Backtrack
ACD/A tree
a2CD/a2:2
c3: 2
ACD/A-Cuboid Tree
a2D/a2:2
d4: 2
d4: 2
output
(a1,*,*) : 3
(a1,b1,*):2
(a2,*,*): 2
(a2,*,c3): 2
(a2,c3,d4): 2
a2c3/a2c3:2
Same as before
As we backtrack
recursively mine child
trees
6/28/2016
Data Mining: Concepts and Techniques
126
Mine Subtree BCD
ABCD-Tree
ABCD-Cuboid Tree
BCD:5
root: 5
b*: 3
b1: 2
c*: 1
c3: 2
c*: 2
d*: 1
d4: 2
d*: 2
output
(a1,*,*) : 3
(a1,b1,*):2
(a2,*,*): 2
(a2,*,c3): 2
(a2,c3,d4): 2
mine subtree
BC/BC, BD/B, CD
remove
6/28/2016
Data Mining: Concepts and Techniques
127
Mine Subtree BCD
BCD-Tree
BCD-Cuboid Tree
(a1,*,*) : 3
(a1,b1,*):2
(a2,*,*): 2
(a2,*,c3): 2
(a2,c3,d4): 2
BCD:5
b*: 3
b1: 2
c*: 1
c3: 2
c*: 2
d*: 1
d4: 2
d*: 2
output
Will Skip this step
You may do it as an
exercise
6/28/2016
Data Mining: Concepts and Techniques
128
Finish
ABCD-Tree
ABCD-Cuboid Tree
(a1,*,*) : 3
(a1,b1,*):2
(a2,*,*): 2
(a2,*,c3): 2
(a2,c3,d4): 2
BCD tree patterns
root: 5
6/28/2016
output
Data Mining: Concepts and Techniques
129
Efficient Computation of Data Cubes

General heuristics

Multi-way array aggregation

BUC

H-cubing

Star-Cubing

High-Dimensional OLAP

Computing non-monotonic measures

Compressed and closed cubes
6/28/2016
Data Mining: Concepts and Techniques
130
Computing Cubes with NonAntimonotonic Iceberg Conditions



J. Han, J. Pei, G. Dong, K. Wang. Efficient Computation of Iceberg
Cubes With Complex Measures. SIGMOD’01
Most cubing algorithms cannot compute cubes with nonantimonotonic iceberg conditions efficiently
Example
CREATE CUBE Sales_Iceberg AS
SELECT month, city, cust_grp, AVG(price), COUNT(*)
FROM Sales_Infor
CUBEBY month, city, cust_grp
HAVING AVG(price) >= 800 AND COUNT(*) >= 50

How to push constraint into the iceberg cube computation?
6/28/2016
Data Mining: Concepts and Techniques
131
Non-Anti-Monotonic Iceberg
Condition


Anti-monotonic: if a process fails a condition, continue
processing will still fail
The cubing query with avg is non-anti-monotonic!
 (Mar, *, *, 600, 1800) fails the HAVING clause
 (Mar, *, Bus, 1300, 360) passes the clause
Month
City
Cust_grp
Prod
Cost
Price
Jan
Tor
Edu
Printer
500
485
Jan
Tor
Hld
TV
800
1200
Jan
Tor
Edu
Camera
1160
1280
Feb
Mon
Bus
Laptop
1500
2500
Mar
Van
Edu
HD
540
520
…
…
…
…
…
…
6/28/2016
CREATE CUBE Sales_Iceberg AS
SELECT month, city, cust_grp,
AVG(price), COUNT(*)
FROM Sales_Infor
CUBEBY month, city, cust_grp
HAVING AVG(price) >= 800 AND
COUNT(*) >= 50
Data Mining: Concepts and Techniques
132
From Average to Top-k Average

Let (*, Van, *) cover 1,000 records



Avg(price) is the average price of those 1000 sales
Avg50(price) is the average price of the top-50 sales
(top-50 according to the sales price
Top-k average is anti-monotonic

6/28/2016
The top 50 sales in Van. is with avg(price) <= 800 
the top 50 deals in Van. during Feb. must be with
avg(price) <= 800
Month
City
Cust_grp
Prod
Cost
Price
…
…
…
…
…
…
Data Mining: Concepts and Techniques
133
Binning for Top-k Average

Computing top-k avg is costly with large k

Binning idea


Avg50(c) >= 800
Large value collapsing: use a sum and a count to
summarize records with measure >= 800


6/28/2016
If count>= 50, no need to check “small” records
Small value binning: a group of bins

One bin covers a range, e.g., 600~800, 400~600, etc.

Register a sum and a count for each bin
Data Mining: Concepts and Techniques
134
Computing Approximate top-k average
Suppose for (*, Van, *), we have
Range
Sum
Count
Over 800 28000
20
600~800 10600
15
400~600 15200
30
…
6/28/2016
…
…
Approximate avg50()=
(28000+10600+600*15)/50=952
Top 50
The cell may pass the HAVING clause
Month
City
Cust_grp
Prod
Cost
Price
…
…
…
…
…
…
Data Mining: Concepts and Techniques
135
Weakened Conditions Facilitate
Pushing

Accumulate quant-info for cells to compute average
iceberg cubes efficiently

Three pieces: sum, count, top-k bins

Use top-k bins to estimate/prune descendants

Use sum and count to consolidate current cell
weakest
strongest
Approximate avg50()
real avg50()
avg()
Anti-monotonic, can
be computed
efficiently
Anti-monotonic, but
computationally
costly
Not antimonotonic
6/28/2016
Data Mining: Concepts and Techniques
136
Computing Iceberg Cubes with Other
Complex Measures

Computing other complex measures

Key point: find a function which is weaker but ensures
certain anti-monotonicity

Examples

Avg()  v: avgk(c)  v (bottom-k avg)

Avg()  v only (no count): max(price)  v

Sum(profit) (profit can be negative):


6/28/2016
p_sum(c)  v if p_count(c)  k; or otherwise, sumk(c)  v
Others: conjunctions of multiple conditions
Data Mining: Concepts and Techniques
137
Efficient Computation of Data Cubes

General heuristics

Multi-way array aggregation

BUC

H-cubing

Star-Cubing

High-Dimensional OLAP

Computing non-monotonic measures

Compressed and closed cubes
6/28/2016
Data Mining: Concepts and Techniques
138
Compressed Cubes: Condensed or Closed
Cubes


This is challenging even for icerberg cube: Suppose 100 dimensions,
only 1 base cell with count = 10. How many aggregate cells if
count >= 10?
Condensed cube



Only need to store one cell (a1, a2, …, a100, 10), which
represents all the corresponding aggregate cells
Efficient computation of the minimal condensed cube
Closed cube

6/28/2016
Dong Xin, Jiawei Han, Zheng Shao, and Hongyan Liu, “C-Cubing:
Efficient Computation of Closed Cubes by Aggregation-Based
Checking”, ICDE'06.
Data Mining: Concepts and Techniques
139
Exploration and Discovery in
Multidimensional Databases

Discovery-Driven Exploration of Data Cubes

Complex Aggregation at Multiple Granularities:
Multi-Feature Cubes

Cube-Gradient Analysis
6/28/2016
Data Mining: Concepts and Techniques
140
Cube-Gradient (Cubegrade)

Analysis of changes of sophisticated measures in multidimensional spaces



Query: changes of average house price in Vancouver
in ‘00 comparing against ’99
Answer: Apts in West went down 20%, houses in
Metrotown went up 10%
Cubegrade problem by Imielinski et al.

Changes in dimensions  changes in measures

Drill-down, roll-up, and mutation
6/28/2016
Data Mining: Concepts and Techniques
141
From Cubegrade to Multi-dimensional
Constrained Gradients in Data Cubes

Significantly more expressive than association rules


Capture trends in user-specified measures
Serious challenges



6/28/2016
Many trivial cells in a cube  “significance constraint”
to prune trivial cells
Numerate pairs of cells  “probe constraint” to select
a subset of cells to examine
Only interesting changes wanted “gradient constraint”
to capture significant changes
Data Mining: Concepts and Techniques
142
MD Constrained Gradient Mining



Significance constraint Csig: (cnt100)
Probe constraint Cprb: (city=“Van”, cust_grp=“busi”,
prod_grp=“*”)
Gradient constraint Cgrad(cg, cp):
(avg_price(cg)/avg_price(cp)1.3)
Probe cell: satisfied Cprb
Base cell
Aggregated cell
Siblings
Ancestor
6/28/2016
(c4, c2) satisfies Cgrad!
Dimensions
Measures
cid
Yr
City
Cst_grp
Prd_grp
Cnt
Avg_price
c1
00
Van
Busi
PC
300
2100
c2
*
Van
Busi
PC
2800
1800
c3
*
Tor
Busi
PC
7900
2350
c4
*
*
busi
PC
58600
2250
Data Mining: Concepts and Techniques
143
Efficient Computing Cube-gradients

Compute probe cells using Csig and Cprb


The set of probe cells P is often very small
Use probe P and constraints to find gradients

Pushing selection deeply

Set-oriented processing for probe cells

Iceberg growing from low to high dimensionalities

Dynamic pruning probe cells during growth

Incorporating efficient iceberg cubing method
6/28/2016
Data Mining: Concepts and Techniques
144
Download