No Slide Title - Unit information

advertisement
Efficient Methods for Data Cube
Computation and Data Generalization
Chapter 4 (4.1)
March 14, 2016
1
Data Generalization


It is a process of abstracting conceptual level
knowledge from large set of task-relevant data.
Two types of analysis :


March 14, 2016
Descriptive data mining: Describes data in a concise manner,
highlighting interesting general properties. Supports interest.
Predictive data mining: constructs a model and attempts to
predict behavior of new data. ( classification, regression….)
2
A Data Cube (MOLAP)
• Fast on-line analytical processing takes minimum time if
aggregates for all the cuboids are precomputed.
• Pre-computation of the full cube requires excessive amount of
memory and depends on number of dimensions and cardinality
of dimensions.
• For many cells in a cuboid the measure value is zero and cells
Marchare
14, 2016
of little or no interest. Cuboids are often sparse.
3
Partial Materialization


Precomputation of some of the cuboids in advance leads to
fast response time and avoids redundant computations
during on-line analytical processing.
Data cube materialization/ pre-computation



March 14, 2016
No materialization: Don’t precompute any of the non-base
cuboid. Leads to multidimensional aggregation on the fly and is
slow.
Full materialization: Precompute all the cubes. Running
queries will be very fast. Requires huge memory.
Partial Materialization: Selectively compute a proper subset
of the cuboids, which contains only those cells that satisfy
some user specified criterion.
4
Outline

Types of cells : Base cell, aggregate cell, cell relationship

Types of Cubes : Full cube, Iceberg Cube, Closed Cube, Shell Cube

Efficient Computation of Data Cubes

Multiway Array Aggregation

BUC

Star Cubing
March 14, 2016
5
A Data Cube: sales
Product
Base cuboid
I1
I2
I3
Branch cuboid
I4
I5
I6
All
Branch
New York
10
11
12
3
10
1
47
Chicago
11
9
6
9
6
7
48
12
9
8
5
7
3
44
13
8
10
5
6
3
45
46
37
36
22
29
14
184
Toronto
Vancouver
All
Product cuboid
March 14, 2016
Aggregate cell
Base cell
Apex Cuboid
6
- Types of cells

Types of cells
Base cell: a cell which belongs to a base cuboid
Aggregate cell: a cell which belongs to a non-base cuboid
Each aggregate dimension is indicated by a “*”




Ancestor-descendent relationship between cells:
dimensions are (branch, product, year)
1-D cell c1 = (New York, *, *, 2000) is an ancestor of a
2-D cell c2 = (New York, I1, *, 400) and a
3-D cell c3 = (New York, I1, 2013, 111).
c3 is a descendent of c1 and c2;

In an n-D data cube an
i-D cell a=(a1,a2,…an,measure_a) is an ancestor of a
j-D cell b=(b1,b2,…bn,measure_b) if
1) i<j and 2) for 1≤m≤n am=bm whenever am
 ""
3) if j=i+1 a is called parent of b or b is a child of a
March 14, 2016
7
- Types of cubes

Full cube: All cells and cuboids are materialized. All possible
combination of dimensions and values.
2



n
n
or
 L 1
i
i 1
Iceberg cube: Partial materialization. Materializing only the cells
in a cuboid whose measure value is above the minimum
threshold. count(*) >= min support
Iceberg Condition
Closed cube: No ancestor cell is created if its measure is equal
to that of its descendent cell.
Shell cube: Only cuboids with limited number of dimensions are
created.
March 14, 2016
8
Two base cells {(a1,a2,….a100):10, (a1,a2,b1,…b100):10}




How many sub-patterns for first base cell
Total number of aggregate cells is 2  6
Ignore all of the aggregate cells that can be obtained by replacing
some constants by “*” while keeping the same measure value.
Only 3 really offer new information.
{(a1,a2,….a100):10, (a1,a2,b1,…b100):10, (a1,a2,*…,*):20}
March 14, 2016
101
9
Example

Which are the closed cells?
can be derived from
the closed cell

Similarly we can also get
March 14, 2016
10
Iceberg Cube, Closed Cube & Cube Shell

Is iceberg cube good enough?



How many cells will the iceberg cube have if having count(*) >=
10? Hint: A huge but tricky number!
Close cube:




2 base cells: {(a1, a2, a3 . . . , a100):10, (a1, a2, b3, . . . , b100):10}
Closed cell c: if there exists no cell d, s.t. d is a descendant of c,
and d has the same measure value as c.
Closed cube: a cube consisting of only closed cells
What is the closed cube of the above base cuboid? Hint: only 3
cells
Cube Shell


Precompute only the cuboids involving a small # of dimensions,
e.g., 3
For (A1, A2, … A10), how many combinations to compute?
More dimension combinations will need to be computed on the fly
11
Outline

Types of cells

Types of Cubes

Efficient Computation of Data Cubes

Multiway Array Aggregation

BUC

Star Cubing
March 14, 2016
12
- Efficient Computation of Data Cubes

Preliminary cube computation tricks

Computing full/iceberg cubes: 2 methodologies

Top-Down:
Multi-Way array aggregation

Bottom-Up:
Bottom-up computation: BUC

Star-Cubing: Integrates top-down and bottom-up
March 14, 2016
13
-- Preliminary Cube Computation Tricks


Sorting, hashing, and grouping operations are applied to the dimension
attributes in order to reorder and cluster related tuples. (ROLAP)
Aggregates may be computed from previously computed aggregates,
rather than from the base fact table





Cache-results: accumulating results of already computed cuboid to
reduce disk I/Os. Higher-level aggregates are computed from lower-level
aggregates rather than base facts.
Smallest-child: computing a cuboid from the smallest, previously
computed cuboid. Cbranch
C{ branch, year}, C{branch, item}
Amortize-scans: computing as many as possible cuboids at the same
time to amortize disk reads
Share-sorts: sharing sorting costs cross multiple cuboids when sort-based
method is used
Share-partitions: sharing the partitioning cost across multiple cuboids
when hash-based algorithms are used
March 14, 2016
14
-- Multi-Way Array Aggregation …

Used for MOLAP and full cube computation

Array-based “bottom-up” algorithm




Using multi-dimensional chunks
Simultaneous aggregation on multiple
dimensions
Intermediate aggregate values are re-used
for computing ancestor cuboids
Cannot do Apriori pruning: No iceberg
optimization
March 14, 2016
all
A
B
AB
C
AC
BC
ABC
15
… -- Multi-way Array Aggregation …

Partition arrays into chunks (a small subcube which fits in memory).

Compressed sparse array addressing: (chunk_id, offset)

Compute aggregates in “multiway” by visiting cube cells in the order which
minimizes the # of times to visit each cell, and reduces memory access and
storage cost.
C
c3 61
62
63
64
c2 45
46
47
48
c1 29
30
31
32
c0
B
b3
B13
b2
9
14
15
44
28
24
b1
5
b0
1
2
3
4
a0
a1
a2
a3
March 14, 2016
60
16
56
40
36
A
What is the best
traversing order to
do multi-way
aggregation?
52
20
16
… -- Multi-way Array Aggregation …
C
c3 61
62
63
64
c2 45
46
47
48
c1 29
30
31
32
c0
b3
B
b2
B13
14
15
60
16
44
28
9
24
b1
5
b0
1
2
3
4
a0
a1
a2
a3
56
40
36
52
20
A
March 14, 2016
17
… -- Multi-way Array Aggregation …
C
c3 61
62
63
64
c2 45
46
47
48
c1 29
30
31
32
c0
b3
B
b2
B13
14
60
16
44
28
9
24
b1
5
b0
1
2
3
4
a0
a1
a2
a3
56
40
36
A
March 14, 2016
15
52
20
AB requires longest
scan, i.e scanning of
49th chunk
18
… -- Multi-way Array Aggregation …




Assume the sizes of dimension, A, B, and C are 40, 400, 4000
respectively.
Therefore AB is the smallest and AC is the largest 2-D planes
If chunks are scanned as 1, 2, 3, … then 156,000 memory units
are needed (40*400+40*1000+100*1000)
If chunks are scanned as 1, 17, 33, 49, 5, 21,37 …then 1,641,000
memory units are needed (aggregation ordering AB-AC-BC). Chunk
memory units needed are (400*4000+40*1000+10*10*100)
March 14, 2016
19
… -- Multi-way Array Aggregation …
all
All
A
B
AB
AC
ABC
Needs 156,000
Memory units
March 14, 2016
A
C
BC
B
AB
C
AC
BC
ABC
Needs 1,641,000
Memory units
20
… -- Multi-way Array Aggregation

Method: the planes should be sorted and computed according to
their size in ascending order


Idea: keep the smallest plane in the main memory, fetch and
compute only one chunk at a time for the largest plane
Limitation of the method: computing well only for a small number
of dimensions

March 14, 2016
If there are a large number of dimensions, “top-down”
computation and iceberg cube computation methods can be
explored
21
-- Bottom-Up Computation (BUC) …
all

Bottom-up cube computation
(Note: top-down in our view!)

A
Divides dimensions into partitions
and facilitates iceberg pruning
AB
ABC
AC
B
AD
ABD
C
BC
D
CD
BD
ACD
BCD
ABCD


If a partition does not satisfy
min_sup, its descendants can be
pruned
1 all
2A
If minsup = 1  compute full
CUBE!
3 AB

7 AC
10 B
14 C
16 D
9 AD 11 BC 13 BD
15 CD
No simultaneous aggregation
4 ABC
March 14, 2016
6 ABD
8 ACD
5 ABCD
12 BCD
22
BUC: Partitioning




Usually, entire data set can’t fit in main memory
Sort distinct values

partition into blocks that fit
Continue processing
Optimizations

Partitioning
 External Sorting, Hashing, Counting Sort

Ordering dimensions to encourage pruning
 Cardinality, Skew, Correlation
 Higher the cardinality-smaller the
partitions-greater pruning opportunity

Collapsing duplicates
 Can’t do holistic aggregates anymore!
Ideally the dimension
with most discriminative,
higher cardinality and
having less skew is
processed first.
23
--- BUC: Example (Having count(*) > 5) …
Toronto
New York
3
1
1
2
I1
5
1
0
1
1
I2
1
0
8
9
8
I3
1
1
1
8
Q1
Q2
Q3
Q4
New-York
Toronto
I1
5
1
0
1
I1
3
1
1
2
I2
1
0
8
9
I2
8
1
2
1
I3
1
1
1
8
I3
2
1
11
8
Q1
Q2
Q3
Q4
Q1
Q2
March 14, 2016
Q3
Q4
24
… --- BUC: Example (Having count(*) > 5)
All
77
All 1
7
B
6
P,B
P
5
4
Q,B
B,P,Q
March 14, 2016
Q
2
Q
20
5
Q1
Q2
23
29
Q3
Q4
3
Q,P
Q,P
I1
8
2
1
3
I2
9
1
10
10
3
2
12
16
Q1
Q2
Q3
Q4
I3
25
Till Now


Aggregates simultaneously
on multiple dimensions.
Multiple cuboids can be
computed simultaneously
in one pass. Dynamic
structure with simultaneous
aggregation.
March 14, 2016


Facilitates a-priori pruning.
During partitioning, each
partition’s count is
compared with min sup.
The recursion stops if the
count does not satisfy min
sup.
26
Summary

Data Cube Materialization



Data Cube Computation Methods



Full Materialization
Partial Materialization: iceberg cubes, shell fragments
Multiway array aggregation
BUC for computing iceberg cubes
Next Class



March 14, 2016
Star Cubing
Shell Fragments for Fast High-Dimensional OLAP
Exploration and Discovery in Multidimensional Databases
27
Download