Notes 8

advertisement
Multi-way Algorithm for Cube
Computation
CPS 196.03
Notes 8
1
First Programming Project

Individual project, 15 Points in final grade

Sales(customer_id, item_id, item_group, item_price, purchase_date)




Will be provided as a file during demo and for generating performance
numbers for project report
Task 1: 5 Points

Interface to enter MIN_SUPPORT (% of customers)

Find frequent itemsets using Apriori (set of item_id’s)
Task 2: 5 Points (Section 5.5 in the textbook)

Interface to enter two constraint types (e.g., SUM(item_price) op const)

Use the constraints in Apriori as effectively as possible, study and
demonstrate performance improvement
Task 3: 5 Points

Extension of your choice. Examples include (i) association rules, (ii) complex
constraints, (iii) sequential patterns, (iv) variants of apriori, (v) FP-growth
2
File Format

10,123,3,54,4/4/2008

10,12,4,101,4/5/2008

14,123,3,54,8/4/2008

…

Caveats:
 Customer
 Three
Vs. Item
datasets: Toy, Medium, and Large
 Comma-separated
file, one purchase per line in file, no
header in file
 Integers
 Note
for simplicity
date format
3
First Programming Project: Milestones

Feb 3: Project announced

Feb 17: Mid-project report due
 Describe
progress and planned extensions
 Describe
detailed algorithms for all three tasks

Feb 17: Sample data file will be provided for generating
performance results for project report

March 2: Submit code, README file to run code, code
documentation, and final project report

March 2-4: Project demos (random assignment)

March 6: Spring break. Second project announced
4
Finalized Grading Criteria for Class

Homeworks: 15 points

Programming projects: 40 points

Midterm: 20 points
 Note:

Midterm is on Feb 19 (Thu) in class
Final: 25 Points
5
ROLAP Server

Relational OLAP Server
sale
prodId
p1
p2
p1
date
1
1
2
sum
62
19
48
tools
utilities
ROLAP
server
Special indices, tuning;
Schema is “denormalized”
relational
DBMS
6
MOLAP Server
Multi-Dimensional OLAP Server
Sales
B
A
M.D. tools
Product

milk
soda
eggs
soap
1
utilities
multidimensional
server
2 3 4
Date
could also
sit on
relational
DBMS
7
MOLAP
2Qtr
3Qtr
4Qtr
sum
U.S.A
Canada
Mexico
Country
TV
PC
VCR
sum
1Qtr
Date
Total annual sales
of TV in U.S.A.
sum
8
MOLAP
C
c3 61
62
63
64
c2 45
46
47
48
c1 29
30
31
32
c0
b3
B
b2
B13
14
15
60
16
44
28
9
24
b1
5
b0
1
2
3
4
a0
a1
a2
a3
56
40
36
52
20
A
9
Challenges in MOLAP

Storing large arrays for efficient access
 Row-major,
column major
 Chunking
 Compressing
sparse arrays
Creating array data from data in tables
 Efficient techniques for Cube computation

Topics are discussed in the paper for reading
10
ROLAP Vs. MOLAP
What do the authors say?
 What can you do in MOLAP that you cannot
do in ROLAP?
 Can the algorithm in this paper be used in
ROLAP?

11
Array Storage
Chunks
 Compression

 Chunk-offset
compression Vs. LZW
12
Loading Arrays from Tables
The easy case: array fits in memory
 Else:

 Partitions
13
Loading Arrays from Tables
Table 
...
Suppose there are 1000 chunks. 10
chunks can fit in memory. The partition
size is 10 chunks
...

100
10 chunks
14
Basic Array Cubing Algo

First find minimum spanning tree
 Hierarchy

of aggregates
Compute each (k-1) dimensional aggregate
from its best k dimensional aggregate
 One
pass through the array in the right order
Let us look at some basics first
15
Chunked 3D Array
A
a3 61
62
63
64
a2 45
46
47
48
a1 29
30
31
32
a0
b3
B
b2
B13
14
15
60
16
44
28
9
24
b1
5
b0
1
2
3
4
a0
a1
c2
c3
56
40
36
C
52
20
Dimension order
CBA
16
“a0b0” chunk
a0b0c0
c1
c2
c3
a0b1c0
c1
c2
c3
a0b2c0
c1
c2
c3
b0
a0 xxxx
a0
b0
c0
c1
c2
x
x
x
c3
x
c0
c1
c2
x
x
x
c3
x
b1
b2
b3
a0b3c0
c1
c2
c3
…
17
a0b1 chunk
a0b0c0
c1
c2
c3
a0b1c0
c1
c2
c3
a0b2c0
c1
c2
c3
b1
a0 yyyy
a0
c0
c1
c2
xy
xy
xy
c3
xy
c0
c1
c2
b0
x
x
x
x
b1
y
y
y
y
Done with a0b0
c3
b2
b3
a0b3c0
c1
c2
c3
…
18
a0b2 chunk
a0b0c0
c1
c2
c3
a0b1c0
c1
c2
c3
a0b2c0
c1
c2
c3
b2
a0 zzzz
a0
c0
c1
c2
c3
xyz
xyz
xyz
xyz
c0
c1
c2
b0
x
x
x
x
b1
y
y
y
y
b2
z
z
z
z
Done with a0b1
c3
b3
a0b3c0
c1
c2
c3
…
19
Table Visualization
a0b0c0
c1
c2
c3
a0b1c0
c1
c2
c3
a0b2c0
c1
c2
c3
c0
c1
c2
c3
xyzu
xyzu
xyzu
xyzu
c0
c1
c2
b0
x
x
x
x
b1
y
y
y
y
b2
z
z
z
z
b3
u
u
u
u
b3
a0 uuuu
a0
Done with a0b2
c3
a0b3c0
c1
c2
c3
20
Table Visualization
…
a1b0c0
c1
c2
c3
a1b1c0
c1
c2
c3
a1b2c0
c1
c2
c3
b0
a1 xxxx
a1
c0
c1
c2
x
x
x
c3
Done with a0b3
Done with a0c*
x
c0
c1
c2
b0
xx
xx
xx
xx
b1
y
y
y
y
b2
z
z
z
z
b3
u
u
u
u
c3
a1b3c0
c1
c2
c3
…
21
a3b3 chunk (last one)
…
a3b0c0
c1
c2
c3
a3b1c0
c1
c2
c3
a3b2c0
c1
c2
c3
a3b3c0
c1
c2
c3
b0
a3 uuuu
a3
c0
c1
c2
c3
xyzu
xyzu
xyzu
xyzu
c0
c1
c2
Done with a0b3
Done with a0c*
Done with b*c*
c3
b0
xxxx
xxxx
xxxx
xxxx
b1
yyyy
yyyy
yyyy
yyyy
b2
zzzz
zzzz
zzzz
zzzz
b3
uuuu
uuuu
uuuu uuuu
Finish
22
Memory Used



A: 40 distinct values
B: 400 distinct values
C: 4000 distinct values

CBA: Dimension Order
Plane AB: Need 1 chunk (10 * 100 * 1)
Plane AC: Need 4 chunks (10 * 1000 * 4)
Plane BC: Need 16 chunks (100 * 1000 * 16)

Total memory: 1,641,000



23
Memory Used



A: 40 distinct values
B: 400 distinct values
C: 4000 distinct values

ABC: Dimension Order
Plane BC: Need 1 chunk (1000 * 100 * 1)
Plane AC: Need 4 chunks (1000 * 10 * 4)
Plane AB: Need 16 chunks (100 * 10 * 16)

Total memory: 156,000



24
Basic Array Cubing Algo

First find minimum spanning tree
 Hierarchy

Compute each (k-1) dimensional aggregate
from its best k dimensional aggregate
 One

of aggregates
pass through the array in the right order
What are the advantages and
disadvantages of this algorithm?
25
Multi-way Array Cubing Algo
What is the main idea?
 Rule 1 on Page 163
 Minimum memory spanning tree

 Figure
2
 Figures 3 and 4
Theorem 1
 Basic idea of multi-pass algorithm

 Tradeoff
between memory usage and number
of passes
26
D1
D2
D3
M
A3
B2
C1
20
A7
B2
C1
10
A13
B1
C12
30
A2
B2
C1
10
A3
B7
C12
40
A15
B7
C1
20
A6
B1
C12
10
A13
B2
C1
20
A1
B11
C1
100
A1
B11
C1
50
A13
B2
C1
30
A3
B11
C12
10
A13
B7
C1
40
A10
B1
C1
50
A3
B1
C12
10
27
Download