Outline Data Warehousing Overview What is a Warehouse? CS245 Notes 11

advertisement
Outline
What is a data warehouse?
 Why a warehouse?
 Models & operations
 Implementing a warehouse

Data Warehousing Overview
CS245 Notes 11
Hector Garcia-Molina
Stanford University
CS 245
Notes11
1
CS 245
What is a Warehouse?

Notes11
2
What is a Warehouse?
Collection of diverse data

 subject
oriented
 aimed at executive, decision maker
 often a copy of operational data
 with value-added data (e.g., summaries, history)
 integrated
 time-varying
 non-volatile
Collection of tools
 gathering
data
integrating, ...
 querying, reporting, analysis
 data mining
 monitoring, administering warehouse
 cleansing,
more
CS 245
Notes11
3
CS 245
Warehouse Architecture
Forecasting
Comparing performance of units
 Monitoring, detecting fraud
 Visualization

Query & Analysis
Metadata
4
Motivating Examples
Client
Client
Notes11

Warehouse
Integration
Source
CS 245
Source
Notes11
Source
5
CS 245
Notes11
6
1
Query-Driven Approach
Alternative to Warehousing

Two Approaches:
 Query-Driven
(Lazy)
 Warehouse (Eager)
Client
Client
Mediator
?
Source
CS 245
Wrapper
Source
Source
Notes11
7
CS 245
Advantages of Warehousing
High query performance
 Queries not visible outside warehouse
 Local processing at sources unaffected
 Can operate when sources unavailable
 Can query data not stored in a DBMS
 Extra information at warehouse

8
No need to copy data
storage
need to purchase data
More up-to-date data
Query needs can be unknown
 Only query interface needed at sources
 May be less draining on sources


9
CS 245
Notes11
Warehouse Models & Operators

Source
Notes11
 no
summarize (store aggregates)
historical information
Notes11
Source
 less
 Modify,
CS 245
Wrapper
Advantages of Query-Driven

 Add
Wrapper
10
Star
Data Models
product
 relational
prodId
p1
p2
name price
bolt
10
nut
5
store storeId
c1
c2
c3
 cubes

Operators
sale oderId date
o100 1/7/97
o102 2/7/97
105 3/8/97
customer
CS 245
Notes11
11
CS 245
custId
53
81
111
custId
53
53
111
prodId
p1
p2
p1
name
joe
fred
sally
Notes11
storeId
c1
c1
c3
address
10 main
12 main
80 willow
qty
1
2
5
city
nyc
sfo
la
am t
12
11
50
city
sfo
sfo
la
12
2
Star Schema
Terms
Fact table
 Dimension tables
 Measures

sale
orderId
date
custId
prodId
storeId
qty
amt
product
prodId
name
price
customer
custId
name
address
city
sale
orderId
date
custId
prodId
storeId
qty
amt
product
prodId
name
price
store
storeId
city
store
storeId
city
CS 245
customer
custId
name
address
city
Notes11
13
CS 245
Notes11
Dimension Hierarchies
14
Cube
sType
store
store storeId
s5
s7
s9
city
cityId
sfo
sfo
la
tId
t1
t2
t1
region
mgr
joe
fred
nancy
Fact table view:
sType tId
t1
t2
city
cityId pop
sfo
1M
la
5M
location
downtown
suburbs
sale
regId
north
south
prodId
p1
p2
p1
p2
storeId
c1
c1
c3
c2
Notes11
Fact table view:
prodId
p1
p2
p1
p2
p1
p1
storeId
c1
c1
c3
c2
c1
c2
15
amt
12
11
50
8
44
4
CS 245

c2
c3
50
8
Notes11
Traditional
 aggregation
day 2
day 1
p1
p2 c1
p1
12
p2
11
c1
44
c2
4
c2
c3
 ...
c3
50

8

Relational

Cube
Analysis
 clean
dimensions = 3
Notes11
16
 selection
 find
CS 245
c1
12
11
Operators
Multi-dimensional cube:
date
1
1
1
1
2
2
p1
p2
region regId
nam e
north cold region
south warm region
3-D Cube
sale
Multi-dimensional cube:
amt
12
11
50
8
dimensions = 2
 snowflake schema
 constellations
CS 245
size
small
large
data
trends
 ...
17
CS 245
Notes11
18
3
Aggregates
Aggregates
• Add up amounts for day 1
• In SQL: SELECT sum(amt) FROM SALE
WHERE date = 1
sale
prodId
p1
p2
p1
p2
p1
p1
storeId
c1
c1
c3
c2
c1
c2
date
1
1
1
1
2
2
CS 245
amt
12
11
50
8
44
4
• Add up amounts by day
• In SQL: SELECT date, sum(amt) FROM SALE
GROUP BY date
sale
prodId
p1
p2
p1
p2
p1
p1
81
Notes11
19
storeId
c1
c1
c3
c2
c1
c2
date
1
1
1
1
2
2
CS 245
storeId
c1
c1
c3
c2
c1
c2
date
1
1
1
1
2
2
amt
12
11
50
8
44
4
sale
prodId
p1
p2
p1
date
1
2
sum
81
48
20
Aggregates
• Add up amounts by day, product
• In SQL: SELECT date, sum(amt) FROM SALE
GROUP BY date, prodId
prodId
p1
p2
p1
p2
p1
p1
ans
Notes11
Another Example
sale
amt
12
11
50
8
44
4
date
1
1
2
Operators: sum, count, max, min,
median, ave
 “Having” clause
 Using dimension hierarchy

amt
62
19
48
 average
by region (within store)
by month (within date)
 maximum
rollup
drill-down
CS 245
Notes11
21
CS 245
Notes11
Cube Aggregation
22
Cube Operators
Example: computing sums
day 2
day 1
c1
c2
c3
p1
44
4
p2 c1
c2
c3
p1
12
50
p2
11
8
p1
p2
c1
56
11
c2
4
8
rollup
drill-down
CS 245
c3
50
day 2
...
day 1
sum
c1
67
c2
12
c3
50
Notes11
c1
56
11
p1
p2
129
p1
p2
c1
c2
c3
p1
44
4
p2 c1
c2
c3
p1
12
50
p2
11
8
sum
110
19
c2
4
8
c3
50
sale(c2,p2,*)
23
CS 245
...
sale(c1,*,*)
sum
c1
67
p1
p2
sum
110
19
c2
12
c3
50
129
Notes11
sale(*,*,*)
24
4
Extended Cube
c1
56
11
c267
4
c2
4
8
c312
p1
p2
c1
*
12
p1
p2
c1*
44
c2
44
c3
4
50
11
23
8
8
50
*
62
19
81
*
day 2
day 1
p1
p2
*
c3
50
* 50
48
48
Aggregation Using Hierarchies
*
110
19
129
c1
c2
c3
p1
44
4
p2 c1
c2
c3
p1
12
50
p2
11
8
day 2
day 1
Notes11
country
region A region B
56
54
11
8
CS 245
25
Decision Trees
 Clustering
 Association Rules
Example:
• Conducted survey to see what customers were
interested in new model car
• Want to select customers for advertising campaign
sale
Notes11
sale
custId
c1
c2
c3
c4
c5
c6
age<30
N
city=sf
CS 245
unlikely
Y
likely
age
27
35
40
22
50
25
city newCar
sf
yes
la
yes
sf
yes
sf
yes
la
no
la
no
age
27
35
40
22
50
25
city newCar
sf
yes
la
yes
sf
yes
sf
yes
la
no
la
no
training
set
Notes11
sale
28
Y
unlikely
N
CS 245
car
taurus
van
van
taurus
merc
taurus
age
27
35
40
22
50
25
city newCar
sf
yes
la
yes
sf
yes
sf
yes
la
no
la
no
age<45
N
likely
29
custId
c1
c2
c3
c4
c5
c6
car=taurus
Y
city=sf
N
Notes11
car
taurus
van
van
taurus
merc
taurus
Another Possibility
car=van
N
likely
car
taurus
van
van
taurus
merc
taurus
custId
c1
c2
c3
c4
c5
c6
CS 245
27
One Possibility
Y
26
Decision Trees

Y
(customer c1 in Region A;
customers c2, c3 in Region B)
Notes11
Data Analysis
CS 245
region
sale(*,p2,*)
p1
p2
CS 245
customer
unlikely
Y
likely
N
unlikely
Notes11
30
5
Issues

Decision tree cannot be “too deep”


Clustering
would not have statistically significant amounts of
data for lower decisions
Need to select tree that most reliably
predicts outcomes
income
education
age
CS 245
Notes11
31
CS 245
Clustering
Notes11
32
Another Example: Text

Each document is a vector
 e.g.,
<100110...> contains words 1,4,5,...
Clusters contain “similar” documents
 Useful for understanding, searching
documents

income
sports
education
international
news
age
business
CS 245
Notes11
33
CS 245
Issues
Notes11
34
Association Rule Mining
Given desired number of clusters?
Finding “best” clusters
 Are clusters semantically meaningful?


 e.g.,

sales
records:
“yuppies’’ cluster?
Using clusters for disk storage
tran1
tran2
tran3
tran4
tran5
tran6
cust33
cust45
cust12
cust40
cust12
cust12
p2,
p5,
p1,
p5,
p2,
p9
p5, p8
p8, p11
p9
p8, p11
p9
market-basket
data
• Trend: Products p5, p8 often bough together
• Trend: Customer 12 likes product p9
CS 245
Notes11
35
CS 245
Notes11
36
6
Association Rule
Implementation Issues
Rule: {p1, p3, p8}
 Support: number of baskets where these
products appear
 High-support set: support  threshold s
 Problem: find all high support sets


ETL (Extraction, transformation, loading)
 Getting
data to the warehouse
 Entity Resolution
What to materialize?
 Efficient Analysis

 Association
rule mining
 ...
CS 245
Notes11
CS 245
37
Periodic snapshots
 Database triggers
 Log shipping
 Data shipping (replication service)
 Transaction shipping
 Polling (queries to source)
 Screen scraping
 Application level monitoring

CS 245
Notes11
Advantages & Disadvantages!!
ETL: Monitoring Techniques
Notes11
38
ETL: Data Cleaning



Migration (e.g., yen  dollars)
Scrubbing: use domain-specific knowledge (e.g.,
social security numbers)
Fusion (e.g., mail list, customer merging)
billing DB
service DB

39
customer1(Joe)
merged_customer(Joe)
customer2(Joe)
Auditing: discover rules & relationships
(like data mining)
CS 245
More details: Entity Resolution
Notes11
40
Applications
comparison shopping
 mailing lists
 classified ads
N: a
 customer files
 counter-terrorism
e1

e2
e1
N: a
A: b
CC#: c
Ph: e
N: a
Exp: d
Ph: e
A: b
Ph: e
e2
N: a
41
CC#: c
Exp: d
Ph: e
42
7
Taxonomy: Pairwise vs Global
Why is ER Challenging?
Decide if r, s match only by looking at r, s?
 Or need to consider more (all) records?
Huge data sets
 No unique identifiers
 Lots of uncertainty
 Many ways to skin the cat


Nm: Pat Smith
Ad: 123 Main St
Ph: (650) 555-1212
Nm: Patrick Smith
Ad: 132 Main St
Ph: (650) 555-1212
or
Nm: Patricia Smith
Ad: 123 Main St
Ph: (650) 777-1111
43
44
Taxonomy: Outcome
Taxonomy: Pairwise vs Global

Global matching complicates things a lot!
 e.g.,

Partition of records

Merged records
 e.g.,
change decision as new records arrive
Nm: Pat Smith
Ad: 123 Main St
Ph: (650) 555-1212
Nm: Patrick Smith
Ad: 132 Main St
Ph: (650) 555-1212
Nm: Pat Smith
Ad: 123 Main St
Ph: (650) 555-1212
or
Nm: Patricia Smith
Ad: 123 Main St
Ph: (650) 777-1111
45
46
after merging
Nm: Tom
Ad: 123 Main
BD: Jan 1, 85
Wk: IBM
Nm: Tom
Ad: 123 Main
BD: Jan 1, 85
Wk: IBM
Oc: lawyer
47

Nm: Thomas
Ad: 123 Maim
Oc: lawyer
Nm: Tom
Wk: IBM
Oc: laywer
Sal: 500K
Nm: Tom
Ad: 123 Main
BD: Jan 1, 85
Wk: IBM
Oc: lawyer
Sal: 500K
Nm: Patricia Smith
Ad: 132 Main St
Ph: (650) 777-1111
Hair: Black
Nm: Patricia Smith
Ad: 123 Main St
Ph: (650) 555-1212
(650) 777-1111
Hair: Black
Taxonomy: Record Reuse
Taxonomy: Outcome
 Iterate
comparison shopping
One record related to multiple entities?
Nm: Pat Smith Sr.
Ph: (650) 555-1212
Ph: (650) 555-1212
Ad: 123 Main St
Nm: Pat Smith Jr.
Ph: (650) 555-1212
Nm: Pat Smith Sr.
Ph: (650) 555-1212
Ad: 123 Main St
Nm: Pat Smith Jr.
Ph: (650) 555-1212
Ad: 123 Main St
48
8
Taxonomy: Record Reuse

Partitions
r
s
• Merges
r
t
Taxonomy: Record Reuse

Partitions
r
rs
s
• Merges
r
t
rs
s
s
st
st
t
t
• Record reuse
complex and expensive!
49
50
Taxonomy: Multiple Entity Types
Taxonomy: Multiple Entity Types
papers
authors
person 2
person 1
brother
Organization A
member
a1
p1
a2
p2
a3
p5
a4
p7
member
same??
business
Organization B
a5
51
52
Taxonomy: Exact vs Approximate
cameras
CDs
ER
resolved
cameras
ER
resolved
CDs
products
books
...
53
ER
Taxonomy: Exact vs Approximate
terrorists
resolved
books
sort
by age terrorists
B Cooper 30
match against
ages 25-35
...
54
9
Implementation Issues

What to Materialize?
ETL (Extraction, transformation, loading)
Store in warehouse results useful for
common queries
 Example:
total sales

 Getting
data to the warehouse
 Entity Resolution
What to materialize?
 Efficient Analysis

 Association
day 2
day 1
p1
p2
c1
c2
c3
p1
44
4
p2 c1
c2
c3
12
50
11
8
rule mining
 ...
p1
p2
c1
56
11
c2
4
8
...
c3
50
Notes11
CS 245
55
p1
p2
c1
110
19
c3
50
56
Cube Aggregates Lattice
Type/frequency of queries
 Query response time
 Storage cost
 Update cost
129

c1
67
p1
c2
12
c3
50
city
c1
56
11
c2
4
8
date
city, date
product, date
c3
50
day 2
day 1
57
all
product
city, product
p1
p2
Notes11
c2
12
Notes11
Materialization Factors
CS 245
c1
67
129
materialize
CS 245
p1
c1
c2
c3
p1
44
4
p2 c1
c2
c3
p1
12
50
p2
11
8
city, product, date
CS 245
Dimension Hierarchies
use greedy
algorithm to
decide what
to materialize
Notes11
58
Dimension Hierarchies
all
all
city
cities
state
city
c1
c2
state
CA
NY
city, product
product
city, date
date
product, date
state
city, product, date
state, date
city
state, product
state, product, date
not all arcs shown...
CS 245
Notes11
59
CS 245
Notes11
60
10
Interesting Hierarchy
time
all
years
weeks
quarters
day
1
2
3
4
5
6
7
8
week
1
1
1
1
1
1
1
2
month
1
1
1
1
1
1
1
1
quarter
1
1
1
1
1
1
1
1
Implementation Issues
year
2000
2000
2000
2000
2000
2000
2000
2000

 Getting
data to the warehouse
 Entity Resolution
What to materialize?
 Efficient Analysis

 Association
conceptual
dimension table
months
ETL (Extraction, transformation, loading)
rule mining
 ...
days
CS 245
Notes11
61
CS 245
Finding High-Support Pairs
Baskets(basket, item)
 SELECT I.item, J.item, COUNT(I.basket)
FROM Baskets I, Baskets J
WHERE I.basket = J.basket AND
I.item < J.item
GROUP BY I.item, J.item
HAVING COUNT(I.basket) >= s;
Notes11
Baskets(basket, item)
 SELECT I.item, J.item, COUNT(I.basket)
FROM Baskets I, Baskets J
WHERE I.basket = J.basket AND
I.item < J.item
WHY?
GROUP BY I.item, J.item
HAVING COUNT(I.basket) >= s;

63
CS 245
Example
basket item
t1
p2
t1
p5
t1
p8
t2
p5
t2
p8
t2
p11
...
...
CS 245
Notes11
64
Example
basket item1 item2
t1
p2
p5
t1
p2
p8
t1
p5
p8
t2
p5
p8
t2
p5
p11
t2
p8
p11
...
...
...
Notes11
62
Finding High-Support Pairs

CS 245
Notes11
basket item
t1
p2
t1
p5
t1
p8
t2
p5
t2
p8
t2
p11
...
...
65
CS 245
basket item1 item2
t1
p2
p5
t1
p2
p8
t1
p5
p8
t2
p5
p8
t2
p5
p11
t2
p8
p11
...
...
...
Notes11
check if
count  s
66
11
Issues

Performance for size 2 rules
big

Association Rules
basket
t1
t1
t1
t2
t2
t2
...
item
p2
p5
p8
p5
p8
p11
...
basket
t1
t1
t1
t2
t2
t2
...
item1
p2
p2
p5
p5
p5
p8
...
item2
p5
p8
p8
p8
p11
p11
...

How do we perform rule mining efficiently?
even
bigger!
Performance for size k rules
CS 245
Notes11
CS 245
67
Notes11
Association Rules


68
Association Rules
How do we perform rule mining efficiently?
Observation: If set X has support t, then
each X subset must have at least support t
How do we perform rule mining efficiently?
Observation: If set X has support t, then
each X subset must have at least support t
 For 2-sets:


 if
we need support s for {i, j}
each i, j must appear in at least s baskets
 then
CS 245
Notes11
69
CS 245
Algorithm for 2-Sets

appearing in s or more baskets
(2) Find high-support pairs
using only OK products
CS 245
Notes11
70
Algorithm for 2-Sets
(1) Find OK products
 those
Notes11
71
INSERT INTO okBaskets(basket, item)
SELECT basket, item
FROM Baskets
GROUP BY item
HAVING COUNT(basket) >= s;
CS 245
Notes11
72
12
Algorithm for 2-Sets


Counting Efficiently
INSERT INTO okBaskets(basket, item)
SELECT basket, item
FROM Baskets
GROUP BY item
HAVING COUNT(basket) >= s;
Perform mining on okBaskets

Notes11
73
CS 245
Counting Efficiently

One way:
basket I.item J.item
t1
p5
p8
t2
p5
p8
t2
p8
p11
t3
p2
p3
t3
p5
p8
t3
p2
p8
...
...
...
sort
CS 245

basket I.item J.item
t3
p2
p3
t3
p2
p8
t1
p5
p8
t2
p5
p8
t3
p5
p8
t2
p8
p11
...
...
...
One way:
basket I.item J.item
t1
p5
p8
t2
p5
p8
t2
p8
p11
t3
p2
p3
t3
p5
p8
t3
p2
p8
...
...
...
75
Another way:
sort
CS 245
Counting Efficiently

Notes11
74
Counting Efficiently
threshold = 3
Notes11
threshold = 3
basket I.item J.item
t1
p5
p8
t2
p5
p8
t2
p8
p11
t3
p2
p3
t3
p5
p8
t3
p2
p8
...
...
...
SELECT I.item, J.item, COUNT(I.basket)
FROM okBaskets I, okBaskets J
WHERE I.basket = J.basket AND
I.item < J.item
GROUP BY I.item, J.item
HAVING COUNT(I.basket) >= s;
CS 245
One way:
threshold = 3
basket I.item J.item
t3
p2
p3
t3
p2
p8
t1
p5
p8
t2
p5
p8
t3
p5
p8
t2
p8
p11
...
...
...
count &
remove
count I.item J.item
3
p5
p8
5
p12 p18
...
...
...
Notes11
76
Counting Efficiently
threshold = 3

basket I.item J.item
t1
p5
p8
t2
p5
p8
t2
p8
p11
t3
p2
p3
t3
p5
p8
t3
p2
p8
...
...
...
Another way:
basket I.item J.item
t1
p5
p8
t2
p5
p8
t2
p8
p11
t3
p2
p3
t3
p5
p8
t3
p2
p8
...
...
...
scan &
count
threshold = 3
count I.item J.item
1
p2
p3
2
p2
p8
3
p5
p8
5
p12
p18
1
p21
p22
2
p21
p23
...
...
...
keep counter
array in memory
CS 245
Notes11
77
CS 245
Notes11
78
13
Counting Efficiently

Another way:
basket I.item J.item
t1
p5
p8
t2
p5
p8
t2
p8
p11
t3
p2
p3
t3
p5
p8
t3
p2
p8
...
...
...
scan &
count
Counting Efficiently
threshold = 3
count I.item J.item
1
p2
p3
2
p2
p8
3
p5
p8
5
p12
p18
1
p21
p22
2
p21
p23
...
...
...
remove

count I.item J.item
3
p5
p8
5
p12 p18
...
...
...
Another way:
basket I.item J.item
t1
p5
p8
t2
p5
p8
t2
p8
p11
t3
p2
p3
t3
p5
p8
t3
p2
p8
...
...
...
scan &
count
threshold = 3
count I.item J.item
1
p2
p3
2
p2
p8
3
p5
p8
5
p12
p18
1
p21
p22
2
p21
p23
...
...
...
keep counter
array in memory
CS 245
Notes11
(1)
scan &
hash &
count
count
1
5
2
1
8
1
...
bucket
A
B
C
D
E
F
...
79
in-memory
hash table
CS 245
Notes11
Notes11
threshold = 3
basket I.item J.item
t1
p5
p8
t2
p5
p8
t2
p8
p11
t3
p2
p3
t3
p5
p8
t3
p2
p8
...
...
...
(1)
scan &
hash &
count
81
(1)
scan &
hash &
count
(2) scan &
remove
false positive
CS 245
count
1
5
2
1
8
1
...
bucket
A
B
C
D
E
F
...
in-memory
hash table
bucket
A
B
C
D
E
F
...
in-memory
hash table
threshold = 3
CS 245
Notes11
82
Yet Another Way
threshold = 3
basket I.item J.item
t1
p5
p8
t2
p5
p8
t2
p8
p11
t3
p5
p8
t5
p12 p18
t8
p12 p18
...
...
...
Notes11
count
1
5
2
1
8
1
...
basket I.item J.item
t1
p5
p8
t2
p5
p8
t2
p8
p11
t3
p5
p8
t5
p12 p18
t8
p12 p18
...
...
...
Yet Another Way
basket I.item J.item
t1
p5
p8
t2
p5
p8
t2
p8
p11
t3
p2
p3
t3
p5
p8
t3
p2
p8
...
...
...
80
Yet Another Way
(2) scan &
remove
CS 245
count I.item J.item
3
p5
p8
5
p12 p18
...
...
...
keep counter
array in memory
Yet Another Way
basket I.item J.item
t1
p5
p8
t2
p5
p8
t2
p8
p11
t3
p2
p3
t3
p5
p8
t3
p2
p8
...
...
...
remove
basket I.item J.item
t1
p5
p8
t2
p5
p8
t2
p8
p11
t3
p2
p3
t3
p5
p8
t3
p2
p8
...
...
...
(1)
scan &
hash &
count
(2) scan &
remove
false positive
83
CS 245
count
1
5
2
1
8
1
...
bucket
A
B
C
D
E
F
...
in-memory
hash table
threshold = 3
in-memory
counters
basket I.item J.item
t1
p5
p8
t2
p5
p8
t2
p8
p11
t3
p5
p8
t5
p12 p18
t8
p12 p18
...
...
...
Notes11
count I.item J.item
3
p5
p8
1
p8
p11
5
p12
p18
...
...
...
(3) scan& count
84
14
Yet Another Way
basket I.item J.item
t1
p5
p8
t2
p5
p8
t2
p8
p11
t3
p2
p3
t3
p5
p8
t3
p2
p8
...
...
...
(1)
scan &
hash &
count
count
1
5
2
1
8
1
...
in-memory
hash table
bucket
A
B
C
D
E
F
...
(2) scan &
remove
false positive
CS 245
Notes11
threshold = 3



count I.item J.item
3
p5
p8
5
p12 p18
...
...
...
in-memory
counters
basket I.item J.item
t1
p5
p8
t2
p5
p8
t2
p8
p11
t3
p5
p8
t5
p12 p18
t8
p12 p18
...
...
...
Discussion
Hashing scheme: 2 (or 3) scans of data
Sorting scheme: requires a sort!
Hashing works well if few high-support pairs
and many low-support ones
(4) remove
count I.item J.item
3
p5
p8
1
p8
p11
5
p12
p18
...
...
...
(3) scan& count
85
CS 245
Discussion


86
Implementation Issues
Hashing scheme: 2 (or 3) scans of data
Sorting scheme: requires a sort!
Hashing works well if few high-support pairs
and many low-support ones
frequency

Notes11

ETL (Extraction, transformation, loading)
 Getting
 Entity


iceberg queries
data to the warehouse
Resolution
What to materialize?
Efficient Analysis
 Association
rule mining
 ...
threshold
item-pairs ranked by frequency
CS 245
Notes11
87
Extra: Data Mining in the InfoLab
CS 245
Notes11
Extra: Data Mining in the InfoLab
Recommendations in CourseRank
Recommendations in CourseRank
quarters
user
u1
u2
u3
u4
u
q1
a: 5
a: 1
g: 4
b: 2
a: 5
q2
b: 5
e: 2
h: 2
g: 4
g: 4
q3
d: 5
d: 4
e: 3
h: 4
e: 4
quarters
q4
user
u1
u2
u3
u4
u
f: 3
f: 3
e: 4
q1
a: 5
a: 1
g: 4
b: 2
a: 5
q2
b: 5
e: 2
h: 2
g: 4
g: 4
u3 and u4 are similar to u
CS 245
Notes11
88
89
CS 245
q3
d: 5
d: 4
e: 3
h: 4
e: 4
q4
f: 3
f: 3
e: 4
Recommend h
Notes11
90
15
Extra: Data Mining in the InfoLab
Sequence Mining
Given a set of transcripts, use Pr[x|a]
to predict if x is a good recommendation
given user has taken a.
 Two issues...
Recommendations in CourseRank

quarters
user
u1
u2
u3
u4
u
q1
a: 5
a: 1
g: 4
b: 2
a: 5
q2
b: 5
e: 2
h: 2
g: 4
g: 4
q3
d: 5
d: 4
e: 3
h: 4
e: 4
q4
f: 3
f: 3
e: 4
Recommend d (and f, h)
CS 245
Notes11
91
CS 245
Pr[x|a] Not Quite Right
transcript
1
2
3
4
5
containing
‐
a
x
a ‐> x
x ‐> a
Notes11
92
User Has Taken >= 1 Course
User has taken T= {a, b, c}
 Need Pr[x|T~x]
 Approximate as Pr[x|a~x b~x
 Expensive to compute, so...

target user’s transcript:
[ ... a .... || unknown ]
recommend x?
c~x ]
Pr[x|a] = 2/3
Pr[x|a~x] = 1/2
CS 245
Notes11
93
CS 245
Notes11
94
percentage of ratings
CourseRank User Study
25
20
15
10
good,
Series2
Series1
good,
5
expected
unexpected
0
CS 245
Notes11
95
16
Download