Apriori Algorithm

advertisement
Big Data Analysis
Technology
University of Paderborn
L.079.08013 Seminar: Cloud Computing and Big
Data Analysis (in English)
Summer semester 2013
June 12, 2013
Tobias Hardes (6687549) –
Tobias.Hardes@gmail.com
Table of content
 Introduction
 Definitions
 Background
 Example
 Related Work
 Research
 Main Approaches
 Association Rule Mining
 MapReduce Framework
 Conclusion
June 12, 2013
2
4 Big keywords
June 12, 2013
3
Big Data vs. Business Intelligence
 How can we predict cancer
early enough to treat it
successfully?
 How Can I make significant
profit on the stock market next
month?
Docs.oralcle.com
 Which is the most profitable branch of
our supermarket?
 In a specific country?
 During a specific period of time
June 12, 2013
4
Background
home.web.cern.ch
June 12, 2013
5
Big Science – The LHC
 600 million times per second, particles collide within the Large
Hadron Collider (LHC)
 Each collision generate new particles
 Particles decay in complex way
 Each collision is detected
 The CERN Data Center reconstruct this collision event
 15 petabytes of data stored every year
 Worldwide LHC Computing Grid (WLCG) is
used to crunch all of the data
home.web.cern.ch
June 12, 2013
6
Data Stream Analysis
- Just in time analysis of data.
- Sensor networks
- Analysis for a certain time (last 30 seconds)
http://venturebeat.com
June 12, 2013
7
Complex event processing (CEP)
- Provides queries for streams
- Usage of „Event Processing Languages“ (EPL)
- select avg(price) from
StockTickEvent.win:time(30 sec)
Tumbling Window
(Slide = WindowSize)
Window
Sliding Window
(Slide < WindowSize)
https://forge.fi-ware.eu
Slide
June 12, 2013
8
Complex Event Processing - Areas of
application
- Just in time analysis  Complexity of algorithms
- CEP is used with Twitter:
- Identify emotional states of users
- Sarcasm?
June 12, 2013
9
Related Work
June 12, 2013
10
Big Data in companies
June 12, 2013
11
Principles
- Statistics
- Probability theory
- Machine learning
 Data Mining
- Association rule learning
- Cluster analysis
- Classificiation
June 12, 2013
12
Association Rule Mining – Cluster analysis
Association Rule Mining
Is soda purchased
with bananas?
-Relationships between items
-Find associations,
correlations or causal
structures
-Apriori algorithm
-Frequent Pattern (FP)-Growth
algorithm
June 12, 2013
13
Cluster analysis – Classification
Cluster Analysis
-Classification of similar
objects into classes
-Classes are defined during
the clustering
-k-Means
-K-Means++
June 12, 2013
14
Research and future work
- Performance, performance, performance…
-
Passes of the data source
Parallelization
NP-hard problems
….
- Accuracy
- Optimized solutions
June 12, 2013
16
Example
- Apriori algorithm: n+1 database scans
- FP-Growth algorithm: 2 database scans
June 12, 2013
17
Distributed computing – Motivation
- Complex computational tasks
- Serveral terabytes of data
- Limited hardware resources
Prof. Dr. Erich Ehses (FH Köln)
 Google‘s MapReduce framework
June 12, 2013
18
Main approaches
http://ultraoilforpets.com
June 12, 2013
20
Structure
- Association rule mining
- Apriori algorithm
- FP-Growth algorithm
- Googles MapReduce
June 12, 2013
21
Association rule mining
- Identify items that are related to other items
- Example: Analysis of baskets in an online shop
or in a supermarket
http://img.deusm.com/
June 12, 2013
22
Terminology
- A stream or a database with n elements: S
- Item set: 𝑰 = {𝒊𝟏 , 𝒊𝟐 , … , 𝒊𝒏 }
- Frequency of occurrence of an item set: Φ(A)
- Association rule 𝑨
B:
𝑨 ∈ 𝑰 𝒂𝒏𝒅 𝑩 ∈ 𝑰, 𝑨 ∩ 𝑩 = ∅
- Support: 𝐬𝐮𝐩 𝑨 =
𝜱(𝑨)
|𝑺|
- Confidence: 𝒄𝒐𝒏𝒇(𝑨
B) =
𝒔𝒖𝒑(𝑨 ∪𝑩)
𝒔𝒖𝒑(𝑨)
June 12, 2013
23
Example
- Rule:
„If a basket contains cheese and chocolate,
then it also contains bread“
𝐶ℎ𝑒𝑒𝑠𝑒, 𝑐ℎ𝑜𝑐𝑜𝑙𝑎𝑡𝑒
{𝑏𝑟𝑒𝑎𝑑}
- 6 of 60 transactions contains cheese and
𝜱(𝑨)
𝟔
chocolate  𝒔𝒖𝒑 𝑨 =
= = 𝟏𝟎%
|𝑺|
𝟔𝟎
- 3 of the 6 transactions contains bread
 𝒄𝒐𝒏𝒇(𝑨
B) =
𝒔𝒖𝒑 𝑨 ∪𝑩
𝒔𝒖𝒑 𝑨
=
𝟑
𝟔
= 𝟓𝟎%
June 12, 2013
24
Common approach
- Disjoin the problem into two tasks:
1. Generation of frequent item sets
•
Find item sets that satisfy a minimum support value
𝜱(𝑨)
𝐦𝐢𝐧_𝐬𝐮𝐩 ≤ 𝐬𝐮𝐩 𝑨 =
|𝑺|
2. Generation of rules
•
Find Confidence rules using the item sets
𝒎𝒊𝒏_𝒄𝒐𝒏𝒇 ≤ 𝒄𝒐𝒏𝒇(𝑨
𝒔𝒖𝒑(𝑨 ∪ 𝑩)
B) =
𝒔𝒖𝒑(𝑨)
June 12, 2013
25
Aprio algorithm – Frequent item set
Input:
Minimum support:
Datasource:
min_sup
S
June 12, 2013
26
Apriori – Frequent item sets (I)
Generation of frequent item sets : min_sup = 2
TID Transaction
1
(B,C)
2
(B,C)
3
(A,C,D)
4
(A,B,C,D)
5
(B,D)
22
A1
B12
43
C 42
1
3
D3
1
2
https://www.mev.de/
{}
June 12, 2013
27
Apriori – Frequent item sets (II)
Generation of frequent item sets : min_sup = 2
TID Transaction
1
(B,C)
2
(B,C)
3
(A,C,D)
4
(A,B,C,D)
5
(B,D)
ACD 2
Candidates
Candidates
AB 1
AC 2
A2
BCD 1
AD 2
BC 3
BD 2
B4
C4
D3
L3
CD 2
L2
L1
https://www.mev.de/
{}
June 12, 2013
28
Apriori Algorithm – Rule generation
-
Uses frequent item sets to extract high-confidence rules
-
Based on the same principle as the item set generation
-
Done for all
frequent item set Lk
June 12, 2013
29
Example: Rule generation
TID
Items
T1
{Coffee; Pasta; Milk}
T2
{Pasta; Milk}
T3
{Bread; Butter}
T4
{Coffee; Milk; Butter}
T5
{Milk; Bread; Butter}
sup 𝐵𝑢𝑡𝑡𝑒𝑟
conf 𝐵𝑢𝑡𝑡𝑒𝑟
𝐵𝑢𝑡𝑡𝑒𝑟
𝑀𝑖𝑙𝑘
Φ(𝐵𝑢𝑡𝑡𝑒𝑟 ∪ 𝑀𝑖𝑙𝑘) 2
𝑀𝑖𝑙𝑘 =
= = 40%
|𝑆|
5
𝑠𝑢𝑝(𝐵𝑢𝑡𝑡𝑒𝑟 ∪ 𝑀𝑖𝑙𝑘) 40%
𝑀𝑖𝑙𝑘 =
=
= 66%
sup(𝐵𝑢𝑡𝑡𝑒𝑟)
60%
June 12, 2013
30
Summary Apriori algorithm
- n+1 scans of the database
- Expensive generation of the candidate item set
- Implements level-wise search using frequent
item property.
- Easy to implement
- Some opportunities for specialized optimizations
June 12, 2013
31
FP-Growth algorithm
- Used for databases
- Features:
1.
2.
Requires 2 scans of the database
Uses a special data structure – The FP-Tree
Build the FP-Tree
Extract frequent item sets
- Compression of the database
- Devide this database and apply data mining
June 12, 2013
32
Construct FP-Tree
TID
Items
1
{a,b}
2
{b,c,d}
3
{a,c,d,e}
4
{a,d,e}
5
{a,b,c}
6
{a,b,c,d}
7
{a}
8
{a,b,c}
9
{a,b,d}
10
{b,c,e}
d:1
June 12, 2013
33
Extract frequent itemsets (I)
- Bottom-up strategy
- Start with node „e“
- Then look for „de“
- Each path is processed
recursively
- Solutions are merged
June 12, 2013
34
Extract frequent itemsets (II)
-
Is e frequent?
- Is de frequent?
- …
- Is ce frequent?
- ….
- Is be frequent?
- ….
- Is ae frequent?
- …..
Using subproblems to identify
frequent itemsets
Φ(e) = 3 – Assume the minimum support was set to 2
June 12, 2013
35
Extract frequent itemsets (III)
1. Update the support count along the prefix
path
2. Remove Node e
3. Check the frequency of the paths
Find item sets with
de, ce, ae or be
June 12, 2013
36
Apriori vs. FP-Growth
- FP-Growth has some advantages
-
Two scans of the database
No expensive computation of candidates
Compressed datastructure
Easier to parallelize
W. Zhang, H. Liao, and N. Zhao, “Research on the fp growth algorithm
about association rule mining
June 12, 2013
37
MapReduce
- Map and Reduce functions are expressed by a
developer
- map(key, val)
- Emits new key-values p
- reduce(key, values)
- Emits an arbitrary output
- Usually a key with one value
June 12, 2013
45
MapReduce – Word count
June 12, 2013
46
User
Programm
(1)fork
(1)fork
(7) return
(1)fork
Master
(2) assign
(4) local write
(3) read
(2) assign
(5) RPC
worker
Worker for blue keys
worker
worker
worker
worker
Worker for red keys
worker
worker
worker
Input files
Map
phase
Intermediate
files
Worker for yellow keys
Reduce
Shuffle
phase
Output files
Conclusion: MapReduce (I)
- MapReduce is design as a batch processing
framework
- No usage for ad-hoc analysis
- Used for very large data sets
- Used for time intensive computations
- OpenSource implementation: Apache Hadoop
http://hadoop.apache.org/
June 12, 2013
48
Conclusion
June 12, 2013
49
Conclusion (I)
- Big Data is important for research and in daily
business
- Different approaches
- Data Stream analysis
- Complex event processing
- Rule Mining
- Apriori algorithm
- FP-Growth algorithm
June 12, 2013
50
Conclusion (II)
- Clustering
- K-Means
- K-Means++
- Distributed computing
- MapReduce
- Performance / Runtime
-
Multiple minutes
Hours
Days…
Online analytical processing for Big Data?
June 12, 2013
51
Thank you
for your attention
Appendix
Big Data definitions
Every day, we create 2.5
quintillion bytes of …. .
Big data is high-volume, high-velocity
This data comes from
and high-variety information assets
everywhere: sensors used to
that demand cost-effective, innovative
gather climate information,
posts to social media sites,
forms of information processing for
digital pictures and videos,
enhanced insight and decision making.
purchase transaction
(Gartner Inc.)
records, and cell phone GPS
signals to name a few. This
data is big data.
(IBM Corporate )
Big data” refers to datasets whose size is
beyond the ability of typical database software
tools to capture, store, manage, and analyze.
(McKinsey & Company)
June 12, 2013
54
Big Data definitions
Every day, we create 2.5
quintillion bytes of …. .
Big data is high-volume, high-velocity
This data comes from
and high-variety information assets
everywhere: sensors used
that demand cost-effective, innovative
to gather climate
information, posts to social
forms of information processing for
media sites, digital pictures
enhanced insight and decision making.
and videos, purchase
(Gartner Inc.)
transaction records, and cell
phone GPS signals to name
a few. This data is big data.
(IBM Corporate )
Big data” refers to datasets whose size is
beyond the ability of typical database software
tools to capture, store, manage, and analyze.
(McKinsey & Company)
June 12, 2013
55
Complex Event Processing – Windows
Tumbling Window
Sliding Window
-Moves as much as the
window size
-Slides in time
Tumbling Window
(Slide = WindowSize)
Window
-Buffers the last x elements
Sliding Window
(Slide < WindowSize)
Slide
June 12, 2013
56
MapReduce vs. BigQuery
June 12, 2013
57
Apriori Algorithm (Pseudocode)
-
𝑳𝟏 ← 𝒊𝒕𝒆𝒎𝒔𝒆𝒕𝒔
-
for (𝑘 = 2; 𝐿𝑘−1 ; 𝑘 + +) 𝐝𝐨
-
𝐶𝑘 = 𝑎𝑝𝑟𝑖𝑜𝑟𝑖𝐺𝑒𝑛 𝐿𝑘−1
-
for each 𝐼 ∈ 𝑆 do
-
𝐶𝑟 ← 𝑠𝑢𝑏𝑠𝑒𝑡(𝐶𝑘 , 𝐼)
-
for each 𝑐 ∈ 𝐶𝑟 do
-
-
𝑐. 𝑐𝑜𝑢𝑛𝑡 + +
end for
-
end for
-
if 𝑐. 𝑐𝑜𝑢𝑛𝑡 ≥ 𝑚𝑖𝑛 _𝑠𝑢𝑝 then
-
-
𝐿𝑘 = 𝐿 𝑘 ∪ 𝑐
end if
-
end for
-
return 𝑳𝒌
June 12, 2013
58
Apriori Algorithm (Pseudocode)
-
𝑳𝟏 ← 𝒊𝒕𝒆𝒎𝒔𝒆𝒕𝒔
-
for (𝑘 = 2; 𝐿𝑘−1 ; 𝑘 + +) 𝐝𝐨
-
𝐶𝑘 = 𝑎𝑝𝑟𝑖𝑜𝑟𝑖𝐺𝑒𝑛 𝐿𝑘−1
-
for each 𝐼 ∈ 𝑆 do
-
𝐶𝑟 ← 𝑠𝑢𝑏𝑠𝑒𝑡(𝐶𝑘 , 𝐼)
-
for each 𝑐 ∈ 𝐶𝑟 do
-
-
𝑐. 𝑐𝑜𝑢𝑛𝑡 + +
end for
-
end for
-
if 𝑐. 𝑐𝑜𝑢𝑛𝑡 ≥ 𝑚𝑖𝑛 _𝑠𝑢𝑝 then
-
-
𝐿𝑘 = 𝐿 𝑘 ∪ 𝑐
end if
-
end for
-
return 𝑳𝒌
June 12, 2013
59
Apriori Algorithm (Pseudocode)
-
𝑳𝟏 ← 𝒊𝒕𝒆𝒎𝒔𝒆𝒕𝒔
-
for (𝑘 = 2; 𝐿𝑘−1 ; 𝑘 + +) 𝐝𝐨
-
𝐶𝑘 = 𝑎𝑝𝑟𝑖𝑜𝑟𝑖𝐺𝑒𝑛 𝐿𝑘−1
-
for each 𝐼 ∈ 𝑆 do
-
𝐶𝑟 ← 𝑠𝑢𝑏𝑠𝑒𝑡(𝐶𝑘 , 𝐼)
-
for each 𝑐 ∈ 𝐶𝑟 do
-
-
𝑐. 𝑐𝑜𝑢𝑛𝑡 + +
end for
-
end for
-
if 𝑐. 𝑐𝑜𝑢𝑛𝑡 ≥ 𝑚𝑖𝑛 _𝑠𝑢𝑝 then
-
-
𝐿𝑘 = 𝐿 𝑘 ∪ 𝑐
end if
-
end for
-
return 𝑳𝒌
June 12, 2013
60
Apriori Algorithm (Pseudocode)
-
𝑳𝟏 ← 𝒊𝒕𝒆𝒎𝒔𝒆𝒕𝒔
-
for (𝑘 = 2; 𝐿𝑘−1 ; 𝑘 + +) 𝐝𝐨
-
𝐶𝑘 = 𝑎𝑝𝑟𝑖𝑜𝑟𝑖𝐺𝑒𝑛 𝐿𝑘−1
-
for each 𝐼 ∈ 𝑆 do
-
𝐶𝑟 ← 𝑠𝑢𝑏𝑠𝑒𝑡(𝐶𝑘 , 𝐼)
-
for each 𝑐 ∈ 𝐶𝑟 do
-
-
𝑐. 𝑐𝑜𝑢𝑛𝑡 + +
end for
-
end for
-
if 𝑐. 𝑐𝑜𝑢𝑛𝑡 ≥ 𝑚𝑖𝑛 _𝑠𝑢𝑝 then
-
-
𝐿𝑘 = 𝐿 𝑘 ∪ 𝑐
end if
-
end for
-
return 𝑳𝒌
June 12, 2013
61
June 12, 2013
62
Distributed computing of Big Data
 CERN‘s Worldwide LHC Computing Grid (WLCG) launched in
2002
 Stores, distributes and analyse the 15 petabytes of data
 140 centres across 35 countries
June 12, 2013
63
Apriori Algorithm – 𝑎𝑝𝑟𝑖𝑜𝑟𝑖𝐺𝑒𝑛  Join
- Do not generate not too many candidate item
sets, but making sure to not lose any that do turn
out to be large.
- Assume that the items are ordered (alphabetical)
- {a1, a2 , … ak-1} = {b1, b2 , … bk-1}, and ak < bk,
{a1, a2 , … ak, bk} is a candidate k+1-itemset.
June 12, 2013
64
Big Data vs. Business Intelligence
Big Data
Business Intelligence
 Large and complex data sets
 Temporal, historical, …
 Difficult to process and to
analyse
 Used for deep analysis and
reporting:
 How can we predict cancer
early enough to treat it
successfully?
 How Can I make significant
profit on the stock market
next month?
 Transformed Data
 Historical view
 Easy to process and to
analyse
 Used for reporting:
 Which is the most profitable
branch of our supermarket?
 Which postcodes suffered
the most dropped calls in
July?
June 12, 2013
65
Improvement approaches
- Selection of startup parameters for algorithms
- Reducing the number of passes over the database
- Sampling the database
- Adding extra constraints for patterns
- Parallelization
June 12, 2013
66
Improvement approaches – Examples
June 12, 2013
67
Example: FA-DMFI
- Algorithm for Discovering frequent item sets
- Read the database once
- Compress into a matrix
- Frequent item sets are generated by cover relations
 Further costly computations are avoided
June 12, 2013
68
K-Means algorithm
1. Select k entities as the initial centroids.
2. (Re)Assign all entities to their closest centroids.
3. Recompute the centroid of each newly
assembled cluster.
4. Repeat step 2 and 3 until the centroids do not
change or until the maximum value for the
iterations is reached
June 12, 2013
69
Solving approaches
- K-Means cluster is NP-hard
- Optimization methods to handle NP-hard
problems (K-Means clustering)
June 12, 2013
70
Examples
- Apriori algorithm: n+1 database scans
- FP-Growth algorithm: 2 database scans
- K-Means: Exponential runtime
- K-Means++: Improve startup parameters
June 12, 2013
71
Google‘s BigQuery
Upload
Upload the data set
to the Google
Storage
http://glenn-packer.net/
Process
Import data to tables
Analyse
Run queries
June 12, 2013
72
The Apriori algorithm
- Most known algorithm for rule mining
- Based on a simple principle:
- „If an item set is frequent, then all subsets of this
item are also frequent“
- Input:
- Minimum confidence: min_conf
- Minimum support:
min_sup
- Data source:
S
June 12, 2013
73
Apriori Algorithm – aprioriGen
- Generates a candidate item set 𝐿𝑘+1 that might
by larger
- Join: Generation of the item set
- Prune: Elimination of item sets with
support(𝐼𝑗 ) < 𝑚𝑖𝑛 _𝑠𝑢𝑝
June 12, 2013
74
Apriori Algorithm – Rule generation -- Example
- {Butter, milk, bread}  {cheese}
- {Butter, meat, bread}  {cola}
 {Butter, bread}  {cheese, cola}
June 12, 2013
75
How to improve the Apriori algorithm
- Hash-based itemset counting: A k-itemset
whose corresponding hashing bucket count is
below the threshold cannot be frequent.
- Sampling: mining on a subset of given data
- Dynamic itemset counting:
June 12, 2013
76
Construction of FP-Tree
- Compressed representation of the database
- First scan
- Get the support of every item and sort them by the
support count
- Second scan
- Each transaction is mapped to a path
- Compression is done if overlapping path are
detected
- Generate links between same nodes
- Each node has a counter  Number of mapped
transactions
June 12, 2013
77
FP-Growth algorithm
Calculate the
support count of
each item in S
Sort items in
decreasing support
counts
Read transaction t
hasNext
Create new nodes
labeled with the
items in t
Set the frequency
count to 1
No overlapped
prefix found
Overlapped prefix
found
Increment the
frequency count for
each overlapped
item
Create new nodes
for none overlapped
items
Create additional
path to common
items
return
June 12, 2013
78
Download