COP5725 Advanced Database Systems Data Mining Tallahassee, Florida, 2016

advertisement
COP5725
Advanced Database Systems
Spring 2016
Data Mining
Tallahassee, Florida, 2016
Why Data Mining?
• The Explosive Growth of Data: from terabytes to
petabytes
– Data collection and data availability
• Automated data collection tools, database systems, Web,
computerized society
– Major sources of abundant data
• Business: Web, e-commerce, transactions, stocks, …
• Science: Remote sensing, bioinformatics, scientific simulation, …
• Society and everyone: news, digital cameras, YouTube
• We are drowning in data, but starving for knowledge!
1
What is Data Mining?
• Data mining (knowledge discovery from data)
– Extraction of interesting (non-trivial, implicit, previously unknown and
potentially useful) patterns or knowledge from huge amount of data
– Data mining: a misnomer?
• Alternative names
– Knowledge discovery (mining) in databases (KDD), knowledge extraction,
data/pattern analysis, data archeology, data dredging, information harvesting,
business intelligence, etc.
• Watch out: is everything “data mining”?
– Simple search and query processing
– (Deductive) expert systems
2
Knowledge Discovery (KDD) Process
• This is a view from typical database systems
and data warehousing communities
• Data mining plays an essential role in the
knowledge discovery process
Pattern Evaluation
Data Mining
Task-relevant Data
Data Warehouse
Selection
Data Cleaning
Data Integration
Databases
3
KDD Process:
A Typical View from ML and Statistics
Input Data
Data PreProcessing
Data integration
Normalization
Feature selection
Dimension reduction
Data
Mining
Pattern discovery
Association & correlation
Classification
Clustering
Outlier analysis
…………
PostProcessing
Pattern
Pattern
Pattern
Pattern
evaluation
selection
interpretation
visualization
4
Data Mining in Business Intelligence
Increasing potential
to support
business decisions
Decision
Making
Data Presentation
Visualization Techniques
End User
Business
Analyst
Data Mining
Information Discovery
Data
Analyst
Data Exploration
Statistical Summary, Querying, and Reporting
Data Preprocessing/Integration, Data Warehouses
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
DBA
5
Data Mining:
Confluence of Multiple Disciplines
Machine
Learning
Pattern
Recognition
Applications
Data Mining
Algorithm
Database
Technology
Statistics
Visualization
High-Performance
Computing
6
Data Mining: On What Kinds of Data?
• Database-oriented data sets and applications
– Relational database, data warehouse, transactional database
• Advanced data sets and advanced applications
– Data streams and sensor data
– Time-series data, temporal data, sequence data
– Structure data, graphs and social networks
– Spatial data and spatiotemporal data
– Multimedia database
– Text databases
– The World-Wide Web
7
Association and Correlation Analysis
• Frequent pattern (or frequent itemsets) mining
– What items are frequently purchased together in your Walmart
shopping cart?
• Association, correlation vs. causality
– E.g., Diaper  Beer [0.5%, 75%] (support, confidence)
– Are strongly associated items also strongly correlated?
• How to mine such patterns and rules efficiently in large
datasets?
• How to use such patterns for classification, clustering,
and other applications?
8
Example
9
Classification
• Classification and label prediction
– Construct models (functions) based on some training examples
– Describe and distinguish classes or concepts for future prediction
• E.g., classify countries based on (climate), or classify cars based
on (gas mileage)
– Predict some unknown class labels
• Typical methods
– Decision trees, naïve Bayesian classification, SVM, neural
networks, rule-based classification, pattern-based classification,
logistic regression, …
• Typical applications:
– Credit card fraud detection, direct marketing, classifying stars,
diseases, web-pages, …
10
Example
11
Example
Deep Blue beat Kasparov at chess in 1997.
Watson beat the brightest trivia minds at Jeopardy in 2011.
Can you tell Fido from Mittens in 2013?
https://www.kaggle.com/c/dogs-vs-cats
12
Example
Cat
Dog
Cat
Dog
What about this?
13
Example
14
Clustering & Outlier Analysis
• Unsupervised learning (i.e., Class label is unknown)
– Group data to form new categories (i.e., clusters), e.g., cluster
houses to find distribution patterns
– Principle: Maximizing intra-class similarity & minimizing
interclass similarity
• Outlier Analysis
– A data object that does not comply with the general behavior of
the data
– Noise or exception? ― One person’s garbage could be another
person’s treasure
15
Example
16
Structure and Network Analysis
• Graph mining
– Finding frequent subgraphs (e.g., chemical compounds), trees
(XML), substructures (web fragments)
• Network analysis
– Social networks: actors (objects, nodes) and relationships (edges)
• e.g., author networks in CS, terrorist networks
– Multiple heterogeneous networks
• A person could be multiple information networks: friends, family,
classmates, …
– Links carry a lot of semantic information: Link mining
• Web mining
– Web is a big information network: from PageRank to Google
• Web community discovery, opinion mining, usage mining, …
17
Example
18
Top 10 Data Mining Algorithms
1. C4.5: Decision-tree based classification
2. K-Means: Clustering
3. SVM: Classification and regression
4. Apriori: Frequent pattern mining
5. EM: MLE/MAP estimation, parameter estimation
6. PageRank: Link analysis and ranking
7. AdaBoost: Classification
8. kNN: Classification and regression
9. Naive Bayes: Classification
10. CART: Classification and regression
19
What Is Frequent Pattern Analysis?
• Frequent pattern: a pattern (a set of items, subsequences,
substructures, etc.) that occurs frequently in a data set
• Motivation: Finding inherent regularities in data
– What products were often purchased together?— Beer and diapers?!
– What are the subsequent purchases after buying a PC?
– What kinds of DNA are sensitive to this new drug?
– Can we automatically classify web documents?
•
Applications
–
Basket data analysis, cross-marketing, catalog design, sale campaign analysis,
Web log (click stream) analysis, and DNA sequence analysis.
20
Basic Concepts: Frequent Patterns



Customer
buys both

Customer
buys diaper

Customer
buys beer
itemset: A set of one or more items
k-itemset X = {x1, …, xk}
(absolute) support, or, support
count of X: Frequency or
occurrence of an itemset X
(relative) support, s, is the fraction
of transactions that contains X (i.e.,
the probability that a transaction
contains X)
An itemset X is frequent if X’s
support is no less than a minsup
threshold
21
Basic Concepts: Association Rules
Tid
Items bought
10
Beer, Nuts, Diaper
20
Beer, Coffee, Diaper
30
Beer, Diaper, Eggs
40
50
Nuts, Eggs, Milk

Find all the rules X  Y with
minimum support and confidence

support, s, probability that a
transaction contains X  Y

confidence, c, conditional
probability that a transaction
having X also contains Y
Nuts, Coffee, Diaper, Eggs, Milk
Customer
buys both
Customer
buys
diaper
Let minsup = 50%, minconf = 50%
Customer
buys beer
Freq. Pat.: Beer:3, Nuts:3, Diaper:4, Eggs:3, {Beer,
Diaper}:3

Association rules: (many more!)

Beer  Diaper (60%, 100%)

Diaper  Beer (60%, 75%)
22
Association Rule Mining
• Given a set of transactions T, the goal of association
rule mining is to find all rules having
– support ≥ minsup threshold
– confidence ≥ minconf threshold
• Brute-force approach:
– List all possible association rules
– Compute the support and confidence for each rule
– Prune rules that fail the minsup and minconf thresholds
 Computationally prohibitive!
23
Mining Association Rules
• Observations
– All the association rules are binary partitions of the same itemset:
{Milk, Diaper, Beer}
– Rules originating from the same itemset have identical support but
can have different confidence
– Thus, we may decouple the support and confidence requirements
TID
Items
1
Bread, Milk
2
3
4
5
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
Example of Rules:
{Milk,Diaper}  {Beer} (s=0.4, c=0.67)
{Milk,Beer}  {Diaper} (s=0.4, c=1.0)
{Diaper,Beer}  {Milk} (s=0.4, c=0.67)
{Beer}  {Milk,Diaper} (s=0.4, c=0.67)
{Diaper}  {Milk,Beer} (s=0.4, c=0.5)
{Milk}  {Diaper,Beer} (s=0.4, c=0.5)
24
Frequent Itemset Mining
Given d items, there
are 2d possible
candidate itemsets
null
A
B
C
D
E
AB
AC
AD
AE
BC
BD
BE
CD
CE
DE
ABC
ABD
ABE
ACD
ACE
ADE
BCD
BCE
BDE
CDE
ABCD
ABCE
ABDE
ACDE
BCDE
ABCDE
25
Frequent Itemset Mining
• Brute-force approach
1. Each itemset in the lattice is a candidate frequent itemset
2. Count the support of each candidate by scanning the database
•
Match each transaction against every candidate
– Time complexity: O(N*M*w)  Expensive since M = 2d
Transactions
N
TID
1
2
3
4
5
Items
Bread, Milk
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
List of
Candidates
M
w
26
Computational Complexity of Association Rule Mining
• Given d unique items:
– Total number of itemsets = 2d
– Total number of possible association rules
 d   d  k 
R       

 k   j 
 3  2 1
d 1
d k
k 1
j 1
d
d 1
If d=6, R = 602 rules
27
Closed Patterns and Max-Patterns
• A long pattern contains a combinatorial number of sub-patterns
– e.g., {a1, …, a100} contains (1001) + (1002) + … + (110000) = 2100 – 1
= 1.27*1030 sub-patterns!
• Solution: Mine closed patterns and max-patterns instead
• An itemset X is closed
– If X is frequent and there exists no super-pattern Y ‫ כ‬X, with the same
support as X
• An itemset X is a max-pattern
– If X is frequent and there exists no frequent super-pattern Y ‫ כ‬X
• Closed pattern is a lossless compression of freq. patterns
– Reducing the # of patterns and rules
28
Closed Patterns and Max-Patterns
• Exercise. DB = {<a1, …, a100>, < a1, …, a50>}
– Min_sup = 1.
• What is the set of closed itemset?
– <a1, …, a100>: 1
– < a1, …, a50>: 2
• What is the set of max-pattern?
– <a1, …, a100>: 1
• What is the set of all patterns?
– !!
29
Maximal Frequent Itemset
• An itemset is maximal frequent if none of its immediate
supersets is frequent
null
A
B
C
D
E
Maximal
Itemsets
AB
AC
AD
AE
BC
BD
BE
CD
CE
DE
ABC
ABD
ABE
ACD
ACE
ADE
BCD
BCE
BDE
CDE
ABCD
ABCE
ABDE
ABCD
E
ACDE
BCDE
Border
30
Closed Itemset
• An itemset is closed if none of its immediate supersets
has the same support as the itemset
TID
1
2
3
4
5
Items
{A,B}
{B,C,D}
{A,B,C,D}
{A,B,D}
{A,B,C,D}
Itemset
{A}
{B}
{C}
{D}
{A,B}
{A,C}
{A,D}
{B,C}
{B,D}
{C,D}
Support
4
5
3
4
4
2
3
3
4
3
Itemset Support
{A,B,C}
2
{A,B,D}
3
{A,C,D}
2
{B,C,D}
3
{A,B,C,D}
2
31
Maximal vs. Closed Itemsets
TID
Items
1
ABC
2
ABCD
3
BCE
4
ACDE
5
DE
Transaction Ids
null
124
123
A
12
124
AB
12
24
AC
ABC
ABD
ABE
AE
345
D
2
3
BC
BD
4
ACD
245
C
123
4
24
2
Not supported by
any transactions
B
AD
2
1234
BE
2
4
ACE
ADE
E
24
CD
34
CE
3
BCD
45
DE
4
BCE
BDE
CDE
4
ABCD
ABCE
ABDE
ACDE
BCDE
ABCDE
32
Maximal vs Closed Frequent Itemsets
Minimum support = 2
124
123
A
12
124
AB
12
ABC
24
AC
AD
ABD
ABE
1234
B
AE
345
D
2
3
BC
BD
4
ACD
245
C
123
4
24
2
Closed but
not maximal
null
24
BE
2
4
ACE
E
ADE
CD
Closed and
maximal
34
CE
3
BCD
45
DE
4
BCE
BDE
CDE
4
2
ABCD
ABCE
ABDE
ACDE
BCDE
# Closed = 9
# Maximal = 4
ABCDE
33
Maximal vs Closed Itemsets
Frequent
Itemsets
Closed
Frequent
Itemsets
Maximal
Frequent
Itemsets
34
Mining Frequent Itemsets
• The downward closure property of frequent patterns
– Any subset of a frequent itemset must be frequent
• If {beer, diaper, nuts} is frequent, so is {beer, diaper}
• i.e., every transaction having {beer, diaper, nuts} also contains
{beer, diaper}
• Apriori pruning principle: If there is any itemset which is
infrequent, its superset should not be generated/tested!
– A Candidate Generation & Test Approach
1. Initially, scan DB once to get frequent 1-itemset
2. Generate length (k+1) candidate itemsets from length k frequent
itemsets
3. Test the candidates against DB
4. Terminate when no frequent or candidate set can be generated
35
Illustrating Apriori Principle
null
A
Found to
be
B
C
D
E
AB
AC
AD
AE
BC
BD
BE
CD
CE
DE
ABC
ABD
ABE
ACD
ACE
ADE
BCD
BCE
BDE
CDE
Infrequent
ABCD
Pruned
supersets
ABCE
ABDE
ACDE
BCDE
ABCDE
36
The Apriori Algorithm—An Example
Database
Supmin = 2
L1
C1
1st scan
C2
C2
2nd scan
L2
C3
3rd scan
L3
37
The Apriori Algorithm
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
// candidate generation
for each transaction t in database do
// frequency counting
increment the count of all candidates in Ck+1 that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;
38
Implementation of Apriori
• How to generate candidates?
– Step 1: self-joining Lk
– Step 2: pruning
• Example of Candidate-generation
– L3={abc, abd, acd, ace, bcd}
– Self-joining: L3*L3
• abcd from abc and abd
• acde from acd and ace
– Pruning:
• acde is removed because ade is not in L3
– C4 = {abcd}
39
Candidate Generation: SQL
• SQL Implementation of candidate generation
– Suppose the items in Lk-1 are listed in an order
– Step 1: self-joining Lk-1
insert into Ck
select p.item1, p.item2, …, p.itemk-1, q.itemk-1
from Lk-1 p, Lk-1 q
where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1
– Step 2: pruning
forall itemsets c in Ck do
forall (k-1)-subsets s of c do
if (s is not in Lk-1) then delete c from Ck
40
Factors Affecting Complexity
• Choice of minimum support threshold
– lowering support threshold results in more frequent itemsets
– this may increase number of candidates and max length of frequent itemsets
• Dimensionality (number of items) of the data set
– more space is needed to store support count of each item
– if number of frequent items also increases, both computation and I/O costs
may also increase
• Size of database
– since Apriori makes multiple passes, run time of algorithm may increase with
number of transactions
• Average transaction width
– transaction width increases with denser data sets
– this may increase max length of frequent itemsets, and the number of subsets
in a transaction increases with its width
41
Further Improvement of Apriori
• Major computational challenges
– Multiple scans of transaction database
– Huge number of candidates
– Tedious workload of support counting for candidates
• Improving Apriori: general ideas
– Reduce passes of transaction database scans
– Shrink number of candidates
– Facilitate support counting of candidates
42
Partition: Scan Database Only Twice
• Any itemset that is potentially frequent in DB must be
frequent in at least one of the partitions of DB
– Scan 1: partition database and find local frequent patterns
– Scan 2: consolidate global frequent patterns
DB1
sup1(i) < σDB1
+
DB2
sup2(i) < σDB2
+
+
DBk
supk(i) < σDBk
=
DB
sup(i) < σDB
43
Generating Association Rules
• How to efficiently generate rules from frequent
itemsets?
– In general, confidence does not have an anti-monotone property
c(ABC D) can be larger or smaller than c(AB D)
– But confidence of rules generated from the same itemset has an
anti-monotone property
• e.g., L = {A,B,C,D}:
c(ABC  D)  c(AB  CD)  c(A  BCD)
– Confidence is anti-monotone w.r.t. number of items on the RHS
of the rule
44
Example
• If {A,B,C,D} is a frequent itemset, candidate rules:
A BCD,
B ACD,
C ABD,
D ABC
AB CD,
AC  BD,
AD  BC,
BC AD,
BD AC,
CD AB,
ABC D,
ABD C,
ACD B,
BCD A,
• If |L| = k, then there are 2k – 2 candidate association
rules (ignoring L   and   L)
45
Generating Association Rules
• Given a frequent itemset Z, we look at all proper
subsets X ⊂ Z to compute rules of the form
X  Y, where Y = Z\X
• The rule must be frequent
– s = sup(XY) = sup(Z) >= minsup
• We compute the confidence as
– c = sup(X Y)/sup(X) = sup(Z)/sup(X)
– If c >= minconf, this rule is a strong association rule
– Otherwise, conf(WZ\W)<c for all subsets W ⊂ X, because
sup(W) >= sup(X). We can thus avoid checking subsets of X
46
Example
Lattice of rules
ABCD=>{ }
Low
Confidence
Rule
BCD=>A
CD=>AB
BD=>AC
D=>ABC
ACD=>B
BC=>AD
C=>ABD
ABD=>C
AD=>BC
B=>ACD
ABC=>D
AC=>BD
AB=>CD
A=>BCD
Pruned
Rules
47
Association Rule Mining Algorithm
F: the set of frequent itemsets
A: all proper subsets of Z
X: start from the largest subset
A strong association rule
If X fails, all its subsets fail as well
48
Finding Similar Patterns
• Many data mining problems can be expressed as
finding “similar” patterns:
1.
2.
Web pages with similar words, e.g., for classification by topic
NetFlix users with similar tastes in movies, for recommendation
systems.
• Dual: movies with similar sets of fans
3.
•
The best techniques depend on whether you are
looking for items that are somewhat similar
–
•
Images of related things
Shingling & Minhashing
Or very similar
–
Locality sensitive hashing
49
Example
• Problem: comparing documents
– Goal: common text, not common topic
– Special case: identical documents, or one document contained
character-by-character in another
– General case: many small pieces of one doc appear out of order
in another
– Applications:
• Mirror sites, or approximate mirrors: Don’t want to show both in a
search
• Plagiarism, including large quotations
• Similar news articles at many news sites: cluster articles by “same
story.”
50
Three Essential Techniques
for Similar Pattern Discovery
• Shingling : convert documents, emails, etc., to sets
• Minhashing : convert large sets to short signatures,
while preserving similarity
• Locality-sensitive hashing: focus on pairs of signatures
likely to be similar
Document
Signatures :
The set of
Strings of length
K that appear in
the document
short integer
vectors that
represent the
sets, and
reflect their
similarity
Localitysensitive
Hashing
Candidate
pairs :
those pairs
of signatures
that we need
to test for
similarity
51
Shingles
• A k-shingle (or k-gram) for a document is a sequence of
k characters that appears in the document
– Example: k=2; doc = abcab. Set of 2-shingles = {ab, bc, ca}
– Option: regard shingles as a bag, and count ab twice
– Represent a doc by its set of k-shingles
• Assumption
– Documents that have lots of shingles in common have similar
text, even if the text appears in different order
• You must pick k large enough, or most documents will
have most shingles
– k = 5 is OK for short documents; k = 10 is better for long
documents
52
Jaccard Similarity
• The Jaccard similarity of two sets is the size of their
intersection divided by the size of their union.
– Sim (C1, C2) = |C1C2|/|C1C2|
• Boolean Matrices
– Rows = elements of the universal set
– Columns = sets
– 1 in row e and column S if and only if e is a member of S
– Column similarity is the Jaccard similarity of the sets of their rows
with 1
53
Example
A
B
C
D
E
F
C1
0
1
1
0
1
0
C2
1
0
1 *
0
1 *
1
*
*
*
Sim (C1, C2) = 2/5 = 0.4
*
*
54
Signatures
• Problem:
1.
When the sets are so large or so many that they cannot fit in
main memory
2.
When there are so many sets that comparing all pairs of sets
takes too much time.
• Key idea: “hash” each column C to a small signature Sig
(C), such that:
1.
Sig (C) is small enough that we can fit a signature in main
memory for each column
2.
Sim (C1, C2) is the same as the “similarity” of Sig (C1) and Sig
(C2)
55
MinHashing
1. Imagine the rows permuted randomly
2. Define “hash” function h (C ) = the number of the first
(in the permuted order) row in which column C has 1
3. Use several (e.g., 100) independent hash functions to
create a signature
• Surprising Property
– The probability (over ALL permutations of the rows) that h (C1)
= h (C2) is the same as Sim (C1, C2)
56
Example
Matrix
Signature
C1
C2
C3
R1
1
0
1
R2
0
1
1
R3
1
0
0
R4
1
0
1
R5
0
1
0
S1 S2 S3
Perm 1
12345
1
2
1
Perm 2
54321
4
5
4
Perm 3
34512
3
5
4
Similarities
1-2
1-3
2-3
Col-Col
0
0.5
0.25
Sig-Sig
0
0.67
0
57
Locality Sensitive Hashing
• Checking All Pairs is Hard
– While the signatures of all columns may fit in main memory,
comparing the signatures of all pairs of columns is quadratic in
the number of columns
• General idea
– Use a function f(x,y) that tells whether or not x and y is a
candidate pair
– For minhash matrices: hash columns to many buckets, and make
elements of the same bucket candidate pairs
58
Partition Into Bands
r rows
per band
b bands
One
signature
Matrix M
59
Partition into Bands
• Divide matrix M into b bands of r rows
– For each band, hash its portion of each column to a hash table
with k buckets
• Make k as large as possible
• Candidate column pairs are those that hash to the same
bucket for ≥ 1 band
– Tune b and r to catch most similar pairs, but few non-similar pairs
60
Example
Buckets
Columns 2 and 6
are probably identical.
Columns 6 and 7 are
surely different.
Matrix M
r rows
b bands
61
Example
• Suppose 100,000 columns, and signatures of 100
integers. Therefore, signatures take 40MB. We want all
80%-similar pairs
– 5,000,000,000 pairs of signatures can take a while to compare
– If choose 20 bands of 5 integers/band, and suppose C1, C2 are
80% similar
• Probability C1, C2 identical in one particular band: (0.8)5 = 0.328
• Probability C1, C2 are not similar in any of the 20 bands: (10.328)20 = .00035
• i.e., about 1/3000th of the 80%-similar column pairs are false
negatives
62
Parameter Setting
At least
one band
identical
P ~ (1/b)1/r
Probability
of sharing
a bucket
s
No bands
identical
1 - (1 - s r )b
Some row All rows
of a band of a band
unequal are equal
Similarity s of two sets
63
Example: b = 20; r = 5
s = (1/b)1/r =0.5493
64
Download