Course Title Goes Here (same for every lecture)

advertisement
Machine Learning & Data Mining
Part 1: The Basics
Jaime Carbonell (with contributions from Tom
Mitchell, Sebastian Thrun and Yiming Yang)
Carnegie Mellon University
jgc@cs.cmu.edu
December, 2008
© 2008, Jaime G Carbonell
Some Definitions (KBS vs ML)
 Knowledge-Based Systems
 Rules, procedures, semantic nets, Horn clauses
 Inference: matching, inheritance, resolution
 Acquisition: manually from human experts
 Machine Learning
 Data: tables, relations, attribute lists, …
 Inference: rules, trees, decision functions, …
 Acquisition: automated from data
 Data Mining
 Machine learning applied to large real problems
 May be augmented with KBS
December, 2008
© 2008, Jaime G. Carbonell
2
Ingredients for Machine Learning
 “Historical” data (e.g. DB tables)
 E.g. products (features, marketing, support, …)
 E.g. competition (products, pricing, customers)
 E.g. customers (demographics, purchases, …)
 Objective function (to be predicted or optimized)
 E.g. maximize revenue per customer
 E.g. minimize manufacturing defects
 Scalable machine learning method(s)
 E.g. decision-tree induction, logistic regression
 E.g. “active” learning, clustering
December, 2008
© 2008, Jaime G. Carbonell
3
Sample ML/DM Applications I
 Credit Scoring
 Training: past applicant profiles, how much
credit given, payback or default
 Input: applicant profile (income, debts, …)
 Objective: credit-score + max amount
 Fraud Detection (e.g. credit-card transactions)
 Training: past known legitimate & fraudulent
transactions
 Input: proposed transaction (loc, cust, $$, …)
 Objective: approve/block decision
December, 2008
© 2008, Jaime G. Carbonell
4
Sample ML/DM Applications II
 Demographic Segmentation
 Training: past customer profiles (age, gender,
education, income,…) + product preferences
 Input: new product description (features)
 Objective: predict market segment affinity
 Marketing/Advertisement Effectiveness
 Training: past advertisement campaigns,
demographic targets, product categories
 Input: proposed advertisement campaign
 Objective: project effectiveness (sales
increase modulated by marketing cost)
December, 2008
© 2008, Jaime G. Carbonell
5
Sample ML/DM Applications III
 Product (or Part) Reliability
 Training: past products/parts + specs at
manufacturing + customer usage + maint rec
 Input: new part + expected usage
 Objective: mean-time-to-failure (replacement)
 Manufacturing Tolerances
 Training: past product/part manufacturing
process, tolerances, inspections, …
 Input: new part + expected usage
 Objective: optimal manufacturing precision
(minimize costs of failure + manufacture)
December, 2008
© 2008, Jaime G. Carbonell
6
Sample ML/DM Applications IV
 Mechanical Diagnosis
 Training: past observed symptoms at (or prior
to) breakdown + underlying cause
 Input: current symptoms
 Objective: predict cause of failure
 Mechanical Repair
 Training: cause of failure + product usage +
repair (or PM) effectiveness
 Input: new failure cause + product usage
 Objective: recommended repair (or
preventive maintenance operation)
December, 2008
© 2008, Jaime G. Carbonell
7
Sample ML/DM Applications V
 Billeting (job assignments)
 Training: employee profiles, position profiles,
employee performance in assigned position
 Input: new employee or new position profile
 Objective: predict performance in position
 Text Mining & Routing (e.g. customer centers)
 Training: electronic problem reports,
customer requests + who should handle them
 Input: new incoming texts
 Objective: Assign category + route or reply
December, 2008
© 2008, Jaime G. Carbonell
8
Preparing Historical Data
 Extract a DB table with all the needed information
 Select, join, project, aggregate, …
 Filter out rows with significant missing data
 Determine predictor attributes (columns)
 Ask domain expert for relevant attributes, or
 Start with all attributes and automatically subselect most predictive ones (feature selection)
 Determine to-be-predicted attribute (column)
 Objective of the DM (number, decision, …)
December, 2008
© 2008, Jaime G. Carbonell
9
Sample DB Table
[predictor attributes]
[objective]
Tot
Num
Max
Num
Acct.
Income
Job
Delinq
Delinq
Owns
Credit
Good
numb.
in K/yr
Now?
accts
cycles
home?
years
cust.?
--------------------------------------------------------------------------1001
85
Y
1
1
N
2
Y
1002
60
Y
3
2
Y
5
N
1003
?
N
0
0
N
2
N
1004
95
Y
1
2
N
9
Y
1005
110
Y
1
6
Y
3
Y
1006
29
Y
2
1
Y
1
N
1007
88
Y
6
4
Y
8
N
1008
80
Y
0
0
Y
0
Y
1009
31
Y
1
1
N
1
Y
1011
?
Y
?
0
?
7
Y
1012
75
?
2
4
N
2
N
1013
20
N
1
1
N
3
N
1014
65
Y
1
3
Y
1
Y
1015
65
N
1
2
N
8
Y
1016
20
N
0
0
N
0
N
1017
75
Y
1
3
N
2
N
1018
40
N
0
0
Y
1
Y
December, 2008
© 2008, Jaime G. Carbonell
10
Supervised Learning on DB Table
 Given: DB table
 With identified predictor attributes x1, x2,…
 And objective attribute y
 Find: Prediction Function
Fk : x1 ,..., xn  y
Fk {F1 , F2 ,...Fm }
 Subject to: Error Minimization on data table M
f best  Arg min [
 2
(
y

f
(
x
 i k i )) ]
f k { f 1 ,... f m ) iRows ( M )
 Least-squares error, or L1-norm, or L-norm, …
December, 2008
© 2008, Jaime G. Carbonell
11
Popular Predictor Functions
Linear Discriminators (next slides)
k-Nearest-Neighbors (lecture #2)
Decision Trees (lecture #5)
Linear & Logistic Regression (lecture #4)
Probabilistic Methods (Lecture #3)
Neural Networks
 2-layer  Logistic Regression
 Multi-layer  Difficult to scale up
 Classification Rule Induction (in a few slides)






December, 2008
© 2008, Jaime G. Carbonell
12
Linear Discriminator Functions
x2
Two class
problem:
y={
,
}
x1
December, 2008
© 2008, Jaime G. Carbonell
13
Linear Discriminator Functions
x2
Two class
problem:
y={
,
}
x1
December, 2008
© 2008, Jaime G. Carbonell
14
Linear Discriminator Functions
x2
Two class
problem:
y={
,
}
n
y   ai xi
i 0
x1
December, 2008
© 2008, Jaime G. Carbonell
15
Linear Discriminator Functions
x2
Two class
problem:
new
y={
,
}
n
y   ai xi
i 0
x1
December, 2008
© 2008, Jaime G. Carbonell
16
Issues with Linear Discriminators
 What is the “best” placement of the discriminator?
 Maximize the margin
 In general  Support Vector Machines
 What if there are k classes (K > 2)?
 Must learn k different discriminators
 Each discriminates ki vs kji (all other classes)
 What if it classes are not linearly separable?
 Minimal error (L1 or L2) placement (regression)
 Give up on linear discriminators ( other fk’s)
December, 2008
© 2008, Jaime G. Carbonell
17
Maximizing the Margin
x2
margin
Two class
problem:
y={
,
}
x1
December, 2008
© 2008, Jaime G. Carbonell
18
Nearly-Separable Classes
x2
Two class
problem:
y={
,
}
x1
December, 2008
© 2008, Jaime G. Carbonell
19
Nearly-Separable Classes
x2
Two class
problem:
y={
,
}
x1
December, 2008
© 2008, Jaime G. Carbonell
20
Minimizing Training Error
 Optimal placing of maximum-margin separator


Quadratic programming (Support Vector Machines)
Slack variables to accommodate training errors
 Minimizing error metrics

Number of errors


1
L0 ( f , X , y )   I ( f ( xi ), yi )
n i 1.. n
Magnitude of error

Squared error

Chevycheff norm


L1 ( f , X , y ) 

i 1.. n


f ( xi )  yi I ( f ( xi , yi )
 1


L2 ( f , X , y )   ( f ( xi )  yi ) 2 I ( f ( xi ), yi )
n i 1.. n



L ( f , X , y )  max ( f ( xi )  yi ) I ( f ( xi ), yi )
i 1.. n
December, 2008
© 2008, Jaime G. Carbonell
21
Symbolic Rule Induction
General idea
 Labeled instances are DB tuples
 Rules are generalized tuples
 Generalization occurs at terms in tuples
 Generalize on new E+ not correctly predicted
 Specialize on new E- not correctly predicted
 Ignore predicted E+ or E- (error-driven learning)
December, 2008
© 2008, Jaime G. Carbonell
22
Symbolic Rule Induction (2)
Example term generalizations
 Constant => disjunction
e.g. if small portion value set seen
 Constant => least-common-generalizer class
e.g. if large portion of value set seen
 Number (or ordinal) => range
e.g. if dense sequential sampling
December, 2008
© 2008, Jaime G. Carbonell
23
Symbolic Rule Induction Example (1)
Age Gender Temp b-cult c-cult
65
M
101 +
.23
25
M
102 +
.00
65
M
102 .78
36
F
99 .19
11
F
103 +
.23
88
F
98 +
.21
39
F
100 +
.10
12
M
101 +
.00
15
F
101 +
.66
20
F
98 +
.00
81
M
98 .99
87
F
100 .89
12
F
102 +
??
loc
USA
CAN
BRA
USA
USA
CAN
BRA
BRA
BRA
USA
BRA
USA
CAN
14
67
USA normal
BRA rash
F
M
101 +
102 +
.33
.77
Skin
normal
normal
rash
normal
flush
normal
normal
normal
flush
rash
rash
rash
normal
disease
strep
strep
dengue
*none*
strep
*none*
strep
strep
dengue
*none*
ec-12
ec-12
strep
Symbolic Rule Induction Example (2)
Candidate Rules:
IF age = [12,65]
gender = *any*
temp = [100,103]
b-cult = +
c-cult = [.00,.23]
loc = *any*
skin = (normal,flush)
THEN:
strep
IF age = (15,65)
gender = *any*
temp = [101,102]
b-cult = *any*
c-cult = [.66,.78]
loc = BRA
skin = rash
THEN:
dengue
Disclaimer: These
are not real medical
records or rules
Types of Data Mining
 “Supervised” Methods (this DM course)
 Training data has both predictor attributes &
objective (to be predicted) attributes
 Predict discrete classes  classification
 Predict continuous values  regression
 Duality: classification  regression
 “Unsupervised” Methods
 Training data without objective attributes
 Goal: find novel & interesting patterns
 Cutting-edge research, fewer success stories
 Semi-supervised methods: market-basket, …
December, 2008
© 2008, Jaime G. Carbonell
26
Machine Learning Application
Process in a Nutshell
 Choose problem where
 Prediction is valuable and non-trivial
 Sufficient historical data is available
 The objective is measurable (incl in past data)
 Prepare the data
 Tabular form, clean, divide training & test sets
 Select a Machine Learning algorithm
 Human readable decision fn  rules, trees, …
 Robust with noisy data  kNN, logistic reg, …
December, 2008
© 2008, Jaime G. Carbonell
27
Machine Learning Application
Process in a Nutshell (2)
 Train ML Algorithm on Training Data Set
 Each ML method has different training process
 Training uses both predictor & objective att’s
 Run Training ML Algorithm on Test Data Set
 Test uses only predictor att’s & outputs
predictions on objective attributes
 Compare predictions vs actual objective att’s
(see lecture 2 for evaluation metrics)
 If Accuracy  threshold, done.
 Else, try different ML algorithm, different
parameter settings, get more training data, …
December, 2008
© 2008, Jaime G. Carbonell
28
Sample DB Table (same)
[predictor attributes]
[objective]
Tot
Num
Max
Num
Acct.
Income
Job
Delinq
Delinq
Owns
Credit
Good
numb.
in K/yr
Now?
accts
cycles
home?
years
cust.?
--------------------------------------------------------------------------1001
85
Y
1
1
N
2
Y
1002
60
Y
3
2
Y
5
N
1003
?
N
0
0
N
2
N
1004
95
Y
1
2
N
9
Y
1005
100
Y
1
6
Y
3
Y
1006
29
Y
2
1
Y
1
N
1007
88
Y
6
4
Y
8
N
1008
80
Y
0
0
Y
0
Y
1009
31
Y
1
1
N
1
Y
1011
?
Y
?
0
?
7
Y
1012
75
?
2
4
N
2
N
1013
20
N
1
1
N
3
N
1014
65
Y
1
3
Y
1
Y
1015
65
N
1
2
N
8
Y
1016
20
N
0
0
N
0
N
1017
75
Y
1
3
N
2
N
1018
40
N
0
0
Y
10
Y
December, 2008
© 2008, Jaime G. Carbonell
29
Feature Vector Representation
 Predictor-attribute rows in DB tables can be
represented as vectors. For instance, the 2nd & 4th
rows of predictor attributes in our DB table are:
R2 = [60
R4 = [95
Y
Y
3
1
2
2
Y
N
5]
9]
Converting to numbers (Y = 1, N = 0), we get:
R2 = [60
R4 = [95
December, 2008
1
1
3
1
2
2
1
0
5]
9]
© 2008, Jaime G. Carbonell
30
Vector Similarity
 Suppose we have a new credit applicant
R-new = [65
1
1
2
0
10]
To which of R2 or R4 is she closer?
R2 = [60
R4 = [95


1
1
3
1
2
2
1
0
5]
9]
What should we use as a SIMILARITY METRIC?
Should we first NORMALIZE the vectors?
 If not, the largest component will dominate
December, 2008
© 2008, Jaime G. Carbonell
31
Normalizing Vector Attributes
 Linear Normalization (often sufficient)
 Find max & min values for each attribute
 Normalize each attribute by:
( Aactual  Amin )
Anorm 
( Amax  Amin )
 Apply to all vectors (historical + new)
 …by normalizing each attribute, e.g.:
AR 2,1  (60  20) (100  20)  0.5
December, 2008
© 2008, Jaime G. Carbonell
32
Normalizing Full Vectors
 Normalizing the new applicant vector
R-new = [65
1
1
2
0
10]  [.56 1 .17 .33 0 1]
And normalizing the two past customer vectors
R2 = [60
R4 = [95
1
1
3
1
2
2
1
0
5]  [.50 1 .50 .33 1 .50]
9]  [.94 1 .17 .33 0 .90]
 How about if some attributes are known to be more
important, say salary (A1) & delinquencies (A3)?
 Weight accordingly, e.g. x2 for each
 E.g., R-new-weighted: [1.12 1 .34 .33 0 1]
December, 2008
© 2008, Jaime G. Carbonell
33
Similarity Functions (inverse dist)
 Now that we have weighted normalized vectors,
how do we tell exactly their degree of similarity?
 Inverse sum of differences (L1)
 
siminv diff (a , b ) 
1
 | ai  bi |
i 1,... n
 Inverse Euclidean distance (L2)
 
sim Euclid (a , b ) 
1
 (a  b )
i 1,... n
December, 2008
i
2
i
© 2008, Jaime G. Carbonell
34
Similarity Functions (direct)
 Dot-Product Similarity
   
simdot (a, b )  a  b 
a b
i 1,..., n
i i
 Cosine Similarity (dot product of unit vectors)

aibi




a b
i 1,..., n
simcos (a , b )  
 
a b


2
2
  ai     bi 
 i 1,..., n 
 i 1,..., n 
December, 2008
© 2008, Jaime G. Carbonell
35
Alternative: Similarity Matrix for
Non-Numeric Attributes
tiny
little
small
medium
large
huge





tiny
1.0
little
0.8
1.0
small
0.7
0.9
1.0
medium
0.5
0.7
0.7
1.0
large
0.2
0.3
0.3
0.5
1.0
huge
0.0
0.1
0.2
0.3
0.8
1.0
Diagonal must be 1.0
Monotonicity property must hold
Triangle inequality must hold
Transitive property must hold
Additivity/Compostionality need not hold
December, 2008
© 2008, Jaime G. Carbonell
36
k-Nearest Neighbors Method
 No explicit “training” phase
 When new case arrives (vector of predictor att’s)
 Find nearest k neighbors (max similarity)
among previous cases (row vectors in DB table)
 k-neighbors vote for objective attribute
 Unweighted majority vote, or
 Similarity-weighted vote
 Works for both discrete or continuous objective
attributes
December, 2008
© 2008, Jaime G. Carbonell
37
Similarity-Weighted Voting in kNN
 If the Objective Attribute is Discrete:

Valueobj ( y ) 

 
arg max 
sim ( x j , y )

C i Range(Valueobj ) [ x kNN ( y )]&[ value ( x )  C ] 
obj
j
i
 j

 If the Objective Attribute is Continuous:

Valueobj ( y ) 

 
 valueobj ( x j )  sim ( x j , y)


x j kNN ( y )
 
 sim ( x j , y)


x j kNN ( y )
December, 2008
© 2008, Jaime G. Carbonell
38
Applying kNN to Real Problems 1
 How does one choose the vector representation?
 Easy: Vector = predictor attributes
 What if attributes are not numerical?
 Convert: (e.g. High=2, Med=1, Low=0),
 Or, use similarity function over nominal values
 E.g. equality or edit-distance on strings
 How does one choose a distance function?
 Hard: No magic recipe; try simpler ones first
 This implies a need for systematic testing
(discussed in coming slides)
December, 2008
© 2008, Jaime G. Carbonell
39
Applying kNN to Real Problems 2
 How does one determine whether data should
be normalized?
 Normalization is usually a good idea
 One can try kNN both ways to make sure
 How does one determine “k” in kNN?
 k is often determined empirically
 Good start is:
k  log 2 (size ( DB )) 
December, 2008
© 2008, Jaime G. Carbonell
40
Evaluating Machine Learning
 Accuracy = Correct-Predictions/Total-Predictions
 Simplest & most popular metric
 But misleading on very-rare event prediction
 Precision, recall & F1
 Borrowed from Information Retrieval
 Applicable to very-rare event prediction
 Correlation (between predicted & actual values)
for continuous objective attributes
 R2, kappa-coefficient, …
December, 2008
© 2008, Jaime G. Carbonell
41
Sample Confusion Matrix
Predicted Diagnoses
True Diagnoses
Shorted
Power Sup
Loose
Connect’s
Burnt
Resistor
Not
plugged in
Shorted
Power Sup
50
0
10
0
Loose
Connect’s
1
120
0
12
Burnt
Resistor
12
0
60
0
Not
plugged in
0
8
5
110
December, 2008
© 2008, Jaime G. Carbonell
42
Measuring Accuracy
 Accuracy = correct/total
 Error = incorrect/total
 Hence: accuracy = 1 – error
Trace(C )

A

 Full (C )
c
 c
i 1,..., n
i ,i
i 1,... n j 1,..., n
i, j
 For the diagnosis example:
 A = 340/386 = 0.88, E = 1 – A = 0.12
December, 2008
© 2008, Jaime G. Carbonell
43
What About Rare Events?
Predicted Diagnoses
True Diagnoses
Shorted
Power Sup
Loose
Connect’s
Burnt
Resistor
Not
plugged in
Shorted
Power Sup
0
0
10
0
Loose
Connect’s
1
120
0
12
Burnt
Resistor
12
0
60
0
Not
plugged in
0
8
5
160
December, 2008
© 2008, Jaime G. Carbonell
44
Rare Event Evaluation
 Accuracy for example = 0.88
 …but NO correct predictions for “shorted
power supply”, 1 of 4 diagnoses
 Alternative: Per-diagnosis (per-class) accuracy:
ci ,i
A(class i ) 

 

  c j ,i     ci , j 

 

j

1
,...,
n
j

i
,
j

1
,...,
n

 

 A(“shorted PS”) = 0/22 = 0
 A(“not plugged in”) = 160/184 = 0.87
December, 2008
© 2008, Jaime G. Carbonell
45
ROC Curves
(ROC=Receiver Operating
Characteristic)
December, 2008
© 2008, Jaime G. Carbonell
46
ROC Curves
(ROC=Receiver Operating Characteristic)
Sensitivity =
TP/(TP+FN)
Specificity =
TN/(TN+FP)
December, 2008
© 2008, Jaime G. Carbonell
47
If Plenty of data, evaluate with
Holdout Set
Data
train
evaluate 
measure error
 Often also used for parameter optimization
December, 2008
© 2008, Jaime G. Carbonell
48
Finite Cross-Validation Set
 True error:
(true risk)
eD   y  f ( x, ) p( x, y) dx, y
D
 Test error:
(empirical risk)
D = all data
1
eˆS 
m
m = #test samples
December, 2008

y  f ( x, )
x , y S
S = test data
© 2008, Jaime G. Carbonell
49
Confidence Intervals
If
 S contains m examples, drawn independently
 m  30
Then
 With approximately 95% probability, the true
error eD lies in the interval
eˆS (1  eˆS )
eˆS  1.96
m
December, 2008
© 2008, Jaime G. Carbonell
50
Example:
 Hypothesis misclassifies 12 out of 40 examples in
cross validation set S.
 Q: What will the “true” error on future examples?
 A: With 95% confidence, the true error will be in
the interval:
eˆS (1  eˆS )
[0.16;0.44]
 eˆS  1.96
m
m  40
December, 2008
eˆS 
12
 0.3
40
eˆS (1  eˆS )
1.96
 0.14
m
© 2008, Jaime G. Carbonell
51
Confidence Intervals
If
 S contains n examples, drawn independently
 n  30
Then
 With approximately N% probability, the true
error eD lies in the interval
eˆS  z N
eˆS (1  eˆS )
m
N%
50%
68%
80%
90%
95%
98%
99%
zN
0.67
1.0
1.28
1.64
1.96
2.33
2.58
Finite Cross-Validation Set
 True error:
eD   y  f ( x, ) p( x, y) dx, y
D
 Test error:
eˆS 
1
m

y  f ( x, )
x , y S
 Number of test errors: Is Binomially distributed:

p



m!
k
nk

y

f
(
x
,

)

k

(
e
)
(
1

e
)

D
D

k
!
(
m

k
)!
x , y S

December, 2008
© 2008, Jaime G. Carbonell
53
k-fold Cross Validation
Data
Train on yellow, evaluate on pink  error1
Train on yellow, evaluate on pink  error2
Train on yellow, evaluate on pink  error3
k-way split
Train on yellow, evaluate on pink  error4
Train on yellow, evaluate on pink  error5
Train on yellow, evaluate on pink  error6
Train on yellow, evaluate on pink  error7
Train on yellow, evaluate on pink  error8
error =  errori / k
Cross Validation Procedure
 Purpose: Evaluate DM accuracy on training data
 Experiment: Try different similarity functions, etc.
 Process:
 Divide the training data into k equal pieces
(each piece is called a “fold”)
 Train the classifier using all but kth fold
 Test for accuracy on kth fold
 Repeat with kth-1 fold held out for testing, then
with kth-2 fold for testing, till tested on all folds
 Report the average accuracy across folds
December, 2008
© 2008, Jaime G. Carbonell
55
The JackknifeData
Comparing Different Hypotheses:
Paired t test
 True difference:
d  eD (1 )  eD (2 )
 For each partition k:
dˆk  eˆS ,k (1 )  eˆS ,k ( 2 )
k
1
dˆ   dˆi
k i 1
 Average:
test error for partition k
 N% Confidence interval:
k-1 is degrees of freedom
N is confidence level
dˆ  t N ,k 1
December, 2008
k
1
ˆ  ˆ) 2
(


i
k (k  1) i 1
© 2008, Jaime G. Carbonell
57
Version Spaces
(Mitchell, 1980)
Anything
G boundary
b
N
“Target” concept
S boundary
Specific
Instances
December, 2008
© 2008, Jaime G. Carbonell
58
Original & Seeded Version Spaces
 Version-spaces (Mitchell, 1980)
 Symbolic multivariate learning
 S & G sets define lattice boundaries
 Exponential worst-case: O(bN)
 Seeded Version Spaces (Carbonell, 2002)
 Generality level  hypothesis seed
 S & G subsets  effective lattice
 Polynomial worst case: O(bk/2), k=3,4
December, 2008
© 2008, Jaime G. Carbonell
59
Seeded Version Spaces (Carbonell, 2002)
Xn  Ym
G boundary
Det Adj N  Det N Adj
(Y2 num) = (Y3 num)
(Y2 gen) = (Y3 gen)
(X3 num) = (Y2 num)
“Target” concept
N
S boundary
“The big book” 
“ el libro grande”
December, 2008
© 2008, Jaime G. Carbonell
60
Seeded Version Spaces
Xn  Ym
G boundary
Seed
(best guess)
k
“Target” concept
N
S boundary
“The big book” 
“ el libro grande”
December, 2008
© 2008, Jaime G. Carbonell
61
Naïve Bayes Classification
Some Notation:
 Training instance index i = 1, 2, …, I
 Term index j = 1, 2, …, J
 Category index k = 1, 2, …, K
 Training data D (k) = ((xi, yi (k) ))
 Instance feature vector xi = (1, ni1, ni2, …, niJ),
 Output labels yi = (yi (1) , yi (2) , …, yi(K) ) , yi (k) = 1 or 0
December, 2008
© 2008, Jaime G. Carbonell
62
Bayes Classifier
 Assigning the most probable category to x
cˆ  arg max k P(ck|x )
 P ( ck ) P ( x | ck ) 
 arg max k 

P
(
x
)


 arg max k P(ck ) P( x | ck )
Bayes Rule
 arg max k log P(ck )  log P( x | ck )
# of training instances in ck
Pˆ (ck ) 
I
Pˆ ( xi | ck )  Pˆ (ni1 ,, niJ | ck )  ?
December, 2008
(MLE)
(Multinomial Distribution)
© 2008, Jaime G. Carbonell
63
Maximum Likelihood Estimate (MLE)
n: # of objects in a random sample from an population
m: # of instances of a category among the n-object sample
p: true probability of any object belonging to the category
Likelihood of observing the data given model p is defined as:
L( Dn | p )  P ( Dn | p )  P (Y1 ,  , Yn | p ) , Yi  {0,1}, Yi ~ Ber ( p )
 i 1 P (Yi | p )  p m (1  p ) n  m ,
n
assuming i.i.d.
 log p m (1  p ) n  m  m log p  (n  m) log( 1  p )  f ( p )
Setting the derivative of f(p) to zero yields:
0
d
m nm
f ( p)  
, (1  p)m  (n  m) p,
dp
p 1 p
December, 2008
© 2008, Jaime G. Carbonell
p
m
n
64
Binomial Distribution
Consider coin toss as a Bernoulli process, X ~ Ber(p)
P( Head )  p,
P(Tail )  1  p  q
What is the probability of seeing 2 heads out of 5 tosses?
5 2 3
5! 2 3
P(# of heads is 2 | n  5)    p q 
p q
2
2
!
3
!
 
Observing k heads in n tosses follows a binomial distribution:
Y  i 1 X i , Y ~ Bin (n, p) ,
n
December, 2008
n k
P(Y  k )    p (1  p) n  k
k 
© 2008, Jaime G. Carbonell
65
Multinomial Distribution
 Consider tossing a 6-faced dice n times with probabilities
p1, p2, …, p6 where the probabilities sum up to 1.
 Let the count of observing each face as a random
variable, we have a multinomial process defined as
( X 1, X 2,, X 6) ~ Mul (n, p1 , p2 ... p6 )
0  X j  n,

6
j 1
X j  n.
n

 6 nk
 pk
P( X 1  n1 ,, X 6  n6 )  
 n1 n2 ... ... n6  k 1
December, 2008
© 2008, Jaime G. Carbonell
66
Multinomial NB

The conditional probability is
J
nx !
nx , j
P( x | c)  P( nx | c)
P
(
t
|
c
)
 j
nx1!nx 2 !...nxJ ! j 1

(t j is a term)
We can remove the first term from the objective function
P ( x | c)   P (t j | c)
nx
j
  j 1 nxj log P (t j | c)
J
j
December, 2008
© 2008, Jaime G. Carbonell
67
Smoothing Methods
 Laplace Smoothing (common)
1  nt|c
~
P (t | c) 
| V |   nt|c
tV
 Two-state Hidden Markov Model (BBN, or Jelinek-Mercer
Interpolation)
~
P (t | c)  P(t | c)  (1   ) P(t )
 Hierarchical Smoothing (McCallum, ICML’98)
~
P(t | c)  1P(t | c)  2 P(t | c2 )  ...  h P(t | ch )
 Lambda’s (summing to 1) are the mixture weights,
obtained by running an EM algorithm on a validation set.
December, 2008
© 2008, Jaime G. Carbonell
68
Basic Assumptions

Term independence:
P( xi | ck )  P(t1 | ck ) ni1  P(t 2 | ck ) ni 2  ...

Expecting one objective attribute y per instance:
 P (c )  1
k
k

Continuity of instances in the same class (one-mode per class)
|V |


nd !

nd t 
arg max k P(d | c k )  arg max k  P(nd | c k )
P
(
t
|
c
)


k
n
!
n
!...
n
!


d ,1
d ,2
d ,|V | tV


December, 2008
© 2008, Jaime G. Carbonell
69
NB and Cross Entropy
Entropy


Measuring the uncertainty –
lower entropy means easier
predictions
Minimum coding length if
distribution p is known
Cross Entropy

Measuring the coding length
(in # of bits) based on
distribution q when the true
distribution is p
December, 2008
H ( p)   pk log pk
k
p  ( p1 ,, pK ),
p
k
1
k
H(p || q)   pk log qk
k
  pk log pk
k
qk
pk
  pk log pk   pk log
k
k
qk
pk
 H ( p)  D(p || q)
© 2008, Jaime G. Carbonell
70
NB and Cross Entropy (cont’d)
 Kullback Liebler (KL) Divergence




qk
D(p || q)   pk log
pk
k
Also called “Relative Entropy”
Measuring the difference between two
distributions
Zero valued if p = q
Not inter-changeable
December, 2008
© 2008, Jaime G. Carbonell
71
NB & Cross Entropy (cont’d)




k *  arg max log P (ck )   nij log P (t j | ck ) 
k


t j xi




nij
 log P (ck )

 arg max 

log P (t | ck ) 
k
ni


t j xi ni


 log P (ck )

 arg max 
  Pˆ (t j | xi ) log P (t j | ck ) 
k
ni
t




 arg max log P (ck )  H ( p xi || qck )
k


 arg min  log P (ck )  H ( p xi || qck )
k




Minimum Description Length (MDL) Classifier
December, 2008
© 2008, Jaime G. Carbonell
72
Concluding Remarks on NB
Pros
 Explicit probabilistic reasoning
 Relatively effective, fast online response (as an eager learning)
Cons
 Scoring function (logarithm of term probabilities) would be too
sensitive to measurement errors on rare features
 One-class-per-instance assumption imposes both theoretical
and practical limitations
 Empirically weak when dealing with rare categories and large
feature sets
December, 2008
© 2008, Jaime G. Carbonell
73
Statistical Decision Theory
Random input X in RJ
Random output Y in {1,2, …, K}
Prediction f(X) in {1,2, …, K}
Loss function (0-1 loss for classification)
 L(y(x), f(x)) = 0 iff f(x) = y(x)
 L(y(x), f(x)) = 1 otherwise
 Expected Prediction Error (EPE)




EPE   X k 1 L(Y , f ( X )) P(Y | X )
K
fˆ ( x)  arg min
Minimizing
EPE pointwise
December, 2008
(k )
L
(
k
,
f
(
x
))

f ( x ){1,, K } k 1
x
K
 arg min k{1,, K }{1   x( k ) }  arg max k{1,, K }{ x( k ) }
© 2008, Jaime G. Carbonell
74
Selection of ML Algorithm (I)
Method
Training Data
Requirements
Random Noise
Tolerance
Scalability
(atts + data)
Rule Induction
Sparse
None
Good
Decision Trees
Sparse-Dense
Some
Excellent
Naïve Bayes
Medium-Dense Some-Good
Medium
Regression
Medium-Dense Some-Good
Good
kNN
Sparse-Dense
Some-Good
Good-Excellent
SVM
Medium-Dense Some-Good
Good-Excellent
Neural Nets
Dense
Poor-Medium
December, 2008
Good
© 2008, Jaime G. Carbonell
75
Selection of ML Algorithm (II)
Method
Quality of
Prediction
Explanatory
Power
Popularity of
Usage
Rule Induction
Good, brittle
Very clear
Med, declining
Decision Trees
Good/category Very clear
High, stable
Naïve Bayes
Medium/cat
Partial
Med, declining
Regression
Good/both
Partial-Poor
High, stable
kNN
Good/both
Partial-Good
Med, increasing
SVM
Very good/cat
Poor
Med, increasing
Neural Nets
Good/cat
Poor
High, declining
December, 2008
© 2008, Jaime G. Carbonell
76
Download