with slide ideas from a.o. Hendrik Blockeel, Pedro Domingos, David Page,
Tom Dietterich and Eamon Keogh
Nearest neighbor methods
– Similarity
– Problems:
• dimensionality of data, efficiency, etc.
– Solutions:
• weighting, edited NN, kD-trees, etc.
Naïve Bayes
– Including an introduction to Bayesian ML methods
Imagine the world’s music collection represented in some space
When you like a song, other songs residing close to it should also be interesting
…
Picture from Oracle
1. Store all the examples <x i
,y i
>
2. Classify a new example x by finding the stored example x k that most resembles it and predicts that example’s class y k
+
+
+
+
+
+ -
+
+
+ ?
-
+
-
-
-
-
-
+
-
-
-
-
-
-
• Learning is very fast
(although we come back to this later)
• No information is lost
• Hypothesis space
– variable size
– complexity of the hypothesis rises with the number of stored examples
+ -
+
Voronoi diagram
+
-
-
+
+
-
-
+ -
+
+
-
-
Boundaries are not computed!
Advantage : no details lost
Disadvantage : "details" may be noise
+
+
+
+
+
+
-
- -
-
+
-
-
-
+
-
+
+
-
-
-
+
+
+
+
+
+
+
+
-
-
-
To improve robustness against noisy learning examples, use a set of nearest neighbors
For classification: use voting
For regression: use the mean
5
1
5
5
4
4
4
1
4
2
1
2
3
3 3
kNN doesn’t do anything until it needs to make a prediction = lazy learner
– Learning is fast!
– Predictions require work and can be slow
Eager learners start computing as soon as they receive data
Decision tree algorithms, neural networks, …
– Learning can be slow
– Predictions are usually fast!
Distance metrics: measure of dis-similarity
E.g. Manhattan, Euclidean or L n
-norm for numerical attributes
Hamming distance for nominal attributes d ( x , y )
= n å i
=
1 d
( x i
, y i
) where d
( x i
, y i d
( x i
, y i
)
=
0 if x i
)
=
1 if x i
= y i
¹ y i
E.g. comparing humans
1. 1.85m, 37yrs
2. 1.83m, 35yrs
3. 1.65m, 37yrs
1. 185cm, 37yrs
2. 183cm, 35yrs
3. 165cm, 37yrs d(1,2) = 2.00
…
0999975
… d(1,3) = 0.2
d(2,3) = 2.00808
… d(1,2) = 2.8284
… d(1,3) = 20.0997
… d(2,3) = 18.1107
…
Rescale all dimensions such that the range is equal, e.g. [-1,1] or [0,1]
For [0,1] range: x i
'
= x
M i i
m i
m i with m i the minimum and M i the maximum value for attribute i
Assume a uniformly distributed set of 5000 examples
To capture 5 nearest neighbors we need:
– in 1 dim: 0.1% of the range
– in 2 dim: = 3.1% of the range
– in n dim: 0.1% 1/n
With 5000 points in 10 dimensions, each to find 5 neighbors …
Irrelevant features destroy the metric’s meaningfulness
Consider a 1dim problem where the query x is at the origin, the nearest neighbor x
1 x
2 is at 0.1 and the second neighbor at 0.5 (after normalization)
Now add a uniformly random feature. What is the probability that x
2 becomes the closest neighbor?
approx. 15% !!
Location of x
1 vs x
2 on informative dimension
Solution: Give each attribute a different weight in the distance computation
Several options:
– Experimentally find out which weights work well
(cross-validation)
– Other solutions, e.g. (Langley,1996)
1. Normalize attributes (to scale 0-1)
2. Then select weights according to "average attribute similarity within class” for each attribute for each class for each example in that class
Strings
– Levenshtein distance/edit distance
= minimal number of changes needed to change one word into the other
Allowed edits/changes:
1. delete character
2. insert character
3. change character
(not used by some other edit-distances)
Given two time series:
Q = q
1
…q n
C = c
1
…c n
Euclidean
D
Q , C
i n
1
q i
c i
2
Start and end times are critical!
C
D(Q,C)
R
D(Q,R)
Q
Dynamic Time Warping
Fixed Time Axis
Sequences are aligned “one to one”.
Dimensionality reduction
“ Warped” Time Axis
Nonlinear alignments are possible.
k places arbitrary border on example relevance
– Idea: give higher weight to closer instances
Can now use all training instances instead of only k
(“Shepard’s method”) f
ˆ
( x q
)
i k
1 w i f i k
1 w i
( x i
)
with w i
1 d ( x q
, x i
)
2
! In high-dimensional spaces, a function of d that “goes to zero fast enough” is needed. (Again “curse of dimensionality”.)
Efficiency
– For each prediction, kNN needs to compute the distance (i.e. compare all attributes) for ALL stored examples
– Prediction time = linear in the size of the data-set
For large training sets and/or complex distances, this can be too slow to be practical
Use only part of the training data
✔
Less storage
✗
Order dependent
✗
Sensitive to noisy data
More advanced alternatives exist (= IB3)
Reduce time spent on far-away examples by using more efficient distance-estimates first
– Eliminate most examples using rough distance approximations
– Compute more precise distances for examples in the neighborhood
Use a clever data-structure to eliminate the need to compute all distances kD-trees are similar to decision trees except
– splits are made on the median/mean value of dimension with highest variance
– each node stores one data point, leaves can be empty
Use a form of A* search using the minimum distance to a node as an underestimate of the true closest distance
Finds closest neighbor in logarithmic (depth of tree) time
Building a good kD-tree may take some time
– Learning time is no longer 0
– Incremental learning is no longer trivial
• kD-tree will no longer be balanced
• re-building the tree is recommended when the maxdepth becomes larger than 2* the minimal required depth (= log(N) with N training examples)
Cover trees are more advanced, more complex, and more efficient!!
The rough decision surfaces of nearest neighbor can sometimes be considered a disadvantage
+ + + +
-
+
-
+ + +
-
+
– Solve two problems at once by using prototypes
= Representative for a whole group of instances
Prototypes can be:
– Single instance, replacing a group
– Other structure (e.g., rectangle, rule, ...)
-> in this case: need to define distance
+ + + +
-
Movie
Love at last
Romance forever
Cute puppies of love
Nonstop car chases
Swords vs. karate
Alice (1)
?
0
5
5
0
Bob (2)
4
0
5
?
0
Carol (3)
0
5
0
?
5
Dave (4)
?
4
0
0
?
(romance)
0.9
1.0
0.99
0.1
0
(action)
0
0.01
0
1.0
0.9
Predict ratings for films users have not yet seen (or rated).
Predict through instance based regression:
Positive
• Easy to implement
• Good “baseline” algorithm / experimental control
• Incremental learning easy
• Psychologically plausible model of human memory
Negative
• Led astray by irrelevant features
• No insight into domain (no explicit model)
• Choice of distance function is problematic
• Doesn’t exploit/notice structure in examples
• Generalities of instance based learning
– Basic idea, (dis)advantages, Voronoi diagrams, lazy vs. eager learning
• Various instantiations
– kNN, distance-weighted methods, ...
– Rescaling attributes
– Use of prototypes
This is going to be very introductory
• Describing (results of) learning processes
– MAP and ML hypotheses
• Developing practical learning algorithms
– Naïve Bayes learner
• application: learning to classify texts
– Learning Bayesian belief networks
Several roles for probability theory in machine learning:
– describing existing learners
• e.g. compare them with “optimal” probabilistic learner
– developing practical learning algorithms
• e.g. “Naïve Bayes” learner
Bayes’ theorem plays a central role
• P(A): probability that A happens
• P(A|B): probability that A happens, given that
B happens (“conditional probability”)
• Some rules:
– complement: P(not A) = 1 - P(A)
– disjunction: P(A or B) = P(A)+P(B)-P(A and B)
– conjunction: P(A and B) = P(A) P(B|A)
–
= P(A) P(B) if A and B independent total probability:P(A) =
With each B i
i
P(A|B mutually exclusive i
) P(B i
)
P(A|B) = P(B|A) P(A) / P(B)
Mainly 2 ways of using Bayes’ theorem:
– Applied to learning a hypothesis h from data D:
P(h|D) = P(D|h) P(h) / P(D) ~ P(D|h)P(h)
– P(h): a priori probability that h is correct
– P(h|D): a posteriori probability that h is correct
– P(D): probability of obtaining data D
– P(D|h): probability of obtaining data D if h is correct
– Applied to classification of a single example e:
P(class|e) = P(e|class)P(class)/P(e)
Example:
– assume some lab test for a disease has 98% chance of giving positive result if disease is present, and 97% chance of giving negative result if disease is absent
– assume furthermore 0.8% of population has this disease
– given a positive result, what is the probability that the disease is present?
P(Dis|Pos) = P(Pos|Dis)P(Dis) / P(Pos) = 0.98*0.008 / (0.98*0.008 + 0.03*0.992)
Task: Given the current data D and some hypothesis space H, return the hypothesis h in H
that is most likely to be correct.
Note: this h is optimal in a certain sense
– no method can exist that finds with higher probability the correct h
Given some data D and a hypothesis space H, find the hypothesis h
H that has the highest probability of being correct; i.e., P(h|D) is maximal
This hypothesis is called the maximal a posteriori hypothesis h
MAP
: h
MAP
= argmax h
H
P(h|D)
= argmax h
H
P(D|h)P(h)/P(D) = argmax h
H
P(D|h)P(h)
• last equality holds because P(D) is constant
So : we need P(D|h) and P(h) for all h
H to compute h
MAP
P(h): a priori probability that h is correct
What if no preferences for one h over another?
• Then assume P(h) = P(h’) for all h, h’
H
• Under this assumption h
MAP likelihood hypothesis h
ML is called the maximum
• h
ML
= argmax
How to find h h
MAP
H
P(D|h) or h
ML
?
(because P(h) constant)
– brute force method: compute P(D|h), P(h) for all h
H
– usually not feasible
Simple & popular classification method
• Based on Bayes’ rule + assumption of conditional independence
– assumption often violated in practice
– even then, it usually works well
Example application: classification of text documents
Given attribute values, what is most probable value of target variable?
v
MAP
arg v j max
V
P ( v j
| a
1
, a
2
,..., a n
)
arg v j max
V
arg v j max
V
P ( a
1
,
P a
(
2
,..., a
1
, a
2 a n
|
,..., v j a n
) P ( v j
)
P ( a
1
, a
2
,..., a n
| v j
) P ( v j
)
)
Problem: too much data needed to estimate P(a
1
…a n
|v j
)
Naïve Bayes assumption : attributes are independent, given the class
P(a
1
,…,a n
|v j
) = P(a
1
|v j
)P(a
2
|v j
)…P(a n
|v j
)
– also called conditional independence (given the class)
• Under that assumption, v
MAP v
NB
arg v j max
V becomes
P ( v j
)
i
P ( a i
| v j
)
To learn such a classifier: just estimate P(v j
P(a i
|v j
) from data
), v
NB
arg v j max
V
ˆ
( v j
)
i
( a i
| v j
)
How to estimate?
– simplest: standard estimate from statistics
• estimate probability from sample proportion
• e.g., estimate P(A|B) as count(A and B) / count(B)
– in practice, something more complicated needed…
Problem:
– What if attribute value a i never observed for class v j
?
– Estimate P(a i
|v j
)=0 because count(a i and v j
) = 0 ?
• Effect is too strong: this 0 makes the whole product 0!
Solution: use m-estimate
– interpolates between observed value n c
/n and a priori estimate p -> estimate may get close to 0 but never 0
• m is weight given to a priori estimate
ˆ
( a i
| v j
)
n c
mp n
m
Example application:
– given text of newsgroup article, guess which newsgroup it is taken from
– Naïve bayes turns out to work well on this application
– How to apply NB?
– Key issue : how do we represent examples? what are the attributes?
Binary classification (+/-) or multiple classes possible
Attributes = word frequencies
– Vocabulary = all words that occur in learning task
– # attributes = size of vocabulary
– Attribute value = word count or frequency in the text (using m-estimate)
= “Bag of Words” representation
procedure learn_naïve_bayes_text( E : set of articles, V: set of classes)
Voc = all words and tokens occurring in E estimate P(v j
) and P(w k
|v j
) for all w
N j
N = number of articles k in E and v
= number of articles of class j j in V:
P(v j
) = N j
/N n kj
= number of times word w n j
P(w k
|v j
) = (n kj
+1)/(n j
+|Voc|) k occurs in text of class j
= number of words in class j (counting doubles) procedure classify_naïve_bayes_text(A: article) remove from A all words/tokens that are not in Voc return argmax vj V
P(v j
) i
P(a i
|v j
)
Some (old) experimental results:
– 1000 articles taken from 20 newsgroups
– guess correct newsgroup for unseen documents
– 89% classification accuracy with previous approach
• Note: more recent approaches based on SVMs,
… have been reported to work better
– But Naïve Bayes still used in practice, e.g., for spam detection
Consider two extremes of spectrum:
– guessing joint probability distribution
• would yield optimal classifier
• but infeasible in practice (too much data needed)
– Naïve Bayes
• much more feasible
• but strong assumptions of conditional independence
• Is there something in between?
– make some independence assumptions, but only where reasonable
Bayesian belief network consists of
1: graph
• intuitively: indicates which variables “directly influence” which other variables
– arrow from A to B: A has direct effect on B
– parents(X) = set of all nodes directly influencing X
• formally: each node is conditionally independent of each of its non-descendants, given its parents
– conditional independence: cf. Naïve Bayes
– X conditionally independent of Y given Z iff P(X|Y,Z) = P(X|Z)
2: conditional probability tables
• for each node X : P(X|parents(X)) is given
• Burglary or earthquake may cause alarm to go off
• Alarm going off may cause one of neighbours to call
E 0.01
-E 0.99
Burglary Earthquake
B 0.05
-B 0.95
Alarm
B,E B,-E -B,E -B,-E
A 0.9 0.8 0.4 0.01
-A 0.1 0.2 0.6 0.99
John calls Mary calls
A -A
J 0.8 0.1
-J 0.2 0.9
A -A
M 0.9 0.2
-M 0.1 0.8
Network topology usually reflects direct causal influences
– other structure also possible
– but may render network more complex
Burglary
John calls
Alarm
Earthquake
Mary calls
Mary calls
John calls
Alarm
Burglary
Earthquake
Graph + conditional probability tables allow to construct joint probability distribution of all variables
– P(X
1
,X
2
,…,X n
) =
i
P(X i
|parents(X i
))
– In other words: bayesian belief network carries full information on joint probability distribution
Given values for certain nodes, infer probability distribution for values of other nodes
• General algorithm quite complicated
– See, e.g., Russel & Norvig, 1995: Artificial
Intelligence, a Modern Approach
evidence (observed) to be predicted unobserved
In general: inference is NP-complete
– approximating methods, e.g. Monte-Carlo
• Assume structure of network given:
– only conditional probability tables to be learnt
– training examples may include values for all variables, or just for some of them
– when all variables observable:
• estimating probabilities as easy as for Naïve Bayes
• e.g. estimate P(A|B,C) as count(A,B,C)/count(B,C)
– when not all variables observable:
• methods based on gradient descent or EM
• When structure of network not given:
– search for structure + tables
• e.g. propose structure, learn tables
• propose change to structure, relearn, see whether better results
– active research topic
• Importance of Bayes’ theorem
• MAP, ML, MDL
– definitions, characterising learners from this perspective, relationship MDL-MAP
• Bayes optimal classifier, Gibbs classifier
• Naïve Bayes: how it works, assumptions made, application to text classification
• Bayesian networks: representation, inference, learning