Classification - Tamara L Berg

advertisement
Machine Learning Overview
Tamara Berg
Recognizing People, Objects, and
Actions
Today
• Schedule has been adjusted a little bit due to Monday’s
cancellation
– Today – Overview of machine learning algorithms (other
than deep learning)
– We will cover a quick intro to deep learning on day 2 of the
object recognition topic
• The Topic Presentation groups have been posted to the
class webpage
– Group 1, Feb 15/17, should meet with me early next week
to go over presentation outline and proposed paper list
(Adam, Zherong, Jae-Sung, Cheng-Yang)
For next class
• Read assigned object recognition papers (posted
later today)
• Before class turn in hard copy ½ page summary
for each assigned paper outlining:
1) the goal of the paper,
2) the approach,
3) what was novel,
4) what you thought of the paper.
(summary template on the class webpage)
To Do – prepping for projects
– Install your favorite machine learning tool (e.g. CNNs,
SVMs, etc)
– Download your favorite image dataset (imagenet
subset, LFW face dataset, Zappos shoe dataset….)
– Run some simple experiment on image classification –
split your dataset into training/testing sets, train
classifier to recognize images from each category (may
or may not require extracting features)
Useful code/data/etc: https://github.com/jbhuang0604/awesome-computervision
Deep Learning: http://caffe.berkeleyvision.org/,
http://torch.ch/docs/cvpr15.html, https://www.tensorflow.org/
Types of ML algorithms
• Unsupervised
– Algorithms operate on unlabeled examples
• Supervised
– Algorithms operate on labeled examples
• Semi/Partially-supervised
– Algorithms combine both labeled and unlabeled
examples
Slide 5 of 113
Unsupervised Learning,
e.g. clustering
Slide 6 of 113
Slide 7 of 113
K-means clustering
•
Want to minimize sum of squared Euclidean
distances between points xi and their nearest
cluster centers mk
D( X , M ) 

2
(
x

m
)
 i k
cluster k point i in
cluster k
Algorithm:
• Randomly initialize K cluster centers
• Iterate until convergence:
• Assign each data point to the nearest center
• Recompute each cluster center as the mean of all points assigned
to it
Slide 8 of 113
source: Svetlana Lazebnik
Supervised Learning,
e.g. nearest neighbor, decision trees,
SVMs, boosting
Slide 9 of 113
Slide 10 of 113
Slide from Dan Klein
Slide 11 of 113
Slide from Dan Klein
Slide 12 of 113
Slide from Dan Klein
Slide 13 of 113
Slide from Dan Klein
Example: Image classification
input
desired output
apple
pear
tomato
cow
dog
horse
Slide 14 of 113
Slide credit: Svetlana Lazebnik
http://yann.lecun.com/exdb/mnist/index.html
Slide 15 of 113
Slide from Dan Klein
Surface wave magnitude
Example: Seismic data
Earthquakes
Nuclear explosions
Body wave magnitude
Slide 16 of 113
Slide credit: Svetlana Lazebnik
Slide 17 of 113
Slide from Dan Klein
The basic classification framework
y = f(x)
output classification
function
input
• Learning: given a training set of labeled examples {(x1,y1),
…, (xN,yN)}, estimate the parameters of the prediction
function f
• Inference: apply f to a never before seen test example x and
output the predicted value y = f(x)
Slide 18 of 113
Slide credit: Svetlana Lazebnik
Some ML classification methods
Nearest neighbor
Neural networks
106 examples
Shakhnarovich, Viola, Darrell 2003
Berg, Berg, Malik 2005
…
LeCun, Bottou, Bengio, Haffner 1998
Rowley, Baluja, Kanade 1998
…
Support Vector Machines and Kernels
Conditional Random Fields
Guyon, Vapnik
Heisele, Serre, Poggio, 2001
…
McCallum, Freitag, Pereira 2000
Kumar, Hebert 2003
…
19
Slide credit: Antonio Torralba
Example: Training and testing
Training set (labels known)
Test set (labels unknown)
• Key challenge: generalization to unseen examples
Slide 20 of 113
Slide credit: Svetlana Lazebnik
Slide 21 of 113
Slide credit: Dan Klein
Classification by Nearest Neighbor
Word vector document classification – here the vector space is
illustrated as having 2 dimensions. How many dimensions would
Slide 22 of 113
the data actually live in?
Classification by Nearest Neighbor
Slide 23 of 113
Classification by Nearest Neighbor
Classify the test document as the class of the document “nearest” to the query
Slide 24 of 113
document (use vector similarity to find most similar doc)
Classification by kNN
Classify the test document as the majority class of the k documents
Slide 25 of 113
Classification by kNN
What are the features? What’s the training data? Testing data?
Slide 26 of 113
Parameters?
Slide 27 of 113
Slide from Min-Yen Kan
Slide 28 of 113
Slide from Min-Yen Kan
Slide 29 of 113
Slide from Min-Yen Kan
Slide 30 of 113
Slide from Min-Yen Kan
Slide 31 of 113
Slide from Min-Yen Kan
Classification by kNN
What are the features? What’s the training data? Testing data?
Slide 32 of 113
Parameters?
NN
(examples from computer vision)
33
NN for pose estimation
Fast Pose Estimation with Parameter Sensitive Hashing
Shakhnarovich, Viola, Darrell
34
The algorithm flow
Input Query
Processed query
Representation
Database of examples
Output Match
35
NN for vision
J. Hays and A. Efros, IM2GPS: estimating geographic information from a single image,
CVPR 2008
Where?
What can you say about where these photos were taken?
37
How?
Collect a large collection of geo-tagged photos
6.5 million images with both GPS coordinates and geographic keywords,
removing images with keywords like birthday, concert, abstract, …
Test set – 400 randomly sampled images from this collection. Manually
removed abstract photos and photos with recognizable people – 237 test
photos.
38
Nearest Neighbor Matching
For each input image compute features (color, texture,
shape)
Compute distance in feature space to all 6 million
images in the database (each feature contributes
equally).
Label the image with GPS coordinates of:
1 nearest neighbor
k=120 nearest neighbors – probability map over
entire globe.
39
Results
40
Results
41
Results
42
Decision tree classifier
Example problem: decide whether to wait for a table at a
restaurant, based on the following attributes:
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
Alternate: is there an alternative restaurant nearby?
Bar: is there a comfortable bar area to wait in?
Fri/Sat: is today Friday or Saturday?
Hungry: are we hungry?
Patrons: number of people in the restaurant (None, Some, Full)
Price: price range ($, $$, $$$)
Raining: is it raining outside?
Reservation: have we made a reservation?
Type: kind of restaurant (French, Italian, Thai, Burger)
WaitEstimate: estimated waiting time (0-10, 10-30, 30-60, >60)
43
Decision tree classifier
44
Decision tree classifier
45
Shall I play tennis today?
46
47
Leaf nodes
Choose next attribute for splitting
How do we choose the best attribute?
48
Criterion for attribute selection
• Which is the best attribute?
– The one which will result in the smallest tree
– Heuristic: choose the attribute that produces the
“purest” nodes
• Need a good measure of purity!
49
Information Gain
Which test is more informative?
Wind
Humidity
>75%
<=75%
>20
<=20
50
Information Gain
Impurity/Entropy (informal)
– Measures the level of impurity in a group of
examples
51
Impurity
Very impure group
Less impure
Minimum
impurity
52
Entropy: a common way to measure impurity
• Entropy =
 p
i
log 2 pi
i
pi is the probability of class i
Compute it as the proportion of class i in the set.
53
2-Class Cases:
• What is the entropy of a group in which all
examples belong to the same class?
Minimum
impurity
• entropy = - 1 log21 = 0
• What is the entropy of a group with 50%
in either class?
• entropy = -0.5 log20.5 – 0.5 log20.5 =1
Maximum
impurity
54
Information Gain
• We want to determine which attribute in a given set
of training feature vectors is most useful for
discriminating between the classes to be learned.
• Information gain tells us how useful a given attribute
of the feature vectors is.
• We can use it to decide the ordering of attributes in
the nodes of a decision tree.
55
Calculating Information Gain
Information Gain = entropy(parent) – [weighted average entropy(children)]
child   13  log 13    4  log 4   0.787
impurity
2
2
17   17
17 
entropy  17
Entire population (30 instances)
17 instances
child
1   12
12 
1
impurity
entropy  13  log 2 13    13  log 2 13   0.391

parent  14  log 2 14    16  log 2 16   0.996
impurity
30   30
30 
entropy  30
 

13 instances
 17
  13


0
.
787


0
.
391



  0.615
(Weighted) Average Entropy of Children =
 30
  30

Information Gain= 0.996 - 0.615 = 0.38
56
e.g. based on information gain
57
Linear classifier
• Find a linear function to separate the classes
f(x) = sgn(w1x1 + w2x2 + … + wDxD) = sgn(w  x)
Slide 58 of 113
Slide credit: Svetlana Lazebnik
Discriminant Function
• It can be arbitrary functions of x, such as:
Nearest
Neighbor
Decision
Tree
Linear
Functions
g ( x)  w T x  b
Slide 59 of 113
Slide credit: Jinwei Gu
Linear Discriminant Function
• g(x) is a linear function:
x2
wT x + b > 0
g ( x)  w T x  b

A hyper-plane in the feature
space
w T x + b < 0 x1
x1
denotes +1
denotes -1
Slide 60 of 113
Slide credit: Jinwei Gu
Linear Discriminant Function
• How would you classify these
points using a linear
discriminant function in order
to minimize the error rate?

x2
Infinite number of answers!
x1
denotes +1
denotes -1
Slide 61 of 113
Slide credit: Jinwei Gu
Linear Discriminant Function
• How would you classify these
points using a linear
discriminant function in order
to minimize the error rate?

x2
Infinite number of answers!
x1
denotes +1
denotes -1
Slide 62 of 113
Slide credit: Jinwei Gu
Linear Discriminant Function
• How would you classify these
points using a linear
discriminant function in order
to minimize the error rate?

x2
Infinite number of answers!
x1
denotes +1
denotes -1
Slide 63 of 113
Slide credit: Jinwei Gu
Linear Discriminant Function
• How would you classify these
points using a linear
discriminant function in order
to minimize the error rate?

Infinite number of answers!

Which one is the best?
x2
x1
denotes +1
denotes -1
Slide 64 of 113
Slide credit: Jinwei Gu
Large Margin Linear Classifier
• The linear discriminant
function (classifier) with the
maximum margin is the best

Margin is defined as the width
that the boundary could be
increased by before hitting a
data point

Why it is the best?

x2
“safe zone”
Margin
strong generalization ability
x1
Linear SVM Slide 65 of 113
Slide credit: Jinwei Gu
Large Margin Linear Classifier
x2
Margin
x+
x+
x-
Support Vectors
x1
Slide 66 of 113
Slide credit: Jinwei Gu
Discriminating between classes

The linear discriminant function is:
g(x) = wT x + b =
T
a
x
å i i x+b
iÎSV

Notice it relies on a dot product between the test point x and
the support vectors xi
69
Linear separability
70
Non-linear SVMs: Feature Space

General idea: the original input space can be mapped to
some higher-dimensional feature space where the training set
is separable:
Φ: x → φ(x)
Slide courtesy of www.iro.umontreal.ca/~pift6080/documents/papers/svm_tutorial.ppt
71
Nonlinear SVMs: The Kernel Trick

With this mapping, our discriminant function becomes:
g ( x)  w T  ( x )  b 
T


(
x
)
 i i  ( x)  b
iSV

No need to know this mapping explicitly, because we only use the
dot product of feature vectors in both the training and test.

A kernel function is defined as a function that corresponds to a dot
product of two feature vectors in some expanded feature space:
K (xi , x j )   (xi )T  (x j )
72
Nonlinear SVMs: The Kernel Trick

Examples of commonly-used kernel functions:
K (xi , x j )  xTi x j

Linear kernel:

Polynomial kernel:

Gaussian (Radial-Basis Function (RBF) ) kernel:
K (xi , x j )  (1  xTi x j ) p
K (xi , x j )  exp(

Sigmoid:
xi  x j
2
2
2
)
K (xi , x j )  tanh(0 xTi x j  1 )
73
Support Vector Machine: Algorithm
1. Choose a kernel function
2. Choose a value for C and any other parameters (e.g. σ)
3. Solve the quadratic programming problem (many software
packages available)
4. Classify held out validation instances using the learned
model
5. Select the best learned model based on validation accuracy
6. Classify test instances using the final selected model
74
SVMs in Computer Vision
76
Detection
+1 pos
?
?
features
classify
-1 neg
?
x
• We slide a window over the image
• Extract features for each window
• Classify each window into pos/neg
F(x)
y
Sliding Window Detection
78
Representation
79
80
Example Results
81
Example Results
82
Summary: Support Vector Machine
1. Large Margin Classifier
– Better generalization ability & less over-fitting
2. The Kernel Trick
– Map data points to higher dimensional space in order to
make them linearly separable.
– Since only dot product is needed, we do not need to
represent the mapping explicitly.
83
Model Ensembles
Random Forests
A variant of bagging proposed by Breiman
Classifier consists of a collection of decision tree-structure
classifiers.
Each tree cast a vote for the class of input x.
88
Boosting
• A simple algorithm for learning robust classifiers
– Freund & Shapire, 1995
– Friedman, Hastie, Tibshhirani, 1998
• Provides efficient algorithm for sparse visual feature
selection
– Tieu & Viola, 2000
– Viola & Jones, 2003
• Easy to implement, doesn’t require external
optimization tools. Used for many real problems in
AI.
89
Boosting
• Defines a classifier using an additive model:
Strong
classifier
Weak classifier
Weight
Input feature
vector
90
Boosting
• Defines a classifier using an additive model:
Strong
classifier
Weak classifier
Weight
Input feature
vector
• We need to define a family of weak classifiers
Selected from a family of weak classifiers
91
Adaboost
Input: training samples
Initialize weights on samples
For T iterations:
Select best weak classifier
based on weighted error
Update sample weights
Output: final strong classifier
(combination of selected weak
classifier predictions)
Boosting
• It is a sequential procedure:
xt=1
xt=2
xt
Each data point has
a class label:
yt =
+1 ( )
-1 ( )
and a weight:
wt =1
93
Toy example
Weak learners from the family of lines
Each data point has
a class label:
yt =
+1 ( )
-1 ( )
and a weight:
wt =1
h => p(error) = 0.5 it is at chance
94
Toy example
Each data point has
a class label:
yt =
+1 ( )
-1 ( )
and a weight:
wt =1
This one seems to be the best
This is a ‘weak classifier’: It performs slightly better than chance.
95
Toy example
Each data point has
a class label:
yt =
+1 ( )
-1 ( )
We update the weights:
wt
wt exp{-yt Ht}
96
Toy example
Each data point has
a class label:
yt =
+1 ( )
-1 ( )
We update the weights:
wt
wt exp{-yt Ht}
97
Toy example
Each data point has
a class label:
yt =
+1 ( )
-1 ( )
We update the weights:
wt
wt exp{-yt Ht}
98
Toy example
Each data point has
a class label:
yt =
+1 ( )
-1 ( )
We update the weights:
wt
wt exp{-yt Ht}
99
Toy example
f1
f2
f4
f3
The strong (non- linear) classifier is built as the combination of all
the weak (linear) classifiers.
100
Adaboost
Input: training samples
Initialize weights on samples
For T iterations:
Select best weak classifier
based on weighted error
Update sample weights
Output: final strong classifier
(combination of selected weak
classifier predictions)
Boosting for Face Detection
102
Face detection
+1 face
?
?
features
classify
-1 not face
?
x
F(x)
• We slide a window over the image
• Extract features for each window
• Classify each window into face/non-face
y
What is a face?
• Eyes are dark (eyebrows+shadows)
• Cheeks and forehead are bright.
• Nose is bright
Paul Viola, Michael Jones, Robust Real-time Object Detection, IJCV 04
Basic feature extraction
• Information type:
– intensity
x120
• Sum over:
x357
– gray and white rectangles
• Output: gray-white
• Separate output value for
x629
x834
– Each type
– Each scale
– Each position in the window
• FEX(im)=x=[x1,x2,…….,xn]
Paul Viola, Michael Jones, Robust Real-time Object Detection, IJCV 04
Decision trees
• Stump:
x120
– 1 root
– 2 leaves
x357
• If xi > a
then positive
else negative
x629
x834
• Very simple
• “Weak classifier”
Paul Viola, Michael Jones, Robust Real-time Object Detection, IJCV 04
Summary: Face detection
• Use decision stumps
as weak classifiers
• Use boosting to build
a strong classifier
• Use sliding window to
detect the face
X234>1.3
No
x120
+1
Face
x629
x357
Yes
-1
Non-face
x834
Semi-Supervised Learning
Slide 108 of 113
Supervised learning has many successes
•
•
•
•
•
•
recognize speech,
steer a car,
classify documents
classify proteins
recognizing faces, objects in images
...
Slide Credit: Avrim Blum
Slide 109 of 113
However, for many problems, labeled data
can be rare or expensive.
Need to pay someone to do it, requires special testing,…
Unlabeled data is much cheaper.
Slide Credit: Avrim Blum
110
However, for many problems, labeled data
can be rare or expensive.
Need to pay someone to do it, requires special testing,…
Unlabeled data is much cheaper.
Speech
Customer modeling
Images
Protein sequences
Medical outcomes
Web pages
Slide Credit: Avrim Blum
111
However, for many problems, labeled data
can be rare or expensive.
Need to pay someone to do it, requires special testing,…
Unlabeled data is much cheaper.
[From Jerry Zhu]
Slide Credit: Avrim Blum
112
However, for many problems, labeled data
can be rare or expensive.
Need to pay someone to do it, requires special testing,…
Unlabeled data is much cheaper.
Can we make use of cheap unlabeled
data?
Slide Credit: Avrim Blum
113
Semi-Supervised Learning
Can we use unlabeled data to augment a small
labeled sample to improve learning?
But maybe still has
useful regularities
that we can use.
But unlabeled data is
missing the most
important info!!
But…
But…
But…
Slide Credit: Avrim Blum
Slide 114 of 113
Method 1:
EM
115
How to use unlabeled data
• One way is to use the EM algorithm
– EM: Expectation Maximization
• The EM algorithm is a popular iterative algorithm for
maximum likelihood estimation in problems with missing
data.
• The EM algorithm consists of two steps,
– Expectation step, i.e., filling in the missing data
– Maximization step – calculate a new maximum a posteriori
estimate for the parameters.
Slide 116 of 113
Example Algorithm
1. Train a classifier with only the labeled
documents.
2. Use it to probabilistically classify the
unlabeled documents.
3. Use ALL the documents to train a new
classifier.
4. Iterate steps 2 and 3 to convergence.
Slide 117 of 113
Method 2:
Co-Training
118
Co-training
[Blum&Mitchell’98]
Many problems have two different sources of info
(“features/views”) you can use to determine label.
E.g., classifying faculty webpages: can use words on page or words on links pointing to the page.
Prof. Avrim Blum
My Advisor
x - Link info & Text info
Slide Credit: Avrim Blum
Prof. Avrim Blum
My Advisor
x1- Link info
x2- Text info
Slide 119 of 113
Co-training
Idea: Use small labeled sample to learn initial rules.
– E.g., “my advisor” pointing to a page is a good indicator
it is a faculty home page.
– E.g., “I am teaching” on a page is a good indicator it is a
faculty home page.
my advisor
Slide Credit: Avrim Blum
Slide 120 of 113
Co-training
Idea: Use small labeled sample to learn initial rules.
– E.g., “my advisor” pointing to a page is a good indicator
it is a faculty home page.
– E.g., “I am teaching” on a page is a good indicator it is a
faculty home page.
Then look for unlabeled examples where one view is
confident and the other is not. Have it label the example
for the other.
hx1,x2i
hx1,x2i
hx1,x2i
hx1,x2i
hx1,x2i
hx1,x2i
Training 2 classifiers, one on each type of info. Using
each to help train the other.
Slide Credit: Avrim Blum
Slide 121 of 113
Co-training vs. EM
• Co-training splits features, EM does not.
• Co-training incrementally uses the unlabeled
data.
• EM probabilistically labels all the data at each
round; EM iteratively uses the unlabeled data.
Slide 122 of 113
Download