Neural networks

advertisement
MORE CLASSIFIERS
AGENDA

Key concepts for all classifiers
Precision vs recall
 Biased sample sets

Linear classifiers
 Intro to neural networks

RECAP: DECISION BOUNDARIES

With continuous attributes, a decision boundary
is the surface in example space that splits
positive from negative examples
x2
x1>=20
T
x2>=10
F
x2>=15
F
T
F
T
T
F
x1
BEYOND ERROR RATES
4
BEYOND ERROR RATE

Predicting security risk


Predicting “low risk” for a terrorist, is
far worse than predicting “high risk”
for an innocent bystander (but maybe
not 5 million of them)
Searching for images

Returning irrelevant images is worse
than omitting relevant ones
5
BIASED SAMPLE SETS
Often there are orders of magnitude more
negative examples than positive
 E.g., all images of Kris on Facebook
 If I classify all images as “not Kris” I’ll have
>99.99% accuracy


Examples of Kris should count much more than
non-Kris!
FALSE POSITIVES
True decision boundary
Learned decision boundary
x2
x1
7
An example
incorrectly predicted
to be positive
FALSE POSITIVES
True decision boundary
Learned decision boundary
x2
New query
x1
8
An example
incorrectly predicted
to be negative
FALSE NEGATIVES
True decision boundary
Learned decision boundary
x2
New query
x1
9
PRECISION VS. RECALL

Precision


Recall


# of relevant documents retrieved / # of total
documents retrieved
# of relevant documents retrieved / # of total relevant
documents
Numbers between 0 and 1
10
PRECISION VS. RECALL

Precision


# of true positives / (# true positives + # false
positives)
Recall

# of true positives / (# true positives + # false
negatives)
A precise classifier is selective
 A classifier with high recall is inclusive

11
REDUCING FALSE POSITIVE RATE
True decision boundary
Learned decision boundary
x2
x1
12
REDUCING FALSE NEGATIVE RATE
True decision boundary
Learned decision boundary
x2
x1
13
PRECISION-RECALL CURVES
Measure Precision vs Recall as the decision
boundary is tuned
Recall
Perfect classifier
Actual performance
14
Precision
PRECISION-RECALL CURVES
Measure Precision vs Recall as the decision
boundary is tuned
Recall
Penalize false negatives
Equal weight
Penalize false positives
15
Precision
PRECISION-RECALL CURVES
Measure Precision vs Recall as the decision
boundary is tuned
Recall
16
Precision
PRECISION-RECALL CURVES
Measure Precision vs Recall as the decision
boundary is tuned
Recall
Better learning
performance
17
Precision
OPTION 1: CLASSIFICATION THRESHOLDS
Many learning algorithms (e.g., probabilistic
models, linear models) give real-valued output
v(x) that needs thresholding for classification
v(x) > t => positive label given to x
v(x) < t => negative label given to x
 May want to tune threshold to get fewer false
positives or false negatives

18
OPTION 2: WEIGHTED DATASETS

Weighted datasets: attach a weight w to each
example to indicate how important it is
Instead of counting “# of errors”, count “sum of
weights of errors”
 Or construct a resampled dataset D’ where each
example is duplicated proportionally to its w


As the relative weights of positive vs negative
examples is tuned from 0 to 1, the precisionrecall curve is traced out
LINEAR CLASSIFIERS : MOTIVATION
Decision tree produces axis-aligned decision
boundaries
 Can we accurately classify data like this?

x2
x1
PLANE GEOMETRY

Any line in 2D can be expressed as the set of
solutions (x,y) to the equation ax+by+c=0 (an
implicit surface)



ax+by+c > 0 is one side of the line
ax+by+c < 0 is the other
ax+by+c = 0 is the line itself
y
b
a
x
PLANE GEOMETRY

In 3D, a plane can be expressed as the set of
solutions (x,y,z) to the equation ax+by+cz+d=0
ax+by+cz+d > 0 is one side of the plane
 ax+by+cz+d < 0 is the other side
 ax+by+cz+d = 0 is the plane itself

z
c
a
b
x
y
LINEAR CLASSIFIER

In d dimensions,

c0+c1*x1+…+cd*xd =0
is a hyperplane.
 Idea:

Use c0+c1*x1+…+cd*xd > 0 to denote positive
classifications
 Use c0+c1*x1+…+cd*xd < 0 to denote negative
classifications

PERCEPTRON
x2
+
+
x1
+
-
-
+
xi wi
+
S
g
xn
y = f(x,w) = g(Si=1,…,n wi xi)
y
x1
-
w1 x1 + w2 x2 = 0
-
g(u)
24
u
A SINGLE PERCEPTRON CAN LEARN
x1
xi wi
S
g
y
xn
A disjunction of boolean literals x1  x2  x3
Majority function
25
A SINGLE PERCEPTRON CAN LEARN
x1
xi wi
S
g
y
xn
A disjunction of boolean literals x1  x2  x3
Majority function
XOR?
26
PERCEPTRON LEARNING RULE
θ  θ +  x(i)(y(i)-g(θT x(i)))
 (g outputs either 0 or 1, y is either 0 or 1)

If output is correct, weights are unchanged
 If g is 0 but y is 1, then the value of g on attribute
i is increased
 If g is 1 but y is 0, then the value of g on attribute
i is decreased


Converges if data is linearly separable, but
oscillates otherwise
27
PERCEPTRON
+
+
x1
xi wi
?
-
S
+
g
xn
y = f(x,w) = g(Si=1,…,n wi xi)
y
-
+
+
-
g(u)
28
u
UNIT (NEURON)
x1
xi wi
S
g
y
xn
y = g(Si=1,…,n wi xi)
g(u) = 1/[1 + exp(-u)]
29
NEURAL NETWORK

Network of interconnected neurons
x1
xi w
x1
i
xi w
i
xn
S
g
y
S
g
y
xn
30
Acyclic (feed-forward) vs. recurrent networks
TWO-LAYER FEED-FORWARD
NEURAL NETWORK
w1j
Inputs
w2k
Hidden
layer
Output
layer
31
NETWORKS WITH HIDDEN LAYERS
Can represent XORs, other nonlinear functions
 Common neuron types:




Soft perceptron (sigmoid), radial basis functions,
linear, …
As the number of hidden units increase, so does
the network’s capacity to learn functions with
more nonlinear features
How to train hidden layers?
32
BACKPROPAGATION (PRINCIPLE)

Treat the problem as one of minimizing errors
between the example label and the network
output, given the example and network weights
as input


Sum this error term over all examples


Error(xi,yi,w) = (yi – f(xi,w))2
E(w) = Si Error(xi,yi,w) = Si (yi – f(xi,w))2
Minimize errors using an optimization algorithm

Stochastic gradient descent is typically used
33
Gradient direction 𝛻𝐸 is orthogonal to the level sets (contours) of E,
points in direction of steepest increase
Gradient direction 𝛻𝐸 is orthogonal to the level sets (contours) of E,
points in direction of steepest increase
Gradient descent: iteratively move in direction −𝛻E
Gradient descent: iteratively move in direction −𝛻E
Gradient descent: iteratively move in direction −𝛻E
Gradient descent: iteratively move in direction −𝛻E
Gradient descent: iteratively move in direction −𝛻E
Gradient descent: iteratively move in direction −𝛻𝐸
Gradient descent: iteratively move in direction −𝛻𝐸
STOCHASTIC GRADIENT DESCENT

For each example (xi,yi), take a gradient descent
step to reduce the error for (xi,yi) only.
43
STOCHASTIC GRADIENT DESCENT
Objective function values (measured over all
examples) over time settle into local minimum
 Step size must be reduced over time, e.g., O(1/t)

44
NEURAL NETWORKS: PROS AND CONS

Pros
Bioinspiration is nifty
 Can represent a wide variety of decision boundaries
 Complexity is easily tunable (number of hidden
nodes, topology)
 Easily extendable to regression tasks


Cons
Haven’t gotten close to unlocking the power of the
human (or cat) brain
 Complex boundaries need lots of data
 Slow training
 Mostly lukewarm feelings in mainstream ML
(although the “deep learning” variant is en vogue
now)

NEXT CLASS

Another guest lecture
Download