Neural networks

MORE CLASSIFIERS AGENDA  Key concepts for all classifiers Precision vs recall  Biased sample sets  Linear classifiers  Intro to neural networks  RECAP: DECISION BOUNDARIES  With continuous attributes, a decision boundary is the surface in example space that splits positive from negative examples x2 x1>=20 T x2>=10 F x2>=15 F T F T T F x1 BEYOND ERROR RATES 4 BEYOND ERROR RATE  Predicting security risk   Predicting “low risk” for a terrorist, is far worse than predicting “high risk” for an innocent bystander (but maybe not 5 million of them) Searching for images  Returning irrelevant images is worse than omitting relevant ones 5 BIASED SAMPLE SETS Often there are orders of magnitude more negative examples than positive  E.g., all images of Kris on Facebook  If I classify all images as “not Kris” I’ll have >99.99% accuracy   Examples of Kris should count much more than non-Kris! FALSE POSITIVES True decision boundary Learned decision boundary x2 x1 7 An example incorrectly predicted to be positive FALSE POSITIVES True decision boundary Learned decision boundary x2 New query x1 8 An example incorrectly predicted to be negative FALSE NEGATIVES True decision boundary Learned decision boundary x2 New query x1 9 PRECISION VS. RECALL  Precision   Recall   # of relevant documents retrieved / # of total documents retrieved # of relevant documents retrieved / # of total relevant documents Numbers between 0 and 1 10 PRECISION VS. RECALL  Precision   # of true positives / (# true positives + # false positives) Recall  # of true positives / (# true positives + # false negatives) A precise classifier is selective  A classifier with high recall is inclusive  11 REDUCING FALSE POSITIVE RATE True decision boundary Learned decision boundary x2 x1 12 REDUCING FALSE NEGATIVE RATE True decision boundary Learned decision boundary x2 x1 13 PRECISION-RECALL CURVES Measure Precision vs Recall as the decision boundary is tuned Recall Perfect classifier Actual performance 14 Precision PRECISION-RECALL CURVES Measure Precision vs Recall as the decision boundary is tuned Recall Penalize false negatives Equal weight Penalize false positives 15 Precision PRECISION-RECALL CURVES Measure Precision vs Recall as the decision boundary is tuned Recall 16 Precision PRECISION-RECALL CURVES Measure Precision vs Recall as the decision boundary is tuned Recall Better learning performance 17 Precision OPTION 1: CLASSIFICATION THRESHOLDS Many learning algorithms (e.g., probabilistic models, linear models) give real-valued output v(x) that needs thresholding for classification v(x) > t => positive label given to x v(x) < t => negative label given to x  May want to tune threshold to get fewer false positives or false negatives  18 OPTION 2: WEIGHTED DATASETS  Weighted datasets: attach a weight w to each example to indicate how important it is Instead of counting “# of errors”, count “sum of weights of errors”  Or construct a resampled dataset D’ where each example is duplicated proportionally to its w   As the relative weights of positive vs negative examples is tuned from 0 to 1, the precisionrecall curve is traced out LINEAR CLASSIFIERS : MOTIVATION Decision tree produces axis-aligned decision boundaries  Can we accurately classify data like this?  x2 x1 PLANE GEOMETRY  Any line in 2D can be expressed as the set of solutions (x,y) to the equation ax+by+c=0 (an implicit surface)    ax+by+c > 0 is one side of the line ax+by+c < 0 is the other ax+by+c = 0 is the line itself y b a x PLANE GEOMETRY  In 3D, a plane can be expressed as the set of solutions (x,y,z) to the equation ax+by+cz+d=0 ax+by+cz+d > 0 is one side of the plane  ax+by+cz+d < 0 is the other side  ax+by+cz+d = 0 is the plane itself  z c a b x y LINEAR CLASSIFIER  In d dimensions,  c0+c1*x1+…+cd*xd =0 is a hyperplane.  Idea:  Use c0+c1*x1+…+cd*xd > 0 to denote positive classifications  Use c0+c1*x1+…+cd*xd < 0 to denote negative classifications  PERCEPTRON x2 + + x1 + - - + xi wi + S g xn y = f(x,w) = g(Si=1,…,n wi xi) y x1 - w1 x1 + w2 x2 = 0 - g(u) 24 u A SINGLE PERCEPTRON CAN LEARN x1 xi wi S g y xn A disjunction of boolean literals x1  x2  x3 Majority function 25 A SINGLE PERCEPTRON CAN LEARN x1 xi wi S g y xn A disjunction of boolean literals x1  x2  x3 Majority function XOR? 26 PERCEPTRON LEARNING RULE θ  θ +  x(i)(y(i)-g(θT x(i)))  (g outputs either 0 or 1, y is either 0 or 1)  If output is correct, weights are unchanged  If g is 0 but y is 1, then the value of g on attribute i is increased  If g is 1 but y is 0, then the value of g on attribute i is decreased   Converges if data is linearly separable, but oscillates otherwise 27 PERCEPTRON + + x1 xi wi ? - S + g xn y = f(x,w) = g(Si=1,…,n wi xi) y - + + - g(u) 28 u UNIT (NEURON) x1 xi wi S g y xn y = g(Si=1,…,n wi xi) g(u) = 1/[1 + exp(-u)] 29 NEURAL NETWORK  Network of interconnected neurons x1 xi w x1 i xi w i xn S g y S g y xn 30 Acyclic (feed-forward) vs. recurrent networks TWO-LAYER FEED-FORWARD NEURAL NETWORK w1j Inputs w2k Hidden layer Output layer 31 NETWORKS WITH HIDDEN LAYERS Can represent XORs, other nonlinear functions  Common neuron types:     Soft perceptron (sigmoid), radial basis functions, linear, … As the number of hidden units increase, so does the network’s capacity to learn functions with more nonlinear features How to train hidden layers? 32 BACKPROPAGATION (PRINCIPLE)  Treat the problem as one of minimizing errors between the example label and the network output, given the example and network weights as input   Sum this error term over all examples   Error(xi,yi,w) = (yi – f(xi,w))2 E(w) = Si Error(xi,yi,w) = Si (yi – f(xi,w))2 Minimize errors using an optimization algorithm  Stochastic gradient descent is typically used 33 Gradient direction 𝛻𝐸 is orthogonal to the level sets (contours) of E, points in direction of steepest increase Gradient direction 𝛻𝐸 is orthogonal to the level sets (contours) of E, points in direction of steepest increase Gradient descent: iteratively move in direction −𝛻E Gradient descent: iteratively move in direction −𝛻E Gradient descent: iteratively move in direction −𝛻E Gradient descent: iteratively move in direction −𝛻E Gradient descent: iteratively move in direction −𝛻E Gradient descent: iteratively move in direction −𝛻𝐸 Gradient descent: iteratively move in direction −𝛻𝐸 STOCHASTIC GRADIENT DESCENT  For each example (xi,yi), take a gradient descent step to reduce the error for (xi,yi) only. 43 STOCHASTIC GRADIENT DESCENT Objective function values (measured over all examples) over time settle into local minimum  Step size must be reduced over time, e.g., O(1/t)  44 NEURAL NETWORKS: PROS AND CONS  Pros Bioinspiration is nifty  Can represent a wide variety of decision boundaries  Complexity is easily tunable (number of hidden nodes, topology)  Easily extendable to regression tasks   Cons Haven’t gotten close to unlocking the power of the human (or cat) brain  Complex boundaries need lots of data  Slow training  Mostly lukewarm feelings in mainstream ML (although the “deep learning” variant is en vogue now)  NEXT CLASS  Another guest lecture

Neural networks

Related documents

Products

Support

Neural networks

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib