A classification learning example Predicting when Rusell will wait for a table --similar to book preferences, predicting credit card fraud, predicting when people are likely to respond to junk mail Uses different biases in predicting Russel’s waiting habbits K-nearest neighbors If patrons=full and day=Friday then wait (0.3/0.7) If wait>60 and Reservation=no then wait (0.4/0.9) Decision Trees --Examples are used to --Learn topology --Order of questions Association rules --Examples are used to --Learn support and confidence of association rules SVMs Neural Nets --Examples are used to --Learn topology --Learn edge weights Russell waits RW None some full T 0.3 0.2 0.5 F 0.4 0.3 0.3 Wait time? Patrons? Naïve bayes (bayesnet learning) --Examples are used to --Learn topology --Learn CPTs Friday? Inductive Learning (Classification Learning) • Given a set of labeled examples, and a space of hypotheses – Find the rule that underlies the labeling • (so you can use it to predict future unlabeled examples) – Tabularasa, fully supervised • Idea: – Loop through all hypotheses • Rank each hypothesis in terms of its match to data • Pick the best hypothesis The main problem is that the space of hypotheses is too large Given examples described in terms of n boolean variables n There are 2 2 different hypotheses For 6 features, there are 18,446,744,073,709,551,616 hypotheses A good hypothesis will have fewest false positives (Fh+) and fewest false negatives (Fh-) [Ideally, we want them to be zero] On training or testing data?? False +ve: The learner classifies the example as +ve, but it is actually -ve Fraction incorectly classified Rank(h) = f(Fh+, Fh-) (loss function) --f depends on the domain by default f=Sum; but can give different weights to different errors (Cost-based learning) Medical domain --Higher cost for F--But also high cost for F+ Spam Mailer --Very low cost for F+ --higher cost for FTerrorist/Criminal Identification --High cost for F+ (for the individual) --High cost for F- (for the society) Test (prediction) error Training error Ranking hypotheses H1: Russell waits only in italian restaurants false +ves: X10, false –ves: X1,X3,X4,X8,X12 H2: Russell waits only in cheap french restaurants False +ves: False –ves: X1,X3,X4,X6,X8,X12 Complexity measured in number of Samples required to PAC-learn What is a reasonable goal in designing a learner? • • (Idea) Learner must classify all new instances (test cases) correctly always Any test cases? – Test cases drawn from the same distribution as the training cases • Always? – May be the training samples are not completely representative of the test samples – So, we go with “probably” • Correctly? – May be impossible if the training data has noise (the teacher may make mistakes too) – So, we go with “approximately” • The goal of a learner then is to produce a probably approximately correct (PAC) hypothesis, for a given approximation (error rate) e and probability d. • When is a learner A better than learner B? – For the same e,d bounds, A needs fewer training samples than B to reach PAC. Learning Curves Inductive Learning (Classification Learning) • Given a set of labeled examples, and a space of hypotheses – Find the rule that underlies the labeling • (so you can use it to predict future unlabeled examples) – Tabularasa, fully supervised • Idea: – Loop through all hypotheses • Rank each hypothesis in terms of its match to data • Pick the best hypothesis • • – Main variations: Bias: the “sort” of rule are you looking for? – If you are looking for only conjunctive hypotheses, there are just 3n Search: – Greedy search – Decision tree learner – Systematic search – Version space learner – Iterative search – Neural net learner The main problem is that the space of hypotheses is too large Given examples described in terms of n boolean variables n There are 2 2 different hypotheses For 6 features, there are 18,446,744,073,709,551,616 hypotheses Uses different biases in predicting Russel’s waiting habbits K-nearest neighbors If patrons=full and day=Friday then wait (0.3/0.7) If wait>60 and Reservation=no then wait (0.4/0.9) Decision Trees --Examples are used to --Learn topology --Order of questions Association rules --Examples are used to --Learn support and confidence of association rules SVMs Neural Nets --Examples are used to --Learn topology --Learn edge weights Russell waits RW None some full T 0.3 0.2 0.5 F 0.4 0.3 0.3 Wait time? Patrons? Naïve bayes (bayesnet learning) --Examples are used to --Learn topology --Learn CPTs Friday? Learning Decision Trees---How? Basic Idea: --Pick an attribute --Split examples in terms of that attribute --If all examples are +ve label Yes. Terminate --If all examples are –ve label No. Terminate --If some are +ve, some are –ve continue splitting recursively (Special case: Decision Stumps If you don’t feel like splitting any further, return the majority label ) Which one to pick? Depending on the order we pick, we can get smaller or bigger trees Which tree is better? Why do you think so?? Decision Trees & Sample Complexity • Decision Trees can Represent any boolean function • ..So PAC-learning decision trees should be n 2 exponentially hard (since there are 2 hypotheses) • ..however, decision tree learning algorithms use greedy approaches for learning a good (rather than the optimal) decision tree – Thus, using greedy rather than exhaustive search of hypotheses space is another way of keeping complexity low (at the expense of losing PAC guarantees) Basic Idea: --Pick an attribute --Split examples in terms of that attribute --If all examples are +ve label Yes. Terminate --If all examples are –ve label No. Terminate --If some are +ve, some are –ve continue splitting recursively --if no attributes left to split? (label with majority element) Would you split on patrons or Type? The Information Gain Computation P+ : N+ /(N++N-) P- : N- /(N++N-) # expected comparisons needed to tell whether a given example is +ve or -ve I(P+ ,, P-) = -P+ log(P+) - P- log(P- ) N+ NThe difference is the information gain Splitting on feature fk N1+ N1- I(P1+ ,, P1-) N2+ N2- I(P2+ ,, P2-) Nk+ Nk- I(Pk+ ,, Pk-) So, pick the feature with the largest Info Gain I.e. smallest residual info k Given k mutually exclusive and exhaustive events E1….Ek whose probabilities are p1….pk The “information” content (entropy) is defined as S i -pi log2 pi A split is good if it reduces the entropy.. S i=1 [Ni+ + Ni- ]/[N+ + N-] I(Pi+ ,, Pi-) A simple example Ex Masochistic Anxious Nerdy 1 F T F HATES EXAM Y 2 F F T N V(M) = 2/4 * I(1/2,1/2) + 2/4 * I(1/2,1/2) = 1 V(A) = 2/4 * I(1,0) + 2/4 * I(0,1) 3 T F F N 4 T T T Y = 0 V(N) = 2/4 * I(1/2,1/2) + 2/4 * I(1/2,1/2) = 1 So Anxious is the best attribute to split on Once you split on Anxious, the problem is solved m-fold cross-validation Split N examples into m equal sized parts for i=1..m train with all parts except ith test with the ith part Evaluating the Decision Trees Russell Domain Lesson: Every bias makes some concepts easier to learn and others harder to learn… “Majority” function (say yes if majority of attributes are yes) Learning curves… Given N examples, partition them into Ntr the training set and Ntest the test instances Loop for i=1 to |Ntr| Loop for Ns in subsets of Ntr of size I Train the learner over Ns Test the learned pattern over Ntest and compute the accuracy (%correct) Decision Trees vs. Naïve Bayes For Russell Restaurant Scenario Decision trees are better if there is a “succinct” explanation in terms of a few features. NBC is better if all features wind up playing a role e.g. Spam mails Problems with Info. Gain. Heuristics • Feature correlation: We are splitting on one feature at a time • The Costanza party problem – No obvious easy solution… • Overfitting: We may look too hard for patterns where there are none – E.g. Coin tosses classified by the day of the week, the shirt I was wearing, the time of the day etc. – Solution: Don’t consider splitting if the information gain given by the best feature is below a minimum threshold • Can use the c2 test for statistical significance – Will also help when we have noisy samples… • We may prefer features with very high branching – e.g. Branch on the “universal time string” for Russell restaurant example – Branch on social security number to look for patterns on who will get A – Solution: “gain ratio” --ratio of information gain with the attribute A to the information content of answering the question “What is the value of A?” • The denominator is smaller for attributes with smaller domains. Decision Stumps • Decision stumps are decision trees where the leaf nodes do not necessarily have all +ve or all – ve training examples – Could happen either because examples are noisy and misclassified or because you want to stop before reaching pure leafs • When you reach that node, you return the majority label as the decision. • (We can associate a confidence with that decision using the P+ and P-) N+ N- Splitting on feature fk N1+ N1- N2+ N2- Nk+ Nk- P+= N1+ / N1++N1- Sometimes, the best decision tree for a problem could be a decision stump (see coin toss example next) Uses different biases in predicting Russel’s waiting habbits K-nearest neighbors If patrons=full and day=Friday then wait (0.3/0.7) If wait>60 and Reservation=no then wait (0.4/0.9) Decision Trees --Examples are used to --Learn topology --Order of questions Association rules --Examples are used to --Learn support and confidence of association rules SVMs Neural Nets --Examples are used to --Learn topology --Learn edge weights Russell waits RW None some full T 0.3 0.2 0.5 F 0.4 0.3 0.3 Wait time? Patrons? Naïve bayes (bayesnet learning) --Examples are used to --Learn topology --Learn CPTs Friday? Decision Surface Learning (aka Neural Network Learning) • Idea: Since classification is really a question of finding a surface to separate the +ve examples from the -ve examples, why not directly search in the space of possible surfaces? • Mathematically, a surface is a function – Need a way of learning functions – “Threshold units” The “Brain” Connection A Threshold Unit Threshold Functions differentiable …is sort of like a neuron Perceptron Networks What happened to the “Threshold”? --Can model as an extra weight with static input I1 w1 t=k I2 w2 == I0=-1 w1 w0= k t=0 w2 Perceptron Learning as Gradient Descent Search in the weight-space Optimal perceptron has the lowest error on the training data E 1 (T O ) 2 2 i E (W ) 1 2 i T g W j I j j 2 E I j (T O ) g W j I j W j j 1 ( sigmoid fn) x 1 e g ' ( x) g ( x)(1 g ( x)) g ( x) Often a constant learning rate parameter is used instead W j W j I j (T O ) g W j I j j Ij I Perceptron Learning • Perceptron learning algorithm Loop through training examples – If the activation level of the output unit is 1 when it should be 0, reduce the weight on the link to the jth input unit by a*Ij, where Ii is the ith input value and a a learning rate • So, we are assuming g’(.) is a constant.. Which it is really not.. – If the activation level of the output unit is 0 when it should be 1, increase the weight on the link to the ith input unit by a*Ij – Otherwise, do nothing Until “convergence” A nice applet at: http://neuron.eng.wayne.edu/java/Perceptron/New38.html Comparing Perceptrons and Decision Trees in Majority Function and Russell Domain Majority function Majority function is linearly seperable.. Russell Domain Russell domain is apparently not.... Encoding: one input unit per attribute. The unit takes as many distinct real values as the size of attribute domain Can Perceptrons Learn All Boolean Functions? --Are all boolean functions linearly separable? Max-Margin Classification & Support Vector Machines • • Any line that separates the +ve & –ve examples is a solution And perceptron learning finds one of them – But could we have a preference among these? – may want to get the line that provides maximum margin (equidistant from the nearest +ve/ve) • The nereast +ve and –ve holding up the line are called support vectors • This changes optimization objective – Quadratic Programming can be used to directly find such a line Lagrangian Dual Two ways to learn non-linear decision surfaces • First transform the data into higher dimensional space • Find a linear surface – Which is guaranteed to exist • Transform it back to the original space • TRICK is to do this without explicitly doing a transformation • Learn non-linear surfaces directly (as multi-layer neural nets) • Trick is to do training efficiently – Back Propagation to the rescue.. “Neural Net” is a collection of threshold units with interconnections I1 = 1 if w1I1+w2I2 > k = 0 otherwise w1 differentiable t=k I2 w2 Single Layer Recurrent Feed Forward Uni-directional connections Any linear decision surface can be represented by a single layer neural net Multi-Layer Any “continuous” decision surface (function) can be approximated to any degree of accuracy by some 2-layer neural net Bi-directional connections Can act as associative memory Linear Separability in High Dimensions “Kernels” allow us to consider separating surfaces in high-D without first converting all points to high-D Kernelized Support Vector Machines • • • • Turns out that it is not always necessary to first map the data into high-D, and then do linear separation The quadratic programming formulation for SVM winds up using only the pair-wise dot product of training vectors Dot product is a form of similarity metric between points If you replace that dot product by any non-linear function, you will, in essence, be transforming data into some high-dimensional space and then finding the max-margin linear classifier in that space – • Which will correspond to some wiggly surface in the original dimension The trick is to find the RIGHT similarity function – Which is a form of prior knowledge Kernelized Support Vector Machines • • • • Turns out that it is not always necessary to first map the data into high-D, and then do linear separation The quadratic programming formulation for SVM winds up using only the pair-wise dot product of training vectors Dot product is a form of similarity metric between points If you replace that dot product by any non-linear function, you will, in essence, be tranforming data into some high-dimensional space and then finding the max-margin linear classifier in that space – • Which will correspond to some wiggly surface in the original dimension The trick is to find the RIGHT similarity function – Which is a form of prior knowledge 0 0 A A 6 Polynomial Kernel: K (A ; A ) = (( 100 à 1)( 100 à 1) à 0:5) ï Those who ignore easily available domain knowledge are doomed to re-learn it… Santayana’s brother Domain-knowledge & Learning • Classification learning is a problem addressed by both people from AI (machine learning) and Statistics • Statistics folks tend to “distrust” domain-specific bias. – Let the data speak for itself… – ..but this is often futile. The very act of “describing” the data points introduces bias (in terms of the features you decided to use to describe them..) • …but much human learning occurs because of strong domainspecific bias.. • Machine learning is torn by these competing influences.. – In most current state of the art algorithms, domain knowledge is allowed to influence learning only through relatively narrow avenues/formats (E.g. through “kernels”) • Okay in domains where there is very little (if any) prior knowledge (e.g. what part of proteins are doing what cellular function) • ..restrictive in domains where there already exists human expertise.. Multi-layer Neural Nets How come back-prop doesn’t get stuck in local minima? One answer: It is actually hard for local minimas to form in high-D, as the “trough” has to be closed in all dimensions Multi-Network Learning can learn Russell Domains Russell Domain …but does it slowly… Practical Issues in Multi-layer network learning • For multi-layer networks, we need to learn both the weights and the network topology – Topology is fixed for perceptrons • If we go with too many layers and connections, we can get over-fitting as well as sloooow convergence – Optimal brain damage • Start with more than needed hidden layers as well as connections; after a network is learned, remove the nodes and connections that have very low weights; retrain Humans make 0.2% Neumans (postmen) make 2% Other impressive applications: --no-hands across K-nearest-neighbor The test example’s class is determined america by the class of the majority of its k nearest --learning to speak neighbors Need to define an appropriate distance measure --sort of easy for real valued vectors --harder for categorical attributes