Learning R&N: ch 19, ch 20 Types of Learning Supervised Learning - classification, prediction Unsupervised Learning – clustering, segmentation, pattern discovery Reinforcement Learning – learning MDPs, online learning Supervised Learning Someone gives you a bunch of examples, telling you what each one is Eventually, you figure out the mapping from properties (features) of the examples to their type (classification) Supervised Learning Logic-based approaches: learn a function f(X) (true,false) Statistical/Probabilistic approaches Learn a probability distribution p(Y|X) Outline Logic-based Approaches Decision Trees (lecture 24) Version Spaces (today) PAC learning (we won’t cover, ) Instance-based approaches Statistical Approaches Predicate-Learning Methods Version space Need to provide H Putting Things Together with some “structure” Explicit representation of hypothesis space H yes no Test set Evaluation Induced hypothesis h Training set Learning procedure L Object set Example set X Goal predicate Observable predicates Hypothesis space H Bias Version Spaces The “version space” is the set of all hypotheses that are consistent with the training instances processed so far. An algorithm: V := H ;; the version space V is ALL hypotheses H For each example e: Eliminate any member of V that disagrees with e If V is empty, FAIL Return V as the set of consistent hypotheses Version Spaces: The Problem PROBLEM: V is huge!! Suppose you have N attributes, each with k possible values Suppose you allow a hypothesis to be any disjunction of instances There are kN possible instances |H| = N k 2 If N=5 and k=2, |H| = 232!! Version Spaces: The Tricks First Trick: Don’t allow arbitrary disjunctions Organize the feature values into a hierarchy of allowed disjunctions, e.g. any-color dark black blue pale white yellow Now there are only 7 “abstract values” instead of 16 disjunctive combinations (e.g., “black or white” isn’t allowed) Second Trick: Define a partial ordering on H (“general to specific”) and only keep track of the upper bound and lower bound of the version space RESULT: An incremental, possibly efficient algorithm! Rewarded Card Example (r=1) v … v (r=10) v (r=J) v (r=Q) v (r=K) ANY-RANK(r) (r=1) v … v (r=10) NUM(r) (r=J) v (r=Q) v (r=K) FACE(r) (s=) v (s=) v (s=) v (s=) ANY-SUIT(s) (s=) v (s=) BLACK(s) (s=) v (s=) RED(s) A hypothesis is any sentence of the form: R(r) S(s) IN-CLASS([r,s]) where: • R(r) is ANY-RANK(r), NUM(r), FACE(r), or (r=j) • S(s) is ANY-SUIT(s), BLACK(s), RED(s), or (s=k) Simplified Representation For simplicity, we represent a concept by rs, with: • r {a, n, f, 1, …, 10, j, q, k} • s {a, b, r, , , , } For example: • n represents: NUM(r) (s=) IN-CLASS([r,s]) • aa represents: ANY-RANK(r) ANY-SUIT(s) IN-CLASS([r,s]) Extension of a Hypothesis The extension of a hypothesis h is the set of objects that satisfies h Examples: • The extension of f is: {j, q, k} • The extension of aa is the set of all cards More General/Specific Relation Let h1 and h2 be two hypotheses in H h1 is more general than h2 iff the extension of h1 is a proper superset of the extension of h2 Examples: • aa is more general than f • f is more general than q • fr and nr are not comparable More General/Specific Relation Let h1 and h2 be two hypotheses in H h1 is more general than h2 iff the extension of h1 is a proper superset of the extension of h2 The inverse of the “more general” relation is the “more specific” relation The “more general” relation defines a partial ordering on the hypotheses in H Example: Subset of Partial Order aa na 4a ab a nb n 4b 4 Construction of Ordering Relation a a n f 1 … 10 j … k b r G-Boundary / S-Boundary of V A hypothesis in V is most general iff no hypothesis in V is more general G-boundary G of V: Set of most general hypotheses in V G-Boundary / S-Boundary of V A hypothesis in V is most general iff no hypothesis in V is more general G-boundary G of V: Set of most general hypotheses in V A hypothesis in V is most specific iff no hypothesis in V is more specific S-boundary S of V: Set of most specific hypotheses in V Example: G-/S-Boundaries of V G aa na ab 4b n We replace every hypothesis in S Now suppose that 4 is whose extension doesa not 4a nb given as a positive example contain 4 by its generalization set S 1 … 4 … k Example: G-/S-Boundaries of V aa na G and Sab Here, both have size 1. This is not the case in general! 4a a nb n 4b 4 Example: G-/S-Boundaries of V na 4a The aa generalization set of an hypothesis h is the set of the ab hypotheses that are immediately more general than a h nb n 4b Let 7 be the next (positive) example 4 Generalization set of 4 Example: G-/S-Boundaries of V aa na 4a ab a nb n 4b Let 7 be the next (positive) example 4 Example: G-/S-Boundaries of V Specialization set of aa aa na ab a nb Let 5 be the next (negative) example n Example: G-/S-Boundaries of V G and S, and all hypotheses in between form exactly the version space ab a nb n Example: G-/S-Boundaries of V At this stage … ab No Yes Maybe Do 8, 6, j satisfy CONCEPT? a nb n Example: G-/S-Boundaries of V ab a nb Let 2 be the next (positive) example n Example: G-/S-Boundaries of V ab nb Let j be the next (negative) example Example: G-/S-Boundaries of V + 4 7 2 – 5 j nb NUM(r) BLACK(s) IN-CLASS([r,s]) Example: G-/S-Boundaries of V Let us return to the version space … … and let 8 be the next (negative) example ab nb The only most specific hypothesis disagrees with n this example, so no hypothesis in H agrees with all examples a Example: G-/S-Boundaries of V Let us return to the version space … … and let j be the next (positive) example ab nb The only most general hypothesis disagrees with n this example, so no hypothesis in H agrees with all examples a Version Space Update 1. x new example 2. If x is positive then (G,S) POSITIVE-UPDATE(G,S,x) 3. Else (G,S) NEGATIVE-UPDATE(G,S,x) 4. If G or S is empty then return failure POSITIVE-UPDATE(G,S,x) 1. Eliminate all hypotheses in G that do not agree with x POSITIVE-UPDATE(G,S,x) 1. Eliminate all hypotheses in G that do not agree with x 2. Minimally generalize all hypotheses in S until they are consistent with x Using the generalization sets of the hypotheses POSITIVE-UPDATE(G,S,x) 1. Eliminate all hypotheses in G that do not agree with x 2. Minimally generalize all hypotheses in S until they are consistent with x 3. Remove from S every hypothesis that is neither more specific than nor equal to a hypothesis in G This step was not needed in the card example POSITIVE-UPDATE(G,S,x) 1. Eliminate all hypotheses in G that do not agree with x 2. Minimally generalize all hypotheses in S until they are consistent with x 3. Remove from S every hypothesis that is neither more specific than nor equal to a hypothesis in G 4. Remove from S every hypothesis that is more general than another hypothesis in S 5. Return (G,S) NEGATIVE-UPDATE(G,S,x) 1. Eliminate all hypotheses in S that do not agree with x 2. Minimally specialize all hypotheses in G until they are consistent with x 3. Remove from G every hypothesis that is neither more general than nor equal to a hypothesis in S 4. Remove from G every hypothesis that is more specific than another hypothesis in G 5. Return (G,S) Example-Selection Strategy (aka Active Learning) Suppose that at each step the learning procedure has the possibility to select the object (card) of the next example Let it pick the object such that, whether the example is positive or not, it will eliminate one-half of the remaining hypotheses Then a single hypothesis will be isolated in O(log |H|) steps Example aa na • 9? • j? • j? ab a nb n Example-Selection Strategy Suppose that at each step the learning procedure has the possibility to select the object (card) of the next example Let it pick the object such that, whether the example is positive or not, it will eliminate one-half of the remaining hypotheses Then a single hypothesis will be isolated in O(log |H|) steps But picking the object that eliminates half the version space may be expensive Noise If some examples are misclassified, the version space may collapse Possible solution: Maintain several G- and S-boundaries, e.g., consistent with all examples, all examples but one, etc… Version Spaces Useful for understanding logical (consistency-based) approaches to learning Practical if strong hypothesis bias (concept hierarchies, conjunctive hypothesis) Don’t handle noise well VSL vs DTL Decision tree learning (DTL) is more efficient if all examples are given in advance; else, it may produce successive hypotheses, each poorly related to the previous one Version space learning (VSL) is incremental DTL can produce simplified hypotheses that do not agree with all examples DTL has been more widely used in practice Statistical Approaches Instance-based Learning (20.4) Statistical Learning (20.1) Neural Networks (20.5) Nearest Neighbor Methods To classify a new input vector x, examine the k-closest training data points to x and assign the object to the most frequently occurring class k=1 k=6 x Issues Distance measure Most common: euclidean Better distance measures: normalize each variable by standard deviation Choosing k Increasing k reduces variance, increases bias For high-dimensional space, problem that the nearest neighbor may not be very close at all! Memory-based technique. Must make a pass through the data for each classification. This can be prohibitive for large data sets. Indexing the data can help; for example KD trees Bayesian Learning Example: Candy Bags Candy comes in two flavors: cherry () and lime () Candy is wrapped, can’t tell which flavor until opened There are 5 kinds of bags of candy: H1 = H2 = H3 = H4 = H5 = all cherry 75% cherry, 25% lime 50% cherry, 50% lime 25% cherry, 75% lime 100% lime Given a new bag of candy, predict H Observations: D1, D2 , D3, … Bayesian Learning Calculate the probability of each hypothesis, given the data, and make prediction weighted by this probability (i.e. use all the hypothesis, not just the single best) P(hi | d) P ( d|hi )P (hi ) P ( d) P(d | hi )P(hi ) Now, if we want to predict some unknown quantity X P(X | d) P(X |hi )P(hi | d) i Bayesian Learning cont. Calculating P(h|d) P(hi | d) P(d | hi )P(hi ) likelihood prior Assume the observations are i.i.d.— independent and identically distributed P ( d | hi ) P ( d j | hi ) j Example: Hypothesis Prior over h1, …, h5 is {0.1,0.2,0.4,0.2,0.1} Data: Q1: After seeing d1, what is P(hi|d1)? Q2: After seeing d1, what is P(d2= |d1)? Statistical Learning Approaches Bayesian MAP – maximum a posteriori ** MLE – assume uniform prior over H Naïve Bayes aka Idiot Bayes particularly simple BN makes overly strong independence assumptions but works surprisingly well in practice… Bayesian Diagnosis suppose we want to make a diagnosis D and there are n possible mutually exclusive diagnosis d1, …, dn suppose there are m boolean symptoms, E1, …, Em P(di | e1 ,...,em ) P(di )P(e1 ,..., em | di ) P(e1 ,..., em ) how do we make diagnosis? we need: P(di ) P(e1 ,..., em | di ) Naïve Bayes Assumption Assume each piece of evidence (symptom) is independent give the diagnosis then P(e1 ,...,em | di ) m P (e k | di ) k 1 what is the structure of the corresponding BN? Naïve Bayes Example possible diagnosis: Allergy, Cold and OK possible symptoms: Sneeze, Cough and Fever Well Cold Allergy P(d) 0.9 0.05 0.05 P(sneeze|d) 0.1 0.9 0.9 P(cough|d) 0.1 0.8 0.7 P(fever|d) 0.01 0.7 0.4 my symptoms are: sneeze & cough, what is the diagnosis? Learning the Probabilities aka parameter estimation we need P(di) – prior P(ek|di) – conditional probability use training data to estimate Maximum Likelihood Estimate (MLE) use frequencies in training set to estimate: ni p(di ) N nik p(ek | di ) ni where nx is shorthand for the counts of events in training set what is: P(Allergy)? P(Sneeze| Allergy)? P(Cough| Allergy)? Example: D Sneeze Cough Fever Allergy yes no no Well yes no no Allergy yes no yes Allergy yes no no Cold yes yes yes Allergy yes no no Well no no no Well no no no Allergy no no no Allergy yes no no Laplace Estimate (smoothing) use smoothing to eliminate zeros: ni 1 p(di ) Nn many other smoothing schemes… nik 1 p(ek | di ) ni 2 where n is number of possible values for d and e is assumed to have 2 possible values Comments Generally works well despite blanket assumption of independence Experiments show competitive with decision trees on some well known test sets (UCI) handles noisy data Learning more complex Bayesian networks Two subproblems: learning structure: combinatorial search over space of networks learning parameters values: easy if all of the variables are observed in the training set; harder if there are ‘hidden variables’ Neural Networks Neural function Brain function (thought) occurs as the result of the firing of neurons Neurons connect to each other through synapses, which propagate action potential (electrical impulses) by releasing neurotransmitters Synapses can be excitatory (potential-increasing) or inhibitory (potential-decreasing), and have varying activation thresholds Learning occurs as a result of the synapses’ plasticicity: They exhibit long-term changes in connection strength There are about 1011 neurons and about 1014 synapses in the human brain Biology of a neuron Brain structure Different areas of the brain have different functions Some areas seem to have the same function in all humans (e.g., Broca’s region); the overall layout is generally consistent Some areas are more plastic, and vary in their function; also, the lower-level structure and function vary greatly We don’t know how different functions are “assigned” or acquired Partly the result of the physical layout / connection to inputs (sensors) and outputs (effectors) Partly the result of experience (learning) We really don’t understand how this neural structure leads to what we perceive as “consciousness” or “thought” Our neural networks are not nearly as complex or intricate as the actual brain structure Comparison of computing power Computers are way faster than neurons… But there are a lot more neurons than we can reasonably model in modern digital computers, and they all fire in parallel Neural networks are designed to be massively parallel The brain is effectively a billion times faster Neural networks Neural networks are made up of nodes or units, connected by links Each link has an associated weight and activation level Each node has an input function (typically summing over weighted inputs), an activation function, and an output Neural unit Model Neuron Neuron modeled a unit j weights on input unit I to j, wji net input to unit j is: net j w i threshold Tj oj is 1 if netj > Tj ij oi Neural Computation McCollough and Pitt (1943)showed how LTU can be use to compute logical functions AND? OR? NOT? Two layers of LTUs can represent any boolean function Learning Rules Rosenblatt (1959) suggested that if a target output value is provided for a single neuron with fixed inputs, can incrementally change weights to learn to produce these outputs using the perceptron learning rule assumes binary valued input/outputs assumes a single linear threshold unit Perceptron Learning rule If the target output for unit j is tj w ji w ji (t j o j )oi Equivalent to the intuitive rules: If output is correct, don’t change the weights If output is low (oj=0, tj=1), increment weights for all the inputs which are 1 If output is high (oj=1, tj=0), decrement weights for all inputs which are 1 Must also adjust threshold. Or equivalently asuume there is a weight wj0 for an extra input unit that has o0=1 Perceptron Learning Algorithm Repeatedly iterate through examples adjusting weights according to the perceptron learning rule until all outputs are correct Initialize the weights to all zero (or random) Until outputs for all training examples are correct for each training example e do compute the current output oj compare it to the target tj and update weights each execution of outer loop is an epoch for multiple category problems, learn a separate perceptron for each category and assign to the class whose perceptron most exceeds its threshold Q: when will the algorithm terminate? Representation Limitations of a Perceptron Perceptrons can only represent linear threshold functions and can therefore only learn functions which linearly separate the data, I.e. the positive and negative examples are separable by a hyperplane in n-dimensional space Perceptron Learnability Perceptron Convergence Theorem: If there are a set of weights that are consistent with the training data (I.e. the data is linearly separable), the perceptron learning algorithm will converge (Minksy & Papert, 1969) Unfortunately, many functions (like parity) cannot be represented by LTU Layered feed-forward network Output units Hidden units Input units “Executing” neural networks Input units are set by some exterior function (think of these as sensors), which causes their output links to be activated at the specified level Working forward through the network, the input function of each unit is applied to compute the input value Usually this is just the weighted sum of the activation on the links feeding into this node The activation function transforms this input function into a final value Typically this is a nonlinear function, often a sigmoid function corresponding to the “threshold” of that node