Announcements CS Ice Cream Social • 9/5 3:30-4:30, ECCR 265 • includes poster session, student group presentations Concept Learning Examples Word meanings Edible foods glorch Abstract structures (e.g., irony) not glorch glorch not glorch Supervised Approach To Concept Learning Both positive and negative examples provided Typical models (both in ML and Cog Sci) circa 2000 required both positive and negative examples Contrast With Human Learning Abiliites Learning from positive examples only Learning from a small number of examples E.g., word meanings E.g., learning appropriate social behavior E.g., instruction on some skill What would it mean to learn from a small number of positive examples? + + + Tenenbaum (1999) Two dimensional continuous feature space Concepts defined by axis-parallel rectangles e.g., feature dimensions cholesterol level insulin level e.g., concept healthy Learning Problem Given a set of given a set of n examples, X = {x1, x2, x3, …, xn}, which are instances of the concept… Will some unknown example Y also be an instance of the concept? Problem of generalization + + 1 2 3 + Hypothesis (Model) Space H: all rectangles on the plane, parameterized by (l1, l2, s1, s2) h: one particular hypothesis Note: |H| = ∞ Consider all hypotheses in parallel In contrast to non-Bayesian approach of maintaining only the best hypothesis at any point in time. Prediction Via Model Averaging Will some unknown input y be in the concept given examples X = {x1, x2, x3, …, xn}? Q: y is a positive example of the concept (T,F) P(Q | X) = ⌠h p(Q & h | X) dh Marginalization P(Q & h | X) = p(Q | h, X) p(h | X) Chain rule P(Q | h, X) = P(Q | h) = 1 if y is in h Conditional independence and deterministic concepts p(h | X) ~ P(X | h) p(h) Bayes rule likelihood prior Priors and Likelihood Functions Priors, p(h) Location invariant Uninformative prior (prior depends only on area of rectangle) x Expected size prior Likelihood function, p(X|h) X = set of n examples Size principle Expected size prior Generalization Gradients MIN: smallest hypothesis consistent with data weak Bayes: instead of using size principle, assumes examples are produced by process independent of the true class Dark line = 50% prob. Experimental Design Subjects shown n dots on screen that are “randomly chosen examples from some rectangle of healthy levels” n drawn from {2, 3, 4, 6, 10, 50} Dots varied in horizontal and vertical range r drawn from {.25, .5, 1, 2, 4, 8} units in a 24 unit window Task draw the ‘true’ rectangle around the dots Experimental Results Number Game Experimenter picks integer arithmetic concept C E.g., prime number E.g., number between 10 and 20 E.g., multiple of 5 Experimenter presents positive examples drawn at random from C, say, in range [1, 100] Participant asked whether some new test case belongs in C Empirical Predictive Distributions E x a m p le s 1 6 1 0 .5 0 4 81 21 62 02 42 83 23 64 04 44 85 25 66 06 46 87 27 68 08 48 89 29 6 1 0 0 6 0 1 0 .5 0 4 81 21 62 02 42 83 23 64 04 44 85 25 66 06 46 87 27 68 08 48 89 29 6 1 0 0 1 6826 41 0 .5 0 4 81 21 62 02 42 83 23 64 04 44 85 25 66 06 46 87 27 68 08 48 89 29 6 1 0 0 1 62 31 92 01 0 .5 0 4 81 21 62 02 42 83 23 64 04 44 85 25 66 06 46 87 27 68 08 48 89 29 6 1 0 0 6 08 01 03 01 0 .5 Hypothesis Space Even numbers Odd numbers Squares Multiples of n Ends in n Powers of n All numbers Intervals [n, m] for n>0, m<101 Powers of 2, plus 37 Powers of 2, except for 32 • • Observation = 16 Likelihood function Size principle • Prior Intuition data = 16 even odd squares mult of 3 mult of 4 mult of 5 mult of 6 mult of 7 mult of 8 mult of 9 mult of 10 ends in 1 ends in 2 ends in 3 ends in 4 ends in 5 ends in 6 ends in 7 ends in 8 ends in 9 powers of 2 powers of 3 powers of 4 powers of 5 powers of 6 powers of 7 powers of 8 powers of 9 powers of 10 all powers of 2 + {37} powers of 2 − {32} 0 0.1 prior 35 35 30 30 25 25 20 20 15 15 10 10 5 5 0 0.2 0 0.2 lik 0 0.4 0 0.2 post 0.4 Observation = 16 8 2 64 data = 16 8 2 64 • • Likelihood function Size principle • Prior Intuition even odd squares mult of 3 mult of 4 mult of 5 mult of 6 mult of 7 mult of 8 mult of 9 mult of 10 ends in 1 ends in 2 ends in 3 ends in 4 ends in 5 ends in 6 ends in 7 ends in 8 ends in 9 powers of 2 powers of 3 powers of 4 powers of 5 powers of 6 powers of 7 powers of 8 powers of 9 powers of 10 all powers of 2 + {37} powers of 2 − {32} 0 0.1 prior 35 35 30 30 25 25 20 20 15 15 10 10 5 5 0 0.2 0 1 lik 0 2 0 −3 x 10 0.5 post 1 Posterior Distribution After Observing 16 E x a m p le s 1 6 Model Vs. Human Data 1 0 .5 0 4 81 21 62 02 42 83 23 64 04 44 85 25 66 06 46 87 27 68 08 48 89 29 6 1 0 0 6 0 1 0 .5 0 4 81 21 62 02 42 83 23 64 04 44 85 25 66 06 46 87 27 68 08 48 89 29 6 1 0 0 1 6826 41 0 .5 MODEL 0 4 81 21 62 02 42 83 23 64 04 44 85 25 66 06 46 87 27 68 08 48 89 29 6 1 0 0 1 62 31 92 01 0 .5 0 4 81 21 62 02 42 83 23 64 04 44 85 25 66 06 46 87 27 68 08 48 89 29 6 1 0 0 6 08 01 03 01 0 .5 0 4 81 21 62 02 42 83 23 64 04 44 85 25 66 06 46 87 27 68 08 48 89 29 6 1 0 0 6 05 25 75 51 0 .5 0 4 81 21 62 02 42 83 23 64 04 44 85 25 66 06 46 87 27 68 08 48 89 29 6 1 0 0 HUMAN DATA 8 12 543 61 0 .5 0 4 81 21 62 02 42 83 23 64 04 44 85 25 66 06 46 87 27 68 08 48 89 29 6 1 0 0 8 19 88 69 31 0 .5 0 4 81 21 62 02 42 83 23 64 04 44 85 25 66 06 46 87 27 68 08 48 89 29 6 1 0 0 Summary of Tenenbaum (1999) Method Pick prior distribution (includes hypothesis space) Pick likelihood function (size principle) leads to predictions for generalization as a function of r (range) and n (number of examples) Claims people generalize optimally given assumptions about priors and likelihood Bayesian approach provides best description of how people generalize on rectangle task. Explains how people can learn from a small number of examples, and only positive examples. Important Ideas in Bayesian Models Generative models Likelihood function Consideration of multiple models in parallel Potentially infinite model space Inference prediction via model averaging role of priors diminishes with amount of evidence Learning trade off between model simplicity and fit to data Bayesian Occam’s Razor Ockham's Razor medieval philosopher and monk tool for cutting (metaphorical) If two hypotheses are equally consistent with the data, prefer the simpler one. Simplicity • • • • can accommodate fewer observations smoother fewer parameters restricts predictions more (“sharper” predictions) Examples 1st vs. 4th order polynomial small rectangle vs. large rectangle in Tenenbaum model H1 H0 H0 H1 Motivating Ockham's Razor PRIORS Aesthetic considerations A theory with mathematical beauty is more likely to be right (or believed) than an ugly one, given that both fit the same data. Past empirical success of the principle Coherent inference, as embodied by Bayesian reasoning, automatically incorporates Ockham's razor Two theories H1 and H2 LIKELIHOODS Ockham's Razor with Priors Jeffreys (1939) probabililty text more complex hypotheses should have lower priors Requires a numerical rule for assessing complexity e.g., number of free parameters e.g., Vapnik-Chervonenkis (VC) dimension Subjective vs. Objective Priors subjective or informative prior specific, definite information about a random variable objective or uninformative prior vague, general information Philosophical arguments for certain priors as uninformative Maximum entropy / least committment e.g., interval [a b]: uniform e.g., interval [0, ∞) with mean 1/λ: exponential distribution e.g., mean μ and std deviation σ: Gaussian Independence of measurement scale e.g., Jeffrey’s prior 1/(θ(1-θ)) for θ in [0,1] expresses same belief whether we talk about θ or logθ Ockham’s Razor Via Likelihoods Coin flipping example H1: coin has two heads H2: coin has a head and a tail Consider 5 flips producing HHHHH H1 could produce only this sequence H2 could produce HHHHH, but also HHHHT, HHHTH, ... TTTTT P(HHHHH | H1) = 1, P(HHHHH | H2) = 1/32 H2 pays the price of having a lower likelihood via the fact it can accommodate a greater range of observations H1 is more readily rejected by observations Simple and Complex Hypotheses H2 H1 Bayes Factor BIC is approximation to Bayes factor A.k.a. likelihood ratio Hypothesis Classes Varying In Complexity E.g., 1st, 2nd, and 3d order polynomials Hypothesis class is parameterized by w v Rissanen (1976) Minimum Description Length Prefer models that can communicate the data in the smallest number of bits. The preferred hypothesis H for explaining data D minimizes: (1) length of the description of the hypothesis (2) length of the description of the data with the help of the chosen theory L: length MDL & Bayes L: some measure of length (complexity) MDL: prefer hypothesis that min. L(H) + L(D|H) Bayes rule implies MDL principle P(H|D) = P(D|H)P(H) / P(D) –log P(H|D) = –log P(D|H) – log P(H) + log P(D) = L(D|H) + L(H) + const Relativity Example Explain deviation in Mercury's orbit at perihelion with respect to prevailing theory E: Einstein's theory F: fudged Newtonian theory deviation α = true deviation a = observed Relativity Example (Continued) Subjective Ockham's razor result depends on one's belief about P(α|F) Objective Ockham's razor for Mercury example, RHS is 15.04 Applies to generic situation