“Ideal” learning of language and categories Nick Chater Department of Psychology University of Warwick Paul Vitányi Centrum voor Wiskunde en Informatica Amsterdam OVERVIEW I. Learning from experience: The problem II. Learning to predict III. Learning to identify IV. A methodology for assessing learnability VI. Where next? I. Learning from experience: The problem Learning: How few assumptions will work? Model fitting Assume M(x) Optimize x Easy, but needs prior knowledge No assumptions Learning is impossible--”no free lunch” ? ? ? ? Can a more minimal model of learning still work? Learning from +/- vs. + data target language/category guess + data overlap - data Under-general Over-general But how about learning from + data only? Categorization ? Language acquisition ? ? ? ? ? Learning from +ive data seems to raise in principle problems In Categorization, rules out: Almost all learning experiments in psychology Exemplar models Prototype models NNs, SVMs… Language acquisition Assumed that children only needing access to positive evidence Sometimes viewed as ruling out learning models entirely ? ? ? The “Logical” problem of language acquisition(e.g., Hornstein & Lightfoot, 1981; Pinker, 1979) ? ? ? Must be solvable: A parallel with science Science only has access to positive data Yet it seems to be possible So overgeneral theories must be eliminated, somehow e.g., “Anything goes” seems a bad theory Theories must capture regularities, not just fit data Absence as implicit negative evidence? Thus overgeneral grammars may predict lots of missing sentences And their absence is a systematic clue that the theory is probably wrong This idea only seems convincing if can be proved that convergence works well, statistically... So what do we need to assume? Modest assumption: Computability constraint Assume that data is generated by Random factors Computable factors i.e., nothing uncomputable Chance …HHTTTHTTHTTHT… Computable process S “Monkeys typing into a programming language” A modest assumption! NP V V NP …The cat sat on the mat. The dog… Learning by simplicity Find explanation of “input” that is as simple as possible An ‘explanation’ reconstructs the input Simplicity measured in code length Long history in perception: Mach, Koffka, Hochberg, Attneave, Leeuwenberg, van der Helm Mimicry theorem with Bayesian analysis E.g., Li & Vitányi (2000); Chater (1996); Chater & Vitányi ( ms.) Relation to Bayesian inference Widely used in statistics and machine learning Consider “ideal” learning Given the data, what is the shortest code How well does the shortest code work? Prediction Identification Ignore the question of search Makes general results feasible But search won’t go away…! Fundamental question: when is learning data-limited or search-limited? Three kinds of induction Prediction: Identification: converge on correct predictions identify generating category/distribution in the limit Learning causal mechanisms?? Inferring counterfactuals---effects of intervention (cf Pearl: from probability to causes) II. Learning to predict Prediction by simplicity Find shortest ‘program/explanation’ for current data Predict using that program Strictly, use ‘weighted sum’ of explanations, weighted by brevity… Equivalent to Bayes with (roughly) a 2-K(x) prior, where K(x) is the length of the shortest program generating x Summed error has finite bound (Solomonoff, 1978) log e 2 sj K ( ) 2 j=1 So prediction converges [faster than 1/nlog(n), for corpus size n] Inductive inference is possible! No independence or stationarity assumptions; just computability of generating mechanism Applications Language A. Grammaticality judgements B. Language production C. Form-meaning mappings Categorization Learning from positive examples A: Grammaticality judgments We want a grammar that doesn’t over- or undergeneralize (much) w.r.t., ‘true’ grammar, on sentences that are statistically likely to occur NB. No guarantees for… Colorless green ideas sleep furiously (Chomsky) Bulldogs bulldogs bulldogs fight fight fight (Fodor) Converging on a grammar Fixing undergeneralization is easy (such grammars get ‘falsified’) Overgeneralization is the hard problem Need to use absence as evidence But the language is infinite; any corpus finite So almost all grammatical sentences are also absent Logical problem of language acquisition; Baker’s paradox Impossibility of ‘mere’ learning from positive evidence Overgeneralization Theorem Suppose learner has probability j of erroneously guessing an ungrammatical jth word K ( ) j 1 j log e 2 Intuitive explanation: overgeneralization implies smaller than need probs to grammatical sentences; and hence excessive code lengths B: Language production Simplicity allows ‘mimicry’ of any computable statistical method of generating a corpus Arbitrary prob, ; simplicity prob, ( y | x) 1 ( y | x) Li & Vitányi, 1997 C: Learning form-meaning mappings So far we have ignored semantics Suppose language inputs consists of form-meaning pairs (cf Pinker) Assume only the form→meaning and meaning → form mappings are computable (don’t have to be deterministic)… A theorem It follows that: Total errors in mapping forms to (sets of) meanings (with probs) and Total errors in mapping forms to (sets of) meanings (with probs) …have a finite bound (and hence average errors/sentence tend to 0) Categorization Sample n items from category C (assume each all items equally likely) Guess, by choosing the D that provides the shortest code for the data General proof method: 1. Overgeneralization D must be basis for a shorter code than C (or you wouldn’t prefer it) 2. Undergeneralization Typical data from category C will have no code shorter than nlog|C| 1. Fighting overgeneralization D can’t be much bigger than C, or it’ll have a longer code length: K(D)+nlog|D| ≤ K(C)+nlog|C| as n, constraint is that |D|/|C| ≤ 1+O(1/n) 2. Fighting undergeneralization But guess must cover most of the correct category---or it’d provide a “suspiciously” short code for the data Typicality: K(D|C)+nlog|CD|≥ nlog|C| as n, constraint is that |CD|/|C| ≥ 1-O(1/n) C C D D Implication |D| converges to near |C| Accuracy bounded by O(1/n), with n samples i.i.d. assumptions Actual rate depends on structure of category is crucial Language: need lots of examples (but how many?) Some categories may only need a few (one?) example (Tenenbaum, Feldman) III. Learning to identify Hypothesis identification Induction of ‘true’ hypothesis, or category, or language In philosophy of science, typically viewed as hard problem… Needs stronger assumptions than prediction Identification in the limit: The problem Assume endless data Goal: specify an algorithm that, at each point, picks a hypothesis And eventually locks in on the correct hypothesis though can never announce it---as there may always be an additional low frequency item that’s yet to be encountered Gold, Osherson et al have studied this extensively Sometimes viewed as showing identification not possible (but really a mix of positive and negative results) But i.i.d. and computability allows a general positive result Algorithm Algorithms have two parts Program which specifies set Pr Sample from Pr, using average code length H(Pr) per data point Pick a specific set of data (which needs to be ‘long enough’) Won’t necessarily know what is long enough---an extra assumption Specify enumeration of programs for Pr, e.g., in order of length Run, dovetailing Initialize with any Pr Flip to Pr that corresponds to shortest program so far, that has generated data Dovetailing prog1 prog2 prog3 prog4 1 2 4 7 3 5 8 6 9 10 Runs for ever… Run these in order, dovetailing, where each program gets 2(-length) steps This process runs for ever (looping programs) Shortest prog so far “pocketed”… This will always finish on the “true” program Overwhelmingly likely to work... (as n, Prob correct identification1) For large enough stream of n typical data, no alternative model does better Expected code length of coding data generated by Pr, by Pr’ rather than Pr, wastes n.D(Pr’||Pr) D(Pr’||Pr) > 0; so swamps initial code length, for large enough n Initial Code K(Pr) K(Pr’) n=8 Pr wins IV. A methodology for assessing learnability Assessing learnability in cognition? Constraint c is learnable if code which Nativism? 1. “invests” l(c) bits to encode c (investing) can… 2. recoup its investment save more than l(c) bits in encoding the data c is acquired But not enough data can’t recoup investment (e.g., little/no relevant data) Viability of empiricism? Ample supply of data to recoup l(c) Cf Tenenbaum, Feldman… Language acquisition: Poverty of the stimulus, quantified Consider of linguistic constraint (e.g., noun-verb agreement; subjacency, phonological constraints) Cost assessed by length of formulation (length of linguistic rules) Saving: reduction in cost of coding data (perceptual, linguistic) Easy example: learning singular-plural John loves tennis They love_ tennis x bits y bits John loves tennis *John love_ tennis They love_ tennis *They loves tennis x+1 bits y+1 bits If constraint applies to proportion p of n sentences, constraint saves pn bits. Visual structure―ample data? Depth from stereo: Invest: algorithm for correspondence Recoup: almost a whole image (that’s a lot!) Perhaps could infer stereo for a single stereo image? Object/texture models (Yuille) Investment in building the model But recoup in compression, over “raw” image description Presumably few images needed? A harder linguistic case: Baker’s paradox (with Luca Onnis and Matthew Roberts) Quasi-regular structures are ubiquitous in language: e.g., alternations It is likely that John will come It is possible that John will come John is likely to come *John is possible to come (Baker,1979, see also Culicover) Strong winds High winds Strong currents *High currents I love going to Italy! I enjoy going to Italy! I love to go to Italy! *I enjoy to go to Italy! Baker’s paradox (Baker, 1979) Selectional restrictions: “holes” in the space of possible sentences allowed by a given grammar… How does the learner avoid falling into the holes?? i.e., how does the learner distinguish genuine ‘holes’ from the infinite number of unheard grammatical constructions? Our abstract theory tells us something Theorem on grammaticality judgments show that the paradox is solvable, in the asymptote, and with no computational restrictions But can this be scaled down… Learn specific ‘alternation’ patterns With corpus the child hears Argument by information investment To encode an exception, which appears to have probability x, requires Log2(1/x) bits But this elimination of x makes all other sentences (1-x) times more likely, saving: n(Log2(x/1-x) bits Does the saving outweigh the investment? An example Recovery from overgeneralisations The rabbit hid You hid the rabbit! The rabbit disappeared *You disappeared the rabbit! Return on ‘investment’ over 5M words from the CHILDES database is easily sufficient But this methodology can be applied much more widely (and aimed at fitting time-course of U-shaped generalization; and when overgeneralizations do or do not arise). V. Where next? Can we learn causal structure from observation? What happens if we move the left hand stick? The output of perception provides a description in terms of causality Liftability Breakability Edibility Whats is attached to what What is resting on what Without this, perception is fairly useless as an input for action Inferring causality from observation: The hard problem of induction Sensory input Generative process Formal question Suppose a modular computer program generate stream of data of indefinite length… Under what conditions can modularity be recovered? How might “interventions”/expts help? (Key technical idea: Kolmogorov sufficient statistic) Fairly uncharted territory If data is generated by independent processes Then one model of the data will involve recapitulation of those processes But will there be other alternative modular programs? Which might be shorter? Hopefully not! Completely open field…