Formal Learning Theory Michele Friend (Philosophy) and Valentina Harizanov (Mathematics) Example Guessing strings over alphabet {a,b} a Guess? a, aa Guess? a, aa, aaa Guess? a, aa, aaa, b Guess? a, aa, aaa, b, bb Guess? a, aa, aaa, b, bb, bbb Guess? a, aa, aaa, b, bb, bbb, aba Guess? … Infinite process called identification in the limit Learning paradigm Language or class of languages to be learned Learner Environment in which a language is presented to the learner Hypotheses that occur to the learner about language to be learned on the basis of the environment Formal language L Given finite alphabet, say T={a,b} A sentence w is a string of symbols over the alphabet T: , a, aa, ab, ba, bba, … =empty string A language L={w0, w1, w2,…} is a set of correct sentences, say L1={a, aa, aaa, aaaa, …} L2={a, b, bab, aba, babab, …} w0, w1, w2,… is a text for L (order and repetition do not matter) w2, w1, w2, w1, w3,… another text for L Chomsky grammar G=(T, V, S, P) T={a,b}, V={S}, P=production rules L1 = {a, aa, aaa, …} P1 1. S aS 2. Sa L(G1)= L1 S aS aaS aaaS aaaa Regular grammar Finite state automaton L2={a, b, bab, aba, babab, aabaa, …} P2 1. S aSa 2. S bSb 3. Sa 4. Sb L(G2)= L2 Context-free grammar Push-down automaton Coding Chomsky languages Chomsky languages=computably enumerable languages Gödel coding by numbers of finite sequences of syntactic objects Code of L is e: L=Le Algorithmic enumeration of all (unrestricted) Chomsky languages L0, L1, L2, …,Le,… Decidable (computable) language A language is decidable if there is an algorithm that recognizes correct from incorrect sentences. Chomsky language is decidable exactly when the incorrect sentences form a Chomsky language. Not every Chomsky language is decidable. There is no algorithmic enumeration of all decidable languages. Learning from text An algorithmic learner is a Turing machine being fed text for the language L to be learned, sentence by sentence. At each step the learner guesses the code for the language being fed: w0 ; e0 w0, w1; e1 w0, w1, w2; e2 … Learning is successful if the sequence e 0, e 1, e 2, e 3, … converges to the “description” of L. Syntactic convergence EX-learning EX=explanatory For some n, have e0, e1, …, en, en, en, en, … and L Len The set of all finite languages is EX-learnable from text. Semantic convergence BC-learning BC = behaviorally correct For some n, have e0, e1, …, en, en+1, en+2, en+3, … and L Le Le Le n n 1 n2 There are classes of languages that are BC-learnable, but not EX-learnable Learning from an informant L = {w0, w1, w2, w3, … } (not L)= {u0, u1, u2, u3, … } = incorrect sentences in proper vocabulary Learning steps w0 ; e0 w0, u0; e1 w0, u1, w1; e2 w0, u1, w1, u2; e3 … Locking sequence for EX-learner (Blum-Blum) If a learner can learn a language L, then there is a finite sequence of sentences in L, called a locking sequence for L, on which the learner “locks” its correct hypothesis; that is, after that sequence the hypothesis does not change. Same hypothesis on and (,) for any Angluin criterion Maximum finite fragment property Consider a class of Chomsky languages. Then the class is EX-learnable from text exactly when for every language L in the class, there is a finite fragment D of L (D⊂L) such that every other possibly bigger fragment U of L: D⊆U⊂L cannot be in the class. Problem How can we formally define and study certain learning strategies? Constraints on hypotheses: consistency, confidence, reliability, etc. Consistent learning A learner is consistent on a language L if at every step, the learner guesses a language which includes all the data given to the learner up to that point. The class of all finite languages can be identified consistently. If a languages is consistently EX-learnable by an algorithmic learner, then it must be a decidable language. Popperian learning of total functions (Total) computable function f can be tested against finite sequences of data given to the learner. A learner is Popperian on f if on any sequence of positive data for f, the learner guesses a computable function. A learner is Popperian if it is Popperian on every computable function. Not every algorithmically EX-learnable class of functions is Popperian. Confident learning Learner is confident when it is guaranteed to converge to some hypothesis, even if it is given text for a language that does not belong to the class to be learned. Must be also accurate on the languages in the class. There is a class that is learnable by an algorithmic EX-learner, and is learnable by (another) confident EX-learner, but cannot be learned by an algorithmic and confident EXlearner. Reliable learning Learner is reliable if it is not allowed to converge incorrectly (although might never converge on the text for a language not in the class). Reliable EX-learnability from text implies that every language in the text must be finite. Decisive learning Learner, once it has put out a revised hypothesis for a new language, which replaces an earlier hypothesized language, never returns to the old language again. Decisive EX-learning from text not restrictive for general learners, nor for algorithmic learners of computable functions. Decisiveness reduces the power of algorithmic learning for languages. U-shaped learning (Baliga, Case, Merkle, Stephan, and Wiehagen) Variant of non-decisive learning Mimics learning-unlearning-relearning pattern Overregularization in Language Acquisition, monograph by Markus, Pinker, Ullman, Hollande, Rosen, and Xu. Problem How can we develop algorithmic learning theory for languages more complicated than Chomsky languages, in particular, ones closer to natural language? (Case and Royer) Correction grammars: L1– L2, where G1 is Chomsky (unrestricted, type-0; or contextfree, type-2) grammar for generating language L1 and G2 is the one generating the editing (corrections) L2 Burgin: “Grammars with prohibition and human-computer interaction,” 2005. Ershov’s difference hierarchy in computability theory for limit computable languages Problem What is the significance of negative versus positive information in the learning process? Learning from switching type of information (Jain and Stephan): Learner can request positive or negative information about L, but when he, after finitely many switches, requests information of the same type, he receives all of it (in the limit) Harizanov-Stephan’s result Consider a class of Chomsky languages. Assume that there is a language L in the family such that for every finite set of sentences D, there are languages U and U′ in the family with U ⊂L⊂U′ and D ∩ U = D ∩ U′ Then the family cannot be even BC-learned from switching. U approximates L from below; U′ from above U and U′ coincide on D Problem What are good formal frameworks that unify deduction and induction? Martin, Sharma and Stephan: use parametric logic (5 parameters: vocabulary, structures, language, data sentences, assumption sentences) Model theoretic approach, based on the Tarskian “truth-based” notion of logical consequence. The difference between deductive and inductive consequences lies in the process of deriving a consequence from the premises. Deduction vs induction A sentence s is a deductive consequence of a theory T if s can be inferred from T with absolute certainty. A sentence s is an inductive consequence of a theory T if s can be correctly (only hypothetically) inferred from T, but can also be incorrectly inferred from other theories T′ that have enough in common with T to provisionally force the inference of s.