Formal Learning Theory

advertisement
Formal Learning Theory
Michele Friend (Philosophy)
and
Valentina Harizanov (Mathematics)
Example
Guessing strings over alphabet
{a,b}
a Guess?
a, aa Guess?
a, aa, aaa Guess?
a, aa, aaa, b Guess?
a, aa, aaa, b, bb Guess?
a, aa, aaa, b, bb, bbb Guess?
a, aa, aaa, b, bb, bbb, aba Guess?
…
Infinite process called identification in the limit
Learning paradigm
Language or class of languages to be
learned
Learner
Environment in which a language is
presented to the learner
Hypotheses that occur to the learner about
language to be learned on the basis of the
environment
Formal language L
Given finite alphabet, say T={a,b}
A sentence w is a string of symbols over the
alphabet T: , a, aa, ab, ba, bba, …
=empty string
A language L={w0, w1, w2,…} is a set of correct
sentences, say
L1={a, aa, aaa, aaaa, …}
L2={a, b, bab, aba, babab, …}
w0, w1, w2,… is a text for L (order and repetition
do not matter)
w2, w1, w2, w1, w3,… another text for L
Chomsky grammar G=(T, V, S, P)
T={a,b}, V={S}, P=production rules
L1 = {a, aa, aaa, …}
P1
1.
S  aS
2.
Sa
L(G1)= L1
S  aS  aaS  aaaS
 aaaa
Regular grammar
Finite state automaton
L2={a, b, bab, aba,
babab, aabaa, …}
P2
1.
S  aSa
2.
S  bSb
3.
Sa
4.
Sb
L(G2)= L2
Context-free grammar
Push-down automaton
Coding Chomsky languages
Chomsky languages=computably
enumerable languages
Gödel coding by numbers of finite
sequences of syntactic objects
Code of L is e: L=Le
Algorithmic enumeration of all
(unrestricted) Chomsky languages
L0, L1, L2, …,Le,…
Decidable (computable) language
A language is decidable if there is an
algorithm that recognizes correct from
incorrect sentences.
Chomsky language is decidable exactly
when the incorrect sentences form a
Chomsky language.
Not every Chomsky language is decidable.
There is no algorithmic enumeration of all
decidable languages.
Learning from text
An algorithmic learner is a Turing machine being
fed text for the language L to be learned,
sentence by sentence.
At each step the learner guesses the code for
the language being fed:
w0 ; e0
w0, w1; e1
w0, w1, w2; e2
…
Learning is successful if the sequence
e 0, e 1, e 2, e 3, …
converges to the “description” of L.
Syntactic convergence
EX-learning
EX=explanatory
For some n, have
e0, e1, …, en, en, en, en, …
and L  Len
The set of all finite languages is
EX-learnable from text.
Semantic convergence
BC-learning
BC = behaviorally correct
For some n, have
e0, e1, …, en, en+1, en+2, en+3, …
and L  Le  Le  Le  
n
n 1
n2
There are classes of languages that are
BC-learnable, but not EX-learnable
Learning from an informant
L = {w0, w1, w2, w3, … }
(not L)= {u0, u1, u2, u3, … } =
incorrect sentences in proper vocabulary
Learning steps
w0 ; e0
w0, u0; e1
w0, u1, w1; e2
w0, u1, w1, u2; e3
…
Locking sequence for EX-learner
(Blum-Blum) If a learner can learn a
language L, then there is a finite sequence
 of sentences in L, called a locking
sequence for L, on which the learner
“locks” its correct hypothesis; that is, after
that sequence the hypothesis does not
change.
Same hypothesis on  and (,) for any 
Angluin criterion
Maximum finite fragment property
Consider a class of Chomsky languages.
Then the class is EX-learnable from text
exactly when for every language L in the
class, there is a finite fragment D of L
(D⊂L) such that every other possibly
bigger fragment U of L:
D⊆U⊂L
cannot be in the class.
Problem
How can we formally define and study
certain learning strategies?
Constraints on hypotheses: consistency,
confidence, reliability, etc.
Consistent learning
A learner is consistent on a language L if
at every step, the learner guesses a
language which includes all the data given
to the learner up to that point.
The class of all finite languages can be
identified consistently.
If a languages is consistently EX-learnable
by an algorithmic learner, then it must be a
decidable language.
Popperian learning of total
functions
(Total) computable function f can be tested
against finite sequences of data given to the
learner.
A learner is Popperian on f if on any sequence
of positive data for f, the learner guesses a
computable function.
A learner is Popperian if it is Popperian on every
computable function.
Not every algorithmically EX-learnable class of
functions is Popperian.
Confident learning
Learner is confident when it is guaranteed to
converge to some hypothesis, even if it is given
text for a language that does not belong to the
class to be learned.
Must be also accurate on the languages in the
class.
There is a class that is learnable by an
algorithmic EX-learner, and is learnable by
(another) confident EX-learner, but cannot be
learned by an algorithmic and confident EXlearner.
Reliable learning
Learner is reliable if it is not allowed to
converge incorrectly (although might never
converge on the text for a language not in
the class).
Reliable EX-learnability from text implies
that every language in the text must be
finite.
Decisive learning
Learner, once it has put out a revised hypothesis
for a new language, which replaces an earlier
hypothesized language, never returns to the old
language again.
Decisive EX-learning from text not restrictive for
general learners, nor for algorithmic learners of
computable functions.
Decisiveness reduces the power of algorithmic
learning for languages.
U-shaped learning
(Baliga, Case, Merkle, Stephan,
and Wiehagen)
Variant of non-decisive learning
Mimics learning-unlearning-relearning
pattern
Overregularization in Language
Acquisition, monograph by Markus, Pinker,
Ullman, Hollande, Rosen, and Xu.
Problem
How can we develop algorithmic learning theory for
languages more complicated than Chomsky languages,
in particular, ones closer to natural language?
(Case and Royer) Correction grammars: L1– L2,
where G1 is Chomsky (unrestricted, type-0; or contextfree, type-2) grammar for generating language L1 and G2
is the one generating the editing (corrections) L2
Burgin: “Grammars with prohibition and human-computer
interaction,” 2005.
Ershov’s difference hierarchy in computability theory for
limit computable languages
Problem
What is the significance of negative versus
positive information in the learning process?
Learning from switching type of information
(Jain and Stephan):
Learner can request positive or negative
information about L, but when he, after finitely
many switches, requests information of the
same type, he receives all of it (in the limit)
Harizanov-Stephan’s result
Consider a class of Chomsky languages.
Assume that there is a language L in the family
such that for every finite set of sentences D,
there are languages U and U′ in the family with
U ⊂L⊂U′
and
D ∩ U = D ∩ U′
Then the family cannot be even BC-learned from
switching.
U approximates L from below; U′ from above
U and U′ coincide on D
Problem
What are good formal frameworks that unify
deduction and induction?
Martin, Sharma and Stephan: use parametric
logic (5 parameters: vocabulary, structures,
language, data sentences, assumption
sentences)
Model theoretic approach, based on the
Tarskian “truth-based” notion of logical
consequence.
The difference between deductive and
inductive consequences lies in the process of
deriving a consequence from the premises.
Deduction vs induction
A sentence s is a deductive consequence of a
theory T if s can be inferred from T with absolute
certainty.
A sentence s is an inductive consequence of a
theory T if s can be correctly (only
hypothetically) inferred from T, but can also be
incorrectly inferred from other theories T′ that
have enough in common with T to provisionally
force the inference of s.
Download