Class 2: Computational Models - CUNY

advertisement
Introduction to
Language Acquisition Theory
Janet Dean Fodor
St. Petersburg July 2013
Class 2. From computer science (then)
to psycholinguistics (now)
Syntax acquisition as parameter setting
 Like playing “20 questions’. The learner’s task is to detect
the correct settings of the finite number of parameters.
 Headedness parameter: Are syntactic phrases head-initial
(e.g., in VP, the verb precedes its object) or head-final (the
verb follows the object)?
 Wh-movement parameter: Does a Wh-phrase move to the
top of a clause or does it remain in situ?
 Parameter values are ‘triggered’ by learner’s encountering
a distinctive revealing property of an input sentence.
 This Principles-and-Parameters approach has been
retained through many subsequent changes in TG theory.
 It greatly reduces a learner’s workload of data-processing. 
It helps address the Poverty of Stimulus problem. 
2
Parameter setting as flipping switches
 Chomsky never provided a specific implementation of
parametric triggering. He often employed a metaphor of
setting switches. (Chomsky 1981/1986)
 The metaphor suggests that parameter setting is:
› Automatic, instantaneous, effortless: no linguistic reasoning
is required of the learner. (Unlike hypothesis-formation models.)
› Input-guided (no trial-and-error process).
› A universal mechanism, but leading reliably to languagespecific parameter settings.
› Non-interacting parameters: Each can be set separately.
› Each has unambiguous triggers recognizable regardless of
what else the learner does or doesn’t know about the language.
› Deterministic learning: fully accurate, so no revision is ever
needed.
 A wonderful advance if true – if psychologically feasible!
3
But computational linguists couldn’t implement it
(parameters yes; triggering no)
 Syntacticians largely embraced this neat picture.
 But as a mechanism, triggering was never implemented.
Computational linguists deemed it unfeasible. Due to ambiguity
and opacity of would-be triggers, in the natural language
domain. (Clark, 1989)
Examples, next slide 
 Only the concept of parameterization was retained: Language
acquisition is selection of a grammar from a finite set, which is
defined by UG (innate principles + innate parametric choices).
 The learning process was modeled as a trial-and-error search
through the domain of all possible grammars. Applying familiar
domain-general learning algorithms from computer science.
 No input guidance toward correct grammar. Input serves only
as feedback on hypotheses selected partly at random.
4
Why doesn’t instant triggering work?
 Input ambiguity: E.g. Exceptional Case Marking (Clark 1989)
We consider him to be clever. ECM or Infin assigns Acc case?
I consider myself to be clever. Long-distance anaphora?
 Derivational opacity: E.g. Adv P not Verb Subj.
Entails -NullSubj. Why?! Because P with no object must be
due to obj-topicalization, then topic-drop. +NullTop entails -NS.
 Conclusion: It’s impossible or impractical to recognize the
parameter-values from the surface sentence.
 Learners have to guess. (Counter-argument in Classes 6 & 7.)
 Also, classic triggering mis-predicts child data (Yang 2002):
children’s grammar changes are gradual; they must be
contemplating two or more (many?) grammars simultaneously.
5
Trial-and-error domain search methods:
under-powered or over-resourced
 Genetic algorithm. Clark & Roberts (1993)
Test many grammars each on many sentences, rank them,
breed them, repeat, repeat. (Over-resourced)
 Triggering Learning Algorithm. Gibson & Wexler (1994)
Test one grammar at a time, on one sentence. If it fails,
change one P at random. (Under-powered; fails often, slow)
 Next slide
Give TLA a memory for success-rate of each parameter
value. Test one grammar, but sample the whole domain.
 Variational Model. Yang (2000)
 Bayesian Learner. Perfors, Tenenbaum & Regier (2006)
Test all grammars on total input sample. Adopt the one
with best mix of simplicity & good fit. (Over-resourced)
6
Variational Model’s memory for how
well each P-value has performed
1 1 1 1 1
etc.
0 0 0 0 0
Test one grammar at
a time. If it succeeds,
nudge the pointer for
each parameter toward
the successful P-value.
If the grammar fails,
nudge the pointers away
from those P-values.
 Select a grammar to
test next, with probabNull subject
Head-direction
WH-movement ility based on the
weights of its P-values.
7
Varieties of domain-search, illustrated
 Think Easter egg hunt. Eggs are the parameter values, to
be found. Search domain is the park.
 Genetic Algorithm: Send out hordes of searchers, compare
notes.
 Triggering Learning Algorithm: A lone searcher, following
own nose, small steps: “getting warmer”.
 Variational Model: Mark findings/failures on a rough map to
focus search; occasionally dash to another spot to see
what’s there.
 Compare these with decoding: First consult the sentence!
Read a clue, decipher its meaning, go where it says; the
egg is there.
8
Varieties of domain-search, illustrated
 GA: Send out hordes of
searchers, compare notes.
(Vast effort)
 TLA: A lone searcher,
following own nose, small
steps: “getting warmer”.
(Slow progress)
 VM: Mark findings/failures
on a rough map;
occasionally dash to another
spot to see what’s there.
(Still a needle in a haystack)
…
...
...
9
Yang’s VM: the best current search model
 Can learn from every input sentence.

 Choice of a grammar to try is based on its track record.

 But no decoding, so it extracts little info per sentence:
Only can /cannot parse. Not why, or what would help.
 Can’t recognize unambiguity.


 Non-deterministic. Parameters may swing back & forth
between the two values repeatedly.
 Inefficiency increases with size of the domain, perhaps
exponentially (especially if domain is not ‘smooth’).
 Yang’s simulations and ours agree: VM consumes an
order of magnitude more input than decoding models.



10
Is VM plausible as psychology?
 VM improves on TLA, achieving more effective search
with modest resources. And it avoids getting permanently
trapped in a wrong corner of the domain. (‘local minimum’)
 But it has some strange un-human-like(?) properties:
 Irrelevant parameter values are rewarded / punished,
e.g., prep-stranding in a sentence with no preps.
Without decoding, VM can’t know which parameters are
relevant to the input sentence.
 To explore, it tests some grammars that are NOT highly
valued at present  The child will often fail to parse a
sentence, even if her currently best grammar can parse it!
Exploring fights normal language use.
11
What’s more psychologically realistic?
 A crucial aspect of the VM is that even low-valued grammars





are occasionally tried out on input sentences.
But is this what children do?
When a toddler hears an utterance, what goes on in her brain?
Specifically:
What grammar does she try to process the sentence with?
Surely, she’d apply her currently ‘highest-valued’ grammar?
Why would she use one that she believes to be wrong?
A low-valued grammar would often fail to deliver a successful
parse of the sentence. When it fails, the child doesn’t
(linguistically) understand the sentence – even if it’s one she
understood yesterday and it is generated by her current ‘best’
grammar!
CUNY’s alternative: Learning by parsing
 This is a brief preview. We’ll go into more detail in Class 7.
 A child’s aim is to understand what people are saying.
 So, just like adults, children try to parse the sentences they
hear. (Assign structure to word string; semantic composition.)
 When the child’s grammar licenses an input, her parsing
routines function just as in adult sentence comprehension.
 When the sentence lies beyond her current grammar, the
parsing mechanism can process parts of the sentence but not
all. It seeks a way to complete the parse tree. (Not just yes/no.)
 To do so, it draws on the additional parameter-values that UG
makes available, seeking one that can solve the problem.
 If a parameter-value succeeds in rescuing the parse, that
means it’s useful, so it is adopted into the grammar.
13
So a parameter value must be
something the parser can use
 What a parser (adult or child) really needs is a way to
connect an incoming word into the tree structure being built.
Some linkage of syntactic nodes and branches.
 At CUNY we take parameter values to be UG-specified
‘treelets’, that the parser can use. (Not switch-settings.)
 A treelet is a sub-structure of larger sentential trees (typically
underspecified in some respects).
 Example treelet: a PP node immediately dominating a
preposition and a nominal trace. Indicates a positive value
for the preposition-stranding parameter (Who are you talking
with now? vs. *Avec qui parles-tu maintenant?).
14
Children do what adults do: Example
 E.g., Which rock can you jump to from here? has a stranded
preposition to, with no overt complement. That becomes
evident at the word from.
 For an adult English speaker, the parsing mechanism has
access to a possible piece of tree structure (a ‘treelet’) which
inserts a phonologically null complement to the preposition,
and links it to a fronted wh-phrase. See tree diagram 
 Now consider a child who already knows wh-movement but
not yet preposition stranding (maybe not realistic!). The child’s
parser would do exactly the same as the adult’s, up to the
word from.
 The child’s current grammar offers no means of continuing the
parse. It has no treelet that fits between to and from. So it
must look and see whether UG can provide one.
15
In English, a preposition may have a null complement.
Learners will discover this, as they parse.
+nulli
16
Children must reach out to UG
 The child’s parser must search for a treelet in the wider
pool of candidates made available by UG, to identify one
that will fill that gap in the parse tree.
 Once found, that treelet would become part of the learner’s
grammar, for future use in understanding and producing
sentences with stranded prepositions.
 Summary: In the treelet model, the learner’s innate parsing
mechanism works with the learner’s single currently best
grammar hypothesis, and upgrades it on-line just if and
where it finds that a new treelet is needed in order to parse
an incoming sentence.
 A child’s processing of sentences differs from an adults’,
only in the need to reach out to UG for new treelets.
17
Compared with domain search systems
 In this way, the specific properties of input sentences
provide a word-by-word guide to the adoption of relevant
parameter values, in a narrowly channeled process.
E.g., What to do if you encounter a sentence containing a
prep without an overt object.
 This input-guidance gets the maximum benefit from the
information the input contains.
 It requires no specifically-evolved learning mechanism for
language. (But it does need access to UG.)
 It makes use of the sentence parsing mechanism, which is
needed in any case – and which is generally regarded as
being innate, ready to function as soon as the child knows
some words.
18
Please read before Friday (Class 3)
 The 2-page article “Positive and negative evidence in
language acquisition”, by Grimshaw & Pinker.
 On the availability and utility of negative data.
 The key questions: Does negative evidence exist?
Do language learners use it?
Do language learners need to?
19
Download