Sentence

advertisement
The mental representation of sentences
Tree structures or state vectors?
Stefan Frank
S.L.Frank@uva.nl
With help from
Vera Demberg
Rens Bod
Victor Kuperman
Brian Roark
Understanding a sentence
The very general picture
word sequence
the cat is on the mat
“comprehension”
“meaning”
Theories of mental representation
on
cat
perceptual
simulation
…
mat
conceptual
network
tree structure
x,y:cat(x) ∧ mat(y) ∧ on(x,y)
?
state vector or
activation pattern
Sentence meaning
logical
form
Sentence “structure”
Grammar-based vs. connectionist models
The debate (or battle) between the
two camps focuses on particular
(psycho-) linguistic phenomena, e.g.:
 Grammars account for the
productivity and systematicity of
language
(Fodor & Pylyshyn, 1988; Marcus, 1998)
 Connectionism can explain why
there is no (unlimited) productivity
and (pure) systematicity
(Christiansen & Chater, 1999)
From theories to models
 Implemented computational models can be evaluated
and compared more thoroughly than ‘mere’ theories
 Take a common grammar-based model and a common
connectionist model
 Compare their ability to predict empirical data
(measurements of word-reading time)
Simple Recurrent
Network (SRN)
versus
Probabilistic ContextFree Grammar (PCFG)
Probabilistic Context-Free Grammar
 A context-free grammar with a probability for each
production rule (conditional on the rule’s left-hand side)
 The probability of a tree structure is the product of
probabilities of the rules involved in its construction.
 The probability of a sentence is the sum of probabilities
of all its grammatical tree structures.
 Rules and their probabilities can be induced from a large
corpus of syntactically annotated sentences (a treebank).
 Wall Street Journal treebank: approx. 50,000 sentences
from WSJ newspaper articles (1988−1989)
Inducing a PCFG
S
NP
VP
PRP
VPZ
It
has
NP
S

NP
DT
VPNN
.
NN

bearing
.
NP
NP
DT
.
PP
NN
IN
no bearing on
NP
NP 
NN
 PRP$
todayNN NN
PRP$ NN
NP
NP
NN
our work force
NN
today
Simple Recurrent Network
Elman (1990)
 Feedforward neural network with recurrent connections
 Processes sentences, word by word
 Usually trained to predict the upcoming word
(i.e., the input at t+1)
estimated probabilities
for words at t+1
output layer
state vector representing
the sentence up to word t
(copy)
hidden
activation at t–1
hidden layer
word input at t
Word probability and reading times
Hale (2001), Levy (2008)
 Surprisal theory: the more unexpected the occurrence
of a word, the more time needed to process it. Formally:
 A sentence is a sequence of words: w1, w2, w3, …
 The time needed to read word wt, is logarithmically
related to its probability in the ‘context’:
RT(wt) ~ −log Pr(wt|context)
surprisal of wt
 If nothing else changes, the context is just the sequence
of previous words:
RT(wt) ~ −log Pr(wt|w1, …, wt−1)
surprisal of wt
 Both PCFGs and SRNs can estimate Pr(wt|w1, …, wt−1)
 So can they predict word-reading times?
Testing surprisal theory
Demberg & Keller (2008)
 Reading-time data:
 Dundee corpus: approx. 2,400 sentences from The Independent
newspaper editorials
 Read by 10 subjects
 Eye-movement registration
 First-pass RTs: fixation time on a word before any fixation on
later words
 Computation of surprisal:
 PCFG induced from WSJ treebank
 Applied to Dundee corpus sentences
 Using Brian Roark’s incremental PCFG parser
Testing surprisal theory
Demberg & Keller (2008)
Result: No significant effect of word surprisal on RT, apart
from the effects of Pr(wt) and Pr(wt|wt1)
But accurate word prediction is difficult because of
 required world knowledge
 differences between WSJ and The Independent:
−
−
−
−
1988-’89 versus 2002
general WSJ articles versus Independent editorials
American English versus British English
only major similarity: both are in English
Test for a purely ‘structural’ (i.e., non-semantic)
effect by ignoring the actual words
S
NP
VP
PRP
VPZ
it
has
NP
DT
part-of-speech
(pos) tags
.
NP
.
PP
NN
IN
no bearing on
NP
NP
PRP$ NN
NP
NN
our work force
NN
today
Testing surprisal theory
Demberg & Keller (2008)
 ‘Unlexicalized’ (or ‘structural’) surprisal
− PCFG induced from WSJ trees with words removed
− surprisal estimation by parsing sequences of pos-tags (instead
of words) of Dundee corpus texts
− independent of semantics, so more accurate estimation
possible
− but probably a weaker relation with reading times
 Is a word’s RT related to the predictability of its partof-speech?
 Result: Yes, statistically significant (but very small)
effect of pos-surprisal on word-RT
Caveats
 Statistical analysis:
− The analysis assumes independent measurements
− Surprisal theory is based on dependencies between words
− So the analysis is inconsistent with the theory
 Implicit assumptions:
− The PCFG forms an accurate language model (i.e., it gives high
probability to the parts-of-speech that actually occur)
− An accurate language model is also an accurate psycholinguistic
model (i.e., it predicts reading times)
Solutions
 Sentence-level (instead of word-level) analysis
− Both PCFG and statistical analysis assume independence
between sentences
− Surprisal averaged over pos-tags in the sentence
− Total sentence RT divided by sentence length (# letters)
 Measure accuracy
a) of the language model:
lower average surprisal  more accurate language model
b) of the psycholinguistic model:
RT and surprisal correlate more strongly  more accurate
psycholinguistic model
 If a) and b) increase together, accurate language models
are also accurate psycholinguistic models
Comparing PCFG and SRN
 PCFG
just like
− Train on WSJ treebank (unlexicalized)
− Parse pos-tag sequences from Dundee corpus Demberg & Keller
− Obtain range of surprisal estimates by varying ‘beam-width’
parameter, which controls parser accuracy
 SRN
− Train on sequences of pos-tags (not the trees) from WSJ
− During training (at regular intervals), process pos-tags from
Dundee corpus, obtaining a range of surprisal estimates
 Evaluation
− Language model: average surprisal measures inaccuracy (and
estimates language entropy)
− Psycholinguistic model: correlation between surprisals and RTs
Results
PCFG
SRN
Preliminary conclusions
 Both models account for a statistically significant
fraction of variance in reading-time data.
 The human sentence-processing system seems to be
using an accurate language model.
 The SRN is the more accurate psycholinguistic model.
But PCFG and SRN together might form
an ever better psycholinguistic model
Improved analysis
 Linear mixed-effect regression model (to take into account
random effects of subject and item)
 Compare regression models that include:
− surprisal estimates by PCFG with largest beam width
− surprisal estimates by fully trained SRN
− both
 Also include: sentence length, word frequency, forward and backward
transitional probabilities, and all significant two-way interactions between
these
Results
Estimated β-coefficients (and associated p-values)
Regression model includes…
PCFG
Effect of
surprisal
according to…
PCFG
SRN
0.45
p<.02
SRN
both
Results
Estimated β-coefficients (and associated p-values)
Regression model includes…
PCFG
Effect of
surprisal
according to…
PCFG
SRN
SRN
0.45
p<.02
0.64
p<.001
both
Results
Estimated β-coefficients (and associated p-values)
Regression model includes…
PCFG
Effect of
surprisal
according to…
PCFG
SRN
SRN
both
0.64
p<.001
−0.46
p>.2
1.02
p<.01
0.45
p<.02
Conclusions
 Both PCFG and SRN do account for the reading-time
data to some extent
 But the PCFG does not improve on the SRN’s predictions
No evidence for tree structures in the
mental representation of sentences
Qualitative comparison
 Why does the SRN fit the data better than the PCFG?
 Is it more accurate on a particular group of data points,
or does it perform better overall?
 Take the regression analyses’ residuals (i.e., differences
between predicted and measured reading times)
δi = |residi(PCFG)| − |residi(SRN)|
 δi is the extent to which data point i is predicted better by
the SRN than by the PCFG.
 Is there a group of data points for which δ is larger than
might be expected?
 Look at the distribution of the δs.
Possible distributions of δ
symmetrical, mean is 0
No difference between SRN
and PCFG (only random noise)
0
right-shifted
Overall better predictions
by SRN
0
right-skewed
Particular subset of the data
is predicted better by SRN
0
Test for symmetry
 If the distribution of δ is asymmetric, particular data
points are predicted better by the SRN than the PCFG.
 The distribution is not significantly asymmetric
(two-sample Kolmogorov-Smirnov test, p>.17)
 The SRN seems to be a more accurate psycholinguistic
model overall.
Questions I cannot (yet) answer
(but would like to)
 Why does the SRN fit the data better than the PCFG?
 Perhaps people…
− are bad at dealing with long-distance dependencies in a
sentence?
− store information about the frequency of multi-word sequences?
Questions I cannot (yet) answer
(but would like to)
In general:
 Is this SRN a more accurate psychological model than
this PCFG?
 Are SRNs more accurate psychological models than
PCFGs?
 Does connectionism make for more accurate
psychological models than grammar-based theories?
 What kind of representations are used in human
sentence processing?
Surprisal-based model evaluation
may provide some answers
Download