Precision × Recall - Carnegie Mellon University

advertisement
A Multi-Strategy Approach to
Parsing of Grammatical Relations
in Child Language Transcripts
Kenji Sagae
Language Technologies Institute
Carnegie Mellon University
Thesis Committee:
Alon Lavie, co-chair
Brian MacWhinney, co-chair
Lori Levin
Jaime Carbonell
John Carroll, University of Sussex
Natural Language Parsing:
Sentence → Syntactic Structure
• One of the core problems in NLP
Input: The boy ate the cheese sandwich
Output: (ROOT (predicate eat) (surface ate)
(tense past)
(category
V)
Grammatical
Relations
(GRs)
((1
(S 2(NPThe
(Det
DET)N) (agreement 3s)
(SUBJThe)
(category
• Subject,
object, adjunct, etc.
(surface
boy)
(2 3 boy
(N boy)) SUBJ)
(DET (surface the)
(3 0(VPate
(V ate)
ROOT)
(category Det)))
(category N) (definite +)
(4 6 the
(NP(OBJ
(Det
the)
DET)
(DET (surface the)
(5 6 cheese(N cheese)
MOD)
(category
Det))
sandwich)
(6 3 sandwich
(N(predicate
sandwich))))
OBJ))
(surface sandwich)
(MOD (category N)
(surface cheese)
(predicate cheese))))
2
Using Natural Language Processing
in Child Language Research
• CHILDES Database (MacWhinney, 2000)
– 200 megabytes of child-parent dialog transcripts
– Part-of-speech and morphology analysis
• Tools available
• Not enough for many research questions
– No syntactic analysis
• Can we use NLP to analyze CHILDES
transcripts?
– Parsing
– Many decisions: representation, approach, etc.
3
Parsing CHILDES:
Specific and General Motivation
• Specific task: automatic analysis of syntax in
CHILDES corpora
– Theoretical importance (study of child language
development)
– practical importance (measurement of syntactic
competence)
• In general: Develop techniques for syntactic
analysis, advance parsing technologies
– Can we develop new techniques that perform better
than current approaches?
• Rule-based
• Data-driven
4
Research Objectives
• Identify a suitable syntactic representation for
CHILDES transcripts
– Must address the needs of child language research
• Develop a high accuracy approach for syntactic
analysis of spoken language transcripts
– parents and children at different stages of language
acquisition
• The plan: a multi-strategy approach
– ML: ensemble methods
– Parsing: several approaches possible, but
combination is an underdeveloped area
5
Research Objectives
• Develop methods for combining analyses from
different parsers and obtain improved accuracy
– Combining rule-based and data-driven approaches
• Evaluate the accuracy of developed systems
• Validate the utility of the resulting systems to the
child language community
– Task-based evaluation: Automatic measurement of
grammatical complexity in child language
6
Overview of the Multi-Strategy Approach
for Syntactic Analysis
Parser A
Parser B
Transcripts
Parser
Combination
Parser C
Parser D
SYNTACTIC
STRUCTURES
Parser E
7
Thesis Statement
• The development of a novel multi-strategy approach for
syntactic parsing allows for identification of Grammatical
Relations in transcripts of parent-child dialogs at a higher
level of accuracy than previously possible
• Through the combination of different NLP techniques
(rule-based or data-driven), the multi-strategy approach
can outperform each strategy in isolation, and produce
significantly improved accuracy
• The resulting syntactic analysis are at a level of accuracy
that makes them useful to child language research
8
Outline
• The CHILDES GR scheme
• GR Parsing of CHILDES transcripts
• Combining different strategies
• Automated measurement of syntactic
development in child language
• Related work
• Conclusion
9
CHILDES GR Scheme
(Sagae, MacWhinney and Lavie, 2004)
• Grammatical Relations (GRs)
– Subject, object, adjunct, etc.
– Labeled dependencies
• Addresses needs of child language researchers
– Informative and intuitive, basis for DSS and IPSyn
Dependency Label
Dependent
Head
10
CHILDES GR Scheme Includes Important
GRs for Child Language Study
11
Outline
• The CHILDES GR scheme
• GR Parsing of CHILDES transcripts
• Evaluation
• Data
• Combining
different
strategies
• Rule-based
GR parsing
• Data-driven GR parsing
• Automated measurement of syntactic
development in child language
• Related word
• Conclusion
12
The Task:
Sentence → GRs
• Input:
We eat the cheese sandwich
• Output:
13
Evaluation of GR Parsing
• Dependency accuracy
• Precision/Recall of GRs
14
Evaluation:
Calculating Dependency Accuracy
1
2
3
4
5
1
2
3
2
0
5
5
2
We
eat
the
cheese
sandwich
4
5
SUBJ
ROOT
DET
MOD
SUBJ
15
Evaluation:
Calculating Dependency Accuracy
GOLD
1
PARSED
2
3
1
2
3
4
4
5
2
0
4
2
2
We
eat
the
cheese
5
sandwich
Accuracy =1number
of correct dependencies
2 We
SUBJ
of dependencies
2 total
0 number
eat
ROOT
3 5 the
DET
= 2 /45 =50.40cheese
MOD
5 2 sandwich SUBJ
40%
SUBJ
ROOT
DET
OBJ
PRED
16
Evaluation:
Precision and Recall of GRs
• Precision and recall are calculated separately for each
GR type
• Calculated on aggregate counts over entire test corpus
• Example: SUBJ
Precision = # SUBJ matches between PARSED and GOLD
Total number of SUBJs in PARSED
Recall = # SUBJ matches between PARSED and GOLD
Total # of SUBJs in GOLD
F-score = 2 ( Precision × Recall )
Precision + Recall
17
Evaluation:
Precision and Recall of GRs
GOLD
1
2
3
4
5
2
0
5
5
2
We
eat
the
cheese
sandwich
PARSED
SUBJ
ROOT
DET
MOD
OBJ
1
2
3
4
5
2
0
4
2
2
We
eat
the
cheese
sandwich
SUBJ
ROOT
DET
OBJ
SUBJ
Precision
SUBJ
matches
between
PARSED and
and GOLD
GOLD
F-score
Recall === ##2SUBJ
( Precision
matches
× Recall
between
) PARSED
Total number
of SUBJs
in PARSED
Precision
+ Recall
Total
# of SUBJs
in GOLD
== 1
2(50×100)
1 // 2
1 =
= 50%
100%
/ (50+100) = 66.67
=
18
Outline
• The CHILDES GR scheme
• GR Parsing of CHILDES transcripts
•
•
•
•
Evaluation
Data
Rule-based GR parsing
Data-driven GR parsing
• Combining different strategies
• Automated measurement of syntactic
development in child language
19
CHILDES Data: the Eve Corpus
(Brown, 1973)
• A corpus from CHILDES
– Manually annotated with GRs
• Training: ~ 5,000 words (adult)
• Development: ~ 1,000 words
– 600 adult, 400 child
• Test: ~ 2,000 words
– 1,200 adult, 800 child
20
Not All Child Utterances Have GRs
• Utterances in training and test sets are well-formed
I need tapioca in the bowl.
That’s a hat.
In a minute.
• What about
* Warm puppy happiness a blanket.
* There briefcase.
? I drinking milk.
? I want Fraser hat.
• Separate Eve-child test set (700 words)
21
The WSJ Corpus (Penn Treebank)
• 1 million words
• Widely used
– Sections 02-21: training
– Section 22: development
– Section 23: evaluation
• Large corpus with syntactic annotation
– Out-of-domain
• Constituent structures
– Convert to unlabeled dependencies using headpercolation table
22
Outline
• The CHILDES GR scheme
• GR Parsing of CHILDES transcripts
•
•
•
•
Evaluation
Data
Rule-based GR parsing
Data-driven GR parsing
• Combining different strategies
• Automated measurement of syntactic
development in child language
23
Rule-Based Parsing
• The parser’s knowledge is encoded in manually
written rules
– Grammar, lexicon, etc.
• Only analyses that fit the rules are possible
• Accurate in specific domains, difficult to achieve
wide coverage in open domain
– Coverage, ambiguity, domain knowledge
24
Rule-Based Parsing of CHILDES data
(Sagae, Lavie & MacWhinney, 2001, 2004)
LCFlex (Rosé and Lavie, 2001)
 Rules: CFG backbone augmented with unification
constraints
 Manually written, 153 rules
 Robustness
 Limited insertions:
[Do] [you] want to go outside?
 Limited skipping:
No um maybe later.
 PCFG disambiguation model
 Trained on 2,000 words
25
High Precision from a Small Grammar
• Eve test corpus
– 2,000 words
•
•
•
•
31% of the words can be parsed
Accuracy (over all 2,000 words): 29%
Precision: 94%
High Precision, Low Recall
• Improve recall using parser’s robustness
– Insertions, skipping
– Multi-pass approach
26
Robustness and Multi-Pass Parsing
• No insertions, no skipping
31% parsed, 29% recall, 94% precision
• Insertion of NP and/or auxiliary
38% parsed, 35% recall, 92% precision
• Skipping of 1 word
52% parsed, 47% recall, 90% precision
• Skipping of 1 word, insertion of NP, aux
63% parsed, 55% recall, 88% precision
27
Use Robustness to Improve Recall
100
90
80
70
60
precision
50
recall
40
f-score
30
20
10
0
none
insert NP/aux
skip 1 word
insert/skip
28
Outline
• The CHILDES GR scheme
• GR Parsing of CHILDES transcripts
•
•
•
•
Evaluation
Data
Rule-based GR parsing
Data-driven GR parsing
• Combining different strategies
• Automated measurement of syntactic
development in child language
29
Data-driven Parsing
• Parser learns from a corpus of annotated
examples
• Data-driven parsers are robust
• Two approaches
– Existing statistical parser
– Classifier-based parsing
30
Accurate GR Parsing with
Existing Resources (Mostly)
• Large training corpus: Penn Treebank (Marcus
et al., 1993)
– Head-table converts constituents into dependencies
• Use an existing parser (trained on the Penn
Treebank)
– Charniak (2000)
• Convert output to unlabeled dependencies
• Use a classifier for dependency labeling
31
Unlabeled Dependency Identification
We eat the cheese sandwich
eat
eat
sandwich
32
Domain Issues
• Parser training data is in a very different domain
– WSJ vs Parent-child dialogs
• Domain specific training data would likely be
better
• Performance is acceptable
– Shorter, simpler sentences
– Unlabeled dependency accuracy
• WSJ test data: 92%
• Eve test data: 90%
33
Dependency Labeling
• Training data is required
– Eve training set (5,000 words)
• Labeling dependencies is easier than finding
unlabeled dependencies
• Use a classifier
– TiMBL (Daelemans et al., 2004)
– Extract features from unlabeled dependency structure
– GR labels are target classes
34
Dependency Labeling
35
Features Used for GR Labeling
• Head and dependent words
– Also their POS tags
• Whether the dependent comes before or after the head
• How far the dependent is from the head
• The label of the lowest node in the constituent tree that
includes both the head and dependent
36
Features Used for GR Labeling
Consider the words “we” and “eat”
Features: we, pro, eat, v, before, 1, S
Class: SUBJ
37
Good GR Labeling Results
with Small Training Set
• Eve training set
– 5,000 words for training
• Eve test set
– 2,000 words for testing
• Accuracy of dependency labeling (on perfect
dependencies): 91.4%
• Overall accuracy (Charniak parser +
dependency labeling): 86.9%
38
Some GRs Are Easier Than Others
• Overall accuracy: 86.9%
• Easily identifiable GRs
– DET, POBJ, INF, NEG: Precision and recall above
98%
• Difficult GRs
– COMP, XCOMP: below 65%
• I think that Mary saw a movie (COMP)
• She tried to see a movie (XCOMP)
39
Precision and Recall of Specific GRs
GR
SUBJ
Precision
0.94
Recall
0.93
F-score
0.93
OBJ
COORD
JCT
0.83
0.68
0.91
0.91
0.85
0.82
0.87
0.75
0.86
MOD
PRED
ROOT
0.79
0.80
0.91
0.92
0.83
0.92
0.85
0.81
0.91
COMP
XCOMP
0.60
0.58
0.50
0.64
0.54
0.61
40
Parsing with Domain-Specific Data
• Good results with a system based on the
Charniak parser
• Why domain-specific data?
– No Penn Treebank
– Handle dependencies natively
– Multi-strategy approach
41
Classifier-Based Parsing
(Sagae & Lavie, 2005)
• Deterministic parsing
– Single path, no backtracking
– Greedy
– Linear run-time
• Simple shift-reduce algorithm
– Single pass over the input string
• Variety: Left-to-right, right-to-left (order matters)
• Classifier makes parser decisions
– Classifier not tied to parsing algorithm
• Variety: Different types of classifiers can be used
42
A Simple, Fast and Accurate Approach
• Classifier-based parsing with constituents
– Trained and evaluated on WSJ data: 87.5%
– Very fast, competitive accuracy
• Simple adaptation to labeled dependency
parsing
– Similar to Malt parser (Nivre, 2004)
– Handles CHILDES GRs directly
43
GR Analysis with Classifier-Based Parsing
• Stack S
– Items may be POS-tagged words or
dependency trees
– Initialization: empty
• Queue W
– Items are POS-tagged words
– Initialization: Insert each word of the input
sentence in order (first word is in front)
44
Shift and Reduce Actions
• Shift
– Remove (shift) the word in front of queue W
– Insert shifted item on top of stack S
• Reduce
– Pop two topmost item from stack S
– Push new item onto stack S
• New item forms new dependency
• Choose LEFT or RIGHT
• Choose Dependency Label
45
Parser Decisions
• Shift vs. Reduce
• If Reduce
– RIGHT or LEFT
– Dependency label
• We use a classifier to make these
decisions
46
Classes and Features
• Classes
–
–
–
–
–
–
SHIFT
LEFT-SUBJ
LEFT-JCT
RIGHT-OBJ
RIGHT-JCT
…
• Features: derived from parser configuration
– Crucially: two topmost items in S, first item in W
– Additionally: other features that describe the current
configuration (look-ahead, etc)
47
Parsing CHILDES
with a Classifier-Based Parser
• Parser uses SVM
• Trained on Eve training set (5,000 words)
• Tested on Eve test set (2,000 words)
• Labeled dependency accuracy: 87.3%
– Uses only domain-specific data
– Same level of accuracy as GR system based on
Charniak parser
48
Precision and Recall of Specific GRs
GR
SUBJ
Precision
0.97
Recall
0.98
F-score
0.98
OBJ
COORD
JCT
0.89
0.71
0.78
0.94
0.76
0.88
0.92
0.74
0.83
MOD
PRED
ROOT
0.94
0.80
0.95
0.87
0.83
0.94
0.91
0.82
0.94
COMP
XCOMP
0.70
0.93
0.78
0.82
0.74
0.87
49
Precision and Recall of Specific GRs
GR
SUBJ
Precision
0.97
Recall
0.98
F-score
0.98 0.93
OBJ
COORD
JCT
0.89
0.71
0.78
0.94
0.76
0.88
0.92 0.87
0.74 0.75
0.83 0.86
MOD
PRED
ROOT
0.94
0.80
0.95
0.87
0.83
0.94
0.91 0.85
0.82 0.81
0.94 0.91
COMP
XCOMP
0.70
0.93
0.78
0.82
0.74 0.54
0.87 0.61
50
Outline
• The CHILDES GR scheme
• GR Parsing of CHILDES transcripts
• Combining different strategies
• Weighted voting
• Automated
measurement
• Combination
as parsing of syntactic
• Handling young
childlanguage
utterances
development
in child
• Related Work
• Conclusion
51
Combine Different Parsers
to Get More Accurate Results
• Rule-based
• Statistical parsing + dependency labeling
• Classifier-based parsing
– Obtain even more variety
• SVM vs MBL
• Left-to-right vs right-to-left
52
Simple (Unweighted) Voting
• Each parser votes for each dependency
• Word-by-word
• Every vote has the same weight
53
Simple (Unweighted) Voting
He eats cake
Parser A
1 2 He SUBJ
2 0 eats CMOD
3 1 cake OBJ
Parser B
Parser C
1 2 He
SUBJ 1 3 He SUBJ
2 0 eats ROOT 2 0 eats ROOT
3 1 cake OBJ
3 2 cake OBJ
GOLD
1 2 He SUBJ
2 0 eats ROOT
3 2 cake OBJ
54
Simple (Unweighted) Voting
He eats cake
Parser A
1 2 He SUBJ
2 0 eats CMOD
3 1 cake OBJ
GOLD
1 2 He SUBJ
2 0 eats ROOT
3 2 cake OBJ
Parser B
Parser C
1 2 He
SUBJ 1 3 He SUBJ
2 0 eats ROOT 2 0 eats ROOT
3 1 cake OBJ
3 2 cake OBJ
VOTED
1 2 He
SUBJ
2 0 eats ROOT
3 1 cake SUBJ
55
Simple (Unweighted) Voting
He eats cake
Parser A
1 2 He SUBJ
2 0 eats CMOD
3 1 cake OBJ
GOLD
1 2 He SUBJ
2 0 eats ROOT
3 2 cake OBJ
Parser B
Parser C
1 2 He
SUBJ 1 3 He SUBJ
2 0 eats ROOT 2 0 eats ROOT
3 1 cake OBJ
3 2 cake OBJ
VOTED
1 2 He
SUBJ
2 0 eats ROOT
3 1 cake OBJ
56
Weighted Voting
• Each parser has a weight
– Reflects confidence in parser’s GR
identification
• Instead of adding number of votes,
add the weight of votes
• Takes into account that some parsers are
better than others
57
Weighted Voting
He eats cake
Parser A (0.4)
1 2 He SUBJ
2 0 eats CMOD
3 1 cake OBJ
GOLD
1 2 He SUBJ
2 0 eats ROOT
3 2 cake OBJ
Parser B (0.3)
Parser C (0.8)
1 2 He
SUBJ 1 3 He SUBJ
2 0 eats ROOT 2 0 eats ROOT
3 1 cake OBJ
3 2 cake OBJ
VOTED
1 3 He
SUBJ
2 0 eats ROOT
3 2 cake OBJ
58
Label-Weighted Voting
• Not just one weight per parser, but
one weight for each GR for each parser
• Takes into account specific strengths of
each parser
59
Label-Weighted Voting
He eats cake
Parser A
1 2 He
Parser B
SUBJ (0.7) 1 2 He
Parser C
SUBJ (0.8) 1 3 He
SUBJ (0.6)
2 0 eats CMOD (0.3) 2 0 eats ROOT (0.9) 2 0 eats ROOT(0.7)
3 1 cake OBJ (0.5)
GOLD
1 2 He SUBJ
2 0 eats ROOT
3 2 cake OBJ
3 1 cake OBJ (0.3)
3 2 cake OBJ (0.9)
VOTED
1 2 He
SUBJ
2 0 eats ROOT
3 2 cake OBJ
60
Voting Produces Very Accurate Results
• Parsers
– Rule-based
– Statistical based on Charniak parser
– Classifier-based
• Left-to-right SVM
• Right-to-left SVM
• Left-to-right MBL
• Simple Voting: 88.0%
• Weighted Voting: 89.1%
• Label-weighted Voting: 92.1%
61
Precision and Recall of Specific GRs
GR
SUBJ
Precision
0.98
Recall
0.98
F-score
0.98
OBJ
COORD
JCT
0.94
0.94
0.87
0.94
0.91
0.90
0.94
0.92
0.88
MOD
PRED
ROOT
0.97
0.86
0.97
0.91
0.89
0.96
0.94
0.87
0.96
COMP
XCOMP
0.75
0.90
0.67
0.88
0.71
0.89
62
Outline
• The CHILDES GR scheme
• GR Parsing of CHILDES transcripts
• Combining different strategies
• Weighted voting
• Combination as parsing
• Handling young child utterances
• Automated measurement of syntactic
development in child language
63
Voting May Not Produce
a Well-Formed Dependency Tree
• Voting on a word-by-word basis
• No guarantee of well-formedness
• Resulting set of dependencies may form a
graph with cycles, or may not even be fully
connected
– Technically not fully compliant with CHILDES
GR annotation scheme
64
Parser Combination as Reparsing
• Once several parsers have analyzed a
sentence, use their output to guide the
process of reparsing the sentence
• Two reparsing approaches
– Maximum spanning tree
– CYK (dynamic programming)
65
Dependency Parsing as Search for
Maximum Spanning Tree
• First, build a graph
– Each word in input sentence is a node
– Each dependency proposed by any of the parsers is
an weighted edge
– If multiple parsers propose the same dependency,
add weights into a single edge
• Then, simply find the MST
– Maximizes the votes
– Structure guaranteed to be a dependency tree
– May have crossing branches
66
Parser Combination with the CYK Algorithm
• The CYK algorithm uses dynamic programming
to find all parses for a sentence given a CFG
– Probabilistic version finds most probable parse
• Build a graph, as with MST
• Parse the sentence using CYK
– Instead of a grammar, consult the graph to determine
how to fill new cells in the CYK table
– Instead of probabilities, we use the weights from the
graph
67
Precision and Recall of Specific GRs
GR
SUBJ
Precision
0.98
Recall
0.98
F-score
0.98
OBJ
COORD
JCT
0.94
0.94
0.87
0.94
0.91
0.90
0.94
0.92
0.88
MOD
PRED
ROOT
0.97
0.86
0.97
0.91
0.89
0.97
0.94
0.87
0.97
COMP
XCOMP
0.73
0.88
0.89
0.88
0.80
0.88
68
Outline
• The CHILDES GR scheme
• GR Parsing of CHILDES transcripts
• Combining different strategies
• Weighted voting
• Combination as parsing
• Handling young child utterances
• Automated measurement of syntactic
development in child language
69
Handling Young Child Utterances with
Rule-Based and Data-Driven Parsing
• Eve-child test set:
I need tapioca in the bowl.
That’s a hat.
In a minute.
* Warm puppy happiness a blanket.
* There briefcase.
? I drinking milk.
? I want Fraser hat.
70
Three Types of Sentences in One Corpus
• No problem
– High accuracy
• No GRs
– But data-driven systems will output GRs
• Missing words, agreement errors, etc.
– GRs are fine, but a challenge for data-driven
systems trained on fully grammatical
utterances
71
To Analyze or Not To Analyze:
Ask the Rule-Based Parser
• Utterances with no GRs are annotated in
test corpus as such
• Rule-based parser set to high precision
– Same grammar as before
• If sentence cannot be parsed with the rulebased system, output No GR.
– 88% Precision, 89% Recall
– Sentences are fairly simple
72
The Rule-Based Parser also
Identifies Missing Words
• If the sentence can be analyzed with the
rule-based system, check if any insertions
were necessary
– If inserted be or possessive marker ’s, insert
the appropriate lexical item in the sentence
• Parse the sentence with data-driven
systems, run combination
73
High Accuracy Analysis of
Challenging Utterances
• Eve-child test
– No rule-based first pass: 62.9% accuracy
• Many errors caused by GR analysis of words with
no GRs
– With rule-based pass: 88.0% accuracy
• 700 words from Naomi corpus
– No rule-based: 67.4%
– Rule-based, then combo: 86.8%
74
Outline
• The CHILDES GR scheme
• GR Parsing of CHILDES transcripts
• Combining different strategies
• Automated measurement of syntactic
development in child language
• Related work
• Conclusion
75
Index of Productive Syntax (IPSyn)
(Scarborough, 1990)
• A measure of child language development
• Assigns a numerical score for grammatical complexity
(from 0 to 112 points)
• Used in hundreds of studies
76
IPSyn Measures Syntactic Development
• IPSyn: Designed for investigating differences in
language acquisition
– Differences in groups (for example: bilingual children)
– Individual differences (for example: delayed language
development)
– Focus on syntax
• Addresses weaknesses of Mean Length of Utterance
(MLU)
– MLU surprisingly useful until age 3, then reaches ceiling (or
becomes unreliable)
• IPSyn is very time-consuming to compute
77
Computing IPSyn (manually)
• Corpus of 100 transcribed utterances
– Consecutive, no repetitions
• Identify 56 specific language structures (IPSyn Items)
– Examples:
•
•
•
•
Presence of auxiliaries or modals
Inverted auxiliary in a wh-question
Conjoined clauses
Fronted or center-embedded subordinate clauses
– Count occurrences (zero, one, two or more)
• Add counts
78
Automating IPSyn
• Existing state of manual computation
– Spreadsheets
– Search each sentence for language structures
– Use part-of-speech tagging to narrow down the
number of sentences for certain structures
• For example: Verb + Noun, Determiner + Adjective + Noun
• Automatic computation is possible with accurate
GR analysis
– Use GRs to search for IPSyn items
79
Some IPSyn Items Require Syntactic Analysis for
Reliable Recognition
(and some don’t)
•
•
•
•
•
•
•
•
•
•
•
Determiner + Adjective + Noun
Auxiliary verb
Adverb modifying adjective or nominal
Subject + Verb + Object
Sentence with 3 clauses
Conjoined sentences
Wh-question with inverted auxiliary/modal/copula
Relative clauses
Propositional complements
Fronted subordinate clauses
Center-embedded clauses
80
Automating IPSyn with
Grammatical Relation Analyses
• Search for language structures using patterns that
involve POS tags and GRs (labeled dependencies)
• Examples
– Wh-embedded clauses: search for wh-words whose head (or
transitive head) is a dependent in a GR of types [XC]SUBJ,
[XC]PRED, [XC]JCT, [XC]MOD, COMP or XCOMP
– Relative clauses: search for a CMOD where the dependent is to
the right of the head
81
Evaluation Data
• Two sets of transcripts with IPSyn scoring from two different child
language research groups
• Set A
– Scored fully manually
– 20 transcripts
– Ages: about 3 yrs.
• Set B
– Scored with CP first, then manually corrected
– 25 transcripts
– Ages: about 8 yrs.
(Two transcripts in each set were held out for development and debugging)
82
Evaluation Metrics:
Point Difference
• Point difference
– The absolute point difference between the scores
provided by our system, and the scores computed
manually
– Simple, and shows how close the automatic scores
are to the manual scores
– Acceptable range
• Smaller for older children
83
Evaluation Metrics:
Point-to-Point Accuracy
• Point-to-point accuracy
– Reflects overall reliability over each scoring decision
made in the computation of IPSyn scores
– Scoring decisions: presence or absence of language
structures in the transcript
Point-to-Point Acc = C(Correct Decisions)
C(Total Decisions)
– Commonly used for assessing inter-rater reliability
among human scorers (for IPSyn, about 94%).
84
Results
• IPSyn scores from
– Our GR-based system (GR)
– Manual scoring (HUMAN)
– Computerized Profiling (CP)
• Long, Fey and Channell, 2004
85
GR-based IPSyn Is Quite Accurate
System
Avg. Point Difference Point-to-point
to HUMAN
Reliability (%)
GR (total)
3.3
92.8
CP (total)
8.3
85.4
GR (set A)
3.7
92.5
CP (set A)
6.2
86.2
GR (set B)
2.9
93.0
CP (set B)
10.2
84.8
86
GR-Based IPSyn Close to Human Scoring
• Automatic scores very reliable
• Validates usefulness of
– GR annotation scheme
– Automatic GR analysis
• Validates analysis over a large set of
children of different ages
87
Outline
• The CHILDES GR scheme
• GR Parsing of CHILDES transcripts
• Combining different strategies
• Automated measurement of syntactic
development in child language
• Related work
• Conclusion
88
Related Work
• GR schemes, GR evaluation:
– Carroll, Briscoe & Sanfilippo,
1998
– Lin, 1998
– Yeh, 2000
– Preiss, 2003
• Rule-based robust parsing
– Heeman & Allen, 2001
– Lavie, 1996
– Rosé & Lavie, 2001
• Parsing
– Carroll & Briscoe, 2002
– Briscoe & Carroll, 2002
– Buchholz, 2002
– Tomita, 1987
– Magerman, 1995
– Ratnaparkhi, 1997
– Collins, 1997
– Charniak, 2000
• Deterministic parsing
– Yamada & Matsumoto,
2003
– Nivre & Scholz, 2004
• Parser Combination
– Henderson & Brill, 1999
– Brill & Wu, 1998
– Yeh, 2000
– Sarkar, 2001
• Automatic measurement of
grammatical complexity
– Long, Fey & Channell,
89
2004
Outline
• The CHILDES GR scheme
• GR Parsing of CHILDES transcripts
• Combining different strategies
• Automated measurement of syntactic
development in child language
• Related work
• Conclusion
90
Major Contributions
• An annotation scheme based on GRs for
syntactic structure in CHILDES transcripts
• A linear-time classifier-based parser for
constituent structures
• The development of rule-based and data-driven
approaches to GR analysis
– Precision/recall trade-off using insertions and skipping
– Data-driven GR analysis using existing resources
• Charniak parser, Penn Treebank
– Parser variety in classifier-based dependency parsing
91
Major Contributions (2)
• The use of different voting schemes for
combining dependency analyses
– Surpasses state-of-the-art in WSJ dependency
parsing
– Vastly outperforms individual parsing approaches
• A novel reparsing combination scheme
– Maximum spanning trees, CYK
• An accurate automated tool for measurement of
syntactic development in child language
– Validates annotation scheme and quality of GR
analyses
92
Possible Future Directions
• Classifier-based parsing
– Beam search keeping linear time
– Tree classification (Kudo & Matsumoto, 2004)
• Parser combination
– Parser variety, reparsing combination with constituent
trees
• Automated measurement of grammatical
complexity
– Take precision/recall into account
– A data-driven approach to replace search rules
• Other languages
93
94
95
96
97
More on Dependency Voting
• On WSJ data: 93.9% unlabeled accuracy
• On Eve data
– No RB: 91.1%
• COMP: 50%
– No charn, No RB: 89.1%
• COMP: 50%, COORD: 84%, ROOT: 95%
– No charn: 90.5%
• COMP: 67%
– No RL, no MBL: 91.8%
98
Full GR Results
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
XJCT ( 2 / 2) : 1.00 1.00 1.00
OBJ ( 90 / 91) : 0.95 0.96 0.95
NEG ( 26 / 25) : 1.00 0.96 0.98
SUBJ ( 180 / 181) : 0.98 0.98 0.98
INF ( 19 / 19) : 1.00 1.00 1.00
POBJ ( 48 / 51) : 0.92 0.98 0.95
XCOMP ( 23 / 23) : 0.88 0.88 0.88
QUANT ( 4 / 4) : 1.00 1.00 1.00
VOC ( 2 / 2) : 1.00 1.00 1.00
TAG ( 1 / 1) : 1.00 1.00 1.00
CPZR ( 10 / 9) : 1.00 0.90 0.95
PTL ( 6 / 6) : 0.83 0.83 0.83
COORD ( 33 / 33) : 0.91 0.91 0.91
COMP ( 18 / 18) : 0.71 0.89 0.80
AUX ( 74 / 78) : 0.94 0.99 0.96
CJCT ( 6 / 5) : 1.00 0.83 0.91
PRED ( 54 / 55) : 0.87 0.89 0.88
DET ( 45 / 47) : 0.96 1.00 0.98
MOD ( 94 / 89) : 0.97 0.91 0.94
ROOT ( 239 / 238) : 0.97 0.96 0.96
PUNCT ( 286 / 286) : 1.00 1.00 1.00
COM ( 45 / 44) : 0.93 0.91 0.92
ESUBJ ( 2 / 2) : 1.00 1.00 1.00
CMOD ( 3 / 3) : 1.00 1.00 1.00
JCT ( 78 / 84) : 0.85 0.91 0.88
99
Weighted Voting
0.7
Parser A (0.4)
1 2 He SUBJ
2 0 eats CMOD
3 1 cake OBJ
GOLD
1 2 He SUBJ
2 0 eats ROOT
3 2 cake OBJ
He eats cake
0.8
Parser B (0.3)
Parser C (0.8)
1 2 He
SUBJ 1 3 He SUBJ
2 0 eats ROOT 2 0 eats ROOT
3 1 cake OBJ
3 2 cake OBJ
VOTED
1 3 He
SUBJ
2 0 eats ROOT
3 2 cake OBJ
100
Weighted Voting
1.5
Parser A
1 2 He
He eats cake
Parser B
SUBJ (0.7) 1 2 He
0.6
Parser C
SUBJ (0.8) 1 3 He
SUBJ (0.6)
2 0 eats CMOD (0.3) 2 0 eats ROOT (0.9) 2 0 eats ROOT(0.7)
3 1 cake OBJ (0.5)
GOLD
1 2 He SUBJ
2 0 eats ROOT
3 2 cake OBJ
3 1 cake OBJ (0.3)
3 2 cake OBJ (0.9)
VOTED
1 2 He
SUBJ
2 0 eats ROOT
3 2 cake OBJ
101
Download