Treebanks, parsing, etc. - Department of Linguistics and English

advertisement
Treebanks, parsing, etc.
Syntax and computers
• Parsing: input is sentence, output is tree (or
equivalent representation)
• Browsing:
– Finding particular syntactic structures within a
corpus of sentences
– Finding sentences that match a particular
syntactic construction
• Information retrieval, machine translation,
speech recognition, etc.
Why parsing is difficult:
Newspaper headlines
•
•
•
•
•
•
•
•
•
Iraqi Head Seeks Arms
Juvenile Court to Try Shooting Defendant
Teacher Strikes Idle Kids
Stolen Painting Found by Tree
Local High School Dropouts Cut in Half
Red Tape Holds Up New Bridges
Clinton Wins on Budget, but More Lies Ahead
Hospitals Are Sued by 7 Foot Doctors
Kids Make Nutritious Snacks
Ambiguous headlines
POLICE BEGIN CAMPAIGN TO RUN DOWN JAYWALKERS
SAFETY EXPERTS SAY SCHOOL BUS PASSENGERS SHOULD BE BELTED
DRUNK GETS NINE MONTHS IN VIOLIN CASE
FARMER BILL DIES IN HOUSE
IRAQI HEAD SEEKS ARMS
PROSTITUTES APPEAL TO POPE
BRITISH LEFT WAFFLES ON FALKLAND ISLANDS
LUNG CANCER IN WOMEN MUSHROOMS
TEACHER STRIKES IDLE KIDS
ENRAGED COW INJURES FARMER WITH AXE
JUVENILE COURT TO TRY SHOOTING DEFENDANT
TWO SOVIET SHIPS COLLIDE, ONE DIES
4
WordNet subcat frames
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Something ----s
Somebody ----s
It is ----ing
Something is ----ing PP
Something ----s something Adjective/Noun
Something ----s Adjective/Noun
Somebody ----s Adjective
Somebody ----s something
Somebody ----s somebody
Something ----s somebody
Something ----s something
Something ----s to somebody
Somebody ----s on something
Somebody ----s somebody something
Somebody ----s something to somebody
Somebody ----s something from somebody
Somebody ----s somebody with something
Somebody ----s somebody of something
Somebody ----s something on somebody
Soar 2003 Tutorial
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
Somebody ----s somebody PP
Somebody ----s something PP
Somebody ----s PP
Somebody's (body part) ----s
Somebody ----s somebody to INFINITIVE
Somebody ----s somebody INFINITIVE
Somebody ----s that CLAUSE
Somebody ----s to somebody
Somebody ----s to INFINITIVE
Somebody ----s whether INFINITIVE
Somebody ----s somebody into V-ing something
Somebody ----s something with something
Somebody ----s INFINITIVE
Somebody ----s VERB-ing
It ----s that CLAUSE
Something ----s INFINITIVE
5
English LCS lexicon
• Theta-grid information for verbs
• Derive ucat features
– used to build syntactic structure
• Co-referenced with WordNet2.0
– theta-grids are aligned with ucat features and
word sense information
English LCS lexicon data
10.6.a#1#_ag_th,mod-poss(of)#exonerate#exonerate#exonerate#exonerate+ed#
(2.0,00874318_exonerate%2:32:00::)
"10.6.a" :NAME "Verbs of Possessional Deprivation: Cheat Verbs / -of“
WORDS (absolve acquit balk bereave bilk bleed burgle cheat cleanse con cull cure
defraud denude deplete depopulate deprive despoil disabuse disarm
disencumber dispossess divest drain ease exonerate fleece free gull milk
mulct pardon plunder purge purify ransack relieve render rid rifle rob sap
strip swindle unburden void wean)
THETA_ROLES ((1 "_ag_th,mod-poss()") (1 "_ag_th,mod-poss(from)")
"_ag_th,mod-poss(of)"))
SENTENCES "He !!+ed the people (of their rights); He !!+ed him of his sins"
(1
Doing syntax with computers
• To do this you need a grammar
• So where do grammars come from?
– Grammar engineering
• Lovingly hand-crafted decades-long efforts by humans to
write grammars (typically in some particular grammar
formalism of interest to the linguists developing the
grammar).
– TreeBanks
• Semi-automatically generated sets of parse trees for the
sentences in some corpus. Typically in a generic lowest
common denominator formalism (of no particular
interest to any modern linguist).
8
TreeBanks
• TreeBanks provide a grammar (of a sort).
• Hence they provide the training data for various computer
applications that use syntax
• But they can also provide useful data for more purely
linguistic pursuits.
– You might have a theory about whether or not something can
happen in particular language.
– Or a theory about the contexts in which something can happen.
– TreeBanks can give you the means to explore those theories. If you
can formulate the questions in the right way and get the data you
need.
9
A Penn Treebank sentence
( (S
(NP-SBJ (DT The) (NN move))
(VP (VBD followed)
(NP
(NP (DT a) (NN round))
(PP (IN of)
(NP
(NP (JJ similar) (NNS increases))
(PP (IN by)
(NP (JJ other) (NNS lenders)))
(PP (IN against)
(NP (NNP Arizona) (JJ real) (NN estate) (NNS loans))))))
(, ,)
(S-ADV
(NP-SBJ (-NONE- *))
(VP (VBG reflecting)
(NP
(NP (DT a) (VBG continuing) (NN decline))
(PP-LOC (IN in)
(NP (DT that) (NN market)))))))
(. .)))
Equivalent representations
•
•
•
•
PS tree (phrase-markers)
Bracketed labeling
Automaton
F-structure
11
Bracketed labeling
[IP[NP[DetThe] [Ndog]] [VP[vbarked] [PP [Pat] [NP[Detthe] [Nboy]]]]]].
12
An automaton
13
F-structure
Time to be flexible!
• We have learned a way to diagram parse
trees; it involves certain assumptions
• Not everybody agrees with all of these
assumptions
• In fact, very few people agree on very many
specifics at all
• Syntax resources reflect this diversity 
• Hence the need to be flexible
flight
16
flight
flight
17
flight
flight
flight
18
19
Classical grammar engineering
• Write rules with associated lexicon
– S → NP VP
NN → interest
– NP → (DT) NN
NNS → rates
– NP → NN NNS
NNS → raises
– NP → NNP
VBP → interest
– VP → V NP
VBZ → rates
– Simple 10 rule grammar: 592 parses for some
ambiguous sentences
– Real-size broad-coverage grammar: millions of
parses for a complicated sentence
A simple grammar
S
VP
VP
PP
P
V
→ NP VP
→ V NP
→ VP PP
→ P NP
→ with
→ saw
1.0
0.7
0.3
1.0
1.0
1.0
NP
NP
NP
NP
NP
NP
→
→
→
→
→
→
NP PP
astronomers
ears
saw
stars
telescope
0.4
0.1
0.18
0.04
0.18
0.1
Ambiguity
23
Ambiguity
• Tree for: Fed raises interest rates 0.5% in effort
to control inflation (NYT headline 5/17/00)
Local V/N ambiguities
Ambiguity
• Local ambiguity means that we have to deal
with multiple plausible choices during the
parsing process.
• Global ambiguity means that the grammar
can’t tell us which of several (many?) possible
parses is the correct one.
26
Two possible PP attachments
Sample treebank parse
29
Sample treebank sentence
30
Sample NP rules
31
Example
11/2/2011
CSCI 5832 Spring 2006
32
How many rules?
A sample parsed sentence
Not just newswire…
PP attachment ambiguity (German)
PP attachment in Chinese
Sample trees
Searching treebank corpora
• Online
– The Treebank Tool Suite
– The VISL website
– The NCLT website
• Offline
– Treebank corpus
– Search utilities: tgrep, tregex, etc.
tgrep
TiGer Treebank
edge labels:
crossing branches for
discontinuous constituency types
syntactic functions
node labels:
phrase categories
S
HD
SB
OC
VP
MO
OA
PP
AC
NK
NP
NK
NK
nächsten Jahr
will
die
Im
VMFIN ART
NN
APPRART ADJA
Sup.Dat. Dat.
3.Sg. Nom.
Dat
Sg.Neut Pl.Neut Pres.Ind Sg.Fem
nahe
wollen
Jahr
die
in
HD
NP
NK
NK
Regierung ihre
NN PPOSAT
Nom.
Acc.
Sg.Fem Pl.Masc
Regierung ihr
annotation on word level:
part-of-speech,
morphology, lemmata
NK
Reformpläne umsetzen
NN
VVINF
Acc.
Inf
Pl.Masc
Plan
umsetzen
.
$.
Parallel treebanks
• Translation training and studies
• Machine translation (MT) research &
development
Aligning parses
Download