[Noun Phrase Chunking] in [Hebrew] – Influence of [lexical and

advertisement
[Noun Phrase Chunking] in
[Hebrew] – [Influence] of
[Lexical and Morphological
Features]
[Yoav Goldberg], [Meni Adler] and [Michael Elhadad]
[Ben Gurion University of the Negev]
NP Chunking – Task Definition
• Identifying simple noun phrases in natural
language text.
[Piere Vinken], [61 years] old, will join [the
board] as a [nonexecutive director]
Chunking as Tagging
• BIO tags – classify each word as
Begin/Inside/Outside
[Piere/B Vinken/I] ,/O [61/B years/I] old/O
,/O will/O join/O [the/B board/I] as/O a/O
[nonexecutive/B director/I]
What is a Simple NP?
• In English – any non recursive noun phrase
(they are called Base NPs)
[[A team] of [[researchers] from [[the National
Cancer institute] and [[the medical schools]
of [[Harvard University] and [Boston
University]]]]]]
The Case of Hebrew
• Hebrew Treebank -- ~5,000 sentences,
manually tagged for full morphological and
syntactic structure.
• From there, we extracted all non-recursive
NPs.
Hebrew Base NPs
• 99% Precision and Recall!
Hebrew Base NPs
• 99% Precision and Recall – but…
• Avg. number of words in each NP: 1.39
‫[תופעה] זו התבררה אתמול בוועדת ]ה עבודה] ו [ה‬
‫ שדנה בנושא העסקת [עובדים‬,]‫רווחה] של [ה כנסת‬
]‫זרים‬
This [issue] was highlighted yesterday in [the
Knesset]’s committee-of [Work] and [Welfare]
which discussed the issue-of employment-of
]foreign workers]
Some Hebrew Syntax
• Adjectives
– Appear after the noun
– Agreement on gender and number
– Definite article marker appears on both noun
and modifier
Some Hebrew Syntax
• Adjectives
– Appear after the noun
– Agreement on gender and number
– Definite article marker appears on both noun
and modifier
‫התפוח הירוק‬
the-apple(sing,m) the-green(sing,m)
The Green Apple
Some Hebrew Syntax
• Word Order
• Can be either SVO or VSO
• Definite direct objects are marked with the
special preposition ‫‘( את‬et’), which has no
analog in English.
Some Hebrew Syntax
• Construct State (smixut)
– Hebrew’s genitive case
– <N1 NP2> construct: NP2 modifies N1
– N1 is morphologically marked with the
‘construct’ feature.
– The morphological marker is often unmarked.
• Smixut occurs in ~40% of NPs.
Some Hebrew Syntax
• Construct State (smixut)
– Hebrew’s genitive case
– <N1 NP2> construct: NP2 modifies N1
– N1 is morphologically marked with the
‘construct’ feature.
– The morphological marker is often unmarked.
‫בית ספר‬
House-of Book
School
Construct State
• The definite article is placed on NP2
Construct State
• The definite article is placed on NP2
‫מחוסרי הבית‬
people-without-of the-House
The Homeless people
Construct State
• Construct compounds may be nested.
• The innermost NP2 receives the explicit
definite marker
Construct State
• Construct compounds may be nested.
• The innermost NP2 receives the explicit
definite marker
‫שומר בית הספר‬
Guard-of house-of the-book
The school’s guard
Construct State
• NP2 can be a coordination
Construct State
• NP2 can be a coordination
‫וועדת העבודה והרווחה‬
Committee-of the-work and-the-welfare
The work and welfare committee
Construct State
• Unlike English NN compounds, Hebrew
construct makes any inner N a potential NP.
• For all <N NP> constructs, N cannot be part
of any non-recursive chunk.
Possessive
4 possible expressions of generalized
possessive in NPs:
• ‘‫’של‬preposition
• Possessive pronoun suffix
• Smixut
• Double possessive
Possessive: the Shel preposition
Unlike the English ‘of’, the ‘‫ ’של‬possessor:
• expresses only possession
• appears only in the context of NPs (no PP
attachment problem).
 Shel never introduces attachment
ambiguity
Double Possessive
• The construct state in possessive usage
frequently alternates with the possessive
case + possessor ‫( של‬double possessive)
Double Possessive
• The construct state in possessive usage
frequently alternates with the possessive
case + possessor ‫( של‬double possessive)
‫משרד ראש הממשלה‬
Office-of head-of the-government
‫משרדו של ראש הממשלה‬
Office-his of head-of the-government
The Prime Minister’s Office
Defining Hebrew Simple Chunks
• Need to be automatically extracted from the
tree bank.
• Do not require non-local decisions or create
structural ambiguity
• Non-recursive NPs are way too uninformative
Defining Hebrew Simple Chunks
• Instead of defining what is inside a simple
NP, we decide what is definitely out:
Some Resulting Examples
‫• מזכיר תנועת ה מושבים‬
• The secretary of the Moshav movement
‫• דוברת שירות ה תעסוקה‬
• The employment office spokeswoman
‫• מאמרו של תום שגב‬
• Tom Segev’s article
Some Longer Examples
‫• ה בחירות ה מוקדמות של ה מפלגה ה‬
‫דמוקרטית‬
• The Democrat party primary elections
‫• ה התרוצצות ה בזבזנית של עסקני ה ספורט‬
• The wasteful agitation of the sport gamblers
Longer Examples
‫• ה טקסים מרובי ה שירים ה עצובים ו קטעי ה‬
‫קריאה ו הורדת ה דגל‬
• The-ceremonies with-the-many sad songs
and reading passages and lowering of the
flag
Comparison with English
Average number of tokens:
• In a Hebrew Simple NP: 2.49
• In English Base NPs:
2.17
The Hebrew Task is Harder
SVM Based Learning
• Handles very large feature sets
• Good at generalizing
• Have been successfully applied to chunking
– Best results for English
– Work well for Arabic
• YAMCHA toolkit
Basic SVM Experiment
Polynomial kernel of degree 2, C=1, pair-wise voting
Basic SVM Results
Exp
WP,Clean
Tag
Accuracy
Chunk
Precision
97.49
92.54
Chunk
Recall
92.35
F
92.44
Basic SVM Results
Exp
WP,Clean
WP, Noisy
Baseline,
Clean
(EDP+TBL)
Tag
Accuracy
Chunk
Precision
97.49
94.87
92.54
89.14
84.70
Chunk
Recall
92.35
87.69
86.70
F
92.44
88.41
86.20
Basic SVM Results
Exp
WP,Clean
Accuracy
Precision
97.49
94.87
92.54
89.14
Exp
ENGLISH (using
weighted voting, Kudo
and Matsumoto)
Precision
WP,Noisy
94.15
Recall
92.35
87.69
Recall
94.29
F
92.44
88.41
F
94.22
Exploiting Linguistic Knowledge
• Hebrew is a morphologically rich language
• Construct is important for NPs
• Agreement on Gender and Number
• We have a morphological disambiguator
 Let’s add morphological features
Enriching our Feature Set
Results (clean data)
Exp
WP
WPG
WPC
WPN
WPNC
ALL
Accuracy
Precision
97.49
97.44
97.60
97.5
97.61
96.68
92.54
92.45
92.87
92.7
92.99
90.21
Recall
92.35
92.23
93.32
92.4
93.41
90.60
F
92.44
92.34
93.1
92.55
93.20
90.40
Results (noisy data)
Exp
WP
WPNC
Delta
Accuracy
Precision
94.87
96.99
+2.12
89.14
91.49
+2.35
Recall
87.69
91.32
+3.6
F
88.41
91.4
+2.99
Error Analysis – Types of Errors
• Conjunction related
– [work and welfare committee]  [work] and [welfare committee]
– [soldiers] and [police officers]  [soldiers and police officers]
• Split / Merge
– [community work]  [ community ] [ work ]
– Unlike [some people][others]…  Unlike [some people others]…
• Short / Long
– [a] b  [a b]
– red [ car ] etc
a [b]  [a b] a [b] c  [a b c]
• Whole Chunk
a chunk was not caught at all, or an extra chunk was caught
Error Analysis – Effect of
Number and Construct Info
Conclusions
• Non-recursive NPs aren’t suitable for the
Hebrew case.
• Hebrew simple NPs are more complex than
English base NPs
• SVM works well for Hebrew simple NPs
• Morphological features improve chunking
(but need to pick the features with care)
Future Work
• Using word lemmas as features
• Weighted Voting for several SVM systems,
using different chunks tags, as well as
different morphological features
• Handle Conjunctions
• Chunk VPs and APs.
Thank [You]
Download