[Noun Phrase Chunking] in [Hebrew] – [Influence] of [Lexical and Morphological Features] [Yoav Goldberg], [Meni Adler] and [Michael Elhadad] [Ben Gurion University of the Negev] NP Chunking – Task Definition • Identifying simple noun phrases in natural language text. [Piere Vinken], [61 years] old, will join [the board] as a [nonexecutive director] Chunking as Tagging • BIO tags – classify each word as Begin/Inside/Outside [Piere/B Vinken/I] ,/O [61/B years/I] old/O ,/O will/O join/O [the/B board/I] as/O a/O [nonexecutive/B director/I] What is a Simple NP? • In English – any non recursive noun phrase (they are called Base NPs) [[A team] of [[researchers] from [[the National Cancer institute] and [[the medical schools] of [[Harvard University] and [Boston University]]]]]] The Case of Hebrew • Hebrew Treebank -- ~5,000 sentences, manually tagged for full morphological and syntactic structure. • From there, we extracted all non-recursive NPs. Hebrew Base NPs • 99% Precision and Recall! Hebrew Base NPs • 99% Precision and Recall – but… • Avg. number of words in each NP: 1.39 [תופעה] זו התבררה אתמול בוועדת ]ה עבודה] ו [ה שדנה בנושא העסקת [עובדים,]רווחה] של [ה כנסת ]זרים This [issue] was highlighted yesterday in [the Knesset]’s committee-of [Work] and [Welfare] which discussed the issue-of employment-of ]foreign workers] Some Hebrew Syntax • Adjectives – Appear after the noun – Agreement on gender and number – Definite article marker appears on both noun and modifier Some Hebrew Syntax • Adjectives – Appear after the noun – Agreement on gender and number – Definite article marker appears on both noun and modifier התפוח הירוק the-apple(sing,m) the-green(sing,m) The Green Apple Some Hebrew Syntax • Word Order • Can be either SVO or VSO • Definite direct objects are marked with the special preposition ‘( אתet’), which has no analog in English. Some Hebrew Syntax • Construct State (smixut) – Hebrew’s genitive case – <N1 NP2> construct: NP2 modifies N1 – N1 is morphologically marked with the ‘construct’ feature. – The morphological marker is often unmarked. • Smixut occurs in ~40% of NPs. Some Hebrew Syntax • Construct State (smixut) – Hebrew’s genitive case – <N1 NP2> construct: NP2 modifies N1 – N1 is morphologically marked with the ‘construct’ feature. – The morphological marker is often unmarked. בית ספר House-of Book School Construct State • The definite article is placed on NP2 Construct State • The definite article is placed on NP2 מחוסרי הבית people-without-of the-House The Homeless people Construct State • Construct compounds may be nested. • The innermost NP2 receives the explicit definite marker Construct State • Construct compounds may be nested. • The innermost NP2 receives the explicit definite marker שומר בית הספר Guard-of house-of the-book The school’s guard Construct State • NP2 can be a coordination Construct State • NP2 can be a coordination וועדת העבודה והרווחה Committee-of the-work and-the-welfare The work and welfare committee Construct State • Unlike English NN compounds, Hebrew construct makes any inner N a potential NP. • For all <N NP> constructs, N cannot be part of any non-recursive chunk. Possessive 4 possible expressions of generalized possessive in NPs: • ‘’שלpreposition • Possessive pronoun suffix • Smixut • Double possessive Possessive: the Shel preposition Unlike the English ‘of’, the ‘ ’שלpossessor: • expresses only possession • appears only in the context of NPs (no PP attachment problem). Shel never introduces attachment ambiguity Double Possessive • The construct state in possessive usage frequently alternates with the possessive case + possessor ( שלdouble possessive) Double Possessive • The construct state in possessive usage frequently alternates with the possessive case + possessor ( שלdouble possessive) משרד ראש הממשלה Office-of head-of the-government משרדו של ראש הממשלה Office-his of head-of the-government The Prime Minister’s Office Defining Hebrew Simple Chunks • Need to be automatically extracted from the tree bank. • Do not require non-local decisions or create structural ambiguity • Non-recursive NPs are way too uninformative Defining Hebrew Simple Chunks • Instead of defining what is inside a simple NP, we decide what is definitely out: Some Resulting Examples • מזכיר תנועת ה מושבים • The secretary of the Moshav movement • דוברת שירות ה תעסוקה • The employment office spokeswoman • מאמרו של תום שגב • Tom Segev’s article Some Longer Examples • ה בחירות ה מוקדמות של ה מפלגה ה דמוקרטית • The Democrat party primary elections • ה התרוצצות ה בזבזנית של עסקני ה ספורט • The wasteful agitation of the sport gamblers Longer Examples • ה טקסים מרובי ה שירים ה עצובים ו קטעי ה קריאה ו הורדת ה דגל • The-ceremonies with-the-many sad songs and reading passages and lowering of the flag Comparison with English Average number of tokens: • In a Hebrew Simple NP: 2.49 • In English Base NPs: 2.17 The Hebrew Task is Harder SVM Based Learning • Handles very large feature sets • Good at generalizing • Have been successfully applied to chunking – Best results for English – Work well for Arabic • YAMCHA toolkit Basic SVM Experiment Polynomial kernel of degree 2, C=1, pair-wise voting Basic SVM Results Exp WP,Clean Tag Accuracy Chunk Precision 97.49 92.54 Chunk Recall 92.35 F 92.44 Basic SVM Results Exp WP,Clean WP, Noisy Baseline, Clean (EDP+TBL) Tag Accuracy Chunk Precision 97.49 94.87 92.54 89.14 84.70 Chunk Recall 92.35 87.69 86.70 F 92.44 88.41 86.20 Basic SVM Results Exp WP,Clean Accuracy Precision 97.49 94.87 92.54 89.14 Exp ENGLISH (using weighted voting, Kudo and Matsumoto) Precision WP,Noisy 94.15 Recall 92.35 87.69 Recall 94.29 F 92.44 88.41 F 94.22 Exploiting Linguistic Knowledge • Hebrew is a morphologically rich language • Construct is important for NPs • Agreement on Gender and Number • We have a morphological disambiguator Let’s add morphological features Enriching our Feature Set Results (clean data) Exp WP WPG WPC WPN WPNC ALL Accuracy Precision 97.49 97.44 97.60 97.5 97.61 96.68 92.54 92.45 92.87 92.7 92.99 90.21 Recall 92.35 92.23 93.32 92.4 93.41 90.60 F 92.44 92.34 93.1 92.55 93.20 90.40 Results (noisy data) Exp WP WPNC Delta Accuracy Precision 94.87 96.99 +2.12 89.14 91.49 +2.35 Recall 87.69 91.32 +3.6 F 88.41 91.4 +2.99 Error Analysis – Types of Errors • Conjunction related – [work and welfare committee] [work] and [welfare committee] – [soldiers] and [police officers] [soldiers and police officers] • Split / Merge – [community work] [ community ] [ work ] – Unlike [some people][others]… Unlike [some people others]… • Short / Long – [a] b [a b] – red [ car ] etc a [b] [a b] a [b] c [a b c] • Whole Chunk a chunk was not caught at all, or an extra chunk was caught Error Analysis – Effect of Number and Construct Info Conclusions • Non-recursive NPs aren’t suitable for the Hebrew case. • Hebrew simple NPs are more complex than English base NPs • SVM works well for Hebrew simple NPs • Morphological features improve chunking (but need to pick the features with care) Future Work • Using word lemmas as features • Weighted Voting for several SVM systems, using different chunks tags, as well as different morphological features • Handle Conjunctions • Chunk VPs and APs. Thank [You]