COMP 791A: Statistical Language Processing Corpus-Based Work Chap. 4 1 Using a Corpus To approximate the probability distribution of language events, we use a training corpus Statistical NLP seeks to automatically learn lexical and structural preferences from corpora. 2 Corpus Large database of text & speech Many types of text corpora exist Major suppliers: plain text, domain specific, tagged, parsed, parallel bilingual, … Linguistic data Consortium (LDC) -- www.ldc.upenn.edu European Language resources Associations (ELRA) -www.icp.grnet.fr/ELRA To derive the needed probabilities, a corpus needs to be: large a representative sample of the population of interest 3 Low-Level Formatting Issues Junk formatting & content Removal of typesetter codes (ex. HTML tags), diagrams, tables, foreign words etc. Also other problems if data was retrieved through OCR (unrecognized words) Uppercase and Lowercase should we keep the case or not? “the”, “The” and “THE” should all be treated the same? but in “George Brown” and “brown dog”, “brown” should be treated separately… 4 Finding Tokens and Sentences Tokenization divide the input text into units (called tokens) each token is either a word or something else (ex. a number or a punctuation mark) Mark sentence boundaries Most sentences end with ‘.’, ‘?’ or ‘!’ Can be confused by abbreviations 5 Tokenization --What is a word? Graphic word (Kučera and Francis, 1967): “A string of contiguous alphanumeric characters with white spaces on either side; may include hyphens and apostrophes, but no other punctuation marks” But what about: “$22.50” “C++” “ :-)” Main problems: Periods Abbreviation or end of sentence? “etc.” “Calif.” “Wash.” Is the period part of the word or not? “Wash.” (Washington) need to keep the period to distinguish it from “wash” (the verb) Single apostrophes Part of the word or not? “Peter’s sick” --> 1 word? or 2 words? If 1 word, then problems in parsing… S--> NP VP If 2 words, then should “Peter’s house” be considered 2 words? 6 Tokenization --What is a word? (con’t) Hyphens Diacritics Line-breaks to improve justification of text or not? Ex: “e-mail” “pro-life” “data-base”/”database”/”data base” Remove them? Homographs Should we distinguish 2 words that have the same spelling but with unrelated senses “Bow“: part of a ship / a knot of ribbon “Saw”: instrument / past tense of “to see” Word Segmentation in other languages: Some languages have no whitespaces !!! Ex: East-Asian languages In German: “life insurance company employee” = “Lebenversicherungsgesellschaftsangestellter” 7 Tokenization --What is a word? (con’t) Whitespace do not always indicate a word break Ex: Do we really want to separate the phrases If no, then what do we do with non-adjacent phrasal verbs? “I could not work the answer out” Variant forms of some semantic types Ex. Telephone numbers “in spite of” “as a matter of fact” “work out” (514) 848-3074 +1 514 848 3074 +1 (514) 848 3074 Speech corpora More contractions, fillers (ex. “Um” “well” “euh”) 8 Tokenization -- Lemmatizer What about morphological variants? Should “give”, “gives”, “given”, “giver”… be considered different words? Goal: “normalize” similar words Two main approaches: Stemming Morphological Analysis 9 Stemming Very “dumb” rules work well (for English and Romance languages) Ex: the Porter stemmer Strips off affixes and leaves the stem give --> give, gives --> give + s, given --> give + en, … uses simple rules: IF word ends with “ies” but not with “eies” or “aies” IF word ends with “es” but not “aes”, “ees” and “oes” IF word ends with “s” but not “us” or “ss” first applicable rule is applied THEN replace “ies” by “y” THEN replace “es” by ”e” THEN remove “s” Advantage: Fast Disadvantages: Rules depend on the language Unreadable results: EX: “computers”, “computation”, “computational” --> “comput” May reduce different words to same stem although they are actually distinct stocks --> stock stockings --> stock arm --> arm army --> arm organization --> organ university --> universe 10 Morphological Analyzer Apply morphological rules Check that (file,N) (file,V) is in dictionary Advantages: (XXXes,V) --> (XXXe,V) (XXXes,N) --> (XXXe,N) files --> (file,N) (file,V) Identifies the root which is an actual word Fewer errors than stemming Disadvantage: More complex, too slow 11 Sentences: What is a sentence? Something ending with a ‘.’, ‘?’ or ‘!’ But, sentences can be split up by other punctuation marks or quotes True in 90% of the cases Ex: nested phrases: “You remind me,” she remarked, “of your mother.” We usually use heuristic methods But hand-coded heuristics… Some effort to use statistical methods for sentence-boundary detection Typical classification problem… Classify a period as a end-of-sentence marker or not Use features such as case, length, POS tag of words preceding the period,… Use decision trees, neural networks… Some techniques go up to 98-99% correct classification 12 Marked-Up Data: Mark-up Schemes Schemes developed to mark up the structure of text Different Mark-up schemes: COCOA format older, and rather ad-hoc SGML And other related encodings: HTML, XML 13 Example Input text: Bogota, 9 jan 90 (EFE) -- [text] Ricardo Alfonso Castellar, mayor of Achi, in the northern department of Bolivar, who was kidnapped on 5 January, apparently by army of national liberation (ELN) guerrillas, was found dead today, according to authorities. Castellar was kidnapped on 5 january on the outskirts of Achi, about 850 km north of Bogota, by a group of armed men, who forced him to accompany them to an undisclosed location. Police sources in Cartagena reported that Castellar's body showed signs of torture and several bullet wounds. Castellar was attacked by ELN guerrillas while he was traveling in a boat down the Cauca river to the tenche area, a region within his jurisdiction. In Cartagena it was reported that Castellar faced a “revolutionary trial” by the ELN and that he was found guilty and executed. 14 Example (con’t) Text with named entity tags: <ENAMEX TYPE= LOCATION> Bogota </ENAMEX>, <TIMEX TYPE= DATE> 9 jan 90 </TIMEX> ( <ENAMEX TYPE= ORGANIZATION> EFE </ENAMEX>) -- [text] <ENAMEX TYPE= PERSON> Ricardo Alfonso Castellar </ENAMEX>, mayor of <ENAMEX TYPE= LOCATION> Achi </ENAMEX>, in the northern department of <ENAMEX TYPE= LOCATION> Bolivar </ENAMEX>, who was kidnapped on <TIMEX TYPE= DATE>5 January</TIMEX>, apparently by <ENAMEX TYPE= ORGANIZATION> army of national liberation (ELN) </ENAMEX> guerrillas, was found dead today, according to authorities. <ENAMEX TYPE= PERSON> Castellar </ENAMEX> was kidnapped on <TIMEX TYPE= DATE> 5 january </TIMEX>on the outskirts of <ENAMEX TYPE= LOCATION> Achi </ENAMEX>, about 850 km north of <ENAMEX TYPE= LOCATION> Bogota </ENAMEX>, by a group of armed men, who forced him to accompany them to an undisclosed location. ... 15 Example (con’t) Text with coreference tags: Bogota, 9 jan 90 (EFE) -- [text] <COREF ID="1" MIN="Ricardo Alfonso Castellar "> Ricardo Alfonso Castellar </COREF>, <COREF ID= "2" MIN=" mayor" REF="1" TYPE="IDENT"> mayor of Achi </COREF>, in the northern department of Bolivar, who was kidnapped on 5 January, apparently by <COREF ID= "3" MIN="army "> army of national liberation (ELN) guerrillas </COREF>, was found dead today, according to authorities. <COREF ID="4" MIN="Castellar" REF="1" TYPE="IDENT"> Castellar </COREF> was kidnapped on 5 january on the outskirts of Achi, about 850 km north of Bogota, by <COREF ID="5" MIN= "group" REF="3" TYPE="IDENT"> a group of armed men </COREF>, who forced <COREF ID="6" MIN="him" REF="1" TYPE="IDENT"> him </COREF>... 16 Example: (con’t) Interpretation of coreference tags Bogota, 9 jan 90 (EFE) -- [text] Ricardo Alfonso Castellar, mayor of Achi, in the northern department of Bolivar, who was kidnapped on 5 January, apparently by army of national liberation (ELN) guerrillas, was found dead today, according to authorities. Castellar was kidnapped on 5 january on the outskirts of Achi, about 850 km north of Bogota, by a group of armed men, who forced him to accompany them to an undisclosed location. Police sources in Cartagena reported that Castellar's body showed signs of torture and several bullet wounds. Castellar was attacked by ELN guerrillas while he was traveling in a boat down the Cauca river to the tenche area, a region within his jurisdiction. In Cartagena it was reported that Castellar faced a “revolutionary trial” by the ELN and that he was found guilty and executed. 17 Marked-Up Data: Grammatical Coding to indicate the various parts of speech of tokens Different Tag Sets have been used Brown Tag Set: 87/179 tags Penn Treebank (most used): 45 tags London-Lund:197 tags CLAWS1: 132 tags CLAWS2: 166 tags CLAWS c5: 62 tags 18 The design of a Tag Set Target feature (classification): Tags are used to tell (the user) useful information about the grammatical class of a word Predictive feature: Tags are used (by the system) to predict the behavior of other words in the context Example, in the Brown tag set: VBG: verb, present participle But they can be used as Gerund or as Noun Gerund: “While purchasing/VBG a gift, I noticed that I was out of money.” Noun: “Concordia’s purchasing/VBG? Department is closed.” 2 conflicting goals: splitting a tag improves prediction but makes classification harder 19