791-03-Corpus

advertisement
COMP 791A: Statistical Language
Processing
Corpus-Based Work
Chap. 4
1
Using a Corpus


To approximate the probability distribution of
language events, we use a training corpus
Statistical NLP seeks to automatically learn
lexical and structural preferences from corpora.
2
Corpus


Large database of text & speech
Many types of text corpora exist


Major suppliers:



plain text, domain specific, tagged, parsed, parallel
bilingual, …
Linguistic data Consortium (LDC) -- www.ldc.upenn.edu
European Language resources Associations (ELRA) -www.icp.grnet.fr/ELRA
To derive the needed probabilities, a corpus
needs to be:


large
a representative sample of the population of interest
3
Low-Level Formatting Issues

Junk formatting & content



Removal of typesetter codes (ex. HTML tags),
diagrams, tables, foreign words etc.
Also other problems if data was retrieved through
OCR (unrecognized words)
Uppercase and Lowercase



should we keep the case or not?
“the”, “The” and “THE” should all be treated the same?
but in “George Brown” and “brown dog”, “brown” should
be treated separately…
4
Finding Tokens and Sentences

Tokenization

divide the input text into units (called tokens)


each token is either a word or something else (ex. a
number or a punctuation mark)
Mark sentence boundaries


Most sentences end with ‘.’, ‘?’ or ‘!’
Can be confused by abbreviations
5
Tokenization --What is a word?

Graphic word (Kučera and Francis, 1967):

“A string of contiguous alphanumeric characters with white
spaces on either side; may include hyphens and apostrophes, but
no other punctuation marks”


But what about: “$22.50” “C++”
“ :-)”
Main problems:

Periods




Abbreviation or end of sentence?
“etc.” “Calif.” “Wash.”
Is the period part of the word or not?
 “Wash.” (Washington) need to keep the period to distinguish it from
“wash” (the verb)
Single apostrophes




Part of the word or not?
“Peter’s sick” --> 1 word? or 2 words?
If 1 word, then problems in parsing… S--> NP VP
If 2 words, then should “Peter’s house” be considered 2 words?
6
Tokenization --What is a word? (con’t)

Hyphens



Diacritics


Line-breaks to improve justification of text or not?
Ex: “e-mail” “pro-life” “data-base”/”database”/”data base”
Remove them?
Homographs

Should we distinguish 2 words that have the same spelling but with
unrelated senses



“Bow“: part of a ship / a knot of ribbon
“Saw”: instrument / past tense of “to see”
Word Segmentation in other languages:



Some languages have no whitespaces !!!
Ex: East-Asian languages
In German: “life insurance company employee” =
“Lebenversicherungsgesellschaftsangestellter”
7
Tokenization --What is a word? (con’t)

Whitespace do not always indicate a word break

Ex: Do we really want to separate the phrases




If no, then what do we do with non-adjacent phrasal verbs?


“I could not work the answer out”
Variant forms of some semantic types

Ex. Telephone numbers




“in spite of”
“as a matter of fact”
“work out”
(514) 848-3074
+1 514 848 3074
+1 (514) 848 3074
Speech corpora

More contractions, fillers (ex. “Um” “well” “euh”)
8
Tokenization -- Lemmatizer

What about morphological variants?



Should “give”, “gives”, “given”, “giver”… be considered
different words?
Goal: “normalize” similar words
Two main approaches:


Stemming
Morphological Analysis
9
Stemming



Very “dumb” rules work well (for English and Romance languages)
Ex: the Porter stemmer
Strips off affixes and leaves the stem


give --> give, gives --> give + s, given --> give + en, …
uses simple rules:






IF word ends with “ies” but not with “eies” or “aies”
IF word ends with “es” but not “aes”, “ees” and “oes”
IF word ends with “s” but not “us” or “ss”
first applicable rule is applied
THEN replace “ies” by “y”
THEN replace “es” by ”e”
THEN remove “s”
Advantage: Fast
Disadvantages:


Rules depend on the language
Unreadable results:


EX: “computers”, “computation”, “computational” --> “comput”
May reduce different words to same stem although they are actually
distinct






stocks --> stock
stockings --> stock
arm --> arm
army --> arm
organization --> organ
university --> universe
10
Morphological Analyzer

Apply morphological rules





Check that (file,N) (file,V) is in dictionary
Advantages:



(XXXes,V) --> (XXXe,V)
(XXXes,N) --> (XXXe,N)
files --> (file,N) (file,V)
Identifies the root which is an actual word
Fewer errors than stemming
Disadvantage:

More complex, too slow
11
Sentences: What is a sentence?

Something ending with a ‘.’, ‘?’ or ‘!’


But, sentences can be split up by other punctuation marks or
quotes



True in 90% of the cases
Ex: nested phrases:
“You remind me,” she remarked, “of your mother.”
We usually use heuristic methods


But hand-coded heuristics…
Some effort to use statistical methods for sentence-boundary
detection





Typical classification problem…
Classify a period as a end-of-sentence marker or not
Use features such as case, length, POS tag of words preceding the
period,…
Use decision trees, neural networks…
Some techniques go up to 98-99% correct classification
12
Marked-Up Data: Mark-up Schemes


Schemes developed to mark up the structure of
text
Different Mark-up schemes:
 COCOA format


older, and rather ad-hoc
SGML

And other related encodings: HTML, XML
13
Example

Input text:
Bogota, 9 jan 90 (EFE) -- [text] Ricardo Alfonso Castellar, mayor of Achi,
in the northern department of Bolivar, who was kidnapped on 5 January,
apparently by army of national liberation (ELN) guerrillas, was found dead
today, according to authorities.
Castellar was kidnapped on 5 january on the outskirts of Achi, about 850
km north of Bogota, by a group of armed men, who forced him to
accompany them to an undisclosed location.
Police sources in Cartagena reported that Castellar's body showed signs
of torture and several bullet wounds.
Castellar was attacked by ELN guerrillas while he was traveling in a boat
down the Cauca river to the tenche area, a region within his jurisdiction.
In Cartagena it was reported that Castellar faced a “revolutionary trial”
by the ELN and that he was found guilty and executed.
14
Example (con’t)

Text with named entity tags:
<ENAMEX TYPE= LOCATION> Bogota </ENAMEX>, <TIMEX TYPE= DATE>
9 jan 90 </TIMEX> ( <ENAMEX TYPE= ORGANIZATION> EFE
</ENAMEX>) -- [text] <ENAMEX TYPE= PERSON> Ricardo Alfonso
Castellar </ENAMEX>, mayor of <ENAMEX TYPE= LOCATION> Achi
</ENAMEX>, in the northern department of <ENAMEX TYPE=
LOCATION> Bolivar </ENAMEX>, who was kidnapped on <TIMEX TYPE=
DATE>5 January</TIMEX>, apparently by <ENAMEX TYPE=
ORGANIZATION> army of national liberation (ELN) </ENAMEX>
guerrillas, was found dead today, according to authorities.
<ENAMEX TYPE= PERSON> Castellar </ENAMEX> was kidnapped on
<TIMEX TYPE= DATE> 5 january </TIMEX>on the outskirts of <ENAMEX
TYPE= LOCATION> Achi </ENAMEX>, about 850 km north of <ENAMEX
TYPE= LOCATION> Bogota </ENAMEX>, by a group of armed men, who
forced him to accompany them to an undisclosed location.
...
15
Example (con’t)

Text with coreference tags:
Bogota, 9 jan 90 (EFE) -- [text] <COREF ID="1" MIN="Ricardo Alfonso
Castellar "> Ricardo Alfonso Castellar </COREF>, <COREF ID= "2" MIN="
mayor" REF="1" TYPE="IDENT"> mayor of Achi </COREF>, in the northern
department of Bolivar, who was kidnapped on 5 January, apparently by
<COREF ID= "3" MIN="army "> army of national liberation (ELN) guerrillas
</COREF>, was found dead today, according to authorities.
<COREF ID="4" MIN="Castellar" REF="1" TYPE="IDENT"> Castellar
</COREF> was kidnapped on 5 january on the outskirts of Achi, about 850
km north of Bogota, by <COREF ID="5" MIN= "group" REF="3"
TYPE="IDENT"> a group of armed men </COREF>, who forced <COREF
ID="6" MIN="him" REF="1" TYPE="IDENT"> him </COREF>...
16
Example: (con’t)

Interpretation of coreference tags
Bogota, 9 jan 90 (EFE) -- [text] Ricardo Alfonso Castellar, mayor of
Achi, in the northern department of Bolivar, who was kidnapped on 5
January, apparently by army of national liberation (ELN) guerrillas, was
found dead today, according to authorities.
Castellar was kidnapped on 5 january on the outskirts of Achi, about 850
km north of Bogota, by a group of armed men, who forced him to
accompany them to an undisclosed location.
Police sources in Cartagena reported that Castellar's body showed signs
of torture and several bullet wounds.
Castellar was attacked by ELN guerrillas while he was traveling in a boat
down the Cauca river to the tenche area, a region within his jurisdiction.
In Cartagena it was reported that Castellar faced a “revolutionary trial”
by the ELN and that he was found guilty and executed.
17
Marked-Up Data: Grammatical Coding


to indicate the various parts of speech of
tokens
Different Tag Sets have been used






Brown Tag Set:
87/179 tags
Penn Treebank (most used):
45 tags
London-Lund:197 tags
CLAWS1:
132 tags
CLAWS2:
166 tags
CLAWS c5: 62 tags
18
The design of a Tag Set




Target feature (classification):
 Tags are used to tell (the user) useful information about the
grammatical class of a word
Predictive feature:
 Tags are used (by the system) to predict the behavior of
other words in the context
Example, in the Brown tag set:
 VBG: verb, present participle
 But they can be used as Gerund or as Noun
 Gerund: “While purchasing/VBG a gift, I noticed that I was out of money.”
 Noun: “Concordia’s purchasing/VBG? Department is closed.”
2 conflicting goals:
 splitting a tag improves prediction but makes classification
harder
19
Download