From Textual Information to Numerical Vectors Chapters 2.7-2.13 Presented by Aaron Hagan Text Mining • Supplements the human reader with automatic systems undeterred by the text explosion. It involves analyzing a large collection of documents to discover previously unknown information. • The information might be relationships or patterns that are buried in the document collection and which would otherwise be extremely difficult, if not impossible, to discover. What is Covered • Part-of-speech tagging classifies words into categories such as noun, verb or adjective • Word sense disambiguation identifies the meaning of a word, given its usage, from among the multiple meanings that the word may have • Parsing performs a grammatical analysis of a sentence. Shallow parsers identify only the main grammatical elements in a sentence, such as noun phrases and verb phrases, whereas deep parsers generate a complete representation of the grammatical structure of a sentence Motivation • Up until now we have been dealing with individual words and simple-minded (though useful) notions of what sequence of words are likely. • Now we turn to the study of how words – Are clustered into classes – Group with their neighbors to form phrases and sentences – Depend on other words • Interesting notions: – Word order – Constituency – Grammatical relations • Today: syntactic word classes – part of speech tagging Part-Of-Speech Tagging • At the step where text has been broken into tokens and sentences. • If no linguistic analysis is necessary, one might proceed directly to feature generation in which the “features” will be obtained from the tokens. • If a goal is more specific, such as recognizing names of people, place and organizations, it is usually desirable to perform additional linguistic analyses of the text to extract more sophisticated features. • Find POS for each token. • Words are organized into grammatical classes or parts of speech. • English : nouns, verbs, adjectives, adverbs, prepositions, conjunctions. History of POS Tagging • Research on part-of-speech tagging has been closely tied to corpus linguistics. The first major corpus of English for computer analysis was the Brown Corpus developed at Brown University by Henry Kucera and , in the mid-1960s. • Consists of about 1,000,000 words of running English prose text, made up of 500 samples from randomly chosen publications. Each sample is 2,000 words. • Mid 1980s, researchers in Europe began to use hidden Markov models (HMMs) to disambiguate parts of speech, when working to tag the of British English. HMMs involve counting cases (such as from the Brown Corpus), and making a table of the probabilities of certain sequences. CORPUS • CORPUS OF CONTEMPORARY AMERICAN ENGLISH (COCA) • The first large, balanced corpus of contemporary American English. • The corpus contains more than 385 million words of text, including 20 million words each year from 1990-2008, and it is equally divided among spoken, fiction, popular magazines, newspapers, and academic texts. • The interface allows you to search for exact words or phrases, wildcards, lemmas, part of speech, or any combinations of these. You can search for surrounding words (collocates) within a ten-word window (e.g. all nouns somewhere near chain, all adjectives near woman, or all verbs near key). • The corpus also allows you to easily limit searches by frequency and compare the frequency of words, phrases, and grammatical constructions, in at least two main ways: – By genre: comparisons between spoken, fiction, popular magazines, newspapers, and academic, or even between sub-genres (or domains), such as movie scripts, sports magazines, newspaper editorial, or scientific journals – Over time: compare different years from 1990 to the present time Penn Treebank Tag Set • • • • • • • • • • • • • • • • • 1. CC Coordinating • conjunction • 2. CD Cardinal number • 3. DT Determiner • 4. EX Existential there • 5. FW Foreign word • 6. IN Preposition or • subordinating conjunction • 7. JJ Adjective • 8. JJR Adjective, comparative • 9. JJS Adjective, superlative • 10. LS List item marker • 11. MD Modal 12. NN Noun, singular or mass• 13. NNS Noun, plural • 14. NP Proper noun, singular 15. NPS Proper noun, plural • 16. PDT Predeterminer • 17. POS Possessive ending • 18. PP Personal pronoun • 19. PP$ Possessive pronoun • 20. RB Adverb 21. RBR Adverb, comparative 22. RBS Adverb, superlative 23. RP Particle 24. SYM Symbol 25. TO to 26. UH Interjection 27. VB Verb, base form 28. VBD Verb, past tense 29. VBG Verb, gerund or present participle 30. VBN Verb, past participle 31. VBP Verb, non-3rd person singular present 32. VBZ Verb, 3rd person singular present 33. WDT Wh-determiner 34. WP Wh-pronoun 35. WP$ Possessive whpronoun 36. WRB Wh-adverb http://www.ims.uni-stuttgart.de/projekte/CorpusWorkbench/CQP-HTMLDemo/PennTreebankTS.html Assigning POS to Tokens • Possible to manually tag POS. Ideally want automated system to identify POS. • Most successful databases are one generated automatically by machine-learning algorithms from annotated copora. – Example: • Wall Street Journal suited well for certain type of data, but may not be ideal for something like email messages. • A lot of military funding for things such as processing voluminous news source. • Not much support for generating large training corpora in other domains. Part-Of-Speech Dictionaries • Dictionaries showing word-POS correspondence can be useful. • Difficult do to several parts of speech tied to one word. – Example: • Bore – noun - a tiresome person • Bore – verb - to pierce with a turning or twisting movement of a tool – Example • Book/VB that/DT flight/NN • Tagging is a type of disambiguation – Book can be NN or VB – Can I read a book on this flight? – That can be a DT or complementizer – My travel agent said that there would be a meal on this flight • The goal of POS tagging is to determine which of these possibilities is realized in a particular text instance. Approaches to POS Tagging • Rule-based Approach – Uses handcrafted sets of rules to tag input sentences • Statistical approaches – Use training corpus to compute probability of a tag in a context • Hybrid systems (e.g. Brill’s transformationbased learning) 11 ENGTWOL (ENGlish TWO Level analysis) Rule-Based Tagger A Two-stage architecture • Use lexicon FST (dictionary) to tag each word with all possible POS • Apply hand-written rules to eliminate tags. • The rules eliminate tags that are inconsistent with the context, and should reduce the list of POS tags to a single POS per word. 12 ENGTWOL Adverbial-that Rule Given input “that” • If the next word is adj, adverb, or quantifier, and following that is a sentence boundary, and the previous word is not a verb like “consider” which allows adjs as object complements, • Then eliminate non-ADV tags, • Else eliminate ADV tag • I consider that odd. (that is NOT ADV) • It isn’t that strange. (that is an ADV) 13 Det-Noun Rule: • If an ambiguous word follows a determiner, tag it as a noun 14 Does it work? • This approach does work and produces accurate results. • What are the drawbacks? – Extremely labor-intensive 15 Statistical Tagging • Statistical (or stochastic) taggers use a training corpus to compute the probability of a tag in a context. • For a given word sequence, Hidden Markov Model (HMM) Taggers choose the tag sequence that maximixes P(word | tag) * P(tag | previous-n-tags) • A HMM tagger chooses the tag ti for word wi that is most probable given the previous tag, ti-1 ti = argmaxj P(tj | ti-1, wi) 16 HMM Example • For example, once you've seen an article such as 'the', perhaps the next word is a noun 40% of the time, an adjective 40%, and a number 20%. – a program can decide that "can" in "the can" is far more likely to be a noun than a verb or a modal. The same method can of course be used to benefit from knowledge about following words. • More advanced ("higher order") HMMs learn the probabilities not only of pairs, but triples or even larger sequences. So, for example, if you've just seen an article and a verb, the next item may be very likely a preposition, article, or noun, but much less likely another verb. Statistical POS Tagging (Example) • Use probability theory for POS tagging. • Suppose, with no context, we just want to know given the word “flies” whether it should be tagged as a noun or as a verb. • We use conditional probability for this: we want to know which is greater PROB(N | flies) or PROB(V | flies) • Note definition of conditional probability PROB(a | b) = PROB(a & b) / PROB(b) – Where PROB(a & b) is the probability of the two events a and b occurring simultaneously 18 Calculating POS for “flies” We need to know which is more • PROB(N | flies) = PROB(flies & N) / PROB(flies) • PROB(V | flies) = PROB(flies & V) / PROB(flies) • Use Corpus as reference for finding probablities. 19 Corpus to Estimate 1,273,000 words; 1000 uses of flies; 400 flies in N sense; 600 flies in V sense PROB(flies) ≈ 1000/1,273,000 = .0008 PROB(flies & N) ≈ 400/1,273,000 = .0003 PROB(flies & V) ≈ 600/1,273,000 = .0005 Out best guess is that flies is a V PROB(V | flies) = PROB(V & flies) / PROB(flies) = .0005/.0008 = .625 20 Phrase Recognition • Once tokens have been assigned POS tags, the next step is to group individual tokens into units, called phrases. • The idea is for creating a “partial parse” of a sentence and as a step in identifying the “named entities” occurring in a sentence. • Text parsing systems are suppose to scan a text and mark the beginning and end of phrases. Phrase Recognition • There are a number of conventions for marking, but the most common : – Mark a word inside a phrase with I• Can be extended with a code for the phrase type: I-NP, I-VP, etc – Mark a word at the beginning of a phrase adjacent to another phrase with B• Can be extended with a code for the phrase type: B-NP, B-VP, etc. – And a word outside any phrase with O • Looking for a particular sequence of words that occurs frequently enough in the corpora. • Simple statistical approach that looks at multiword tokens. Named Entity Recognition • Specialization of phrase finding • Particular noun phrase finding is the recognition of particular types of proper noun phrases, specifically persons, organizations, and locations. • Importance of these recognizers for intelligence applications . • (More on this in chapter 6). Parsing into Phrases • Usually a full parse of a sentence is done in most sophisticated kind of text processing. • Each word in the sentence has a relation to all the other words and the main function (subject, object, etc) in the sentance. • There are many different kinds of parses, each associated with linguistic theory of the language. Context-Free Parses • A tree of nodes in which the leaf nodes are words of a sentence, the phrases into which the words are grouped are internal nodes, and there is one top node at the root of the tree, which has the label S. • A number of algorithms for producing such a tree from the words of a sentence. With considerable research constructing parsers from a statistical analysis of tree banks of sentences parsed by handle. • Provides information that phrase identification or partial parsing cannot provide. Parse Tree Example Linear order of phrases in a partial parse, one might conclude that Johnson replaced Smith. S NP - N JOHNSON VP PP VP AUX was PPART replaced PREP at PP PREP by PNOUN PNOUN XYZ PNOUN CORP Johnson was replaced at XYZ Corp by Smith. PNOUN Smith Feature Generation • Reason for the linguistic processing is to identify features that can be useful for text mining. • Features that might be useful in identifying the POS include: where the first letter is capitalized (indicating a proper noun), if all the characters are digits, periods, or comma (marking a number), if characters alternate case (usually an abrivation). • A dictionary as to the possible parts of speech for a token. Feature Vector • The feature vector for a document is assigned a set of classes. • Feature Vector Example: – Classifying periods as End-Of-Sentence. – Identifying tokens as instance of titles, such as “Doctor” or “President” Summary • Part-of-Speech Tagging – is an important step in Natural Language Analysis. – is robust and fast. – works with 95-97% accuracy. • Parsing (= full syntax analysis) – is more error-prone than PoS-Tagging. – is important to get to the meaning of a sentence. References / Applications • http://www.cis.upenn.edu/~treebank/ • The Penn Treebank Project annotates naturallyoccuring text for linguistic structure. Most notably, we produce skeletal parses showing rough syntactic and semantic information -- a bank of linguistic trees. • http://www.americancorpus.org/ • http://ucrel.lancs.ac.uk/claws/ • Stanford Natural Language Processing Group http://nlp.stanford.edu/software/tagger.shtm