Computational Intelligence in Biomedical and Health Care Informatics HCA 590 (Topics in Health Sciences) Rohit Kate Natural Language Processing: Words and Parses 1 Reading • Chapter 8, Biomedical Informatics: Computer Applications in Health Care and Biomedicine by Edward H. Shortliffe (Editor) and James J. Cimino (Editor), Springer, 2006. Linguistics Essentials Basic Steps of Natural Language Processing Sound waves Words Phonetics Syntactic processing Parses We will skip phonetics and phonology. Semantic processing Meaning Pragmatic Meaning processing in context Basic Steps of Natural Language Processing Sound waves Words Phonetics Syntactic processing Parses We will skip phonetics and phonology. Semantic processing Meaning Pragmatic Meaning processing in context Words: Morphology • Study of internal structure of words – carried carry + ed (past tense) – independently in + (depend + ent) + ly • English has relatively simple morphology, some other languages like German or Finnish have complex word structures • Very accurate morphological analyzers are available for most languages; considered a solved problem • Biomedical domains have rich morphology: – hydroxynitrodihydrothymine => hydroxy-nitro-di-hydro-thym-ine – hepaticocholangiojejunostomy => hepatico-cholangio-jejuno-stom-y • Identifying morphological structure also helps dealing with new words Words: Parts of Speech • Linguists group words of a language into categories which occur in similar places in a sentence and have similar type of meaning: e.g. nouns, verbs, adjectives; these are called parts of speech (POS) • A basic test to see if words belong to the same category or not is the substitution test – This is a good [dog/chair/pencil]. – This is a [good/bad/green/tall] chair. Parts of Speech • Nouns: Typically refer to entities and their names like people, animals, things – John, Mary, boy, girl, dog, cats, mug, table, idea – Can be further divided as proper, singular, plural • Pronouns: Variables or place-holders for nouns – – – – – Nominative: I, you, he, she, we, they, it Accusative: me, you, him, her, us, them, it Possessive: my, your, his, her, our, their, its 2nd Possessive: mine, yours, his, hers, ours, theirs, its Reflexive: myself, yourself, himself, herself, ourselves, themselves, itself Parts of Speech • Determiners: Describe particular reference of a noun – Articles: a, an, the – Demonstratives: this, that, these, those • Adjectives: Describe properties of nouns – good, bad, green, tall • Verbs: Describe actions – talk, sleep, eat, throw – Categorized based on tense, person, singular/plural Parts of Speech • Adverbs: Modify verbs by specifying space, time, manner or degree – often, slowly, very • Prepositions: Small words that express spatial relations and other attributes – in, on, over, of, about, to, with – They introduce prepositional phrases that typically introduce ambiguity in a sentence. • I saw a man on the hill with a telescope. – Prepositional phrase attachment: Another important NLP problem • Particles: Subclass of prepositions that bond with verbs to form phrasal verbs – take off, air out, ran up POS Tagging • Automatic POS tagging is often the first step in analyzing a sentence John saw the saw and decided to take it to the table. NOUN VERB DT NOUN CONJ VERB TO VERB PRP PREP DT NOUN • Why is this a non-trivial task? – The same word can have different pos tags in different sentences: • His position was near the tree. Noun • Position him near the tree. Verb Basic Steps of Natural Language Processing Sound waves Words Phonetics Syntactic processing Parses Semantic processing Meaning Pragmatic Meaning processing in context Phrase Structure • Most languages have a word order • Words are organized into phrases, group of words that act as a single unit or a constituent – [The dog] [chased] [the cat]. – [The fat dog] [chased] [the thin cat]. – [The fat dog with red collar] [chased] [the thin old cat]. – [The fat dog with red collar named Tom] [suddenly chased] [the thin old white cat]. Phrases • Noun phrase: A syntactic unit of a sentence which acts like a noun and in which a noun is usually embedded called its head – An optional determiner followed by zero or more adjectives, a noun head and zero or more prepositional phrases • Prepositional phrase: Headed by a preposition and express spatial, temporal or other attributes • Verb phrase: Part of the sentence that depend on the verb. Headed by the verb. • Adjective phrase: Acts like an adjective. An Important NLP Task: Phrase Chunking • Find all non-recursive noun phrases (NPs) and verb phrases (VPs) in a sentence. – [NP I] [VP ate] [NP the spaghetti] [PP with] [NP meatballs]. – [NP He ] [VP reckons ] [NP the current account deficit ] [VP will narrow ] [PP to ] [NP only # 1.8 billion ] [PP in ] [NP September ] • Some applications need all the noun phrases in a sentence Phrase Structure Grammars • Syntax is the study of word orders and phrase structures • Syntactic analysis tells how to determine meaning of a sentence from the meaning its of words – The dog bit the man. – The man bit the dog. • A basic question in Linguistics: What forms a legal sentence in a language? • Syntax helps to answer that question – *Bit the the man dog. • Conventionally, ‘*’ indicates ungrammatical sentence – Colorless green ideas sleep furiously. • Meaningless but grammatical Phrase Structure Grammars • Linguists have come up with many grammar formalisms to capture syntax of languages, phrase structure grammar is one of them and is very commonly used • A context-free grammar (CFG) that generates sentences – Context free: Only one symbol on left side • Productions of a small example grammar: S NP VP VP Verb VP Verb NP NP Article Noun Verb [slept|ate|made|bit] Noun [girl|cake|dog|man] Article [A|The] Phrase Structure Grammars • The parse of the sentence is typically shown as a tree The girl ate the cake. Non-terminals S NP VP Article Noun Verb The Terminals girl ate NP Article Noun the cake A syntactic derivation or a parse tree Phrase Structure Grammars • Some of the productions can be recursive (one inside another, like NP NP PP) which can then expand several times (S (NP (PRP I)) (VP (VBD saw) (NP (NP (DT the) (NN man)) (PP (IN on) (NP (NP (DT the) (NN hill))) (PP (IN with) (NP (DT the) (NN telescope)))))) • Because of recursion in the grammars there are potentially infinite number of sentences in a language Syntactic Ambiguity • Typically a grammar can lead to several parses of a sentence, called syntactic ambiguity (S (NP (PRP I)) (VP (VBD saw) (NP (NP (DT the) (NN man)) (PP (IN on) (NP (NP (DT the) (NN hill))) (PP (IN with) (NP (DT the) (NN telescope)))))) Syntactic Ambiguity • Typically a grammar can lead to several parses of a sentence, called syntactic ambiguity (S (NP (PRP I)) (VP (VBD saw) (NP (NP (DT the) (NN man)) (PP (IN on) (NP (DT the) (NN hill))) (PP (IN with) (NP (DT the) (NN telescope)))))) Syntactic Ambiguity • Typically a grammar can lead to several parses of a sentence, called syntactic ambiguity (S (NP (PRP I)) (VP (VBD saw) (NP (DT the) (NN man)) (PP (IN on) (NP (DT the) (NN hill))) (PP (IN with) (NP (DT the) (NN telescope)))))) Syntactic Parsing: A Very Important NLP Task • Not uncommon to have hundreds of parses for a sentence • Syntactic parsing is the task of finding the best parse for a sentence • Previous rule-based approaches were brittle and would not work well • Statistical methods for syntactic parsing have been more successful and are currently being used Statistical Syntactic Parsing • Statistical syntactic parsing uses a probabilistic model of syntax in order to assign probabilities to each parse tree • Provides principled approach to resolving syntactic ambiguity – The more likely parse will have higher probability – Includes POS tagging • Probabilities are typically learned from annotated parses of thousands of sentences, called a treebank – Penn Treebank (http://www.cis.upenn.edu/~treebank/) • Most well known treebank • Contains annotated parse trees of a few thousand Wall Street Journal articles 24 • Sparked progress in automated syntactic parsing methods Probabilistic Context Free Grammar (PCFG) • A PCFG is a probabilistic version of a CFG where each production has a probability. • Probabilities of all productions rewriting a given nonterminal must add to 1, defining a distribution for each non-terminal. 25 Simple PCFG for Air Travel Domain Grammar S → NP VP S → Aux NP VP S → VP NP → Pronoun NP → Proper-Noun NP → Det Nominal Nominal → Noun Nominal → Nominal Noun Nominal → Nominal PP VP → Verb VP → Verb NP VP → VP PP PP → Prep NP Prob 0.8 0.1 0.1 0.2 0.2 0.6 0.3 0.2 0.5 0.2 0.5 0.3 1.0 + 1.0 + 1.0 + 1.0 + 1.0 Lexicon Det → the | a | that | this 0.6 0.2 0.1 0.1 Noun → book | flight | meal | money 0.1 0.5 0.2 0.2 Verb → book | include | prefer 0.5 0.2 0.3 Pronoun → I | he | she | me 0.5 0.1 0.1 0.3 Proper-Noun → Houston | NWA 0.8 0.2 Aux → does 1.0 Prep → from | to | on | near | through 0.25 0.25 0.1 0.2 0.2 Sentence Probability • Assume productions for each node are chosen independently. • Probability of derivation is the product of the probabilities of its productions. P(D1) = 0.1 x 0.5 x 0.5 x 0.6 x 0.6 x S 0.1 0.5 x 0.3 x 1.0 x 0.2 x 0.2 x VP 0.5 0.5 x 0.8 Verb NP 0.5 = 0.0000216 Det book D1 0.6 Nominal 0.5 0.6 the Nominal PP 1.0 0.3 NP 0.2 Noun Prep 0.2 0.5 flight through Proper-Noun 0.8 Houston 27 Syntactic Disambiguation • Resolve ambiguity by picking most probable parse tree. S D2 P(D2) = 0.1 x 0.3 x 0.5 x 0.6 x 0.5 x 0.1 VP 0.6 x 0.3 x 1.0 x 0.5 x 0.2 x 0.3 VP 0.5 0.2 x 0.8 = 0.00001296 Verb NP 0.6 0.5 book PP Det Nominal 1.0 0.6 0.3 NP 0.2 the Noun 0.2Prep 0.5 flight through Proper-Noun 0.8 Houston D1 has a higher probability, hence it is the more likely parse according to the PCFG. 28 28 Syntactic Parsing • State-of-the art in syntactic parsers also uses words to influence probabilities of productions – VP Verb NP with “sneeze” as the verb will have a different probability than VP Verb NP with “eat” as the verb • You don’t sneeze something but you eat something • Try the online version of the Stanford Parser: – http://nlp.stanford.edu:8080/parser/ 29 Syntax of Biomedical Languages • Clinical language often relaxes many syntactic constraints in order to be highly compact – – – – The cough worsened Cough worsened Cough Increased tenderness. • Because these are widely used, they are not considered ungrammatical, but as a sublanguage • There are wide variety of sublanguages in the biomedical domains each exhibiting specialized content and linguistic forms • Parsers trained in one domain typically do not work well on another domain; requires adaptation