Information extraction Lecture 8: Information Extraction from unstructured sources Semantic IE Reasoning You are here Fact Extraction Is-A Extraction singer Entity Disambiguation singer Elvis Entity Recognition Source Selection and Preparation Overview • Fact extraction from text • Difficulties • POS Tagging • Dependency grammars Reminder: Fact Extraction Fact extraction is the extraction of facts about entities from a corpus. Coyote chases Roadrunner. chases(Coyote, Roadrunner) Def: Extraction Pattern An extraction pattern for a binary relation r is a string that contains two place-holders X and Y, indicating that two entities stand in relation r. Extraction patterns for chases(X,Y): • “X chases Y” • “X runs after Y” • “X’s chase for Y” • “Y is chased by X” task> Task: Extraction Pattern Define extraction patterns for • wasBornInCity(X,Y) • isDating(X,Y) Def: Extraction Pattern Match A match for an extraction pattern p in a corpus c is a substring s of c such that σ there exists a substitution from X and Y to words such that σ(p) = s “X’s chase for Y” Coyote’s chase for RR rarely succeeds. Def: Extraction Pattern Match A match for an extraction pattern p in a corpus c is a substring s of c such that σ there exists a substitution from X and Y to words such that σ(p) = s “X’s chase for Y” σ = {X → Coyote, Y → RR} Coyote’s chase for RR rarely succeeds. Def: Extraction Pattern Match A match for an extraction pattern p in a corpus c is a substring s of c such that σ there exists a substitution from X and Y to words such that σ(p) = s “X’s chase for Y” σ = {X → Coyote, Y → RR} match Coyote’s chase for RR rarely succeeds. Def: Pattern Application Given a pattern p for a relation r, and a corpus, pattern application produces, for every match of p with substitution , σ the fact r(σ(X), σ(Y )) “X’s chase for Y” is pattern for chases() σ = {X → Coyote, Y → RR} Coyote’s chase for RR rarely succeeds. Def: Pattern Application Given a pattern p for a relation r, and a corpus, pattern application produces, for every match of p with substitution , σ the fact r(σ(X), σ(Y )) “X’s chase for Y” is pattern for chases() σ = {X → Coyote, Y → RR} Coyote’s chase for RR rarely succeeds. chases(Coyote,RR) Disambiguation is required! Coyote’s chase for RR rarely succeeds. chases(Coyote, RR) ? Strictly speaking, we produce the fact r(d(σ(X)), d(σ(Y ))) where d is a disambiguation function. Where do we get the patterns? • Option 1: Manually compile patterns. Where do we get the patterns? • Option 2: Manually annotate patterns in text While Coyote chases Roadrunner, he experiences certain inconveniences. “X chases Y” Where do we get the patterns? • Option 3: Pattern Deduction Retrieve patterns by help of known facts. Known facts: • chases(Obama,Osama) Obama chases Osama since 2009. Pattern for chases(X,Y): “X chases Y” Def: Pattern Deduction Given a corpus, and given a KB, pattern deduction is the process of finding extraction patterns that produce facts of the KB when applied to the corpus. KB chases Corpus + Obama chases Osama. Def: Pattern Deduction Given a corpus, and given a KB, pattern deduction is the process of finding extraction patterns that produce facts of the KB when applied to the corpus. KB chases Corpus + Obama chases Osama. => “X chases Y” is a pattern for chases(X,Y) Def: Pattern iteration/DIPRE Pattern iteration (also: DIPRE) is the process of repeatedly applying pattern deduction and pattern application, thus continuously augmenting the KB. Known facts Patterns New facts Patterns Example: DIPRE Obama chases Osama. Coyote runs after RR. Coyote chases RR. chases Example: DIPRE Obama chases Osama. Coyote runs after RR. Coyote chases RR. chases “X chases Y” Example: DIPRE Obama chases Osama. Coyote runs after RR. Coyote chases RR. chases chases “X chases Y” Example: DIPRE Obama chases Osama. Coyote runs after RR. Coyote chases RR. chases “X chases Y” chases “X runs after Y” ... Task: DIPRE Given this KB, wantsToEat apply DIPRE to this corpus: Obama likes Michelle. Coyote wants to eat Roadrunner. Harry wants to eat cornflakes. Harry likes cornflakes for breakfast. Overview • Fact extraction from text • Difficulties • POS Tagging • Dependency grammars We use labels to find patterns KB label chases “Obama” label “Osama” Corpus + Obama chases Osama. => “X chases Y” is a pattern for chases(X,Y) Ambiguity is a problem KB label chases “Obama” label loves label “Osama” Corpus + Obama chases Osama. Ambiguity is a problem KB label chases “Obama” label loves label “Osama” Corpus + Obama chases Osama. => “X chases Y” is a pattern for loves(X,Y) or for chases(X,Y)? Disambiguation helps KB chases label “Obama” loves label label “Osama” Corpus + Obama chases Osama. => “X chases Y” is a pattern for chases(X,Y) Phrase structure can be a problem KB label chases “Obama” label “Osama” Corpus + Obama hat Osama gejagt. Phrase structure can be a problem KB label chases “Obama” label “Osama” Corpus + Obama hat Osama gejagt. => “X hat Y” is a pattern for chases(X,Y)? Multiple links are a problem KB label “Obama” chases shot label “Osama” Corpus + Obama chases Osama. Multiple links are a problem KB label “Obama” chases shot label “Osama” Corpus + Obama chases Osama. => “X chases Y” is a pattern for chases(X,Y) or for shot(X,Y)? Confidence of a pattern The confidence of an extraction pattern is the number of matches that produce known facts divided by the total number of matches. Pattern produces mostly new facts => risky Pattern produces mostly known facts => safe Simple word match is not enough Coyote invents a wonderful machine. + “X invents a Y” = invents(Coyote,wonderful) Patterns may be too specific Coyote invents a wonderful machine. + “X invents a great Y” = invents(Coyote,machine) Overview • Fact extraction from text • Difficulties • POS Tagging • Dependency grammars Def: POS A Part-of-Speech (also: POS, POS-tag, word class, lexical class, lexical category) is a set of words with the same grammatical role. • Proper nouns (PN): Wile, Elvis, Obama... • Nouns (N): desert, sand, ... • Adjectives (ADJ): fast, funny, ... • Adverbs (ADV): fast, well, ... • Verbs (V): run, jump, dance, ... Def: POS (2) • Pronouns (PRON): he, she, it, this, ... (what can replace a noun) • Determiners (DET): the, a, these, your,... (what goes before a noun) • Prepositions (PREP): in, with, on, ... (what goes before determiner + noun) • Subordinators (SUB): who, whose, that, which, because... (what introduces a sub-sentence) Def: POS-Tagging POS-Tagging is the process of assigning to each word in a sentence its POS. new machine. Wile tries a PN V DET ADJ N task> Task: POS-Tagging POS-Tag the following sentence: The coyote falls into a canyon. • N, V, PN, ADJ, ADV, PRON • Determiners (DET): the, a, these, your,... (what goes before a noun) • Prepositions (PREP): in, with, on, ... (what goes before determiner + noun) • Subordinators (SUB): who, whose, that,... (what introduces a sub-sentence) 1 2 3 4 Probabilistic POS-Tagging Introduce random variables for words and for POS-tags: (visible) X1 Wile X2 runs Wile runs Wile ... runs (hidden) Y1 Y2 PN V ADJ V Prob. PN PREP ... P (w3) = 0.1 P (w1) = 0.7 P (w2) = 0.1 probabilities> Independence Assumptions Every tag depends just on its predecessor P (Yi|Y1, ..., Yi−1) = P (Yi|Yi−1) Every word depends just on its tag: P (Xi|X1, ..., Xi−1, Y1, ..., Yi) = P (Xi|Yi) (visible) X1 Wile X2 runs Wile runs Wile ... runs (hidden) Y1 Y2 PN V ADJ V Prob. PN PREP ... P (w3) = 0.1 P (w1) = 0.7 P (w2) = 0.1 Homogeneity Assumption 1 P (Yi|Y1, ..., Yi−1) = P (Yi|Yi−1) = P (Yk |Yk−1) ∀ i, k Homogeneity Assumption 1 P (Yi|Y1, ..., Yi−1) = P (Yi|Yi−1) = P (Yk |Yk−1) ∀ i, k Example: The probability that a verb comes after a noun is the same at position 3 and at position 9: P (Y3 = V erb|Y2 = N oun) = P (Y9 = V erb|Y8 = N oun Homogeneity Assumption 1 P (Yi|Y1, ..., Yi−1) = P (Yi|Yi−1) = P (Yk |Yk−1) ∀ i, k Example: The probability that a verb comes after a noun is the same at position 3 and at position 9: P (Y3 = V erb|Y2 = N oun) = P (Y9 = V erb|Y8 = N oun Let’s denote this probability by P (V erb|N oun) Homogeneity Assumption 1 P (Yi|Y1, ..., Yi−1) = P (Yi|Yi−1) = P (Yk |Yk−1) ∀ i, k Example: The probability that a verb comes after a noun is the same at position 3 and at position 9: P (Y3 = V erb|Y2 = N oun) = P (Y9 = V erb|Y8 = N oun Let’s denote this probability by P (V erb|N oun) “Transition probability” P (v|w) := P (Yi = v|Yi−1 = w) (f or any i) Homogeneity Assumption 2 P (Xi|X1, ..., Xi−1, Y1, ..., Yi) = P (Xi|Yi) = P (Xk |Yk ) ∀ i, k Homogeneity Assumption 2 P (Xi|X1, ..., Xi−1, Y1, ..., Yi) = P (Xi|Yi) = P (Xk |Yk ) ∀ i, k Example: The probability that a PN is “Wile” is the same at position 3 and 9: P (X3 = W ile|Y3 = P N ) = P (X9 = W ile|Y9 = P N ) Homogeneity Assumption 2 P (Xi|X1, ..., Xi−1, Y1, ..., Yi) = P (Xi|Yi) = P (Xk |Yk ) ∀ i, k Example: The probability that a PN is “Wile” is the same at position 3 and 9: P (X3 = W ile|Y3 = P N ) = P (X9 = W ile|Y9 = P N ) Let’s denote this probability by P (W ile|P N ) Homogeneity Assumption 2 P (Xi|X1, ..., Xi−1, Y1, ..., Yi) = P (Xi|Yi) = P (Xk |Yk ) ∀ i, k Example: The probability that a PN is “Wile” is the same at position 3 and 9: P (X3 = W ile|Y3 = P N ) = P (X9 = W ile|Y9 = P N ) Let’s denote this probability by P (W ile|P N ) “Emission Probability” P (v|w) := P (Xi = v|Yi = w) (f or any i, word v) Def: HMM A (homogeneous) Hidden Markov Model (also: HMM) is a sequence of random (We use random variables variables, such that P (< x1, ..., xn, y1, ..., yn >) = Words of the sentence POS-tags ...where y0 is fixed as X1,...,Xn,Y1,...,Yn that span the possible worlds) Q i P (xi|yi) × P (yi|yi−1) Emission probabilities Transition probabilities y0 := Start HMMs as graphs P (< x1, ..., xn, y1, ..., yn >) = Q i P (xi|yi) × P (yi|yi−1) Transition probabilities P (P N |Start) PN Start P (DET |Start) P (V |P N ) P (V |N ) DET P (N |DET ) N V P (EN D|V ) END HMMs as graphs P (< x1, ..., xn, y1, ..., yn >) = Q i P (xi|yi) × P (yi|yi−1) Emission probabilities P (P N |Start) PN P (V |P N ) Start P (DET |Start) Transition probabilities P (V |N ) DET P (N |DET ) N P (a|DET ) P (the|DET ) the a V P (EN D|V ) END HMMs as graphs P (< x1, ..., xn, y1, ..., yn >) = Q i P (xi|yi) × P (yi|yi−1) Emission probabilities P (P N |Start) PN P (V |P N ) Start P (DET |Start) Transition probabilities P (V |N ) V P (EN D|V ) DET P (N |DET ) N P (a|DET ) P (W ile|N ) END . P (coyote|N ) P (the|DET ) P (RR|N ) Wile the a RoadRunner P (bird|N ) coyote bird eats sleeps Example: HMMs as graphs P (< x1, ..., xn, y1, ..., yn >) = Q i P (xi|yi) × P (yi|yi−1) P (< the, bird, sleeps, ., , EN D >) =0.2 × 0.5 × 1 × 0.7 × 1 × 0.9 × 1× 1 DET, N , V Start 1 PN 0.8 0.2 DET 1 1 V 1 END .1 N 0.1 0.5 0.4 0.2 0.9 0.5 0.7 0.6 Wile the a RoadRunner coyote bird eats sleeps task> Task: HMMs Write the product of values for P (< W ile, sleeps, ., P N, V, EN D >) P (< a, coyote, eats, ., DET, N, V, EN D >) Start 1 PN 0.8 0.2 DET 1 1 V 1 END .1 N 0.1 0.5 0.4 0.2 0.9 0.5 0.7 0.6 Wile the a RoadRunner coyote bird eats sleeps All other probabilities are 0! P (< W ile, bird, P N, V >) = 0 The graph is just one way to visualize P (< W ile, sleeps, V, V >) = 0 the emisssion and transistion probabilities! 0 Start 1 PN 0.8 0.2 DET 1 0.5 0.4 1 1 V N END .1 0 0.1 0.2 0.9 0.5 0.7 0.6 Wile the a RoadRunner coyote bird eats sleeps Estimate probabilities Start with manually annotated corpus The laws of gravity apply selectively DET N PREP N V ADV P (tagi|tagj ) = # cases where tagi f ollows tagj # appearances of tagj P (ADV |V ) = P (wordi|tagj ) = 1 1 P (V |N ) = 1 2 ... # cases where wordi has tagj # appearances of tagj P (the|DET ) = 1 1 P (laws|N ) = 1 2 ... task> Task: Estimate probabilities Estimate the transition and emission probs: Wile runs many runs . START PN V ADJ N END P (P N |Start) = P (W ile|P N ) = P (V |P N ) = P (runs|V ) = ... ... Draw the HMM as a graph. Task: Compute probabilities Using the probabilities from the last task, compute the following probabilities: P (< W ile, runs, ., P N, V, EN D >) P (< W ile, runs, ., P N, N, EN D >) Def: Probabilistic POS Tagging Given a sentence and transition and emission probabilities, Probabilistic POS Tagging computes the sequence of tags that has maximal probability (in an HMM). ~ = Wile runs. X P (W ile, runs, P N, N ) = 0.01 P (W ile, runs, V, N ) = 0.01 P (W ile, runs, P N, V ) = 0.1 ... winner POS Tagging is not trivial Several cases are difficult to tag: • the word “blue” has 4 letters. • pre- and post-secondary • look (a word) up • The Duchess was entertaining last night. Wikipedia/POS tagging POS Taggers There are several free POS taggers: • Stanford POS tagger • hunpos • MBT: Memory-based Tagger • TreeTagger • ACOPOST (formerly ICOPOST) • YamCha • ... Stanford NLP/Resources Try it out! POS taggers achieve very good precision. POS Tagging helps pattern IE We can choose to match the placeholders with only nouns or proper nouns: “X invents a Y” Coyote invents a catapult invents(Coyote,catapult) Coyote invents a really cool thing. invents(Coyote,really) POS-tags can generalize patterns “X invents a ADJ X” Coyote invents a cool catapult Coyote invents a great catapult Coyote invents a super-duper catapult. match match match Phrase structure is a problem “X invents a ADJ X” Coyote invents a very great catapult. Coyote, who is very hungry, invents a great catapult. no match no match We will see next time how to solve it. Semantic IE Reasoning You are here Fact Extraction Is-A Extraction singer Entity Disambiguation singer Elvis Entity Recognition Source Selection and Preparation References Brin: Extracting Patterns and Relations from the WWW Agichtein: Snowball Ramage: HMM Fundamentals Web data mining class