November 9, 2000 Language and Information Handout #4 (C) 2000, The University of Michigan 1 Course Information • • • • • • Instructor: Dragomir R. Radev (radev@si.umich.edu) Office: 305A, West Hall Phone: (734) 615-5225 Office hours: TTh 3-4 Course page: http://www.si.umich.edu/~radev/760 Class meets on Thursdays, 5-8 PM in 311 West Hall (C) 2000, The University of Michigan 2 Readings • Textbook: – Oakes Ch.3: 95-96, 110-120 – Oakes Ch.4: 149-150, 158-166, 182-189 – Oakes Ch.5: 199-212, 221-223, 236-247 • Additional readings – Knight “Statistical Machine Translation Workbook” (http://www.clsp.jhu.edu/ws99/) – McKeown & Radev “Collocations” – Optional: M&S chapters 4, 5, 6, 13, 14 (C) 2000, The University of Michigan 3 Statistical Machine Translation and Language Modeling (C) 2000, The University of Michigan 4 The Noisy Channel Model • Source-channel model of communication • Parametric probabilistic models of language and translation • Training such models (C) 2000, The University of Michigan 5 Statistics • Given f, guess e e EF f e’ FE encoder decoder e’ = argmax P(e|f) = argmax P(f|e) P(e) e e translation model (C) 2000, The University of Michigan language model 6 Parametric probabilistic models • Language model (LM) P(e) = P(e1, e2, …, eL) = P(e1) P(e2|e1) … P(eL|e1 … eL-1) • Deleted interpolation P(eL|e1 … eK-1) P(eL|eL-2, eL-1) • Translation model (TM) Alignment: P(f,a|e) (C) 2000, The University of Michigan 7 IBM’s EM trained models 1. 2. 3. 4. 5. Word translation Local alignment Fertilities Class-based alignment Non-deficient algorithm (avoid overlaps, overflow) (C) 2000, The University of Michigan 8 Lexical Semantics and WordNet (C) 2000, The University of Michigan 9 Meanings of words • Lexemes, lexicon, sense(s) • Examples: – Red, n: the color of blood or a ruby – Blood, n: the red liquid that circulates in the heart, arteries and veins of animals – Right, adj: located nearer the right hand esp. being on the right when facing the same direction as the observer • Do dictionaries gives us definitions?? (C) 2000, The University of Michigan 10 Relations among words • Homonymy: – Instead, a bank can hold the investments in a custodial account in the client’s name. – But as agriculture burgeons on the east bank, the river will shrink even more. • • • • Other examples: be/bee?, wood/would? Homophones Homographs Applications: spelling correction, speech recognition, textto-speech • Example: Un ver vert va vers un verre vert. (C) 2000, The University of Michigan 11 Polysemy • They rarely serve red meat, preferring to prepare seafood, poultry, or game birds. • He served as U.S. ambassador to Norway in 1976 and 1977. • He might have served his time, come out and led an upstanding life. • Homonymy: distinct and unrelated meanings, possibly with different etymology (multiple lexemes). • Polysemy: single lexeme with two meanings. • Example: an “idea bank” (C) 2000, The University of Michigan 12 Synonymy • • • • Principle of substitutability How big is this plane? Would I be flying on a large or small plane? Miss Nelson, for instance, became a kind of big sister to Mrs. Van Tassel’s son, Benjamin. • ?? Miss Nelson, for instance, became a kind of large sister to Mrs. Van Tassel’s son, Benjamin. • What is the cheapest first class fare? • ?? What is the cheapest first class cost? (C) 2000, The University of Michigan 13 Semantic Networks • Used to represent relationships between words • Example: WordNet - created by George Miller’s team at Princeton (http://www.cogsci.princeton.edu/~wn) • Based on synsets (synonyms, interchangeable words) and lexical matrices (C) 2000, The University of Michigan 14 Lexical matrix Word Forms Word Meanings F1 F2 M1 E1,1 E1,2 M2 F3 … Fn E1,2 … … Mm (C) 2000, The University of Michigan Em,n 15 Synsets • Disambiguation – {board, plank} – {board, committee} • Synonyms – substitution – weak substitution – synonyms must be of the same part of speech (C) 2000, The University of Michigan 16 $ ./wn board -hypen Synonyms/Hypernyms (Ordered by Frequency) of noun board 9 senses of board Sense 1 board => committee, commission => administrative unit => unit, social unit => organization, organisation => social group => group, grouping Sense 2 board => sheet, flat solid => artifact, artefact => object, physical object => entity, something Sense 3 board, plank => lumber, timber => building material => artifact, artefact => object, physical object => entity, something (C) 2000, The University of Michigan 17 Sense 4 display panel, display board, board => display => electronic device => device => instrumentality, instrumentation => artifact, artefact => object, physical object => entity, something Sense 5 board, gameboard => surface => artifact, artefact => object, physical object => entity, something Sense 6 board, table => fare => food, nutrient => substance, matter => object, physical object => entity, something (C) 2000, The University of Michigan 18 Sense 7 control panel, instrument panel, control board, board, panel => electrical device => device => instrumentality, instrumentation => artifact, artefact => object, physical object => entity, something Sense 8 circuit board, circuit card, board, card => printed circuit => computer circuit => circuit, electrical circuit, electric circuit => electrical device => device => instrumentality, instrumentation => artifact, artefact => object, physical object => entity, something Sense 9 dining table, board => table => furniture, piece of furniture, article of furniture => furnishings => instrumentality, instrumentation => artifact, artefact => object, physical object => entity, something (C) 2000, The University of Michigan 19 Antonymy • “x” vs. “not-x” • “rich” vs. “poor”? • {rise, ascend} vs. {fall, descend} (C) 2000, The University of Michigan 20 Other relations • Meronymy: X is a meronym of Y when native speakers of English accept sentences similar to “X is a part of Y”, “X is a member of Y”. • Hyponymy: {tree} is a hyponym of {plant}. • Hierarchical structure based on hyponymy (and hypernymy). (C) 2000, The University of Michigan 21 Other features of WordNet • Index of familiarity • Polysemy (C) 2000, The University of Michigan 22 Familiarity and polysemy board used as a noun is familiar (polysemy count = 9) bird used as a noun is common (polysemy count = 5) cat used as a noun is common (polysemy count = 7) house used as a noun is familiar (polysemy count = 11) information used as a noun is common (polysemy count = 5) retrieval used as a noun is uncommon (polysemy count = 3) serendipity used as a noun is very rare (polysemy count = 1) (C) 2000, The University of Michigan 23 Compound nouns advisory board appeals board backboard backgammon board baseboard basketball backboard big board billboard binder's board binder board (C) 2000, The University of Michigan blackboard board game board measure board meeting board member board of appeals board of directors board of education board of regents board of trustees 24 Overview of senses 1. board -- (a committee having supervisory powers; "the board has seven members") 2. board -- (a flat piece of material designed for a special purpose; "he nailed boards across the windows") 3. board, plank -- (a stout length of sawn timber; made in a wide variety of sizes and used for many purposes) 4. display panel, display board, board -- (a board on which information can be displayed to public view) 5. board, gameboard -- (a flat portable surface (usually rectangular) designed for board games; "he got out the board and set up the pieces") 6. board, table -- (food or meals in general; "she sets a fine table"; "room and board") 7. control panel, instrument panel, control board, board, panel -- (an insulated panel containing switches and dials and meters for controlling electrical devices; "he checked the instrument panel"; "suddenly the board lit up like a Christmas tree") 8. circuit board, circuit card, board, card -- (a printed circuit that can be inserted into expansion slots in a computer to increase the computer's capabilities) 9. dining table, board -- (a table at which meals are served; "he helped her clear the dining table"; "a feast was spread upon the board") (C) 2000, The University of Michigan 25 Top-level concepts {act, action, activity} {animal, fauna} {artifact} {attribute, property} {body, corpus} {cognition, knowledge} {communication} {event, happening} {feeling, emotion} {food} {group, collection} {location, place} {motive} (C) 2000, The University of Michigan {natural object} {natural phenomenon} {person, human being} {plant, flora} {possession} {process} {quantity, amount} {relation} {shape} {state, condition} {substance} {time} 26 Information Extraction (C) 2000, The University of Michigan 27 Types of Information Extraction • • • • Template filling Language reuse Biographical information Question answering (C) 2000, The University of Michigan 28 MUC-4 Example On October 30, 1989, one civilian was killed in a reported FMLN attack in El Salvador. INCIDENT: DATE INCIDENT: LOCATION INCIDENT: TYPE INCIDENT: STAGE OF EXECUTION INCIDENT: INSTRUMENT ID INCIDENT: INSTRUMENT TYPE PERP: INCIDENT CATEGORY PERP: INDIVIDUAL ID PERP: ORGANIZATION ID PERP: ORG. CONFIDENCE PHYS TGT: ID PHYS TGT: TYPE PHYS TGT: NUMBER PHYS TGT: FOREIGN NATION PHYS TGT: EFFECT OF INCIDENT PHYS TGT: TOTAL NUMBER HUM TGT: NAME HUM TGT: DESCRIPTION HUM TGT: TYPE HUM TGT: NUMBER HUM TGT: FOREIGN NATION HUM TGT: EFFECT OF INCIDENT HUM TGT: TOTAL NUMBER (C) 2000, The University of Michigan 30 OCT 89 EL SALVADOR ATTACK ACCOMPLISHED TERRORIST ACT "TERRORIST" "THE FMLN" REPORTED: "THE FMLN" "1 CIVILIAN" CIVILIAN: "1 CIVILIAN" 1: "1 CIVILIAN" DEATH: "1 CIVILIAN" 29 Language reuse NP Yugoslav President Slobodan Milosevic [description] [entity] Phrase to be reused (C) 2000, The University of Michigan 30 Example NP NP Punc Andrija Hebrang [entity] (C) 2000, The University of Michigan , NP The Croatian Defense Minister [description] 31 Issues involved • • • • • • • Text generation depends on lexical resources Lexical choice Corpus processing vs. manual compilation Deliberate decisions by writers Difficult to encode by hand Dynamically updated (Scott O’Grady) No full semantic representation (C) 2000, The University of Michigan 32 Named entities Richard Butler met Tareq Aziz Monday after rejecting Iraqi attempts to set deadlines for finishing his work. Yitzhak Mordechai will meet Mahmoud Abbas at 7 p.m. (1600 GMT) in Tel Aviv after a 16-month-long impasse in peacemaking. Sinn Fein deferred a vote on Northern Ireland's peace deal Sunday. Hundreds of troops patrolled Dili on Friday during the anniversary of Indonesia's 1976 annexation of the territory. (C) 2000, The University of Michigan 33 Entities + Descriptions Chief U.N. arms inspector Richard Butler met Iraq’s Deputy Prime Minister Tareq Aziz Monday after rejecting Iraqi attempts to set deadlines for finishing his work. Israel's Defense Minister Yitzhak Mordechai will meet senior Palestinian negotiator Mahmoud Abbas at 7 p.m. (1600 GMT) in Tel Aviv after a 16-month-long impasse in peacemaking. Sinn Fein, the political wing of the Irish Republican Army, deferred a vote on Northern Ireland's peace deal Sunday. Hundreds of troops patrolled Dili, the Timorese capital, on Friday during the anniversary of Indonesia's 1976 annexation of the(C)territory. 2000, The University 34 of Michigan Building a database of descriptions • Size of database: 59,333 entities and 193,228 descriptions as of 08/01/98 • Text processed: 494 MB (ClariNet, Reuters, UPI) • Length: 1-15 lexical items • Accuracy: (precision 94%, recall 55%) (C) 2000, The University of Michigan 35 Multiple descriptions per entity Ung Huot A senior member Cambodia’s Cambodian foreign minister Co-premier First prime minister Foreign minister His excellency Mr. New co-premier New first prime minister Newly-appointed first prime minister Premier (C) 2000, The University of Michigan 36 Profile for Ung Huot Language reuse and regeneration CONCEPTS + CONSTRAINTS = CONSTRUCTS Corpus analysis: determining constraints Text generation: applying constraints (C) 2000, The University of Michigan 37 Language reuse and regeneration • • • • Understanding: full parsing is expensive Generation: expensive to use full parses Bypassing certain stages (e.g., syntax) Not(!) template-based: still required extraction, analysis, context identification, modification, and generation • Factual sentences, sentence fragments • Reusability of a phrase (C) 2000, The University of Michigan 38 Context-dependent solution Redefining the relation: DescriptionOf (E,C) = {Di,c, Di,c is a description of E in context C} If named entity E appears in text and the context is C: Insert DescriptionOf (E,C) in text. (C) 2000, The University of Michigan 39 Multiple descriptions per entity Bill Clinton U.S. President President An Arkansas native Democratic presidential candidate Profile for Bill Clinton (C) 2000, The University of Michigan 40 Choosing the right description Bill Clinton CONTEXT U.S. President …………………………..foreign relations President ………………………………… national affairs An Arkansas native ……………....false bomb alert in AR Democratic presidential candidate …………….. elections Pragmatic and semantic constraints on lexical choice. (C) 2000, The University of Michigan 41 Semantic information from WordNet • All words contribute to the semantic representation • First sense is used only • What is a synset? (C) 2000, The University of Michigan 42 WordNet synset hierarchy {00001740} entity, something {00002086} life form, organism, being, living thing {00004123} person, individual, someone, somebody, human {06950891} leader {07311393} head, chief, top dog {07063507} administrator, decision maker {07063762} director, manager, managing director (C) 2000, The University of Michigan 43 Lexico-semantic matrix Word synsets Description {07147929} premier {07009772} Kampuchean Parent synsets … {07412658} minister … A senior member X Cambod ia's X … Cambod ian foreign minister X … X Co-premier X … X First prime minister X … X Foreign minister … X H is excellency … Mr. … N ew co-premier X … X N ew first prime minister X … X N ew ly-appointed first prime minister X … X Premier X … X Prime minister X … X (C) 2000, The University of Michigan Profile for Ung Huot {07087841} associate 44 Choosing the right description • Topic approximation by context: words that appear near the entity in the text (bag) • Name of the entity (set) • Length of article (continuous) • Profile: set of all descriptions for that entity (bag) - parent synset offsets for all words wi. • Semantic information: WordNet synset offsets (bag) (C) 2000, The University of Michigan 45 Choosing the right description Ripper feature vector [Cohen 1996] (Context, Entity, Description, Length, Profile, Parent) (C) 2000, The University of Michigan Classes 46 Example (training) T# Context 1 Election, promised, said, carry, party … Entity D escription Kim Veteran D ae-Jung opposition leader Len Profile 949 Candidate, chief, policy maker, Korean ... 2 Kim South D ae-Jung Korea's opposition candidate 629 Candidate, chief, policy maker, Korean ... Kim A frontD ae-Jung runner 535 Candidate, chief, policy maker, Korean ... Kim A frontD ae-Jung runner 1114 Candidate, Kim South D ae-Jung Korea's presidentelect 449 3 4 5 Introduced, responsible, running, should, bringing … Attend, during, party, time, traditionall y… D iscuss, making, party, statement, said … N ew , party, politics, in, it … (C) 2000, The University of Michigan chief, policy maker, Korean ... Candidate, chief, policy maker, Korean ... Parent person, leader, Asian, important person ... person, leader, Asian, important person ... person, leader, Asian, important person ... person, leader, Asian, important person ... person, leader, Asian, important person ... Classes {07136302} {07486519} {07311393} {06950891} {07486079} {07136302} {07486519} {07311393} {06950891} {07486079} {07136302} {07486519} {07311393} {06950891} {07486079} {07136302} {07486519} {07311393} {06950891} {07486079} {07136302} {07486519} {07311393} {06950891} {07486079} 47 Sample rules {07136302} IF PROFILE ~ P{07136302} LENGTH <= 603 LENGTH >= 361 . {07136302} IF PROFILE ~ P{07136302} CONTEXT ~ presidential LENGTH <= 412 . {07136302} IF PROFILE ~ P{07136302} CONTEXT ~ nominee CONTEXT ~ during . {07136302} IF PROFILE ~ P{07136302} CONTEXT ~ case . {07136302} IF PROFILE ~ P{07136302} LENGTH <= 603 LENGTH >= 390 LENGTH <= 412 . {07136302} IF PROFILE ~ P{07136302} CONTEXT ~ nominee CONTEXT ~ and . Total number of rules: 4085 for 100,000 inputs (C) 2000, The University of Michigan 48 Evaluation • 35,206 tuples; 11,504 distinct entities; 3.06 DDPE • Training: 90% of corpus (10,353 entities) • Test: 10% of corpus (1,151 entities) (C) 2000, The University of Michigan 49 Evaluation • Rule format (each matching rule adds constraints): X [A] (evidence of A) Y [B] (evidence of B) [A] [B] (evidence of A and B) X Y • Classes are in 2W (powerset of WN nodes) • P&R on the constraints selected by system (C) 2000, The University of Michigan 50 Definition of precision and recall Model System [B] [D] [A] [B] [C] 33.3 % 50.0 % [A] [B] [C] [A] [B] [D] 66.7 % 66.7 % (C) 2000, The University of Michigan P R 51 Precision and recall Word nodes only Word and parent nodes Training set 500 1000 2000 5000 10000 15000 20000 25000 30000 50000 100000 150000 200000 250000 Precision 64.29% 71.43% 42.86% 59.33% 69.72% 76.24% 76.25% 83.37% 80.14% 83.13% 85.42% 87.07% 85.73% 87.15% (C) 2000, The University of Michigan Recall 2.86% 2.86% 40.71% 48.40% 45.04% 44.02% 49.91% 52.26% 50.55% 58.53% 62.81% 63.17% 62.86% 63.85% Precision 78.57% 85.71% 67.86% 64.67% 74.44% 73.39% 79.08% 82.39% 82.77% 88.87% 89.70% Recall 2.86% 2.86% 62.14% 53.73% 59.32% 53.17% 58.70% 57.49% 57.66% 63.39% 64.64% 52 Question Answering (C) 2000, The University of Michigan 53 Question answering Q: When did Nelson Mandela become president of South Africa? A: 10 May 1994 Q: How tall is the Matterhorn? A: The institute revised the Matterhorn 's height to 14,776 feet 9 inches Q: How tall is the replica of the Matterhorn at Disneyland? A: In fact he has climbed the 147-foot Matterhorn at Disneyland every week end for the last 3 1/2 years Q: If Iraq attacks a neighboring country, what should the US do? A: ?? (C) 2000, The University of Michigan 54 Q: Why did David Koresh ask the FBI for a word processor? Q: Name the designer of the shoe that spawned millions of plastic imitations, known as "jellies". Q: What is the brightest star visible from Earth? Q: What are the Valdez Principles? Q: Name a film that has won the Golden Bear in the Berlin Film Festival? Q: Name a country that is developing a magnetic levitation railway system? Q: Name the first private citizen to fly in space. Q: What did Shostakovich write for Rostropovich? Q: What is the term for the sum of all genetic material in a given organism? Q: What is considered the costliest disaster the insurance industry has ever faced? Q: What is Head Start? Q: What was Agent Orange used for during the Vietnam War? Q: What did John Hinckley do to impress Jodie Foster? Q: What was the first Gilbert and Sullivan opera? Q: What did Richard Feynman say upon hearing he would receive the Nobel Prize in Physics? Q: How did Socrates die? Q: Why are electric cars less efficient in the north-east than in California? (C) 2000, The University of Michigan 55 The TREC evaluation • • • • Document retrieval Eight years Information retrieval? Corpus: texts and questions (C) 2000, The University of Michigan 56 Textract Resporator Indexer documents Index Query Processing Search query GuruQA Ranked HitList AnSel/ Werlect Hit List Answer selection Prager et al. 2000 (SIGIR) Radev et al. 2000 (ANLP/NAACL) (C) 2000, The University of Michigan 57 QA-Token Question type Example PLACE$ Where In the Rocky Mountains COUNTRY$ Where/What country United Kingdom STATE$ Where/What state Massachusetts PERSON$ Who Albert Einstein ROLE$ Who Doctor NAME$ Who/What/Which The Shakespeare Festival ORG$ Who/What The US Post Office DURATION$ How long For 5 centuries AGE$ How old 30 years old YEAR$ When/What year 1999 TIME$ When In the afternoon DATE$ When/What date July 4th, 1776 VOLUME$ How big 3 gallons AREA$ How big 4 square inches LENGTH$ How big/long/high 3 miles WEIGHT$ How big/heavy 25 tons NUMBER$ How many 1,234.5 METHOD$ How By rubbing RATE$ How much 50 per cent MONEY$ How much 4 million dollars (C) 2000, The University of Michigan 58 <p><NUMBER>1</NUMBER></p> <p><QUERY>Who is the author of the book, "The Iron Lady: A Biography of Margaret Thatcher"?</QUERY></p> <p><PROCESSED_QUERY>@excwin(*dynamic* @weight(200 *Iron_Lady) @weight(200 Biography_of_Margaret_Thatcher) @weight(200 Margaret) @weight(100 author) @weight(100 book) @weight(100 iron) @weight(100 lady) @weight(100 :) @weight(100 biography) @weight(100 thatcher) @weight(400 @syn(PERSON$ NAME$)) )</PROCESSED_QUERY></p> <p><DOC>LA090290-0118</DOC></p> <p><SCORE>1020.8114</SCORE></p> <TEXT><p>THE IRON LADY; A <span class="NAME">Biography of Margaret Thatcher</span> by <span class="PERSON">Hugo Young</span> (<span class="ORG">Farrar , Straus & Giroux</span>) The central riddle revealed here is why, as a woman <span class="PLACEDEF">in a man</span>'s world, <span class="PERSON">Margaret Thatcher</span> evinces such an exclusionary attitude toward women.</p></TEXT> (C) 2000, The University of Michigan 59 SYN-set PERSON NAME PLACE COUNTRY STATE NAME PLACEDEF NAME DATE YEAR PERSON ORG NAME ROLE undefined NUMBER PLACE NAME PLACEDEF PERSON ORG PLACE NAME PLACEDEF MONEY RATE ORG NAME SIZE1 SIZE1 DURATION STATE COUNTRY YEAR RATE TIME DURATION SIZE1 SIZE2 DURATION TIME DATE (C) 2000, The University of Michigan N 30 21 18 18 19 19 18 14 10 6 4 4 3 3 3 2 2 1 1 1 1 Score 16.5 7.08 3.67 5.31 4.62 11.45 8.00 10.00 3.03 1.50 1.25 2.50 0.83 2.00 1.33 1.00 1.50 0.00 0.00 0.33 0 Score/N 55.0% 33.7% 20.4% 29.5% 24.3% 60.3% 44.4% 71.4% 30.3% 25% 31.2% 62.5% 27.7% 66.7% 44.3% 50.0% 75.0% 0.0% 0.0% 33.3% 0.00% 60 Span Ollie Matson Lou Vasquez Tim O'Donohue Athletic Director Dave Cowen Johnny Ceballos Civic Center Director Martin Durham Johnny Hodges Derric Evans NEWSWIRE Johnny Majors Woodbridge High School Evan Gary Edwards O.J. Simpson South Lake Tahoe Washington High Morgan Tennesseefootball Ellington assistant the Volunteers Johnny Mathis Mathis coach (C) 2000, The University of Michigan Type PERSON PERSON PERSON PERSON PERSON PERSON PERSON PERSON PERSON ORG PERSON PERSON NAME NAME NAME NAME NAME NAME ROLE ROLE PERSON NAME ROLE Number 3 1 17 23 22 13 25 33 30 18 37 38 2 7 10 26 31 24 21 34 4 14 19 Rspanno 3 1 1 6 5 1 2 4 1 2 6 7 2 5 6 3 2 1 4 5 4 2 3 Count 6 6 4 4 4 2 4 4 4 4 4 4 6 6 6 4 4 4 4 4 6 2 4 Notinq 2 2 2 4 1 5 1 2 2 1 1 2 2 3 1 1 1 1 1 2 -100 -100 -100 Type 1 1 1 1 1 1 1 1 1 2 1 1 3 3 3 3 3 3 4 4 1 3 4 Avgdst 12 16 8 11 9 16 15 14 17 6 14 17 12 14 18 12 15 20 8 14 11 10 4 Sscore 0.02507 0.02507 0.02257 0.02257 0.02257 0.02505 0.02256 0.02256 0.02256 0.02257 0.02256 0.02256 0.02507 0.02507 0.02507 0.02256 0.02256 0.02256 0.02257 0.02256 0.02507 0.02505 0.02257 TOTAL -7.53 -9.93 -12.57 -15.87 -19.07 -19.36 -25.22 -25.37 -25.47 -28.37 -29.57 -30.87 -37.40 -40.06 -49.80 -52.52 -56.27 -59.42 -62.77 -71.17 -211.33 -254.16 -259.67 61 Features (1) • Number: position of the span among all spans returned. Example: “Lou Vasquez” was the first span returned by GuruQA on the sample question. • Rspanno: position of the span among all spans returned within the current passage. • Count: number of spans of any span class retrieved within the current passage. • Notinq: the number of words in the span that do not appear in the query. Example: Notinq (“Woodbridge high school”) = 1, because both “high” and “school” appear in the query while “Woodbridge” does not. It is set to –100 when the actual value is 0. (C) 2000, The University of Michigan 62 Features (2) • Type: the position of the span type in the list of potential span types. Example: Type (“Lou Vasquez”) = 1, because the span type of “Lou Vasquez”, namely “PERSON” appears first in the SYN-set, “PERSON ORG NAME ROLE”. • Avgdst: the average distance in words between the beginning of the span and the words in the query that also appear in the passage. Example: given the passage “Tim O'Donohue, Woodbridge High School's varsity baseball coach, resigned Monday and will be replaced by assistant Johnny Ceballos, Athletic Director Dave Cowen said.” and the span “Tim O’Donohue”, the value of avgdst is equal to 8. • Sscore: passage relevance as computed by GuruQA. (C) 2000, The University of Michigan 63 Combining evidence • TOTAL (span) = – 0.3 * number – 0.5 * rspanno + 3.0 * count + 2.0 * notinq – 15.0 * types – 1.0 * avgdst + 1.5 * sscore (C) 2000, The University of Michigan 64 Extracted text Document ID LA0531890069 LA0531890069 LA0608890181 LA0608890181 LA0608890181 Score Extract 892.5 of O.J. Simpson , Ollie Matson and Johnny Mathis 890.1 Lou Vasquez , track coach of O.J. Simpson , Ollie 887.4 Tim O'Donohue , Woodbridge High School 's varsity 884.1 nny Ceballos , Athletic Director Dave Cowen said. 880.9 aced by assistant Johnny Ceballos , Athletic Direc (C) 2000, The University of Michigan 65 Results 50 bytes # cases Points First 49 49.00 Second Third Fourth 15 11 9 7.50 3.67 2.25 Fifth 4 0.80 TOTAL 88 63.22 First 71 71.00 Second Third Fourth 16 11 6 8.00 3.67 1.50 Fifth 5 1.00 TOTAL 109 85.17 250 bytes # cases Points (C) 2000, The University of Michigan 66 Style and Authorship Analysis (C) 2000, The University of Michigan 67 Style and authorship analysis • • • • Use of nouns, verbs… Use of rare words Positional and contextual distribution Use of alternatives: “and/also”, “since/because”, “scarcely/hardly” (C) 2000, The University of Michigan 68 Sample problem • 15-th century Latin work “De Imitatione Christi” • Was it written by Thomas a Kempis or Jean Charlier de Gerson? • Answer: by Kempis • Why? (C) 2000, The University of Michigan 69 Yule’s K characteristic • Vocabulary richness: measure of the probability that any randomly selected pair of words will be identical K = 10,000 x (M2 - M1)/(M1 x M1) • M1, M2 - distribution moments • M1 - total number of usages (words including repetitions) • M2 - sum of all vocabulary words in each frequency group, from 1 to the maximum word frequency, multiplied by the square of the frequency (C) 2000, The University of Michigan 70 Example • Text consisting of 12 words, where two of the words occur once, two occur twice, and two occur three times. • M0 = 6 • M1 = 12 • M2 = (2 x 12) + (2 x 22) + (2 x 32) = 28 • K increases as the diversity of the vocabulary decreases. (C) 2000, The University of Michigan 71 Example (cont’d) • Three criteria used: – – – – – total vocabulary size frequency distribution of the different words Yule’s K the mean frequency of the word sin the sample the number of nouns unique to a particular sample • Pearson’s coefficient used (C) 2000, The University of Michigan 72 Federalist papers • Published in 1787-1788 to persuade the population of New York state to ratify the new American constitution • Published under the pseudonym Publius, the three authors were James Madison, John Jay, and Alexander Hamilton. • Before dying in a duel, Hamilton claimed some portion of the essays. • It was agreed that Jay wrote 5 essays, Hamilton - 43, Madison - 14. Three others were jointly written by Hamilton and Madison, and 12 were disputed (C) 2000, The University of Michigan 73 Method • Mosteller and Wallace (1963) used Bayesian statistics to determine which papers were written by whom. • Authors had tried to imitate each other. So - sentence length and other easily imitable features are not useful. • Madison and Hamilton were found to vary in their use of “by” (H) and “to” (M), “enough” (H) and “whilst” (M). (C) 2000, The University of Michigan 74 Cluster Analysis (C) 2000, The University of Michigan 75 Clustering • Idea: find similar objects and group them together • Examples: – all news stories on the same topic – all documents from the same genre or language • Types of clustering: classification (tracking) and categorization (detection) (C) 2000, The University of Michigan 76 Non-hierarchical clustering • Concept of a centroid • document/centroid similarity • other parameters: – – – – number of clusters maximum and minimum size for each cluster vigilance parameter overlap between clusters (C) 2000, The University of Michigan 77 Hierarchical clustering • Similarity matrix (expensive: the SIM matrix needs to be updated after every iteration) • Average linkage method • dendrograms (C) 2000, The University of Michigan 78 Introduction • Abundance of newswire on the Web • Multiple sources reporting on the same event • Multiple modalities (speech, text) • Summarization and filtering (C) 2000, The University of Michigan 79 Introduction • TDT participation topic detection and tracking – CIDR • Multi-document summarization – statistical, domain-dependent – knowledge-based (SUMMONS) (C) 2000, The University of Michigan 80 Topics and events • Topic = event (single act) or activity (ongoing action) • Defined by content, time, and place of occurrence [Allan et al. 1998, Yang et al. 1998] • Examples: – Marine fighter pilot’s plane cuts cable in Italian Alps (February 3, 1998) – Eduard Shavardnadze assassination attempt (February 9, 1998) – Jonesboro shooting (March 24, 1998) (C) 2000, The University of Michigan 81 TDT overview • Event detection: monitoring a continuous stream of news articles and identifying new salient events • Event tracking: identifying stories that belong to predefined event topics • [Story segmentation: identifying topic boundaries] (C) 2000, The University of Michigan 82 The TDT-2 corpus • Corpus described in [Doddington et al. 1999, Cieri et al. 1999] • One hundred topics, 54K stories, 6 sources • Two newswire sources (AP, NYT); 2 TV stations (ABC, CNN-HN); 2 radio stations (PRI, VOA) • 11 participants (4 industrial sites, 7 universities) (C) 2000, The University of Michigan 83 Detection conditions • Default: – Newswire + Audio - automatic transcription – Deferral period of 10 source files – Given boundaries for ASR (C) 2000, The University of Michigan 84 Description of the system • Single-pass clustering algorithm • Normalized, tf*idf-modified, cosine-based similarity between document and centroid • detection only, standard evaluation conditions, no deferral (C) 2000, The University of Michigan 85 Research problems • focus on speedup • search space of five experimental parameters • tradeoffs between parallelization and accuracy (C) 2000, The University of Michigan 86 Vector-based representation Term 1 Document Term 3 a Centroid Term 2 (C) 2000, The University of Michigan 87 Vector-based matching • The cosine measure S (d . c . idf(k)) S (d ) . S (c ) k sim (D,C) = k k (C) 2000, The University of Michigan k k 2 k k 2 88 Description of the system sim T (C) 2000, The University of Michigan 89 Description of the system sim > T (C) 2000, The University of Michigan sim < T 90 Centroid size C 10062 (N =161) microsoft 3.24 justice 0.93 d epartmen 0.88 w indt ow s 0.98 corp 0.61 softw are 0.57 ellison 0.07 hatch 0.06 netscape 0.04 metcalfe 0.02 (C) 2000, The University of Michigan C 00008 (N =113) (10000) 1.98 sp ace shu ttle 1.17 station 0.75 nasa 0.51 colu m bia 0.37 m ission 0.33 m ir 0.30 astronau t 0.14 s steering 0.11 safely 0.07 C 10007 (N =11) (10000) 1.00 crashes safety 0.55 transp ortat 0.55 ion d rivers 0.45 board 0.36 flight 0.27 bu ckle 0.27 p ittsbu rgh 0.18 grad u ating 0.18 au tom obile 0.18 91 Centroid size C 00022 (N =44) (10000) 1.93 d iana p rincess 1.52 C 00026 (N =10) (10000) 1.50 u niverse exp ansion 1.00 bang 0.90 C 00025 (N =19) (10000) 3.00 albanians C 00035 (N =22) (10000) 1.45 airlines finnair 0.45 C 00031 (N =34) el(10000) 1.85 nino 1.56 (C) 2000, The University of Michigan 92 Centroid size C 00022 (N =44) (10000) 1.93 d iana p rincess 1.52 C 00035 (N =22) (10000) 1.45 airlines finnair 0.45 C 00031 (N =34) el(10000) 1.85 nino 1.56 C 00026 (N =10) (10000) 1.50 u niverse exp ansion 1.00 bang 0.90 C 10062 (N =161) microsoft 3.24 justice 0.93 d epartmen 0.88 w indt ow s 0.98 corp 0.61 softw are 0.57 ellison 0.07 hatch 0.06 netscape 0.04 metcalfe 0.02 (C) 2000, The University of Michigan C 00025 (N =19) (10000) 3.00 albanians C 00008 (N =113) (10000) 1.98 sp ace shu ttle 1.17 station 0.75 nasa 0.51 colu m bia 0.37 m ission 0.33 m ir 0.30 astronau t 0.14 s steering 0.11 safely 0.07 C 10007 (N =11) (10000) 1.00 crashes safety 0.55 transp ortat 0.55 ion d rivers 0.45 board 0.36 flight 0.27 bu ckle 0.27 p ittsbu rgh 0.18 grad u ating 0.18 au tom obile 0.18 93 Parameter space • Similarity – DECAY: Number of words at beginning of document that will be considered in computing vector similarities (50 - 1000) – IDF: Minimum value for idf so that a word is considered (1 - 10) – SIM: Similarity threshold (0.01 - 0.25) • Centroids – KEEPI: Keep all words whose tf*idf scores are above a certain threshold (1-10) – KEEP: Keep at least that many words in centroid (1-50) (C) 2000, The University of Michigan 94 Parameter selection (dev-test) (C) 2000, The University of Michigan 95 Cluster stability 10000 docs (10000) 2.48 suharto jakarta 0.58 habibie 0.47 stu d ents 0.45 stu d ent 0.22 p rotesters 0.20 asean 0.11 cam p u ses 0.05 geertz 0.04 m ed an 0.04 22443 docs 2.61 suharto jakarta 0.58 habibie 0.53 stu d ents 0.43 stu d ent 0.21 p rotesters 0.19 asean 0.10 cam p u ses 0.04 geertz 0.04 m ed an 0.04 (C) 2000, The University of Michigan 10000 docs m icrosoft 3.31 ju stice 1.06 d ep artm ent 1.01 w ind ow s 0.90 corp 0.60 softw are 0.51 ellison 0.09 hatch 0.06 netscap e 0.05 m etcalfe 0.03 22443 docs m icrosoft 3.24 ju stice 0.93 d ep artm ent 0.88 w ind ow s 0.98 corp 0.61 softw are 0.57 ellison 0.07 hatch 0.06 netscap e 0.04 m etcalfe 0.03 96 Parallelization (C) 2000, The University of Michigan 97 Parallelization C(P) (C) 2000, The University of Michigan 98 Parallelization (C) 2000, The University of Michigan 99 Parallelization (C) 2000, The University of Michigan 100 Evaluation principles CDet(R,H) = Cmiss.Pmiss(R,H).Ptopic + CFalseAlarm.PFalseAlarm (R,H).(1-Ptopic) CMiss = 1 CFalseAlarm = 1 PMiss(R,H) = NMiss(R,H)/|R| PFalseAlarm(R,H) = NFalseAlarm(R,H)/|S-R| Ptopic = 0.02 (a priori probability) R - set of stories in a reference target topic H - set of stories in a system-defined topic S - set of stories to be scored in eval corpus Task: to determine H(R) = argmin{CDet(R,H)} H (C) 2000, The University of Michigan 101 Official results (C) 2000, The University of Michigan 102 Results # Parallel Sim Decay Story Weighted Idf Keep P(miss) P(fa) Cdet Topic Weighted P(miss) P(fa) Cdet 1 yes .1 100 3 10 0.3861 0.0018 0.0095 0.3309 0.0018 0.0084 2 no .1 100 3 10 0.3164 0.0014 0.0077 0.3139 0.0014 0.0077 3 no .1 100 2 10 0.3178 0.0014 0.0077 0.2905 0.0014 0.0072 4 no .1 50 3 10 0.5045 0.0014 0.0114 0.3201 0.0014 0.0077 (C) 2000, The University of Michigan 103 Novelty detection <DOCID> reute960109.0101 </DOCID> <HEADER> reute 01-09 0057 </HEADER> ... German court convicts Vogel of extortion BERLIN, Jan 9 (Reuter) - A German court on Tuesday convicted Wolfgang Vogel, the East Berlin lawyer famous for organising Cold War spy swaps, on charges that he extorted money from would-be East German emigrants. The Berlin court gave him a two-year suspended jail sentence and a fine -- less than the 3 3/8 years prosecutors had sought. (C) 2000, The University of Michigan <DOCID> reute960109.0201 </DOCID> <HEADER> reute 01-09 0582 </HEADER> ... East German spy-swap lawyer convicted of extortion BERLIN (Reuter) - The East Berlin lawyer who became famous for engineering Cold War spy swaps, Wolfgang Vogel, was convicted by a German court Tuesday of extorting money from East German emigrants eager to flee to the West. Vogel, a close confidant of former East German leader Erich Honecker and one of the Soviet bloc's rare millionaires, was found guilty of perjury, four counts of blackmail and five counts of falsifying documents. The Berlin court gave him the two-year suspended sentence and a $63,500 fine. Prosecutors had pressed for a jail sentence of 3 3/8 years and a $215,000 penalty... 104