lai04 - CLAIR - University of Michigan

advertisement
November 9, 2000
Language and Information
Handout #4
(C) 2000, The University
of Michigan
1
Course Information
•
•
•
•
•
•
Instructor: Dragomir R. Radev (radev@si.umich.edu)
Office: 305A, West Hall
Phone: (734) 615-5225
Office hours: TTh 3-4
Course page: http://www.si.umich.edu/~radev/760
Class meets on Thursdays, 5-8 PM in 311 West Hall
(C) 2000, The University
of Michigan
2
Readings
• Textbook:
– Oakes Ch.3: 95-96, 110-120
– Oakes Ch.4: 149-150, 158-166, 182-189
– Oakes Ch.5: 199-212, 221-223, 236-247
• Additional readings
– Knight “Statistical Machine Translation Workbook”
(http://www.clsp.jhu.edu/ws99/)
– McKeown & Radev “Collocations”
– Optional: M&S chapters 4, 5, 6, 13, 14
(C) 2000, The University
of Michigan
3
Statistical Machine Translation
and Language Modeling
(C) 2000, The University
of Michigan
4
The Noisy Channel Model
• Source-channel model of communication
• Parametric probabilistic models of language
and translation
• Training such models
(C) 2000, The University
of Michigan
5
Statistics
• Given f, guess e
e
EF
f
e’
FE
encoder
decoder
e’ = argmax P(e|f) = argmax P(f|e) P(e)
e
e
translation model
(C) 2000, The University
of Michigan
language model
6
Parametric probabilistic models
• Language model (LM)
P(e) = P(e1, e2, …, eL) = P(e1) P(e2|e1) … P(eL|e1 … eL-1)
• Deleted interpolation
P(eL|e1 … eK-1)  P(eL|eL-2, eL-1)
• Translation model (TM)
Alignment: P(f,a|e)
(C) 2000, The University
of Michigan
7
IBM’s EM trained models
1.
2.
3.
4.
5.
Word translation
Local alignment
Fertilities
Class-based alignment
Non-deficient algorithm (avoid overlaps,
overflow)
(C) 2000, The University
of Michigan
8
Lexical Semantics
and WordNet
(C) 2000, The University
of Michigan
9
Meanings of words
• Lexemes, lexicon, sense(s)
• Examples:
– Red, n: the color of blood or a ruby
– Blood, n: the red liquid that circulates in the heart, arteries
and veins of animals
– Right, adj: located nearer the right hand esp. being on the
right when facing the same direction as the observer
• Do dictionaries gives us definitions??
(C) 2000, The University
of Michigan
10
Relations among words
• Homonymy:
– Instead, a bank can hold the investments in a custodial account in
the client’s name.
– But as agriculture burgeons on the east bank, the river will shrink
even more.
•
•
•
•
Other examples: be/bee?, wood/would?
Homophones
Homographs
Applications: spelling correction, speech recognition, textto-speech
• Example: Un ver vert va vers un verre vert.
(C) 2000, The University
of Michigan
11
Polysemy
• They rarely serve red meat, preferring to prepare seafood,
poultry, or game birds.
• He served as U.S. ambassador to Norway in 1976 and 1977.
• He might have served his time, come out and led an
upstanding life.
• Homonymy: distinct and unrelated meanings, possibly with
different etymology (multiple lexemes).
• Polysemy: single lexeme with two meanings.
• Example: an “idea bank”
(C) 2000, The University
of Michigan
12
Synonymy
•
•
•
•
Principle of substitutability
How big is this plane?
Would I be flying on a large or small plane?
Miss Nelson, for instance, became a kind of big
sister to Mrs. Van Tassel’s son, Benjamin.
• ?? Miss Nelson, for instance, became a kind of
large sister to Mrs. Van Tassel’s son, Benjamin.
• What is the cheapest first class fare?
• ?? What is the cheapest first class cost?
(C) 2000, The University
of Michigan
13
Semantic Networks
• Used to represent relationships between
words
• Example: WordNet - created by George
Miller’s team at Princeton
(http://www.cogsci.princeton.edu/~wn)
• Based on synsets (synonyms,
interchangeable words) and lexical matrices
(C) 2000, The University
of Michigan
14
Lexical matrix
Word Forms
Word
Meanings
F1
F2
M1
E1,1
E1,2
M2
F3
…
Fn
E1,2
…
…
Mm
(C) 2000, The University
of Michigan
Em,n
15
Synsets
• Disambiguation
– {board, plank}
– {board, committee}
• Synonyms
– substitution
– weak substitution
– synonyms must be of the same part of speech
(C) 2000, The University
of Michigan
16
$ ./wn board -hypen
Synonyms/Hypernyms (Ordered by Frequency) of noun board
9 senses of board
Sense 1
board
=> committee, commission
=> administrative unit
=> unit, social unit
=> organization, organisation
=> social group
=> group, grouping
Sense 2
board
=> sheet, flat solid
=> artifact, artefact
=> object, physical object
=> entity, something
Sense 3
board, plank
=> lumber, timber
=> building material
=> artifact, artefact
=> object, physical object
=> entity, something
(C) 2000, The University
of Michigan
17
Sense 4
display panel, display board, board
=> display
=> electronic device
=> device
=> instrumentality, instrumentation
=> artifact, artefact
=> object, physical object
=> entity, something
Sense 5
board, gameboard
=> surface
=> artifact, artefact
=> object, physical object
=> entity, something
Sense 6
board, table
=> fare
=> food, nutrient
=> substance, matter
=> object, physical object
=> entity, something
(C) 2000, The University
of Michigan
18
Sense 7
control panel, instrument panel, control board, board, panel
=> electrical device
=> device
=> instrumentality, instrumentation
=> artifact, artefact
=> object, physical object
=> entity, something
Sense 8
circuit board, circuit card, board, card
=> printed circuit
=> computer circuit
=> circuit, electrical circuit, electric circuit
=> electrical device
=> device
=> instrumentality, instrumentation
=> artifact, artefact
=> object, physical object
=> entity, something
Sense 9
dining table, board
=> table
=> furniture, piece of furniture, article of furniture
=> furnishings
=> instrumentality, instrumentation
=> artifact, artefact
=> object, physical object
=> entity, something
(C) 2000, The University
of Michigan
19
Antonymy
• “x” vs. “not-x”
• “rich” vs. “poor”?
• {rise, ascend} vs. {fall, descend}
(C) 2000, The University
of Michigan
20
Other relations
• Meronymy: X is a meronym of Y when
native speakers of English accept sentences
similar to “X is a part of Y”, “X is a
member of Y”.
• Hyponymy: {tree} is a hyponym of {plant}.
• Hierarchical structure based on hyponymy
(and hypernymy).
(C) 2000, The University
of Michigan
21
Other features of WordNet
• Index of familiarity
• Polysemy
(C) 2000, The University
of Michigan
22
Familiarity and polysemy
board used as a noun is familiar (polysemy count = 9)
bird used as a noun is common (polysemy count = 5)
cat used as a noun is common (polysemy count = 7)
house used as a noun is familiar (polysemy count = 11)
information used as a noun is common (polysemy count = 5)
retrieval used as a noun is uncommon (polysemy count = 3)
serendipity used as a noun is very rare (polysemy count = 1)
(C) 2000, The University
of Michigan
23
Compound nouns
advisory board
appeals board
backboard
backgammon board
baseboard
basketball backboard
big board
billboard
binder's board
binder board
(C) 2000, The University
of Michigan
blackboard
board game
board measure
board meeting
board member
board of appeals
board of directors
board of education
board of regents
board of trustees
24
Overview of senses
1. board -- (a committee having supervisory powers; "the board has seven members")
2. board -- (a flat piece of material designed for a special purpose; "he nailed boards across the
windows")
3. board, plank -- (a stout length of sawn timber; made in a wide variety of sizes and used for
many purposes)
4. display panel, display board, board -- (a board on which information can be displayed to public
view)
5. board, gameboard -- (a flat portable surface (usually rectangular) designed for board games; "he
got out the board and set up the pieces")
6. board, table -- (food or meals in general; "she sets a fine table"; "room and board")
7. control panel, instrument panel, control board, board, panel -- (an insulated panel containing
switches and dials and meters for controlling electrical devices; "he checked the instrument
panel"; "suddenly the board lit up like a Christmas tree")
8. circuit board, circuit card, board, card -- (a printed circuit that can be inserted into expansion
slots in a computer to increase the computer's capabilities)
9. dining table, board -- (a table at which meals are served; "he helped her clear the dining table";
"a feast was spread upon the board")
(C) 2000, The University
of Michigan
25
Top-level concepts
{act, action, activity}
{animal, fauna}
{artifact}
{attribute, property}
{body, corpus}
{cognition, knowledge}
{communication}
{event, happening}
{feeling, emotion}
{food}
{group, collection}
{location, place}
{motive}
(C) 2000, The University
of Michigan
{natural object}
{natural phenomenon}
{person, human being}
{plant, flora}
{possession}
{process}
{quantity, amount}
{relation}
{shape}
{state, condition}
{substance}
{time}
26
Information Extraction
(C) 2000, The University
of Michigan
27
Types of Information Extraction
•
•
•
•
Template filling
Language reuse
Biographical information
Question answering
(C) 2000, The University
of Michigan
28
MUC-4 Example
On October 30, 1989, one civilian was killed in a
reported FMLN attack in El Salvador.
INCIDENT: DATE
INCIDENT: LOCATION
INCIDENT: TYPE
INCIDENT: STAGE OF EXECUTION
INCIDENT: INSTRUMENT ID
INCIDENT: INSTRUMENT TYPE
PERP: INCIDENT CATEGORY
PERP: INDIVIDUAL ID
PERP: ORGANIZATION ID
PERP: ORG. CONFIDENCE
PHYS TGT: ID
PHYS TGT: TYPE
PHYS TGT: NUMBER
PHYS TGT: FOREIGN NATION
PHYS TGT: EFFECT OF INCIDENT
PHYS TGT: TOTAL NUMBER
HUM TGT: NAME
HUM TGT: DESCRIPTION
HUM TGT: TYPE
HUM TGT: NUMBER
HUM TGT: FOREIGN NATION
HUM TGT: EFFECT OF INCIDENT
HUM TGT: TOTAL NUMBER
(C) 2000, The University
of Michigan
30 OCT 89
EL SALVADOR
ATTACK
ACCOMPLISHED
TERRORIST ACT
"TERRORIST"
"THE FMLN"
REPORTED: "THE FMLN"
"1 CIVILIAN"
CIVILIAN: "1 CIVILIAN"
1: "1 CIVILIAN"
DEATH: "1 CIVILIAN"
29
Language reuse
NP
Yugoslav President
Slobodan Milosevic
[description]
[entity]
Phrase to be reused
(C) 2000, The University
of Michigan
30
Example
NP
NP
Punc
Andrija Hebrang
[entity]
(C) 2000, The University
of Michigan
,
NP
The Croatian Defense Minister
[description]
31
Issues involved
•
•
•
•
•
•
•
Text generation depends on lexical resources
Lexical choice
Corpus processing vs. manual compilation
Deliberate decisions by writers
Difficult to encode by hand
Dynamically updated (Scott O’Grady)
No full semantic representation
(C) 2000, The University
of Michigan
32
Named entities
Richard Butler met Tareq Aziz Monday after rejecting Iraqi
attempts to set deadlines for finishing his work.
Yitzhak Mordechai will meet Mahmoud Abbas at 7 p.m.
(1600 GMT) in Tel Aviv after a 16-month-long impasse in
peacemaking.
Sinn Fein deferred a vote on Northern Ireland's peace deal
Sunday.
Hundreds of troops patrolled Dili on Friday during the
anniversary of Indonesia's 1976 annexation of the territory.
(C) 2000, The University
of Michigan
33
Entities + Descriptions
Chief U.N. arms inspector Richard Butler met Iraq’s Deputy
Prime Minister Tareq Aziz Monday after rejecting Iraqi attempts
to set deadlines for finishing his work.
Israel's Defense Minister Yitzhak Mordechai will meet senior
Palestinian negotiator Mahmoud Abbas at 7 p.m. (1600 GMT)
in Tel Aviv after a 16-month-long impasse in peacemaking.
Sinn Fein, the political wing of the Irish Republican Army,
deferred a vote on Northern Ireland's peace deal Sunday.
Hundreds of troops patrolled Dili, the Timorese capital, on
Friday during the anniversary of Indonesia's 1976 annexation of
the(C)territory.
2000, The University
34
of Michigan
Building a database of descriptions
• Size of database: 59,333 entities and
193,228 descriptions as of 08/01/98
• Text processed: 494 MB (ClariNet,
Reuters, UPI)
• Length: 1-15 lexical items
• Accuracy: (precision 94%, recall 55%)
(C) 2000, The University
of Michigan
35
Multiple descriptions per entity
Ung Huot
A senior member
Cambodia’s
Cambodian foreign minister
Co-premier
First prime minister
Foreign minister
His excellency
Mr.
New co-premier
New first prime minister
Newly-appointed first prime minister
Premier
(C) 2000, The University
of Michigan
36
Profile for Ung Huot
Language reuse and regeneration
CONCEPTS
+
CONSTRAINTS
=
CONSTRUCTS
Corpus analysis: determining constraints
Text generation: applying constraints
(C) 2000, The University
of Michigan
37
Language reuse and regeneration
•
•
•
•
Understanding: full parsing is expensive
Generation: expensive to use full parses
Bypassing certain stages (e.g., syntax)
Not(!) template-based: still required
extraction, analysis, context identification,
modification, and generation
• Factual sentences, sentence fragments
• Reusability of a phrase
(C) 2000, The University
of Michigan
38
Context-dependent solution
Redefining the relation:
DescriptionOf (E,C) =
{Di,c, Di,c is a description of E in context C}
If named entity E appears in text and the context is C:
Insert DescriptionOf (E,C) in text.
(C) 2000, The University
of Michigan
39
Multiple descriptions per entity
Bill Clinton
U.S. President
President
An Arkansas native
Democratic presidential candidate
Profile for Bill Clinton
(C) 2000, The University
of Michigan
40
Choosing the right description
Bill Clinton
CONTEXT
U.S. President …………………………..foreign relations
President ………………………………… national affairs
An Arkansas native ……………....false bomb alert in AR
Democratic presidential candidate …………….. elections
Pragmatic and semantic constraints on lexical choice.
(C) 2000, The University
of Michigan
41
Semantic information from
WordNet
• All words contribute to the semantic
representation
• First sense is used only
• What is a synset?
(C) 2000, The University
of Michigan
42
WordNet synset hierarchy
{00001740} entity, something
{00002086} life form, organism, being, living thing
{00004123} person, individual, someone, somebody, human
{06950891} leader
{07311393} head, chief, top dog
{07063507} administrator, decision maker
{07063762} director, manager, managing director
(C) 2000, The University
of Michigan
43
Lexico-semantic matrix
Word synsets
Description
{07147929}
premier
{07009772}
Kampuchean
Parent synsets
…
{07412658}
minister
…
A senior member
X
Cambod ia's
X
…
Cambod ian foreign minister
X
…
X
Co-premier
X
…
X
First prime minister
X
…
X
Foreign minister
…
X
H is excellency
…
Mr.
…
N ew co-premier
X
…
X
N ew first prime minister
X
…
X
N ew ly-appointed first prime minister
X
…
X
Premier
X
…
X
Prime minister
X
…
X
(C) 2000, The University
of Michigan
Profile for Ung Huot
{07087841}
associate
44
Choosing the right description
• Topic approximation by context: words that
appear near the entity in the text (bag)
• Name of the entity (set)
• Length of article (continuous)
• Profile: set of all descriptions for that entity (bag)
- parent synset offsets for all words wi.
• Semantic information: WordNet synset offsets
(bag)
(C) 2000, The University
of Michigan
45
Choosing the right description
Ripper feature vector [Cohen 1996]
(Context, Entity, Description, Length, Profile, Parent)
(C) 2000, The University
of Michigan
Classes
46
Example (training)
T# Context
1
Election,
promised,
said, carry,
party …
Entity
D escription
Kim
Veteran
D ae-Jung opposition
leader
Len Profile
949 Candidate,
chief, policy
maker,
Korean ...
2
Kim
South
D ae-Jung Korea's
opposition
candidate
629
Candidate,
chief, policy
maker,
Korean ...
Kim
A frontD ae-Jung runner
535
Candidate,
chief, policy
maker,
Korean ...
Kim
A frontD ae-Jung runner
1114 Candidate,
Kim
South
D ae-Jung Korea's
presidentelect
449
3
4
5
Introduced,
responsible,
running,
should,
bringing …
Attend,
during,
party, time,
traditionall
y…
D iscuss,
making,
party,
statement,
said …
N ew , party,
politics, in,
it …
(C) 2000, The University
of Michigan
chief, policy
maker,
Korean ...
Candidate,
chief, policy
maker,
Korean ...
Parent
person,
leader,
Asian,
important
person ...
person,
leader,
Asian,
important
person ...
person,
leader,
Asian,
important
person ...
person,
leader,
Asian,
important
person ...
person,
leader,
Asian,
important
person ...
Classes
{07136302}
{07486519}
{07311393}
{06950891}
{07486079}
{07136302}
{07486519}
{07311393}
{06950891}
{07486079}
{07136302}
{07486519}
{07311393}
{06950891}
{07486079}
{07136302}
{07486519}
{07311393}
{06950891}
{07486079}
{07136302}
{07486519}
{07311393}
{06950891}
{07486079}
47
Sample rules
{07136302} IF PROFILE ~ P{07136302} LENGTH <= 603 LENGTH >= 361 .
{07136302} IF PROFILE ~ P{07136302} CONTEXT ~ presidential LENGTH <=
412 .
{07136302} IF PROFILE ~ P{07136302} CONTEXT ~ nominee CONTEXT ~
during .
{07136302} IF PROFILE ~ P{07136302} CONTEXT ~ case .
{07136302} IF PROFILE ~ P{07136302} LENGTH <= 603 LENGTH >= 390
LENGTH <= 412 .
{07136302} IF PROFILE ~ P{07136302} CONTEXT ~ nominee CONTEXT ~
and .
Total number of rules: 4085 for 100,000 inputs
(C) 2000, The University
of Michigan
48
Evaluation
• 35,206 tuples; 11,504 distinct entities; 3.06
DDPE
• Training: 90% of corpus (10,353 entities)
• Test: 10% of corpus (1,151 entities)
(C) 2000, The University
of Michigan
49
Evaluation
• Rule format (each matching rule adds constraints):
X
[A]
(evidence of A)
Y
[B]
(evidence of B)
[A] [B]
(evidence of A and B)
X Y
• Classes are in 2W (powerset of WN nodes)
• P&R on the constraints selected by system
(C) 2000, The University
of Michigan
50
Definition of precision and recall
Model
System
[B] [D]
[A] [B] [C]
33.3 % 50.0 %
[A] [B] [C]
[A] [B] [D]
66.7 % 66.7 %
(C) 2000, The University
of Michigan
P
R
51
Precision
and
recall
Word nodes only
Word and parent nodes
Training
set 500
1000
2000
5000
10000
15000
20000
25000
30000
50000
100000
150000
200000
250000
Precision
64.29%
71.43%
42.86%
59.33%
69.72%
76.24%
76.25%
83.37%
80.14%
83.13%
85.42%
87.07%
85.73%
87.15%
(C) 2000, The University
of Michigan
Recall
2.86%
2.86%
40.71%
48.40%
45.04%
44.02%
49.91%
52.26%
50.55%
58.53%
62.81%
63.17%
62.86%
63.85%
Precision
78.57%
85.71%
67.86%
64.67%
74.44%
73.39%
79.08%
82.39%
82.77%
88.87%
89.70%
Recall
2.86%
2.86%
62.14%
53.73%
59.32%
53.17%
58.70%
57.49%
57.66%
63.39%
64.64%
52
Question Answering
(C) 2000, The University
of Michigan
53
Question answering
Q: When did Nelson Mandela become president of South Africa?
A: 10 May 1994
Q: How tall is the Matterhorn?
A: The institute revised the Matterhorn 's height to 14,776 feet 9 inches
Q: How tall is the replica of the Matterhorn at Disneyland?
A: In fact he has climbed the 147-foot Matterhorn at Disneyland every week
end for the last 3 1/2 years
Q: If Iraq attacks a neighboring country, what should the US do?
A: ??
(C) 2000, The University
of Michigan
54
Q: Why did David Koresh ask the FBI for a word processor?
Q: Name the designer of the shoe that spawned millions of plastic imitations, known as "jellies".
Q: What is the brightest star visible from Earth?
Q: What are the Valdez Principles?
Q: Name a film that has won the Golden Bear in the Berlin Film Festival?
Q: Name a country that is developing a magnetic levitation railway system?
Q: Name the first private citizen to fly in space.
Q: What did Shostakovich write for Rostropovich?
Q: What is the term for the sum of all genetic material in a given organism?
Q: What is considered the costliest disaster the insurance industry has ever faced?
Q: What is Head Start?
Q: What was Agent Orange used for during the Vietnam War?
Q: What did John Hinckley do to impress Jodie Foster?
Q: What was the first Gilbert and Sullivan opera?
Q: What did Richard Feynman say upon hearing he would receive the Nobel Prize in Physics?
Q: How did Socrates die?
Q: Why are electric cars less efficient in the north-east than in California?
(C) 2000, The University
of Michigan
55
The TREC evaluation
•
•
•
•
Document retrieval
Eight years
Information retrieval?
Corpus: texts and questions
(C) 2000, The University
of Michigan
56
Textract
Resporator
Indexer
documents
Index
Query
Processing
Search
query
GuruQA
Ranked
HitList
AnSel/
Werlect
Hit List
Answer selection
Prager et al. 2000 (SIGIR)
Radev et al. 2000 (ANLP/NAACL)
(C) 2000, The University
of Michigan
57
QA-Token
Question type
Example
PLACE$
Where
In the Rocky Mountains
COUNTRY$
Where/What country
United Kingdom
STATE$
Where/What state
Massachusetts
PERSON$
Who
Albert Einstein
ROLE$
Who
Doctor
NAME$
Who/What/Which
The Shakespeare Festival
ORG$
Who/What
The US Post Office
DURATION$
How long
For 5 centuries
AGE$
How old
30 years old
YEAR$
When/What year
1999
TIME$
When
In the afternoon
DATE$
When/What date
July 4th, 1776
VOLUME$
How big
3 gallons
AREA$
How big
4 square inches
LENGTH$
How big/long/high
3 miles
WEIGHT$
How big/heavy
25 tons
NUMBER$
How many
1,234.5
METHOD$
How
By rubbing
RATE$
How much
50 per cent
MONEY$
How much
4 million dollars
(C) 2000, The University
of Michigan
58
<p><NUMBER>1</NUMBER></p>
<p><QUERY>Who is the author of the book, "The Iron Lady: A
Biography of Margaret Thatcher"?</QUERY></p>
<p><PROCESSED_QUERY>@excwin(*dynamic* @weight(200
*Iron_Lady) @weight(200 Biography_of_Margaret_Thatcher)
@weight(200 Margaret) @weight(100 author) @weight(100
book) @weight(100 iron) @weight(100 lady) @weight(100 :)
@weight(100 biography) @weight(100 thatcher) @weight(400
@syn(PERSON$ NAME$)) )</PROCESSED_QUERY></p>
<p><DOC>LA090290-0118</DOC></p>
<p><SCORE>1020.8114</SCORE></p>
<TEXT><p>THE IRON LADY; A <span class="NAME">Biography of
Margaret Thatcher</span> by <span class="PERSON">Hugo
Young</span> (<span class="ORG">Farrar , Straus &
Giroux</span>) The central riddle revealed here is why, as
a woman <span class="PLACEDEF">in a man</span>'s world,
<span class="PERSON">Margaret Thatcher</span> evinces such
an exclusionary attitude toward women.</p></TEXT>
(C) 2000, The University
of Michigan
59
SYN-set
PERSON NAME
PLACE COUNTRY STATE NAME PLACEDEF
NAME
DATE YEAR
PERSON ORG NAME ROLE
undefined
NUMBER
PLACE NAME PLACEDEF
PERSON ORG PLACE NAME PLACEDEF
MONEY RATE
ORG NAME
SIZE1
SIZE1 DURATION
STATE
COUNTRY
YEAR
RATE
TIME DURATION
SIZE1 SIZE2
DURATION TIME
DATE
(C) 2000, The University
of Michigan
N
30
21
18
18
19
19
18
14
10
6
4
4
3
3
3
2
2
1
1
1
1
Score
16.5
7.08
3.67
5.31
4.62
11.45
8.00
10.00
3.03
1.50
1.25
2.50
0.83
2.00
1.33
1.00
1.50
0.00
0.00
0.33
0
Score/N
55.0%
33.7%
20.4%
29.5%
24.3%
60.3%
44.4%
71.4%
30.3%
25%
31.2%
62.5%
27.7%
66.7%
44.3%
50.0%
75.0%
0.0%
0.0%
33.3%
0.00%
60
Span
Ollie Matson
Lou Vasquez
Tim O'Donohue
Athletic Director Dave Cowen
Johnny Ceballos
Civic Center Director Martin Durham
Johnny Hodges
Derric Evans
NEWSWIRE Johnny Majors
Woodbridge High School
Evan
Gary Edwards
O.J. Simpson
South Lake Tahoe
Washington High
Morgan
Tennesseefootball
Ellington
assistant
the Volunteers
Johnny Mathis
Mathis
coach
(C) 2000, The University
of Michigan
Type
PERSON
PERSON
PERSON
PERSON
PERSON
PERSON
PERSON
PERSON
PERSON
ORG
PERSON
PERSON
NAME
NAME
NAME
NAME
NAME
NAME
ROLE
ROLE
PERSON
NAME
ROLE
Number
3
1
17
23
22
13
25
33
30
18
37
38
2
7
10
26
31
24
21
34
4
14
19
Rspanno
3
1
1
6
5
1
2
4
1
2
6
7
2
5
6
3
2
1
4
5
4
2
3
Count
6
6
4
4
4
2
4
4
4
4
4
4
6
6
6
4
4
4
4
4
6
2
4
Notinq
2
2
2
4
1
5
1
2
2
1
1
2
2
3
1
1
1
1
1
2
-100
-100
-100
Type
1
1
1
1
1
1
1
1
1
2
1
1
3
3
3
3
3
3
4
4
1
3
4
Avgdst
12
16
8
11
9
16
15
14
17
6
14
17
12
14
18
12
15
20
8
14
11
10
4
Sscore
0.02507
0.02507
0.02257
0.02257
0.02257
0.02505
0.02256
0.02256
0.02256
0.02257
0.02256
0.02256
0.02507
0.02507
0.02507
0.02256
0.02256
0.02256
0.02257
0.02256
0.02507
0.02505
0.02257
TOTAL
-7.53
-9.93
-12.57
-15.87
-19.07
-19.36
-25.22
-25.37
-25.47
-28.37
-29.57
-30.87
-37.40
-40.06
-49.80
-52.52
-56.27
-59.42
-62.77
-71.17
-211.33
-254.16
-259.67
61
Features (1)
• Number: position of the span among all spans returned. Example:
“Lou Vasquez” was the first span returned by GuruQA on the sample
question.
• Rspanno: position of the span among all spans returned within the
current passage.
• Count: number of spans of any span class retrieved within the current
passage.
• Notinq: the number of words in the span that do not appear in the
query. Example: Notinq (“Woodbridge high school”) = 1, because both
“high” and “school” appear in the query while “Woodbridge” does not.
It is set to –100 when the actual value is 0.
(C) 2000, The University
of Michigan
62
Features (2)
• Type: the position of the span type in the list of potential span types.
Example: Type (“Lou Vasquez”) = 1, because the span type of “Lou
Vasquez”, namely “PERSON” appears first in the SYN-set, “PERSON
ORG NAME ROLE”.
• Avgdst: the average distance in words between the beginning of the
span and the words in the query that also appear in the passage.
Example: given the passage “Tim O'Donohue, Woodbridge High
School's varsity baseball coach, resigned Monday and will be replaced
by assistant Johnny Ceballos, Athletic Director Dave Cowen said.”
and the span “Tim O’Donohue”, the value of avgdst is equal to 8.
• Sscore: passage relevance as computed by GuruQA.
(C) 2000, The University
of Michigan
63
Combining evidence
• TOTAL (span) = – 0.3 * number – 0.5 *
rspanno + 3.0 * count + 2.0 * notinq – 15.0
* types – 1.0 * avgdst + 1.5 * sscore
(C) 2000, The University
of Michigan
64
Extracted text
Document
ID
LA0531890069
LA0531890069
LA0608890181
LA0608890181
LA0608890181
Score Extract
892.5 of O.J. Simpson , Ollie Matson and Johnny Mathis
890.1 Lou Vasquez , track coach of O.J. Simpson , Ollie
887.4 Tim O'Donohue , Woodbridge High School 's varsity
884.1 nny Ceballos , Athletic Director Dave Cowen said.
880.9 aced by assistant Johnny Ceballos , Athletic Direc
(C) 2000, The University
of Michigan
65
Results
50 bytes
# cases
Points
First
49
49.00
Second Third Fourth
15
11
9
7.50
3.67
2.25
Fifth
4
0.80
TOTAL
88
63.22
First
71
71.00
Second Third Fourth
16
11
6
8.00
3.67
1.50
Fifth
5
1.00
TOTAL
109
85.17
250 bytes
# cases
Points
(C) 2000, The University
of Michigan
66
Style and Authorship Analysis
(C) 2000, The University
of Michigan
67
Style and authorship analysis
•
•
•
•
Use of nouns, verbs…
Use of rare words
Positional and contextual distribution
Use of alternatives: “and/also”,
“since/because”, “scarcely/hardly”
(C) 2000, The University
of Michigan
68
Sample problem
• 15-th century Latin work “De Imitatione
Christi”
• Was it written by Thomas a Kempis or Jean
Charlier de Gerson?
• Answer: by Kempis
• Why?
(C) 2000, The University
of Michigan
69
Yule’s K characteristic
• Vocabulary richness: measure of the probability that any
randomly selected pair of words will be identical
K = 10,000 x (M2 - M1)/(M1 x M1)
• M1, M2 - distribution moments
• M1 - total number of usages (words including repetitions)
• M2 - sum of all vocabulary words in each frequency
group, from 1 to the maximum word frequency, multiplied
by the square of the frequency
(C) 2000, The University
of Michigan
70
Example
• Text consisting of 12 words, where two of the
words occur once, two occur twice, and two occur
three times.
• M0 = 6
• M1 = 12
• M2 = (2 x 12) + (2 x 22) + (2 x 32) = 28
• K increases as the diversity of the vocabulary
decreases.
(C) 2000, The University
of Michigan
71
Example (cont’d)
• Three criteria used:
–
–
–
–
–
total vocabulary size
frequency distribution of the different words
Yule’s K
the mean frequency of the word sin the sample
the number of nouns unique to a particular
sample
• Pearson’s coefficient used
(C) 2000, The University
of Michigan
72
Federalist papers
• Published in 1787-1788 to persuade the population of New
York state to ratify the new American constitution
• Published under the pseudonym Publius, the three authors
were James Madison, John Jay, and Alexander Hamilton.
• Before dying in a duel, Hamilton claimed some portion of
the essays.
• It was agreed that Jay wrote 5 essays, Hamilton - 43,
Madison - 14. Three others were jointly written by
Hamilton and Madison, and 12 were disputed
(C) 2000, The University
of Michigan
73
Method
• Mosteller and Wallace (1963) used Bayesian statistics to
determine which papers were written by whom.
• Authors had tried to imitate each other. So - sentence
length and other easily imitable features are not useful.
• Madison and Hamilton were found to vary in their use of
“by” (H) and “to” (M), “enough” (H) and “whilst” (M).
(C) 2000, The University
of Michigan
74
Cluster Analysis
(C) 2000, The University
of Michigan
75
Clustering
• Idea: find similar objects and group them
together
• Examples:
– all news stories on the same topic
– all documents from the same genre or language
• Types of clustering: classification (tracking)
and categorization (detection)
(C) 2000, The University
of Michigan
76
Non-hierarchical clustering
• Concept of a centroid
• document/centroid similarity
• other parameters:
–
–
–
–
number of clusters
maximum and minimum size for each cluster
vigilance parameter
overlap between clusters
(C) 2000, The University
of Michigan
77
Hierarchical clustering
• Similarity matrix (expensive: the SIM
matrix needs to be updated after every
iteration)
• Average linkage method
• dendrograms
(C) 2000, The University
of Michigan
78
Introduction
• Abundance of newswire on the Web
• Multiple sources reporting on the same
event
• Multiple modalities (speech, text)
• Summarization and filtering
(C) 2000, The University
of Michigan
79
Introduction
• TDT participation topic detection and tracking
– CIDR
• Multi-document summarization
– statistical, domain-dependent
– knowledge-based (SUMMONS)
(C) 2000, The University
of Michigan
80
Topics and events
• Topic = event (single act) or activity (ongoing
action)
• Defined by content, time, and place of
occurrence [Allan et al. 1998, Yang et al. 1998]
• Examples:
– Marine fighter pilot’s plane cuts cable in Italian Alps
(February 3, 1998)
– Eduard Shavardnadze assassination attempt (February 9,
1998)
– Jonesboro shooting (March 24, 1998)
(C) 2000, The University
of Michigan
81
TDT overview
• Event detection: monitoring a continuous stream
of news articles and identifying new salient events
• Event tracking: identifying stories that belong to
predefined event topics
• [Story segmentation: identifying topic
boundaries]
(C) 2000, The University
of Michigan
82
The TDT-2 corpus
• Corpus described in [Doddington et al. 1999, Cieri
et al. 1999]
• One hundred topics, 54K stories, 6 sources
• Two newswire sources (AP, NYT); 2 TV
stations (ABC, CNN-HN); 2 radio stations
(PRI, VOA)
• 11 participants (4 industrial sites, 7
universities)
(C) 2000, The University
of Michigan
83
Detection conditions
• Default:
– Newswire + Audio - automatic transcription
– Deferral period of 10 source files
– Given boundaries for ASR
(C) 2000, The University
of Michigan
84
Description of the system
• Single-pass clustering algorithm
• Normalized, tf*idf-modified, cosine-based
similarity between document and centroid
• detection only, standard evaluation
conditions, no deferral
(C) 2000, The University
of Michigan
85
Research problems
• focus on speedup
• search space of five experimental
parameters
• tradeoffs between parallelization and
accuracy
(C) 2000, The University
of Michigan
86
Vector-based representation
Term 1
Document
Term 3
a
Centroid
Term 2
(C) 2000, The University
of Michigan
87
Vector-based matching
• The cosine measure
S (d . c . idf(k))
S (d ) . S (c )
k
sim (D,C) =
k
k
(C) 2000, The University
of Michigan
k
k
2
k
k
2
88
Description of the system
sim 
T
(C) 2000, The University
of Michigan
89
Description of the system
sim > T
(C) 2000, The University
of Michigan
sim < T
90
Centroid size
C 10062 (N =161)
microsoft
3.24
justice
0.93
d epartmen
0.88
w indt ow s
0.98
corp
0.61
softw are
0.57
ellison
0.07
hatch
0.06
netscape
0.04
metcalfe
0.02
(C) 2000, The University
of Michigan
C 00008 (N =113)
(10000) 1.98
sp ace
shu ttle
1.17
station
0.75
nasa
0.51
colu m bia
0.37
m ission
0.33
m ir
0.30
astronau t
0.14
s
steering
0.11
safely
0.07
C 10007 (N =11)
(10000) 1.00
crashes
safety
0.55
transp ortat 0.55
ion
d rivers
0.45
board
0.36
flight
0.27
bu ckle
0.27
p ittsbu rgh 0.18
grad u ating 0.18
au tom obile 0.18
91
Centroid size
C 00022 (N =44)
(10000) 1.93
d iana
p rincess
1.52
C 00026 (N =10)
(10000) 1.50
u niverse
exp ansion 1.00
bang
0.90
C 00025 (N =19)
(10000) 3.00
albanians
C 00035 (N =22)
(10000) 1.45
airlines
finnair
0.45
C 00031 (N =34)
el(10000) 1.85
nino
1.56
(C) 2000, The University
of Michigan
92
Centroid size
C 00022 (N =44)
(10000) 1.93
d iana
p rincess
1.52
C 00035 (N =22)
(10000) 1.45
airlines
finnair
0.45
C 00031 (N =34)
el(10000) 1.85
nino
1.56
C 00026 (N =10)
(10000) 1.50
u niverse
exp ansion 1.00
bang
0.90
C 10062 (N =161)
microsoft
3.24
justice
0.93
d epartmen
0.88
w indt ow s
0.98
corp
0.61
softw are
0.57
ellison
0.07
hatch
0.06
netscape
0.04
metcalfe
0.02
(C) 2000, The University
of Michigan
C 00025 (N =19)
(10000) 3.00
albanians
C 00008 (N =113)
(10000) 1.98
sp ace
shu ttle
1.17
station
0.75
nasa
0.51
colu m bia
0.37
m ission
0.33
m ir
0.30
astronau t
0.14
s
steering
0.11
safely
0.07
C 10007 (N =11)
(10000) 1.00
crashes
safety
0.55
transp ortat 0.55
ion
d rivers
0.45
board
0.36
flight
0.27
bu ckle
0.27
p ittsbu rgh 0.18
grad u ating 0.18
au tom obile 0.18
93
Parameter space
• Similarity
– DECAY: Number of words at beginning of document that will be
considered in computing vector similarities (50 - 1000)
– IDF: Minimum value for idf so that a word is considered (1 - 10)
– SIM: Similarity threshold (0.01 - 0.25)
• Centroids
– KEEPI: Keep all words whose tf*idf scores are above a certain
threshold (1-10)
– KEEP: Keep at least that many words in centroid (1-50)
(C) 2000, The University
of Michigan
94
Parameter selection (dev-test)
(C) 2000, The University
of Michigan
95
Cluster stability
10000 docs
(10000) 2.48
suharto
jakarta
0.58
habibie
0.47
stu d ents
0.45
stu d ent
0.22
p rotesters 0.20
asean
0.11
cam p u ses 0.05
geertz
0.04
m ed an
0.04
22443 docs
2.61
suharto
jakarta
0.58
habibie
0.53
stu d ents
0.43
stu d ent
0.21
p rotesters
0.19
asean
0.10
cam p u ses
0.04
geertz
0.04
m ed an
0.04
(C) 2000, The University
of Michigan
10000 docs
m icrosoft
3.31
ju stice
1.06
d ep artm ent 1.01
w ind ow s
0.90
corp
0.60
softw are
0.51
ellison
0.09
hatch
0.06
netscap e
0.05
m etcalfe
0.03
22443 docs
m icrosoft
3.24
ju stice
0.93
d ep artm ent 0.88
w ind ow s
0.98
corp
0.61
softw are
0.57
ellison
0.07
hatch
0.06
netscap e
0.04
m etcalfe
0.03
96
Parallelization
(C) 2000, The University
of Michigan
97
Parallelization
C(P)
(C) 2000, The University
of Michigan
98
Parallelization
(C) 2000, The University
of Michigan
99
Parallelization
(C) 2000, The University
of Michigan
100
Evaluation principles
CDet(R,H) = Cmiss.Pmiss(R,H).Ptopic + CFalseAlarm.PFalseAlarm (R,H).(1-Ptopic)
CMiss = 1
CFalseAlarm = 1
PMiss(R,H) = NMiss(R,H)/|R|
PFalseAlarm(R,H) = NFalseAlarm(R,H)/|S-R|
Ptopic = 0.02 (a priori probability)
R - set of stories in a reference target topic
H - set of stories in a system-defined topic
S - set of stories to be scored in eval corpus
Task: to determine H(R) = argmin{CDet(R,H)}
H
(C) 2000, The University
of Michigan
101
Official results
(C) 2000, The University
of Michigan
102
Results
#
Parallel Sim Decay
Story Weighted
Idf Keep P(miss) P(fa)
Cdet
Topic Weighted
P(miss) P(fa)
Cdet
1
yes
.1
100
3
10
0.3861
0.0018
0.0095
0.3309
0.0018
0.0084
2
no
.1
100
3
10
0.3164
0.0014
0.0077
0.3139
0.0014
0.0077
3
no
.1
100
2
10
0.3178
0.0014
0.0077
0.2905
0.0014
0.0072
4
no
.1
50
3
10
0.5045
0.0014
0.0114
0.3201
0.0014
0.0077
(C) 2000, The University
of Michigan
103
Novelty detection
<DOCID> reute960109.0101 </DOCID>
<HEADER> reute 01-09 0057 </HEADER>
...
German court convicts Vogel of extortion
BERLIN, Jan 9 (Reuter) - A German court
on Tuesday convicted Wolfgang Vogel, the East
Berlin lawyer famous for organising Cold War
spy swaps, on charges that he extorted money
from would-be East German emigrants.
The Berlin court gave him a two-year
suspended jail sentence and a fine -- less than
the 3 3/8 years prosecutors had sought.
(C) 2000, The University
of Michigan
<DOCID> reute960109.0201 </DOCID>
<HEADER> reute 01-09 0582 </HEADER>
...
East German spy-swap lawyer convicted of
extortion
BERLIN (Reuter) - The East Berlin lawyer
who became famous for engineering Cold War
spy swaps, Wolfgang Vogel, was convicted by a
German court Tuesday of extorting money
from East German emigrants eager to flee to
the West.
Vogel, a close confidant of former East
German leader Erich Honecker and one of the
Soviet bloc's rare millionaires, was found guilty
of perjury, four counts of blackmail and five
counts of falsifying documents.
The Berlin court gave him the two-year
suspended sentence and a $63,500 fine.
Prosecutors had pressed for a jail sentence of 3
3/8 years and a $215,000 penalty...
104
Download