Novelty Detection Based on Sentence Level Patterns

advertisement
Sentence Level Information Patterns
for Novelty Detection
Xiaoyan Li
PhD in Computer Science
UMass Amherst
Visiting Assistant Professor
Department of Computer Science
Mount Holyoke College
xli@mtholyoke.edu
1
Research Backgrounds and Interests
●
Information Retrieval (IR)
●
●
Novelty Detection (ND)
●
●
MIS at Tsinghua, and now at MHC
Bioinformatics
●
●
[HLT01], [SIGIR’03]
Database Systems & Data Mining
●
●
[IPM’07], [CIKM’05], [CIKM’06]
Question Answering (QA)
●
●
[CIKM’03], [IPM’07], [ECIR’08]
Now at MHC
The Intersection of IR, QA, MIS and Data Mining
●
●
Organize/access data in Relational Databases and indexed free text
Answer users questions instead of matching query words
2
Outline
●
●
●
What is Novelty Detection?
Related Work
Novelty and Information Patterns
− New definition of “novelty”
− Information patterns and analysis
●
ip-BAND: An Information Pattern-Based Approach
− Query analysis
− Relevant sentence retrieval
− Novel sentence detection
●
●
Experiments and Results
Conclusions and Future Work
3
What Is Novelty Detection?
Any car bomb
events recently?
http://news.google.com/
When and where
did they happen?
car bomb
4
Car bomb
16,400
5
Car Bomb, Baghdad,
Monday June 14th.
Car Bomb, Gaza Strip,
Tuesday June 15th.
6
What Is Novelty Detection?
●
Novelty Detection at the Event Level
− Document is relevant to the query
− Document discusses a new event
●
Novelty Detection at the Sentence Level
− A sentence is relevant to the query
− A sentence has new information about an old event or a
sentence discusses a new event
7
What Is Novelty Detection?
●
Task of a Novelty Detection System (NDS)
− Given a query, a NDS is to retrieve a list of sentences that
each of them is both relevant to the query and contain new
information that is not covered by previous sentences.
− Relevance judgment: independent of other sentences.
− Novelty judgment: depends on previously delivered sentences.
●
Goal of a Novelty Detection System
− For the user to get useful information without going through
redundant sentences as well as non-relevant sentences
8
Related Work: Novelty Detection At Different Levels
●
Novelty Detection At the Event Level
− New event detection from Topic Detection and Tracking (TDT)
research.
− Most techniques based on:
− Bag-of-words representation
− Clustering algorithms
9
Related Work: Novelty Detection At Different Levels
●
●
Novelty Detection At the Event Level
Novelty Detection At the Sentence Level
− TREC novelty tracks (2002-2004)
− New words appearing in a sentence contribute to its
novelty scores
10
Related Work: Novelty Detection At Different Levels
●
●
●
Novelty Detection At the Event Level
Novelty Detection At the Sentence Level
Novelty Detection in Other Applications
− “The Use of MMR, Diversity-based Reranking for
Reordering Documents and Producing Summaries”
Carbonell and Goldstein (SIGIR 1998)
− “Novelty and redundancy detection in document filtering”,
Zhang, Callan and Minka (SIGIR 2002)
− “Beyond Independent Relevance: Methods and Evaluation
Metrics for Subtopic Retrieval”, Zhai, Cohen and Lafferty (SIGIR
2003)
11
Related Work: Novelty Detection At Different Levels
●
Novelty Detection at the Sentence Level
−
−
−
−
Similarity functions in IR
New words contribute to novelty scores
High Sim(query, S) -> increase the relevance rank
High Sim(S, previous sentences) -> decrease the novelty rank
12
Novelty Detection at the Sentence Level
Given a query
Non-relevant
sentences
Redundant
Set of
sentences
relevant
sentences
Novel
sentences
13
Related Work -- Limitations
●
Query 306
(TREC novelty track 2002)
− <title> “African Civilian Deaths”
− <description> “How many civilian non-combatants have been
killed in the various civil wars in Africa?”
− <Narrative> A relevant document will contain specific casualty
information for a given area, country, or region. It will cite
numbers of civilian deaths caused directly or indirectly by
armed conflict.
14
Related Work -- Limitations
●
Four Sentences:
− Sentence 1 (Relevant): “It could not verify Somali claims of more than
100 civilian deaths”
− Sentence 2 (Relevant): “Natal's death toll includes another massacre
of 11 ANC [African National Congress] supporters”
− Sentence 3 (Non-relevant): “Once the slaughter began, following the
death of President Juvenal Habyarimana in an air crash on April 6,
hand grenades were thrown into schools and churches that had given
refuge to Tutsi civilians.”
− Sentence 4 (Non-relevant): “A Ghana News Agency correspondent
with the West African force said that rebels loyal to Charles Taylor
began attacking the civilians shortly after the peace force arrived in
Monrovia last Saturday to try to end the eight-month-old civil war.”
15
Related Work -- Motivations
●
A Deeper Query Understanding
− A query representation beyond keywords
− Type of information required in additional to topical
relevance (“number” for query 306)
●
Determination of Novel Sentences
− Topical relevance + the right type of information
− New Words != New Information
16
Novelty and Information Patterns
●
What is Novelty or New Information?
− Novelty or new information means new answers to the
potential questions representing a user’s request or
information need
− Two aspects
− Query -> question(s)
− New answers ->novel or new information
17
Information Patterns
•
Information Patterns in Sentences
- Indicators of answers to users’ questions
•
Understanding Information Patterns
- Sentence Lengths (SLs)
- Named Entities (NEs)
- Opinion Patterns (OPs)
18
Sentence Lengths (SLs)
Types of
Sentences (S.)
TREC 2002: 49 topics
TREC 2003: 50 topics
# of S.
Length
# of S.
Length
Relevant
1365
15.58
15557
13.1
Novel
1241
15.64
10226
13.3
Non-relevant
55862
9.55
24263
8.5
•
SL Observations:
•
relevant sentences on average have more words than nonrelevant sentences
differences in SLs between novel and non-relevant
sentences are slightly larger
•
19
Named Entities (NEs)
TREC 2002 Novelty Track: Total S# = 57227,
Total Rel#=1365, Total Non-Rel#=55862
TREC 2003 Novelty Track:Total S# = 39820,
Total Rel#=15557, Total Non-Rel#=24263
NEs
Rel # (%)
Non-Rel # (%)
NEs
Rel # (%)
Non-Rel # (%)
PERSON
381(27.91%)
13101(23.45%)
PERSON
6633(42.64%)
7211(29.72%)
ORGANIZATION
532(38.97%)
17196(30.78%)
ORGANIZATION
6572(42.24%)
9211(37.96%)
LOCATION
536(39.27%)
11598(20.76%)
LOCATION
5052(32.47%)
5168(21.30%)
DATE
382(27.99%)
6860(12.28%)
DATE
3926(25.24%)
4236(17.46%)
NUMBER
444(32.53%)
14035(25.12%)
NUMBER
4141(26.62%)
6573(27.09%)
ENERGY
0(0.00%)
5(0.01%)
ENERGY
0(0.00%)
0(0.00%)
MASS
31(2.27%)
1455(2.60%)
MASS
34(0.22%)
19(0.08%)
•
NE Observation 1.
- The five most frequent types (>25%) of NE are: PERSON,
ORGANIZATION, LCATION, DATE and NUMBER.
20
Named Entities (NEs) in Opinion/Event Topics
The statistics of named entities in opinion and event topics (2003)
TREC 2003 Novelty Track Event Topics
Total = 18705, Rel#= 7802, Non-Rel#= 10903
TREC 2003 Novelty Track Opinion Topics
Total S# = 21115, Rel#= 7755, Non-Rel#= 13360
NEs
Rel # (%)
Non-Rel # (%)
NEs
Rel # (%)
Non-Rel # (%)
PERSON
3833(49.13%)
3228(29.61%)
PERSON
2800(36.11%)
3983(29.81%)
LOCATION
3100(39.73%)
2567(23.54%)
LOCATION
1952(25.17%)
2601(19.47%))
DATE
2342(30.02%)
1980(18.16%)
DATE
1584(20.43%)
2256(16.89%))
•
NE Observations 2 & 3:
- PERSON, LOCATION and DATE are more important than NUMBER and
ORGANIZATION for relevance
- PERSON, LOCATION and DATE play a more important role in event
topics than in opinion topics
21
Named Entities (NEs) in Novelty
Previously unseen NEs and Novelty/Redundancy (TREC2002&UMass)
•
Types of
Sentences
Total # of
Sentences
# of Sentences
/w New NEs (%)
# of
Queries
Novel S.
4170
2801 (67.2%)
101
Redundant S.
777
355 (45.7%)
75
NE Observations 4 & 5:
- novel sentences have more new named entities than relevant but
redundant sentences
- PERSON, LOCATION, ORGANIZATION and DATE (POLD) NEs are more
important for novelty
22
Opinion Patterns (OPs)
Opinion patterns for 22 opinion topics (2003)
•
Sentences (S.)
Total # of S.
# of Opinion S. (and %)
Relevant
7755
3733 (48.1%)
Novel
5374
2609 (48.6%)
Non-relevant
13360
3788 (28.4%)
OP Observation
- there are more opinion sentences in relevant (and novel) sentences
than in non-relevant sentences
23
Opinion Patterns (OPs)
•
Opinion patterns are detected in a sentence if it
includes quotation marks or one or more of the
opinion expressions indicating it states an opinion
Quotation marks “ ”, said, say, according to, add,
addressed, agree, affirmed, reaffirmed, argue,
believe, believes, claim, concern, consider,
disagreed, expressed, finds that, found that, fear
that, idea that, insist, maintains that, predicted,
reported, report, state that, stated that, states that,
show that, showed that, shows that, think, wrote, etc
24
information-pattern-BAsed Novelty Detection
(ip-BAND)
Query (Topic)
Step1. Query Analysis
(1) Question Generation
a. One or more specific questions, or
b. A general question
(2) Information Pattern (IP) Determination
a. For each specific question, IP entities includes
(a specific NE type, sentence length (SL) )
b. For a general question, IP entities includes
(POLD NE types, SL, and opinion/event type)
Information Patterns
Step2. Relevant Sentence Retrieval
- Sentence Relevance Ranking using TFIDF Scores
- Information Pattern Detection in Sentences
- Sentence Re-Ranking using Information Patterns
Relevant Sentences
Step3. Novel Sentence Detection
- For specific topic: New and Specific NE Detection
- For general topic: New NEs and New Words Detection
Novel Sentences
25
ip-BAND: Query Analysis
●
Classify topics
- Specific topics: multiple NE questions
- General topics: (opinion topic, event and others)
●
Determine the possible query-related information
patterns for specific topic:
- Query words + expected answer types
26
An Example Query
●
Query 306
(TREC novelty track 2002)
− <title> “African Civilian Deaths”
− <description> “How many civilian non-combatants have been
killed in the various civil wars in Africa?”
− <Narrative> A relevant document will contain specific casualty
information for a given area, country, or region. It will cite
numbers of civilian deaths caused directly or indirectly by
armed conflict.
27
ip-BAND: Query Analysis
●
Query Analysis
− Determine the possible query-related patterns
--Query words + expected answer type
--African Civilian Death + NUMBER (query 306)
--Civilian Death + NUMBER (query 306)
…
--expanded query words + Number
28
ip-BAND: Query Analysis
●
Query Analysis—determine expected answer types
Word patterns for the five types of NE question
Answer types
Word patterns
Person
who, individual, person, people, participant, candidate, customer,
victim, leader, member, player, name
Organization
who, company, companies, organization, agency, agencies, name,
participant
Location
where, location, nation, country, countries, city, cities, town, area,
region
Number
how many, how much, length, number, polls, death tolls, injuries,
how long
Date
when, date, time, which year, which month, which day
29
ip-BAND: Relevant Sentence Retrieval
•
Retrieve sentences indicating “possible answers”
•
TFIDF ranking
n
S0   [tf s (ti ) tf q (ti ) idf
i 0
2
(ti )]
•
Sentence re-ranking
- SL, NE, OP
•
Filter out sentences without “answers” for specific
topics
30
ip-BAND: Relevant Sentence Retrieval
•
Sentence re-ranking
-Sentence length adjustment
S1  S0 * ( L / L)
-Named entity adjustment
S 2  S1 * [1   ( F person  Flocation  Fdate )]
-Opinion adjustment
(general opinion topics only)
S 3  S 2 * [1  Fopinion]
31
ip-BAND: Novel Sentence Detection
•
Identify Sentences with “new answers”
•
Sn = wNw + g Nne
- Novel if Sn> T
•
For specific topics/questions
- IPs: new answer NEs
 w = 0, g = 1, and T = 1
•
For general topics/questions
- IPs: new words and new NEs
- w = 1, g = 1, and T = 4
32
Experiments and Results
●
Data From TREC Novelty Tracks
(2002-2004)
− 49 queries (2002), 50 queries (2003) and 50
queries(2004)
− For each query, up to 25 relevant documents (2002,
2003), 25 relevant Documents + more non-relevant
documents (2004)
− Documents were pre-segmented into sentences.
− Redundancy: 2002 (9.1%), 2003 (34.3%),
2004(58.6%)
33
Experiments and Results
●
Baseline Approaches
− B-NN: Initial Retrieval Ranking
− No novelty detection performed
− B-NW: New Word Detection
− New words existing in sentences indicate novel
− B-NWT: New Word Detection with Threshold T
− T New words existing in sentences indicate novel
− B-MMR: Maximal Marginal Relevance
− Carbonell and Goldstein (1998)
− It was reported to work well in non-redundant text
summarization, novelty detection at document filtering
and subtopic retrieval.
− MMR may incorporate various novelty measure
34
Experiments and Results
●
Baseline Approaches
− MMR score


MMR  arg max  (Sim1 ( S i , Q)  (1   ) max Sim2 (S i , S j )
Si R / N 
S j N


− B-MMR: Maximal Marginal Relevance
− (1) Start with a sentence relevance ranking, select the
first sentence (it is always novel)
− (2) Calculate MMR score for the rest sentences
− (3) pick one with the max MMR score and go to (2)
until the last sentence is selected
35
Experiments and Results
●
Three Sets of Experiments
− (1) Performance of identifying novel sentences
for queries transformed into multiple specific
questions
− (2) Performance of identifying novel sentences
for queries transformed into a general question
− (3)Performance of finding relevant sentences for
all queries
36
Performance of Novelty Detection for Specific topics
Performance of Novelty Detection for General topics
The Overall Performance of Novelty Detection
Experiments and Results
●
The proposed approach outperforms all baselines
at top (5,10,15, 20, 30) ranks
●
The proposed approach beats baselines
approaches across different data collections
●
All approaches achieve better performance on
specific topics than on general topics
40
Conclusions and Future Work
●
New definition of novelty
-
●
Analysis of information patterns
-
●
New answers to potential questions from a query
Sentence lengths, named entities, opinion patterns
ip-BAND
- information pattern-BAsed Novelty Detection approach
41
Conclusions and Future Work (cont.)
●
Combine with other IR approaches
– information patterns + language modeling
●
Improve query analysis
- other types of questions in addition to NE question
- Why, what, how …
●
Extend to other novelty-based applications
− New Event Detection
− Multi-document Summarization
42
Research Backgrounds and Interests
●
Information Retrieval (IR)
●
Thank You!
Robust High Performance IR
●
Novelty Detection (ND)
●
Question Answering (QA)
●
Database systems & Data Mining
●
Bioinformatics
●
The Intersection of IR, QA, MIS and Data Mining
●
●
Questions?
Organize and access data in Relational Databases and indexed files of
free text.
Answer users questions instead of matching query words
43
Download