Linear Programming

advertisement
Connecting the Dots Between
News Article
KDD‘10
Advisor: Jia Ling, Koh
Speaker: Yu Cheng, Hsieh
Outline
•
•
•
•
•
•
•
Introduction
Scoring a chain
Formalize story coherence
Measuring influence
Finding a good chain
Evaluation
Interaction Model
Introduction
• Users are constantly struggling to keep up
with the large amounts of content that is
being published every day.
• With this much data, it is easy to miss the
big picture.
• Investigate methods for automatically
connecting the dots.
• Connecting the mortgage crisis to healthcare
• This chain should be coherent
• The user should gain better understanding of the
progression of the story
Scoring a chain
Formalizing story coherence
Formalizing story coherence
Formalizing story coherence
• Advantage:
- Positioning similar documents next to each other
- Rewards long stretches of words
• Disadvantage:
- Overlook importance of a word
- Missing Words
- Overlook weak links
Formalizing story coherence
Formalizing story coherence
Formalizing story coherence
Formalizing story coherence
Formalizing story coherence
• Jitteriness: topics that appear and
disappear throughout the chain
- Only consider the longest continuous
stretch of each word.
- This way, going back-and-forth between
two topics provides no utility after the first
topic switch
Formalizing story coherence
Measuring influence
Measuring influence
Measuring influence
Measuring influence
Measuring influence
Finding a good chain
Finding a good chain
• Linear Programming
- Chain Restriction
- Smoothness
- Activation Restriction
- Minmax Objective
Linear Programming
Linear Programming
Linear Programming
Linear Programming
• Minmax Objective
- Minedge is the minimum of all active edge scores
Evaluation
• More than half million real news articles were
used.
• Major news stories of recent years are
considered.
• For each story, selecting an initial subset of 500
– 10,000 candidate articles, based on keywordsearch
• Named entities and noun phrases were
extracted from each article(remove infrequent
name entities and non-informative noun phrase)
Evaluation
• Stories linking technique
- Connecting-Dots
- Shortest-path
- Google News Timeline(GNT)
- Event threading(TDT)
Evaluation
• Shortest path
constructed a graph by connecting each document with its
nearest neighbor based on Cosine similarity
• Google news timeline
GNT
- Using query string to get articles
- Construct query string for each story, based on
s and t
- Picked K equally-spaced documents between the dates
of the original query article
Evaluation
Evaluation
• 18 users with a pair of source an target articles
• Gauged users familiarity with those articles
• Ask whether they believe they knew a coherent
story linking them together( on scale 1 - 5 )
• Ask user to indicate
- Relevance
- Coherence
- Non-Redundancy
Evaluation
Evaluation
Interaction Models
• Refinement:
- Users might be especially interested in a
specific part of the chain
- A refinement may consist of adding a
new article, or replacing an article
Interaction Models
Interaction Model
Evaluation
• Refinement
- Return two chains, obtained from the original chain by
(1) our local search
(2) adding an article chosen randomly from a subset of
candidate articles
- User preferred the local-search chains 72% of the
time
Evaluation
• User Interests
- Two chains are showed to users
1 Obtained from the other by increasing the importance
of 2-3 words
2 Show them a list of ten words containing the words
(1) words whose importance we increased
(2) randomly chosen words
asked which words they would pick in order to obtain
the seconds chain from the first. The goal was to see if
users can identify at least some of the words
- User identified at least one word 63.3% of the time
Conclusion & Future Work
• Describe problem of connecting the dots.
• Explore different desired properties of a
good story, formalized it as a linear
program
• Provided an efficient algorithm to connect
two articles
• Allowing more complex tasks
Download