overview presentation - SUNY - Stony Brook

advertisement
`All I know is what I read
in the newspaper’:
news/blog analysis with
Lydia
Steven Skiena
Dept. of Computer Science
Stony Brook University
http://www.cs.sunysb.edu/~skiena
Large-Scale Text Analysis



Our Lydia text analysis system does a daily
analysis of over 1000+ online English and
foreign-language newspapers, plus blogs, RSS
feeds, historical archives, patents, and other
news sources.
We currently track over 20,000,000 news
entities in over one terabyte of text, providing
spatial, temporal, relational and sentiment
analysis.
We believe our data/analysis to be of great
interest in social sciences and other domains.
Edison Chen and the Computer Technician
News Sentiment Time Series
International News Interest
International News Sentiment
Runner-up to Obama in Hong Kong’s 2008 “Man of the Year” poll!
www.textmap.com
http://www.textmap.com
The textmap.com
website
TextMap Access
Outline of Talk
Introduction
 Lydia system architecture
 Spatial, temporal, and network analysis
 Sentiment analysis
 Applications
 Future directions

System Architecture
Spidering – text is retrieved from a given site on a
daily basis using semi-custom spidering agents.
Normalization – clean text is extracted with semicustom parsers and formatted for our pipeline
Text Markup – annotates parts of the source text
for storage and analysis.
Back Office Operations – we aggregate entity
frequency and relational data for a variety of
statistical analyses.
Levon Lloyd, Dimitrios Kechagias, and Steven Skiena. Lydia: A
System for Large-Scale News Analysis.
In String Processing and Information Retrieval: 12th International Conference (SPIRE 2005).
Text Markup




We apply natural language processing (NLP)
techniques to annotate interesting features of
the document.
Full parsing techniques are too slow to keep up
with our volume of text, so we employ shallow
parsing instead.
We can markup approximately 2000
newspaper-days of text per CPU.
Analysis phases include…
Input
Dr. Judith Rodin, the former president of the University of
Pennsylvania, will become president of the Rockefeller Foundation
next year, the foundation announced yesterday in New York.
She will take over in March 2005, succeeding Gordon Conway, the
foundation's first non-American president. Mr. Conway announced
last year that he would retire at 66 in December and return to Britain,
where his children and grandchildren live.
Sentence and Paragraph
Identification
<p>
Dr. Judith Rodin, the former president of the University of
Pennsylvania, will become president of the Rockefeller Foundation
next year, the foundation announced yesterday in New York.
</p>
<p>
She will take over in March 2005, succeeding Gordon Conway, the
foundation's first non-American president.
Mr. Conway announced last year that he would retire at 66 in
December and return to Britain, where his children and grandchildren
live.
</p>
Part Of Speech Tagging
<p>
Dr./NNP Judith/NNP Rodin/NNP ,/, the/DT former/JJ president/NN
of/IN the/DT University/NNP of/IN Pennsylvania/NNP ,/, will/MD
become/VB president/NN of/IN the/DT Rockefeller/NNP
Foundation/NN next/JJ year/NN ,/, the/DT foundation/NN
announced/VBD yesterday/RB in/IN New/NNP York/NNP./.
</p>
<p>
She/PRP will/MD take/VB over/IN in/IN March/NNP 2005/CD ,/,
succeeding/VBG Gordon/NNP Conway/NNP ,/, the/DT
foundation/NN 's/POS first/JJ non-American/JJ president/NN ./.
Mr./NNP Conway/NNP announced/VBD last/JJ year/NN that/IN
he/PRP would/MD retire/VB at/IN 66/CD in/IN December/NNP
and/CC return/NN to/TO Britain/NNP ,/, where/WRB his/PRP$
children/NNS and/CC grandchildren/NNS live/VBP ./.
</p>
Proper Noun Extraction
<p>
<pn> Dr./NNP Judith/NNP Rodin/NNP </pn> ,/, the/DT former/JJ president/NN
of/IN the/DT <pn> University/NNP </pn> of/IN <pn> Pennsylvania/NNP </pn> ,/,
will/MD become/VB president/NN of/IN the/DT <pn> Rockefeller/NNP </pn>
Foundation/NN next/JJ year/NN ,/, the/DT foundation/NN announced/VBD
yesterday/RB in/IN <pn> New/NNP York/NNP </pn> ./.
</p>
<p>
She/PRP will/MD take/VB over/IN in/IN March/NNP 2005/CD ,/, succeeding/VBG
<pn> Gordon/NNP Conway/NNP </pn> ,/, the/DT foundation/NN 's/POS first/JJ
non-American/JJ president/NN ./.
<pn> Mr./NNP Conway/NNP </pn> announced/VBD last/JJ year/NN that/IN
he/PRP would/MD retire/VB at/IN 66/CD in/IN December/NNP and/CC return/NN
to/TO <pn> Britain/NNP </pn> ,/, where/WRB his/PRP$ children/NNS and/CC
grandchildren/NNS live/VBP ./.
</p>
Actor Classification
<p>
<pn category = "PERSON"> Dr./NNP Judith/NNP Rodin/NNP </pn> ,/, the/DT
former/JJ president/NN of/IN the/DT <pn category = "UNKNOWN">
University/NNP </pn> of/IN <pn category = "STATE"> Pennsylvania/NNP </pn> ,/,
will/MD become/VB president/NN of/IN the/DT <pn category = "UNKNOWN">
Rockefeller/NNP </pn> Foundation/NN next/JJ year/NN ,/, the/DT foundation/NN
announced/VBD yesterday/RB in/IN <pn category = “CITY”> New/NNP
York/NNP </pn> ./.
</p>
<p>
She/PRP will/MD take/VB over/IN in/IN <embedded_date> March/NNP 2005/CD
</embedded_date> ,/, succeeding/VBG <pn category = "PERSON"> Gordon/NNP
Conway/NNP </pn> ,/, the/DT foundation/NN 's/POS <num type = "ORDINAL">
first/JJ </num> non-American/JJ president/NN ./.
<pn category = "PERSON"> Mr./NNP Conway/NNP </pn> announced/VBD last/JJ
year/NN that/IN he/PRP would/MD retire/VB at/IN <num type = "CARDINAL">
66/CD </num> in/IN <embedded_date> December/NNP </embedded_date>
and/CC return/NN to/TO <pn category = "COUNTRY"> Britain/NNP </pn> ,/,
where/WRB his/PRP$ children/NNS and/CC grandchildren/NNS live/VBP ./.
</p>
Rewrite Rules
<p>
<appellation> Dr. </appellation> <pn category = "PERSON"> Judith Rodin </pn> ,
the former president of the <pn category = "UNIVERSITY"> University of
Pennsylvania </pn> , will become president of the <pn category = "UNKNOWN">
Rockefeller Foundation </pn> next year , the foundation announced yesterday in
<pn category = “CITY”> New York </pn> .
</p>
<p>
She will take over in <embedded_date> March 2005 </embedded_date> ,
succeeding <pn category = "PERSON"> Gordon Conway </pn> , the foundation 's
<num type = "ORDINAL"> first </num> non-American president .
<appellation> Mr. </appellation> <pn category = "PERSON"> Conway </pn>
announced last year that he would retire at <num type = "CARDINAL"> 66 </num>
in <embedded_date> December </embedded_date> and return to <pn category =
"COUNTRY"> Britain </pn> , where his children and grandchildren live .
</p>
Alias Expansion
<p>
<appellation> Dr. </appellation> <pn category = "PERSON"> Judith Rodin </pn> ,
the former president of the <pn category = "UNIVERSITY"> University of
Pennsylvania </pn> , will become president of the <pn category = "UNKNOWN">
Rockefeller Foundation </pn> next year , the foundation announced yesterday in
<pn category = “CITY”> New York </pn> .
</p>
<p>
She will take over in <embedded_date> March 2005 </embedded_date> ,
succeeding <pn category = "PERSON"> Gordon Conway </pn> , the foundation 's
<num type = "ORDINAL"> first </num> non-American president .
<appellation> Mr. </appellation> <pn category = "PERSON"> Gordon Conway
</pn> announced last year that he would retire at <num type = "CARDINAL"> 66
</num> in <embedded_date> December </embedded_date> and return to <pn
category = "COUNTRY"> Britain </pn> , where his children and grandchildren live.
</p>
Geography Normalization
<p>
<appellation> Dr. </appellation> <pn category = "PERSON"> Judith Rodin </pn> ,
the former president of the <pn category = "UNIVERSITY"> University of
Pennsylvania </pn> , will become president of the <pn category = "UNKNOWN">
Rockefeller Foundation </pn> next year , the foundation announced yesterday in
<pn category = “CITY, STATE, COUNTRY”> New York City, New York, USA
</pn> .
</p>
<p>
She will take over in <embedded_date> March 2005 </embedded_date> ,
succeeding <pn category = "PERSON"> Gordon Conway </pn> , the foundation 's
<num type = "ORDINAL"> first </num> non-American president .
<appellation> Mr. </appellation> <pn category = "PERSON"> Gordon Conway
</pn> announced last year that he would retire at <num type = "CARDINAL"> 66
</num> in <embedded_date> December </embedded_date> and return to <pn
category = "COUNTRY"> Britain </pn> , where his children and grandchildren live.
</p>
Back Office Operations



The most interesting analysis occurs after
markup, using our Hadoop-based statistical
analysis of all occurrences of interesting entities.
Each day’s worth of analysis yields about 10
million occurrences of about 1 million different
entities, so efficiency matters...
Linkage of each occurrence to source and time
facilitates a variety of interesting analysis.
M. Bautin, C. Ward, and S. Skiena, Lydia News/Blog Analysis, 2nd Hadoop Summit, Santa
Clara CA, 2009.
Freedonia Architecture!
Eliminating the Database
Bottleneck
Separating NLP analysis from backend
analysis eliminates daily processing cycles.
 Distributed Map-Reduce processing with
Hadoop aggregates years of counts in
under an hour on our cluster!
 Such throughput facilitates large-scale
evaluation to improve NLP/sentiment
 Server architecture now oriented around
time series vs. daily processing

Map-Reduce Processing

Distributed processing using the Hadoop
implementation of Google’s MapReduce model
Map tasks
Input
Files
Reduce tasks
Map()
Reduce()
Map()
Reduce()
Map()
Reduce()
Outputs
(key,
value)*
Hashing
by key Recieves
(key,
value*)
Output
Files
Synonym Sets
 JFK, John Kennedy, John F. Kennedy, and John
Fitzgerald Kennedy all refer to the same person.
 We need a mechanism to link multiple entities that
have slightly different names but refer to the same
thing.
 We say that two actors belong in the same synonym
set if:
 There names are morphologically compatible.
 If the sets of entities that they are related to are
similar.
➔Levon Lloyd, Andrew Mehler, and Steven Skiena. Identifying Co-referential Names Across
Large Corpra. In Proc. Combinatorial Pattern Matching (CPM 2006)
Duplicate Article Elimination
Supreme Court Justice David Souter suffered minor injuries
when a group of young men assaulted him as he jogged on a
city street, a court spokeswoman and Metropolitan Police said
Saturday.
Supreme Court Justice David Souter suffered minor injuries
when a group of young men assaulted him as he jogged on a
city street, a court spokeswoman and Metropolitan Police said.
Hashing techniques can efficiently identify
duplicate and near-duplicate articles appearing
in different news sources.
Outline of Talk
Introduction
 Lydia system architecture
 Spatial, temporal, and network analysis
 Sentiment analysis
 Applications
 Future directions

Heatmaps
Where are people are talking about
particular topics?
Newspapers have a sphere of influence
based on:
Power of the source – circulation, website popularity
 Population density of surrounding cities


The heat a given entity generates in a
particular location is a function of the
frequency it is mentioned in local sources
➔A. Mehler, Y. Bao, X. Li, Y. Wang, and S. Skiena. Spatial analysis of News Sources, IEEE Trans.
Visualization (2006)
Who is talked about where?
Arnold Schwarzenegger
Alabama
Donde Esta Mexico?
Which candidate got more press?
Historical News Analysis: Dual
Presidencies in U.S. History
Juxtaposition Analysis
We compute the significance of the cooccurrences count between two entities in
terms of their individual frequencies.
 The number of raw co-occurrences biases
towards relationships with popular entities.

Obama and his Friends

Top Juxtapositions for Barack Obama

Juxtapositions between Barack Obama and John McCain
George Bush’s Social Network
Entity Networks
Statistically-significant entity juxtapositions
naturally define networks of news entities
 Our experiments show that these networks
are scale-free, i.e. observe power law
distributions on both entity reference
frequency and vertex degree.
 These networks are useful for
visualization, traversal, and more.

Discovering Entity Communities
Identifying natural communities of news
entities facilitates interesting analysis.
 E.g: Democrats, technologists, Italians,
actors, baseball players, New Yorkers.
 Curated rosters of such communities are
usually unobtainable, incomplete, and/or
ambiguous

A. Mehler and S. Skiena, Expanding Entity Communities from Representative Examples, ACM Trans.
Knowledge and Data Discovery. 3 (2009)
Expanding Communities from
Seeds
Our approach: grow a large community
from a small seed set of known members.
 Selected representatives of any distinct
community are easy to identify (e.g.
Wikipedia)
 We incrementally expand the network to
include the entity with the most significant
in-group neighborhood.

Identifying Baseball Players
But when do we stop?
This incremental process eventually starts
adding members from outside the
community, destroying the result.
 We partition our known members into seed
and validation sets.
 The gap between insertions of validation
members grows substantially soon as we
leave the community.

When is a baseball player not a
baseball player?
Outline of Talk
Introduction
 Lydia system architecture
 Spatial, temporal, and network analysis
 Sentiment analysis
 Applications
 Future directions

Sentiment Analysis

Sentiment analysis lets us to measure how
positively/negatively an entity is regarded,
not just how much it is talked about.
Enron Sentiment Index, 1996-2005 (New York Times)
Blog Analysis with Lydia
Blogs represent a different view of the world
than newspapers.


Less objective
Greater diversity of topics
We adapted Lydia to process Livejournal
blogs, and compared blog content to that of
newspapers.
➔Levon Lloyd, Prachi Kaulgud, and Steven Skiena. News vs. Blogs: Who Gets the Scoop?.
In AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs. 2006
Most Positive Actors in News
and Blogs

News: Felicity Huffman, Fenando Alonso, Dan Rather,
Warren Buffett, Joe Paterno, Ray Charles, Bill Frist, Ben
Wallace, John Negroponte, George Clooney, Alicia Keys,
Roy Moore, Jay Leno, Roger Federer

Blogs: Joe Paterno, Phil Mickelson, Tom Brokow, Sasha
Cohen, Ted Stevens, Rafael Nadal, Felicity Huffman,
Warren Buffett, Fernando Alonso, Chauncey Billups, Maria
Sharapova, Earl Woods, Kasey Kahne, Tom Brady
Most Negative Actors in News
and Blogs

News: Slobodan Milosevic, John Ashcroft, Zacarias Moussaoui,
John Allen Muhammad, Lionel Tate, Charles Taylor, George
Ryan, Al Sharpton, Peter Jennings, Saddam Hussein, Jose
Padilla, Abdul Rahman, Adolf Hitler, Harriet Miers, King
Gyanendra

Blogs: John Allen Muhammad, Sammy Sosa, George Ryan,
Lionel Tate, Esteban Loaiza, Slobodan Milosevic, Charles
Schumer, Scott Peterson, Zacarias Moussaoui, William
Jefferson, King Gyanemdra, Ricky Williams, Ernie Fletcher,
Edward Kennedy, John Gotti
How Do We Do it?




We use large-scale statistical analysis instead of
careful NLP of individual reviews.
We expand small seed lists of +/- terms into
large vocabularies using Wordnet and pathcounting algorithms.
We correct for modifiers and negation.
Statistical methods turn these counts into
indicies.
➔N. Godbole, M. Srinivasaiah, and S. Skiena. Large-Scale Sentiment Analysis for News and
Blogs. Int. Conf. Weblogs and Social Media, 2007
Good to Bad in Three Hops

Paths of WordNet synonyms can lead to
contradictory results, requiring careful path
selection.
What Does it Mean?

Our scores corrolate very well with financial,
political, and sporting events.
Decreasing fear of AIDS in the
scientific and popular press
Multilingual Sentiment Analysis
Implementing language-specific sentiment analysis takes:
1. Language-specific NLP software (e.g. POS tagger, NER )
2. Language-specific linguistic resources (e.g. WordNet)
Machine translation is used in our English sentiment analysis
Foreign text
Translator
English
sentiment
analyzer
M. Bautin, L. Venu-reddy, S. Skiena, International Sentiment analysis for news and
blogs, IWCSM 2008
But what does it mean?
The key question here is assessment: Do
meaningful sentiment signals survive the
limitations of mechanical translation?
 Motivations for multilingual sentiment
analysis include

 Foreign
market research
 Public opinion analysis
Approaches to Evaluation



No gold standard exists for international
sentiment scoring (subjective, requires broad
knowledge of international issues).
But we can correlate with real-world data such
as stock indexes or baseball team results.
We can also assess whether we detect similar
sentiment on contemporary news feels across
languages and translators – uncorrelated
results would mean no signal.
Summary of Results
 We
see consistent entity sentiment between
languages, measured through correlation.
43 out of 45 pairs correlated with p < 0.05.
 Also consistent across different translators.
Significant to p < 0.001.
Outline of Talk
Introduction
 Lydia system architecture
 Spatial, temporal, and network analysis
 Sentiment analysis
 Applications
 Future directions

Lydia Applications in the Social
Sciences
Ethnic group analysis.
 News sentiment and financial asset
pricing.
 The influence of media in politics
 Consumer attitudes and market research.

Name Ethnicity Classification
Family and given names convey
considerable information about ethnicity.
 We developed a HMM / decision tree
classifier to assign people in the news to
the most likely cultural/ethnic group.
 Group trends become clear when applied
to the 20 million people in our news data.

A. Ambekar, C. Ward, J. Mohammed, S. Male, and S. Skiena, Name-Ethnicity Classification from
Open Sources, KDD 2009
Ethnicity HMM / Decision Tree
Accuracy by CEL Group
Aggregated Group Analysis


Muslim CEL Group References, 1987-2004 (New York Times)
Though name ethnicity analysis and M/R
group aggregation, we can monitor news
coverage by nationality or other groups.
Identifying Differences in News Coverage Between Cultural/Ethnic Groups
(with C. Ward and M. Bautin), News Analysis Workshop of IEEE/ACM Web Intelligence (WI 2009)
Lydia Analysis vs. Census Data

Frequency of coverage of
entities with Hispanic
names in the U.S. news,
2004-2008

Percentage of population
self-reporting as Hispanic
in the 2000 census.
Courtesy of Wikipedia.
CEL U.S. Frequency Maps
Nordic
French
Italian
Hispanic Sentiment in the
United States
We discovered a negative sentiment against Hispanics in regions
where they are heavily represented, unlike other groups in our study.
Temporal References by CEL
Geographic Associations
Top geographic juxtapositions
for two world leaders.
National classification results
for heads of state and
government.
Geographic Associations/CELs






(a) African
(b) Hispanic
(c) East Asian
(d) Indian
(e) Eastern
European
(f) Muslim
CEL Group Associations

Tracking the number of co-references
between different CEL groups provides
insight into how strongly groups interact.
Movie Gross Modeling / Forecasting
Investors, movie studios, distributors, and
exhibitors are all interested in forecasting
movies’ grosses before their release.
 Hollywood Stock Exchange http://www.hsx.com
 Does coupling news analysis with
traditional variables lead to better
forecasting models?

W. Zhang and S. Skiena, Improving Movie Gross Prediction through News Analysis,
IEEE/ACM Web Intelligence (WI 2009)
Frequency Count Example
Night at the Museum 1
Night at the Museum 2
Correlation Analysis
Logged News References vs. Logged Gross (or budget)
Correlation Analysis
Logged Sentiment Counts (Pre-release) vs. Logged Gross
Want to improve movie gross?
Add extra violence!
Performance Comparison
(Regression Model)
Performance Comparison
(k-NN Models)
Stock Price Forecasting
Despite the Efficient Market Hypothesis,
there is some evidence that prices do not
optimally react to changes in news..
 If so, sentiment/frequency data should be
factored into pricing/trading models.

W. Zhang and S. Skiena, Trading Strategies to Exploit News Sentiment, in preparation.
News Frequencies vs. Stock
Trading Volume
A Market-Neutral News-Based
Trading Agent


Method: Rank companies by their reported
sentiment each day, then go long (short) on
equal amounts of positive (negative) sentiment
stocks
Four key tunable parameters:
 n:
The number of selected stocks
 s: The number of historical days used for sentiment
calculation.
 h: holding days
 Cl and Cu: The lower bound and upper bound of firms’
market capital.
Backtesting Results
Market Strategy Comparison
Our strategy shows positive returns with low volatility!
Almost very impressive: the timing resolution of our news data makes it
unconfirmable that all news comes from before market opening…
Media Influence in Political
Science
Chicken and egg problem: does the public
influence media coverage or visa versa?
 Lydia provided high-resolution media
analysis for the 2008 National Annenberg
Election Survey.
 Elite opinion and the Iraq War (HuddyJohnston-Lebo): Republicans reacted
differently to war news than Democrats.

Swings in Sentiment:
Obama vs. Clinton
General Sentiment
Technology licensed from Stony Brook
 SBIR and investor funding
 Eight employees and counting..

Understanding Popular Sentiment:
Hummer vs. Gas Prices
Outline of Talk
Introduction
 Lydia system architecture
 Spatial, temporal, and network analysis
 Sentiment analysis
 Applications
 Future directions

Future Directions
 Larger scale / improved data access, to support…
 Historical corpus / blog analysis
 Foreign-language news analysis
 Group-aggregated frequency analysis
 Improved sentiment analysis
 Meaningful collaboration with social scientists: e.g.
the sociology of cumulative advantage
Possible Areas for Collaboration
Visualization of news/blog analysis
 AI/Machine learning applied to Lydia data.
 Integrating/constructing entity ontologies.
 Identifying entity photoimages
 Cloud computing / MapReduce
 Domain-specific projects…

The Lydia Team
… and the Lydia Cluster
Sentiment Count Examples
Performance Comparison
Variations in Topic Coverage
Lydia classifies each article into news,
business, sports, entertainment.
 Thus we can study CEL group penetration
into these different spheres.

Stock Return & Polarity
Distribution
Cross-Language News Stream
Analysis


For each pair of languages, sentiment score
correlation is calculated for the common
entities in these languages
Polarities are correlated much more strongly
than frequencies or subjectivities
Bold correlations are significant with p < 0.05
Time Series News Analysis
•
New York Yankees References, 1987-2004 (New York Times)
•
Eliot Spitzer Sentiment Rank, Feb. 2008 - Mar. 2008
(US Newspapers)
Download