2005_KA_Lancaster - Department of Computing

advertisement
The quality of social interaction:
Towards an automatic analysis of sentiments in
informative and persuasive texts.
Khurshid Ahmad,
Department of Computing, University of Surrey
Department of Computer Science, Trinity College, Dublin, Ireland
Workshop on Information Management and e-Science, Lancaster
e-Science Centre, Lancaster University, 5th October 2005
Motivation
Newly emergent subjects and e-Science:
Behavioural Economics Investor
Psychology; Social Studies of Finance;
Economic Sociology;
‘The number of items of quantitative and qualitative
information available to well-equipped actor is, in effect,
infinite, yet the capacity of any agencement [humans, machines,
algorithms, location,..] to apprehend and to interpret that data
is finite’ (Hardie and Mackenzie 2005).
‘The economies of calculation’ (Mackenzie 2003, 2004, 2005)
1
Motivation
Newly emergent subjects and e-Science:
“I remember ’29 very well,” Steinbeck writes (2002:
17), “We had it made…I remember the drugged and
happy faces of people who built paper fortunes in
stocks they couldn’t possibly have paid for…Their
eyes had the look you see around the roulette table.”
Then, however, “came panic, and panic changed to
dull shock…People remembered their little bank
balances, the only certainties in a treacherous world.
They rushed to draw the money out. There were
fights and riots and lines of policemen. Some banks
failed; rumors began to fly”
2
Motivation


Of all the contested boundaries that define the discipline
of sociology, none is more crucial than the divide
between sociology and economics […] Talcott Parsons,
for all [his] synthesizing ambitions, solidified the divide.
“Basically,” […] “Parsons made a pact ... you, economists,
study value; we, the sociologists, will study values.”
If the financial markets are the core of many highmodern economies, so at their core is arbitrage: the
exploitation of discrepancies in the prices of identical or
similar assets.
MacKenzie, Donald. 2000b. “Long-Term Capital Management: a Sociological Essay.” In (Eds) in Okönomie
und Gesellschaft, Herbert Kaltoff, Richard Rottenburg and Hans-Jürgen Wagener. Marberg: Metropolis. Pp
277-287.
3
Motivation

Social studies of finance repopulates abstracted financial
markets with human




traders and speculators, who have particular and complex
relations to what they understand to be the market;
inventors of market models and formulas, that prove to
be contested and fallible interpretations of economic reality
rather than unproblematic representations;
designers of technology and risk assessment models,
which have normative choices and criteria at their hearts; and
journalists who do not just write impassive financial
news, but play important roles in marketing financial products
and creating space for speculation in everyday life.
de Goede, Marieke (2005). "Resocialising and Repoliticising Financial Markets:
Contours of Social Studies of Finance". Economic Sociology.Vol. 6, No. 3 - July
2005
4
Motivation
Newly emergent subjects and e-Science:
Criminology: Crime Perception,
Detection and Prevention;
Anthropology: Ethnic and Cultural
Identity
‘The number of items of quantitative and qualitative
information available to well-equipped actor is, in
effect, infinite, yet the capacity of any agencement
[humans, machines, algorithms, location,..] to
apprehend and to interpret that data is finite’ (Hardie
and Mackenzie 2005)
5
Motivation: Bounded Rationality
Herbert Simon
•Mechanisms of Bounded Rationality –
rationality is bounded when it fails short of
omniscience – largely due to failures of knowing
all of the alternatives, uncertainty about relevant
exogenous events, and inability to calculate
consequences (pp 356)
•Human behaviour, even rational human
behaviour, is not to be accounted for by a
handful of invariants (pp 367)
6
Motivation
Sentiment Analysis?



In the 1960’s and 1970’s “The unpredictability of inflation was
a primary cause of business cycles”.
Friedman: “the level of inflation was not a problem; it was the
uncertainty about future costs and prices that would prevent
entrepreneurs from investing and lead to a recession” (Milton
Friedman 1977).
Friedman’s conjecture “could only be plausible if the
uncertainty were changing over time so this was my goal.
Econometricians call this heteroskedasticity.” (Robert Engle
2003)
Friedman, M. (1977), "Nobel Lecture: Inflation and Unemployment," Journal
of Political Economy, 85, 451-472.
Engle, Robert (2003)RISK AND VOLATILITY: ECONOMETRIC MODELS
AND FINANCIAL PRACTICE, Nobel Lecture, December 8, 2003
7
Motivation :Sentiment Analysis?

Two strands of literature imply asymmetry in the
response of exchange rates to news.



First Strand: bad news in “good times” should have an
unusually large impact
Second Strand: “bad news should have unusually
large effects”
Robert Engle was shared the 2003 Nobel Prize
in Economic sciences on formulating the impact
of ‘news’ on economic and financial variables.
‘News’ was code for the ‘announcement of key
economic indices by various agencies’.
Torben G. Andersen, Tim Bollerslev, Francis X. Diebold &Clara Vega (2002). MICRO EFFECTS OF
MACRO ANNOUNCEMENTS:REAL-TIME PRICE DISCOVERY IN FOREIGN EXCHANGE. Working
Paper 8959 Cambridge, MA: NATIONAL BUREAU OF ECONOMIC RESEARCH.
http://www.nber.org/papers/w8959
8
Motivation: Bounded Rationality
Daniel Kahneman
•Maps of Bounded Rationality – Two generic modes of cognitive
function: an intuitive mode, where judgements and decisions
are made automatically and rapidly, and a controlled mode
which is deliberate and slower (pp 449)
•Kahneman and Tversky found that intuitive judgements occupy a
position […] between automatic operation of perception and the
deliberate operations of reasoning (e.g. discrepancy between
statistical judgement and statistical knowledge).
(pp 450)
•Highly accessible features will influence decisions, while
features of low accessibility will be largely ignored. (pp459)
•Abrupt transition from risk aversion to risk seeking could not
be plausibly explained by a utility function for wealth (pp 461)9
Motivation: Bounded Rationality
Japanese yen/US dollar exchange rate (decreasing solid line); US
consumer price index (increasing solid line); Japanese consumer price
index (increasing dashed line), 1970:1 − 2003:5, monthly observations
Why is it that Japanese consumer price index is
following the same trend as the US CPI?
10
Motivation:
I wrote therefore I existed; I may write and
change the world
The real world
News Reports; Regulatory Body
Reports
Genre
Informative
Commentaries; Letters to the Editors;
Rumour-laden e-mails
Appelative
Semi-structured interviews;
Confidence Surveys
Expressive
++ Language and text are constitutive (and not merely representational)
-- ‘society is not reducible to language and linguistic analysis (Hodgson 2000:62).
-- Discourses are broader than language, being constituted not just in texts, but
also in definite institutional and organizational practices’ (Jackson 2004).
++ But text is all we have after the event, the interview, the survey, the news, the
review – a trace of the sentiment.
11
The quality of social interaction
or the world according to Khurshid Ahmad
Any analysis of the interaction
between the members of a well
defined social group, where each
is engaged in optimising return on
his or her economic and social
investment, should involve an
analysis of the 'sentiments' of the
group members
12
The quality of social interaction
or the world according to Khurshid Ahmad
The sentiment is expressed in the
news and views that emanate for and
on behalf of the members in free
natural language writing and speech
excerpts.
The quantifiable aspects of the
exchange of objects abstract (power)
and concrete (money, goods, and
services) have to be assessed in the
context of how the news and views
may impact on the exchange.
13
The quality of social interaction
or the world according to other folk
More importantly the sentiment may be
expressed through action:
(a) panic buying and selling of financial
instruments by the investors and traders,
and
(b) the sometimes complacent attitude of
the regulators, are good examples of
economic, social and political action by
individuals and groups.
Simon, H.A. (1978). “Rational Decision-Making in Business Organizations”. Nobel Lectures, Economics
1969-1980, (Editor) Assar Lindbeck, World Scientific Publishing Co.: Singapore, 1992.
http://www.nobel.se/economics/laureates/1978/simon-lecture.html.
Kahneman, D. (2002). “Maps of Bounded Rationality: A perspective on Intuitive Judgement and Choice”, Les
Prix Nobel 2002. (Editor) Professor Tore Frangsmyr. http://www.nobel.se/economics/laureates/2002/kahnemanlecture.html.
Mackenzie, Donald. (2000). ‘Fear in the Markets’. London Review of Books. Vol 22 (No. 8).
14
The quality of social interaction
or the world according to other folk
Actions motivated by panic can
equally well be seen in mass
hysteria related to national/ethnic
identity that, in turn, can
motivate concerns related to
security and safety (Jackson
2004).
Jackson, Richard (2004). ‘The Social Construction of Internal War’ In (Ed.)
Richard Jackson. (Re)Constructing Cultures of Violence and Peace. Rodopi:
Amsterdam/New York.
15
e-Science and social interaction?
 The
UK e-Science programme is moving
towards successful completion.
 Major contribution has been made to UK
science and technology:





Bioinformatics, psychiatry, chemistry and
engineering (Discovery Net and myGrid)
New ways of doing chemistry (CombeChem)
Visualisation of complex systems (RealityGrid);
Novel design (GEODISE);
Safer aircrafts (DAME)
16
e-Science and social interaction?
 Crime,
conflict, and economy are deeply
interrelated and highly interactive.
 However, data and methods in each area
are in a mono-disciplinary silo, referred to
by some as data tombs, where access to
others requires significant mediation.
 Data required in each case includes
quantitative data, textual data, and
historical data.
17
e-Science and social interaction?
 Social
sciences and the so-called hard
sciences increasingly use complementary
methodologies, and a century or more of
discussion of methodology, statistical
methods and structural models is witness
to this.
 E-Science offers the potential for
convergence of scientific methods through
provision of a common underlying
structure, or "grid", of computational
methods, data-base technologies and
conceptual models.
18
e-Science and social interaction?

Social scientists often want to develop
evidence based substantive theory. They
want to know “what determines what”, e.g. long
term unemployment and social exclusion

And social scientists want to explore the
consequences of policy changes on
individual behaviour, e.g. encouragement to
stay on at school on educational attainment,
truancy, and social exclusion

Social science data sets may be small (<10GB
(some exceptions)) but they are complex
(Imitation is the sincerest form of flattery – Rob)
19
e-Science and social interaction?
Financial
Economics
Sociology of
Crime; Crime
Science
Social
Anthropology
Macro-micro Economic Indicators; Census Statistics;
Survey of Social Attitudes;
Life-style and Well-being Statistics;
Market Movement
Crime
Statistics
Ethnicity-related
data
Political News – Reports, Editorials, Letters to the Editor;
Political and Social Opinion Polls;
Consumer Confidence Survey;
Investor/Trader Confidence
Surveys; Regulatory Body
Output;
Financial News;
Citizen Confidence Surveys;
Police Forces/Home Office
Reports;
Crime Reports;
Ethnic Minority Surveys;
Police Forces/Home Office
Reports;
Crime Reports;
20
The Surrey Society Grid
Demonstrator

Was developed under the aegis of the ESRC eSocial Science Programme (FINGRID).
 demonstrated how Grid technologies could
support novel research activities in financial
economics that involve

the rapid processing of large volumes of time-varying
qualitative and quantitative data (Monte Carlo simulation,
wavelet analysis, fuzzy logic and neural network based
simulations)

fusing/visualising of such qualitative and quantitative
data (qualitative data –news, e-mails- and quantitative
data – non-stationary and heteroskadistic data
collated at different frequencies and in different units.
21
The Society Grid Demonstrator
 Globus Toolkit 3.0 (based on Open Grid Services
Architecture (OGSA))
 Java CogKit (Java Commodity Grid) for resource
management and system integration
 Languages for Development:
 Java for the implementation of the application
 Reuters SSL Developer’s Kit (Java) for the connection with the
Reuters streaming data
 Other Technologies:
 XML (NewsML) for the news information
 JMatlink (adapted to Linux environment for the communication with
Matlab environment)
 CGI for communication of Java Applet with the server side
22
The Society Grid Demonstrator

Live financial data: news, historical time series data
and tick data provided by Reuters, (Reuters SSL SDK).

Time series analysis: a FORTRAN bootstrap algorithm,
and the MATLAB toolkit for Wavelet Analysis (via
JMatLink)

News/Sentiment analysis: System Quirk components
for terminology extraction, ontology learning and local
grammar analysis.

Visualisation and fusion: System Quirk components for
corpus visualisation, financial charting, and data fusion.
23
Design and Performance of the
Society Grid
7
Time in ms (log)
6.5
6
Preparation Time
GridFTP Upload Time
5.5
Processing Time
GridFTP Download Time
5
4.5
4
0
16
32
48
64
80
Number of CPUs
24
The new (e-) Social Sciences?
 Social
sciences deal with collectives, or
agencements comprising human beings,
technical devices, algorithms, workplaces
and so on (Callon 1998), such that the
number of items of quantitative and
qualitative information to a well
equipped economic actor, or
agencement, ‘is, in effect, infinite, yet the
capacity of any agencement to apprehend
and to interpret that data is finite’
(Hardie and MacKenzie 2005)
Callon, Michael. (1998). The Laws of the Markets. Oxford: Blackwell.
Hardie, Iain & MacKenzie, Donald. (July 2005). An Economy of
Calculation: Agencement and Distributed Cognition in a Hedge Fund
(http://www.sps.ed.ac.uk/staff/An%20Economy%20of%20Calculation.pdf)
25
The new (e-) Social Sciences?

The number of data items available to an agencement in a market place –
financial instruments, commodity markets, e-Bay (?) – is potentially
infinite but at any give time only a fraction of that data can be processed.
The market place is a fickle place and the information derived from
historical data can be so quickly outdated that ‘in any agencement for a
selective, socially distributed, technologically-mediated ‘economy of
calculation’.

“The economies of calculation and the agencements that underpin them
stretch beyond individual firms: the sifting of information often takes
place in networks of interacting participants. The features of processes
involved – for instance, where agency lies, the types of information that
are deemed relevant or irrelevant, how that information is processed –
are consequential. They affect, for example, the possibility of a ‘global’
market and help shape how ‘markets’ and ‘politics’ interact.” (Hardies &
Mackenzie 2005).
Hardie, Iain & MacKenzie, Donald. (July 2005). An Economy of
Calculation: Agencement and Distributed Cognition in a Hedge Fund
(available from D.MacKenzie@ed.ac.uk)
26
The new (e-) Social Sciences?

Sentiments and the sociology of financial
markets
 Mackenzie has focused on how a mathematicaleconomics theory is used to create a new instrument
– especially arbitrage (Mackenzie 2003) and
options markets (Mackenzie and Millo 2003,
Mackenzie 2004)- and then the theory is used to
explain and monitor the workings of the instrument.
 Mackenzie, Knorr-Cettina and others are studying
the rise of electronic markets – where people in
distant geographical locations can be ‘interactionally
present’
Mackenzie, Donald. (2003). ‘Long-Term Capital Management and the sociology of
arbitrage’. Economy and Society Vol. 32 (No. 3). pp 349-380.
27
The new (e-) Social Sciences?
Sentiments and the sociology of financial markets

Mackenzie used interviewing techniques to understand the collapse of a
large arbitrage firm (Long-Term Capital Management, LTCM), a firm that
pioneered trading of financial instruments that sought to profit from
price discrepancies; the 24/7 watch on price discrepancies requires a
distributed computational infrastructure.

Mackenzie (2003) has looked at the change in the value of the
instruments and has conducted just under 70 interviews with partners
and employees of the failed firm, including a Nobel Laureate who was a
partner, and with other experts, together with documents that were found
to have precipitated or hastened the demise of LTCM. The sentiment
about LCTM as expressed in the interviews, and in some of the key
documents, formed the basis of an analysis of a set of time series and the
computation of key parameters of the time series.
Mackenzie, Donald. (2003). ‘Long-Term Capital Management and the sociology of
arbitrage’. Economy and Society Vol. 32 (No. 3). pp 349-380.
28
The new (e-) Social Sciences?
Sentiments and the sociology of financial markets
 Mackenzie found that he was working with a community of people
who had organized themselves and knew each other. There was
evidence that imitation of the business model and practices
adapted by the firm by others played a major role in the demise of
the firm. Most importantly for us Mackenzie cites the existence of
a fax sent by one of the principals of the firm that asked investors
to make more investment as problems had started to arise: this fax
was posted on the Internet within five minutes of its dispatch and
contributed to the demise of the firm. The sentiments expressed
by the principal were misconstrued by the recipients and despite
the fairly sound reasons expressed in the fax, albeit in a febrile
atmosphere, bounded rationality of the recipients came into play.
Mackenzie, Donald. (2003). ‘Long-Term Capital Management and the sociology of
arbitrage’. Economy and Society Vol. 32 (No. 3). pp 349-380.
29
The new (e-) Social Sciences?

Sentiments and the sociology of financial markets

Knorr-Cetina and Bruegger (2002) have looked at the emergence of
electronic markets and focused on the virtual societies being formed in
the financial markets through the infrastructure that supports electronic
trading.
The trading room operative is in a disembodied world dealing with a onscreen reality that ‘lacks an off-screen counterpart’ – a form of
arepresentation (appresentation) of markets. The operative is connected
to others through electronic mail, news and data feeds (this is not
explicitly dealt with in Knorr-Cteina and Bruegger), and has access to a
computing system that can process very complex data in a timely and
efficient manner.
This virtual world has fast throughput of data and processed information
and the rapidity of the interaction perhaps compensates for the
disembodied nature of the electronic trading markets.


Knorr-Cetina, Karin & Bruegger, Urs. (2002). ‘Global Microstructures: The Virtual
Societies of Financial Markets’. American Journal of Sociology. Volume 107, pp 909-950.
30
The new (e-) Social Sciences?
There is a constant stream of news and e-mails in a
dealing room. Some directly from news agencies (*) and
some annotated items based on the news
Hardie, Iain & MacKenzie, Donald. (July 2005). An Economy of
Calculation: Agencement and Distributed Cognition in a Hedge Fund
(available from D.MacKenzie@ed.ac.uk)
31
The new (e-) Social Sciences?
There is a constant stream of news and e-mails in a
dealing room. Some directly from news agencies (*) and
some annotated items based on the news
Hardie, Iain & MacKenzie, Donald. (July 2005). An Economy of
Calculation: Agencement and Distributed Cognition in a Hedge Fund
(available from D.MacKenzie@ed.ac.uk)
32
The new (e-) Social Sciences?
Hardie, Iain & MacKenzie, Donald. (July 2005). An Economy of
Calculation: Agencement and Distributed Cognition in a Hedge Fund
(available from D.MacKenzie@ed.ac.uk)
33
The new (e-) Social Sciences?
But whilst the trader is
not ‘reading’ the news
off the live news wire
streams – Reuters,
Bloomberg, BBC, CNNsomebody else is
eyeballing the news for
the content (Brazilian
economics, Chilean
politics) and the
sentiment (bonds so
hot that they were on
fire!)
Hardie, Iain & MacKenzie, Donald. (July 2005). An Economy of
Calculation: Agencement and Distributed Cognition in a Hedge Fund
(available from D.MacKenzie@ed.ac.uk)
34
The classical Social Sciences:
Eyeballing the text!
 The
key requirement in contemporary
social sciences is to complement the
analysis of a range of data sets,
demographic, economic and political,
with data related to the person
(Kahneman 2002, Simon 1972), or lived
experience (Sacks 1992, Sliverman
2004)
Sacks, H., (1992). Lectures on Conversation. Oxford: Blackwell Publishers (Ed. Gail
Jefferson).
Silverman, David. (2004). ‘Who cares about experience?’. In (Ed.) David Silverman.
Qualitative Research. London: Sage Publications. ‘pp 342-367.
35
The classical Social Sciences: Eyeballing the text!
Package
Function
Facilities
ATLAS.ti
text analysis and
model building.
Users attach code and annotate;
search/select segments by code; Manual
hotlinks connecting segments, displays link
information diagrammatically.
Similar segments can be coded automatically
The General
Inquirer
content analysis
Users can establish patterns in the meaning
of words supported by large content
dictionaries (Lasswell Value Dictionary;
Harvard Psycho-Sociological Dictionary)
Nvivo
‘Entry’ level
qualitative text
analysis
Users supply text patterns and can analyse
text data base through text-pattern matching
to search for repetition, variant word forms,
recurrent phrases.
QUALRUS
General purpose
qualitative
analysis package
Offers intelligent suggestions throughout the coding
process; analysis of data once it has already been
coded
TextSmart
(SPSS's module)
coding and
analyzing openended survey
questions
Automated stemming; grouping of synonyms;
excludes grammatical words automatically; Term
clustering; text categorisation based on clustering;
36
Dictionary free approach
The classical Social Sciences:
Eyeballing the text!

What is missing in the qualitative analysis
packages?


The texts have to be eye-balled – Most phrases,
clauses, paragraphs have to be coded/annotated by
hand  impossible task when texts all around us is
exploding;
There is a need for a domain specific thesaurus
(conceptually-organised terminology or ‘ontology’)
for each new domain 
• Identify ontological commitments;
• Find terms, and the broader/narrower equivalents;
synonyms and antonyms;
• Maintain terminology data bases

Texts that are conceptually similar within a domain
have to be clustered using unsupervised learning
algorithms
37
The new (e-) Social Sciences?
Towards an automatic analysis
 What
is missing in the qualitative
analysis packages?
38
The new (e-) Social Sciences?
Towards an automatic analysis
 One key result of close social interaction
is the emergence of a sub-set of the
natural language of a given community
that is idiosyncratic of the desires,
aspirations, goals and prejudices of the
community  idiosyncratic nature of the
ontological commitment of the
community;
 The subset has its own lexicogrammar
and is called language for special
purposes of a given specialism
 Lexicogrammar: Vocabulary
39
(terminology) + Local Grammar
The new (e-) Social Sciences?
Towards an automatic analysis
July 2005 Reuters Financial News Service: News items
disambiguated using an automatic extracted terminology and an
automatically local grammar that only recognises changes in
financial instruments
Total
Number of News Items
Per Hour
134,975
46,337,111
208
71508
774,507
520, 006
254, 501
1195
802
393
Filtered Positive
56,102
17,340
87
27
Filtered Negative
38,762
60
Number of Words
Raw Sentiment
Raw Positive
Raw Negative
Filtered Sentiment
40
The new (e-) Social Sciences?
Towards an automatic analysis
Semantic Orientation
Changes in ‘semantic orientation’ for a news input, for July
2005 for all shares in the FTSE.
500
300
100
Series1
-100
0
50
100
150
200
250
-300
-500
Hours
41
The new (e-) Social Sciences?
Towards an automatic analysis
•There is no obvious technique in social science
research method that can improve the researchers
productivity in collecting and analysing large volumes
of speech and text.
•Social scientists survey, and occasionally interview,
interesting individuals in various social groups –
analyse the survey form and quantify.
The real
world
Genr
e
News Reports;
Regulatory
Body Reports
Informat
ive
Commentaries
; Letters to the
Editors;
Rumour-laden
e-mails
Appelati
ve
Semistructured
interviews;
Confidence
Surveys
Expressi
ve
•So what about the data collected in the field.
Data is buried in tombs never to be taken out
again.
•Most text, if ever, is hand-coded by the social science
researcher and then the proxy of the interpretation of
the codes is presented as objective analysis.
42
The new (e-) Social Sciences?
Towards an automatic analysis
•We present a method for systematically
identifying sentiment bearing phrases in large
volumes of streaming texts – a local grammar
comprising templates to extract the phrases
with a minimal number of false positives.
The real
world
Genre
News Reports;
Regulatory Body
Reports
Informativ
e
Commentaries;
Letters to the
Editors; Rumourladen e-mails
Appelative
Semi-structured
interviews;
Confidence
Surveys
Expressive
•The sentiments are aligned with quantitative
(time-varying) information and results cointegrated and tested for Granger causality
•The grammar itself is constructed
automatically from a corpus of domain
specific texts
43
Conclusions and Future Work

The methods developed in the Society Grids
project can be used



to investigate how a person’s perception of his or her
own well being, at different times and in different places,
and in various facets - social, political and economic.
This can be the same or at variance with, say for
example, crime statistics, economic indicators,
achievements or failures of (other) ethnic/racial
categories.
These can be extended to the new areas like


the reassurance gap in policing
totalising war discourse that leads to ethnic/racial
conflicts
44
Towards an automatic analysis
of sentiments?
We rely on reviews and opinion polls of various
kinds:
 Film & TV reviews; Book reviews; Resort
reviews
 Bank reviews; Automobile Review; White good
reviews;
 Consumer surveys; ‘write your own’ reviews;
 Newspaper editorials; Editors’ choice.
45
Towards an automatic analysis
of sentiments?
 We
rely on the sentiment of the reviewers,
editors, investment experts, and ……
 We do know the cost of durables, shares,
holidays.
 A reasonable price is rejected if the reviews are
poor; an exorbitant price is acceptable if the
reviews are good;
 Bad reviews stick in the mind for longer than
good reviews.
46
Towards an automatic analysis
of sentiments?
We
rely on the sentiment of the
more vociferous in the society
sometimes
 The vociferous may call black
white, and white black;
 The vociferous may repudiate
facts and purvey fiction.
47
Towards an automatic analysis
of sentiments?
A new bank has just been launched: Punter Smith has passed his
judgement on the bank. Which of the two columns tells us that he likes
the new outfit?
online service
unethical practices
online experience
low funds
direct deposit
other problems
local branch
old man
low fees
lesser evil
well other
virtual monopoly
small part
probably wondering
printable version
little difference
true service
other bank
other bank
possible moment
inconveniently located
extra day
Turney, Peter D. (2002). “Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised
Classification of Reviews”. In Proc of the 40th Ann. Meeting of the Ass. for Comp. Linguistics (ACL).
Philadelphia, July 2002, pp. 417-424. (Available at http://acl.ldc.upenn.edu/P/P02/P02-1053.pdf).
48
Towards an automatic analysis
of sentiments?
How can a machine detect the positive/negative sentiment from texts?
We eyeball the collocation of words like excellent & poor in text corpus.
online service
unethical
practices
online experience
low funds
direct deposit
other
problems
p( word & word )
PMI ( word , word )  (
)
( p( word ) p( word )
local branch
old man
low fees
lesser evil
well other
virtual
monopoly
Semantic orientation of phrase is given as:
small part
probably
wondering
printable version
little
difference
true service
other bank
other bank
possible
moment
inconveniently located
extra day
The point wise mutual information is
computed between word1 & word2:
1
1
2
2
1
2
SemOr( phrase)  PMI ("excellent", phrase) 
PMI (" poor", phrase)
Turney, Peter D. (2002). “Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised
Classification of Reviews”. In Proc of the 40th Ann. Meeting of the Ass. for Comp. Linguistics (ACL).
Philadelphia, July 2002, pp. 417-424. (Available at http://acl.ldc.upenn.edu/P/P02/P02-1053.pdf).
49
Towards an automatic analysis
of sentiments?
How can a machine detect the positive/negative sentiment from texts?
We eyeball the collocation of words like excellent & poor in a number
of texts.
Phrase
Semantic
Orientation
Phrase
Semantic
Orientation
online service
2.780
unethical practices
-8.484
online experience
2.253
low funds
-6.843
direct deposit
1.288
other problems
-2.748
local branch
0.421
old man
-2.566
low fees
0.333
lesser evil
-2.288
well other
0.237
virtual monopoly
-2.050
small part
0.053
probably wondering
-1.830
printable version
-0.705
little difference
-1.615
true service
-0.732
other bank
-0.850
other bank
-0.850
possible moment
-0.668
extra day
-0.286
inconveniently located
-1.541
50
Towards an automatic analysis
of sentiments?

Robert Engle’s contribution: Volatility may vary
considerably over time: large (small) changes in
returns are followed by large (small) changes.
Engle, R. F. (1982). Autoregressive conditional heteroscedasticity with estimates
51
of the variance of United Kingdom inflation. Econometrica Vol 50, pp 987—1007.
Towards an automatic analysis
of sentiments?
Engle and Ng have developed the concept of the news impact
curve.


To condition at time t on the information available at t − 2 and
thus consider the effect of the shock ε t−1 on the conditional
variance ht in isolation.
The conditional variance is affected by the latest information, “the
news” ε t−1:
• The symmetric case: Both positive and negative news has the same
effect.
h    
t
0
1
2
t 1
• The assymetric case: a positive and an equally large negative piece of
“news” do not have the same effect on the conditional variance.
h       h
2
t
0
1
t 1
1
t 1
Engle, R. F. and Ng, V. K (1993). Measuring and testing the impact of news on volatility, Journal
of Finance Vol. 48, pp 1749—1777.
52
News Analysis and Sentiment
Analysis

Dan Nelson (1992) ‘recognized that volatility
could respond asymmetrically to past forecast
errors. In a financial context, negative returns
seemed to be more important predictors of
volatility than positive returns. Large price
declines forecast greater volatility than similarly
large price increases. This is an economically
interesting effect that has wide ranging
implications’
53
Towards an automatic analysis
of sentiments?
Symmetric case
Asymmetric case
Engle, R. F. and Ng, V. K (1993). Measuring and testing the impact of news on volatility, Journal
of Finance Vol. 48, pp 1749—1777.
54
Towards an automatic analysis of
sentiments?

News Effects
 I: News Announcements Matter, and Quickly;
 II: Announcement Timing Matters
 III: Volatility Adjusts to News Gradually
 IV: Pure Announcement Effects are Present in
Volatility
 V: Announcement Effects are Asymmetric –
Responses Vary with the Sign of the News;
 VI: The effect on traded volume persists longer
than on prices.
Andersen, T. G., Bollerslev, T., Diebold, F X., & Vega, C. (2002). Micro effects of macro announcements: Real
time price discovery in foreign exchange. National Bureau of Economic Research Working Paper 8959,
http://www.nber.org/papers/w8959
55
Eyeballing News for
Sentiments



Qualitative research methods are being used in financial
economics, and in sociological studies of financial
markets, for systematically studying the hopes and fears
of the traders, investors, and regulators in the analysis of
the behaviour of the markets.
Since 2000, the analysis of news wire has become
selective and targeted.
Some researchers choose news related to economic and
financial topics


news about employment
distinguish between scheduled and non-scheduled news
announcements;
56
Eyeballing News for
Sentiments


Some pre-select keywords that indicate change in the
value of a financial instrument – including metaphorical
terms like above, below, up and down – and use them to
‘represent’ positive/negative news stories.
Some use the frequency of collocation patterns for
assigning a ‘feel-good/bad’ score to the story




‘Good’ news stories appear to comprise collocates like revenues
rose, share rose;
‘Bad’ news stories contain profit warning, poor expectation;
‘Neutral’ stories contain collocates such as announces product,
alliance made;
The ‘sentiment’ of the story is then correlated with that of
a financial instrument cited in the stories and inferences
made.
57
Automating News Analysis for
Extracting Sentiments
 We
adopt a text-driven and bottom-up
method: starting from a collection of texts
in a specialist domain, together with a
representative general language corpus,
and use the following five-step algorithm
for identifying discourse patterns with
more or less unique meanings, without
any overt access to an external knowledge
base
58
Automating News Analysis for
Extracting Sentiments: A method
I.
II.
III.
IV.
V.
Select training corpora: Reuters Corpus
Volume 1 (RCV1) and a general language
corpus.
Extract key words;
Extract key collocates;
Extract local grammar using collocation
and relevance feedback;
Assert the grammar as a finite state
automaton.
59
Automating News Analysis for
Extracting Sentiments: An experiment
 I.
Select training corpora
 Training-Corpus


The British National Corpus, comprising 100million tokens distributed over 4124 texts
(Aston and Burnard 1998);
Reuters Corpus Volume 1 (RCV1) comprising
news texts produced in 1996-1997 and
contains 181 million words distributed over
806,791 texts
60
Automating News Analysis for
Extracting Sentiments: An experiment
 II.



Extract key words
The frequencies of individual words in the
RCV1 were computed using System Quirk;
for describing how our method works we will
use a randomly selected component of the
corpus – the output of February 1997,
henceforth referred to as the RCV1-Feb97
corpus;
the RCV1-Feb97 corpus containing 14 Million
words distributed 63,364 texts.
61
Automating News Analysis for
Extracting Sentiments: An experiment
Ranks
RCV1 Feb97
(NRCV1Feb97=14 Million)
Cumulative
Number of
Tokens (%)
British National
Corpus
(NBNC=100 Million)
Cumulative
Number of
Tokens (%)
1-10
the, to, of, in, a, and, said,
on, s, for
0.87 M the, of, and, a, in, to, for,
(21.3%) is, as, that
22.3 M
(22.3%)
11-20
at, that, was, is, it, by, with,
from, percent, be
0.28 M was, I, on, with, as, be,
(6.8%) he, you, at, by
6.51 M
(6.5 %)
21-30
as, he, million, year, its,
will, but, has, would, were
0.17 M are, this, have, but, not,
(4.2%) from, had, his, they, or
4.23 M
(4.2%)
31-40
an, not, are, have, which,
had, up, n, new, market
0.13M which, an, she, where,
(3.3%) here, we, one, there, all,
been
3.05 M
(3.1%)
41-50
this, we, after, one, last,
company, u, they, bank,
government
0.10M their, if, has, will, so,
(2.6%) would, no, what, can,
when
2.35 M
(2.4%)
62
Automating News Analysis for
Extracting Sentiments: An experiment
Token
RCV1 Feb97
(NRCV1Feb97= 14,244,349)
Rank
fRCV1Feb97
fRCV1Feb97 /
NRCV1Feb97
(a)
BNC
(NBNC=100,000,000)
Rank
fBNC
fBNC / NBNC
(b)
Weirdness
(a/b)
percent
19
65763
0.462%
3394
2928
0.003%
157.84
market
40
36349
0.255%
301
30078
0.030%
8.49
company
46
29058
0.204%
219
40118
0.040%
5.09
bank
49
28041
0.197%
562
17932
0.018%
10.99
shares
56
23352
0.164%
1285
8412
0.008%
19.51
63
Automating News Analysis for
Extracting Sentiments: An experiment
 III.
Extract key collocates
f
percent
Left
Right
Total
z-score
65763
up
5315
4360
955
5315
15.91
rose
4361
3988
373
4361
13.04
rise
2391
980
1411
2391
7.12
down
2291
1636
655
2291
6.82
fell
2074
1844
230
2074
6.17
64
Automating News Analysis for
Extracting Sentiments: An experiment
 IV.
Extract local grammar using
collocation and relevance feedback
Pattern
f
Collocate
Left
Right z-score
108
rose
24
0
5.45
by 10 percent to
18
rose
5
0
2.27
rose 10 percent to
14
billion
0
7
4.24
rose 20 percent to
11
billion
1
7
6.02
10 percent to
65
Automating News Analysis for
Extracting Sentiments: An experiment
 V. Assert
the grammar as a finite state
automaton

The (re-) collocation patterns can then be asserted as a finite state automata
for each of the movement verbs and spatial preposition metaphors
66
Automating News Analysis for
Extracting Sentiments: An experiment
 V. Assert
the grammar as a finite state
automaton

The (re-) collocation patterns can then be asserted as a finite state automata
for each of the movement verbs and spatial preposition metaphors
67
Automating News Analysis for
Extracting Sentiments: An experiment
 V. Assert
the grammar as a finite state
automaton

The (re-) collocation patterns can then be asserted as a finite state automata
for each of the movement verbs and spatial preposition metaphors
68
Experiments and Evaluation of
sentiment analysis method
 V. Assert
the grammar as a finite state
automaton

The (re-) collocation patterns can then be asserted as a finite state automata
for each of the movement verbs and spatial preposition metaphors
69
Automating News Analysis for
Extracting Sentiments: Some results
Changes in the total number of positive/negative words
together with those that are used in the local grammars
(filtered positive / negative words) and total number of words.
7
Number of words (Log scale)
6
5
4
Raw Sentiment
Filtered Sentiment
Total number of Tokens
3
2
1
0
0
6
12
18
24
30
Hours from midnight Nov. 15th, 2004
36
42
70
Automating News Analysis for
Extracting Sentiments: Some results
Changes in the total number of positive/negative words
together with those that are used in the local grammars
(filtered positive / negative words) and total number of words.
6
Number of words (Log scale)
5.5
5
4.5
Raw Positive Words
4
Raw Negative Words
3.5
Filtered Positive Words
Filtered Negative Words
3
Total Number of Words
2.5
2
1.5
1
0
6
12
18
24
30
Hours from midnight Nov. 15th, 2004
36
42
71
Automating News Analysis for
Extracting Sentiments: Bradford Riots?

BBC News tracked from 9/11/1999 to 5/08/2005 for the
keywords Bradford Riots, Burnley Riots, and Oldham
Riots
“City”
Bradford
Number of Total # of
News Items Tokens
253
175191
Average #
of Tokens
(±Std. Dev)
3368
(±5478)
Burnley
172
99059
2304
(±3236)
Oldham
261
151696
3096
(±3041)
72
Automating News Analysis for
Extracting
Sentiments:
Bradford
Riots?
BBC News tracked from 9/11/1999 to 5/08/2005 for the keywords

Percentage Change 2001-2002
Bradford Riots, Burnley Riots, and Oldham Riots. The results for the
period July 2001-July 2002
38%
28%
18%
8%
Bradford
-2%
Oldham
-12%
3
4
5
6
7
8
9
10
11
12
13
Burnely
-22%
-32%
-42%
Months
73
Percentage occurance of riots
Automating News Analysis for
Extracting Sentiments: Bradford Riots?
Rate of change?
60%
50%
40%
Bradford
30%
Oldham
20%
Burnley
10%
0%
-10%
0
2
4
6
8
10
Months
74
Automating News Analysis for
Extracting Sentiments: Bradford Riots?
The ‘common’ agencements persons, places, institutions and acts
Shared between
All 3 corpora
asian
blair
bradford
blunkett burnley
bnp
oldham
racial rioting
riots
2 corpora
asians
griffin
racist disturbances
youths
riot
Unique to a corpus
immigrant~ malik
shahid
manningham
75
Grids for Automating News Analysis

We followed Hughes et al. (2003) word
frequency counting approach to evaluate the
performance of our implementation
 The corpora used in our experiments are the
Brown Corpus and the Reuters RCV1 Corpus
Files
Brown
RCV1
Size (Mb)
Words (M)
500
5.2
1.0
806,791
2576.8
169.9
76
Grids for Automating News Analysis
Time in seconds
8000
6000
4000
2000
0
0
16
32
48
64
80
Number of CPUs
77
Afterthought


Though we have devised programs that can learn
unambiguous patterns of use of positive or negative
sentiment, a sentence is always used in the context of
other sentences and the context may change if the
inference is made on the basis of one sentence only;
One can argue that a new text is a response to some or
all of the existing texts, and in that sense each text is
contextualised within a network of other texts - even if all
the existing texts unambiguously expressed a positive
sentiment, a new text with strong negative sentiment
may invalidate all of the positive sentiment.
78
Conclusions and Future Work
Data Sources
Quantitative
Financial
Economics
Social
Anthropology
Macro-micro Economic Indicators; Census Statistics;
Survey of Social Attitudes;
Life-style and Well-being Statistics;
Market
Movement
Qualitative
Sociology of
Crime; Crime
Science
Crime
Statistics
Ethnicity-related
data
Political News – Reports, Editorials, Letters to the Editor;
Political and Social Opinion Polls;
Consumer Confidence Survey;
Investor/Trader
Confidence
Surveys;
Regulatory Body
Output;
Financial News;
Citizen Confidence
Surveys;
Police
Forces/Home
Office Reports;
Crime Reports;
Ethnic Minority ;
Police Forces/Home
Office Reports;
Crime Reports;
79
Download