The quality of social interaction: Towards an automatic analysis of sentiments in informative and persuasive texts. Khurshid Ahmad, Department of Computing, University of Surrey Department of Computer Science, Trinity College, Dublin, Ireland Workshop on Information Management and e-Science, Lancaster e-Science Centre, Lancaster University, 5th October 2005 Motivation Newly emergent subjects and e-Science: Behavioural Economics Investor Psychology; Social Studies of Finance; Economic Sociology; ‘The number of items of quantitative and qualitative information available to well-equipped actor is, in effect, infinite, yet the capacity of any agencement [humans, machines, algorithms, location,..] to apprehend and to interpret that data is finite’ (Hardie and Mackenzie 2005). ‘The economies of calculation’ (Mackenzie 2003, 2004, 2005) 1 Motivation Newly emergent subjects and e-Science: “I remember ’29 very well,” Steinbeck writes (2002: 17), “We had it made…I remember the drugged and happy faces of people who built paper fortunes in stocks they couldn’t possibly have paid for…Their eyes had the look you see around the roulette table.” Then, however, “came panic, and panic changed to dull shock…People remembered their little bank balances, the only certainties in a treacherous world. They rushed to draw the money out. There were fights and riots and lines of policemen. Some banks failed; rumors began to fly” 2 Motivation Of all the contested boundaries that define the discipline of sociology, none is more crucial than the divide between sociology and economics […] Talcott Parsons, for all [his] synthesizing ambitions, solidified the divide. “Basically,” […] “Parsons made a pact ... you, economists, study value; we, the sociologists, will study values.” If the financial markets are the core of many highmodern economies, so at their core is arbitrage: the exploitation of discrepancies in the prices of identical or similar assets. MacKenzie, Donald. 2000b. “Long-Term Capital Management: a Sociological Essay.” In (Eds) in Okönomie und Gesellschaft, Herbert Kaltoff, Richard Rottenburg and Hans-Jürgen Wagener. Marberg: Metropolis. Pp 277-287. 3 Motivation Social studies of finance repopulates abstracted financial markets with human traders and speculators, who have particular and complex relations to what they understand to be the market; inventors of market models and formulas, that prove to be contested and fallible interpretations of economic reality rather than unproblematic representations; designers of technology and risk assessment models, which have normative choices and criteria at their hearts; and journalists who do not just write impassive financial news, but play important roles in marketing financial products and creating space for speculation in everyday life. de Goede, Marieke (2005). "Resocialising and Repoliticising Financial Markets: Contours of Social Studies of Finance". Economic Sociology.Vol. 6, No. 3 - July 2005 4 Motivation Newly emergent subjects and e-Science: Criminology: Crime Perception, Detection and Prevention; Anthropology: Ethnic and Cultural Identity ‘The number of items of quantitative and qualitative information available to well-equipped actor is, in effect, infinite, yet the capacity of any agencement [humans, machines, algorithms, location,..] to apprehend and to interpret that data is finite’ (Hardie and Mackenzie 2005) 5 Motivation: Bounded Rationality Herbert Simon •Mechanisms of Bounded Rationality – rationality is bounded when it fails short of omniscience – largely due to failures of knowing all of the alternatives, uncertainty about relevant exogenous events, and inability to calculate consequences (pp 356) •Human behaviour, even rational human behaviour, is not to be accounted for by a handful of invariants (pp 367) 6 Motivation Sentiment Analysis? In the 1960’s and 1970’s “The unpredictability of inflation was a primary cause of business cycles”. Friedman: “the level of inflation was not a problem; it was the uncertainty about future costs and prices that would prevent entrepreneurs from investing and lead to a recession” (Milton Friedman 1977). Friedman’s conjecture “could only be plausible if the uncertainty were changing over time so this was my goal. Econometricians call this heteroskedasticity.” (Robert Engle 2003) Friedman, M. (1977), "Nobel Lecture: Inflation and Unemployment," Journal of Political Economy, 85, 451-472. Engle, Robert (2003)RISK AND VOLATILITY: ECONOMETRIC MODELS AND FINANCIAL PRACTICE, Nobel Lecture, December 8, 2003 7 Motivation :Sentiment Analysis? Two strands of literature imply asymmetry in the response of exchange rates to news. First Strand: bad news in “good times” should have an unusually large impact Second Strand: “bad news should have unusually large effects” Robert Engle was shared the 2003 Nobel Prize in Economic sciences on formulating the impact of ‘news’ on economic and financial variables. ‘News’ was code for the ‘announcement of key economic indices by various agencies’. Torben G. Andersen, Tim Bollerslev, Francis X. Diebold &Clara Vega (2002). MICRO EFFECTS OF MACRO ANNOUNCEMENTS:REAL-TIME PRICE DISCOVERY IN FOREIGN EXCHANGE. Working Paper 8959 Cambridge, MA: NATIONAL BUREAU OF ECONOMIC RESEARCH. http://www.nber.org/papers/w8959 8 Motivation: Bounded Rationality Daniel Kahneman •Maps of Bounded Rationality – Two generic modes of cognitive function: an intuitive mode, where judgements and decisions are made automatically and rapidly, and a controlled mode which is deliberate and slower (pp 449) •Kahneman and Tversky found that intuitive judgements occupy a position […] between automatic operation of perception and the deliberate operations of reasoning (e.g. discrepancy between statistical judgement and statistical knowledge). (pp 450) •Highly accessible features will influence decisions, while features of low accessibility will be largely ignored. (pp459) •Abrupt transition from risk aversion to risk seeking could not be plausibly explained by a utility function for wealth (pp 461)9 Motivation: Bounded Rationality Japanese yen/US dollar exchange rate (decreasing solid line); US consumer price index (increasing solid line); Japanese consumer price index (increasing dashed line), 1970:1 − 2003:5, monthly observations Why is it that Japanese consumer price index is following the same trend as the US CPI? 10 Motivation: I wrote therefore I existed; I may write and change the world The real world News Reports; Regulatory Body Reports Genre Informative Commentaries; Letters to the Editors; Rumour-laden e-mails Appelative Semi-structured interviews; Confidence Surveys Expressive ++ Language and text are constitutive (and not merely representational) -- ‘society is not reducible to language and linguistic analysis (Hodgson 2000:62). -- Discourses are broader than language, being constituted not just in texts, but also in definite institutional and organizational practices’ (Jackson 2004). ++ But text is all we have after the event, the interview, the survey, the news, the review – a trace of the sentiment. 11 The quality of social interaction or the world according to Khurshid Ahmad Any analysis of the interaction between the members of a well defined social group, where each is engaged in optimising return on his or her economic and social investment, should involve an analysis of the 'sentiments' of the group members 12 The quality of social interaction or the world according to Khurshid Ahmad The sentiment is expressed in the news and views that emanate for and on behalf of the members in free natural language writing and speech excerpts. The quantifiable aspects of the exchange of objects abstract (power) and concrete (money, goods, and services) have to be assessed in the context of how the news and views may impact on the exchange. 13 The quality of social interaction or the world according to other folk More importantly the sentiment may be expressed through action: (a) panic buying and selling of financial instruments by the investors and traders, and (b) the sometimes complacent attitude of the regulators, are good examples of economic, social and political action by individuals and groups. Simon, H.A. (1978). “Rational Decision-Making in Business Organizations”. Nobel Lectures, Economics 1969-1980, (Editor) Assar Lindbeck, World Scientific Publishing Co.: Singapore, 1992. http://www.nobel.se/economics/laureates/1978/simon-lecture.html. Kahneman, D. (2002). “Maps of Bounded Rationality: A perspective on Intuitive Judgement and Choice”, Les Prix Nobel 2002. (Editor) Professor Tore Frangsmyr. http://www.nobel.se/economics/laureates/2002/kahnemanlecture.html. Mackenzie, Donald. (2000). ‘Fear in the Markets’. London Review of Books. Vol 22 (No. 8). 14 The quality of social interaction or the world according to other folk Actions motivated by panic can equally well be seen in mass hysteria related to national/ethnic identity that, in turn, can motivate concerns related to security and safety (Jackson 2004). Jackson, Richard (2004). ‘The Social Construction of Internal War’ In (Ed.) Richard Jackson. (Re)Constructing Cultures of Violence and Peace. Rodopi: Amsterdam/New York. 15 e-Science and social interaction? The UK e-Science programme is moving towards successful completion. Major contribution has been made to UK science and technology: Bioinformatics, psychiatry, chemistry and engineering (Discovery Net and myGrid) New ways of doing chemistry (CombeChem) Visualisation of complex systems (RealityGrid); Novel design (GEODISE); Safer aircrafts (DAME) 16 e-Science and social interaction? Crime, conflict, and economy are deeply interrelated and highly interactive. However, data and methods in each area are in a mono-disciplinary silo, referred to by some as data tombs, where access to others requires significant mediation. Data required in each case includes quantitative data, textual data, and historical data. 17 e-Science and social interaction? Social sciences and the so-called hard sciences increasingly use complementary methodologies, and a century or more of discussion of methodology, statistical methods and structural models is witness to this. E-Science offers the potential for convergence of scientific methods through provision of a common underlying structure, or "grid", of computational methods, data-base technologies and conceptual models. 18 e-Science and social interaction? Social scientists often want to develop evidence based substantive theory. They want to know “what determines what”, e.g. long term unemployment and social exclusion And social scientists want to explore the consequences of policy changes on individual behaviour, e.g. encouragement to stay on at school on educational attainment, truancy, and social exclusion Social science data sets may be small (<10GB (some exceptions)) but they are complex (Imitation is the sincerest form of flattery – Rob) 19 e-Science and social interaction? Financial Economics Sociology of Crime; Crime Science Social Anthropology Macro-micro Economic Indicators; Census Statistics; Survey of Social Attitudes; Life-style and Well-being Statistics; Market Movement Crime Statistics Ethnicity-related data Political News – Reports, Editorials, Letters to the Editor; Political and Social Opinion Polls; Consumer Confidence Survey; Investor/Trader Confidence Surveys; Regulatory Body Output; Financial News; Citizen Confidence Surveys; Police Forces/Home Office Reports; Crime Reports; Ethnic Minority Surveys; Police Forces/Home Office Reports; Crime Reports; 20 The Surrey Society Grid Demonstrator Was developed under the aegis of the ESRC eSocial Science Programme (FINGRID). demonstrated how Grid technologies could support novel research activities in financial economics that involve the rapid processing of large volumes of time-varying qualitative and quantitative data (Monte Carlo simulation, wavelet analysis, fuzzy logic and neural network based simulations) fusing/visualising of such qualitative and quantitative data (qualitative data –news, e-mails- and quantitative data – non-stationary and heteroskadistic data collated at different frequencies and in different units. 21 The Society Grid Demonstrator Globus Toolkit 3.0 (based on Open Grid Services Architecture (OGSA)) Java CogKit (Java Commodity Grid) for resource management and system integration Languages for Development: Java for the implementation of the application Reuters SSL Developer’s Kit (Java) for the connection with the Reuters streaming data Other Technologies: XML (NewsML) for the news information JMatlink (adapted to Linux environment for the communication with Matlab environment) CGI for communication of Java Applet with the server side 22 The Society Grid Demonstrator Live financial data: news, historical time series data and tick data provided by Reuters, (Reuters SSL SDK). Time series analysis: a FORTRAN bootstrap algorithm, and the MATLAB toolkit for Wavelet Analysis (via JMatLink) News/Sentiment analysis: System Quirk components for terminology extraction, ontology learning and local grammar analysis. Visualisation and fusion: System Quirk components for corpus visualisation, financial charting, and data fusion. 23 Design and Performance of the Society Grid 7 Time in ms (log) 6.5 6 Preparation Time GridFTP Upload Time 5.5 Processing Time GridFTP Download Time 5 4.5 4 0 16 32 48 64 80 Number of CPUs 24 The new (e-) Social Sciences? Social sciences deal with collectives, or agencements comprising human beings, technical devices, algorithms, workplaces and so on (Callon 1998), such that the number of items of quantitative and qualitative information to a well equipped economic actor, or agencement, ‘is, in effect, infinite, yet the capacity of any agencement to apprehend and to interpret that data is finite’ (Hardie and MacKenzie 2005) Callon, Michael. (1998). The Laws of the Markets. Oxford: Blackwell. Hardie, Iain & MacKenzie, Donald. (July 2005). An Economy of Calculation: Agencement and Distributed Cognition in a Hedge Fund (http://www.sps.ed.ac.uk/staff/An%20Economy%20of%20Calculation.pdf) 25 The new (e-) Social Sciences? The number of data items available to an agencement in a market place – financial instruments, commodity markets, e-Bay (?) – is potentially infinite but at any give time only a fraction of that data can be processed. The market place is a fickle place and the information derived from historical data can be so quickly outdated that ‘in any agencement for a selective, socially distributed, technologically-mediated ‘economy of calculation’. “The economies of calculation and the agencements that underpin them stretch beyond individual firms: the sifting of information often takes place in networks of interacting participants. The features of processes involved – for instance, where agency lies, the types of information that are deemed relevant or irrelevant, how that information is processed – are consequential. They affect, for example, the possibility of a ‘global’ market and help shape how ‘markets’ and ‘politics’ interact.” (Hardies & Mackenzie 2005). Hardie, Iain & MacKenzie, Donald. (July 2005). An Economy of Calculation: Agencement and Distributed Cognition in a Hedge Fund (available from D.MacKenzie@ed.ac.uk) 26 The new (e-) Social Sciences? Sentiments and the sociology of financial markets Mackenzie has focused on how a mathematicaleconomics theory is used to create a new instrument – especially arbitrage (Mackenzie 2003) and options markets (Mackenzie and Millo 2003, Mackenzie 2004)- and then the theory is used to explain and monitor the workings of the instrument. Mackenzie, Knorr-Cettina and others are studying the rise of electronic markets – where people in distant geographical locations can be ‘interactionally present’ Mackenzie, Donald. (2003). ‘Long-Term Capital Management and the sociology of arbitrage’. Economy and Society Vol. 32 (No. 3). pp 349-380. 27 The new (e-) Social Sciences? Sentiments and the sociology of financial markets Mackenzie used interviewing techniques to understand the collapse of a large arbitrage firm (Long-Term Capital Management, LTCM), a firm that pioneered trading of financial instruments that sought to profit from price discrepancies; the 24/7 watch on price discrepancies requires a distributed computational infrastructure. Mackenzie (2003) has looked at the change in the value of the instruments and has conducted just under 70 interviews with partners and employees of the failed firm, including a Nobel Laureate who was a partner, and with other experts, together with documents that were found to have precipitated or hastened the demise of LTCM. The sentiment about LCTM as expressed in the interviews, and in some of the key documents, formed the basis of an analysis of a set of time series and the computation of key parameters of the time series. Mackenzie, Donald. (2003). ‘Long-Term Capital Management and the sociology of arbitrage’. Economy and Society Vol. 32 (No. 3). pp 349-380. 28 The new (e-) Social Sciences? Sentiments and the sociology of financial markets Mackenzie found that he was working with a community of people who had organized themselves and knew each other. There was evidence that imitation of the business model and practices adapted by the firm by others played a major role in the demise of the firm. Most importantly for us Mackenzie cites the existence of a fax sent by one of the principals of the firm that asked investors to make more investment as problems had started to arise: this fax was posted on the Internet within five minutes of its dispatch and contributed to the demise of the firm. The sentiments expressed by the principal were misconstrued by the recipients and despite the fairly sound reasons expressed in the fax, albeit in a febrile atmosphere, bounded rationality of the recipients came into play. Mackenzie, Donald. (2003). ‘Long-Term Capital Management and the sociology of arbitrage’. Economy and Society Vol. 32 (No. 3). pp 349-380. 29 The new (e-) Social Sciences? Sentiments and the sociology of financial markets Knorr-Cetina and Bruegger (2002) have looked at the emergence of electronic markets and focused on the virtual societies being formed in the financial markets through the infrastructure that supports electronic trading. The trading room operative is in a disembodied world dealing with a onscreen reality that ‘lacks an off-screen counterpart’ – a form of arepresentation (appresentation) of markets. The operative is connected to others through electronic mail, news and data feeds (this is not explicitly dealt with in Knorr-Cteina and Bruegger), and has access to a computing system that can process very complex data in a timely and efficient manner. This virtual world has fast throughput of data and processed information and the rapidity of the interaction perhaps compensates for the disembodied nature of the electronic trading markets. Knorr-Cetina, Karin & Bruegger, Urs. (2002). ‘Global Microstructures: The Virtual Societies of Financial Markets’. American Journal of Sociology. Volume 107, pp 909-950. 30 The new (e-) Social Sciences? There is a constant stream of news and e-mails in a dealing room. Some directly from news agencies (*) and some annotated items based on the news Hardie, Iain & MacKenzie, Donald. (July 2005). An Economy of Calculation: Agencement and Distributed Cognition in a Hedge Fund (available from D.MacKenzie@ed.ac.uk) 31 The new (e-) Social Sciences? There is a constant stream of news and e-mails in a dealing room. Some directly from news agencies (*) and some annotated items based on the news Hardie, Iain & MacKenzie, Donald. (July 2005). An Economy of Calculation: Agencement and Distributed Cognition in a Hedge Fund (available from D.MacKenzie@ed.ac.uk) 32 The new (e-) Social Sciences? Hardie, Iain & MacKenzie, Donald. (July 2005). An Economy of Calculation: Agencement and Distributed Cognition in a Hedge Fund (available from D.MacKenzie@ed.ac.uk) 33 The new (e-) Social Sciences? But whilst the trader is not ‘reading’ the news off the live news wire streams – Reuters, Bloomberg, BBC, CNNsomebody else is eyeballing the news for the content (Brazilian economics, Chilean politics) and the sentiment (bonds so hot that they were on fire!) Hardie, Iain & MacKenzie, Donald. (July 2005). An Economy of Calculation: Agencement and Distributed Cognition in a Hedge Fund (available from D.MacKenzie@ed.ac.uk) 34 The classical Social Sciences: Eyeballing the text! The key requirement in contemporary social sciences is to complement the analysis of a range of data sets, demographic, economic and political, with data related to the person (Kahneman 2002, Simon 1972), or lived experience (Sacks 1992, Sliverman 2004) Sacks, H., (1992). Lectures on Conversation. Oxford: Blackwell Publishers (Ed. Gail Jefferson). Silverman, David. (2004). ‘Who cares about experience?’. In (Ed.) David Silverman. Qualitative Research. London: Sage Publications. ‘pp 342-367. 35 The classical Social Sciences: Eyeballing the text! Package Function Facilities ATLAS.ti text analysis and model building. Users attach code and annotate; search/select segments by code; Manual hotlinks connecting segments, displays link information diagrammatically. Similar segments can be coded automatically The General Inquirer content analysis Users can establish patterns in the meaning of words supported by large content dictionaries (Lasswell Value Dictionary; Harvard Psycho-Sociological Dictionary) Nvivo ‘Entry’ level qualitative text analysis Users supply text patterns and can analyse text data base through text-pattern matching to search for repetition, variant word forms, recurrent phrases. QUALRUS General purpose qualitative analysis package Offers intelligent suggestions throughout the coding process; analysis of data once it has already been coded TextSmart (SPSS's module) coding and analyzing openended survey questions Automated stemming; grouping of synonyms; excludes grammatical words automatically; Term clustering; text categorisation based on clustering; 36 Dictionary free approach The classical Social Sciences: Eyeballing the text! What is missing in the qualitative analysis packages? The texts have to be eye-balled – Most phrases, clauses, paragraphs have to be coded/annotated by hand impossible task when texts all around us is exploding; There is a need for a domain specific thesaurus (conceptually-organised terminology or ‘ontology’) for each new domain • Identify ontological commitments; • Find terms, and the broader/narrower equivalents; synonyms and antonyms; • Maintain terminology data bases Texts that are conceptually similar within a domain have to be clustered using unsupervised learning algorithms 37 The new (e-) Social Sciences? Towards an automatic analysis What is missing in the qualitative analysis packages? 38 The new (e-) Social Sciences? Towards an automatic analysis One key result of close social interaction is the emergence of a sub-set of the natural language of a given community that is idiosyncratic of the desires, aspirations, goals and prejudices of the community idiosyncratic nature of the ontological commitment of the community; The subset has its own lexicogrammar and is called language for special purposes of a given specialism Lexicogrammar: Vocabulary 39 (terminology) + Local Grammar The new (e-) Social Sciences? Towards an automatic analysis July 2005 Reuters Financial News Service: News items disambiguated using an automatic extracted terminology and an automatically local grammar that only recognises changes in financial instruments Total Number of News Items Per Hour 134,975 46,337,111 208 71508 774,507 520, 006 254, 501 1195 802 393 Filtered Positive 56,102 17,340 87 27 Filtered Negative 38,762 60 Number of Words Raw Sentiment Raw Positive Raw Negative Filtered Sentiment 40 The new (e-) Social Sciences? Towards an automatic analysis Semantic Orientation Changes in ‘semantic orientation’ for a news input, for July 2005 for all shares in the FTSE. 500 300 100 Series1 -100 0 50 100 150 200 250 -300 -500 Hours 41 The new (e-) Social Sciences? Towards an automatic analysis •There is no obvious technique in social science research method that can improve the researchers productivity in collecting and analysing large volumes of speech and text. •Social scientists survey, and occasionally interview, interesting individuals in various social groups – analyse the survey form and quantify. The real world Genr e News Reports; Regulatory Body Reports Informat ive Commentaries ; Letters to the Editors; Rumour-laden e-mails Appelati ve Semistructured interviews; Confidence Surveys Expressi ve •So what about the data collected in the field. Data is buried in tombs never to be taken out again. •Most text, if ever, is hand-coded by the social science researcher and then the proxy of the interpretation of the codes is presented as objective analysis. 42 The new (e-) Social Sciences? Towards an automatic analysis •We present a method for systematically identifying sentiment bearing phrases in large volumes of streaming texts – a local grammar comprising templates to extract the phrases with a minimal number of false positives. The real world Genre News Reports; Regulatory Body Reports Informativ e Commentaries; Letters to the Editors; Rumourladen e-mails Appelative Semi-structured interviews; Confidence Surveys Expressive •The sentiments are aligned with quantitative (time-varying) information and results cointegrated and tested for Granger causality •The grammar itself is constructed automatically from a corpus of domain specific texts 43 Conclusions and Future Work The methods developed in the Society Grids project can be used to investigate how a person’s perception of his or her own well being, at different times and in different places, and in various facets - social, political and economic. This can be the same or at variance with, say for example, crime statistics, economic indicators, achievements or failures of (other) ethnic/racial categories. These can be extended to the new areas like the reassurance gap in policing totalising war discourse that leads to ethnic/racial conflicts 44 Towards an automatic analysis of sentiments? We rely on reviews and opinion polls of various kinds: Film & TV reviews; Book reviews; Resort reviews Bank reviews; Automobile Review; White good reviews; Consumer surveys; ‘write your own’ reviews; Newspaper editorials; Editors’ choice. 45 Towards an automatic analysis of sentiments? We rely on the sentiment of the reviewers, editors, investment experts, and …… We do know the cost of durables, shares, holidays. A reasonable price is rejected if the reviews are poor; an exorbitant price is acceptable if the reviews are good; Bad reviews stick in the mind for longer than good reviews. 46 Towards an automatic analysis of sentiments? We rely on the sentiment of the more vociferous in the society sometimes The vociferous may call black white, and white black; The vociferous may repudiate facts and purvey fiction. 47 Towards an automatic analysis of sentiments? A new bank has just been launched: Punter Smith has passed his judgement on the bank. Which of the two columns tells us that he likes the new outfit? online service unethical practices online experience low funds direct deposit other problems local branch old man low fees lesser evil well other virtual monopoly small part probably wondering printable version little difference true service other bank other bank possible moment inconveniently located extra day Turney, Peter D. (2002). “Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews”. In Proc of the 40th Ann. Meeting of the Ass. for Comp. Linguistics (ACL). Philadelphia, July 2002, pp. 417-424. (Available at http://acl.ldc.upenn.edu/P/P02/P02-1053.pdf). 48 Towards an automatic analysis of sentiments? How can a machine detect the positive/negative sentiment from texts? We eyeball the collocation of words like excellent & poor in text corpus. online service unethical practices online experience low funds direct deposit other problems p( word & word ) PMI ( word , word ) ( ) ( p( word ) p( word ) local branch old man low fees lesser evil well other virtual monopoly Semantic orientation of phrase is given as: small part probably wondering printable version little difference true service other bank other bank possible moment inconveniently located extra day The point wise mutual information is computed between word1 & word2: 1 1 2 2 1 2 SemOr( phrase) PMI ("excellent", phrase) PMI (" poor", phrase) Turney, Peter D. (2002). “Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews”. In Proc of the 40th Ann. Meeting of the Ass. for Comp. Linguistics (ACL). Philadelphia, July 2002, pp. 417-424. (Available at http://acl.ldc.upenn.edu/P/P02/P02-1053.pdf). 49 Towards an automatic analysis of sentiments? How can a machine detect the positive/negative sentiment from texts? We eyeball the collocation of words like excellent & poor in a number of texts. Phrase Semantic Orientation Phrase Semantic Orientation online service 2.780 unethical practices -8.484 online experience 2.253 low funds -6.843 direct deposit 1.288 other problems -2.748 local branch 0.421 old man -2.566 low fees 0.333 lesser evil -2.288 well other 0.237 virtual monopoly -2.050 small part 0.053 probably wondering -1.830 printable version -0.705 little difference -1.615 true service -0.732 other bank -0.850 other bank -0.850 possible moment -0.668 extra day -0.286 inconveniently located -1.541 50 Towards an automatic analysis of sentiments? Robert Engle’s contribution: Volatility may vary considerably over time: large (small) changes in returns are followed by large (small) changes. Engle, R. F. (1982). Autoregressive conditional heteroscedasticity with estimates 51 of the variance of United Kingdom inflation. Econometrica Vol 50, pp 987—1007. Towards an automatic analysis of sentiments? Engle and Ng have developed the concept of the news impact curve. To condition at time t on the information available at t − 2 and thus consider the effect of the shock ε t−1 on the conditional variance ht in isolation. The conditional variance is affected by the latest information, “the news” ε t−1: • The symmetric case: Both positive and negative news has the same effect. h t 0 1 2 t 1 • The assymetric case: a positive and an equally large negative piece of “news” do not have the same effect on the conditional variance. h h 2 t 0 1 t 1 1 t 1 Engle, R. F. and Ng, V. K (1993). Measuring and testing the impact of news on volatility, Journal of Finance Vol. 48, pp 1749—1777. 52 News Analysis and Sentiment Analysis Dan Nelson (1992) ‘recognized that volatility could respond asymmetrically to past forecast errors. In a financial context, negative returns seemed to be more important predictors of volatility than positive returns. Large price declines forecast greater volatility than similarly large price increases. This is an economically interesting effect that has wide ranging implications’ 53 Towards an automatic analysis of sentiments? Symmetric case Asymmetric case Engle, R. F. and Ng, V. K (1993). Measuring and testing the impact of news on volatility, Journal of Finance Vol. 48, pp 1749—1777. 54 Towards an automatic analysis of sentiments? News Effects I: News Announcements Matter, and Quickly; II: Announcement Timing Matters III: Volatility Adjusts to News Gradually IV: Pure Announcement Effects are Present in Volatility V: Announcement Effects are Asymmetric – Responses Vary with the Sign of the News; VI: The effect on traded volume persists longer than on prices. Andersen, T. G., Bollerslev, T., Diebold, F X., & Vega, C. (2002). Micro effects of macro announcements: Real time price discovery in foreign exchange. National Bureau of Economic Research Working Paper 8959, http://www.nber.org/papers/w8959 55 Eyeballing News for Sentiments Qualitative research methods are being used in financial economics, and in sociological studies of financial markets, for systematically studying the hopes and fears of the traders, investors, and regulators in the analysis of the behaviour of the markets. Since 2000, the analysis of news wire has become selective and targeted. Some researchers choose news related to economic and financial topics news about employment distinguish between scheduled and non-scheduled news announcements; 56 Eyeballing News for Sentiments Some pre-select keywords that indicate change in the value of a financial instrument – including metaphorical terms like above, below, up and down – and use them to ‘represent’ positive/negative news stories. Some use the frequency of collocation patterns for assigning a ‘feel-good/bad’ score to the story ‘Good’ news stories appear to comprise collocates like revenues rose, share rose; ‘Bad’ news stories contain profit warning, poor expectation; ‘Neutral’ stories contain collocates such as announces product, alliance made; The ‘sentiment’ of the story is then correlated with that of a financial instrument cited in the stories and inferences made. 57 Automating News Analysis for Extracting Sentiments We adopt a text-driven and bottom-up method: starting from a collection of texts in a specialist domain, together with a representative general language corpus, and use the following five-step algorithm for identifying discourse patterns with more or less unique meanings, without any overt access to an external knowledge base 58 Automating News Analysis for Extracting Sentiments: A method I. II. III. IV. V. Select training corpora: Reuters Corpus Volume 1 (RCV1) and a general language corpus. Extract key words; Extract key collocates; Extract local grammar using collocation and relevance feedback; Assert the grammar as a finite state automaton. 59 Automating News Analysis for Extracting Sentiments: An experiment I. Select training corpora Training-Corpus The British National Corpus, comprising 100million tokens distributed over 4124 texts (Aston and Burnard 1998); Reuters Corpus Volume 1 (RCV1) comprising news texts produced in 1996-1997 and contains 181 million words distributed over 806,791 texts 60 Automating News Analysis for Extracting Sentiments: An experiment II. Extract key words The frequencies of individual words in the RCV1 were computed using System Quirk; for describing how our method works we will use a randomly selected component of the corpus – the output of February 1997, henceforth referred to as the RCV1-Feb97 corpus; the RCV1-Feb97 corpus containing 14 Million words distributed 63,364 texts. 61 Automating News Analysis for Extracting Sentiments: An experiment Ranks RCV1 Feb97 (NRCV1Feb97=14 Million) Cumulative Number of Tokens (%) British National Corpus (NBNC=100 Million) Cumulative Number of Tokens (%) 1-10 the, to, of, in, a, and, said, on, s, for 0.87 M the, of, and, a, in, to, for, (21.3%) is, as, that 22.3 M (22.3%) 11-20 at, that, was, is, it, by, with, from, percent, be 0.28 M was, I, on, with, as, be, (6.8%) he, you, at, by 6.51 M (6.5 %) 21-30 as, he, million, year, its, will, but, has, would, were 0.17 M are, this, have, but, not, (4.2%) from, had, his, they, or 4.23 M (4.2%) 31-40 an, not, are, have, which, had, up, n, new, market 0.13M which, an, she, where, (3.3%) here, we, one, there, all, been 3.05 M (3.1%) 41-50 this, we, after, one, last, company, u, they, bank, government 0.10M their, if, has, will, so, (2.6%) would, no, what, can, when 2.35 M (2.4%) 62 Automating News Analysis for Extracting Sentiments: An experiment Token RCV1 Feb97 (NRCV1Feb97= 14,244,349) Rank fRCV1Feb97 fRCV1Feb97 / NRCV1Feb97 (a) BNC (NBNC=100,000,000) Rank fBNC fBNC / NBNC (b) Weirdness (a/b) percent 19 65763 0.462% 3394 2928 0.003% 157.84 market 40 36349 0.255% 301 30078 0.030% 8.49 company 46 29058 0.204% 219 40118 0.040% 5.09 bank 49 28041 0.197% 562 17932 0.018% 10.99 shares 56 23352 0.164% 1285 8412 0.008% 19.51 63 Automating News Analysis for Extracting Sentiments: An experiment III. Extract key collocates f percent Left Right Total z-score 65763 up 5315 4360 955 5315 15.91 rose 4361 3988 373 4361 13.04 rise 2391 980 1411 2391 7.12 down 2291 1636 655 2291 6.82 fell 2074 1844 230 2074 6.17 64 Automating News Analysis for Extracting Sentiments: An experiment IV. Extract local grammar using collocation and relevance feedback Pattern f Collocate Left Right z-score 108 rose 24 0 5.45 by 10 percent to 18 rose 5 0 2.27 rose 10 percent to 14 billion 0 7 4.24 rose 20 percent to 11 billion 1 7 6.02 10 percent to 65 Automating News Analysis for Extracting Sentiments: An experiment V. Assert the grammar as a finite state automaton The (re-) collocation patterns can then be asserted as a finite state automata for each of the movement verbs and spatial preposition metaphors 66 Automating News Analysis for Extracting Sentiments: An experiment V. Assert the grammar as a finite state automaton The (re-) collocation patterns can then be asserted as a finite state automata for each of the movement verbs and spatial preposition metaphors 67 Automating News Analysis for Extracting Sentiments: An experiment V. Assert the grammar as a finite state automaton The (re-) collocation patterns can then be asserted as a finite state automata for each of the movement verbs and spatial preposition metaphors 68 Experiments and Evaluation of sentiment analysis method V. Assert the grammar as a finite state automaton The (re-) collocation patterns can then be asserted as a finite state automata for each of the movement verbs and spatial preposition metaphors 69 Automating News Analysis for Extracting Sentiments: Some results Changes in the total number of positive/negative words together with those that are used in the local grammars (filtered positive / negative words) and total number of words. 7 Number of words (Log scale) 6 5 4 Raw Sentiment Filtered Sentiment Total number of Tokens 3 2 1 0 0 6 12 18 24 30 Hours from midnight Nov. 15th, 2004 36 42 70 Automating News Analysis for Extracting Sentiments: Some results Changes in the total number of positive/negative words together with those that are used in the local grammars (filtered positive / negative words) and total number of words. 6 Number of words (Log scale) 5.5 5 4.5 Raw Positive Words 4 Raw Negative Words 3.5 Filtered Positive Words Filtered Negative Words 3 Total Number of Words 2.5 2 1.5 1 0 6 12 18 24 30 Hours from midnight Nov. 15th, 2004 36 42 71 Automating News Analysis for Extracting Sentiments: Bradford Riots? BBC News tracked from 9/11/1999 to 5/08/2005 for the keywords Bradford Riots, Burnley Riots, and Oldham Riots “City” Bradford Number of Total # of News Items Tokens 253 175191 Average # of Tokens (±Std. Dev) 3368 (±5478) Burnley 172 99059 2304 (±3236) Oldham 261 151696 3096 (±3041) 72 Automating News Analysis for Extracting Sentiments: Bradford Riots? BBC News tracked from 9/11/1999 to 5/08/2005 for the keywords Percentage Change 2001-2002 Bradford Riots, Burnley Riots, and Oldham Riots. The results for the period July 2001-July 2002 38% 28% 18% 8% Bradford -2% Oldham -12% 3 4 5 6 7 8 9 10 11 12 13 Burnely -22% -32% -42% Months 73 Percentage occurance of riots Automating News Analysis for Extracting Sentiments: Bradford Riots? Rate of change? 60% 50% 40% Bradford 30% Oldham 20% Burnley 10% 0% -10% 0 2 4 6 8 10 Months 74 Automating News Analysis for Extracting Sentiments: Bradford Riots? The ‘common’ agencements persons, places, institutions and acts Shared between All 3 corpora asian blair bradford blunkett burnley bnp oldham racial rioting riots 2 corpora asians griffin racist disturbances youths riot Unique to a corpus immigrant~ malik shahid manningham 75 Grids for Automating News Analysis We followed Hughes et al. (2003) word frequency counting approach to evaluate the performance of our implementation The corpora used in our experiments are the Brown Corpus and the Reuters RCV1 Corpus Files Brown RCV1 Size (Mb) Words (M) 500 5.2 1.0 806,791 2576.8 169.9 76 Grids for Automating News Analysis Time in seconds 8000 6000 4000 2000 0 0 16 32 48 64 80 Number of CPUs 77 Afterthought Though we have devised programs that can learn unambiguous patterns of use of positive or negative sentiment, a sentence is always used in the context of other sentences and the context may change if the inference is made on the basis of one sentence only; One can argue that a new text is a response to some or all of the existing texts, and in that sense each text is contextualised within a network of other texts - even if all the existing texts unambiguously expressed a positive sentiment, a new text with strong negative sentiment may invalidate all of the positive sentiment. 78 Conclusions and Future Work Data Sources Quantitative Financial Economics Social Anthropology Macro-micro Economic Indicators; Census Statistics; Survey of Social Attitudes; Life-style and Well-being Statistics; Market Movement Qualitative Sociology of Crime; Crime Science Crime Statistics Ethnicity-related data Political News – Reports, Editorials, Letters to the Editor; Political and Social Opinion Polls; Consumer Confidence Survey; Investor/Trader Confidence Surveys; Regulatory Body Output; Financial News; Citizen Confidence Surveys; Police Forces/Home Office Reports; Crime Reports; Ethnic Minority ; Police Forces/Home Office Reports; Crime Reports; 79