Society Grids Khurshid Ahmad, Lee Gillam and David Cheng Dept. of Computing, University of Surrey Abstract A grid implementation is described that can deal with large volumes of streaming free natural language text in conjunction with large sets of time series data. Processing speed-ups on a cluster of 24 machines (81 CPUs) for dealing with texts in excess of 100 million words (of text) are reported. The application area is econometrics, specifically the behaviour of financial markets, and the methodology reported can extend the scope of the Surrey’s Society Grid to strategically important areas of crime science and social anthropology. The data and compute requirements identified in the three areas compare well with the traditional concerns in grid computing. Our studies indicate problems of scalability especially when dealing with multi-modal data – texts and numbers. 1. Introduction A large number of demonstrator projects in eScience focus on ‘real-time’ data, paradoxically stored in large data archives, referred to by some as data tombs (Fayyad and Uthurusamy 2002). Terabytes of data emanate from quarkhunting expeditions and from engines of jet aircraft across time zones. These data are, reportedly, efficiently stored, rapidly retrieved, intelligently processed, and visualized. Neither the particle physicists nor the structural-safety engineers will receive any more data, unless new data are sought, and relatively well established models, motivated within a crisply defined framework, will be used to find the elusive quark or the potentially damaging vibrations. Much real-world data does not come from a single source or is without contamination that cannot be readily eliminated, nor is it ever ‘complete’ - the data has both time- and frequency-domain components and involves both global and local influences that are difficult to quantify. Dynamic, real-world data influences, and is influenced by, other seemingly independent data sets. Models used to analyze such data are based on overlapping theoretical frameworks: there are seldom, if any, unified theories or the opportunity to make universal approximations. In practical terms, the logistics used in conveying information to and from real-world systems includes a range of modes: natural language, images, and (sets of) numbers are prominent. A computational grid can facilitate capturing and processing of data, and visualizing the processed data, in one modality, and synthesizing results across modalities. The construction of such a multi-modal grid, that processes real-world data, will help in understanding problems of building such systems and how to scale up the results of existing mono-modal grid systems. Multi-modal data invariably brings with it the question of how to fuse results of computations over data was articulated in different modes. The mode (numbers or language for instance) imposes methodological constraints: typically, quantitative methods are used directly on numerical data for summarizing a time series; qualitative methods are used to process texts such that one ends up with a set of statistics summarising the contents of one or more text documents. In this paper we describe the construction and testing of a multi-modal grid that can process numerical and textual data. The data sets are composed from continuous livestreaming data captured from dedicated datafeeds that are processed using upto 24machines. We motivate the discussion about a multi-modal grid by outlining one application area – econometric analysis of markets that includes the analysis of the value of transactions as well as the effect of news on such transactions. The numeric data used is a timeordered record of financial transactions; the text data is the financial news that relates to such transactions and to economic and political events of world interest that impact on financial decisions. The method for converting streaming news into time-ordered signals that were articulated in language is described. This is followed by the description of a grid and its performance. Discussion includes a typology of data and methods that can be adapted to a number of other societal issues including the perception of crime and racial violence. Note that we have reported the results of time series analysis and Monte Carlo simulations elsewhere (Ahmad, Gillam and Cheng 2005) and we will be focusing exclusively on the problem of qualitative data (news texts) and analysis. 2. Motivation Consider econometric data – movements of the values of financial instruments sometimes integrated with the economic/financial value of enterprises - that are collected in real-time and rendered as time-series. The data are artifacts of how humans interact: rises and falls in value represent some form of consensus; human investment activities produce data, analysis of which results in further investments producing even more data, but by the time analysis of data is complete, the data set may already have changed. Time series analysis involving Monte Carlo simulations and stochastic models, with which the e-Science community is familiar, are used for predicting the value of the financial instruments at a certain time, based on such data. But the analysis is found to be flawed under different economic conditions, for example during boom or bust periods. Scholars have discovered that investors and traders can suspend rational reasoning, and that their sentiments interfere with their decision-making. This may involve use of covert information, greed, false expectations, herd instincts and so on. The Yale University Center for International Finance publishes a monthly survey of both investor and trader sentiment related to performance of the US economy – the variance in the expectations of the two groups is very clear (Shiller 2003). A study of individual or group (human) behaviour requires a conjunctive analysis of rational and other behaviour. Such work necessitates consideration of the notion of bounded rationality. Human decision making, it appears, is bounded: neither always rule governed nor always informed by prior empirical knowledge (Simon 1992, Kahneman 2002). Econometricians have argued that the hopes, fears, aspirations and disappointments of the investors and traders in a financial market manifest themselves as changes in the price of one or more financial instruments. The effect of exaggerated hopes and aspirations on the part of naïve or greedy or confused investors/traders may lead to a short term erratic upwards movement in prices, but the proponents of efficient market hypothesis (EMH) have argued that other (more) rational traders and investors work to counter the bounded rationality of the greedy; similarly fears and disappointments of some may lead to erratic downwards behaviour of prices and again the rational agents intervene to check behaviour that is not based on rational analysis. It has been noticed, however, that whilst the prices may consolidate in the short term, the order flow remains erratic for much longer and, in turn, presents a challenge for EMH. A considerable body of literature exists where econometricians have analysed the ‘impact of news’ on the prices, and indirectly on the order flows. The tradition here is to use methods and techniques of time-series analysis more precisely, generalized autoregressive conditional heteroskedasticity (GARCH) - and to isolate the unexpected changes (shocks) that may occur in the price over a period of time. Processing an aggregate of such records (over a 1, 15 or 30 minute, daily, weekly, monthly or yearly interval), especially for finding autocorrelation within the time series of one instrument or the cross-correlation amongst many (see, Percival & Walden 2000), requires distributed systems with fairly short response times. The challenge here is intellectual and has concomitant strategic and commercial import. Engle (1993), the co-winner of the 2003 Nobel Prize in Economics who has developed and used GARCH, has argued that the impact of news may be positive or negative, but the effect of negative news lasts much longer than that of positive news. This asymmetry has been observed empirically and Engle has shown how to model the asymmetry. Computing the ‘news impact’ is a computationally intensive task requiring, on the one hand, fast and efficient calculations, and on the other, substantial volumes of data storage and handling capabilities – a large number of key financial instruments are each bought and sold many times within one second, and in some cases in a variety of financial markets. The news impact analysis, however, does not explicitly use linguistic data. The econometricians, by and large, use information proxies. For instance, Engle’s past and current work uses the timings of various announcements from financial and monetary authorities as a placeholder or proxy for the details of the news itself. Andersen et al (2002) have noted that: News announcements matter, and quickly; the timing of announcement matters; the rate of change adjusts to news gradually; the effect on traded volume persists longer than on prices. Other authors have used sentiment proxies: the changes in the values of economic variables, for example, turn-over of a stock exchange or number of initial public offerings, are subjected to factor analysis, and a novel sentiment index is created (Baker and Wurgler 2004). Some researchers pre-select keywords that indicate change in the value of a financial instrument – including metaphorical terms like above, below, up and down – and use them to ‘represent’ positive/negative news stories. Others use the frequency of collocational patterns for assigning a ‘feel-good/bad’ score to the story (see, for example, DeGennaro and Shrieves 1997, and Koppel and Shtrimberg 2004). Table 1 shows such sentiment proxies frequent metaphorical or literal keywords that can be used as placeholders for investor/trader sentiment. Sentiment in news stories ‘Good’ news stories ‘Bad’ news stories ‘Neutral’ stories Lexical Content appear to comprise collocates like revenues rose, share rose; may contain profit warning, poor expectation; usually contain collocates such as announces product, alliance made; Table 1: Lexical content of a news story and the implied sentiment. The ‘sentiment’ of the story is then correlated with that of a financial instrument cited in the stories and inferences made. Sentiments are difficult to detect, but there may be evidence of sentiment detectable in financial news (comprising facts as well as rumours), company reports (containing a rosier view of the world), and speeches of key players (bringing glad tidings or otherwise). There is a body literature emerging that includes the description of algorithms and programs that can detect sentiment in texts, that range from film and holiday resort reviews to restaurant reviews, based on the mutual information metric. This metric has been used to compute the joint probability distribution of one arbitrary word (w1) with another arbitrary word (w2): MI(w1,w2) = p(w1 & w2) / (p(w1)*p(w2)) where p(w) is the probability of the word occurring and p(w1, w2) is that of the two words occurring together. Turney (2002) has used the metric to detect the semantic orientation (SemOr) of individual phrases used by a reviewer by in conjunction with sentiment words ‘excellent’ and ‘poor’: SemOr(phrase)= MI(‘excellent’, phrase) – MI (‘poor’, phrase) The sum of SemOr for all pre-selected phrases is computed – if the sum is negative then the given review is deemed negative otherwise it is deemed positive. 3. Method In the above-mentioned methods of sentiment analysis, either the sentiment ‘variables’ and metrices use information proxies, or they rely on pre-selected keywords and phrases. The implicit and explicit methods are designed to avoid ambiguity that is inherent in natural language based communication. However, it is important to explore whether or not these sentiments can be extracted with a minimum of ambiguity where the premium is on avoiding false positives. We have discussed, elsewhere, the amenability of special language texts for automatic analysis - the authors of special language texts are trained to avoid ambiguity. This is not to say that specialist writers or their readers always succeed, but the chances of a writer confusing a reader are lower than that of writer’s of non-specialist texts. It has been shown that a pre-selected collection of texts has a lexical profile – a set of single words that are characteristic of a specialism. The profile dominates most compound words and indeed large number of meaning bearing phrases. The texts not only have a restricted vocabulary, albeit profusely used, but appear also to have syntactic restrictions that result in largely unambiguous phrases (see Ahmad, Gillam and Cheng 2005 and references therein). We adopt a text-driven and bottom-up method: starting from a collection of texts in a specialist domain, together with a representative general language corpus. We describe a five step algorithm for identifying discourse patterns with more or less unique meanings, without any overt access to an external knowledge base: I. Select training corpora General Language: The British National Corpus, contains 100-million tokens distributed over 4124 texts (Aston and Burnard 1998); Special Language: Reuters Corpus Volume 1 (RCV1) comprising news texts produced in 1996-1997: contains 181 million tokens distributed over 806,791 texts. (For describing how our method works we will use a randomly selected component of the corpus – the output of February 1997 comprising 14,244,349 tokens, henceforth referred to as the RCV1Feb97 corpus). II. Extract key words: The frequencies of individual words in the RCV1-Feb97 were computed using System Quirk. The frequency of the words in the RCV1-Feb97 corpus is compared with the frequency of the same words in the BNC. A word that is used more frequently in the RCV1-Feb, according to a statistical criterion referred to as weirdness, than in the BNC is regarded as a candidate keyword (Ahmad 1995). The grammatical words (the, a, an, and, but..), usually described as a stop list, have a very similar distribution, but subject specific words have a rather different distribution (see Table 2): fR Word percent market fR/NR fG fG /NG Weirdness (a) (b) ('c) (d) (b)/(d) 65763 0.46% 2928 0.00% 157.84 36349 0.26% 30078 0.03% company 29058 0.20% 40118 0.04% 8.49 5.09 bank 28041 0.20% 17932 0.02% 10.99 shares 23352 0.16% 19.51 8412 0.01% Table 2. Occurrences of most frequent words in RCV1-Feb (fR) compared with their frequency in the BNC (fG). NR=14.24Million; NG=100 million III. Extract key collocates: Collocation patterns, combinations of words that occur together frequently, are considered as indicators of meaning and intent of the author. Techniques have been developed recently that attempt to compute the statistically significant patterns. In our method, the focus is on the collocates of the most frequently used single words – selection based on frequency can then be readily programmed. System Quirk has modules to do just that (Ahmad, Gillam & Cheng 2005). The key collocates of the most frequent word in RCV1-Feb – percent - are up, rose, rise, down and fell. This results in the patterns rose X percent, X percent rise, and up [by] X percent. Our method automatically selects these collocates, from them computes collocates of these collocates. The collocation patterns suggest that the metaphorical words, rose, fell, up, down, usually used to refer to movement of objects in physical space have been transferred (the origin of the word metaphor) over to the change in the value of the rather abstract financial instruments. IV. Extract local grammar using collocation and relevance feedback: The frequent collocates have an unambiguous interpretation, and the avoidance of ambiguity is the cornerstone of modern information retrieval. The frequent collocates of collocates have still more unambiguous interpretation. This has helped us write the programs that recognize these patterns automatically and is important for dealing with the deluge of texts – c. 100,000 words per hour. The ambiguities typically occur because a pivotal verb (or noun) in a sentence can be replaced by other verbs (or nouns). The specialist nature of financial news restricts the use of such verbs (nouns) to a very small subset of such words in the language and thereby minimizes ambiguity. This approach is contrary to the current paradigm of natural language processing that is grounded in universal grammar – where many words can be used interchangeably. The approach used in Society Grids is called local grammar. Figure 1 shows a local grammar patterns that is amongst the most frequent ones in our training corpus – for down. The patterns were then tested on the bulk of the RCV1 corpus and the precision of the local grammar patterns was considerably improved over the precision on single word retrievals. Figure 1: A finite state automaton for recognising negative sentiment sentences comprising ‘down’ The local grammar is used to unambiguously identify sentences that contain sentiment bearing phrases and to automatically annotate the phrases. Figure 2. The results of ‘filtering’ raw news for a 48 hour period (top line) for differentiating words that carry sentiment information within (bottom line) and without (middle line) the local grammar patterns. Figure 3. A differentiated view of the positive and negative sentiment within the local grammar patterns for the data described above. Figure 2 shows the filtering power of the local grammar patterns: the patterns identify between 1,000 and 10,000 sentiment words in a corpus of between 10,000 to 100,000 tokens arriving per hour to find between 10 to 100 ‘true’ sentiment bearing sentences. The system differentiates between ‘negative’ and ‘positive’ sentiments (Figure 3). Here some user intuition is used to suggest one word or phrase has a negative connotation as opposed to a positive one. The positive and negative sentiment time-series can then be correlated with the time series of financial data. 4. The Society Grid demonstrator The first prototype of our Society Grid demonstrator was developed under the aegis of the ESRC e-Social Science Programme (FINGRID project). We demonstrated how Grid technologies could support novel research activities in financial economics that involve the rapid processing and combination of large volumes of time-varying qualitative and quantitative data. We used Globus (GT3) with the Java CogKit to integrate: • Live financial data: news, historical time series data and tick data provided by Reuters, (Reuters SSL SDK). • Time series analysis: a FORTRAN bootstrap algorithm, and the MATLAB toolkit for Wavelet Analysis (via JMatLink) • News/Sentiment analysis: System Quirk components for terminology extraction, ontology learning and local grammar analysis. • Visualisation and fusion: System Quirk components for corpus visualisation, financial charting, and data fusion. The Society Grid demonstrator enables the extraction of patterns of language from large collections of text that indicate changes in events or values of objects, and the correlation of these with movements in financial markets. These patterns are extracted semi-automatically using methods of corpus linguistics, pioneered and tested in the System Quirk framework, to discover keywords, a select group of verbs, and orthographic markers. We discovered the local grammar that governs the ordering of these keywords, verbs and markers in sentimentbearing sentences. We have shown how econometricians and empirical and financial economists could use Grid technologies to facilitate research and collaboration. The Society Grid comprises 24 machines1 in addition to a financial datafeed provided by Reuters Financial Services (c. 25 MB or 6000 news items on average per day; one year is around 2 GB texts). We have developed programs using Reuters SSL Developer Kit (Java) to capture the news, historical time series data and tick data. Reuters supply news with categories, authorships and date information. We have followed Hughes and Bird (2003) word frequency counting approach to evaluate the performance of our implementation. The corpora used in our experiments are the Brown Corpus and the Reuters RCV1 Corpus: see Table 3 for details. The computational power of our grid implementation has been reported in Ahmad et al. (2004), with an 8-node configuration. Files Brown RCV1 500 806,791 Size (Mb) 5.2 2576.8 Words (M) 1.0 169.9 Table 1: Size of the corpora Theoretically, the performance gain in a Grid environment is proportional to the number of machines being used. In parallel processing, the overall execution time of the task is taken by measuring the process that finishes at last. To discount this factor, we explored the use of an 8-CPU, 16-CPU, 32-CPU and 64-CPU configuration. Each experiment was repeated 10 times, and the average was recorded. Figure 4 below shows the time taken in seconds to complete the word frequency counting on the Reuters RCV1 corpus. Time in seconds 7000 6000 5000 4000 3000 2000 1000 0 0 16 32 48 64 80 Number of CPUs Figure 4. Time taken to perform word frequency counting on the Reuters RCV1 with different numbers of CPUs. We have observed a performance gain of 47% in using a 16-CPU grid rather than a 8-CPU grid; a gain of 33% in moving from a 16-CPU grid to a 32-CPU grid; a gain of 16% in moving from a 32-CPU grid to a 48-CPU grid; and a mere gain of 7% in moving from a 48-CPU grid to a 64-CPU grid configuration. To investigate degradation of performance, we decomposed the execution time of the word frequency counting process into four parts: preparation time (time required to allocate the task), GridFTP upload time (upload the necessary files to each machines), processing time (time required to perform the word frequency counting) and GridFTP download time (time required to download the results). Figure 5 shows the decomposition of time taken to complete the word frequency counting on the Reuters RCV1 corpus. 7 6.5 Time in ms (Log scale) 4.1 Design and Performance of the Society Grid 8000 6 Preparation Time GridFTP Upload Time 5.5 Processing Time GridFTP Download Time 5 4.5 4 0 16 32 48 64 80 Number of CPUs Figure 5. Decomposition of time taken to perform word frequency counting on the Reuters RCV1 with different number of CPUs 1 We have 19 Dell PowerEdge 2650 with 1 GB memory and dual processors; and 5 Dell Optiplex GX150s with 256MB memory, single processor. 81 CPUs are available across these 24 machines. The bulk of computation is in the word frequency counting – where the actual processing occurs. By considering this alone, the performance gain was 49%, 39%, 22% and 10% respectively in moving from an 8-CPU grid to 64-CPU grid. 5. Society Grids – What next? The methods and techniques developed in our prototype can be used to investigate how a person’s perception of his or her own well being, at different times and in different places, and in various facets - social, political and economic - can be the same or at variance with, say for example, crime statistics, economic indicators, achievements or failures of (other) ethnic/racial categories. Evidence of such bounded rationality includes: (i) the reassurance gap: the difference between crime rates and the public perception of crime (Fielding 1995, Fielding, Innes and Fielding, 2002); and (ii) internal war (Kaldor 1999): where emotional/affective responses are needed to compensate for the limitations of the rational action model. The reassurance gap and internal war are both mediated by a discourse pattern that fuels bounded rationality – racist web sites, Data Mode Type (Freq) Numerical (D) Quantitative Numerical (C) Informative Qualitative Appellative; Expressive minority community newspapers, inflammatory speeches, all publicly accessible and all laden with sentiments. The data in the three different disciplines have significant overlaps and a number of differences as well. All three fields have discrete data and continuous data, quantitative data and qualitative data. The text types range from informative text, that are primarily designed to convey information from some knowledgeable person to a less knowledgeable person, to expressive texts, where a knowledgeable person is seeking or transferring knowledge to those with equal knowledge. A new data type we have recently identified is that of appellative texts where somebody is competing to transfer their knowledge. Each of the text types contains sentiment bearing sentences, and once extracted this sentimentrelated information can be used in conjunction with the quantitative data. Financial Economics Crime Science Social Anthropology Macro-micro Economic Indicators; Census Statistics; Survey of Social Attitudes; Life-style/Well-being Statistics MARKET MOVEMENT CRIME STATISTICS ETHNICITY DATA General News Reports and Editorials FINANCIAL NEWS; POLITICAL NEWS ETHNO-CULTURAL NEWS Financial/Monetary Police Forces/Home Office Reports; Crime Regulators’ Reports Reports LETTERS TO THE EDITOR; RUMOUR-LADEN E-MAILS; COMMENTARIES Semi-structured interviews (Traders, Citizens) CITIZEN SURVEYS INVESTOR SURVEYS Table 3: Data and Mode typology for e-Social Science, ‘D’ indicates discrete data and ‘C’ continuous data We are currently seeking support to extend the methods, tools and techniques of e-Science and of Society Grids, to fuse the quantitative and qualitative data in a mono-discipline (econometrics, sociology of crime, and social anthropology) and to fuse the analysis across the disciplines. The experts are looking at different facets of the same reality and we aim to integrate this analysis. We believe that the methods, techniques and prototypes we have developed using leading-edge techniques to analyse large data sets, both quantitative and qualitative, with reference to market sentiment, sentiment analysis at large, can make a potential contribution to understanding crime, conflict, and economy. Grid technologies can benefit traditional social science analysis that begins with an attempt to find correlation between the onset of a crisis, for example, racial violence, insecurity amongst citizens and stock market crash, and variables related to system attributes (local, national, or market systems), social divisions, economic-activity data, types of systems involved and external context. Grids for Social Scientists will have matured when social scientists can quickly and easily explore such phenomena through combinations of methods of textual, historical, theoretical, and numeric analyses: when social scientists can focus on the science, not on the technology required to undertake the science. Acknowledgements The work described was part-funded by the ESRC (FINGRID: RES-149-25-0028), EPSRC (SOCIS: GR/M89041/01, REVEAL: GR/S98450/01) and EU (LIRICS: eContent EDC-22236). In particular we would like to thank our colleagues at Surrey: Prof Nigel Fielding (Sociology), Prof John Eade (Anthropology), and Dr M Rogers (Linguistics) and we are grateful to Prof John Nankervis (Essex) and Prof Yorick Wilks (Sheffield) for discussions on econometrics and linguistics. References Ahmad, K. (1995). “Pragmatics of Specialist Terms and Terminology Management” In (Ed.) Petra Steffens. Machine Translation and the Lexicon. (LNAI, Vol. 898) Heidelberg: Springer. pp.51-76 Ahmad, K., Gillam, L., and Cheng, D. (2005) “Textual and Quantitative Analysis: Towards a new, e-mediated Social Science”. Proc. of the 1st International Conference on e-Social Science (Manchester, June 2005). Ahmad, K., Taskaya-Temizel, T., Cheng D., Gillam, L., Ahmad, S., Traboulsi, H. and Nankervis, J. (2004). Financial Information Grid – an ESRC eSocial Science Pilot, Proceedings of the Third UK eScience Programme All Hands Meeting (AHM 2004), Nottingham, United Kingdom. © EPSRC Sept 2004, (ISBN 1-904425-21-6) Andersen, T. G., Bollerslev, T., Diebold, F X., & Vega, C. (2002). Micro effects of macro announcements: Real time price discovery in foreign exchange. National Bureau of Economic Research Working Paper 8959, http://www.nber.org/papers/w8959. Aston, G., and Burnard, L. (1998). The BNC Handbook. Edinburgh: Edinburgh University Press. Baker, M., and Wurgler, J. (2004). Investor Sentiment and the Cross-Section of Stock Return, NBER Working Papers 10449, Cambridge, Mass National Bureau of Economic Research, Inc. Hughes, B., and Bird, S. (2003). “Grid-Enabling Natural Language Engineering by Stealth”, In Proc. of HLT-NAACL 2003 (Workshop on SEALTS), pp. 31-38, Association of Computational Linguistics, 2003. DeGennaro, R., and R. Shrieves (1997): ‘Public information releases, private informationarrival and volatility in the foreign exchange market’. Journal of Empirical Finance Vol. 4, pp 295–315. Engle, R. F. and Ng, V. K (1993). Measuring and testing the impact of news on volatility, Journal of Finance. Vol. 48, pp 1749—1777. Fayyad, U. and Uthurusamy, R. (2002) “Evolving data mining into solutions for insights”. Communications of the ACM 45(8), pp28-31 Fielding Nigel. (1995). Community Policing. Oxford: Oxford University Press Fielding, N., Innes, M., and Fielding, J. (2002). ‘Reassurance Policing and the Visual Environmental Crime Audit in Surrey Police: a Report’. Guildford: Univ. of Surrey Department of Sociology. Kahneman, D. (2002). Maps of Bounded Rationality: A Perspective on Intuitive Judgment and Choice (A Nobel Prize Lecture December 8, 2002). Kaldor, M. (1999). New and Old Wars: Organised Violnece in a Global Era. Polity Press: Cambridge. Koppel, M and Shtrimberg, I. (2004). “Good News or Bad News? Let the Market Decide”. In AAAI Spring Symposium on Exploring Attitude and Affect in Text. Palo Alto: AAAI Press. pp. 86-88. Percival, D. B., and A. T. Walden. 2000. Wavelet Methods for Time Series Analysis. Cambridge University Press. Shiller R. J. (2003). The New Financial Order: Risk in the 21st Century. Princeton: Princeton University Press. Simon, H. (1992). ‘Rational Decision Making in Business Organisations (A Nobel Memorial Lecture, 8 December, 1978).’ In (Ed.) Assar Lindbeck. Nobel Lectures in Economic Sciences 1969-1980. Singapore: World Scientific Publishing Company. (Available at http://nobelprize.org/economics/laureates/1978/simon -lecture.pdf) Turney, Peter D. (2002). “Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews”. Proc. 40th Ann. Meeting of the Ass.n for Comp.l Ling. (ACL). Philadelphia, July 2002, pp. 417-424. (available at http://acl.ldc.upenn.edu/P/P02/P02-1053.pdf)