Classifying Stocks using P-Trees and Investor Sentiment Arijit Chatterjee Dr.William Perrizo Department of Computer Science North Dakota State University Fargo, ND 58012, USA arijit.chatterjee@ndsu.edu Department of Computer Science North Dakota State University Fargo, ND 58012, USA william.perrizo@ndsu.edu Abstract— For people who are not so exposed to the financial markets, it is important for them to know which ticker symbols to follow, based on which investment decisions can be made. In this work, we propose how we can classify stock ticker symbols from tweets vertically using P-Trees. The solution described in this paper analyzes 3000 financial and news symbols vertically from the Twitter platform and finds the ticker symbols which are most frequently been discussed. It also provides an ability to scan through the tweet texts associated with the common ticker occurrences so that the context can be identified to help users make better informed business decisions. The paper also discusses on the bias of investors and the affect it has on the volatility of the stocks in the market. I. INTRODUCTION Twitter is one of the most important social media platforms in today’s world providing the unique ability for a user to connect with almost anyone else in the world. The platform supports 33 languages and has close to 288 million active users and almost 500 million tweets are created every day. Investors are day by day using Twitter platform to cite their opinions on particular ticker symbols and share their market focused posts and updates. But with so much of information in the platform it is really difficult to find the information a particular user just needs to reference to make an investment decision. In the realm of stocks it is important to understand which ticker symbols to follow based on which investment decisions can be made. The solution described in the paper will provide them with the ability to identify the ticker symbols effectively without reading through huge corpuses of tweet texts. Now once those ticker symbols with the most frequent occurrences from the tweets are been classified, the paper also gives an insight into how the sentiments of the investors make an impact on the behavior of the stock in the market. The first section of this paper introduces the programming around the Twitter platform limitations, the second section of the paper introduces the concept of P-Trees and how we can structure our data vertically, the third section of the paper shows how we can store the necessary information in the data cube model and represent various views from the captured information. The fourth section shows results on classifying ticker symbols from the tweets of multiple investors and an insight to derive sentiments of investors. The future direction of this work is shown in the fifth section and we finally conclude in the sixth section. II. PROGRAMMING AROUND TWITTER PLATFORM This section of the paper explains in details that how we can programmatically retrieve the tweets from Twitter platform. Twitter has a REST API that allows users to search for tweets, users, timelines, or even post new messages. We use Tweetinvi which is a C# based .net API which is been used to access the Twitter API. A Twitter developer account needs to be set up first with the necessary credentials to query the API using Tweetinvi. The project is layered to keep the twitter interactions, file management, application logic and P-Tree based algorithms separate. Based on the sector we are interested to follow, we first create a list of usernames (tweet ID’s) from that particular sector whose tweets we would like to analyze. For our analysis we grabbed a list of 3000 financial and news symbol that we took from twitter lists and search results. We also estimated when it was the best time to update the users tweet based on their tweeting frequency. The freshly downloaded tweets are serialized to JSON by JSON.net and then written to an .xml based file one per twitter username. The program is designed to download the maximum historical tweets possible per user and then rechecking the accounts for new tweets. Twitter limits the API requests to 300 requests per 15 minutes, and allows access to maximum of 3200 historical tweets. Considering the focus is on the financial sector, we would not be missing much information as the historical tweets will not have much significance as compared to the present state of the markets. Additionally each request to the platform can download a maximum of 200 tweets at a time. So in order to maximize the productive use of the requests, calculating the average time span from the user’s latest tweets and setting the time the program should recheck for new tweets is necessary. Also, when downloading the tweets we check if we have already downloaded the tweets for this user. If there are historical tweets for the user then we will need to download only the updates. Each tweet has its own format and contains a lot of information (ID, Text, Time, Retweets, Favorites, User, and Followers). The new encoded tweets are written to a list which is then written to a username.xml file. III. THE P-TREE TECHNOLOGY Tremendous volumes of data cause the cardinality problem for conventional transaction based data mining algorithms. For fast and efficient data processing, we transform the data into PTree, the loss-less, compressed, and data-mining-ready vertical data structure. The basic data structure exploited in the P-tree technology [1] is the Predicate Count Tree2 (PC-Tree) or simply the P-tree. Formally, P-trees are tree-like data structures that store numeric-relational data (i.e. numeric data in relational format) in column-wise, bit-compressed format by splitting each attribute into bits (i.e. representing each attribute value by its binary equivalent), grouping together all bits in each bit position for all tuples, and representing each bit group by a Ptree. P-trees provide a lot of information and are structured to facilitate data mining processes. After representing each numeric attribute value by its bit representation, we store all bits for each position separately. In other words, we group together all the bit values at bit position x of each attribute for all tuples t. Figure 1 shows a relational table made up of three attributes and four tuples transformed from numeric to binary, and highlights all the bits in the first three bit groups for the first attribute, Attribute 1; each of those bit groups will form a P-tree. Since each attribute value in our table is made up of 8 bits, 24 bit groups are generated in total with each attribute generating 8 bit groups. Figure 2 shows a group of 16 bits transformed into a P-tree after being divided into quadrants (i.e. subgroups of 4). Each such tree is called a basic P-tree. In the lower part of Figure 2, 7 is the total number of bits in the whole bit group shown in the upper part. 4, 2, 1 and 0 are the number of 1’s in the 1st, 2nd, 3rd and 4th quadrants respectively in the bit group. Since the first quadrant (the node denoted by 4 on the second level in the tree) is made up of “1” bits in its entirety (we call it a pure-1 quadrant) no sub-trees for it are needed. Similarly, quadrants made up entirely of “0” bits (the node denoted by 0 on the second level in the tree) are called pure-0 quadrants and have no sub-trees. As a matter of fact, this is how compression is achieved3 [1]. Non-pure quadrants such as nodes 2 and 1 on the second level in the tree are recursively partitioned further into four quadrants with a node for each quadrant. We stop the recursive partitioning of a node when it becomes pure-1 or pure-0 (eventually we will reach a point where the node is composed of a single bit only and is pure because it is made up entirely of either only “1” bits or “0” bits). P-tree algebra includes operations such as AND, OR, NOT (or complement) and ROOTCOUNT (a count of the number of “1”s in the tree). Details for those operations can be found in [1]. The latest benchmark on P-trees ANDing has shown a speed of 6 milliseconds for ANDing two P-trees representing bit groups each containing 16 million bits. Speed and compression aspects of P-Trees have been discussed in greater details in [1]. [2], [4] and [5] give some applications exploiting the P-tree technology. Once we have represented our data using P-trees, no scans of the database are needed to perform text categorization as we shall demonstrate later. In fact, this is one of the important aspects of P-tree technology. Fig. 1. Relational numeric data converted to binary format with the first three bit groups in Attribute 1 highlighted. 2 3 Formerly known as the Peano Count Tree Its worth noting that good P-Tree compression can be achieved when the data is very sparse (which increases the chances of having long sequences of “0” bits) or very dense (which increases the chances of having long sequences of “1” bits) Fig. 2. A 16 bit group converted to a P-Tree. IV. DATA CUBE MODEL AND VIEWS In this section we describe how we can represent the captured tweets in the form of P-Trees and store the data lossless in a 3-D data cube model. In the term space model, a document is presented as a vector in the term space where terms are used as features or dimensions. The data structure resulting from representing all the documents in a given collection as term vectors is referred to as a document-by-term matrix. Given that the term space has thousands of dimensions, most current text-mining algorithms fail to scale-up. This very high dimensionality of the term space is an idiosyncrasy of text mining and must be addressed carefully in any text-mining application. Within the term space model, many different representations exist. On one extreme, there is the binary representation in which a document is represented as a binary vector where a 1 bit in slot i implies the existence of the corresponding term ti in the document in position pj, and a 0 bit implies its absence. This model is fast and efficient to implement but clearly lacks the degree of accuracy needed because most of the semantics are lost. On the other extreme, there is the frequency representation where a document is represented as a frequency vector. Many types of frequency measures exist: term frequency (TF), term frequency by inverse document frequency (TFxIDF), normalized TF, and the like. This representation for term frequency is obviously more accurate than the binary one but is not as easy and efficient to implement. Our algorithm shows the term frequency (TF) measure based on ticker symbol references in the Tweet texts and is characterized by accuracy and space and time efficiency because it is based on the P-Tree technology. For our analysis we store the tweet ID, tweet Text, and User Information as different dimensions of every tweet. We also store time as a supplemental information which can be used later in analysis. In a data cube model, the cubes can be used to represent multidimensional data of different dimensions. Each cube cells denote an event while the cube edges stand for analysis dimensions. Each cube cell is given a value of each measure. So far, we’ve created a binary matrix similar to that depicted in Figure 1. We follow the same steps presented in Section III to create the P-tree version of the matrix as in Figure 2. For every bit position in every term ti we will create a basic P-tree. Pi, j is the P-Tree representation of the bits lying at jth position in ith term for all documents. Each Pi,j will give the number of documents having a 1 bit in position j for term i. This representation conveys a lot of information and is structured to facilitate fast data mining processing. To get the P-tree holding all documents having a certain value for some term i, we can follow the steps given in the following example: if the desired binary value for term i is 10, we calculate Pi,10 as Pi,10 = Pi,1 AND P’i,0 where ’indicates the bit-complement or the NOT operation (which is simply the count complement in each quadrant). A. P-Tree Term View The term view P-Trees, for example, are then created by bit-slicing the Term Table. They are based on the tweet ID dimension and the position of occurrence of the term within the tweet, a P-Tree for every term is been created. For every tweet made by a user, we parse the tweet Text and store all the unique words in a list. At the same pass, we also find the max Position of the words in the tweet. Since twitter has a character limitation of 140 for every tweet the max Position in most cases from our tests have been <= 35. So every user related data set can be captured as an aggregate of tweet ID’s indexed on the unique word occurrences from their tweets and the respective positions of these unique terms in the tweets. Fig. 4. Term View diagram showing an example of the term view of the Term “How”. B. P-Tree Position View The Position view P-Trees are created by bit-slicing the Position Table. They are based on the tweet ID dimension and the unique term occurrences we find in each tweet ID, a P-Tree for every position is been created. Fig. 3. Data Cube diagram showing the dimension of Unique Term, Position and Tweet ID’s. The above cube shows the representation of a single tweet text “Buy AAPL always and invest in TSLA” where based on the unique terms and their positions we denote a 1 where there is an occurrence else a 0. Now based on this cube there can be different forms of P-Trees views which can be created based on how we would like to slice the data cube. Intermediate to the creation of P-Trees we view the data cube as tables three ways, the Document Table (DT), the Term Table (TT) and the Position Table (PT): Document Table: Doc T1P1…T1P7 . . . T9P1…T9P7 1 1 … 0 . . . 0 … 0 2 0 … 0 . . . 1 … 0 3 0 … 0 . . . 1 … 1 Term Table: Term P1D1 P1D2 P1D3...P7D1…P7D3 1 1 .. . 0 1 ... 0 … Another view which can be created easily is the Document view or tweet ID view by bit-slicing the Document Table. It is based on the unique term and their position in each of the tweets, a P-Tree for every tweet ID can be created. The data has been stored redundantly in the form of three different P-Tree view’s. The computational advantage of processing the data in the form of these bitmaps becomes significant when the data is very deep and also significantly wide. From conventional horizontal processing, we have to throw away data not because it is irrelevant but because the massive amount of data results in data mining times that are too much high. With P-Trees we would not ever need to throw away the data due to the faster processing advantage. 0 V. CLASSIFICATION 9 0 … 0 . . . 1 … 1 Position Table Pos T1D1 T1D2 T1D3...T9D1…T9D3 1 1 7 0 .. . Fig. 5. Position View diagram showing an example of the position view of the Position “1”. 0 … 1 0 ... 0 … . . . 1 … 0 1 A. TICKER SYMBOLS Now with all the terms been stored in the term view PTrees list we can filter on the terms which only begin with a “$” symbol. The term view P-Trees for every user generates the term frequency based on the 1 occurrence of these terms. The ticker symbol which has the highest frequency of 1occurrence is the most discussed amongst the investors. This insight is helpful especially when a user needs to know which stocks to invest in primarily and does not have much knowledge on the markets. The system scans through 3000 different twitter user Names and retrieves this information on the most frequent ticker symbol which is been discussed. The system can also process a given number of latest tweets. Once the most frequent occurrence of the ticker symbols is obtained, the system generates an .xml based output containing the ticker symbol frequency and the tweet Text associated with the ticker symbol occurrence. Based on how we wish to process the text from the tweets we can either use humans to read the final generated .xml file to have an understanding on the ticker symbols and derive sentiments or we can have a NLP tool to process these generated .xml files. Using NLP to derive sentiments from the Tweet Texts is outside the scope of this paper. The analysis has also been done with varied sample sizes like studying the latest 100 tweets across 100 investors, 10 tweets across 500 investors and so on. A sample result with the most frequent 10 ticker symbol occurrence from the latest 500 tweets across 25 most tweeted investors between 03/30/2015 and 04/03/2015 is shown below in Table. I. TABLE I. Ticker Symbol 10. $AXMM % Price Changes -15.2 The results show that $TRIL and $LBIO which had the highest frequency of occurrence had risen close to 36.2% and 12.39% respectively in that time period. Using P-Trees we not only find the most frequently occurred ticker symbols in an efficient manner but we can also re-construct the tweet texts which were assoicated for every term and position for the most frequently occurring ticker symbols. We are providing this information so that the end user can understand the contexts based on which the investors had made the tweets. Fig 6. shows below some of the sample tweet texts which were associated for the ticker symbol “FB” (Facebook Inc.). TOP 10 TICKER SYMBOLS WITH FREQUENCY Ticker Symbol Frequency 1. $TRIL 124 2. $LBIO 76 3. $CYBR 68 4. $FEYE 61 5. $JUNO 53 6. $FB 44 7. $CNAT 41 8. $KRFT 39 9. $SPY 39 10. $AXMM 37 We try to see the actual behavior of these stocks in the market within the same time frame that the investors were tweeting about these ticker symbols and the results are shown in Table. II. TABLE II. TICKER SYMBOLS WITH % PRICE CHANGES Ticker Symbol % Price Changes 1. $TRIL 36.2 2. $LBIO 12.39 3. $CYBR 0.29 4. $FEYE -2.74 5. $JUNO -4.86 6. $FB -.91 7. $CNAT -19.39 8. $KRFT -2.04 9. $SPY -.2 Fig. 6. Tweet Texts captured for “Facebook” from multiple investors. Compared to any horizontal approach such as KNN to scan for the term frequency, our approach shows much better results in terms of speed and accuracy. The reason for the improvement in speed is mainly due to the complexity of the two algorithms. Usually, the high cost of KNN-based algorithms is mainly associated with their selection phases. The selection phase in our algorithm has a complexity of O(n) where n is the number of dimensions (number of terms) while the KNN approach has a complexity of O(mn) for its selection phase where m is the size of the dataset (number of documents or tuples) and n is the number of dimensions. Drastic improvement is shown when the size of the matrix is very large (the case of 5000x20000 matrix size in Table III). As for accuracy, the KNN approach bases its judgment on the similarity between two document vectors upon the angle between those vectors regardless of the actual distance between them. Our approach does a more sophisticated comparison by using P-tree ANDing to compare the closeness of the value of each term in the corresponding vectors, thus being able to judge upon the distance between the two vectors and not only the angle. Also, terms that seem to skew the result could have been ignored in our algorithm unlike in the KNN approach which has to include all the terms in the space during the neighbor-selection process. TIME COMPARISON TABLE B. CLASSIFYING INVESTORS With the globally most frequent ticker symbols now been classified we would like to evaluate the sentiment of the investors so that we can study their effect on certain ticker symbols. Investors can safely be assumed to be sentiment driven. Since we emphasize so heavily on the tweet texts it is important to make sure that the investors are not underreacting or overreacting for particular ticker symbols. The real investors and markets are very complicated to be summarized by a few selected biases and trading frictions. The top down approach [] [] focuses on the measurement of reduced form, aggregate sentiment and traces its effects to market returns and individual stocks. The approach is based on two broad undisputable assumptions of behavioral finance—sentiment and the limits to arbitrage—to explain which stocks are likely to be most affected by sentiment, rather than simply pointing out that the level of stock prices in the aggregate depends on sentiment. In particular, stocks of low capitalization, younger, unprofitable, high volatility, non-dividend paying, growth companies, or stocks of firms in financial distress, are likely to be disproportionately sensitive to broad waves of investor sentiment. In particular, stocks that are difficult to arbitrage or to value are most affected by sentiment. S2 denotes the standard deviation of the population above the mean as shown in (1) and (2). 𝑆1 = 1 𝑛1 −1 ∑𝑖𝑛1 𝑓1𝑖 (𝑥𝑖 – 𝑥1 ) 2 (1) 𝑆2 = 1 𝑛2 −1 ∑𝑖𝑛2 𝑓2𝑖 (𝑥𝑖 – 𝑥2 ) 2 (2) The composite standard deviation S of the population is calculated as shown in (3). 𝑆= 𝑛1 .𝑆1 2 + 𝑛2 .𝑆2 2 (3) 𝑛1 +𝑛2 −1 Where 𝑛1 = ∑𝑖 𝑓1𝑖 and 𝑛2 = ∑𝑖 𝑓2𝑖 The weighted mean deviation about mean is shown in (4) and it removes any bias from the measures of the tweet counts of the investors. 1 𝑛1 + 𝑛2 −1 𝑛 𝑛 ̂ | + ∑𝑖 2 𝑓2𝑖 | 𝑋2𝑖 − 𝑋 ̂|) ( ∑𝑖 1 𝑓1𝑖 | 𝑋1𝑖 − 𝑋 ̂= Where 𝑋 1 𝑛1 + 𝑛2 −1 𝑛 (4) 𝑛 [∑𝑖 1 𝑋1𝑖 + ∑𝑖 2 𝑋2𝑖 ] VI. FUTURE DIRECTION With the twitter data now being accessible and loaded in various P-Tree views, the future potential in this area can be significant. P-Trees will help us analyze if necessary trillion records deep. The main idea illustrated in this paper is solving a classification problem to understand which ticker symbols we would focus on depending on the tweets made by the investors. Some users can also trade against this information based on their investment pattern. The actual movement of the stocks in the market in various time frames and their fluctuations can be related with the information impact from this data. Investors whose predictions have been in accordance with the stock fluctuations can be assigned higher weightage values to further classify ticker symbols. Fig. 8. Cross-sectional effects of investor sentiment. Stocks that are speculative and difficult to value and arbitrage will have higher relative valuations when sentiment is high. When sentiment is low, the average future returns of speculative stocks exceed those of bond-like stocks. When sentiment is high, the average future returns of speculative stocks are on average lower than the returns of bond-like stocks. In our analysis, we focus on investors being sentiment driven on particular ticker symbols based on their tweet counts. The more is the heterogenity of the population of investors tweeting on Ts, signifies Ts is been discused by a wider population range. Let X: X1, X2,……….. , Xn be the population of the investors who are tweeting on ticker symbol Ts and f: f1, f2,……….. , fn denote the frequency of tweets made by the investors. We use the median divisor principle as the respective tweets are exhibiting two different patterns of behavior based on the median value. In other words, below the median, they will exhibit a certain pattern and above the median, they exhibit a different pattern. So, we segregate them in two subdivisions n1, n2 from the median where S1 denotes the standard deviation of the population below the mean and In our approach we believe that adding the third dimension, namely term position in the tweet, offers us a wide range of possible new decision supports, such as phrase analysis (multiple juxtapose terms), such as “Buy AAPL”, the evaluation of which could be done by shifting the “AAPL” PTree by one position bit, then AND’ing with the “Buy” P-Tree. In addition to phrase analysis, the position dimension also opens up the potential of discovering signals based on where in a tweet, a particular term appears (early on, later, etc.). Of course, introducing the third dimension of term position massively expands the size of the data cube being considered and necessitates effective compression techniques. Fortunately, when converting to P-Trees in each of the three dimensions, it is true that in all but one of the dimensions (document dimension) each P-Tree is as sparse as it could possibly be (a single one-bit) and therefore compresses maximally. In the document dimension, by replicating single one-bit P-Trees, instead of using links, the same advantage can be realized. These advantages will be pursued in the future as well. We also plan to include this solution as a phone app which will be commercially available to the end consumers on Windows, Android and the iOS platforms. VII. CONCLUSION AND LIMITATIONS We have shown in this paper how using Twitter API and PTree’s we can provide insight to users who would like to invest in the market but cannot identify which ticker symbols they would like to invest in. They can use this solution which can provide them insight to make business investment decisions. We have also classified the investors based on their tweet counts and have ensured that any bias from the measures is removed. We have used classification as a decision support and have not really used sentiment analysis in this approach. We believe if the available information is provided in an easy consumable manner, humans derive sentiments better than NLP tools. The sentiment analysis derived from NLP solutions have not always revealed accurate insights but if necessary we would leverage NLP to derive sentiments from the tweet texts when the generated output is considerably large to process by humans. [6] [7] [8] [9] [10] [11] [12] [13] REFERENCES [1] [2] [3] [4] [5] M. Khan, Q. Ding, and W. Perrizo, “K-nearest Neighbor Classification on Spatial Data Stream Using P-trees,” Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 02), pp. 517-528, Taipei, Taiwan, May 2002. Perrizo, Ding and Roy, “Deriving high confidence rules from spatial data using peano count trees,” Proceedings of the WAIM, International Conference on Web-Age Information Management,(Xi’an, China),91102, July 2001. Rahal and Perrizo, “Query acceleration in Multi-level secure database systems using the P-tree technology,” Proceedings of the ISCA CATA, International Conference on Computers and Their Applications (Honolulu,Hawaii), March 2003. Rahal and Perrizo,“An Optimized Approach for KNN Text Categorization using P-trees,”Proceedings of the 2004 ACM Symposium on Applied Computing (SAC-04) Nicosia,Cyprus, March 14-17 2004. I. Rahal, D. Ren, W. Perrizo, “A Scalable Vertical Model for Mining Association Rules,” Journal of Information and Knowledge Management (JIKM), V3:4, pp. 317-329, 2004. [14] [15] [16] [17] J.Han and Y.Fu,“Discovery of multiple level association rules from large databases,” Proceedings of the 21st International Conference on very large databases, pages 420-431, September 1995. Jiawei Han, Micheline Kamber, “Data mining: Concepts and Techniques,”Morgan Kaufmann Publishers Inc.,San Francisco,CA, 2000. P.Adriaans and D.Zantinge,”Data Mining,”Addison Wesley,1996. N.G. Das,”A Book on Statistical Methods,”Publisher M. Das and Co, 2001. C.Yang, U.M Fayyad and P.S. Bradley, “Efficient discovery of errortolerant frequent item sets in high dimensions,” Proceedings of the KDD 2001, pages 194-203. Lodhi, Saunders, Shawe-Taylor, Cristianini, and Watkins,” Text classification using string kernels.” Journal of Machine Learning Research, (2): 419-444, February 2002. T. Abidin and W. Perrizo, “SMART-TV: A Fast and Scalable Nearest Neighbor Based Classifier for Data Mining,” Proceedings of the 21st Association of Computing Machinery Symposium on Applied Computing (SAC-06), Dijon, France, April 23-27, 2006. A. Perera, T. Abidin, M. Serazi, G. Hamer, and W. Perrizo, “Vertical Set Squared Distance Based Clustering without Prior Knowledge of K,” International Conference on Intelligent and Adaptive Systems and Software Engineering (IASSE-05), pp. 72-77, Toronto, Canada, July 2022, 2005. D. Ren, B. Wang, and W. Perrizo, “RDF: A Density-Based Outlier Detection Method using Vertical Data Representation,” Proceedings of the 4th Institute of Electrical and Electronic Engineers (IEEE) International Conference on Data Mining (ICDM-04), pp. 503-506, Nov 1-4, 2004. Qin Ding, Qiang Ding, W. Perrizo, “PARM - An Efficient Algorithm to Mine Association Rules from Spatial Data" Institute of Electrical and Electronic Engineering (IEEE) Transactions of Systems, Man, and Cybernetics, Volume 38, Number 6, ISSN 1083-4419), pp. 1513-1525, December, 2008. I. Rahal, M. Serazi, A. Perera, Q. Ding, F. Pan, D. Ren, W. Wu, W. Perrizo, “DataMIME™”, Association of Computing Machinery, Management of Data (ACM SIGMOD 04), Paris, France, June 2004. Treeminer Inc., The Vertical Data Mining Company, 175 Admiral Cochrane Drive, Suite 300, Annapolis, Ma