Stocks 'R Us: Analyzing the Volatility of Stocks in the S&P 500 Felix Chu, Jeremy Brudvik, Philip Kuo, Alex Loddengaard Introduction Many companies offer stock shares of their company to a public, open market, where owners of the stock have a small share in the company. The ownership of a stock typically gives the buyer votes somewhat proportional to the number of stocks they own, allowing them a small say in policy-making, as well as a share of profits and losses depending on stock price fluctuation. The buying and selling of stocks will also affect its price. Generally good news will cause stock prices to rise, and bad news will cause stock prices to fall. Moreover, if a stock is in high demand, then its share price will increase; if it is in low demand, then its share price will decrease. Sometimes this can be taken as a measure of the public's confidence in the company, or a reflection of their opinion on the company's recent actions. A company’s worth can be roughly estimated by the price of a single stock multiplied by the total amount of stocks. In order to control the price of a stock, the company that owns the stock may choose to split. Usually splits are 2:1, where the stock price is halved and stock owners end up with twice the number of stocks they had originally. The total number of stocks is doubled as well. Other common splits include 3:1 and 3:2, as well as others. It is theorized that stock splits, by making the price of an individual stock lower, make the stock more attractive to larger portions of the population, such as day traders. For this project, we hope to determine whether this is the case, by comparing the volatility of a stock as a measure of its attractiveness to traders with its price. We define volatility to be the difference between high and low price for the day, divided by the opening price. This gives us the absolute percent fluctuation of the stocks' prices based on their price on that day. For example, if a stock price fluctuates greatly during a single day, it has high volatility. On the other hand, if the stock price stays stable over the day, it has low volatility. We investigated the stocks in the S&P 500 for this analysis. The S&P 500 is a stock market index consisting of 500 "large" companies, with "large" being defined as having a very high market value, which is defined as the product of the number of shares available to the public and share price. 3M, Adobe, Amazon, Apple, Coca Cola, eBay, FedEx, Ford, Google, Heinz, Mattel, McDonalds, Safeway, Starbucks, Time Warner, and Disney are just a few of the stocks found in this index. S&P stands for Standards and Poor, who developed and maintains the index. S&P is a division of McGraw Hill, which, besides printing textbooks, provides financial and business services, among other things. (1). All companies in the index are publicly traded in the New York Stock Exchange and NASDAQ. Nearly all of the 500 are US companies. The companies to be included are chosen by a committee, who judge which are indicative of the performance of various industries in the United States. The inclusion/exclusion of companies are determined on an "as needed" basis. Stocks without enough liquidity are not included in this index, which is also determined by the committee. <TODO: cite more in this paragraph> Methods Data Source We have selected a single dataset to analyze stock volatility (2). The dataset contains daily snapshots of the S&P 500 stock index for a year starting on April 26, 2007. Each entry includes the stock symbol, date, stock ticker, opening price, high price, low price, close price, and trade volume. Since stock trading does not occur on national holidays or weekends, we have 253 days worth of stock trades data (see figure 1). Given that the S&P 500 is made up of 500 companies, we would expect to have 126,500 data entries. However, the dataset has a few missing entries for unknown reasons and only includes 126,437 entries. The set of stocks in the S&P 500 also changed over the year, but the size of the set always remained at or near 500. Figure 1: A histogram showing that most stocks were traded when the market was open. The S&P 500 dataset was heavily weighted towards stocks with low prices and also with minimal percent changes. The heavy weighting caused our regression analysis to be greatly skewed (see figure 2), so we decided to prune the larger dataset into a sub-dataset. We created a Python script that does the following: 1. Split all entries into groups of $5 opening-price chunks ($0-4, $5-9, $10-14, etc.) 2. Compute the absolute percentage change for each entry by computing the difference of high price and low price for the day, divided by the opening price 3. Only include the data entry with the largest percent change for this $5 chunk This gives us a data set with a single entry per $5 slice, where that single entry is the stock that had the highest volatility within that opening price range. The graph of opening price to percentage change (see figure 2) clearly shows that there is an inverse-polynomial curve at the top of the data, disregarding outliers. Essentially by finding the largest percentage change for each $5 chunk, we're capturing the curve of Figure 2 and avoiding our regression having large errors from a weighted dataset. Figure 3 shows how the curve becomes more apparent with the pruned dataset. Figure 2: Opening price vs. Daily percentage Figure 3: Opening price vs. Daily percentage change with full dataset -- no pruning. change with pruned dataset. Incidentally, we did investigate the effects of after hour trading (AHT) between the closing price of a stock on one day and the opening price of the stock on the following day. By definition, AHT is the buying and selling of stocks outside of the specified regular trading hours. Both the New York Stock Exchange and the NASDAQ operate from 9:30 a.m. to 4:00 p.m. EST (3). We discovered the price change from one day to the next due to AHT was normally distributed around zero, with a mean of -0.007% and a standard deviation of 1.7% (see figure 4). While AHT may have had interesting impacts on our measure of volatility, we decided not to include it in our research for the following reasons. First, the demographics of traders during AHT are different than non-AHT traders, because AHT trades are dominated by large financial institutions. Second, we could not find an AHT dataset that had the same level of detail of our non-AHT dataset. Figure 4: AHT percentage change histogram. Data Analysis Statistical Methods In order to fully understand the relationship between daily change and opening price, we created two regression models to fit our dataset: one non-linear and one linear with logarithmic data relationships. Tests and Criteria Our preliminary tests and criteria were common sense. That is, we directly compared our models to the dataset itself to determine if a particular model was accurate. Moreover, we looked at the distribution of the residuals to ensure the relevancy of each model. For the non-linear model, we looked at the residual sum of squares, and for the linear model, we looked at the R-squared value, pvalue of significance, and standard error. Tools Our data-pruning tool is a custom Python script. A custom Java program was used for the AHT histogram data preprocessing. All statistical analysis was done in R with standard packages. Results Assaf, we thought we would include our linear and non-linear regressions as well as histograms and box-plots for the residuals. Tell us what you think . Discussion and Conclusions References 1.) http://www2.standardandpoors.com 2.) http://biz.swcp.com/stocks/ 3.) http://www.investopedia.com/ask/answers/04/061004.asp 4.) <add a reference to the site we read about the nlm function?> Appendix