In our project, we investigated the relation between stocks in the Information Technology industry (IT), using five years of price data from October 22, 2004 – October 20, 2009.
The company that we used as our focus was Microsoft, and we wanted to determine how ten other stocks affected the daily return on Microsoft stock (MSFT). The ten stocks we researched for effect were the following: Apple Computer (AAPL), Adobe Software
(ADBE), Automatic Data Processing (ADP), Advanced Micro Devices (AMD), Dell
Computer (DELL), Hewlett-Packard (HP), International Business Machines (IBM),
Oracle Software (ORCL), Yahoo (YHOO), and Google (GOOG). We found that only six of these had a significant relationship with Microsoft. The result of our analysis is an equation that we can use to predict Microsoft stock returns based on the returns of those six companies’ stocks.
First, let’s look into why some stock returns are more highly correlated than others.
Companies in the same industry should behave the same right? The trouble with that idea is that an industry (like Information Technology) is a very wide category. IBM makes computers and sells software, Advanced Micro Devices only creates chips to make computers run, while Microsoft and Oracle compete with their database systems and other software. All four of these companies are in the information technology industry, or sector, but they operate in different sub-sectors. So when we look at the ten companies we want to compare to Microsoft, we expect that the company most similar to Microsoft would have the best relationship (and we measure that) with them. In the table of our results below, where we rank the six stocks according to their relationship with
Microsoft, we indeed see that Oracle does have the best relationship with Microsoft. This makes sense, since both these companies are very large and produce only software.
Stock Rank
ORCL 0.235
ADP 0.211
IBM
ADBE
0.188
0.123
GOOG 0.095
DELL 0.072
When we look at the chart (see the next section) of each of these stocks vs. Microsoft, we see each is positively correlated with Microsoft Corp.
In conclusion, our results show that the strongest relationship is between Microsoft and
Oracle for several reasons: both are purely software companies, direct competitors in the database provider space, and they have the same relationship with the overall market.
The less comparable company is Dell which is solely a hardware technology company that manufactures personal computers.
We have included in the appendix a description of each of the companies.
Our data was originally retrieved from the Yahoo finance web site. Stock prices were converted to log returns, i.e. r(t) = log[S(t)/S(t–1)], where S(t) = closing stock price at time t, and log = natural logarithm. This is a typical transformation for financial data, which yields near-normal distributions. Our data had over 100 outliers. Some of these were found to be stock splits, which we then adjusted for. Other outliers were kept in the sample, because they are an important part of financial market data.
Examining the pair-wise relationships between stocks, the correlation table below shows a range of 28–60%, the largest being between MSFT and ORCL. Looking at the plots of each pair, we see that there is a linear relationship between stocks. See Graphs 1 and 2 in the appendix for the pair-wise highest correlation (MSFT vs. ORCL) and lowest correlation (MSFT vs. YHOO). Note that the far right outlier on the YHOO chart is when Microsoft offered to buy out Yahoo for a large premium to its then-current price.
Correlation Table:
MSFT
MSFT AAPL ADBE ADP AMD DELL HP IBM ORCL YHOO GOOG
1 0.44 0.56 0.55 0.36 0.46 0.4 0.56 0.6 0.31 0.46
AAPL
ADBE
ADP
AMD
DELL
HP
IBM
ORCL
YHOO
0.44 1
0.56 0.45
0.55 0.38
0.36 0.38
0.46 0.42
0.4 0.37
0.56 0.49
0.6 0.48
0.31 0.34
0.45 0.38 0.38 0.42 0.37 0.49
1 0.54 0.43 0.46 0.45 0.54
0.54 1 0.38 0.41 0.43 0.56
0.43 0.38
0.46 0.41
1
0.4
0.4 0.38 0.44
1 0.36 0.5
0.45 0.43 0.38 0.36
0.54 0.56 0.44
1
0.5 0.45
0.45
1
0.55 0.51 0.37 0.48 0.42 0.58
0.35 0.31 0.28 0.33 0.34 0.37
0.48
0.55
0.51
0.37
0.48
0.42
0.58
1
0.36
0.34
0.35
0.31
0.28
0.33
0.34
0.37
0.36
1
GOOG 0.46 0.49 0.47 0.43 0.31 0.38 0.39 0.4 0.43 0.37
The graph below shows the distribution of the response variable is mostly normal, with heavy tails, as is often the case for financial data. The mean return was very close to
1 zero, while the kurtosis is 10.78, which is well above the value of 3 for a normal population. The time period of our sample includes some very extreme events, including the bankruptcy of Lehman Brothers, which was largely unanticipated by the market, and had the overall market tumbling with daily returns of up to minus 8%, which is more than six standard deviations from the mean.
0.49
0.47
0.43
0.31
0.38
0.39
0.4
0.43
0.37
The first regression procedure was run with all 10 independent variables and was examined by five algorithms, which are namely, C p
, Adjusted R backward, and stepwise method. According the selection rule of C p fitting model should see the value of C p
2 , forward details,
method, a good
equal to the number of parameters, although sometimes the minimum C p
is used. We find the best model from the C p
algorithm:
Number in Model C(p) R-‐Square Variables in Model
6 5.5647 0.5017 adbe adp dell ibm orcl goog
The value of adj-R chosen by the adj-R
2
2 for the above model is 0.4993. This is slightly lower than the model
algorithm, which has 7 variables and an adj-R 2 value of 0.4994.
Number in Model Adjusted R-‐Square R-‐Square Variables in Model
7 0.4994 0.5022 aapl adbe adp dell ibm orcl goog
However, we exclude AAPL from our model based on its high p-value of 0.2810. The remaining three methods are more complicated due to multiple steps to review. Detailed
SAS output (as well as code) is found in the appendix. We specified a parameter of 0.05 to include variables that show significance in their added predictability to the model. In the case of the forward model selection method, the same six stocks as the C p model selection are included. See Table 1 in the appendix.
The backward elimination method begins with the full model, and steadily spins off weak variables based on the goodness-of-fit F-test. For example, in the first step, AMD is removed, because it has the lowest F value of 0.06. In addition, when we check its pvalue, it is very high at 0.8039, so it is not significant in its relation to MSFT. The next steps successively remove YHOO, HP, and AAPL from the model. See Table 2 in the appendix. Removing the above variables from the full model, there remains the six stocks
ORCL, ADP, ADBE, IBM, GOOG, DELL to form the best fitting model.
The stepwise method is similar to the forward method except it has an iterative procedure to add or eliminate x-variables. More specifically, if one variable acquired by the previous step has a small F-value at the current step, it will be excluded again. From the summary table (Table 3 in the appendix), the six variables (ORCL, ADP, ADBE, IBM,
GOOG, DELL) are again selected to comprise the model, and ORCL has the highest F value. When we come back to re-examine the possible models, it is not hard to notice C p
, forward, backward and stepwise methods all point to the same model consisting of the same six independent variables. In addition, their p-values are below the 5% significant level, thus are the best candidates to explain our response variable – MSFT.
While there is some correlation between our independent variables, the values of the
Variance Inflation Factor (VIF) indicate that collinearity is not a problem for this model.
The highest value was that of IBM, at 1.95, well below the threshold of 4. See table 5.
Most of the time, markets exhibit the behavior of a normal random population in the log returns of stocks. The Normal Probability plot below shows this well, with data very close to the straight line for most of the distribution. However, in the points at the ends of the distribution the data falls away from the line, showing the data is not normal in these cases. This is the heavy tails of stock market data.
The plot of residuals below also shows a large number of extreme data points greater than
2.5 standard deviations from the mean. Again, this is not unexpected for data from financial markets. Apart from the outliers, the residual plot shows that the data is linearly related, with no clear pattern, showing that a linear model is applicable.
We used a number of measures to identify and examine outlier points. For cutoff values, our model has p=7 parameters and n=1257 observations. The leverage statistic shows 103 data points above a cutoff of 0.127 = (2p+2)/n. Cook’s distance shows 84 data points above a cutoff of 0.00318 = 4/n. The DFFITS statistic shows the same 84 data points beyond a cutoff of 0.15 = 2*sqrt(p/n). Using studentized residuals, we again found 84 data points beyond 2 standard deviations, 16 of which were beyond 3 standard deviations
(see Table 4 in the appendix).
There are a number of events which can cause such extreme data: 1) earnings reported far from expectations of market participants, 2) merger/acquisition activity, and 3) changes in macroeconomic environment. We do not exclude these data from our analysis because they are part of the structure of financial markets. Market participants must continually monitor news sources for such extreme events, and realize that statistical models are unlikely to perform well if a crisis environment persists. There were additional outliers found in the data originally, which were found to be stock-splits. The data was adjusted for those events. The remaining outliers are all an expected part of stock market data. The 5-year period covered by our sample includes a housing bubble and a financial crisis, so it makes sense to have some extreme data points.
The regression analysis yields the following result (using the stock symbol to indicate a company’s log return). See table 5 in the appendix for details of the parameter estimates.
MSFT = –0.0003 + 0.1233 x ADBE + 0.2107 x ADP + 0.0719 x DELL + 0.1875 x IBM
+ 0.2354 x ORCL + 0.0941 x GOOG
An example of its predictive use, with the following input data, ADP = –0.050, IBM = –
0.042, ORCL = –0.075, ADBE = –0.037, DELL = –0.045, GOOG = –0.071, we calculate the return for MSFT to be –0.0509, with a prediction interval of [–0.0784, –0.0233].
The model could be used to earn profits if there were deviations from the statistical patterns. For example, using the above prediction interval, if the values for the dependent variables were actual data near the end of a trading day, and MSFT was outside the prediction interval, a position could be taken which would profit if MSFT stock moved so that its log return came inside the prediction interval before the close of the day.
However, there is always uncertainty in the markets, and there is no guarantee that at the close of any particular day, the stocks will align with their statistical past.
While this 6-stock model shows good results in the value of predictability of the variables, the model could be improved by adding macroeconomic variables to represent the categories 1=expansion, 2=recession, 3=transition. This may separate the outliers from the rest of the data, as the heavy tails in financial data tend to occur in clusters during recessionary times. One macroeconomic variable that is easily available on a daily basis is the level of long-term interest rates. An initial check shows that adding the
10-year bond as an independent variable improves the adjusted R 2 from 0.4993 to 0.5004.
Another possible improvement to the model is to allow the variance, or volatility, to be random. The common assumption that stock price data is lognormal dates back to the
1970s, and there have been many additional models since then. In addition to the macroeconomic variables mentioned in the previous paragraph, newer statistical models include a distribution of price or volatility jumps. These models are used more often in option markets than stock markets, and show good results in duplicating the skew found in those markets.
Overall, within the context of a multiple regression model for the response variable
MSFT vs. the 10 other Information Technology stocks, we found the best model used six of them as independent variables, with ORCL showing the strongest relationship with
MSFT. Our equation above shows the exact relationship, and this accounts for almost
50% of the variance in Microsoft’s stock returns.
Appendix
Microsoft: develops, manufactures, licenses, sells, and supports software products. The
Company offers operating system software, server application software, business and consumer applications software, software development tools, and Internet and intranet software. Microsoft also develops video game consoles and digital music entertainment devices. Listed on NASDAQ
Apple Computers: designs, manufactures, and markets personal computers and related personal computing and mobile communication devices along with a variety of related software, services, peripherals, and networking solutions. The Company sells its products worldwide through its online stores, its retail stores, its direct sales force, third-party wholesalers, and resellers. Listed on NASDAQ
Adobe Systems Inc.: Incorporated develops, markets, and supports computer software products and technologies. The Company's products allow users to express and use information across all print and electronic media. Adobe offers a line of application software products, type products, and content for creating, distributing, and managing information. Listed on NASDAQ
Automatic Data Processing: is a global provider of business outsourcing solutions. The
Company's services include a wide range of human resource, payroll, tax and benefits administration solutions. Automatic Data also provides solutions to auto, truck, motorcycle, and marine and recreational vehicle dealers. Listed on NASDAQ
Advanced Micro Devices: manufactures semiconductor products. The Company manufactures products that include microprocessors, embedded microprocessors, chipsets, graphics, video and multimedia products. Advanced Micro Devices, Inc. offers its products on a global basis. Listed on DJIA
Dell: offers a wide range of computers and related products. The Company sells personal computers, servers and networking products, storage systems, mobility products, software and peripherals, and services. Listed on NASDAQ
Google, Inc: is a global technology company that provides a web based search engine through its website. The Company offers a wide range of search options, including web, image, groups, directory, and news searches. Listed on NASDAQ
Hewlett-Packard Co: provides imaging and printing systems, computing systems, and information technology services for business and home. The Company's products include laser and inkjet printers, scanners, copiers and faxes, personal computers, workstations, storage solutions, and other computing and printing systems. Hewlett-Packard sells its products worldwide. Listed on DJIA
International Business Machines: provides computer solutions through the use of advanced information technology. The Company's solutions include technologies, systems, products, services, software, and financing. IBM offers its products through its
global sales and distribution organization, as well as through a variety of third party distributors and resellers. Listed on DJIA
Oracle Corp.: supplies software for enterprise information management. The Company offers databases and relational servers, application development and decision support tools, and enterprise business applications. Oracle's software runs on network computers, personal digital assistants, set-top devices, PCs, workstations, minicomputers, mainframes, and massively parallel computers. Listed on NASDAQ
Yahoo Inc.: is a global Internet media company that offers an online guide to Web navigation, aggregated information content, communication services, and commerce. The
Company's site includes a hierarchical, subject-based directory of Web sites, which enables users to locate and access information and services through hypertext links included in the directory. Listed on NASDAQ
Graph 1: MSFT vs. ORCL
Graph 2: MSFT vs. YHOO
Table 1:
Summary of Forward Selection
Variable Number Partial Model
Step Entered Vars In R-‐Square R-‐Square C(p) F Value Pr > F
1 orcl 1 0.3560 0.3560 360.649 693.77 <.0001
2 adp 2 0.0786 0.4346 165.712 174.32 <.0001
3 adbe 3 0.0342 0.4688 81.9272 80.76 <.0001
4 ibm 4 0.0172 0.4860 40.8118 41.92 <.0001
5 goog 5 0.0106 0.4967 16.1526 26.44 <.0001
6 dell 6 0.0050 0.5017 5.5647 12.60 0.0004
Table 2:
Summary of Backward Elimination
Variable Number Partial Model
Step Removed Vars In R-‐Square R-‐Square C(p) F Value Pr > F
1 amd 9 0.0000 0.5027 9.0617 0.06 0.8039
2 yhoo 8 0.0003 0.5024 7.7288 0.67 0.4140
3 hp 7 0.0002 0.5022 6.3297 0.60 0.4382
4 aapl 6 0.0005 0.5017 5.5647 1.24 0.2663
Table 3:
Summary of Stepwise Selection
Variable Variable Number Partial Model
Step Entered Removed Vars In R-‐Square R-‐Square C(p) F Value Pr > F
1 orcl 1 0.3560 0.3560 360.649 693.77 <.0001
2 adp 2 0.0786 0.4346 165.712 174.32 <.0001
3 adbe 3 0.0342 0.4688 81.9272 80.76 <.0001
4 ibm 4 0.0172 0.4860 40.8118 41.92 <.0001
5 goog 5 0.0106 0.4967 16.1526 26.44 <.0001
6 dell 6 0.0050 0.5017 5.5647 12.60 0.0004
Table 4:
Studentized Residual analysis 17:54 Saturday, November 7, 2009 2156
Obs studres Date msft ibm orcl goog
1 -‐7.457 20090122 -‐0.125 -‐0.015 -‐0.014 0.011
2 -‐7.321 20060428 -‐0.121 -‐0.019 -‐0.023 -‐0.005
3 -‐6.961 20090724 -‐0.086 0.005 0.006 0.021
4 -‐6.085 20041115 -‐0.090 0.006 -‐0.029 0.016
5 -‐4.554 20080718 -‐0.062 0.026 0.015 -‐0.103
6 -‐4.396 20080201 -‐0.068 0.018 0.006 -‐0.090
7 -‐4.027 20080425 -‐0.064 -‐0.009 -‐0.019 0.002
8 -‐3.714 20090403 -‐0.028 0.014 0.025 0.020
9 -‐3.073 20090107 -‐0.062 -‐0.016 -‐0.041 -‐0.037
1251 3.192 20060127 0.048 0.004 -‐0.002 -‐0.002
1252 3.228 20060818 0.043 0.007 -‐0.006 -‐0.006
1253 4.084 20060721 0.044 -‐0.008 0.000 0.008
1254 5.383 20081121 0.116 0.043 0.062 0.011
1255 5.916 20071026 0.091 0.008 0.017 0.009
1256 6.250 20090424 0.100 -‐0.013 0.006 0.012
1257 6.465 20081013 0.171 0.050 0.123 0.138
Table 5:
Parameter Standard Variance
Variable DF Estimate Error t Value Pr > |t| Tolerance Inflation
Intercept 1 -‐0.00026956 0.00039454 -‐0.68 0.4946 . 0
adbe 1 0.12329 0.02137 5.77 <.0001 0.53758 1.86017
adp 1 0.21074 0.03348 6.30 <.0001 0.56917 1.75695
dell 1 0.07190 0.02025 3.55 0.0004 0.65733 1.52130
ibm 1 0.18750 0.03559 5.27 <.0001 0.51282 1.94999
orcl 1 0.23535 0.02604 9.04 <.0001 0.53531 1.86808
goog 1 0.09472 0.01982 4.78 <.0001 0.69456 1.43975
SAS Code:
/* Analysis of associations between daily log stock returns of 11 software companies - Microsoft, Apple, Adobe, ADP, Advanced Micro Devices,
Dell, Hewlett-Packard, IBM, Oracle, Yahoo, Google */ proc import datafile = "D:\DePaul\logret_11stocks.csv" out =stocks dbms =csv replace; getnames=yes; run ; proc print data =stocks; run ;
/* check correlations */ proc corr ; var msft aapl adbe adp amd dell hp ibm orcl yhoo goog; run ;
/* individual scatterplots with msft as dependent variable */ proc gplot ; plot msft*aapl msft*adbe msft*adp msft*amd msft*dell msft*hp msft*ibm msft*orcl msft*yhoo msft*goog; run ;
/* analysis of individual stock returns */ proc univariate normal ; var msft aapl adbe adp amd dell hp ibm orcl yhoo goog; histogram / normal ; title "Normal probability plots" ; probplot / normal ( mu =est sigma =est); run ;
/* regression analysis */ symbol value = "plus" color =black; title "Regression analysis" ; proc reg ;
/* model selection methods */ model msft = aapl adbe adp amd dell hp ibm orcl yhoo goog / selection = cp slstay =0.05; model msft = aapl adbe adp amd dell hp ibm orcl yhoo goog / selection = adjrsq ; model msft = aapl adbe adp amd dell hp ibm orcl yhoo goog / selection = forward details slentry =0.05 slstay =0.05; model msft = aapl adbe adp amd dell hp ibm orcl yhoo goog / selection = backward slstay =0.05; model msft = aapl adbe adp amd dell hp ibm orcl yhoo goog / selection = stepwise slstay =0.05;
/*Residual plot: residuals vs x-variables*/ plot student.
*(aapl adbe adp amd dell hp ibm orcl yhoo goog);
/*Residual plot: residuals vs predicted values.*/ plot student.
* predicted.
; plot npp.
* student.
; run ;
/* reduced model */ proc reg ;
/* check for collinearity with variance inflation factor */ model msft = adbe adp dell ibm orcl goog / vif tol ;
/* check for influence of outliers */ model msft = adbe adp dell ibm orcl goog / r influence ;
output out =regout1 p =pred cookd =cookd; h =levg student =studres
/*Residual plot: residuals vs x-variables;*/ plot student.
*(adbe adp dell ibm orcl goog);
/*Residual plot: residuals vs predicted values.*/ plot student.
* predicted.
; dffits =dfts press =press plot npp.
* student.
; run ; proc export data =regout1 outfile = "D:\DePaul\regout1.csv" dbms =csv replace ; run ;
/* analysis of residuals */ proc univariate data =regout1; var studres; run ; proc sort data =regout1; by studres; run ; proc print data =regout1; var studres Date msft ibm orcl goog; where abs(studres) > 3 ; format studres msft ibm orcl goog 6.3
run;
/* predict new observation */ data newobs; input date msft aapl adbe adp amd dell hp ibm orcl yhoo goog; datalines ;
. . . -0.037 -0.050 . -0.045 . -0.042 -0.075 . -0.071
; data predict; set newobs stocks; run ; proc reg ; model msft = adbe adp dell ibm orcl goog/ cli clm ; run ;