Statistics for Finance 1. Lecture 4:Regression. The theme of this lecture is how to fit data to certain functions. Suppose that we collect data (xi , yi ), for i = 1, 2, . . . and we would like to fit them to a straight line of the form y = β0 + β1 x. Then, essentially we need to estimate the intercept β0 and the slope β1 x. A popular method to do this is to use the Least Squares Method, which amounts to finding β0 , β1 , that minimise the quantity S(β0 , β1 ) = n X (yi − β0 − β1 xi )2 . i=1 To do so we differentiate with respect to β0 , β1 and set the derivatives equal to zero n X ∂S = −2 (yi − β0 − β1 xi ) ∂β0 i=1 n X ∂S = −2 xi (yi − β0 − β1 xi ) ∂β1 i=1 and setting the derivatives equal to zero we get n X yi = nβ̂0 + β̂1 i=1 n X n X xi i=1 n X xi yi = β̂0 xi + β̂1 i=1 x2i i=1 and solving for β̂0 , β̂1 we get Pn Pn Pn 2 Pn xi y i i=1 xi i=1 xi i=1 yi − (1) β̂0 = 2i=1 Pn 2 Pn n i=1 xi − i=1 xi (2) β̂1 = n Pn i=1 n xi y i − Pn i=1 Pn x2i − i=1 xi Pn Pn i=1 xi i=1 2 and it can be shown that (3) (4) β̂1 β̂0 = y − β̂1 x Pn (x − x)(yi − y) i=1 Pn i = 2 i=1 (xi − x) 1 yi 2 The x variables are often called the predictor variables and the y variables are called the response variables. More complicated examples will be the case when we want to fit data to linear functions of more than one variables, e.g. y = β 0 + β 1 x1 + β 2 x2 + β 3 x3 . Be careful not to confuse the variables x1 , x2 , x3 that appear above with the data values xi , for i = 1, 2, . . . , that appeared before. The data values, now, should be represented as xij , for i = 1, 2, . . . , n and j = 1, 2, 3. Following the least squares method the problem amounts to finding the constants β̂0 , β̂1 , β̂2 , β̂3 , that minimise the functional n X S(β0 , β1 , β2 , β3 ) = (yi − β0 + β1 xi1 + β2 xi2 + β3 xi3 )2 i=1 This is also possible to be done following the previous procedure, i.e. differentiating with respect to the parameters and solving the equations. Often one would like to fit data to nonlinear functions, such as f (t) = Ae−at + Be−bt . The least squares functional in this case would be S(A, B, a, b) = n X (yi − Ae−ati − Bebti )2 i=1 and in this case the equations that one is led to are nonlinear and usually cannot be solved explicitly. One then needs to resort to an iterative procedure. For our purposes we will focus to fitting data to linear functions. Let us remark, though, that it may happen that fitting data to nonlinear functions may be reduced to fitting data to linear function. For example suppose we want to fit data to the function y = β0 e−β1 x where the uknown parameters are β0 , β1 . This nonlinear function can turn into a linear function by taking the logarithms log y = log β0 − β1 x and then we can use the data (xi , log yi ) to find the parameters from the functional S(β0 , β1 ) = n X i=1 (log yi − log β0 + β1 xi )2 3 Definition 1. A linear functional f (x1 , x2 , . . . , xp−1 ) = β0 + β1 x1 + · · · + βp−1 xp−1 is called a linear regression of y on x 1.1. Statistical Properties. Almost always there is some “noise” in the data which affects the reliability of the estimation of the parameters of the linear regression. A simple way to model the presence of noise is by considering independent variables ei , i = 1, 2, . . . with mean zero and variance σ 2 . Then we can assert that the observed values of y is a linear function of x plus the noise, i.e. y i = β 0 + β 1 xi + e i i = 1, 2, . . . , n This is known as the statistical standard model. These equations can be written as (5) yi − ei = β0 + β1 xi i = 1, 2, . . . , n and therefore we can use the equations (1), (2) to derive the estimators for β0 , β1 . Notice that because of the presence of the noise ei , i = 1, 2 . . . , the estimators β̂0 and β̂1 will be random variables. Theorem 1. Under the assumptions of the statistical standard model the least square estimates are unbiased, i.e. E[β̂j ] = βj , for j = 0, 1. Proof. From the assumption that E[ei ] = 0 we have that E[yi ] = β0 +β1 xi . Therefore from equation (1) we get that Pn Pn Pn 2 Pn xi E[yi ] i=1 xi i=1 E[yi ] − i=1 xi E[β̂0 ] = i=1 Pn Pn 2 2 n i=1 xi − i=1 xi Pn 2 Pn Pn Pn Pn 2 nβ0 + β1 i=1 xi − i=1 xi i=1 xi β0 i=1 xi + β1 i=1 xi = 2 Pn 2 Pn n i=1 xi − i=1 xi = β0 The proof corresponding to β1 is similar (Exercise). Theorem 2. Under the assumptions of the standard statistical model we have P σ 2 ni=1 x2i Var(β̂0 ) = Pn 2 Pn n i=1 x2i − x i i=1 Var(β̂1 ) = nσ 2 n Pn i=1 x2i − Pn i=1 xi 2 4 Cov(β̂0 , β̂1 ) = −σ 2 Pn n i=1 Pn xi Pn i=1 x2i − i=1 xi 2 In the previous theorem we see that the variances depend on the xi ’s and on the variance σ 2 . We would need to estimate the σ 2 . This can be done writing ei = yi − β0 − β1 xi therefore, it is natural to try to estimate the variance σ 2 by the average deviations of the data from the fitted line, i.e. yi − β̂0 − β̂1 xi . We define the residual sum of squares (RSS) by RSS = n X (yi − β̂0 − β̂1 xi )2 . i=1 It can be shown that the quantity s2 = (6) RSS n−2 is an unbiased estimator of σ 2 . The divisor n − 2 is because two parameters (β̂0 , β̂1 ) have been estimated from the data, therefore reducing the degrees of freedom to n − 2. Once we estimate the σ 2 we can estimate the variances of β̂0 , β̂1 by the formulae of Theorem 2, where we replace σ 2 by s2 . The estimators for the variances wil be denoted by s2β̂ , s2β̂ . 0 1 1.2. Assessing the Fit. The residual introduced in the previous section i.e. êi = yi − β̂0 − β̂1 xi can be used in assesing the quality of the fit. Often we plot the residual versus the x values. Such plots may reveal systematic misfit. Since the noise ei is considered to be a collection of independent random variables, the residuals should bear no relation to the x values and ideally the plot should look as a horizontal blur. For example let’s look at the following data examining the relationship between the depth of a stream and the rate of its flow 5 Depth Flow .32 .636 .29 .319 .28 .734 .42 1.327 .29 .487 .41 .924 .76 7.350 .73 5.890 .46 1.979 .40 1.124 The following diagrams show the least squares line as well as the residual diagram. 6 The above diagrams indicate some deviations from a linear fit. This is a bit more apparent from the residual fit. One can attemp to investigate a possible nonlinear dependence and therefore try to plot the log values of the data. The diagrams we get in this case are In this case the data seem to better fit the line and also the residuals are fairly enough scattered. Normal probability plots can also be used to assess the fitting. For examples look at the book of Rice, page 556. 1.3. Correlation and Regression. 7 We will explore the relation between correlation and fitting data by the least square method. First we have, n 1X (7) sxx = (xi − x)2 n i=1 n (8) syy 1X = (yi − y)2 n i=1 sxy 1X (xi − x)(yi − y) = n i=1 n (9) the sample variances and covariance, between the predictor and the response. We also have the sample correlation sxy r=√ sxx syy As you may have seen from the homework the least squares slope is given by sxy β̂1 = sxx and so the sample correlation is given by r sxx . r = β̂1 syy If we denote the regression respone variable by ŷ, i.e. ŷ = β̂0 + β̂1 x then we have that (exercise) (10) ŷ − y x−x =r√ √ syy sxx The interpretation of this equation is as follows: Suppose that r > 0 and that the predictor variable is one standard deviation greater than its average. Then the standard deviation of the response from its mean is r standard deviations. Notice that always r ≤ 1, therefore the response tends to have less deviation than its predictor. 1.4. Matrix Approach to Least Squares. Matrix analysis offers a convenient way to represent and analyse linear equations. Suppose that the linear model y = β0 + β1 x1 + · · · + βp−1 xp−1 is to fit the data yi , xi1 , . . . , xi,p−1 , i = 1, . . . , n 8 Then the observations (y1 , . . . , yn ) will be represented by a vector Y and the unknowns (β0 , . . . , βp−1 ) will be represented by a vector β. Finally we will have the n × p matrix 1 x11 . . . x1,p−1 1 x21 . . . x2,p−1 X= .. ... ... . 1 xn1 . . . xn,p−1 Then the vector of the fitted or predicted values Ŷ can be written as Ŷ = Xβ The least squares problem can then be phrased as finding the vector β that minimises S(β) = n X (yi − β0 − · · · − βp−1 xi,p−1 )2 i=1 = kY − Xβk2 = kY − Ŷk2 P where we used the notation kuk = ni u2i for a vector u. If A is a matrix then AT is the transpose, meaning ATij = Aji . We also have that S(β) = kY − Xβk2 = (Y − Xβ)T (Y − Xβ) = YT Y − (Xβ)T Y − YT (Xβ)T − (Xβ)T (Xβ) To find the minimiser β we need to solve the equation ∇β S(β) = 0 which writes as XT Xβ̂ = XT Y If XT X is nonsingular, i.e. invertible, then the solution of the above equation is β̂ = (XT X)−1 XT Y We can also incorporate in the above formulation the noise ei , i = 1, 2, . . . , n as a noise vector e = (e1 , . . . , en )T . The corresponding equation then equations (5) write as Y = Xβ + e The covariance matrix of the vector e is Σ = σ2I where I is the identity matrix and σ 2 = Var(ei ), while we assume that ei ’s are normal i.i.d with mean zero. 9 We can reprove that the least squares estimator β̂ is unbiased since β̂ = (XT X)−1 XT Y = (XT X)−1 XT (Xβ + e) = β + (XT X)−1 XT e and therefore E[β̂] = β + (XT X)−1 XT E[e] = β Using this formulation we can also compute the covariance matrix of the least squares estimator. To do this we need to the following theorem Theorem 3. Let Z = c+AX, where X is a random vector, A is a fixed, nonrandom matrix and c a constant, nonrandom vector. Let ΣY Y the covariance matric of Y, then the covariance matrix of Z is ΣZZ = AΣY Y AT . Using this theorem we can prove that the covariance matrix of the least squares estimator is (11) Σβ̂ β̂ = σ 2 (XT X)−1 . Some Financial Applications: Regression Hedging. We will present an application of regression into determining the optimal hedge of a bond option. This paragraph follows the exposition in Ruppert pg 181, where we refer for more details and applications. Market makers buy securities at bid price and make a profit by selling them at an ask price. Suppose a market maker has just purchased a bond from a pension fund, which would like to ideally sell immediately. On the other hand he might be bound to sell them after a certain maturity time. The market maker is at risk that the bond price could drop due to change in interest rates. In order to minimize this risk the market maker will resort to hedging. This means that he is willing to assume another risk that is likely to be in the opposite direction from the risk due to holding the immature bonds. In this way the two risks may cancel. To hedge an interest rate risk of the bond being held, the market maker can sell other, more liquid bonds, short. For example he might decide to sell a 30-year Treasury bond, which is quite liquid and can be sold immediately. Regression hedging determines the optimal amount of the 30-year old treasury to sell short in order to hedge the risk of the bond just purchased. In this way he hopes that the price of the portfolio of the long position in the first bond and the 30-year Treasury changes little as yields change. 10 Suppose the first bond has maturity of 25 years. Let y30 be the yield on the 30-year bonds, i.e. the interest rate. Let P30 be the price of 1$ in face amount of 30-year bonds, i.e.the amount paid to the holder at maturity, in this case 30 years from now. There is also a relevant quantity called the duration, DUR30 such that the change in price, ∆P30 and the change in yield ∆y30 are related by ∆P30 ' −P30 DUR30 ∆y30 for small values of ∆y30 (Duration is the weighted average of the maturities of future cash flows with weights proportional to the net present value of the cash flows. For a zero-coupon bond, duration equals time to maturity. Duration is a measure of interest rate risk). A similar equation holds for the 25-year bond. Consider, now, a portfolio that holds F25 in 25-year bonds and is short F30 in 30-year bonds. The value of the portfolio then is F25 P25 − F30 P30 . If ∆y30 , ∆y25 are the changes in the yields, then the change in the value of the bond is approximatelly F30 P30 DUR30 ∆y30 − F25 P25 DUR25 ∆y25 . Suppose that the regression of ∆y30 on ∆y25 is ∆y30 = β̂0 + β̂1 ∆y25 . We also adapt the usual assumption that β̂0 ' 0.Then we get that the change in price is F30 P30 DUR30 β̂1 − F25 P25 DUR25 ∆y25 . This change is approximately zero if P25 DUR25 F30 = F25 P30 DUR30 β̂1 and this tells us how much face value of the 30-year bond we need to sell short in order to hedge F25 face value of the 25-year bond. 1.5. Exercises. 1. Derive relations (3),(4) 2. Finish the proof of Theorem . P 3. Prove Theorem 2 You may find the fact that (xi − x) = 0 as well as Exercise 1, useful. 4. Prove relation (10) 5. Prove theorem 3 6. Prove relation (11) 11 7. A study of commercial bank branches obtains data on the number of independent businesses. It records the amount of money (in GBP 1000) each business deposits a year (x) and the amount (in GBP 1000) they save within this year (y). The data from the research are summarised below x : 31.5 33.1 27.4 24.5 27.0 27.8 23.3 24.7 16.9 18.1 y : 18.1 20.0 20.8 21.5 22.0 22.4 22.9 24.0 25.4 27.3 (I) Identify the least squares estimates for β0 , β1 in the model y = β0 + β1 x. (II) Predict y for x = 19.5. (III) Identify the sample standard deviation about the regression line, i.e. the residual standard deviation.