Department of Statistics Statistics 310 Lecture Notes SECTION A: Simple & Multiple Regression Analysis Compiled by Dr Paul J van Staden © Copyright reserved Statistics 310 A1. SIMPLE REGRESSION – THE NATURE OF REGRESSION ANALYSIS ............................................................. 1 THE MODERN INTERPRETATION OF REGRESSION ................................................................................................................ 1 STATISTICAL VERSUS DETERMINISTIC RELATIONSHIPS ........................................................................................................ 1 REGRESSION VERSUS CAUSATION ...................................................................................................................................... 2 REGRESSION VERSUS CORRELATION .................................................................................................................................. 2 TYPES OF DATA................................................................................................................................................................. 3 Time series data ...................................................................................................................................................... 3 Cross-sectional data .............................................................................................................................................. 3 Pooled data .............................................................................................................................................................. 4 A2. SIMPLE REGRESSION – TWO-VARIABLE REGRESSION MODEL: BASIC IDEAS ......................................... 5 A HYPOTHETICAL EXAMPLE .............................................................................................................................................. 5 THE CONCEPT OF POPULATION REGRESSION FUNCTION (PRF) ........................................................................................... 6 THE MEANING OF THE TERM LINEAR .................................................................................................................................. 7 STOCHASTIC SPECIFICATION OF PRF ................................................................................................................................. 7 THE SAMPLE REGRESSION FUNCTION (SRF) ...................................................................................................................... 8 A3. SIMPLE REGRESSION – TWO-VARIABLE REGRESSION MODEL: ESTIMATION ....................................... 10 METHOD OF ORDINARY LEAST SQUARES (OLS) .............................................................................................................. 10 CLASSICAL LINEAR REGRESSION MODEL: ASSUMPTIONS UNDERLYING METHOD OF LEAST SQUARES ................................. 13 STATISTICAL PROPERTIES OF LEAST SQUARES ESTIMATORS: GAUSS-MARKOV THEOREM.................................................. 14 STANDARD ERRORS OF LEAST SQUARES ESTIMATORS...................................................................................................... 17 COEFFICIENT OF DETERMINATION (R2): A MEASURE OF GOODNESS OF FIT ....................................................................... 18 A4. SIMPLE REGRESSION – THE NORMALITY ASSUMPTION ........................................................................... 21 NORMALITY ASSUMPTION FOR STOCHASTIC ERROR TERM................................................................................................ 21 STATISTICAL PROPERTIES OF OLS ESTIMATORS UNDER THE NORMALITY ASSUMPTION ..................................................... 21 THE METHOD OF MAXIMUM LIKELIHOOD (ML)............................................................................................................... 22 A5. SIMPLE REGRESSION – TWO-VARIABLE REGRESSION MODEL: INFERENCE ......................................... 25 THEOREMS FOR PROBABILITY DISTRIBUTIONS ................................................................................................................ 25 CONFIDENCE INTERVALS ................................................................................................................................................ 26 Confidence interval for β2 ................................................................................................................................. 26 Confidence interval for β1 ................................................................................................................................. 27 Confidence interval for σ2 ................................................................................................................................. 27 HYPOTHESIS TESTING .................................................................................................................................................... 27 Hypothesis testing for β2 .................................................................................................................................... 27 Hypothesis testing for β1 .................................................................................................................................... 28 ANALYSIS OF VARIANCE (ANOVA) .................................................................................................................................. 31 PREDICTION .................................................................................................................................................................. 32 Mean prediction ................................................................................................................................................... 32 Individual prediction .......................................................................................................................................... 32 A6. SIMPLE REGRESSION – EXTENSIONS OF THE TWO-VARIABLE LINEAR REGRESSION MODEL ........... 34 REGRESSION THROUGH ORIGIN...................................................................................................................................... 34 LOG-LINEAR MODEL ..................................................................................................................................................... 34 LOG-LIN MODEL ........................................................................................................................................................... 36 LIN-LOG MODEL ........................................................................................................................................................... 37 RECIPROCAL MODEL ..................................................................................................................................................... 37 A7. MULTIPLE REGRESSION – ESTIMATION ...................................................................................................... 38 THREE-VARIABLE REGRESSION MODEL .......................................................................................................................... 38 MEANING OF PARTIAL REGRESSION COEFFICIENTS .......................................................................................................... 39 OLS & ML ESTIMATION................................................................................................................................................. 39 R2 & ADJUSTED R2 ......................................................................................................................................................... 39 PARTIAL CORRELATION COEFFICIENTS ........................................................................................................................... 40 POLYNOMIAL REGRESSION ............................................................................................................................................. 40 A8. MULTIPLE REGRESSION – INFERENCE......................................................................................................... 41 NORMALITY ASSUMPTION ............................................................................................................................................... 41 HYPOTHESIS TESTING .................................................................................................................................................... 41 PREDICTION .................................................................................................................................................................. 44 A9. MULTIPLE REGRESSION – MATRIX APPROACH TO REGRESSION ANALYSIS ......................................... 46 TWO-VARIABLE REGRESSION MODEL ............................................................................................................................. 46 THREE-VARIABLE REGRESSION MODEL .......................................................................................................................... 46 K-VARIABLE REGRESSION MODEL .................................................................................................................................. 46 OLS ESTIMATION .......................................................................................................................................................... 48 INFERENCE .................................................................................................................................................................... 49 REFERENCES.................................................................................................................................................................. 50 STK310 A1. SIMPLE REGRESSION – THE NATURE OF REGRESSION ANALYSIS THE MODERN INTERPRETATION OF REGRESSION Regression analysis is concerned with the study of the dependence of one variable, the so-called dependent variable, on one or more other variables, referred to as the explanatory variables. With regression analysis the population mean value of the dependent variable is estimated and/or predicted in terms of the known values of the explanatory variables. Suppose, for example, we want to find out how the average height of sons changes, given their fathers’ height (Gujarati & Porter, 2009). In other words, we want to predict the average height of a son given that we know the height of the father. 195 190 Son's height (cm) 185 180 175 170 165 160 155 155 160 165 170 175 180 Father's height (cm) 185 190 195 Scatter diagram of the hypothetical distribution of sons' heights corresponding to given heights of their fathers Given a certain height for the fathers, we have a range (distribution) of heights for their sons. Furthermore, the average height of the sons increases as the height of the fathers increases. This is indicated by the regression line, which will be discussed in detail later. STATISTICAL VERSUS DETERMINISTIC RELATIONSHIPS We will deal with variables that are stochastic or random, that is, variables that have probability distributions. We will consider the statistical relationships between these variables. Consider the example on the heights of fathers and sons. The dependence of the height of a son on his father’s height is statistical in nature, since we will not be able to predict the son’s height exactly. This may be because of errors in the measurement of the variables as well as the absence of other variables that affect height but are not used in the analysis. There is therefore some random variability in the height that cannot be fully explained. __________________________________________________________________________________ Copyright reserved Dr Paul J van Staden, University of Pretoria, Department of Statistics Statistics 310 1 STK310 With deterministic dependency, the variables are not random or stochastic. An example is Ohm’s law, which states: For metallic conductors over a limited range of temperature the current, C, is proportional to the voltage, V: C = k1 V REGRESSION VERSUS CAUSATION Regression analysis deals with the dependence of one variable on other variables. However, this does not necessarily imply causation. Consider the following example (data taken from Steyn et al., 1999): Number of road accidents (thousands) 500 450 400 350 300 250 200 150 100 50 0 0 50 100 150 200 250 300 350 400 Production of eggs (million dozen) 450 500 Scatter diagram of the annual number of road accidents in South Africa (thousands) against the annual production of eggs in South Africa (million dozen) As the annual production of eggs increases, the average annual number of road accidents increases. There is therefore a statistical relationship between the two variables. However, we cannot logically explain this dependence. For causality we need a priori or theoretical considerations. REGRESSION VERSUS CORRELATION Regression analysis and correlation analysis are closely related, but have some fundamental differences. We have already defined regression analysis as the study of the dependence (linear or nonlinear) of the dependent variable on one or more explanatory variables. We assume that the dependent variable is stochastic (random) with a probability distribution. The explanatory variables are assumed to have fixed values (in repeated sampling). __________________________________________________________________________________ Copyright reserved Dr Paul J van Staden, University of Pretoria, Department of Statistics Statistics 310 2 STK310 With correlation analysis the strength of the linear association between two variables is measured. We assume that both variables are stochastic. There is furthermore no distinction between dependent and explanatory variables. This means that the correlation between the height of a son and the height of his father is the same as the correlation between the height of a father and the height of his son. TYPES OF DATA Time series data A time series is a set of values that are observed sequentially over time. The data is typically collected at regular intervals – daily (e.g. stock prices), monthly (e.g. the consumer price index), quarterly (e.g. private consumption expenditure) and annually. Consider as an example the annual production of eggs in South Africa (Steyn et al., 1999). Production of eggs (million dozen) 400 350 300 250 200 150 100 50 0 1959 1962 1965 1968 1971 1974 1977 1980 1983 1986 1989 1992 Year Time plot of the annual production of eggs in South Africa (million dozen) Cross-sectional data Cross-sectional data are observations of a variable that are collected at the same point in time. Consider as an example the production of eggs in the 50 states of the USA in 1990 (Gujarati & Porter, 2009). The histogram is an appropriate graphical representation for cross-sectional data. __________________________________________________________________________________ Copyright reserved Dr Paul J van Staden, University of Pretoria, Department of Statistics Statistics 310 3 STK310 35 30 Frequency 25 20 15 10 5 0 500 1500 2500 3500 4500 5500 Production of eggs (millions) 6500 7500 Histogram of the production of eggs in the 50 states of the USA in 1990 Pooled data Pooled data is a combination of time series and cross-sectional data. For example, if the production of eggs is observed for each of the 50 states in the USA for a number of years, then the data is referred to as pooled data. Panel data is a special type of pooled data where the same cross-sectional unit is observed over time. The above-mentioned example is therefore an example of panel data, since we have observations for the same 50 states at each time point. __________________________________________________________________________________ Copyright reserved Dr Paul J van Staden, University of Pretoria, Department of Statistics Statistics 310 4 STK310 A2. SIMPLE REGRESSION – TWO-VARIABLE REGRESSION MODEL: BASIC IDEAS A HYPOTHETICAL EXAMPLE Suppose we have a country with a total population of 60 families and that we want to study the relationship between weekly family consumption expenditure, Y, and weekly income, X (Gujarati & Porter, 2009, p. 34). Assume that we want to predict the expected level (mean level) of consumption expenditure given that we know the income. Consider the following hypothetical dataset: Weekly family consumption expenditure: Y 80 55 60 65 70 75 100 65 70 74 80 85 88 120 79 84 90 94 98 Σ(Y | X) E(Y | X) 325 65 462 77 445 89 Weekly family income: X 140 160 180 200 220 80 102 110 120 135 93 107 115 136 137 95 110 120 140 140 103 116 130 144 152 108 118 135 145 157 113 125 140 160 115 162 707 678 750 685 1043 101 113 125 137 149 240 137 145 155 165 175 189 260 150 152 175 178 180 185 191 966 1211 161 173 Corresponding to a given income level, we have a range of values for the consumption expenditure. In other words, we have a distribution of values for Y for a fixed value of X. Since Y is conditional upon the given values for X, this distribution is referred to as a conditional distribution. The expected value of Y is E (Y ) = 1 n ∑Y = 601 × 7 272 = 121.2 , so the mean level of weekly consumption expenditure is $121.20. This is an unconditional mean, since the level of income is not taken into account. For each level of income, a conditional expected value of consumption expenditure can be calculated. For example, E (Y | X = 80) = 1 5 ∑ (Y | X = 80) = 15 × 325 = 65 . Thus, given a weekly income level of $80, the expected weekly consumption expenditure is $65. In general the conditional expected value is calculated with the formula E (Y | X = X i ) = ∑ [Y × p (Y | X = X i )] , where p(Y | X = X i ) is the conditional probability. For instance, there are 5 families with a weekly income level of $80. Assuming that one of these families is randomly selected, the probability of selecting the family is 15 . Therefore E (Y | X = 80) = ∑ [Y × p (Y | X = 80)] = 15 × 325 = 65 . __________________________________________________________________________________ Copyright reserved Dr Paul J van Staden, University of Pretoria, Department of Statistics Statistics 310 5 STK310 Consider a scatter diagram of the weekly consumption expenditure against the weekly income: Weekly consumption expenditure ($) 200 150 100 50 60 80 100 120 140 160 180 200 220 240 260 280 Weekly income ($) Conditional distribution of consumption expenditure for various levels of income Note again that for a given level of income, there is a range of values for consumption expenditure. Furthermore, on average the consumption expenditure increases as the income increases. In other words, the conditional expected values of Y increase as the values of X increase. The conditional expected values lie on a straight line with a positive slope. This line is called the population regression line and it represents the regression of Y on X. In general we will have a population regression curve (the population regression line is a special case). THE CONCEPT OF POPULATION REGRESSION FUNCTION (PRF) The conditional expected value, E (Y | X i ) , is a function of X i , E (Y | X i ) = f ( X i ) , and is called the population regression function (PRF). The functional form of the PRF can be linear or nonlinear. For simplicity we will start with a linear form. In effect, we will assume that the PRF is a linear function of X i , E (Y | X i ) = β1 + β 2 X i . β1 and β 2 are unknown (but fixed) parameters called regression coefficients. β1 is the intercept, while β 2 is the slope. With regression analysis we want to estimate the PRF. This means that we want to estimate β1 and β 2 . __________________________________________________________________________________ Copyright reserved Dr Paul J van Staden, University of Pretoria, Department of Statistics Statistics 310 6 STK310 THE MEANING OF THE TERM LINEAR The population regression line, E (Y | X i ) = β1 + β 2 X i is a linear function of X i . We say that we have linearity in variables. In contrast, the PRF E (Y | X i ) = β1 + β 2 X i2 is nonlinear with respect to the variable X. For both examples we have so-called linearity in parameters. This means that the PRF is linear with respect to the parameters β1 and β 2 . Linearity in parameters requires that the parameters are raised to the first power only. Furthermore, a parameter may not be multiplied or divided by any other parameter. An example of a PRF that is nonlinear with respect to the parameters is E (Y | X i ) = β 1 + β 2 X i . We will consider linear regression models. A model is a linear regression model if it is linear in parameters. The model may or may not be linear in variables. STOCHASTIC SPECIFICATION OF PRF Returning to the example on consumption expenditure and income, we see that given a certain level of income, the consumption expenditure of some families is more than the conditional expected value, while for, other families it is less. In effect, given a level of income, the consumption expenditure of the different families fluctuates around the conditional expected vale. The deviation of an individual Yi around its expected value is ui = Yi − E (Y | X i ) . The stochastic specification of the PRF is then Yi = E (Y | X i ) + u i . That is, Yi is expressed as the sum of two components: E (Y | X i ) , the conditional expected value of Y for a given value of X, is the deterministic component. ui , the so-called stochastic error (disturbance) term, is a random component. ui is a surrogate for all the explanatory variables that are omitted from the model, but that may affect the dependent variable. It can be shown that E (ui | X i ) = 0 . __________________________________________________________________________________ Copyright reserved Dr Paul J van Staden, University of Pretoria, Department of Statistics Statistics 310 7 STK310 The stochastic specification of the population regression line is Yi = E (Y | X i ) + ui = β1 + β 2 X i + ui . THE SAMPLE REGRESSION FUNCTION (SRF) With regression analysis we want to estimate the PRF on the basis of a sample regression function (SRF). Returning to the example on consumption expenditure and income, we considered a population of Y values corresponding to fixed X values. Generally the population values are not all known. We only have a sample of Y values available. Consider the following two random samples of Y values: First sample Y X 70 80 65 100 90 120 95 140 110 160 115 180 120 200 140 220 155 240 150 260 Second sample Y X 55 80 88 100 90 120 80 140 118 160 120 180 145 200 135 220 145 240 175 260 For these two random samples, a scatter diagram of the weekly consumption expenditure against weekly income is given below. The pairs of (X, Y) values plotted with × and the solid line correspond to the first sample, while the pairs of (X, Y) values plotted with ● and the dashed line correspond to the second sample. Weekly consumption expenditure ($) 200 150 100 50 60 80 100 120 140 160 180 200 220 240 260 280 Weekly income ($) Regression lines based on two different samples __________________________________________________________________________________ Copyright reserved Dr Paul J van Staden, University of Pretoria, Department of Statistics Statistics 310 8 STK310 The drawn lines are known as sample regression lines. The sample regression line is the sample counterpart of the population regression line and is written as Yˆi = βˆ1 + βˆ2 X i . Yˆi is the estimator of the conditional expected value, E (Y | X i ) , while βˆ1 and βˆ 2 are the estimators of β1 and β 2 , the unknown population parameters. Note that the value obtained for an estimator is referred to as an estimate. Also note that, whereas the PRF and its parameters are unknown but fixed, the SRF and its estimators will differ for different random samples. The stochastic specification of the SRF is Yi = Yˆi + uˆi = βˆ1 + βˆ2 X i + uˆi , where uˆi = Yi − Yˆi , the difference between the observed and estimated values of Y, is known as the residual. __________________________________________________________________________________ Copyright reserved Dr Paul J van Staden, University of Pretoria, Department of Statistics Statistics 310 9 STK310 A3. SIMPLE REGRESSION – TWO-VARIABLE REGRESSION MODEL: ESTIMATION METHOD OF ORDINARY LEAST SQUARES (OLS) Recall that the PRF is given by Yi = E (Y | X i ) + ui = β1 + β 2 X i + ui . Since the PRF is not observable, we estimate it from the SRF Yi = βˆ1 + βˆ2 X i + uˆi = Yˆ + uˆ . i i Given a sample of n observations of Y and X, we would like to determine the SRF so that it is as close as possible to the PRF, in effect, so that the Yˆi ’s are as close as possible to the Yi ’s. This is equivalent to minimizing the ûi ’s. How is this done? First consider the sum of the residuals, ∑ ûi . Minimizing ∑ ûi is not a good criterion, since the positive residuals will be cancelled out by the negative residuals. We therefore can get that ∑ û i is small (even zero), although the individual residuals are large. We need a criterion that ignores the sign of the residuals. The sum of the absolute residuals, ∑ û i , can be used, but minimizing this is mathematically demanding. We therefore use the least-squares criterion. Theorem The OLS estimators for β 2 and β1 are respectively βˆ2 = = n ∑ X iYi − ∑ X i ∑Yi n ∑ X i2 − (∑ X i ) 2 ∑ ( X i − X )(Yi − Y ) ∑ ( X i − X )2 and βˆ1 = 1 n ∑Yi − n1 βˆ2 ∑ X i = Y − βˆ2 X . __________________________________________________________________________________ Copyright reserved Dr Paul J van Staden, University of Pretoria, Department of Statistics Statistics 310 10 STK310 Proof With the least-squares method, the sum of the squared residuals, Q = ∑ uˆi2 = ∑ (Yi − Yˆi )2 = ∑ (Yi − βˆ1 − βˆ2 X i ) 2 , is minimized. The ordinary least squares (OLS) estimators, β̂1 and β̂ 2 , are derived as follow: 1. Determine the partial derivatives with respect to β̂1 and β̂ 2 : ∂Q = −2∑ (Yi − βˆ1 − βˆ2 X i ) ˆ ∂β1 ① ∂Q = −2∑ X i (Yi − βˆ1 − βˆ2 X i ) ˆ ∂β 2 ② 2. Set ① and ② equal to zero and simplify to obtain the so-called normal equations: ① = 0: − 2∑ (Yi − βˆ1 − βˆ2 X i ) = 0 ∑Yi − ∑ βˆ1 − ∑ βˆ2 X i = 0 ∑Yi = nβˆ1 + βˆ2 ∑ X i ② = 0: ③ − 2∑ X i (Yi − βˆ1 − βˆ2 X i ) = 0 ∑ X iYi − ∑ βˆ1 X i − ∑ βˆ2 X i2 = 0 ∑ X iYi = βˆ1 ∑ X i + βˆ2 ∑ X i2 ④ 3. Solve ③ and ④ simultaneously: ③ × ∑ Xi : 2 ∑ X i ∑Yi = nβˆ1 ∑ X i + βˆ2 (∑ X i ) ⑤ ④ ×n: n ∑ X iYi = nβˆ1 ∑ X i + nβˆ2 ∑ X i2 ⑥ ⑥ – ⑤: 2 n ∑ X iYi − ∑ X i ∑Yi = nβˆ2 ∑ X i2 − βˆ2 (∑ X i ) [ 2 = βˆ2 n ∑ X i2 − (∑ X i ) ] __________________________________________________________________________________ Copyright reserved Dr Paul J van Staden, University of Pretoria, Department of Statistics Statistics 310 11 STK310 βˆ2 = = From ③: n ∑ X iYi − ∑ X i ∑Yi n ∑ X i2 − (∑ X i ) 2 ∑ ( X i − X )(Yi − Y ) ∑ ( X i − X )2 βˆ1 = n1 ∑ Yi − n1 βˆ2 ∑ X i = Y − βˆ2 X ■ The OLS estimates are interpreted as follow: β̂1 gives the mean value of Y given that X = 0 . β̂ 2 gives the effect of a unit change in X on the mean value of Y. The OLS estimators have the following numerical properties: I. The OLS estimators are expressed in terms of observed Y and X values and are therefore easily calculated. II. The OLS estimators are point estimators. III. The SRF is determined using the OLS estimates and the observed X. The deviation form of the SRF is y i = yˆ i + uˆ i = βˆ 2 x i + uˆ i where yi = Yi − Y and xi = X i − X . The properties of the SRF are: 1. The SRF passes through Y and X . 2. Yˆ = Y 3. ∑ uˆ i = 0 and hence uˆ = 0 4. ûi is uncorrelated with Yˆi since 5. ûi is uncorrelated with X i since ∑ Yˆi uˆ i = ∑ yˆ i uˆ i = 0 . ∑ X i uˆ i = 0 . __________________________________________________________________________________ Copyright reserved Dr Paul J van Staden, University of Pretoria, Department of Statistics Statistics 310 12 STK310 CLASSICAL LINEAR REGRESSION MODEL: ASSUMPTIONS UNDERLYING METHOD OF LEAST SQUARES Assumption 1 Linear regression model (linear in parameters): Yi = β1 + β 2 X i + ui Assumption 2 X values fixed (nonstochastic) in repeated sampling or sampled randomly such that cov(ui , X i ) = 0 . Assumption 3 Zero mean value for ui : E (ui | X i ) = 0 Assumption 4 Homoscedasticity or equal variance of ui : var(ui | X i ) = σ 2 Assumption 5 No autocorrelation between ui ’s: cov(ui , u j | X i , X j ) = 0 , i ≠ j Assumption 6 Number of observations (n) > number of parameters (k). Assumption 7 Variability in X values: var( X i ) > 0 __________________________________________________________________________________ Copyright reserved Dr Paul J van Staden, University of Pretoria, Department of Statistics Statistics 310 13 STK310 STATISTICAL PROPERTIES THEOREM OF LEAST SQUARES ESTIMATORS: GAUSS-MARKOV An estimator is said to be the best linear unbiased estimator (BLUE) if 1. it is a linear function of a random variable, 2. it is unbiased in that the expected value of the estimator is equal to the (unknown) true parameter value, 3. and it is efficient since it has the minimum variance in the class of all linear unbiased estimators. Theorem The Gauss-Markov Theorem states that, given the assumptions of the classical linear regression model, the OLS estimators are BLUE. Proof This theorem will be proven for βˆ2 . Note that this estimator can be expressed as βˆ2 = ∑ xiYi ∑ xi2 where ki = = ∑ kiYi , xi and xi = X i − X . ∑ xi2 The properties of ki are: Assuming that X i is nonstochastic (fixed), ki is also nonstochastic. xi = 2 x ∑ i ∑ ki = ∑ 2 ∑ xi = 0 since ∑ xi = 0 . ∑ xi2 xi 2 k = ∑ i ∑ x 2 = ∑ i ∑ xi2 = 1 . (∑ xi2 )2 ∑ xi2 x ∑ ki xi = ∑ ix 2 xi = ∑ i ∑ xi2 = 1 . ∑ xi2 1. Linearity β̂ 2 = ∑ kiYi is a linear function of Y and thus a linear estimator. __________________________________________________________________________________ Copyright reserved Dr Paul J van Staden, University of Pretoria, Department of Statistics Statistics 310 14 STK310 2. Unbiasedness Using the population regression line and the properties of ki , βˆ2 can be written as βˆ2 = ∑ kiYi = ∑ ki (β1 + β 2 X i + ui ) = β1 ∑ ki + β 2 ∑ ki X i + ∑ ki ui = β1 ∑ ki + β 2 ∑ ki (xi + X ) + ∑ ki ui = β1 ∑ ki + β 2 ∑ ki xi + β 2 X ∑ ki + ∑ ki ui = β 2 + ∑ ki ui . The expected value of βˆ2 is then ( ) E βˆ2 = E (β 2 + ∑ ki ui ) = β 2 + ∑ ki E (ui ) = β2 , and therefore βˆ2 is an unbiased estimator for β 2 . 3. Efficiency Since var (Yi ) = var (ui ) = σ 2 and cov (Yi , Y j ) = cov (ui , u j ) = 0 for i ≠ j , ( ) var βˆ2 = var (∑ kiYi ) = ∑ ki2 var (Yi ) = σ2 ∑ xi2 Let β 2* = ∑ wiYi be an alternative linear estimator of β 2 . Then β 2* = ∑ wiYi = ∑ wi (β1 + β 2 X i + ui ) = β1 ∑ wi + β 2 ∑ wi X i + ∑ wi ui = β1 ∑ wi + β 2 ∑ wi (xi + X ) + ∑ wi ui = β1 ∑ wi + β 2 ∑ wi xi + β 2 X ∑ wi + ∑ wi ui . __________________________________________________________________________________ Copyright reserved Dr Paul J van Staden, University of Pretoria, Department of Statistics Statistics 310 15 STK310 The expected value of β 2* is ( ) E β 2* = E (β1 ∑ wi + β 2 ∑ wi xi + β 2 X ∑ wi + ∑ wi ui ) = β1 ∑ wi + β 2 ∑ wi xi + β 2 X ∑ wi + ∑ wi E (ui ) . To be unbiased, ∑ wi = 0 and ∑ wi xi = 1 so that E (β 2* ) = β 2 . The variance of β 2* is ( ) var β 2* = var (∑ wiYi ) = σ 2 ∑ wi2 x = σ ∑ wi − i 2 + ∑ xi 2 xi ∑ xi2 2 2 x x = σ ∑ wi − i 2 + σ 2 ∑ i 2 ∑x ∑ xi i 2 2 x x + 2σ 2 ∑ wi − i 2 i 2 ∑ xi ∑ xi 2 xi σ2 = σ ∑ wi − + , ∑ xi2 ∑ xi2 2 since ∑ wi xi = 1 ∑ xi2 ∑ xi2 If wi = and ∑ xi2 = 1 . (∑ xi2 )2 ∑ xi2 xi = ki , then 2 x ∑i ( ) var β 2* = σ2 ∑ xi2 ( ) = var βˆ2 . In effect, the miminum variance obtained for any unbiased estimator of β 2 is the variance of βˆ2 , so βˆ2 is an efficient estimator. ■ __________________________________________________________________________________ Copyright reserved Dr Paul J van Staden, University of Pretoria, Department of Statistics Statistics 310 16 STK310 STANDARD ERRORS OF LEAST SQUARES ESTIMATORS The expected value, variance and standard error of β̂1 are respectively ( ) E βˆ1 = β1 , σ 2 ∑ X i2 var βˆ1 = n ∑ xi2 ( ) and ∑ X i2 . n ∑ xi2 ( ) se βˆ1 = σ For β̂ 2 expected value, variance and standard error are ( ) E βˆ2 = β 2 , σ2 var βˆ2 = ∑ xi2 ( ) and ( ) se βˆ2 = σ ∑ xi2 . The covariance between β̂1 and β̂ 2 is σ2 cov βˆ1 , βˆ2 = − X var βˆ2 = − X . ∑ xi2 ( ) ( ) The mean square error (MSE) is σˆ 2 uˆi2 ∑ = . n−2 __________________________________________________________________________________ Copyright reserved Dr Paul J van Staden, University of Pretoria, Department of Statistics Statistics 310 17 STK310 COEFFICIENT OF DETERMINATION (R2): A MEASURE OF GOODNESS OF FIT The total deviation for the dependent variable can be divided into two components, yi = Yi − Y = Yi − Y + Yˆi − Yˆi = Yˆi − Y + Yi − Yˆi , ( ) ( ) where ( ) ( Yˆi − Y = βˆ1 + βˆ2 X i − βˆ1 + βˆ2 X = βˆ (X − X ) 2 ) i = βˆ2 xi is the explained deviation and Yi − Yˆi = uˆi is the unexplained deviation. The total sum of squares (total variation) is given by TSS = ∑ yi2 = ∑ (Yi − Y ) 2 ( ) + ∑ (Y − Yˆ ) 2 = ∑ Yˆi − Y 2 i i = βˆ22 ∑ xi2 + ∑ uˆi2 = ESS + RSS, where ESS is the explained sum of squares (explained variation) and RSS is the residual sum of squares (unexplained variation). The coefficient of determination is then defined as R2 = = ESS TSS ∑ Yˆi − Y ( ) 2 ∑ (Yi − Y ) =1− =1− 2 RSS TSS ∑ uˆi2 ∑ yi2 . __________________________________________________________________________________ Copyright reserved Dr Paul J van Staden, University of Pretoria, Department of Statistics Statistics 310 18 STK310 R 2 measures the percentage of total variation in Y explained by the regression model. If Yˆi = Yi so that uˆi = 0 for all i, then R 2 = 1 and all the variation in Y is explained by the regression model. In contrast, if Yˆi = Y for all i, then none of the variation in Y is explained by the regression model and R 2 = 0 . So it follows that 0 ≤ R 2 ≤ 1 . The sample correlation coefficient is given by r = ± R2 = ∑ xi yi . ∑ xi2 ∑ yi2 The properties of r are: 1. The sign of r is the same as the sign of βˆ2 . 2. − 1 ≤ r ≤ 1 3. rXY = rYX 4. r is independent of the units and scale in which X and Y are measured. 5. If X and Y are independent, then r = 0 . However, if r = 0 , then X and Y are not necessarily independent. 6. r is only a measure of linear dependence. 7. r cannot be used to determine the cause-and-effect relationship between two variables. Example: Wages versus education (Gujarati & Porter, 2009, p. 78) In this example education, measured by the number of years of schooling, is used to explain the mean hourly wage (in $). The dataset is: Years of schooling: X Hourly wage: Y 6 4.4567 7 5.7700 8 5.9787 9 7.3317 10 7.3182 11 6.5844 12 7.8182 13 7.8351 14 11.0223 15 10.6738 16 10.8361 17 13.6150 18 13.5310 __________________________________________________________________________________ Copyright reserved Dr Paul J van Staden, University of Pretoria, Department of Statistics Statistics 310 19 STK310 14 13 Hourly wage: Y 12 y = 0.7241x - 0.0145 R² = 0.9078 11 10 9 8 7 6 5 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Years of schooling: X Scatter diagram of hourly wage ($) against years of schooling with sample regression line The estimated regression line is Yˆi = −0.0145 + 0.7241 X i . There is no practical interpretation for the OLS estimate of the intercept, βˆ1 = −0.0145 , since wages cannot be negative. The OLS estimate for the slope, βˆ = 0.7241 , indicates that with each 2 additional year of schooling, hourly wages on average increase by 72 cents. Based on the coefficient of determination, R 2 = 0.9078 , 90.78% of the variation in hourly wages is explained by the number of years of schooling. Hourly wages and education are highly positively correlated in that r = 0.9528 . __________________________________________________________________________________ Copyright reserved Dr Paul J van Staden, University of Pretoria, Department of Statistics Statistics 310 20 STK310 A4. SIMPLE REGRESSION – THE NORMALITY ASSUMPTION NORMALITY ASSUMPTION FOR STOCHASTIC ERROR TERM The assumptions already made about ui are: E (ui ) = 0 var(ui ) = σ 2 cov(ui , u j ) = 0 for i ≠ j No assumption has been made yet about the probability distribution of ui . The classical normal linear regression model assumes that each ui is normally distributed with mean zero and variance σ 2 , denoted ui ~ N (0, σ 2 ) . If two normally distributed variables have zero correlation, they are independent. Therefore ui ~ NID(0, σ 2 ) , where NID indicates normally and independently distributed. Why the normal distribution? 1. Following the central limit theorem (CLT), given a very large number of independent and identically distributed random variables, denoted IID, the distribution of the sum of these variables will be approximately normal. 2. Following a variant of the CLT, the above still holds if the number of variables is not very large or if they are not strictly independent. 3. Any linear function of normally distributed variables is also normally distributed. Therefore, under the normality assumption of ui , the OLS estimators β̂1 and β̂ 2 will also be normally distributed. 4. The normal distribution is fully described by only two parameters, its mean and variance. 5. The normality assumption allows the use of t, F and χ 2 hypothesis tests in regression modeling. STATISTICAL PROPERTIES OF OLS ESTIMATORS UNDER THE NORMALITY ASSUMPTION Given the assumption of normality, the OLS estimators, β̂1 , β̂ 2 and σ̂ 2 , have the following statistical properties: 1. They are unbiased estimators. 2. They have minimum variance and are therefore efficient estimators. __________________________________________________________________________________ Copyright reserved Dr Paul J van Staden, University of Pretoria, Department of Statistics Statistics 310 21 STK310 3. They are consistent estimators in that, as the sample size tends to infinity, the estimates converge to the true parameter values. 4. βˆ1 ~ N ( β1 , σ β2ˆ ) with σ β2ˆ = 1 1 ∑ X i2 σ 2 so that Z = βˆ1 − β1 ~ N (0, 1) σ βˆ n∑ xi2 1 σ2 βˆ2 − β 2 = Z ~ N (0, 1) so that 5. βˆ 2 ~ N ( β 2 , σ β2ˆ ) with σ β2ˆ = 2 2 σ βˆ ∑ xi2 2 6. ( n − 2)σˆ 2 σ 2 ~ χ 2 ( n − 2) 7. β̂1 and β̂ 2 are distributed independently of σ̂ 2 . 8. β̂1 and β̂ 2 have the minimum variance of all unbiased estimators (linear and nonlinear) and are therefore the best unbiased estimators (BUE). THE METHOD OF MAXIMUM LIKELIHOOD (ML) Consider the two-variable regression model, Yi = β1 + β 2 X i + ui . Assume that the Yi s are normally and independently distributed, in effect, Yi ~ NID( β1 + β 2 X i ,σ 2 ) . Theorem The ML estimators for β 2 , β1 and σ 2 are ~ β 2 = βˆ2 = = n ∑ X iYi − ∑ X i ∑Yi n ∑ X i2 − (∑ X i ) 2 ∑ ( X i − X )(Yi − Y ) , ∑ ( X i − X )2 ~ β1 = βˆ1 = 1 n ∑Yi − n1 βˆ2 ∑ X i = Y − βˆ2 X . and σ~ 2 = 1 n ∑ uˆi2 . __________________________________________________________________________________ Copyright reserved Dr Paul J van Staden, University of Pretoria, Department of Statistics Statistics 310 22 STK310 Proof The probability density function of Yi is f (Yi ) = (Y − β1 − β 2 X i )2 exp − i . 2σ 2 2πσ 2 1 Due to their independence, the joint probability density function of Y1 , Y2 ,..., Yn is obtained by f (Y1 , Y2 ,..., Yn ) = f (Y1 ) × f (Y2 ) × ... × f (Yn ) . The likelihood function is then given by LF ( β1 , β 2 ,σ 2 ) = f (Y1,Y2 ,...,Yn ) = 1 ( 2πσ ) 2 = 1 (Yi − β1 − β 2 X i ) 2 exp − 2 ∑ n σ2 1 (Yi − β1 − β 2 X i ) 2 exp − 2 ∑ n θ ( 2πθ ) 2 1 where θ = σ 2 . With the method of maximum likelihood, the unknown parameters are estimated in such a way that the probability of observing the given Yi s is as high as possible. This is done by partial differentiation to obtain normal equations that are solved simultaneously. To simplify the differentiation, the log-likelihood function, ln LF = − n2 ln 2π − n2 ln θ − 12 ∑ (Yi − β1 − β 2 X i ) 2 θ , is used. The partial derivatives are: ~ ~ ∂ ln LF 1 ~ = ~ ∑ (Yi − β1 − β 2 X i ) , θ ∂β1 ⓐ ∂ ln LF 1 ~ ~ ~ = ~ ∑ X i (Yi − β1 − β 2 X i ) , θ ∂β 2 ⓑ and ∂ ln LF n 1 ~ = − ~ + ~2 ∂θ 2θ 2θ ~ ~ ∑ (Yi − β1 − β 2 X i )2 . ⓒ __________________________________________________________________________________ Copyright reserved Dr Paul J van Staden, University of Pretoria, Department of Statistics Statistics 310 23 STK310 The normal equations are obtained by setting ⓐ and ⓑ equal to zero and simplifying: 1 ~ ~ ~ ∑ (Yi − β1 − β 2 X i ) = 0 ⓐ = 0: θ ~ ~ ∑Yi − ∑ β1 − ∑ β2 X i = 0 ~ ~ ∑Yi = nβ1 + β2 ∑ X i ⓓ 1 ~ ~ ~ ∑ X i (Yi − β1 − β 2 X i ) ⓑ = 0: θ ~ ~ ∑ X iYi = β1 ∑ X i + β2 ∑ X i2 ⓔ The normal equations of the ML estimators and the OLS estimators for β1 and β 2 are the same and ~ ~ thus β = βˆ and β = βˆ . 1 1 2 2 To obtain the ML estimator of σ 2 , set ⓒ equal to zero and simplify: n 1 − ~ + ~2 2θ 2θ ~ ~ ∑ (Yi − β1 − β 2 X i ) 2 = 0 ~ θ = σ~ 2 ∑ (Yi − βˆ1 − βˆ2 X i )2 = n1 ∑ uˆi2 = 1 n ■ Note that the ML estimator of σ 2 is biased, since E (σ~ 2 ) = (n −n 2 )σ 2 ≠ σ 2. __________________________________________________________________________________ Copyright reserved Dr Paul J van Staden, University of Pretoria, Department of Statistics Statistics 310 24 STK310 A5. SIMPLE REGRESSION – TWO-VARIABLE REGRESSION MODEL: INFERENCE THEOREMS FOR PROBABILITY DISTRIBUTIONS The following listed theorems will be of importance for statistical inference. Theorem 1 ( ) Let Z1 , Z 2 ,..., Z n be normally and independently distributed random variables, Zi ~ N µi ,σ i2 . Then ∑ ki Zi ~ N (∑ ki µi , ∑ ki2σ i2 ), where the k s are constants. i Theorem 2 Let Z1 , Z 2 ,..., Z n be normally distributed random variables which are not independent, Zi ~ N ( µi ,σ i2 ) . Then ∑ ki Z i ~ N ∑ ki µ i , ∑ ki2σ i2 + 2∑ ki k j cov(Z i , Z j ) . i≠ j Theorem 3 Let Z1 , Z 2 ,..., Z n be standardized normal random variables that are independent, Z i ~ N (0, 1) . Then Zi2 ~ χ 2 (1) and ∑ Zi2 ~ χ 2 (n) . Theorem 4 Let Z1 , Z 2 ,..., Z n be independently distributed random variables, each following a χ 2 distribution, Z i ~ χ 2 (ki ) . Then ∑ Zi ~ χ 2 (∑ ki ). Theorem 5 If Z1 ~ N (0, 1) and Z 2 ~ χ 2 ( k ) are independent random variables, then t = Z1 Z2 k ~ t (k ) . Theorem 6 If Z1 ~ χ 2 ( k1 ) and Z 2 ~ χ 2 ( k 2 ) are independent random variables, then F = Z1 k1 ~ F ( k1 , k 2 ) . Z 2 k2 Theorem 7 Let t ~ t (k ) . Then F = t 2 ~ F (1, k ) . __________________________________________________________________________________ Copyright reserved Dr Paul J van Staden, University of Pretoria, Department of Statistics Statistics 310 25 STK310 CONFIDENCE INTERVALS Confidence interval for β2 ( ) σ2 βˆ2 ~ N β 2 , σ β2ˆ with σ β2ˆ = OLS estimator 2 ∑ xi2 2 ( ) ∑x βˆ2 − β 2 βˆ2 − β 2 Z= = σ βˆ σ Standardization 2 i ~ N (0, 1) 2 But σ 2 is unknown. σˆ 2 = n 1−2 ∑ uˆi2 OLS estimator for σ 2 σ β2ˆ = 2 σ2 ∑ xi2 W = ( n − 2) σˆ 2 estimated by σˆ β2ˆ = ∑ xi2 2 σˆ 2 ~ χ 2 (n − 2) 2 σ βˆ2 and σˆ 2 independent ( Z and W independent ) ∑x βˆ2 − β 2 Z t= = σˆ W ( n − 2) 2 i = βˆ2 − β 2 ~ t ( n − 2) σˆ βˆ 2 100(1 − α )% confidence interval for β 2 : P ( −tα 2 ≤ t ≤ tα 2 ) = 1 − α βˆ − β 2 P − tα 2 ≤ 2 ≤ tα σˆ βˆ 2 2 =1−α P( −tα 2σˆ βˆ ≤ βˆ2 − β 2 ≤ tα 2σˆ βˆ ) = 1 − α 2 2 P( − βˆ2 − tα 2σˆ βˆ ≤ − β 2 ≤ − βˆ2 + tα 2σˆ βˆ ) = 1 − α 2 2 P( βˆ2 − tα 2σˆ βˆ ≤ β 2 ≤ βˆ2 + tα 2σˆ βˆ ) = 1 − α 2 2 βˆ2 ± tα 2σˆ βˆ 2 __________________________________________________________________________________ Copyright reserved Dr Paul J van Staden, University of Pretoria, Department of Statistics Statistics 310 26 STK310 Confidence interval for β1 ( ) βˆ1 ~ N β1 , σ β2ˆ with σ β2ˆ = OLS estimator 1 1 100(1 − α )% confidence interval for β1 σ 2 ∑ X i2 n ∑ xi2 σˆ 2 ∑ X i2 2 ˆ β1 ± tα 2σˆ βˆ with σˆ βˆ = 1 1 n ∑ xi2 Confidence interval for σ2 σˆ 2 = n 1−2 ∑ uˆi2 with W = ( n − 2) OLS estimator 100(1 − α )% confidence interval for σ 2 σˆ 2 ~ χ 2 (n − 2) 2 σ 2 2 ( n − 2) σˆ , ( n − 2) σˆ χα2 2 χ12−α 2 Example: Wages versus education (Gujarati & Porter, 2009, p. 78) Y Hourly wage X Years of schooling Estimated regression line Yˆi = −0.01445 + 0.7241 X i 95% confidence interval for β1 (−1.93949, 1.91058) 95% confidence interval for β 2 (0.57095, 0.87724) Mean square error (MSE) σˆ 2 = 0.88116 95% confidence interval for σ 2 (0.442187, 2.540199) HYPOTHESIS TESTING Hypothesis testing for β2 1. Specify null and alternative hypotheses: H 0 : β 2 = β 2∗ H1 : β 2 ≠ β 2∗ where β 2∗ is parameter value under H 0 __________________________________________________________________________________ Copyright reserved Dr Paul J van Staden, University of Pretoria, Department of Statistics Statistics 310 27 STK310 2. Select significance level α and define test statistic t= βˆ2 − β 2∗ ~ t ( n − 2) σˆ βˆ 2 3. Give decision rule: Confidence interval reject H 0 if β 2∗ falls outside 100(1 − α )% confidence interval Critical value vs test statistic value reject H 0 if | t |≥ tα 2 reject H 0 if p-value ≤ α p-value vs significance level 4. Calculate: Confidence interval Test statistic value p-value 5. Draw conclusion Hypothesis testing for β1 1. H 0 : β1 = β1∗ H1 : β1 ≠ β1∗ 2. t = βˆ1 − β1∗ ~ t ( n − 2) σˆ βˆ 1 3. Decision rule 4. Calculate confidence interval / test statistic value / p-value 5. Conclusion Example: Wages versus education (Gujarati & Porter, 2009, p. 78) Y Hourly wage X Years of schooling Estimated regression line Yˆi = −0.01445 + 0.7241 X i Is intercept parameter significant? __________________________________________________________________________________ Copyright reserved Dr Paul J van Staden, University of Pretoria, Department of Statistics Statistics 310 28 STK310 1. H 0 : β1 = 0 H 1 : β1 ≠ 0 2. α = 0.05 t= βˆ1 ~ t (11) σˆ βˆ 1 3. Reject H 0 if: 0 falls outside 95% confidence interval | t |≥ t0.025 = 2.201 p-value < 0.05 4. Calculations using SAS: 95% confidence interval t= (−1.93949, 1.91058) βˆ1 − 0.01445 = = −0.02 σˆ βˆ 0.87462 1 p-value = 0.9871 intercept parameter not significant 5. H 0 not rejected Is slope parameter significant? 1. H 0 : β 2 = 0 H1 : β 2 ≠ 0 2. α = 0.05 t= βˆ2 ~ t (11) σˆ βˆ 2 3. Reject H 0 if: 0 falls outside 95% confidence interval | t |≥ t0.025 = 2.201 p-value < 0.05 __________________________________________________________________________________ Copyright reserved Dr Paul J van Staden, University of Pretoria, Department of Statistics Statistics 310 29 STK310 4. Calculations using SAS: 95% confidence interval t= (0.57095, 0.87724) 0.7241 βˆ2 = = 10.41 σˆ βˆ 0.06958 2 p-value < .0001 5. H 0 rejected slope parameter significant For each additional year of schooling, does hourly wage on average increase by 50 cents? 1. H 0 : β 2 = 0.5 H 1 : β 2 ≠ 0 .5 2. α = 0.05 t= βˆ2 − 0.5 ~ t (11) σˆ βˆ 2 3. Reject H 0 if: 0.5 falls outside 95% confidence interval | t |≥ t0.025 = 2.201 p-value < 0.05 4. Calculations using SAS: 95% confidence interval t= (0.57095, 0.87724) βˆ2 − 0.5 0.7241 − 0.5 = = 3.22 σˆ βˆ 0.06958 2 p-value = 0.008 5. H 0 rejected For each additional year of schooling, hourly wage does not increase on average by 50 cents __________________________________________________________________________________ Copyright reserved Dr Paul J van Staden, University of Pretoria, Department of Statistics Statistics 310 30 STK310 ANALYSIS OF VARIANCE (ANOVA) Total sum of squares divided into explained sum of squares and residual sum of squares: TSS = ESS + RSS Study of these components ANOVA ANOVA table: Source Degrees of freedom Sum of squares Mean square F ESS 1 βˆ22 ∑ xi2 βˆ22 ∑ xi2 βˆ22 ∑ xi2 σˆ 2 RSS n−2 ∑ uˆi2 σ̂ 2 TSS n −1 ∑ yi2 What is the distribution of F? ( ) ∑x βˆ2 − β 2 βˆ2 − β 2 = Z= σ βˆ σ 2 i Z 2 ~ χ 2 (1) ~ N (0, 1) 2 W = ( n − 2) σˆ 2 ~ χ 2 (n − 2) σ2 ( ) 2 βˆ2 − β 2 ∑ xi2 βˆ22 ∑ xi2 Z2 1 F= under H 0 : β 2 = 0 = = W ( n − 2) σˆ 2 σˆ 2 F= Z & W independent βˆ22 ∑ xi2 ~ F (1, n − 2) σˆ 2 Example: Wages versus education (Gujarati & Porter, 2009, p. 78) Y Hourly wage X Years of schooling ANOVA table: Source Degrees of freedom Sum of squares Mean square ESS 1 95.42552 95.42552 RSS 11 9.69281 0.88116 TSS 12 105.11833 F 108.29 Is slope parameter significant? __________________________________________________________________________________ Copyright reserved Dr Paul J van Staden, University of Pretoria, Department of Statistics Statistics 310 31 STK310 1. H 0 : β 2 = 0 H1 : β 2 ≠ 0 βˆ22 ∑ xi2 F= ~ F (1, 11) under H 0 : β 2 = 0 σˆ 2 2. α = 0.01 3. Reject H 0 if p-value < 0.01 4. Calculations using SAS: F = 108.29 p-value < .0001 5. H 0 rejected slope parameter significant PREDICTION Mean prediction Prediction of conditional mean value of Y corresponding to given value of X, say X 0 Yˆ0 = βˆ1 + βˆ2 X 0 point estimator for E (Y | X 0 ) 1 ( X − X )2 Yˆ0 ~ N β1 + β 2 X 0 , σ Y2ˆ where σ Y2ˆ = σ 2 + 0 2 0 0 n ∑ xi ( t= ) Yˆ0 − ( β1 + β 2 X 0 ) ~ t ( n − 2) σˆYˆ 0 100(1 − α )% confidence interval for E (Y | X 0 ) Yˆ0 ± tα 2σˆYˆ 0 Individual prediction Prediction of individual value of Y, say Y0 , corresponding to given value of X, say X 0 Yˆ0 = βˆ1 + βˆ2 X 0 Y0 − Yˆ0 point estimator for Y0 prediction error ( (Y0 − Yˆ0 ) ~ N 0, σ (2Y t= ˆ 0 −Y0 ) ) where σ 2 (Y0 −Yˆ0 ) 1 ( X − X )2 = σ 2 1 + + 0 2 n ∑ xi Y0 − Yˆ0 ~ t ( n − 2) σˆ (Y −Yˆ ) 0 0 100(1 − α )% confidence interval for Y0 Yˆ0 ± tα 2σˆ (Y ˆ 0 −Y0 ) __________________________________________________________________________________ Copyright reserved Dr Paul J van Staden, University of Pretoria, Department of Statistics Statistics 310 32 STK310 Example: Wages versus education (Gujarati & Porter, 2009, p. 78) Y Hourly wage X Years of schooling Estimated regression line Yˆi = −0.01445 + 0.7241 X i Predict the mean hourly wage for 6 years of schooling and calculate a 95% confidence interval. Yˆ0 = −0.01445 + 0.7241 × 6 = 4.3301 95% confidence interval for E (Y | 6) (3.2472, 5.413) Nosnow Cannotski has 6 years of schooling. Predict his hourly wage and calculate a 95% confidence interval. Yˆ0 = −0.01445 + 0.7241 × 6 = 4.3301 95% confidence interval for Y0 (1.9975, 6.6628) __________________________________________________________________________________ Copyright reserved Dr Paul J van Staden, University of Pretoria, Department of Statistics Statistics 310 33 STK310 A6. SIMPLE REGRESSION – EXTENSIONS OF THE TWOVARIABLE LINEAR REGRESSION MODEL REGRESSION THROUGH ORIGIN PRF Yi = β 2 X i + ui SRF Yi = βˆ2 X i + uˆi OLS estimator for β2 βˆ2 = ∑ OLS estimator for σ 2 σˆ 2 = n1−1 ∑ uˆi2 X iYi ∑ X i2 σ2 E βˆ2 = β 2 and var βˆ2 = ∑ X i2 ( ) ( ) rawR 2 2 ( X iYi ) ∑ = ∑ X i2 ∑Yi2 Example: Wages versus education (Gujarati & Porter, 2009, p. 78) Y Hourly wage X Years of schooling OLS regression Yˆi = −0.0145 + 0.7241X i OLS regression through origin Yˆi = 0.723X i LOG-LINEAR MODEL Exponential regression model Apply double-log transformation Yi = β1 X iβ 2 eui ln Yi = ln β1 + β 2 ln X i + ui Let Yi∗ = ln Yi X i∗ = ln X i α = ln β1 __________________________________________________________________________________ Copyright reserved Dr Paul J van Staden, University of Pretoria, Department of Statistics Statistics 310 34 STK310 Yi∗ = α + β 2 X i∗ + ui Log-linear / log-log / double-log model Absolute, relative and percentage change: Absolute change X i − X i −1 and Yi − Yi −1 Relative change X i − X i −1 Y − Yi −1 and i X i −1 Yi −1 Percentage (%) change 100 × X i − X i −1 Y −Y and 100 × i i −1 X i −1 Yi −1 β2 measures absolute change in Y ∗ for given absolute change in X ∗ β2 measures % change in Y for given % change in X elasticity Example: Coffee consumption (Gujarati & Porter, 2009, p. 204) Y USA coffee consumption (cups of coffee per person per day) X Retail price in dollars per pound Estimated linear model: Yˆi = 2.69112 − 0.47953X i If price of coffee increases by $1 per pound, demand of coffee decreases on average by ½ cup per day What is price elasticity of coffee demand? Estimated log-log model: Yˆi* = 0.77742 − 0.25305X i* If price of coffee increases by 1% per pound, demand of coffee decreases on average by 0.25% Estimated exponential regression model: βˆ1 = eαˆ = e0.77742 = 2.17585 Yˆi = 2.17585X i−0.25305 __________________________________________________________________________________ Copyright reserved Dr Paul J van Staden, University of Pretoria, Department of Statistics Statistics 310 35 STK310 LOG-LIN MODEL Yi = β1β 2X i eui ln Yi = ln β1 + X i ln β 2 + ui Apply semilog transformation Let Yi∗ = ln Yi α1 = ln β1 α2 = ln β 2 Log-lin / semilog model Yi∗ = α1 + α 2 X i + ui α2 measures % change in Y for given absolute change in X Example: Growth rate of fast food company Growth rate of fast food company must be measured in terms of its number of pizzerias in operation by using growth rate formula Yt = Y0 (1 + r )t Yt number of pizzerias for years t = 1, 2, ..., 15 Y0 initial number of pizzerias when company started compound rate of growth r t Yt 1 7 2 13 3 20 4 33 5 40 6 53 7 60 8 47 9 54 10 41 11 48 12 61 13 68 14 55 15 75 Apply suitable transformation to growth rate formula to obtain linear regression model. Growth rate formula: Yt = Y0 (1 + r ) t Apply semilog transformation: ln Yt = ln Y0 + t ln(1 + r ) __________________________________________________________________________________ Copyright reserved Dr Paul J van Staden, University of Pretoria, Department of Statistics Statistics 310 36 STK310 Let Yt* = ln Yt , α1 = lnY0 and α 2 = ln(1 + r ) : Yt* = α1 + α 2t Add error term, ut , to obtain linear regression model: Yt* = α1 + α 2t + ut Estimate instantaneous rate of growth and compound rate of growth of company. Estimated linear model: Yˆt* = 2.69115 + 0.12057t Estimated instantaneous rate of growth: 100 × 0.12057 = 12.057% Estimated compound rate of growth: ( ) 100 × e 0.12057 − 1 = 12.814% LIN-LOG MODEL Yi = β1 + β 2 ln X i + ui Let X i∗ = ln X i Yi = β1 + β 2 X i∗ + ui β 2 measures absolute change in Y for given % change in X RECIPROCAL MODEL Yi = β1 + β 2 Let X i∗ = 1 + ui Xi 1 Xi Yi = β1 + β 2 X i∗ + ui As X increases, Y approaches β1 limiting / asymptotic value __________________________________________________________________________________ Copyright reserved Dr Paul J van Staden, University of Pretoria, Department of Statistics Statistics 310 37 STK310 A7. MULTIPLE REGRESSION – ESTIMATION THREE-VARIABLE REGRESSION MODEL Yi = β1 + β 2 X 2i + β 3 X 3i + ui PRF β1 intercept parameter β 2 and β 3 partial regression coefficients Assumptions: Linear regression model X values fixed (nonstochastic) in repeated sampling OR X values sampled randomly such that covariance between error term and each X variable is zero Zero mean value for ui Homoscedasticity for ui E (ui ) = 0 var(ui ) = σ 2 No autocorrelation between ui ’s cov(ui , u j ) = 0 , i ≠ j Number of observations (n) > number of parameters (k) Variability in X values No perfect collinearity (linear relationship) between X variables No specification bias Example: Semester test mark obtained Y X2 number of hours studied per day X3 number of hours studied per week Since X 3 = 7X 2 : Yi = β1 + β 2 X 2i + β 3 X 3i + ui = β1 + β 2 X 2i + 7 β 3 X 2i + ui = β1 + ( β 2 + 7 β 3 ) X 2i + ui = β1 + α X 2i + ui __________________________________________________________________________________ Copyright reserved Dr Paul J van Staden, University of Pretoria, Department of Statistics Statistics 310 38 STK310 MEANING OF PARTIAL REGRESSION COEFFICIENTS β2 measures change in mean value of Y for unit change in X 2 , holding X 3 constant β3 measures change in mean value of Y for unit change in X 3 , holding X 2 constant OLS & ML ESTIMATION PRF Yi = β1 + β 2 X 2i + β 3 X 3i + ui SRF Yi = Yˆi + uˆi = βˆ1 + βˆ2 X 2i + βˆ3 X 3i + uˆi OLS estimators: βˆ1 = Y − βˆ2 X 2 − βˆ3 X 3 βˆ2 = (∑ yi x2i )(∑ x32i ) − (∑ yi x3i )(∑ x2i x3i ) (∑ x22i )(∑ x32i )− (∑ x2i x3i )2 βˆ3 = (∑ yi x3i )(∑ x22i ) − (∑ yi x2i )(∑ x2i x3i ) (∑ x22i )(∑ x32i )− (∑ x2i x3i )2 σˆ 2 uˆi2 ∑ = n−3 ML estimators: ~ β1 = βˆ1 ~ β 2 = βˆ2 ~ β 3 = βˆ3 σ~ 2 = ∑ uˆi2 n R2 & ADJUSTED R2 ESS RSS ∑ uˆi R = = 1− = 1− TSS TSS ∑ yi2 2 2 R 2 uˆi2 ( n − k ) n −1 ∑ =1− = 1 − (1 − R 2 ) 2 n−k ∑ yi (n − 1) __________________________________________________________________________________ Copyright reserved Dr Paul J van Staden, University of Pretoria, Department of Statistics Statistics 310 39 STK310 PARTIAL CORRELATION COEFFICIENTS Correlation coefficients: r12 correlation coefficient between Y and X 2 r13 correlation coefficient between Y and X 3 r23 correlation coefficient between X 2 and X 3 Partial correlation coefficients: r12.3 correlation coefficient between Y and X 2 holding X 3 constant r13.2 correlation coefficient between Y and X 3 holding X 2 constant r23.1 correlation coefficient between X 2 and X 3 holding Y constant POLYNOMIAL REGRESSION PRF Yi = β 0 + β1 X i + β 2 Zi + ui Suppose Zi = X i2 Yi = β 0 + β1 X i + β 2 X i2 + ui One explanatory variable β1 and β 2 cannot be interpreted individually __________________________________________________________________________________ Copyright reserved Dr Paul J van Staden, University of Pretoria, Department of Statistics Statistics 310 40 STK310 A8. MULTIPLE REGRESSION – INFERENCE NORMALITY ASSUMPTION Assumption t= ui ~ N (0, σ 2 ) βˆ j − β j ~ t (n − k ) for j = 1, 2, ..., k σˆ βˆ j σˆ 2 W = ( n − k ) 2 ~ χ 2 (n − k ) σ F= ESS ( k − 1) R 2 ( k − 1) = ~ F (k − 1 , n − k ) RSS (n − k ) (1 − R 2 ) (n − k ) F= 2 ( RSSR − RSSUR ) m ( RUR − RR2 ) m = ~ F (m , n − k ) 2 RSSUR ( n − k ) (1 − RUR ) (n − k ) HYPOTHESIS TESTING Example: Demand for chicken (Gujarati & Porter, 2009, p. 220) Demand function for chicken: Yt = αX 2βt2 X 3βt3 X 4βt4 X 5βt5 eut Y per capita consumption of chickens in pounds X2 real disposable income per capita in dollars X3 real retail price of chicken per pounds in cents X4 real retail price of pork per pounds in cents X5 real retail price of beef per pounds in cents Regression model in linear form: ln Yt = β1 + β 2 ln X 2t + β 3 ln X 3t + β 4 ln X 4t + β5 ln X 5t + ut Fitted regression model: ln Yt = 2.18979 + 0.34256 ln X 2 t − 0.50459 ln X 3t ( 0.15571) ( 0.08327 ) ( 0.11089 ) + 0.14855 ln X 4 t + 0.09110 ln X 5t + uˆt ( 0.09967 ) ( 0.10072 ) __________________________________________________________________________________ Copyright reserved Dr Paul J van Staden, University of Pretoria, Department of Statistics Statistics 310 41 STK310 Are the individual regression coefficients significant? 1. H 0 : β 2 = 0 H1 : β 2 ≠ 0 2. t = βˆ2 ~ t (18) under H 0 σˆ βˆ 2 3. Reject H 0 if p-value ≤ α 4. t = 0.34256 = 4.11 0.08327 p-value = 0.0007 5. H 0 : β 2 = 0 rejected at 1% significance level β 2 significant Similarly it follows that β1 and β3 are significant, but β 4 and β5 are not. Is the overall regression model significant? 1. H 0 : β 2 = β 3 = β 4 = β5 = 0 H1 : At least one β j ≠ 0 , j = 2, 3, 4, 5 2. F = ESS 4 R2 4 = ~ F ( 4 , 18) under H 0 RSS 18 (1 − R 2 ) 18 3. Reject H 0 if p-value ≤ α 4. F = 0.76105 4 0.9823 4 = = 249.93 0.0137 18 (1 − 0.9823) 18 p-value < 0.0001 5. H 0 : β 2 = β 3 = β 4 = β5 = 0 rejected at 1% significance level The overall regression model is significant. __________________________________________________________________________________ Copyright reserved Dr Paul J van Staden, University of Pretoria, Department of Statistics Statistics 310 42 STK310 Are β4 and β5 both zero? 1. H 0 : β 4 = β 5 = 0 H1 : At least one of β 4 and β5 not zero Unrestricted model: ln Yt = β1 + β 2 ln X 2t + β 3 ln X 3t + β 4 ln X 4t + β5 ln X 5t + ut Restricted model under H0: ln Yt = β1 + β 2 ln X 2t + β 3 ln X 3t + ut 2. F = 2 ( RSSR − RSSUR ) 2 ( RUR − RR2 ) 2 = ~ F (2 , 18) under H 0 2 RSSUR 18 (1 − RUR ) 18 3. Reject H 0 if p-value ≤ α 4. F = (0.01544 − 0.0137) 2 (0.9823 − 0.9801) 2 = = 1.14 0.0137 18 (1 − 0.9823) 18 p-value = 0.3421 5. H 0 : β 4 = β 5 = 0 not rejected β 4 and β5 are not significantly different from zero Is β4 equal to β5? 1. H 0 : β 4 = β5 H1 : β 4 ≠ β 5 Unrestricted model: ln Yt = β1 + β 2 ln X 2t + β 3 ln X 3t + β 4 ln X 4t + β5 ln X 5t + ut Restricted model under H0: ln Yt = β1 + β 2 ln X 2t + β 3 ln X 3t + β 4 ln X 4t + β 4 ln X 5t + ut = β1 + β 2 ln X 2t + β 3 ln X 3t + β 4 (ln X 4t + ln X 5t ) + ut 2. F = 2 ( RSSR − RSSUR ) 1 ( RUR − RR2 ) 1 = ~ F (1 , 18) under H 0 2 RSSUR 18 (1 − RUR ) 18 3. Reject H 0 if p-value ≤ α 4. F = (0.01394 − 0.0137) 1 (0.9823 − 0.9820) 1 = = 0.31 0.0137 18 (1 − 0.9823) 18 p-value = 0.5864 5. H 0 : β 4 = β5 not rejected β 4 is not significantly different from β5 __________________________________________________________________________________ Copyright reserved Dr Paul J van Staden, University of Pretoria, Department of Statistics Statistics 310 43 STK310 If the real retail price of chicken increases by 1% while the real disposable income per capita and the real retail prices of pork and beef remain constant, will the per capita consumption of chickens decrease by 0.75%? 1. H 0 : β 3 = −0.75 H1 : β 3 ≠ −0.75 2. t-test: t = βˆ3 − ( −0.75) ~ t (18) under H 0 σˆ βˆ 3 F-test: F = 2 ( RSSR − RSSUR ) 1 ( RUR − RR2 ) 1 = ~ F (1 , 18) under H 0 2 RSSUR 18 (1 − RUR ) 18 3. t-test: Reject H 0 if –0.75 falls outside the 95% confidence interval for β 3 F-test: Reject H 0 if p-value ≤ α 4. t-test: 95% confidence interval for β 3 F-test: F = 4.90 ( −0.73757, − 0.27161) p-value = 0.0401 5. H 0 : β 3 = −0.75 rejected at 5% significance level The per capita consumption of chickens will not decrease by 0.75% if the real retail price of chicken increases by 1% while the real disposable income per capita and the real retail prices of pork and beef remain constant. PREDICTION Example: Demand for chicken (Gujarati & Porter, 2009, p. 220) Consider following demand function for chicken: Yt = αX 2βt2 X 3βt3 eut Regression model in linear form: ln Yt = β1 + β 2 ln X 2t + β 3 ln X 3t + ut Fitted regression model: ln Yt = 2.03282 + 0.45153 ln X 2 t − 0.37221ln X 3t + uˆt ( 0.11618 ) ( 0.02469 ) ( 0.06347 ) __________________________________________________________________________________ Copyright reserved Dr Paul J van Staden, University of Pretoria, Department of Statistics Statistics 310 44 STK310 Suppose for 1983 we have: Real disposable income is $2 700 per capita Real retail price of chicken is 70 cents per pound Calculate a 95% confidence interval for the mean consumption of chicken when the real disposable income is $2 700 per capita and the real retail price of chicken is 70 cents per pound: Predicted value e 4.019 = 55.64543 95% confidence interval ( e 3.9919 , e 4.0461 ) = (54.15769 , 57.17404) Calculate a 95% confidence interval for the consumption of chicken in 1983: Predicted value e 4.019 = 55.64543 95% confidence interval ( e 3.955 , e 4.083 ) = (52.19569 , 59.32317) __________________________________________________________________________________ Copyright reserved Dr Paul J van Staden, University of Pretoria, Department of Statistics Statistics 310 45 STK310 A9. MULTIPLE REGRESSION – MATRIX APPROACH TO REGRESSION ANALYSIS TWO-VARIABLE REGRESSION MODEL PRF Yi = β1 + β 2 X i + ui SRF Yi = βˆ1 + βˆ2 X i + uˆi βˆ1 = Y − βˆ2 X n XY − X Y xy βˆ2 = ∑ i i2 ∑ i ∑2 i = ∑ i 2 i n ∑ X i − (∑ X i ) ∑ xi σˆ 2 uˆi2 ∑ = n−2 THREE-VARIABLE REGRESSION MODEL PRF Yi = β1 + β 2 X 2i + β 3 X 3i + ui SRF Yi = βˆ1 + βˆ2 X 2i + βˆ3 X 3i + uˆi βˆ1 = Y − βˆ2 X 2 − βˆ3 X 3 βˆ2 = (∑ yi x2i )(∑ x32i ) − (∑ yi x3i )(∑ x2i x3i ) (∑ x22i )(∑ x32i )− (∑ x2i x3i )2 βˆ3 = (∑ yi x3i )(∑ x22i ) − (∑ yi x2i )(∑ x2i x3i ) (∑ x22i )(∑ x32i )− (∑ x2i x3i )2 σˆ 2 uˆi2 ∑ = n−3 k-VARIABLE REGRESSION MODEL PRF Yi = β1 + β 2 X 2i + β 3 X 3i + ... + β k X ki + ui SRF Yi = βˆ1 + βˆ2 X 2i + βˆ3 X 3i + ... + βˆk X ki + uˆi How do we now obtain expressions for the OLS estimators? __________________________________________________________________________________ Copyright reserved Dr Paul J van Staden, University of Pretoria, Department of Statistics Statistics 310 46 STK310 Consider system of n equations in PRF: Y1 = β1 + β 2 X 21 + β 3 X 31 + ... + β k X k 1 + u1 Y2 = β1 + β 2 X 22 + β 3 X 32 + ... + β k X k 2 + u2 M Yn = β1 + β 2 X 2 n + β 3 X 3n + ... + β k X kn + un Write this system in matrix notation: Y1 1 X 21 Y2 = 1 X 22 M M M Yn 1 X 2 n y = n ×1 K X k1 β1 u1 K X k 2 β 2 + u2 O M M M K X kn β k un X β + u n×k k ×1 n ×1 Theorem u ~ N ( 0, σ 2 I ) Proof u1 E (u1 ) 0 u 2 E ( u2 ) 0 E (u) = E = = =0 M M M un E ( un ) 0 u12 u1u2 u1 u2u1 u22 u2 E ( uu' ) = E (u1 u2 K un ) = E M M M u u u u u n n 1 n 2 E (u12 ) E (u1u2 ) E (u2u1 ) E (u22 ) = M M E (u u ) E ( u u ) n 1 n 2 1 20 =σ M 0 0 1 M 0 K K O K K u1un K u2un O M K un2 K E (u1un ) σ 2 0 K E (u2un ) 0 σ 2 = O M M M 2 K E (un ) 0 0 K 0 K 0 O M K σ 2 0 0 = σ 2I M 1 ■ __________________________________________________________________________________ Copyright reserved Dr Paul J van Staden, University of Pretoria, Department of Statistics Statistics 310 47 STK310 OLS ESTIMATION y = Xβˆ + uˆ SRF in matrix notation Theorem OLS estimator for β βˆ = ( X ' X ) −1 X ' y Proof OLS estimators obtained by minimizing In matrix notation ∑ uˆi2 minimize uˆ ' uˆ : uˆ ' uˆ = ( y − Xβˆ )' ( y − Xβˆ ) = y ' y − βˆ ' X ' y − y ' Xβˆ + βˆ ' X ' Xβˆ = y ' y − 2βˆ ' X ' y + βˆ ' X ' Xβˆ Take derivative, set equal to zero and solve: ∂uˆ ' uˆ = −2 X ' y + 2 X ' Xβˆ = 0 ∂βˆ ∴ X ' Xβˆ = X ' y ∴ βˆ = ( X ' X ) −1 X ' y ■ Theorem βˆ ~ N (β, σ 2 ( X ' X ) −1 ) Proof βˆ = ( X ' X ) −1 X ' y = ( X ' X ) −1 X ' ( Xβ + u) = ( X ' X ) −1 X ' Xβ + ( X ' X ) −1 X ' u = β + ( X ' X ) −1 X ' u ( E (βˆ ) = E β + ( X ' X ) −1 X ' u ) ( = E (β) + E ( X ' X ) −1 X ' u ) = β + ( X ' X ) −1 X ' E ( u) = β since E ( u) = 0 __________________________________________________________________________________ Copyright reserved Dr Paul J van Staden, University of Pretoria, Department of Statistics Statistics 310 48 STK310 ( ) = E ((( X ' X ) X ' u)(( X ' X ) X ' u)' ) = E ((( X ' X ) X ' u)( u' X ( X ' X ) ) ) = E (( X ' X ) X ' uu' X ( X ' X ) ) var − cov(βˆ ) = E (βˆ − β)(βˆ − β)' −1 −1 −1 −1 −1 −1 = ( X ' X ) −1 X ' E (uu' ) X ( X ' X ) −1 = ( X ' X ) −1 X ' σ 2IX ( X ' X ) −1 = σ 2 ( X ' X ) −1 ■ INFERENCE σˆ 2 = uˆ ' uˆ n−k R2 = βˆ ' X ' y − nY 2 y ' y − nY 2 (βˆ ' X ' y − nY 2 ) ( k − 1) R 2 ( k − 1) F= = ~ F (k − 1 , n − k ) ( y ' y − βˆ ' X ' y ) (n − k ) (1 − R 2 ) (n − k ) βˆ j − β j t= ~ t(n − k ) σˆ βˆ j __________________________________________________________________________________ Copyright reserved Dr Paul J van Staden, University of Pretoria, Department of Statistics Statistics 310 49 STK310 REFERENCES Gujarati, D.N. & Porter, D.C. 2009. Basic Econometrics, Fifth edition, McGraw-Hill. Steyn, A.G.W., Smit, C.F., du Toit, S.H.C. & Strasheim, C. 1999. Modern Statistics in Practice, J.L. van Schaik Publishers. __________________________________________________________________________________ Copyright reserved Dr Paul J van Staden, University of Pretoria, Department of Statistics Statistics 310 50