Chapters 4, 16 & 17 Module 8: Simple and Multiple Linear Regression Week10 (Jul 17-23) Correlation and Regression Analysis • Correlation • Simple Linear Regression • Multiple Linear Regression 4.1 Week10 (Jul 17-23) Week11 (Jul 24-30) Week12 (Jul 31 – Aug 6) Correlation Analysis 4.2 ▪ Correlation is a statistical term indicating a relationship between two variables. For example, the temperature is correlated with the number of cars that will not start in the morning. As the temperature decreases, the number of cars that will not start in the morning increases. ▪ The sample correlation coefficient, denoted r, is a measure of the strength of a linear relationship between two quantitative variables x and y. ➢ If large values of x are associated with large values of y, or if as x increases, the corresponding value of y tends to increase, then x and y are positively related. ➢ If small values of x are associated with large values of y, or if as x increases, the corresponding value of y tends to decrease, then x and y are negatively related. Correlation Analysis 4.3 ▪ Definition: Suppose there are n pairs of observations (x1, y1), (x2, y2), . . ., (xn, yn). The sample Pearson correlation coefficient for these n pairs is defined as the covariance divided by the standard deviations of the variables: r= This coefficient answers the question: What is the direction and how strong is the association between X and Y? = ( xi − x )( yi − y ) = 2 2 SSxx SSyy ( x − x ) ( y − y ) i i 1 x y − i i n ( xi )( yi ) x 2 − 1 ( x )2 y 2 − 1 ( y )2 i n i i n i S xy 2 x 2 y Greek letter “rho” Correlation Analysis ▪ The value of r does not depend on the order of the variables and is independent of units. ▪ −1 ≤ r ≤ +1. r is exactly +1 if and only if all the ordered pairs lie on a straight line with positive slope. r is exactly −1 if and only if all the ordered pairs lie on a straight line with negative slope. ▪ If r is near 0, there is no evidence of a linear relationship, but x and y may be related in another way. ▪ Correlation between two variables does not imply causation. 4.4 Coefficient of Correlation 4.5 +1 Strong positive linear relationship r or r = 0 No linear relationship -1 Strong negative linear relationship General Guidelines Direction Negative Strength strong moderate -1.0 - 0.7 - 0.3 weak - 0.1 4.6 None 0 Positive weak 0.1 r = + 0.85 Direction positive correlation Strength a strong relationship moderate 0.3 0.7 strong 1.0 Example1: Income and Credit Score Although income is not used in calculating a credit score, some evidence suggests that income is related to credit score. A random sample of adult consumers was obtained, and their credit score and yearly income level (in hundreds of thousands of dollars) were recorded. The data are given in the table. Calculate the sample correlation coefficient between credit score and yearly income; and interpret this value. 4.7 S x2 S y2 S x2 S y2 Because r = 0.2794 < 0.3, there is a weak positive linear relationship between income level and credit score (as a person’s income level increases, so does the credit score). Example2: Work Experience and Hourly Wage 4.8 A researcher wants to explore the relationship between work experience and the hourly wage received by selected workers. The researcher obtained sampled data which provide information on hourly wage and work experience of the selected workers. The researcher wants to know whether work experience is related to how much a worker is paid hourly. The researcher decides to conduct a correlation analysis to explore the direction and strength of the relationship between work experience and hourly wage. Years of Hourly Wage Experience (X) (Y) 3 17 4 18 1 15 5 22 2 19 4 20 7 23 Example2: Work Experience and Hourly Wage r= S xy 2 x S S 2 y 4.9 29.29 (23.43)( 46.86) = 0.88 S xy = xi yi xy − i i n (26)(134) = 527 − 29.29 7 S x2 = xi2 − ( xi ) 2 n , S y2 = yi2 − (26) 2 = 120 − 23.43, 7 ( yi ) 2 Total (∑) X 3 4 1 5 2 4 7 26 X2 Y2 XY 9 289 51 17 16 324 72 18 1 225 15 15 25 484 110 22 4 361 38 19 16 400 80 20 49 529 161 23 134 120 2612 527 Y n (134) 2 = 2612 − 46.86 7 A strong positive relationship between work experience and hourly wages. Thus, higher wage is strongly associated with longer work experience (a worker with longer work experience is expected to receive higher wages) ∑X = ∑Y = ∑X2 = ∑Y2 = ∑XY = 26 134 120 2612 527 Correlation Using Excel 4.10 Data > Data Analysis > Correlation > ok > select the 2 columns of variables in the input range > check off the box of Labels in First Row > ok Experience Experience 1 Wage 0.883883 =CORREL(array1,array2) Drawing the Scatter plot in Excel: Select the 2 vars. > Insert > Insert Scatter Wage 1 Regression Analysis 16.11 ▪ If we are interested only in determining whether a relationship between 2 quantitative variables exists, we employ correlation analysis. ▪ To determine which variable is the dependent and which one is the independent, we use the Regression Analysis. It is used to predict the value of the dependent variable based on other independent variable(s). Dependent variable: denoted Y Independent variables: denoted X1, X2, …, Xk ▪ The linear equation that relates the dependent and independent variables is called regression model. ▪ Deterministic Model: an equation or set of equations that allow us to fully determine the value of the dependent variable from the values of the independent variables. Deterministic models are usually unrealistic. E.g. Is it reasonable to believe that we can determine the selling price of a house solely based on its size? ▪ Probabilistic Model: a method used to capture the randomness that is part of a real-life process. E.g. Do all houses of the same size (measured in square feet) sell for the same price? A Model ▪ To construct a probabilistic model, we start with a deterministic model that approximates the relationship we want to model and add a random term that measures the error of the deterministic component. E.g., the cost of building a new house is about $100 per square foot and most lots sell for about $100,000. Hence the approximate selling price (y) would be (Deterministic Model): y = $100,000 + (100$/ft2)(x) where x is the size of the house (independent var.) in square foot. In this model, the price of the house is completely determined by the size. House Price ▪ 16.12 Most lots sell for $100,000 House size A Model In real life however, the house cost will vary even among the same size of house Lower vs. Higher Variability House Price We now represent the price of a house as a function of its size in this Probabilistic Model: 16.13 Price = 100,000 + 100(Size) + ɛ where ɛ (Greek letter epsilon) is the random term (a.k.a. error variable). It is the difference between the actual selling price and the estimated price based on the size of the house. Its value will vary from house sale to house sale, even if the square footage (i.e. x) remains the same. 100K$ x House size Same square footage, but different price points (e.g. décor options, lot location…) Simple Linear Regression Model Break 5 minutes 16.14 A straight-line model with one independent variable is called a first order linear model or a simple linear regression model. Its is written as: dependent variable independent variable y-intercept slope of the line error variable (constant) (regression coefficient) Note that both β0 and β1 are population parameters which are usually unknown and hence estimated from the data. x = independent = predictor = explanatory = covariate = input = effect y = dependent = response = output Simple Linear Regression Model 16.15 The least squares estimates of the y-intercept, β0 and the slope, β1 of the true regression model are: ˆ1 = ˆ1 = S xy S x2 ( xi )( yi ) 2 n xi2 − ( xi ) n xi yi − ˆ x y − i 1 i = y − ˆ1x ˆ0 = n The estimated regression model/line is ŷ = ˆ0 + ˆ1 x y rise run β0 =y-intercept β1 =slope (=rise/run) x The Least Squares Method 4.16 The line of best fit, or estimated regression line, is obtained using the principle of least squares: Minimize the sum of the squared deviations (errors), or vertical distances from the observed points to the regression line. i.e., The principle of least squares produces an estimated regression line such that the sum of all squared vertical distances is a minimum. This line is represented by the equation: (“y” hat) is the value of y determined by the line. 2 e i is minimized 2 2 ˆ e = ( y − y ) i i i Interpretation of the Slope, Intercept, and R2 ▪ ▪ 4.17 The intercept is the estimated average value of y when the value of x is zero. The slope is the estimated change (increase or decrease) in the average value of y as a result of a one-unit increase in x. Coefficient of Determination, denoted as R2: ▪ Coefficient of determination, which is calculated by squaring the coefficient of correlation R2 = (r)2, measures the amount of variation in the dependent variable that is explained by the variation in the independent variable. ▪ In general, the higher the value of R2, the better the model fits the data. There is no cutoff value to use for R2 that indicates that we have a good model. ▪ 0 ≤ R2 ≤ 1 How much percentage of variation in the y-variable can be explained by the variation in the x variable? Example 16.1: The annual bonuses ($1,000s) of six employees with different years of experience were recorded as follows. We wish to determine the straight-line relationship between annual bonus and years of experience. Years of experience (x) 1 2 3 4 5 6 Annual bonus (y) 6 1 9 5 17 12 16.18 To apply the shortcut formula, we need to compute four summations, the covariance and the variance of x using a calculator: Example 16.1: 16.19 Thus, the least squares line is: What is the predicted annual bonus of such an employee who has 5 years of experience? Predict annual bonus yˆ = .934 + 2.114(5) = 11.504 The predicted annual bonus is 11.504(1000) = $11,504 The regression equation should not be used to make predictions for x values that fall outside the range of values covered by the original data. these differences are called residuals Example: House Price and its Size 4.20 A real estate agent wishes to examine the relationship between the selling price of a home (in $1000s) and its size (in square foot). A random sample of 10 houses is selected. 58.08% of the variation in house prices is explained by the variation in house size. Price = 98.24833 + 0.10977Size House Price (y) House size (x) 245 1400 312 1600 279 1700 308 1875 199 1100 219 1550 405 2350 324 2450 319 1425 255 1700 Example: House Price and its Size Interpreting the Intercept and the Slope: Price = 98.24833 + 0.10977Size 4.21 • The intercept, b0 = 98.24833. One interpretation would be that when x = 0 (No houses had 0 square foot) the house price is 98.24833($1000) = $98,248.33. However, in this case, the intercept is probably meaningless. Because our sample did not include any house with zero square foot of the size, we have no basis for interpreting b0. • The slope, b1 = 0.10977 tells us that the average value of a house increases by 0.10977($1000) = $109.77 for each additional one square foot in the house size. Predict the price for a house with 2000 square foot. Price = 98.24833 + 0.10977Size = 98.24833 + 0.10977(2000) = 317.78833 The predicted price for a house with 2000 square foot is 317.78833(1000) = $317,788.33 The forecast would not be reliable if the house size is an outlier, such as, 4000 sqft. Example: House Price and its Size 4.22 How much will the price be expected to change if the house size increases by 1400 square foot? Price = 98.24833 + 0.10977Size Expected change = 0.10977(1400) = 153.678 153.678(1000) = $153,678 Regression Using Excel 4.23 Data > Data Analysis > Regression Selecting “Line Fit Plots” on the Regression dialog box, will produce a scatter plot and the regression line Regression Using Excel To show up the regression/trend line, R2, and the regression equation on the chart: Right click on any point in the scatter plot > Add Trendline > select Linear, then scroll down to check off the 2 boxes of: Display Equation on chart Display R-squared value on chart And close the pane on the right Then you can drag the equation and R2 using the mouse to any place in the chart to be shown clearly. 4.24 Example 16.2 Week11 (Jul 24-30) Car dealers use the “Blue Book" to determine the value of used cars that their customers trade in when purchasing new cars. The book, which is published monthly, lists the trade-in values for all basic models of cars. It provides alternative values for each car model according to its condition and optional features. The values are determined on the basis of the average paid at recent used-car auctions, the source of supply for many used-car dealers. However, the Blue Book does not indicate the value determined by the odometer reading, despite the fact that a critical factor for used-car buyers is how far the car has been driven. To examine this issue, a used-car dealer randomly selected 100 three-year old Toyota Camrys that were sold at auction during the past month. The dealer recorded the price ($1,000) and the number of miles (thousands) on the odometer. The dealer wants to find the regression line/model. 16.25 Xm16-02 Part of dataset Price Odometer 14.6 37.4 14.1 44.8 14.0 45.8 15.6 30.9 15.6 31.7 14.7 34.0 14.5 45.9 15.7 19.1 15.1 40.1 14.8 40.2 15.2 32.4 Example 16.2 COMPUTE 16.26 Example 16.2 – Using Excel s sb1 = COMPUTE 16.27 The predicted car price will typically differ from the actual price by $0.3266 x 1000 = $326.60. Example 16.2 – Using Excel 4.28 Selecting “line fit plots” on the Regression dialog box, will produce a scatter plot of the data and the regression line. Example 16.2 INTERPRET 16.29 The slope coefficient, b1, is –0.0669, that is, for each additional mile on the odometer, the price decreases on average by $0.0669 or 6.69¢. Equivalently, for each additional 1000 miles on the odometer, the price decreases on average by $0.0669(1000) = $66.9 The intercept, b0 = 17.250. One interpretation would be that when x = 0 (no miles on the car or the car was not driven at all) the selling price is $17,250. However, in this case, the intercept is probably meaningless. Because our sample did not include any cars with zero miles on the odometer, we have no basis for interpreting b0.. As a general rule, we cannot determine the value of for a value of x that is far outside (an outlier) the range of the sample values of x. Coefficient of Determination R2 = 0.6483. This means 64.83% of the variation in the auction selling prices (y) is explained by the variation in the odometer readings (x). The remaining 35.17% is unexplained, i.e. due to error. Example 16.2 16.30 To predict the selling price of a car with 40 miles on it: 𝑦ො = 17.250 – .0669x = 17.250 – .0669(40) = 14.574 We call this value (14.574(1000) = $14,574) a point prediction. The chance of finding a different actual selling price is expected, hence we can estimate the selling price in terms of an interval (beyond the scope of this course). Note: Let’s say the regression equation shows us r2 = 0.65 and the slope = - 0.07. What is the linear correlation coefficient, r? Note: If it's not mentioned in the question of how many decimals you should round the result to, you should round to 2 decimals. Testing the Slope, β1 16.31 ▪ We can draw inferences about the population slope β1 from the sample slope b1. ▪ The process of testing hypotheses about β1 is identical to the process of testing any parameter. We begin with the hypotheses. ▪ We can conduct one- or two-tail tests of β1. Most often, we perform a twotail test. ▪ The null and alternative hypotheses become: H0: β1 = 0 (there is no linear relationship) H1: β1 ≠ 0 (there is a linear relationship) ▪ TS: where Sb1 is the standard deviation of b1, defined as: If the error variable (ɛ) is normally distributed, the test statistic has a Student t-distribution with n–2 degrees of freedom. © 2018 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website for classroom use. Example 16.4 Using Example 16.2 16.32 Test to determine if there is a linear relationship between the price & the odometer reading at 5% significance level. H0: β1 = 0 H1: β1 ≠ 0 claim (if the null hypothesis is true, no linear relationship exists) The rejection region is: TS: b1 − 1 −.0669 − 0 t= = sb1 .00497 = −13.46 sx2 = 43.509 From slide 27 s .3265 sb1 = = 2 99(43.509) (n − 1) sx = .00497 © 2018 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website for classroom use. Example 16.4 Using Example 16.2 Although, F-statistic is mainly used in the multiple regression, we can use it in the simple regression as an alternative to the t-statistic. INTERPRET 16.33 Decision: The value of the TS = -13.44 which is < the CV = -1.984 lies in the rejection region. Equivalently, we found p ≈ 0 is < α = 0.05. Therefore, we reject the null hypothesis in favor of the H1 at α = 0.05. Conclusion: There is overwhelming evidence to infer that a significant linear relationship exists. What this means is that the odometer reading may affect the auction selling price of the cars. Note: Regression analysis can only show that a statistical relationship exists. We cannot infer that one variable causes another. © 2018 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website for classroom use. Testing the Slope 16.34 If we wish to test for negative or positive linear relationships, we conduct one-tail test, i.e., our research hypothesis becomes: H1: β1 < 0 (testing for a negative slope) or H1: β1 > 0 (testing for a positive slope) Of course, the null hypothesis remains: H0: β1 = 0. However, in this case the p-value would be the two-tail p-value divided by 2; using Excel’s p-value, this would be which is still approximately 0. Note: Remember Excel gives us the two-tail p-value. © 2018 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website for classroom use. Required Conditions 16.35 For these regression methods to be valid the following four conditions for the error variable (ɛ) must be met: The probability distribution of ɛ is normal. The mean of the distribution is 0; that is, E(ɛ) = 0. The standard deviation of ɛ is σɛ, which is a constant regardless of the value of x. The value of ɛ associated with any particular value of y is independent of ɛ associated with any other value of y. Chapter 17 Multiple Regression • Model and Required Conditions • Estimating the Coefficients and Assessing the Model • Regression Diagnostics-I • Regression Diagnostics-II (Time Series) 17.36 Multiple Regression 17.37 ▪ The simple linear regression model was used to analyze how one interval variable (the dependent variable y) is related to one other interval variable (the independent variable x). ▪ Multiple regression allows for any number of independent variables. ▪ We expect to develop models that fit the data better than would a simple linear regression model. ▪ The data for a simple linear regression problem consists of n observations (𝑥𝑖, 𝑦𝑖) of two variables. ▪ Data for multiple linear regression consists of the value of a response variable y and k explanatory variables (𝑥1, 𝑥2, …, 𝑥k) on each of n cases. ▪ We write the data and enter them into software in the form: Variables … Individual x1 x2 xk y 1 x11 x12 … x1k y1 2 x21 x22 … x2k y2 ⁞ ⁞ n xn1 ⁞ … ⁞ xn2 … xnk ⁞ yn The Model 17.38 We now assume we have k independent variables potentially related to the one dependent variable. This relationship is represented in this first order linear equation: dependent variable independent variables error variable coefficients The coefficient βi (i = 1, …, k) has the following interpretation: It represents the average change (increase or decrease) in the response variable, y, when the independent variable xi increases by one unit and all other x variables are held constant. The Model 17.39 ▪ In the simple regression model with one independent variable, we drew a straight regression line. ▪ When there is more than one independent variable in the regression model, we refer to the graphical depiction of the equation as a response surface rather than as a straight line. ▪ The Figure below depicts a scatter diagram of a response surface (plane) with k=2. Whenever k > 2, we cannot draw the response surface. Required Conditions 17.40 For these regression methods to be valid the following four conditions for the error variable (ɛ) must be met: • The probability distribution of the error variable (ɛ) is normal. • The mean of the error variable is 0. • The standard deviation of ɛ is σɛ , which is a constant. • The errors are independent. Estimating the Coefficients 17.41 The multiple regression equation is expressed as: We will use computer output to: Assess the model… How well it fits the data Are any required conditions violated? Employ the model… Interpreting the coefficients Estimating the expected value of the dependent variable Regression Analysis Steps 17.42 u Use a computer and software to generate the coefficients and the statistics used to assess the model. v Diagnose violations of required conditions. If there are problems, attempt to remedy them. w Assess the model’s fit. standard error of estimate, coefficient of determination, F-test of the analysis of variance. x If u, v, and w are OK, use the model to predict or estimate the expected value of the dependent variable. File name in Assignment 3 The Excel file name must be “Lastname, Firstname A3” In A3, the Excel file should include 1 sheet. 4.43 Chapter-Opening Example Week12 (Jul 31 – Aug 6) Data file: Xm17-00 17.44 GENERAL SOCIAL SURVEY: VARIABLES THAT AFFECT INCOME In Chapter 16 opening example, it is showed using the General Social Survey that income and education are linearly related. This raises the question, what other variables affect one’s income? To answer this question, we need to expand the simple linear regression technique used in the previous chapter to allow for more than one independent variable. Here is a list of selected variables the General Social Survey created: 1. Age (AGE): For most people, income increases with age. 2. Years of education (EDUC): It is possible that education and income are linearly related. 3. Hours of work per week (HRS1): More hours of work should produce more income. Chapter-Opening Example 17.45 4. Spouse’s hours of work (SPHRS1): It is possible that, if one’s spouse works more and earns more, the other spouse may choose to work less and thus earn less. 5. Number of family members earning money (EARNRS): As is the case With SPHRS1, if more family members earn income there may be less pressure on the respondent to work harder. 6. Number of children (CHILDS): Children are expensive, which may encourage their parents to work harder and thus earn more. Chapter-Opening Example– Using Excel 17.46 Data > Data Analysis > Regression Data file: Xm17-00 Chapter-Opening Example– Using Excel 17.47 The results: s Chapter-Opening Example– Using Excel 17.48 The estimated regression model is: yˆ = −110186.4004 + 921.9746x1 +5780.6634x2 + 1095.5988x3 − 238.9901x4 + 149.7858x5 + 469.3977x6 y= x1= x2= x3= x4= x5= x6= Income Age Education Hours of work Spouse’s hours of work Number of family members earning money Number of children Does there appear to be a significant linear relationship between the income and at least one of the 6 independent variables? Model Assessment We will assess the estimated model in three ways: Standard error of estimate, Coefficient of determination, and F-test of the analysis of variance. 17.49 In multiple regression, the standard error of estimate is defined as: SSE = sum squares of errors n = sample size k = number of independent variables in the model From Excel output: s = 35901.56 It seems the standard error of estimate is quite large. However, we use it for the comparison with other estimated models. The estimated model with the smallest s is the best one (the closer the data values are to the regression line). Model Assessment 17.50 Coefficient of Determination: Again, the coefficient of determination, R2 is defined as: R2= 34.34%. This means that 34.34% of the variation in income is explained by the six independent variables, but 65.66% remains unexplained. Adjusted R2 is the coefficient of determination adjusted for the sample size, n and the number of independent variables k: In the income model, the Adjusted R2 = 33.44%. We use it in the multiple regression model. Testing the Validity of the Model 17.51 In a multiple regression model (i.e. more than one independent variable), we utilize an analysis of variance (ANOVA) technique to test the overall validity of the model. Here’s the hypotheses: H 0: (The model is not valid/significant) H1: At least one βi is not equal to zero. ▪ If the null hypothesis is true, none of the independent variables is linearly related to y, and so the model is invalid. ▪ If at least one βi is not equal to 0, the model does have some validity. Testing the Validity of the Model n=446 ANOVA table for regression analysis Source of Variation degrees of freedom Sums of Squares Mean Squares Regression k SSR MSR = SSR/k 17.52 F-Statistic F=MSR/MSE Error n–k–1 Total n–1 SSE MSE = SSE/(n–k-1) Chapter-Opening Example A large value of F indicates that most of the variation in y is explained by the regression equation and that the model is valid. A small value of F indicates that most of the variation in y is unexplained. pvalue Grades Explained and Unexplained Variation Y 4.53 student grade yi SSE = (yi - yi )2 _ SST = (yi - y)2 y _ y y _ SSR = (yi - y)2 Unexplained (Unpredicted portion) Explained (Predicted portion) SSR SSE SST yi Xi No. of study hours x Testing the Validity of the Model n=446 Our rejection region is: F > Fα,k,n-k-1 = F.05,6,439 ≈ F.05,6,∞ = 2.10 (Table 6) Do not reject H0 Decision: We reject H0 in favor of H1 because the F statistic = 38.26 is > 2.10, or equivalently the p ≈ 0 is > α = .05 Conclusion: There is a great deal of evidence to infer that there is a significant linear relationship between the income and at least one of the 6 independent variables. Chapter-Opening Example v1= k v2= n-k-1 ▪ F is always zero or positive Notes: ▪ Large values of F statistic are evidence against H0 ▪ The F test is upper-one-sided 17.54 Reject H0 2.10 p-value Relationship among SSE, sℇ, R2, and F Summary 17.55 SSE R2 F Assessment of Model 0 0 1 Perfect small small close to 1 large Good large large close to 0 small Poor 0 0 Invalid Once we’re satisfied that the model fits the data as well as possible, and that the required conditions are satisfied, we can interpret and test the individual coefficients and use the model for prediction. Interpreting the Coefficients 17.56 Intercept The intercept is b0 = −110186.40. This is the average income when all the independent variables are zero. As we observed in Chapter 16, it is often misleading to try to interpret this value, particularly if 0 is outside the range of the values of the independent variables (as is the case here). Age, x1 The relationship between income and age is described by b1 = 921.97. For each additional year of age, the income increases on average by $921.97, assuming that the other independent variables in this model are held constant. Education, x2 The coefficient b2 = 5780.66 specifies that for each additional year of education the income increases on average by $5,780.66, assuming all other independent variables in this model are held constant. yˆ = −110186.4004 + 921.9746x1 +5780.6634x2 + 1095.5988x3 − 238.9901x4 + 149.7858x5 + 469.3977x6 Interpreting the Coefficients 17.57 Hours of work, x3 The relationship between hours of work per week is expressed by b3= 1,095.60. We interpret this number as the average increase in annual income for each additional hour of work per week keeping the other independent variables fixed. Spouse’s hours of work, x4 The relationship between annual income and a spouse’s hours of work per week is described b4 = −238.99, which means that for each additional hour a spouse works per week, the income decreases on average by $238.99 when the other variables are constant. Number of family members earning income, x5 The relationship between annual income and the number of family members who earn money is expressed by b5 = 149.79, which tells us that for each additional family member earning money, the annual income increases on average by $149.79 assuming that the other independent variables are constant. yˆ = −110186.4004 + 921.9746x1 +5780.6634x2 + 1095.5988x3 − 238.9901x4 + 149.7858x5 + 469.3977x6 Interpreting the Coefficients 17.58 Number of children, x6 The relationship between annual income and number of children is expressed by b6 = 469.40, which tells us that for each additional child, annual income increases on average by $469.40 assuming that the other independent variables are held constant. yˆ = −110186.4004 + 921.9746x1 +5780.6634x2 + 1095.5988x3 − 238.9901x4 + 149.7858x5 + 469.3977x6 Testing the Coefficients Break 5 minutes 17.59 For each independent variable, we can test to determine whether there is enough evidence of a linear relationship between this independent variable and the dependent variable in the entire population. H0: βi = 0 (for i = 1, 2, …, 6) H1: βi ≠ 0 TS: (with n–k–1 degrees of freedom) Testing the Coefficients 17.60 Test of β1 (Coefficient of age): Value of the test statistic: t = 6.16; p-value ≈ 0. We reject H0 in favor of H1 Test of β2 (Coefficient of education): Value of the test statistic: t = 9.57; p-value ≈ 0. We reject H0 in favor of H1 Test of β3 (Coefficient of number of hours of work per week): Value of the test statistic: t = 9.34; p-value ≈ 0. We reject H0 in favor of H1 Test of β4 (Coefficient of spouse’s number of hours of work per week): Value of the test statistic: t = −1.88; p-value = .061. We do not reject H0 in favor of H1 Test of β5 (Coefficient of number of earners in family): Value of the test statistic: t = .05; p-value = .960. We do not reject H0 Test of β6 (Coefficient of number of children): Value of the test statistic: t = .387; pvalue = .699. We do not reject H0 Testing the Coefficients INTERPRET 17.61 Conclusion: There is sufficient evidence at the 5% significance level to infer that each of the following variables is linearly related to income: • Age • Education • Number of hours of work per week In this model there is not enough evidence to conclude that each of the following variables is linearly related to income: • Spouse’s number of hours of work per week • Number of earners in the family • Number of children Note that this may mean that there is no evidence of a linear relationship between these 3 independent variables. However, it may also mean that there is a linear relationship between the 2 variables, but because of a condition called multicollinearity, the t-test revealed no linear relationship. We will discuss multicollinearity next. Using the Regression Equation for Prediction 17.62 As we did with simple linear regression, we can predict the income of a 50year-old respondent, with 12 years of education, who works 40 hours per week, whose spouse also works 40 hours per week, 2 earners in the family, and has 2 children. yˆ = −110186.4004 + 921.9746x1 +5780.6634x2 + 1095.5988x3 − 238.9901x4 + 149.7858x5 + 469.3977x6 = −110186.4004 + 921.9746(50) + 5780.6634(12) + 1095.5988(40) − 238.9901(40) + 149.7858(2) + 469.3977(2) $40,783.01 Regression Diagnostics I 17.63 ▪ Multiple regression models have a problem that simple regressions do not have, namely multicollinearity. It happens when the independent variables are highly correlated. ▪ The adverse effect of multicollinearity is that the estimated regression coefficients of the independent variables that are correlated tend to have large sampling errors. ▪ The consequence of the multicollinearity is that when the coefficients are tested, the t-statistics will be small which leads to the inference that there is no linear relationship between the affected independent variables and the dependent variable. In some cases, this inference will be wrong. ▪ Fortunately, multicollinearity does not affect the F test of the analysis of variance. Multicollinearity 17.64 To illustrate, we’ll use the General Social Survey of 2012. When we conducted a regression analysis similar to the chapter-opening example, we found that the number of children in the family was not statistically significant at the 5% significance level. However, when we tested the coefficient of correlation between income and number of children, we found it to be statistically significant. Correlation Between Income and Number of Children How do we explain the apparent contradiction between the insignificant multiple regression t-test of the coefficient of the number of children, β6, and the significant correlation coefficient of number of children and income? The answer is multicollinearity. Multicollinearity 17.65 ▪ There is a relatively high degree of correlation between number of family members who earn income, x5, and number of children, x6. ▪ The result of the t-test of the correlation between number of earners and number of children is shown significant. This result should not be surprising, as more earners in a family are very likely children. Correlation Between Number of Earners and Number of Children Multicollinearity 4.66 ▪ Another problem caused by multicollinearity is the interpretation of the coefficients. ▪ We interpret the coefficients as measuring the change in the dependent variable when the corresponding independent variable increases by one unit while all the other independent variables are held constant. ▪ This interpretation may be impossible when the independent variables are highly correlated, because when the independent variable increases by one unit, some or all of the other independent variables will change. Regression Diagnostics II 17.67 ▪ We pointed out that one of the required conditions in the regression analysis is the errors should be independent (review slide 35). ▪ In the time series data (i.e., when the data are gathered sequentially over a series of time periods), there is a high possibility of violating the independency condition between the errors. ▪ So, when this condition is violated, this means we have the problem called autocorrelation (a condition in which a relationship exists between consecutive residuals, i.e., ei and ei-1 (i is the time period)). ▪ The Durbin-Watson test (beyond the scop of this course) allows us to determine whether there is evidence of first-order autocorrelation. Regression Diagnostics II 17.68 ▪ This graph reveals a serious problem. There is a strong relationship between consecutive values of the residuals, which indicates that the requirement that the errors are independent has been violated. ▪ To confirm this diagnosis, we can use Excel to calculate the Durbin– Watson statistic.