Chapter 10.1 — Inference for Simple Linear Regression Stat 226 – Introduction to Business Statistics I is the linear relationship between x and y significant or not? Spring 2009 Professor: Dr. Petrutza Caragea Section A Tuesdays and Thursdays 9:30-10:50 a.m. Do New Jersey banks serve minority communities? Financial institutions have a legal and social responsibility to serve all communities. Do banks adequately serve both inner-city and suburban neighborhoods, both poor and wealthy communities? In New Jersey, banks have been charged with withdrawing from urban areas with a high percentage of minorities. To examine this charge, a regional New Jersey newspaper, the Asbury Park Press compiled county by county data on the number (y ) of people in each county per branch bank in the county and the percentage (x) of the population in each county that is minority . Chapter 10, Section 10.1 Inference for simple linear regression Source: McClave, J.T., Benson, P.G., Sincich T.; (2007), Statistics for Business and Economics, 10th Edt., Prentice Hall, Upper Saddle River, NJ. Stat 226 (Spring 2009) Introduction to Business Statistics I Section 10.1 1 / 28 Chapter 10.1 — Inference for Simple Linear Regression Stat 226 (Spring 2009) Introduction to Business Statistics I Section 10.1 2 / 28 Chapter 10.1 — Inference for Simple Linear Regression data: 4000 1 2 3 4 5 .. . Atlantic Bergen Burlington Camden Cape May .. . 21 Warren number of people per bank branch 3,073 2,095 2,905 3,330 1,321 .. . percentage of minority population 23.3 13 17.8 23.4 7.3 .. . 2,349 2.8 3500 3000 number of people per bank branch county 2000 1500 1000 If charge against New Jersey holds true we should see an increase in the number of people per bank (less bank branches) as the minority percentage in population increases. Stat 226 (Spring 2009) Introduction to Business Statistics I Section 10.1 2500 0 10 20 30 40 50 percentage of minority population 3 / 28 Stat 226 (Spring 2009) Introduction to Business Statistics I Section 10.1 4 / 28 Chapter 10.1 — Inference for Simple Linear Regression Chapter 10.1 — Inference for Simple Linear Regression population regression line 4000 Correlation: Because we have complete data for all 21 New Jersey counties and only New Jersey is of interest to us, we have data on the entire population. 3500 number of people per bank branch 3000 The least squares regression line fitted through the 21 observations corresponds therefore to the so-called population regression line 2500 2000 µy = β0 + β1 x 1500 1000 0 10 20 30 40 50 percentage of minority population LS regression line: β0 and β1 are population parameters describing the linear relationship between x and y in the entire population. Stat 226 (Spring 2009) Introduction to Business Statistics I Section 10.1 5 / 28 Chapter 10.1 — Inference for Simple Linear Regression Stat 226 (Spring 2009) Introduction to Business Statistics I Section 10.1 6 / 28 Chapter 10.1 — Inference for Simple Linear Regression data for the 21 New Jersey counties (the entire population): Note: 4000 The population regression line µy = β0 + β1 x describes the linear relationship between the explanatory variable x and µy , i.e. the relationship between x and the average/mean value of y for a given x. 3500 If we are interested in describing each individual y in the population, we need to account for the fact that not all y are equal to µy and therefore will not fall on the straight line but will deviate from the line by some error ε: number of people per bank branch 3000 2500 2000 1500 1000 0 y = β0 + β1 x +ε ! "# $ 10 20 30 40 50 percentage of minority population µy Stat 226 (Spring 2009) Introduction to Business Statistics I Section 10.1 7 / 28 Stat 226 (Spring 2009) Introduction to Business Statistics I Section 10.1 8 / 28 Chapter 10.1 — Inference for Simple Linear Regression Chapter 10.1 — Inference for Simple Linear Regression New Jersey counties: Typically we are not as fortunate and won’t be able to observe an entire population. Hopefully though, with the help of a representative random sample, we still will obtain reliable information about the true underlying linear relationship in the population. The simple linear regression model y = β0 + β1 x + ε Recall the general form of the fitted least squares regression line from Chapter 2 y% = a + bx, allows us to describe the linear relationship between each yi for a given value of the explanatory variable xi (i=1,2,. . . ,21), i.e. where a and b are obtained from the sample as follows: yi = β0 + β1 xi + εi The εi ’s are independent and normally distributed with mean 0 and standard deviation σ — this is an important assumption to which we will come back to later. Stat 226 (Spring 2009) Introduction to Business Statistics I Section 10.1 b=r· 9 / 28 Stat 226 (Spring 2009) sy sx and a = ȳ − b · x̄ Introduction to Business Statistics I Section 10.1 Chapter 10.1 — Inference for Simple Linear Regression Chapter 10.1 — Inference for Simple Linear Regression We can use a to estimate β0 and b to estimate β1 : Knowing the sampling distribution of b0 and b1 allows us to: 10 / 28 Both, a and b are sample statistics and will vary from sample to sample. If we took another sample we would get different values of a and b (sampling variability). 1 construct confidence intervals for the slope β1 and intercept β0 Consequently, a and b have a sampling distribution. 2 test whether the response y depends linearly on x, i.e. there is a significant linear relationship between x and y in the population The textbook unfortunately switches notation from Chapter 2 to Chapter 10. In the following we will denote a as b0 and b as b1 . Stat 226 (Spring 2009) Introduction to Business Statistics I Section 10.1 11 / 28 Generally, we will focus on the slope β1 because the value of the slope determines whether or not a linear relationship between x and y exists. Stat 226 (Spring 2009) Introduction to Business Statistics I Section 10.1 12 / 28 Chapter 10.1 — Inference for Simple Linear Regression Chapter 10.1 — Inference for Simple Linear Regression Note, in order to test whether a linear relationship exists between x and y , we need to test whether the population slope β1 = 0 Why? If β1 = 0, we get the following regression model y y y = β0 + β1 · x + ε = β0 + 0 · x + ε = β0 + ε if β1 = 0 ⇒ x does not help explain the behavior of y . Stat 226 (Spring 2009) Introduction to Business Statistics I Section 10.1 13 / 28 Chapter 10.1 — Inference for Simple Linear Regression Stat 226 (Spring 2009) 14 / 28 checking the assumptions Before we are going to construct CIs and tests, we should have a look at assumptions that are necessary for inference on regression parameters: 1 simple random sample (ensuring independence of y ’s) 2 linear relationship between x and µy 3 standard deviation of the responses about the population line is the same for all values of the explanatory variable x 4 the response y varies according to a normal distribution about the population regression line for all values of the explanatory variable x Introduction to Business Statistics I Section 10.1 Chapter 10.1 — Inference for Simple Linear Regression assumptions for regression inference Stat 226 (Spring 2009) Introduction to Business Statistics I Section 10.1 15 / 28 1 independence: 2 linear relationship: Stat 226 (Spring 2009) Introduction to Business Statistics I Section 10.1 16 / 28 Chapter 10.1 — Inference for Simple Linear Regression 1 Chapter 10.1 — Inference for Simple Linear Regression normality: confidence intervals for slope β1 recall: the general form of a confidence interval is given estimate ± margin of error, where margin of error corresponds to critical value × standard error 2 constant variance: CI for the slope β1 is of the same form: b1 ± t ∗ SEb1 , the standard error SEb1 can be obtained from the JMP output. Note, the critical value t ∗ corresponds now to a t-distribution with df=n-2. Stat 226 (Spring 2009) Introduction to Business Statistics I Section 10.1 17 / 28 Chapter 10.1 — Inference for Simple Linear Regression Introduction to Business Statistics I Section 10.1 Introduction to Business Statistics I Section 10.1 18 / 28 Chapter 10.1 — Inference for Simple Linear Regression New Jersey example: Let’s construct a 95% confidence interval for the slope β1 : Stat 226 (Spring 2009) Stat 226 (Spring 2009) 19 / 28 Interpretation cont’d: Stat 226 (Spring 2009) Introduction to Business Statistics I Section 10.1 20 / 28 Chapter 10.1 — Inference for Simple Linear Regression Chapter 10.1 — Inference for Simple Linear Regression Note: If there exists a linear relationship (β1 $= 0), then this linear relationship can be either positive or negative testing for a significant linear relationship, i.e. β1 $= 0 example: New Jersey data example β1 < 0 ⇒ negative relationship Is there a significant linear relationship between the percentage of the minority population and the number of people per bank branch? Recall the population regression line β1 > 0 ⇒ positive relationship µy = β0 + β1 x We are interested in showing that β1 is significantly different from zero, i.e. β1 $= 0 because this implies that there exists indeed a linear relationship between x and y . We therefore set up the following hypotheses Ha : β1 < 0 for negative linear relationship Ha : β1 $= 0 (there exists a linear relationship between x and y ) Introduction to Business Statistics I If we are specifically interested in showing either a positive or negative relationship we need to set up the alternatives accordingly, i.e. Ha : β1 > 0 for positive linear relationship H0 : β1 = 0 (no linear relationship between x and y ) Stat 226 (Spring 2009) If we are simply interested in showing that a linear relationship exists and the direction (either positive or negative) is not important, we test H0 against the two-sided alternative Ha : β1 $= 0 Section 10.1 21 / 28 Chapter 10.1 — Inference for Simple Linear Regression Stat 226 (Spring 2009) p-values are found in exactly the same way we have done before. Depending on the alternative, the p-value corresponds to b1 − β1 SEb1 Under the null hypothesis we assume β1 = 0, the test statistic therefore simplifies to b1 − 0 b1 = t= SEb1 SEb1 Ha : β1 $= 0 Ha : β1 > 0 Ha : β1 < 0 Often this test statistic is called the t-ratio (e.g. in JMP) Introduction to Business Statistics I 22 / 28 finding the p-value with df=n-2 for a t-distribution, b1 is the estimate of β1 based on sample. Stat 226 (Spring 2009) Section 10.1 Chapter 10.1 — Inference for Simple Linear Regression A general form of the test statistic is given by t= Introduction to Business Statistics I Section 10.1 23 / 28 Stat 226 (Spring 2009) Introduction to Business Statistics I Section 10.1 24 / 28 Chapter 10.1 — Inference for Simple Linear Regression Chapter 10.1 — Inference for Simple Linear Regression Note, JMP gives p-values corresponding to a two-sided alternative, i.e. Ha : β1 $= 0. We need to divide the JMP p-value by 2 if we are interested in testing a one-sided alternative such as Ha : β1 > 0 or Ha : β1 < 0! Linear Fit decision rule: as before, we reject H0 if p − value ≤ α conclusion: Rejecting H0 implies that there exists a statistically significant linear relationship between x and y . Linear Fit number of people per bank branch = 2082.0153 + 35.287737 percentage of minority population Summary of Fit RSquare RSquare Adj Root Mean Square Error Mean of Response Observations (or Sum Wgts) 0.526538 0.501619 400.2546 2693.333 21 Does this conclusion imply a change in the response y can be caused by a change in the explanatory variable x? Analysis of Variance Source Model Error C. Total DF Sum of Squares Mean Square 1 3385090.2 3385090 19 3043870.4 160204 20 6428960.7 F Ratio 21.1299 Prob > F 0.0002* Parameter Estimates Term Intercept percentage of minority population Stat 226 (Spring 2009) Estimate Std Error 2082.0153 159.107 35.287737 7.676707 t Ratio Prob>|t| 13.09 <.0001* 4.60 0.0002* Introduction to Business Statistics I Section 10.1 25 / 28 Chapter 10.1 — Inference for Simple Linear Regression Stat 226 (Spring 2009) Introduction to Business Statistics I Section 10.1 26 / 28 Chapter 10.1 — Inference for Simple Linear Regression Example: New Jersey banks Stat 226 (Spring 2009) Introduction to Business Statistics I Section 10.1 27 / 28 Stat 226 (Spring 2009) Introduction to Business Statistics I Section 10.1 28 / 28