Real Statistics Using Excel 102521088 吳柏葦 * • Confidence and prediction intervals for regression • Exponential Regression Model • Power Regression Model • Linear regression models for comparing means The 95% confidence interval for the forecasted values ŷ of x is Where This means that there is a 95% probability that the true linear regression line of the population will lie within the confidence interval of the regression line calculated from the sample data. The 95% prediction interval of the forecasted value ŷ0 for x0 is where the standard error of the prediction is For any specific value x 0 the prediction interval is more meaningful than the confidence interval. Find the 95% confidence and prediction intervals for the forecasted life expectancy for men who smoke 20 cigarettes in Example 1 of Method of Least Squares Figure 2 – Confidence and prediction intervals for data in Example 1 Referring to Figure 2, we see that the forecasted value for 20 cigarettes is given by FORECAST (20,B4:B18,A4:A18) = 73.16. The confidence interval, calculated using the standard error 2.06 (found in cell E12), is (68.70, 77.61). The prediction interval is calculated in a similar way using the prediction standard error of 8.24 (found in cell J12). Thus life expectancy of men who smoke 20 cigarettes is in the interval (55.36, 90.95) with 95% probability. Example 2: Test whether the y-intercept is 0. We use the same approach as that used in Example 1 to find the confidence interval of ŷ when x = 0 (this is the yintercept). The result is given in column M of Figure 2. Here the standard error is And so the confidence interval is Since 0 is not in this interval, the null hypothesis that the y-intercept is zero is rejected. Sometimes linear regression can be used with relationships which are not inherently linear, but can be made to be linear after a transformation. In particular, we consider the following exponential model: y= α𝑒 β𝑥 y= α𝑒 β𝑥 ln y = ln α + β 𝑥 Y’ = α ’+ β 𝑥 + ε Observation: Since αeβ(x+1) = αeβx · eβ, we note that an increase in x of 1 unit results in y being multiplied by eβ. Observation: A model of the form ln y = βx + δ is referred to as a log-level regression model. Clearly any such model can be expressed as an exponential regression model of form y = αeβx by setting α = eδ. * Example 1: Determine whether the data on the left side of Figure 1 fits with an exponential model. Figure 1 – Data for Example 1 and log transform * The table on the right side of Figure 1 shows ln y (the natural log of y) instead of y. We now use the Regression data analysis tool to model the relationship between ln y and x. Figure 2 – Regression data analysis for x vs. ln y from Example 1 The table in Figure 2 shows that the model is a good fit and the relationship between ln y and x is given by * ln y = 0.016+2.64𝑥 Applying e to both sides of the equation yields We can also see the relationship between and by creating a scatter chart for the original data and choosing Layout > Analysis|Trendline in Excel and then selecting the Exponential Trendline option. We can also create a chart showing the relationship between and ln and use Linear Trendline to show the linear regression line . As usual we can use the formula y = 14.05∙(1.016)x described above for prediction. Thus if we want the y value corresponding to x = 26, using the above model we get ŷ =14.05∙(1.016)26 = 21.35. We can get the same result using Excel’s GROWTH function, as described below. Excel Functions: Excel supplies two functions for exponential regression, namely GROWTH and LOGEST. LOGEST is the exponential counterpart to the linear regression function LINEST described in Testing the Slope of the Regression Line. Once again you need to highlight a 5 × 2 area and enter the array function =LOGEST(R1, R2, TRUE, TRUE), where R1 = the array of observed values for y (not ln y) and R2 is the array of observed values for x, and then press Ctrl-ShftEnter. LOGEST doesn’t supply any labels and so you will need to enter these manually. Essentially LOGEST is simply LINEST using the mapping described above for transforming an exponent model into a linear model. For Example 1 the output for LOGEST(B6:B16, A6:A16, TRUE, TRUE) is as in Figure 4. GROWTH is the exponential counterpart to the linear regression function TREND described in Method of Least Squares. For R1 = the array containing the y values of the observed data and R2 = the array containing the x values of the observed data, GROWTH(R1, R2, x) = EXP(a) * EXP(b)^x where EXP(a) and EXP(b) are as defined from the LOGEST output described above (or alternatively from the Regression data analysis). E.g., based on the data from Example 1, we have: GROWTH(B6:B16, A6:A16, 26) = 21.35 which is the same result we obtained earlier using the Regression data analysis tool. GROWTH can also be used to predict more than one value. In this case, GROWTH(R1, R2, R3) is an array function where R1 and R2 are as described above and R3 is an array ofx values. The function returns an array of predicted values for the x values in R3 based on the model determined by the values in R1 and R2. Power Regression Model Another non-linear regression model is the power regression model, which is based on the following equation: y= α𝑒 β𝑥 ln y = ln α + β ln𝑥 y′ = α ′ + β 𝑥′ + ε Observation: A model of the form ln y = β ln x + δ is referred to as a log-log regression model. Since if this equation holds, we have it follows that any such model can be expressed as a power regression model of form y =αxβ by setting α = eδ. Example 1: Determine whether the data on the left side of Figure 1 is a good fit for a power model. The table on the right side of Figure 1 shows y transformed into ln y and x transformed into ln x. We now use the Regression data analysis tool to model the relationship between ln y and ln x. Figure 2 shows that the model is a good fit and the relationship between ln x and ln y is given by ln y = 0.234 + 2.81 ln𝑥 Applying e to both sides of the equation yields We can also see the relationship between and by creating a scatter chart for the original data and choosing Layout > Analysis|Trendline in Excel and then selecting the Power Trendline option (after choosing More Trendline Options). We can also create a chart showing the relationship between ln x and ln y and use Linear Trendline to show the linear regression line As usual we can use the formula described above for prediction. For example, if we want the y value corresponding to x = 26, using the above model we get Excel doesn’t provide functions like TREND/GROWTH (nor LINEST/LOGEST) for power/log-log regression, but we can use the TREND formula as follows: =EXP(TREND(LN(B6:B16),LN(A6:A16),LN(26))) to get the same result. Observation: Thus the equivalent of the array formula GROWTH(R1, R2, R3) for log-log regression is =EXP(TREND(LN(R1), LN(R2), LN(R3))). Observation: In the case where there is one independent variable x, there are four ways of making log transformations, namely level-level regression: y = βx + α log-level regression: ln y = βx + α level-log regression: y = β ln x + α log-log regression: ln y = β ln x + α We dealt with the first of these in ordinary linear regression (no log transformation). The second is described in Exponential Regression and the fourth is power regression as described on this webpage. We haven’t studied the level-log regression, but it too can be analyzed using techniques similar to those described here. Linear regression models for comparing means In this section we show how to use dummy variables to model categorical variables using linear regression in a way that is similar to that employed in Dichotomous Variables and the t-test. In particular we show that hypothesis testing of the difference between means using the t-test (see Two Sample t Test with Equal Variances and Two Sample t Test with Unequal Variances) can be done by using linear regression. Example 1: Repeat the analysis of Example 1 of Two Sample t Test with Equal Variances (comparing means from populations with equal variance) using linear regression. The leftmost table in Figure 1 contains the original data from Example 1 of Two Sample t Test with Equal Variances. We define the dummy variable x so that x = 0 when the data element is from the New group and x = 1 when the data element is from the Old group. The data can now be expressed with an independent variable and a dependent variable as described in the middle table in Figure 1. Running the Regression data analysis tool on x and y, we get the results on the right in Figure 1. We can now compare this with the results we obtained using the t-test data analysis tool, which we repeat here in Figure 2. We now make some observations regarding this comparison: F = 4.738 in the regression analysis is equal to the square of the t-stat (2.177) from the t-test, which is consistent with Property 1 of F Distribution R Square = .208 in the regression analysis is equal to 𝑡2 2.1772 = 𝑡2+𝑑𝑓 2.1772+18 where t is the t-stat from the t-test, which is consistent with the observation following Theorem 1 of One Sample Hypothesis Testing for Correlation The p-value = .043 from the regression analysis (called Significance F) is the same as the p-value from the test (called P(T<=t) two-tail). Effect Size We can also see from the above discussion that the regression coefficient can be expressed as a function of the t-stat using the following formula: The impact of this is that the effect size for the t-test can be expressed in terms of the regression coefficient. The general guidelines are that r = .1 is viewed as a small effect, r= .3 as a medium effect and r = .5 as a large effect. For Example 1, r = 0.456 which is close to .5, and so is viewed as a large effect. Note that this formula can also be used to measure the effect size for t-tests even when the population variances are unequal (see next example) and for the case of paired samples. Model coefficients Also note that the coefficients in the regression model y = bx + a can be calculated directly from the original data as follows. First calculate the means of the data for each flavoring (new and old). The mean of the data in the new flavoring sample is 15 and the mean of the data in the old flavoring sample is 11.1. Since x = 0 for the new flavoring sample and x = 1 for the old flavoring sample, we have This means that a = 15 and b = 11.1 – a = 11.1 – 15 = -3.9, and so the regression line is y = 15 – 3.9x, which agrees with the coefficients in Figure 1. Unequal variance As was mentioned in the discussion following Figure 4 of Testing the Regression Line Slope, the Regression data analysis tool provides an optional Residuals Plot. The output for Example 1 is displayed in Figure 3. From the chart we see how the residual values corresponding to x = 0 and x = 1 are distributed about the mean of zero. The spreading about x = 1 is a bit larger than for x = 0, but the difference is quite small, which is an indication that the variances for x = 0 andx = 1 are quite equal. This suggests that the variances for the New and Old samples are roughly equal. Example 2: Repeat the analysis of Example 2 of Two Sample t Test with Unequal Variances (comparing means from populations with unequal variance) using linear regression. We note that the regression analysis displayed in Figure 4 agrees with the t-test analysis assuming equal variances (the table on the left of Figure 5). Unfortunately, since the variances are quite unequal, the correct results are given by the table on the right in Figure 5. This highlights the importance of the requirement that variances of the values for each be equal for the results of the regression analysis to be useful. Also note that the plot of the Residuals for the regression analysis clearly shows that the variances are unequal (see Figure 6). Thanks for your attention