Presenter 07 - Linear Regression

advertisement
Real Statistics Using Excel
102521088 吳柏葦
*
•
Confidence and prediction intervals for regression
•
Exponential Regression Model
•
Power Regression Model
•
Linear regression models for comparing means
The 95% confidence interval for the forecasted values ŷ of x is
Where
This means that there is a 95% probability that the true linear regression
line of the population will lie within the confidence interval of the
regression line calculated from the sample data.
The 95% prediction interval of the forecasted value ŷ0 for x0 is
where the standard error of the prediction is
For any specific value x 0 the prediction interval is more
meaningful than the confidence interval.
Find the 95% confidence and prediction intervals for the forecasted life expectancy
for men who smoke 20 cigarettes in Example 1 of Method of Least Squares
Figure 2 – Confidence and prediction intervals for data in Example 1
Referring to Figure 2, we see that the forecasted value
for 20 cigarettes is given by FORECAST
(20,B4:B18,A4:A18) = 73.16. The confidence interval,
calculated using the standard error 2.06 (found in cell
E12), is (68.70, 77.61).
The prediction interval is calculated in a similar way
using the prediction standard error of 8.24 (found in cell
J12). Thus life expectancy of men who smoke 20
cigarettes is in the interval (55.36, 90.95) with 95%
probability.
Example 2: Test whether the y-intercept is 0.
We use the same approach as that used in Example 1 to
find the confidence interval of ŷ when x = 0 (this is the yintercept). The result is given in column M of Figure 2.
Here the standard error is
And so the confidence interval is
Since 0 is not in this interval, the null hypothesis that the y-intercept is zero is
rejected.
Sometimes linear regression can be
used with relationships which are not
inherently linear, but can be made to
be linear after a transformation. In
particular, we consider the following
exponential model:
y= α𝑒 β𝑥
y= α𝑒
β𝑥
ln y = ln α + β 𝑥
Y’ = α ’+ β 𝑥 + ε
Observation: Since αeβ(x+1) = αeβx · eβ, we note that an
increase in x of 1 unit results in y being multiplied
by eβ.
Observation: A model of the form ln y = βx + δ is
referred to as a log-level regression model. Clearly
any such model can be expressed as an exponential
regression model of form y = αeβx by setting α = eδ.
* Example 1: Determine whether the data on the left side of Figure 1
fits with an exponential model.
Figure 1 – Data for Example 1 and log transform
* The table on the right side of Figure 1 shows ln y (the natural
log of y) instead of y. We now use the Regression data analysis
tool to model the relationship between ln y and x.
Figure 2 – Regression data analysis for x vs. ln y from Example 1
The table in Figure 2 shows that the model is a good fit and the
relationship between ln y and x is given by
* ln y = 0.016+2.64𝑥
Applying e to both sides of the equation yields
We can also see the relationship between and by
creating a scatter chart for the original data and
choosing Layout > Analysis|Trendline in Excel and
then selecting the Exponential Trendline option. We can
also create a chart showing the relationship
between and ln and use Linear Trendline to show the
linear regression line .
As usual we can use the formula y = 14.05∙(1.016)x described
above for prediction. Thus if we want the y value corresponding
to x = 26, using the above model we get ŷ =14.05∙(1.016)26 =
21.35.
We can get the same result using Excel’s GROWTH function, as
described below.
Excel Functions: Excel supplies two functions for exponential
regression, namely GROWTH and LOGEST.
LOGEST is the exponential counterpart to the linear regression
function LINEST described in Testing the Slope of the
Regression Line. Once again you need to highlight a 5 × 2 area
and enter the array function =LOGEST(R1, R2, TRUE, TRUE),
where R1 = the array of observed values for y (not ln y) and R2
is the array of observed values for x, and then press Ctrl-ShftEnter. LOGEST doesn’t supply any labels and so you will need
to enter these manually.
Essentially LOGEST is simply LINEST using the mapping
described above for transforming an exponent model into a
linear model. For Example 1 the output for LOGEST(B6:B16,
A6:A16, TRUE, TRUE) is as in Figure 4.
GROWTH is the exponential counterpart to the linear regression
function TREND described in Method of Least Squares. For R1 =
the array containing the y values of the observed data and R2 =
the array containing the x values of the observed data,
GROWTH(R1, R2, x) = EXP(a) * EXP(b)^x where EXP(a) and
EXP(b) are as defined from the LOGEST output described above
(or alternatively from the Regression data analysis). E.g., based
on the data from Example 1, we have:
GROWTH(B6:B16, A6:A16, 26) = 21.35
which is the same result we obtained earlier using the Regression
data analysis tool.
GROWTH can also be used to predict more than one value. In
this case, GROWTH(R1, R2, R3) is an array function where R1
and R2 are as described above and R3 is an array ofx values. The
function returns an array of predicted values for the x values in
R3 based on the model determined by the values in R1 and R2.
Power Regression Model
Another non-linear regression model is the power
regression model, which is based on the following equation:
y= α𝑒 β𝑥
ln y = ln α + β ln𝑥
y′ = α ′ + β 𝑥′ + ε
Observation: A model of the form ln y = β ln x + δ is referred to
as a log-log regression model. Since if this equation holds, we
have
it follows that any such model can be expressed as a power
regression model of form y =αxβ by setting α = eδ.
Example 1: Determine whether the data on the
left side of Figure 1 is a good fit for a power
model.
The table on the right side of Figure 1 shows y transformed
into ln y and x transformed into ln x. We now use the
Regression data analysis tool to model the relationship
between ln y and ln x.
Figure 2 shows that the model is a good fit and the relationship between
ln x and ln y is given by
ln y = 0.234 + 2.81 ln𝑥
Applying e to both sides of the equation yields
We can also see the relationship between and by creating a
scatter chart for the original data and choosing Layout >
Analysis|Trendline in Excel and then selecting the Power
Trendline option (after choosing More Trendline Options).
We can also create a chart showing the relationship between
ln x and ln y and use Linear Trendline to show the linear
regression line
As usual we can use the formula described above for prediction. For example,
if we want the y value corresponding to x = 26, using the above model we get
Excel doesn’t provide functions like TREND/GROWTH (nor
LINEST/LOGEST) for power/log-log regression, but we can use the TREND
formula as follows:
=EXP(TREND(LN(B6:B16),LN(A6:A16),LN(26)))
to get the same result.
Observation: Thus the equivalent of the array formula GROWTH(R1, R2, R3)
for log-log regression is =EXP(TREND(LN(R1), LN(R2), LN(R3))).
Observation: In the case where there is one independent variable x, there are
four ways of making log transformations, namely
level-level regression: y = βx + α
log-level regression: ln y = βx + α
level-log regression: y = β ln x + α
log-log regression: ln y = β ln x + α
We dealt with the first of these in ordinary linear regression (no log
transformation). The second is described in Exponential Regression and the
fourth is power regression as described on this webpage. We haven’t studied the
level-log regression, but it too can be analyzed using techniques similar to those
described here.
Linear regression models for
comparing means
In this section we show how to use dummy
variables to model categorical variables using
linear regression in a way that is similar to that
employed in Dichotomous Variables and the
t-test. In particular we show that hypothesis
testing of the difference between means using
the t-test (see Two Sample t Test with Equal
Variances and Two Sample t Test with Unequal
Variances) can be done by using linear
regression.
Example 1: Repeat the analysis of Example 1 of Two Sample t Test with
Equal Variances (comparing means from populations with equal variance)
using linear regression.
The leftmost table in Figure 1 contains the original data from Example 1
of Two Sample t Test with Equal Variances. We define the dummy
variable x so that x = 0 when the data element is from the New group
and x = 1 when the data element is from the Old group. The data can now
be expressed with an independent variable and a dependent variable as
described in the middle table in Figure 1.
Running the Regression data analysis tool on x and y,
we get the results on the right in Figure 1. We can
now compare this with the results we obtained using
the t-test data analysis tool, which we repeat here in
Figure 2.
We now make some observations regarding this
comparison:
F = 4.738 in the regression analysis is equal to the
square of the t-stat (2.177) from the t-test, which
is consistent with Property 1 of F Distribution
R Square = .208 in the regression analysis is equal
to
𝑡2
2.1772
=
𝑡2+𝑑𝑓 2.1772+18
where t is the t-stat from the
t-test, which is consistent with the observation
following Theorem 1 of One Sample Hypothesis
Testing for Correlation
The p-value = .043 from the regression analysis
(called Significance F) is the same as the p-value
from the test (called P(T<=t) two-tail).
Effect Size
We can also see from the above discussion that the regression
coefficient can be expressed as a function of the t-stat using the
following formula:
The impact of this is that the effect size for the t-test can be expressed in
terms of the regression coefficient. The general guidelines are that r = .1 is
viewed as a small effect, r= .3 as a medium effect and r = .5 as a large effect.
For Example 1, r = 0.456 which is close to .5, and so is viewed as a large
effect.
Note that this formula can also be used to measure the effect size for t-tests
even when the population variances are unequal (see next example) and for
the case of paired samples.
Model coefficients
Also note that the coefficients in the regression model
y = bx + a can be calculated directly from the original data as
follows. First calculate the means of the data for each flavoring
(new and old). The mean of the data in the new flavoring sample
is 15 and the mean of the data in the old flavoring sample is 11.1.
Since
x = 0 for the new flavoring sample and x = 1 for the old flavoring
sample, we have
This means that a = 15 and b = 11.1 – a = 11.1 – 15 = -3.9, and
so the regression line is y = 15 – 3.9x, which agrees with the
coefficients in Figure 1.
Unequal variance
As was mentioned in the discussion following Figure 4 of Testing the
Regression Line Slope, the Regression data analysis tool provides an
optional Residuals Plot. The output for Example 1 is displayed in
Figure 3.
From the chart we see how the residual values corresponding to x = 0
and x = 1 are distributed about the mean of zero. The spreading about x = 1
is a bit larger than for x = 0, but the difference is quite small, which is an
indication that the variances for x = 0 andx = 1 are quite equal. This
suggests that the variances for the New and Old samples are roughly equal.
Example 2: Repeat the analysis of Example 2 of Two Sample t Test
with Unequal Variances (comparing means from populations with
unequal variance) using linear regression.
We note that the regression analysis displayed in Figure 4 agrees with
the t-test analysis assuming equal variances (the table on the left of
Figure 5).
Unfortunately, since the variances are quite unequal, the correct
results are given by the table on the right in Figure 5. This highlights
the importance of the requirement that variances of the values for
each be equal for the results of the regression analysis to be useful.
Also note that the plot of the Residuals for the regression analysis
clearly shows that the variances are unequal (see Figure 6).
Thanks for your attention
Download