Overview2 - DRS & Company

advertisement
From [JK10 Section 13.4 Linear Regression Analysis] – this is the later-chapter presentation. The
introductory discussion is in Chapter 3]
[maybe revise this after you revisit Chapter 3]
Line of Best Fit – Simple Linear Regression
The method of Least Squares
The Line of Best fit
Slope = 𝑏1 =
𝑆𝑆(𝑥,𝑦)
𝑆𝑆(𝑥)
Slope = 𝑏1 =
∑(𝑥−𝑥̅ )(𝑦−𝑦̅)
∑[(𝑥−𝑥̅ )2 ]
y-intercept = 𝑏0 =



where 𝑆𝑆(𝑥𝑦) = ∑(𝑥𝑦) −
𝑛
and 𝑆𝑆(𝑥) = ∑(𝑥 2 ) −
(∑ 𝑥)2
𝑛
(computational)
(definition)
∑ 𝑦−(𝑏1 ∙∑ 𝑥)
𝑛
∑ 𝑥∙∑ 𝑦
or 𝑏0 = 𝑦̅ − (𝑏1 ∙ 𝑥̅ ) (computational)
Is there a linear relationship between the two variables?
What equation expresses that relationship?
If 𝑥 and 𝑦 are unrelated, the slope is 0.
There are other regression methods


Curvilinear, involving powers of 𝑥 and other functions such as 𝑒 𝑥 and ln 𝑥.
Multiple regression, more than one input variable.
The Linear Model
⏞
𝑦 = 𝛽0 + 𝛽1 𝑥 + 𝜖



𝛽0 is the 𝑦-intercept.
𝛽1 is the slope.
𝜖 is the random experimental error in the observed value of 𝑦 at a given value of 𝑥
𝑒 =𝑦−⏞
𝑦
The Residual


𝑒 is the “residual”, the estimate of the experimental error. It is the difference between the
observed value of 𝑦 and the predicted value, ⏞
𝑦.
The sum of the errors (the sum of the residuals) for all values of 𝑦 for a given value of 𝑥 is
exactly zero, by design of the Least Squares Method.
Document1
2/18/2016 7:52 AM - D.R.S.
Statistics with the Residual


The mean value of the residual: 𝜇𝑒 = 0
The variance of the residual: 𝜎𝑒2 . We want to estimate what this is.
The Story
Suppose we observe several values of 𝑦 at a given value of 𝑥.




Plot a distribution of 𝑦 values observed (a distribution curve perpendicular to the plane at the 𝑥
value under consideration. (three-dimensional graph now – the frequency is the distance it
jumps off the page toward you)
We need to make an assumption: the distribution of these 𝑦 values at each 𝑥 is approximately a
normal distribution.
We also assume that the variances of the distributions of 𝑦 at all values of 𝑥 are the same.
The mean of the observed 𝑦 is different from one 𝑥 to another 𝑥, but at each one it can be
estimated by 𝑦̅.
Variance of the Error
From [JK10 Section 13.4]
Variance of the error (the residual), 𝑒:
2
𝑠𝑒2
∑(𝑦 − ⏞
𝑦)
=
𝑛−2
Replace ⏞
𝑦 with 𝑏0 + 𝑏1 𝑥
→ 𝑠𝑒2 =
∑(𝑦 − 𝑏0 − 𝑏1 𝑥)2
𝑛−2
Then do a bunch of algebra and you can arrive at this:
→ 𝑠𝑒2 =
(∑ 𝑦 2 ) − (𝑏0 ) ∙ (∑ 𝑦) − (𝑏1 ) ∙ (∑ 𝑥𝑦)
𝑛−2
Inferences about the slope of the regression line
From [JK10 Section 13.5]
Estimate for the Variance of the Slope
[JK10 page 718] The sampling distribution of the slope. If random samples of size n are repeatedly
taken from a bivariate population, then the calculated slopes, the 𝑏1 ’s, will form a sampling distribution
that is normally distributed with a mean of 𝛽1 , the population value of the slope, and with a variance of
𝑠2
𝜖
𝜎𝑏21 =… An appropriate estimator is 𝑠𝑏21 = ∑(𝑥−𝑥̅
=
)2
𝑠𝜖2
∑ 𝑥2−
(∑ 𝑥)2
𝑛
.. Terminology: “Variance of the
Document1
2/18/2016 7:52 AM - D.R.S.
slope” or “Variance of the regression”, “Standard deviation of the slope” or “standard error of the
regression”.
[JK10 page 719] Assumptions for inferences about the linear regression: The set of (x,y) ordered pairs
forms a random sample. The y values at each x have a normal distribution. Since the population
standard deviation is unknown and replaced with the sample standard deviation, the t-distribution will
be used with n-2 degrees of freedom.
Confidence Interval for the Slope of the Regression Line
[JK10 page 719]
The formula is 𝑏1 ± 𝑡(𝑛−2,𝛼/2) ∙ 𝑠𝑏1
Hypothesis Testing with the Slope of the Regression Line
The null hypothesis is 𝛽1 = 0, that is, no relationship.
We want to find if the test statistic 𝑡 ∗=
𝑏1 −𝛽1
𝑠𝑏1
(the slope of our regression line, minus the hypothesized
slope 0, divided by the standard error of our slope) is in the critical region. The critical value needs the
usual significance 𝛼 and the degrees of freedom is 𝑑𝑓 = 𝑛 − 2.
Coefficient of Linear Correlation
Recall from somewhere else that 𝑟 = ⋯ =
𝑆𝑆(𝑥𝑦)
√𝑆𝑆(𝑥)∙𝑆𝑆(𝑦)
.
Coefficient of Determination
[Blu4 page 550] says more: 𝑟 2 is the coefficient of determination and it is equal to the explained
variation divided by the total variation. It answers the question of “How much of the variation is
explained by the regression line?”
And the coefficient of nondetermination is 1 − 𝑟 2 .
̂
Confidence Intervals and Prediction Intervals of 𝒚
Linear regression produces the linear equation 𝑦̂ = 𝑏1∙ 𝑥 + 𝑏0.
[JK10 page 727] When you are talking about one specific value of the independent variable, 𝑥0 , 𝑦̂ is the
best point estimate of 𝑦. [JK10 page 727] goes on to discuss the development leading up to the
formulas for the confidence interval and the prediction interval.
Document1
2/18/2016 7:52 AM - D.R.S.
Confidence Interval for the Mean of Population Values of y at a given x0
[JK10] Notation: 𝜇𝑦|𝑥0 . Formula and example on [JK10 Page 728]
Prediction Interval for the Value of one individual y
[JK10 page 730] Notation for this concept: 𝑦𝑥=𝑥0 . Formula and example.
Comparison of these two concepts
[JK10 page 731] has illustration – Figure 13.14 – showing Confidence Belts for the mean 𝑦 and Prediction
Belts for the values of 𝑦. The Prediction belts are much wide.
[Blu4 page 553] discusses only the Prediction Interval and only briefly, no more than a half page.
Document1
2/18/2016 7:52 AM - D.R.S.
Download