Lesson 2 - Simple Linear Regression

advertisement
Simple Linear Regression
Deterministic Relationship
īŽ
If the value of y (dependent) is completely determined by
the value of x (Independent variable)
(Like an equation in the form y = 2x + 10, or f(x) = 5x-1)
īŽ
However, in most situations, the variables of interest are
not deterministically related!
For example, the value of y = 1st year college GPA is
certainly not determined solely by x = high school
GPA.
Probabilistic Model
īŽ
A description of the relation between 2 variables x and y
that are not deterministic.
īŽ
The general form allows y to be larger or smaller than f(x)
by a random amount, e.
𝒚 = 𝒅𝒆𝒕𝒆𝒓𝒎𝒊𝒏𝒊𝒔𝒕𝒊𝒄 𝒇𝒖𝒏𝒄𝒕𝒊𝒐𝒏 𝒐𝒇 𝒙 + 𝒓𝒂𝒏𝒅𝒐𝒎 𝒅𝒆𝒗𝒊𝒂𝒕𝒊𝒐𝒏
𝒚=𝒇 𝒙 +𝒆
Let x* denote the value of x….
īƒŦ y ī€ž f ( x ) if e ī€ž 0
īƒ¯
*
īƒ­ y ī€ŧ f ( x ) if e ī€ŧ 0
īƒ¯ y ī€Ŋ f ( x* ) if e ī€Ŋ 0
īƒŽ
*
Without the random deviation e, all observed (x, y) points
would fall exactly on the population regression line. The
inclusion of e in the model equation recognizes that
points will deviate from the line.
Simple Linear
Regression Model:
y ī€ŊīĄ ī€Ģ īĸxī€Ģe
īŽ
e1
Assumptions about the distribution of e
=0
𝜇𝑒 = 0
ī‚¨ St. Dev.
īŗ is the same for any value of x.
ī‚¨ Distribution of e at any x value is normal
ī‚¨ Random deviations 𝑒1 , 𝑒2 , 𝑒3 … . , 𝑒𝑛 associated
with different observations are independent of
one another
ī‚¨ Mean
Slope
y ī€ŊīĄ ī€Ģ īĸxī€Ģe
Population Regression Line
īŽ
Average change in y associated with a 1 unit increase in
x.
īŽ
Point estimate is the slope (b). (Population is 𝜷)
īŽ
Y-intercept’s point estimate is a. (Population is đœļ)
Summary
𝑆đ‘Ĩđ‘Ļ
īŽ
𝑏 = 𝑝𝑜𝑖𝑛𝑡 𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒 𝑜𝑓 đ›Ŋ =
īŽ
𝑎 = 𝑝𝑜𝑖𝑛𝑡 𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒 𝑜𝑓 đ›ŧ = đ‘Ļ − 𝑏đ‘Ĩ
where
𝑆đ‘Ĩđ‘Ļ =
īŽ
đ‘Ĩđ‘Ļ −
đ‘Ĩ
đ‘Ļ
𝑛
and
𝑆đ‘Ĩđ‘Ĩ
𝑆đ‘Ĩđ‘Ĩ =
2
đ‘Ĩ −
đ‘Ĩ 2
𝑛
The estimated regression line is then just the leastsquares line
đ‘Ļ = 𝑎 + 𝑏đ‘Ĩ.
X* denotes a specified value of the predictor
variable x ….
*
a
ī€Ģ
bx
īŽ So
has 2 different interpretations
ī‚¨ It is a point estimate of the true mean y value
when x = x*.
ī‚¨ It
is a point predictor of an individual y value
that would be observed when x = x*.
Find the point estimate of the mean y-value for
the following:
īƒŦ x ī€Ŋ mother's age
īƒ­
īƒŽ y ī€Ŋ birth weight
Age (x)
Weight
(y)
15
17
18
15
16
19
17
16
18
20
2289 3393 3271 2648 2897 3327 2970 2535 3138 3573
So what’s the point estimate for an 18
year old mom?
Point estimate and point prediction are identical – only the
interpretation is different.
īŽ
Prediction – weight of single baby who mom is 18
īŽ
Estimate – average weight of all babies born to 18 yearolds
Answer the following:
īŽ
Explain the slope in context of the problem
īŽ
Explain the y-intercept in context of the problem.
Find SSResid.
īŽ
īŽ
īƒĨī€¨ y ī€­ yī€Š
2
On calculator – every time you calculate a linear regression – it
calculates the residuals. Put them in list 3 and square them & add
the list.
Point estimate of is 𝜎 is 𝑆𝑒 .
SSRe sid
Se ī€Ŋ
nī€­2
īŽ
It represents the typical deviation in the y-variable
from the least squares line.
Find the residual for a mother who is 19.
Find the probability that a 19 year old
mother has a baby that is more than 3000 g.
Coefficient of determination (r2)
īŽ
ī€¨
SSTot ī€Ŋ īƒĨ y ī€­ y
ī€ŊīƒĨy
īŽ
īŽ
2
ī€Š
2
yī€Š
ī€¨
īƒĨ
ī€­
2
n
SS resid
r ī€Ŋ 1ī€­
SSTot
2
It’s the amount of variation in the y-variables that can be
explained by the least squares line.
Homework
īŽ
Worksheet
Download