Introduction to Regression Analysis

advertisement
Introduction to Regression Analysis
• We use sample data to
• estimate a population mean () or (1 - 2)
• estimate a population proportion (p) or (p1 - p2)
• test of hypothesis about  or (1 - 2)
• test of hypothesis about p or (p1 - p2).
• Now we want to use sample data to investigate the
relationships among a group of variables and to create a
mathematical model that can be used to predict its value in
the future.
• The process of finding a mathematical model (an equation)
that best fits the data is known as regression analysis.
1
Introduction to Regression Analysis
• The variable to be predicted (or modeled), y, is called the
dependent variable.
• The variables used to predict (or model) y are called
independent variables and are denoted by the symbols x1,
x2, x3, etc..
• General form of probabilistic model in regression:
y   y| x1 , x2 ,..., xk     0  1 x1   2 x2  ...   k xk  
where
y = dependent variable
 y| x1 , x2 ,..., xk = mean or expected value of y, deterministic component
 = unexplainable, or random error component
• Estimation/prediction equation
yˆ  b0  b1 x1  b2 x2  ...  bk xk
2
Form of The Simple Linear
Regression Model
y= μ y|x  ε = β0  β1 x  ε
y|x = 0 + 1x is the mean value of the
dependent variable y when the value of the
independent variable is x
0 is the y-intercept, the mean of y when x is 0
(when there is observed any values of x near 0)
1 is the slope, the change in the mean of y per
unit change in x (over the range of sample x-values)
 is an error term that describes the effect on y
of all factors other than x
3
The Simple Linear Regression Model
Illustrated
4
Regression Terms
• β0 and β1 are called regression
parameters
• β0 is the y-intercept and β1 is the slope
• We do not know the true values of these
parameters
• So, we must use sample data to
estimate them
• b0 is the estimate of β0 and b1 is the
estimate of β1
5
The Least Squares Point Estimates
Estimation/prediction equation
yˆ  b0  b1 x
Slope: b1  SSxy
SSxx
y-intercept: b0  y  b1 x
x

x
i
n
y
y
n=sample size
i
n
SS xy   ( xi  x )( yi  y )   xi yi  nxy
SS xx   ( xi  x ) 2   xi  n( x ) 2
2
MS EXCEL: =SLOPE(y range, x range)
=INTERCEPT(y range, x range)
6
An Estimator of 2
SSE
s 
n2
2
where
SSE   ( yi  yˆi )2  SS yy  b1SS xy   yi2  n( y)2  b1SS xy
n = sample size
s = standard deviation of error = standard error of estimate
7
A 100(1-)% confidence interval for the simple
linear regression slope 1
b1  t / 2 sb1
where
sb1 
s
SS xx
t/2 is based on (n-2) degree of freedom
8
Testing the Significance of the Slope
One Tailed Test
Ho: 1 = 0
Ha: 1 < 0
or 1 > 0
Two Tailed Test
Ho: 1 = 0
Ha: 1  0
b1
Test Statistic: t 
sb1
Rejection region: t< -t
or t> t
Where t is based on
(n-2) degree of freedom
Rejection region: |t|>t/2
Where t/2 is based on
(n-2) degree of freedom
9
The 100(1-)% confidence interval for the mean
value of y for x=xp
y  t / 2 s
1

n
( x p  x )2
SS xx
Where t/2 is based on (n-2) degree of freedom
10
The 100(1-)% prediction interval for an
individual y for x=xp
1
y  t / 2 s 1  
n
( x p  x )2
SS xx
Where t/2 is based on (n-2) degree of freedom
11
Simple Coefficient of Determination
2
ˆ
(
y

y
)
 i
Explained Variation

2
(
y

y
)
Total Variation
 i
r2 =
About 100(r2)% of the sample variation in y can be
explained by using x to predict y in the simple linear
regression model.
yi
ŷi
y
Un-Explained
Variation
Explained
Variation
Total Variation
xi
12
The coefficient of correlation
SSxy
Where
r = ---------------SS yy   ( yi  y ) 2   yi2  ny 2
SSxx SSyy
r for sample and  (rho) for population
-1< r <1
r > 0 means that y increases as x increases
r < 0 means that y decreases as x increases
r  0 little or no linear relationship between y and x.
the closer r to 1 or –1, the stronger the relationship.
High correlation does not imply causality. Only a linear
trend may exist between x and y.
r  r 2 when b1>0
or
r   r2
when b1<0
13
Exercise
• What is the range of values that the coefficient of
determination can assume? ___
• If the value of r is -0.96, what does this indicate
about the dependent variable as the
independent variable increases? __
• If the correlation between sales and advertising
is +0.6, what percent of the variation in sales
can be attributed to advertising? __
• What does the coefficient of determination equal
if r = 0.89?
Exercise
• In the regression equation, what does the letter
"b" represent?
• What is the null hypothesis to test the
significance of the slope in a regression
equation?
• The regression equation is Ŷ = 29.29 - 0.96X,
the sample size is 8, and the standard error of
the slope is 0.22. What is the test statistic to test
the significance of the slope?
15
Exercise
•
•
•
•
•
Page 488 no. 26
Page 494 no. 31
Page 500 no. 38
Page 502 no. 46
Page 506 no. 56
16
Download