Korelasi dalam Regresi Linear Sederhana Pertemuan 03 Tahun

advertisement
Matakuliah : I0174 – Analisis Regresi
Tahun
: Ganjil 2007/2008
Korelasi dalam Regresi Linear Sederhana
Pertemuan 03
Korelasi Dalam Regresi Linier Sederhana
Koefisien Korelasi Linier Sederhana
Koefisien Determinasi
Koefisien Korelasi Data Berkelompok
Bina Nusantara
Chapter Topics
•
•
•
•
•
•
•
Bina Nusantara
Types of Regression Models
Determining the Simple Linear Regression Equation
Measures of Variation
Assumptions of Regression and Correlation
Residual Analysis
Measuring Autocorrelation
Inferences about the Slope
Chapter Topics
(continued)
• Correlation - Measuring the Strength of the Association
• Estimation of Mean Values and Prediction of Individual Values
• Pitfalls in Regression and Ethical Issues
Bina Nusantara
Purpose of Regression Analysis
• Regression Analysis is Used Primarily to Model Causality
and Provide Prediction
– Predict the values of a dependent (response) variable
based on values of at least one independent
(explanatory) variable
– Explain the effect of the independent variables on the
dependent variable
Bina Nusantara
Types of Regression Models
Positive Linear Relationship
Negative Linear Relationship
Bina Nusantara
Relationship NOT Linear
No Relationship
Simple Linear Regression Model
• Relationship between Variables is Described by a Linear Function
• The Change of One Variable Causes the Other Variable to Change
• A Dependency of One Variable on the Other
Bina Nusantara
Simple Linear Regression Model
(continued)
Population regression line is a straight line that
describes the dependence of the average value
(conditional mean) of one variable on the other
Population
Slope
Coefficient
Population
Y Intercept
Dependent
(Response)
Variable
Bina Nusantara
Random
Error
Yi      X i   i
Population
Regression
Y |X
Line
(Conditional Mean)

Independent
(Explanatory)
Variable
Simple Linear Regression Model
Y
(Observed Value of Y) = Yi
(continued)
     X i   i
 i = Random Error

Y | X      X i

(Conditional Mean)
X
Observed Value of Y
Bina Nusantara
Linear Regression Equation
Sample regression line provides an estimate of
the population regression line as well as a
predicted value of Y
Sample
Y Intercept
Yi  b0  b1 X i  ei
Ŷ  b0  b1 X
Bina Nusantara
Sample
Slope
Coefficient
Residual
Simple Regression Equation
(Fitted Regression Line, Predicted Value)
•
Linear Regression Equation
(continued)
b0and b1 are obtained by finding the values of b0 and b1that
minimize the sum of the squared residuals

n
i 1
•
•
Yi  Yˆi

2
n
  ei2
i 1
b1 provides an estimate of  
b0 provides an estimate of 
Bina Nusantara
Linear Regression Equation
Yi  b0  b1 X i  ei
Y
ei
(continued)
Yi      X i   i
b1
i

Y | X      X i

Bina Nusantara
b0
Observed Value
Yˆi  b0  b1 X i
X
Interpretation of the Slope
and Intercept
•
   E Y | X  0 
is the average value of Y when the value of X
is zero
change in E Y | X 
• 1 
measures the change in the average
change in X
value of Y as a result of a one-unit change in X
Bina Nusantara
Interpretation of the Slope
and Intercept
•
(continued)
b  Eˆ Y | X  0  is the estimated average value of Y when the
value of X is zero
change in Eˆ Y | X 
• b1 
is the estimated change in the
change in X
average value of Y as a result of a one-unit change in X
Bina Nusantara
Simple Linear Regression: Example
You wish to examine
the linear dependency
of the annual sales of
produce stores on their
sizes in square footage.
Sample data for 7
stores were obtained.
Find the equation of
the straight line that
fits the data best.
Bina Nusantara
Store
Square
Feet
Annual
Sales
($1000)
1
2
3
4
5
6
7
1,726
1,542
2,816
5,555
1,292
2,208
1,313
3,681
3,395
6,653
9,543
3,318
5,563
3,760
Scatter Diagram: Example
Annua l Sa le s ($000)
12000
10000
8000
6000
4000
2000
0
0
1000
2000
3000
4000
S q u a re F e e t
Excel Output
Bina Nusantara
5000
6000
Simple Linear Regression Equation: Example
Yˆi  b0  b1 X i
 1636.415  1.487 X i
From Excel Printout:
C o e ffi c i e n ts
I n te r c e p t
1 6 3 6 .4 1 4 7 2 6
X V a ria b le 1 1 .4 8 6 6 3 3 6 5 7
Bina Nusantara
Annua l Sa le s ($000)
Graph of the Simple Linear Regression Equation:
Example
12000
10000
8000
6000
4000
2000
0
0
1000
2000
3000
4000
S q u a re F e e t
Bina Nusantara
5000
6000
Interpretation of Results: Example
Yˆi  1636.415  1.487 X i
The slope of 1.487 means that for each increase of
one unit in X, we predict the average of Y to
increase by an estimated 1.487 units.
The equation estimates that for each increase of 1
square foot in the size of the store, the expected
annual sales are predicted to increase by $1487.
Bina Nusantara
Simple Linear Regression
in PHStat
• In Excel, use PHStat | Regression | Simple Linear Regression …
• Excel Spreadsheet of Regression Sales on Footage
Bina Nusantara
Measures of Variation:
The Sum of Squares
SST
=
Total
=
Sample
Variability
Bina Nusantara
SSR
Explained
Variability
+
SSE
+
Unexplained
Variability
Measures of Variation:
The Sum of Squares
(continued)
• SST = Total Sum of Squares
– Measures the variation of the Yi values around their mean, Y
• SSR = Regression Sum of Squares
– Explained variation attributable to the relationship between X and Y
• SSE = Error Sum of Squares
– Variation attributable to factors other than the relationship between X and Y
Bina Nusantara
Measures of Variation:
The Sum of Squares
(continued)

SSE =(Yi - Yi )2
Y
_
SST = (Yi - Y)2
 _
SSR = (Yi - Y)2
Bina Nusantara
Xi
_
Y
X
Venn Diagrams and Explanatory Power of Regression
Variations in
store Sizes not
used in
explaining
variation in
Sales
Sizes
Bina Nusantara
Sales
Variations in Sales
explained by the
error term or
unexplained by
Sizes  SSE 
Variations in Sales
explained by Sizes
or variations in Sizes
used in explaining
variation in Sales
 SSR 
The ANOVA Table in Excel
ANOVA
df
SS
MS
F
MSR
Regression k
SSR
MSR/MSE
=SSR/k
MSE
Residuals n-k-1 SSE
=SSE/(n-k-1)
Total
Bina Nusantara
n-1
SST
Significance
F
P-value of
the F Test
Measures of Variation
The Sum of Squares: Example
Excel Output for Produce Stores
Degrees of freedom
ANOVA
df
SS
MS
Regression
1
30380456.12
30380456
Residual
5
1871199.595 374239.92
Total
6
32251655.71
F
81.17909
Regression (explained) df
Error (residual) df
Total df
Bina Nusantara
SSE
SSR
Significance F
0.000281201
SST
The Coefficient of Determination
•
SSR Regression Sum of Squares
r 

SST
Total Sum of Squares
2
• Measures the proportion of variation in Y that is explained by the
independent variable X in the regression model
Bina Nusantara
Venn Diagrams and
Explanatory Power of Regression
r 
2
Sales
Sizes
Bina Nusantara
SSR

SSR  SSE
Coefficients of Determination (r 2) and Correlation (r)
Y r2 = 1, r = +1
Y r2 = 1, r = -1
^=b +b X
Y
i
^=b +b X
Y
i
0
1 i
0
X
Y r2 = .81,r = +0.9
X
Bina Nusantara
X
Y
^=b +b X
Y
i
0
1 i
1 i
r2 = 0, r = 0
^=b +b X
Y
i
0
1 i
X
Standard Error of Estimate

n
•
SYX
SSE


n2
i 1
Y  Yˆi

2
n2
• Measures the standard deviation (variation) of the Y values around
the regression equation
Bina Nusantara
Measures of Variation:
Produce Store Example
Excel Output for Produce Stores
R e g r e ssi o n S ta ti sti c s
M u lt ip le R
R S q u a re
0 .9 4 1 9 8 1 2 9
A d ju s t e d R S q u a re
0 .9 3 0 3 7 7 5 4
S t a n d a rd E rro r
6 1 1 .7 5 1 5 1 7
O b s e r va t i o n s
r2 = .94
0 .9 7 0 5 5 7 2
n
7
94% of the variation in annual sales can be
explained by the variability in the size of the
store as measured by square footage.
Bina Nusantara
Syx
Linear Regression Assumptions
• Normality
– Y values are normally distributed for each X
– Probability distribution of error is normal
• Homoscedasticity (Constant Variance)
• Independence of Errors
Bina Nusantara
Consequences of Violation
of the Assumptions
• Violation of the Assumptions
– Non-normality (error not normally distributed)
– Heteroscedasticity (variance not constant)
• Usually happens in cross-sectional data
– Autocorrelation (errors are not independent)
• Usually happens in time-series data
• Consequences of Any Violation of the Assumptions
– Predictions and estimations obtained from the sample regression line will
not be accurate
– Hypothesis testing results will not be reliable
• It is Important to Verify the Assumptions
Bina Nusantara
Purpose of Correlation Analysis
(continued)
• Population Correlation Coefficient  (Rho) is Used to Measure the
Strength between the Variables
 XY

 X Y
Bina Nusantara
Purpose of Correlation Analysis
(continued)
• Sample Correlation Coefficient r is an Estimate of  and is Used to
Measure the Strength of the Linear Relationship in the Sample
Observations
n
r
 X
i 1
n
 X
i 1
Bina Nusantara
i
i
 X Yi  Y 
X
2
n
 Y  Y 
i 1
i
2
Sample Observations from Various r Values
Y
Y
Y
X
r = -1
X
r = -.6
Y
Bina Nusantara
X
r=0
Y
r = .6
X
r=1
X
Features of  and r
•
•
•
•
•
Bina Nusantara
Unit Free
Range between -1 and 1
The Closer to -1, the Stronger the Negative Linear Relationship
The Closer to 1, the Stronger the Positive Linear Relationship
The Closer to 0, the Weaker the Linear Relationship
t Test for Correlation
• Hypotheses
– H0:  = 0 (no correlation)
– H1:   0 (correlation)
• Test Statistic
–
t
r
where
 r
n2
2
n
r  r2 
Bina Nusantara
 X
i 1
n
 X
i 1
i
i
 X Yi  Y 
X
2
n
 Y  Y 
i 1
i
2
Example: Produce Stores
From Excel Printout
Is there any
evidence of linear
relationship between
annual sales of a
store and its square
footage at .05 level
of significance?
R e g r e ssi o n S ta ti sti c s
M u lt ip le R
R S q u a re
0 .9 7 0 5 5 7 2
0 .9 4 1 9 8 1 2 9
A d ju s t e d R S q u a re 0 . 9 3 0 3 7 7 5 4
S t a n d a rd E rro r
6 1 1 .7 5 1 5 1 7
O b s e rva t io n s
H0:  = 0 (no association)
H1:   0 (association)
  .05
df  7 - 2 = 5
Bina Nusantara
r
7
Example: Produce Stores Solution
r
.9706
t

 9.0099
2
1  .9420
 r
5
n2
Critical Value(s):
Reject
.025
Reject
.025
-2.5706 0 2.5706
Bina Nusantara
Decision:
Reject H0.
Conclusion:
There is evidence of a
linear relationship at 5%
level of significance.
The value of the t statistic is
exactly the same as the t
statistic value for test on the
slope coefficient.
Download