Regression Analysis - Data Analysis and Modeling for Public Affairs

advertisement
Lecture on Correlation and
Regression Analyses
REVIEW - Variable
A variable is a characteristic that changes or varies
over time or different individuals or objects under
consideration.
Broad Classification of Variables:

QUANTITATIVE


DISCRETE
CONTINUOUS
QUALITATIVE
Types of Variable
Qualitative
 assumes
values that are not numerical but
can be categorized
 categories may be identified by either nonnumerical descriptions or by numeric
codes
Types of Variable
Quantitative
 indicates
the quantity or amount of a
characteristic
 data are always numeric
 can be discrete or continuous
Types of Quantitative Variables
Discrete – variable with a finite or
countable number of
possible values
Continuous – variable that assumes
any value in a given
interval
2.A.5
Levels/Scales of Measurement
Data may be classified into four hierarchical levels of
measurement:
Nominal
Ordinal
Interval
Ratio
Note: The type of statistical analysis that is appropriate
for a particular variable depends on its level of
measurement.
NOMINAL SCALE



Data collected are labels,
names or categories.
Frequencies or counts of
observations belonging to
the same category can be
obtained.
It is the lowest level of
measurement.
ORDINAL SCALE


Data collected are
labels with implied
ordering.
The difference
between two data
labels is
meaningless.
INTERVAL SCALE



Data can be ordered
or ranked.
The difference
between two data
values is meaningful.
Data at this level may
lack an absolute zero
point.
RATIO SCALE
Data have all the properties
of the interval scale.
 The number zero indicates
the absence of the
characteristic being
measured.
 It is the highest level of
measurement.

Learning Points – PART II
1.
2.
3.
4.
5.
What is a correlation analysis
What is a regression analysis
When do we use correlation analysis?
When do we use regression analysis?
How do we compare regression
versus correlation analysis?
CORRELATION ANALYSIS
It is a statistical technique used to
determine the strength of the
relationship between two variables,
X and Y.
It provides a measure of strength of
the linear relationship between two
variables measured in at least
interval scale.
5.F.12
ILLUSTRATION
The UP Admissions office may be interested
in the relationship between UPCAT scores
in Math and Reading Comprehension of
UPCAT qualifiers.
5.F.13
ILLUSTRATION
A social scientist might be concerned with
how a city’s crime rate is related to its
unemployment rate.
5.F.14
ILLUSTRATION
A nutritionist might try to relate the quantity
of carbohydrates in the diet consumed to
the amount of sugar in the blood of
diabetic individuals.
5.F.15
PEARSON’S CORRELATION
COEFFICIENT, 
s XY

,
s XsY
1    1
where
sXY = covariance between X and Y
N
 X i   X Yi  Y 

N
sX = standard deviation of the X values
sY = standard deviation of the Y values
N = number of paired observations in
the population
i 1
5.F.16
PEARSON’S CORRELATION
COEFFICIENT, 
Y as X 
Y
X and Y increases
(decreases) together,  >0
X
5.F.17
PEARSON’S CORRELATION
COEFFICIENT, 
X increases (decreases) while
Y decreases (increases),  < 0
Y as X 
Y
X
5.F.18
PEARSON’S CORRELATION
COEFFICIENT, 
No pattern
Y
X and Y have no linear
relationship,  = 0
X
5.F.19
SAMPLE CORRELATION
COEFFICIENT, r
s XY
r
,
s X sY
where
n
s XY = 
i=1
sXY
sX
sY
n
X
i
- X Yi - Y 
n-1
-1  r  1
n
n
 X iYi =
n
 X Y
i
i=1
i=1
n-1
i
i=1
n
,
= sample covariance of X and Y values
= sample standard deviation of X values
= sample standard deviation of Y values
= sample size
5.F.20
QUALITATIVE
INTERPRETATION OF  AND r
Absolute Value of the
Correlation Coefficient
Strength of Linear
Relationship
0.0 – 0.2
Very weak
0.2 – 0.4
Weak
0.4 – 0.6
Moderate
0.6 – 0.8
Strong
0.8 – 1.0
Very Strong
5.F.21
EXAMPLE
It is of interest to study the relationship
between the number of hours spent
studying and the student’s grade in an
examination.
A random sample of twenty students is
selected and the data are given in the
following table.
Compute and interpret the sample
correlation coefficient.
EXAMPLE
Student
Hours
Studied
Score
(%)
Student
Hours
Studied
Score
(%)
1
2
3
4
5
6
7
8
9
10
3
5
2
3
4
2
3
4
3
4
71
90
83
70
93
50
70
90
76
80
11
12
13
14
15
16
17
18
19
20
4
3
1
0
3
1
1
2
3
1
80
60
63
49
80
61
63
50
60
65
Slide No. V.F.15
SCATTER PLOT
Examination Score
100
90
80
70
60
50


40
0
1
2
3
4
5
Number of Hours Spent Studying
6
sXY
r
,
s X sY
-1  r  1
2


  Xi 
n
2  i=1

X

i
n
2
i=1
sx 
 1.7263 -->> sx  1.313893
n-1
s y  13.53203
n
n
n
 X iYi s XY =
 X Y
i
i=1
i=1
5.F.25
n
n-1
i
i=1
n
(52)(1404)
3901 20

 13.1895
19
Sample Correlation Coefficient
Sample
X
Y
XY
Total
Standard
Deviation
Variance
Covariance
52
1404
3901
1.3139
13.5320
1.7263
183.1158
13.1895
s XY
13.1895
r

 0.7418
s X sY 1.3139 13.5320 
Interpretation:
There is a strong positive linear relationship between the
number of hours the student spent studying for the exam and
exam score of students.
TEST OF HYPOTHESIS ABOUT 
Ho:  = 0; There is no linear relationship
between X and Y.
vs.
Ha:   0; There is a linear relationship
between X and Y.
or
Ha:  > 0; There is a positive linear
relationship between X and Y.
or
Ha:  < 0; There is a negative linear
relationship between X and Y.
5.F.27
TEST OF HYPOTHESIS ABOUT 
The standardized form of the test
statistic is
tc 
r n2
1 r2
which follows the Student’s t distribution
with n - 2 df when the null hypothesis is
TRUE. This is commonly referred to as ttest for correlation coefficient.
5.F.28
TEST OF HYPOTHESIS ABOUT 
With a given level of significance, a
Alternative
Hypothesis
Ha:  ≠ 0
(two-tailed test)
Ha: 
>0
(one-tailed test)
Ha: 
<0
(one-tailed test)
5.F.29
Decision Rule
Reject Ho if |tc| > ttab= tα/2(n-2).
Fail to reject Ho, otherwise.
Reject Ho if tc > ttab= tα(n-2).
Fail to reject Ho, otherwise.
Reject Ho if tc < ttab= - tα(n-2).
Fail to reject Ho, otherwise.
EXAMPLE
In the study of the relationship between
the number of hours spent studying and
the student’s grade in an examination.
Is there evidence to say that longer
number of hours spent studying is
associated with higher exam scores at 5%
level of significance?
Test of Hypothesis
Ho:  = 0; There is no linear
relationship between the
number of hours a student
spent studying for the exam
and his exam score.
Ha:  > 0; There is a positive
linear relationship between the
number of hours a student
spent studying for the exam
and his exam score.
Test of Hypothesis
The test statistic is
tc 
r n2
1 r2
~ Student's tn  2
Test procedure: One-tailed t-test for correlation
coefficient
Decision rule: Reject Ho if tc > t.tab = t.05(18) =
1.734. Reject Ho, otherwise.
Test of Hypothesis
tc 
r n2
1 r
2

0.7418 20  2
1  0.7418
2
 4.6929
Decision: Reject Ho.
Conclusion: At α=5%, there is evidence to say that
longer number of hours spent studying is
associated with higher exam scores.
WORD OF CAUTION
Correlation is a measure of the strength of
linear relationship between two variables,
with no suggestion of “cause and effect”
or causal relationship.
A correlation coefficient equal to zero only
indicates lack of linear relationship and
does not discount the possibility that other
forms of relationship may exist.
5.F.34
REGRESSION ANALYSIS
A statistical technique used to study the
functional relationship between
variables which allows predicting the
value of one variable, say Y, given the
value of another variable, say X
5.F.35
REGRESSION ANALYSIS

Y – dependent variable



5.F.36
A variable whose variation/value depends on
that of another.
X – independent variable
- A variable whose variation/value does not
depend on that of another.
ILLUSTRATION
The relationship between the
number of hours spent
studying and the student’s
exam score may be
expressed in equation form.
This equation may be used to
predict the student’s exam
score knowing the number of
hours the student spent
studying.
5.F.37
ILLUSTRATION
A child’s height is studied to
see whether it is related to his
father’s height such that some
equation can be used to
predict a child’s height given
his father’s height.
Sales of a product may be
related to the corresponding
advertising expenditures.
5.F.38
SAMPLE REGRESSION MODEL
Yˆi  b0  b1 X i
where b0 = estimated Y-intercept; the
predicted value of Y when
X = 0;
b1 = estimated slope of the line;
measures the change in
the predicted value of Y
per unit change in X
5.F.39
ESTIMATORS
s XY
b1  2
sX
b0  Y  b1 X
where Y = mean of the Y values
X = mean of the X values
n
sˆ 2  sY2 | X  
i 1



2
ˆ
Yi  Yi

n  1 sY2  b1s XY

n2
n2

= estimated common variance of the Y’s
5.F.40
EXAMPLE
In the previous example, we may want to predict the
examination score of a student given the number of hours he
spent studying.
b1 
s XY
s
2
X

13.1895
1.7263
 7.6403
b0  Y  b1 X  70.2 - (7.6403)2.6  50.3352
Estimated regression line: Yˆ  50.3352  7.6403 X
i
i
Predicted exam score for Xi = 2.5 is 69.44 ~ 69
EXAMPLE
Student
Hours
Studied
Score
(%)
Student
Hours
Studied
Score
(%)
1
2
3
4
5
6
7
8
9
10
3
5
2
3
4
2
3
4
3
4
71
90
83
70
93
50
70
90
76
80
11
12
13
14
15
16
17
18
19
20
4
3
1
0
3
1
1
2
3
1
80
60
63
49
80
61
63
50
60
65
Slide No. V.F.15
TEST OF HYPOTHESIS ABOUT b1
Ho: b1 = b
*
1
Ha: b1 
b
*
1
or
Ha: b1 >
b
Ha: b1 <
b
or
5.F.43
*
1
*
1
where b the
hypothesized value of b1
*
1is
TEST OF HYPOTHESIS ABOUT b1
The standardized form of the test statistic is
b1  b
tc 
se  b1 
*
1
sY / X
where se  b1   s
X
1
n  1 and it follows the
Student’s t distribution with n -2 df when the
null hypothesis is TRUE. This is commonly
referred to as t-test for regression coefficient.
5.F.44
TEST OF HYPOTHESIS ABOUT b1
With a given level of significance, a
Alternative
Hypothesis
Ha: b1 ≠
b1*
(two-tailed test)
*
Ha: b1 > b1
(one-tailed test)
Ha: b1 < b1*
(one-tailed test)
5.F.45
Decision Rule
Reject Ho if |tc| > ttab= tα/2(n-2).
Fail to reject Ho, otherwise.
Reject Ho if tc > ttab= tα(n-2).
Fail to reject Ho, otherwise.
Reject Ho if tc < ttab= -tα(n-2).
Fail to reject Ho, otherwise.
EXAMPLE
Using the previous example, test at a =
5% if a student’s examination score will
increase by at least 1 percent with an
additional hour of study time.
Ho: b1  1
<
Test statistic: tc 
Ha:
b1  b1*
seb1 
b1  1
>
~ Student's tn  2 
Test procedure: One-tailed t-test for
regression coefficient
EXAMPLE
Decision Rule:
Reject Ho if tc > t.tab = -t.05(18) = -1.734
Otherwise, Fail to reject Ho,
Computations:
sY | X 

 n  1  sY2  b1s XY 
n2
 20  1 183.1158  7.6403 13.1895 
se  b1  
20  2
sY | X
sX
 9.3230
1
9.3230
1

 1.6279
n  1 1.3139 20  1
EXAMPLE
tc 
b1  b
*
1
se  b1 

7.6403  1
1.6279
 4.0791
Decision: Since tc= 4.0791 > t.tab= -1.734,
we reject Ho.
Conclusion: At a=5%, the student’s exam score
will increase by at least 1 percent for an
additional hour of study time.
TEST OF HYPOTHESIS ABOUT b0
Ho: b0 = b where b is the
hypothesized value of b0
*
0
Ha: b0 
b
*
0
or
Ha: b0 > b 0*
or
Ha: b0 <
5.F.49
b
*
0
*
0
TEST OF HYPOTHESIS ABOUT b0
The standardized form of the test statistic is
b0  b
tc 
se  b0 
*
0
where
seb0  
sY | X
sX
i 1 X i2
n
nn  1
and it follows
the Student’s t distribution with n -2 df
when the null hypothesis is TRUE. This
is commonly referred to as t-test for
regression constant.
5.F.50
TEST OF HYPOTHESIS ABOUT b0
With a given level of significance, a
Alternative
Hypothesis
Ha: b0 ≠
b 0*
(two-tailed test)
Ha: b0 >
b 0*
(one-tailed test)
*
Ha: b0 < b 0
(one-tailed test)
5.F.51
Decision Rule
Reject Ho if |tc| > ttab= tα/2(n-2).
Fail to reject Ho, otherwise.
Reject Ho if tc > ttab= tα(n-2).
Fail to reject Ho, otherwise.
Reject Ho if tc < ttab= - tα(n-2).
Fail to reject Ho, otherwise.
EXAMPLE
At a = 5%, test if the data indicate that the
student will fail (a score less than 60) if he
did not study.
Ho: b 0  60
>
Test statistic: tc 
Ha: b 0  60
*
b0  b 0
seb0 
~ Student's tn  2 
Test procedure: One-tailed t-test for
regression constant
EXAMPLE
Decision rule: Reject Ho if tc > t.05(18) = -1.734
otherwise, Fail to reject Ho
Computations:
n
se  b0  
sY | X
sX
2
X
 i
9.3230

n(n  1) 1.3139
i 1
168
 4.7180
20  20  1
+ 2.0485
b0  b
50.3352  60
tc 

 2.0485
se  b0 
4.7180
*
0
40
EXAMPLE
Decision: Since tc= 2.0485 > ttab= -1.734,
we reject Ho.
Conclusion: At a = 5%, the student will get
a score less than 60 or the
student will fail if he/she did not
study for the examination.
ADEQUACY OF THE MODEL
Coefficient of Determination (R2)
- proportion of the total variation in Y
that is explained by X, usually
expressed in percent
 s XY
R r 
 100%  b1  2
s s
 sY
2
2
2
XY
2 2
X Y
s

  100%

5.F.55
EXAMPLE
 13.1895 
R  b1 2  100%  7.6403 
  100%  55.03%
sY
 183.1158 
2
sXY
Interpretation:
Around 55% of the total variation in
examination scores is explained by the
number of hours spent studying.
The remaining 45% is explained by other
variables not in the model, or by the fact
that the relationship is not exactly linear.
SUMMARY
1.
2.
3.
4.
Correlation analysis
Regression analysis
Application with computer output
Interpretation
Regression analysis

is a causality relationship, where you
can predict the value of one variable
given the values of the other
variable/s.
Correlation analysis

is a relationship between two variables
but without the causality clause.
Regression analysis in policy analysis is
usually used to forecast certain events.
For example, our trend line is an
example of a regression analysis.
Illustrations:

Knowing the effect of TV spot advertising on the
number of people visiting the Family Planning clinic
would allow the population commission official to
decide rationally whether or not to increase the
amount to be spent on TV spot advertising. The
officer would be able to predict how many people
the commission would be able to attract to the
Family Planning clinic if it increased the number of
TV ads run.
(See series p.176)

The relationship between two variables
(in our example, the number of TV ad
runes and the number of people
visiting Family Planning clinic can be
summarized by a line. This is called
the regression line. This is the line that
we will use to predict the value of one
variable, given the other.
Formula of the regression line:
Y  a  bX  e
Where:
b = the slope of the line;
a = the Y intercept or the value of Y
when x=0;
e = the error term.
Example:
Relationship between TV ads and number of people visiting the family
planning clinic:
Municipalities
Number of TV ads (X)
Number of people visiting the
clinic (Y)
1
7
42
2
5
32
3
1
10
4
8
40
5
10
61
6
2
8
7
6
35
8
7
39
9
8
48
10
9
51
11
5
30
12
7
45
13
8
41
14
2
7
15
6
37
16
5
33
b
N   XY     X   Y 
N  X   X 
2
2
Y  b X

a
N
16  3960    96  559 

b
2
16  676    96 
559    5.76  96 

a
9216

1600
 5.76
 0.4
16

The equation of the line is
Y= 0.4 + 5.76 X

If X= 5, our predicted value for Y will be
Y= .4+ 5.76 (5) = 29.2

If X=7, our predicted value for Y will be
Y= .4+ 5.76 (7)= 40.7

Interpretation:
An increase of one in the number of TV ad runs will
generate a 5.76 increase in the number of people
visiting the family planning clinic. So the family
planning officer can now proceed with evaluating
the cost effectiveness of the program ads.
Coefficient of Determination


The coefficient of determination is the percent
variation in Y explained or accounted for by the
variability of X. It is derived by squaring R and
multiplying by 100. It is expressed in percentage
term. Thus, if R= .9, the coefficient of
determination will be 81%.
Formula:
R
N  XY   X  Y
N  X   X 
2
2
N Y
2
Y 
2
Hypothesis Testing for a and b
We use the t-statistic to test the
Hypothesis that a and b are
significantly different from zero.

Excel analysis of the problem
Summary Output
Regression Statistics
Multiple R
0.972499
R Square
0.945755
Adjusted R Square 0.941582
Standard Error
3.796237
Observations
15
Revised Figure
ANOVA
df
Regression
Residual
Total
1
13
14
SS
MS
F
Significance F
3266.3853266.385226.6526
1.32E-09
187.348414.41141
3453.733
CoefficientsStandard Error
Intercept
0.373989
7 5.745957
t Stat
P-value
2.4675740.151562 0.88186
0.38166515.05499 1.32E-09
dF: k, n-(k+1), n-1
Lower 95% Upper 95%Lower 95.0%Upper 95.0%
-4.95688 5.704857
4.921421 6.570493
-4.95688
4.921421
5.704857
6.570493
DUMMY VARIABLE
Represents nominal or categorical variable in the
regression model
For Example:
Y= b0 + b1X1 + b2X2
Y= scores, X1=hours spent in studying, X2=M/F
taking a value of 1 if male, otherwise 0
Download