Uploaded by Tinashe Kaitano

REGRESSION ANALYSIS NOTES (1)

advertisement
REGRESSION ANALYSIS
ECONOMETRICS
By
Muthama
JKUAT
REGRESSION ANALYSIS
PRESENTED BY
JAPHETH MUTINDA MUTHAMA
PRESENTED TO
PROFESSOR NAMUSONGE
JKUAT - KENYA
MEANING OF REGRESSION:
The dictionary meaning of the word Regression is ‘Stepping back’ or ‘Going back’.
Regression is the measures of the average relationship between two or more variables in
terms of the original units of the data.
It attempts to establish the functional relationship between the variables and thereby
provide a mechanism for prediction or forecasting.
It describes the relationship between two (or more) variables.
Regression analysis uses data to identify relationships among variables by applying
regression models
The relationships between the variables e.g X and Y can be used to make predictions on
the same.
The ’independent’ variable ‘X’ is usually called the repressor (there may be one
or more of these), the ’dependent’ variable y is the response variable.
Dependent variable (Y)
Regression
Independent variable (x)
Regression is thus an explanation of causation.
If the independent variable(s) sufficiently explain the variation in the dependent
variable, the model can be used for prediction.
APPLICATION OF REGRESSION ANALYSIS IN RESEARCH
i.
It helps in the formulation and determination of functional
relationship between two or more variables.
ii.
It helps in establishing a cause and effect relationship between
two variables in economics and business research.
iii. It helps in predicting and estimating the value of dependent
variable as price, production, sales etc.
iv. It helps to measure the variability or spread of values of a
dependent variable with respect to the regression line
USE OF REGRESSION IN ORGANIZATIONS
In the field of business regression is widely used by businessmen in;
• Predicting future production
• Investment analysis
• Forecasting on sales etc.
It is also used in sociological study and economic planning to find the
projections of population, birth rates. death rates
So the success of a businessman depends on the correctness of the
various estimates that he is required to make.
METHODS OF STUDYING REGRESSION:
FREE HAND CURVE
GRAPHICALLY
LESAST SQUARES
Or
REGRESSION
LESAST SQUARES
ALGEBRAICALLY
DEVIATION METHOD FROM
AIRTHMETIC MEAN
DEVIATION METHOD FORM
ASSUMED MEAN
Algebraically method
1.Least Square Method-:
The regression equation of X on Y is :
X= a+bX
Where,
X=Dependent variable and Y=Independent variable
The regression equation of Y on X is:
Y = a+bX
Where,
Y=Dependent variable
X=Independent variable
Dependent variable (y)
Simple Linear Regression
y = a + bX ± є
є
a (y intercept)
b = slope
= ∆y/ ∆x
Independent variable (x)
The output of a regression is a function that predicts the dependent variable
based upon values of the independent variables.
Simple regression fits a straight line to the data.
The output of a simple regression is the coefficient β and the constant A.
The equation is then:
y =A+ β* x +ε
where ε is the residual error.
β is the per unit change in the dependent variable for each unit change in
the independent variable. Mathematically:
∆y
β= ∆x
Multiple Linear Regression
More than one independent variable can be used to explain variance in the
dependent variable, as long as they are not linearly related.
A multiple regression takes the form:
1
1
2
2
y = A + β X + β X + … + β k Xk + ε
where k is the number of variables, or parameters.
Multicollinearity
Multicollinearity is a condition in which at least 2 independent variables are
highly linearly correlated. It will often crash computers.
Example table of
Correlations
Y
X1
Y
1.000
X1
0.802
1.000
X2
0.848
0.578
X2
1.000
A correlations table can suggest which independent variables may be
significant. Generally, an ind. variable that has more than a .3 correlation
with the dependent variable and less than .7 with any other ind. variable
can be included as a possible predictor.
Nonlinear Regression
Nonlinear functions can also be fit as regressions. Common
choices include Power, Logarithmic, Exponential, and Logistic,
but any continuous function can be used.
Example1-: From the following data obtain the regression equations
using the method of Least Squares.
X
3
2
7
4
8
Y
6
1
8
5
9
Solution-:
X
Y
XY
X2
Y2
3
6
18
9
36
2
1
2
4
1
7
8
56
49
64
4
5
20
16
25
8
9
72
64
81
 X  24
Y  29  XY 168  X 2  142 Y 2  207
Y  na  b X
 XY  a X  b X
2
Substitution the values from the table we get
29=5a+24b…………………(i)
168=24a+142b
84=12a+71b………………..(ii)
Multiplying equation (i ) by 12 and (ii) by 5
348=60a+288b………………(iii)
420=60a+355b………………(iv)
By solving equation(iii)and (iv) we get
a=0.66 and b=1.07
By putting the value of a and b in the Regression equation Y on X
we get
Y=0.66+1.07X
Now to find the regression equation of X on Y ,
The two normal equation are
 X  na  b Y
 XY  a Y  b Y
2
Substituting the values in the equations we get
24=5a+29b………………………(i)
168=29a+207b…………………..(ii)
Multiplying equation (i)by 29 and in (ii) by 5 we get
a=0.49 and b=0.74
Substituting the values of a and b in the Regression equation X and Y
X=0.49+0.74Y
2.Deaviation from the Arithmetic mean method:
The calculation by the least squares method are quit cumbersome when
the values of X and Y are large. So the work can be simplified by using this
method.
The formula for the calculation of Regression Equations by this method:
Regression Equation of X on Y- ( X  X )  bxy (Y  Y )
Regression Equation of Y on X-
(Y  Y )  byx ( X  X )
Where, bxy and
byx = Regression
bxy 
Coefficient
xy

y
2
and
byx
xy


x
2
Example2-: from the previous data obtain the regression equations by
Taking deviations from the actual means of X and Y series.
X
3
2
7
4
8
Y
6
1
8
5
9
Solution-:
xX X
y  Y Y
x2
y2
xy
6
-1.8
0.2
3.24
0.04
-0.36
2
1
-2.8
-4.8
7.84
23.04
13.44
7
8
2.2
2.2
4.84
4.84
4.84
4
5
-0.8
-0.8
0.64
0.64
0.64
8
9
3.2
3.2
10.24
10.24
10.24
X
Y
3
 X  24 Y  29
x  0
 y 0
 x  26.8  y 2  38.8
2
 xy  28.8
Regression Equation of X on Y is
( X  X )  bxy (Y  Y )
bxy
xy


y
2
28.8
Y  5.8
X  4.8 
38.8
X  4.8  0.74Y  5.8
X  0.74Y  0.49
Regression Equation of Y on X is
………….(I)
(Y  Y )  byx ( X  X )
byx
xy


x
2
28.8
 X  4.8
26.8
Y  5.8  1.07( X  4.8)
Y  5.8 
Y  1.07 X  0.66 ………….(II)
It would be observed that these regression equations are same as
those obtained by the direct method .
3.Deviation from Assumed mean method-:
When actual mean of X and Y variables are in fractions ,the
calculations can be simplified by taking the deviations from the
assumed mean.
The Regression Equation of X on Y-:
( X  X )  bxy (Y  Y )
The Regression Equation of Y on X-:
But , here the values of
following formula:
bxy 
and
bxy
N  d xd y   d x  d y
 
Nd  d y
2
y
2
(Y  Y )  byx ( X  X )
bwill
yx be calculated by
byx 
N  d xd y   d x  d y
Nd 
2
x
 d 
2
x
Example-: From
the data given in previous example calculate
regression equations by assuming 7 as the mean of X series and 6 as
the mean of Y series.
Solution-:
X
Y
Dev. From
assu. Mean 7
(dx)=X-7
d x2
Dev. From assu.
Mean 6 (dy)=Y-6
d y2
dxdy
3
6
-4
16
0
0
0
2
1
-5
25
-5
25
+25
7
8
0
0
2
4
0
4
5
-3
9
-1
1
+3
8
9
1
1
3
9
+3
 X  24 Y  29  d
x
 11
d
2
x
 51
d
y
 1
d
2
y
 39
d d
x y
 31
X
24

X
 X   4.8
N
Y
29

Y
 Y   5.8
5
The Regression Coefficient of X on Y-:
N
bxy 
5
N  d xd y   d x  d y
Nd 
2
y
 d 
2
y
5(31)  ( 11)( 1)
5(39)  ( 1) 2
155  11

195  1
144

194
 0.74
bxy 
bxy
bxy
bxy
The Regression equation of X on Y-:
( X  X )  bxy (Y  Y )
( X  4.8)  0.74(Y  5.8)
X  0.74Y  0.49
The Regression coefficient of Y on X-:
byx 
byx
byx
byx
byx
The Regression Equation of Y on X-:
N  d xd y   d x  d y
N  d x2 
 d 
2
x
5(31)  (11)( 1)

5(51)  (11) 2
155  11

255  121
144

134
 1.07
(Y  Y )  byx ( X  X )
(Y  5.8)  1.07( X  4.8)
Y  1.07 X  0.66
It would be observed the these regression equations are same as those
obtained by the least squares method and deviation from arithmetic mean .
SIMPLE REGRESSION
This assumes the model y = β0 + βx + ε
Example:
Assume variables Y and X Explained by the following model
Y = 0 + x
Where (Y) is called the dependent (or response) variable and X the
independent (or predictor, or explanatory) variable.
The two variables can be explained in the following model E(Y | X = x) =
0 + x
(the “population line”)
Cont…..
The interpretation is as follows:
where 0 is the (unknown) intercept and 1 is the (unknown) slope or
incremental change in Y per unit change in X.
0 and 1 are not known exactly, but are estimated from sample data
and their estimates can be denoted b0 and b1.
Note that the actual value of σ is usually not known.
The two regression coefficients are called the slope and intercept.
Their actual values are also unknown and should always be estimated using
the empirical data at hand.
MULTIVARIATE (LINEAR) REGRESSION
This is a regression model with multiple independent variables
Here, the independent (regressor) variables x1, x2.... xn with only one
dependent (response) variable y
The model therefore assumes the following format;
yi = β0 + β1x1 + β2x2 + ...... βnxn+ ε
Where 1, 2, ... n, are the first index labels of the variable and the
second observation.
NB: The exact values of β and ε are, and will always remain unknown
Polynomial Regression
This is a special case of multivariate regression, with only one independent
variable
x, but an x-y relationship which is clearly nonlinear (at the same time, there
is no ‘physical’ model to rely on).
y = β0 + β1x + β2x2 + β3x3.....+ βnxn + ε
Effectively, this is the same as having a multivariate model with x1 ≡ x, x2
≡ x2, x3 ≡ x3
NONLINEAR REGRESSION
This is a model with one independent variable (the results can be easily
extended to several) and ‘n’ unknown parameters, which we will call b1,
b2, ... bn:
y = f (x, b) + ε
where f (x, b) is a specific (given) function of the independent variable and
the ‘n’ parameters.
Types of Lines
Scatter plot
Personal Income Per Capita, current dollars,
1999
Percent of Population with Bachelor's Degree by Personal Income Per Capita
40000
35000
30000
25000
20000
15.0
20.0
25.0
30.0
35.0
Percent of Population 25 years and Over with Bachelor's Degree or More,
March 2000 estimates
•This is a linear relationship
•It is a positive relationship.
•As population with BA’s
increases so does the
personal income per capita.
Regression Line
Personal Income Per Capita, current dollars,
1999
Percent of Population with Bachelor's Degree by Personal Income Per Capita
40000
35000
30000
25000
R Sq Linear = 0.542
20000
15.0
20.0
25.0
30.0
35.0
Percent of Population 25 years and Over with Bachelor's Degree or More,
March 2000 estimates
•Regression line is the
best straight line
description of the plotted
points and use can use it
to describe the
association between the
variables.
•If all the lines fall exactly
on the line then the line is
0 and you have a perfect
relationship.
Things to note
Regression focuses on association, not causation.
Association is a necessary prerequisite for inferring causation, but also:
1.
2.
3.
The independent variable must preceed the dependent variable.
The two variables must be inline with a given theory,
Competing independent variables must be eliminated.
Regression Table
40000
35000
30000
25000
R Sq Linear = 0.542
20000
15.0
20.0
25.0
30.0
35.0
Percent of Population 25 years and Over with Bachelor's Degree or More,
March 2000 estimates
Percent of Population with Bachelor's Degree by Personal Income Per Capita
Personal Income Per Capita, current dollars, 1999
•The regression coefficient is
not a good indicator for the
strength of the relationship.
•Two scatter plots with very
different dispersions could
produce the same regression
line.
Personal Income Per Capita, current dollars,
1999
Percent of Population with Bachelor's Degree by Personal Income Per Capita
40000
35000
30000
25000
R Sq Linear = 0.463
20000
0.00
200.00
400.00
600.00
800.00
Population Per Square Mile
1000.00
1200.00
Regression coefficient
The regression coefficient is the slope of the regression line wil tell;
• What the nature of the relationship between the variables is.
• How much change in the independent variables is associated with
thechange in the dependent variable.
• The larger the regression coefficient the more the change.
Pearson’s r
• To determine strength you look at how closely the dots are clustered
around the line. The more tightly the cases are clustered, the
stronger the relationship, while the more distant, the weaker.
• Pearson’s r is given a range of -1 to + 1 with 0 being no linear
relationship at all.
Reading the tables
Model Summary
Model
1
R
R Square
.736 a
.542
Adjus ted
R Square
.532
Std. Error of
the Es timate
2760.003
a. Predictors : (Constant), Percent of Population 25 years
and Over with Bachelor's Degree or More, March 2000
estimates
•When you run regression analysis on SPSS you get a 3 tables. Each
tells you something about the relationship.
•The first is the model summary.
•The R is the Pearson Product Moment Correlation Coefficient.
•In this case R is .736
•R is the square root of R-Squared and is the correlation between
the observed and predicted values of dependent variable.
R-Square
Model Summary
Model
1
R
R Square
.736 a
.542
Adjus ted
R Square
.532
Std. Error of
the Es timate
2760.003
a. Predictors : (Constant), Percent of Population 25 years
and Over with Bachelor's Degree or More, March 2000
estimates
•R-Square is the proportion of variance in the dependent
variable (income per capita) which can be predicted from the
independent variable (level of education).
•This value indicates that 54.2% of the variance in income can be
predicted from the variable education. Note that this is an
overall measure of the strength of association, and does not
reflect the extent to which any particular independent variable is
associated with the dependent variable.
•R-Square is also called the coefficient of determination.
Adjusted R-square
Model Summary
Model
1
R
R Square
.736 a
.542
Adjus ted
R Square
.532
Std. Error of
the Es timate
2760.003
a. Predictors : (Constant), Percent of Population 25 years
and Over with Bachelor's Degree or More, March 2000
estimates
•As predictors are added to the model, each predictor will explain some of the
variance in the dependent variable simply due to chance.
•One could continue to add predictors to the model which would continue to
improve the ability of the predictors to explain the dependent variable, although
some of this increase in R-square would be simply due to chance variation in that
particular sample.
•The adjusted R-square attempts to yield a more honest value to estimate the Rsquared for the population. The value of R-square was .542, while the value of
Adjusted R-square was .532. There isn’t much difference because we are dealing
with only one variable.
•When the number of observations is small and the number of predictors is large,
there will be a much greater difference between R-square and adjusted R-square.
•By contrast, when the number of observations is very large compared to the
number of predictors, the value of R-square and adjusted R-square will be much
closer.
ANOVA
ANOVAb
Model
1
Regress ion
Res idual
Total
Sum of
Squares
4.32E+08
3.66E+08
7.98E+08
df
1
48
49
Mean Square
432493775.8
7617618.586
F
56.775
Sig.
.000 a
a. Predictors : (Constant), Percent of Population 25 years and Over with Bachelor's
Degree or More, March 2000 estimates
b. Dependent Variable: Pers onal Income Per Capita, current dollars , 1999
•The p-value associated with this F value is very small (0.0000).
•These values are used to answer the question "Do the independent variables
reliably predict the dependent variable?".
•The p-value is compared to your alpha level (typically 0.05) and, if smaller,
you can conclude "Yes, the independent variables reliably predict the
dependent variable".
•If the p-value were greater than 0.05, you would say that the group of
independent variables does not show a statistically significant relationship
with the dependent variable, or that the group of independent variables does
not reliably predict the dependent variable.
Part of the Regression Equation
• b represents the slope of the line
• It is calculated by dividing the change in the dependent variable by the change in
the independent variable.
• The difference between the actual value of Y and the calculated amount is called
the residual.
• The represents how much error there is in the prediction of the regression
equation for the y value of any individual case as a function of X.
Download