Regression2011

advertisement
Regression and correlation analysis
(RaKA)
1
Investigating the relationships between the
statistical characteristics:
 Investigating the relationship between qualitative
characteristics, e.g. AB , called measurement
of association
 Investigating the relationship between
quantitative characteristics – Regression and
correlation analysis
2
3
Regression and correlation analysis:
examining causal dependency,
exploring the relationship between cause and effect
 When one or more effects (attributes, independent
variables) cause resulting effect – dependent variable

Y = f (X1 X2…...Xk ,Bo , B1 ,….Bp ) +e
Dependent
variable
- effect
4
Random,
Independent Unknown unspecified
parameters effects
variables
- cause
of a
functional
relationship
Example of false correlation
One of the famous
spurious correlations:
If the skirt lenght gets
shorter, quotation of
stocks gets higher
Apart from that it is not
always true, it would be
false, or spurious
correlation.
5
Examples of statistical - free - dependence
 Examination how consumption of pork depend on
income, price of pig meat, beef, poultry and tradition
resp. another unspecified, or random effects.
 Examination of dependence of GNP on Labour and
Capital...
 Ivestigation if the nutrition of the population depend on
the degree of economic development of the country
6
Opposite of the statistical dependence is
the functional dependence
Y = f(X1 X2…...Xk ,Bo , B1 ,…., Bp)
Where the dependent variable is clearly determined by
functional relationship,
Examples from physics, chemistry – this kind of relationship
is not the subject of statistical investigation.
7
Regression and correlation analysis (RaKA)
 Two basic task of RaKA:
 Regression
a) find a functional relationship by which the dependent
variable changes with the change of independent
variables - find a suitable regression line (function).
b) It is also necessary to estimate the parameters of the
regression function.
 Correlation - to measure strength of the examined
dependence (relationship).
8
Illustration of the correlation field in two
cases (scatter plot)
y
y
x
9
x
According to the number of independent variables
are distinguished:
 Simple dependence, when we consider only one
independent variable X, we investigate the relationship
between Y and X.
 Multiple dependence , we are considering at least two
independent variables veličiny X1, X2, … Xk , for k  2
10
Simple regression and correlation analysis
 Consider statistical sign X and Y which are in the
population in linear relationship
Y = Bo + B1 X +e
point estimate of the regression function is a straight line
yj = b0 + b 1 xj + ej , with coefficients calculated from
the sample data
Which method to use ???
11
The least square method (LSM)
n
( y
j 1

j
 y )  MIN
, 2
j
F ( b0 ,b1 ,...bp )
bi
 0,
i  1,2,..p
We get set of p+1 equation with p+1 unknown
parameters => Ordinary least square method (OLS)
12
yj = b0 + b 1 xj + ej  we can rewrite yj = yj , + ej and
ej = y j - yj ,
Principle of the LSM
n
(y
j 1
j
 y )  MIN
, 2
j
(ej )2 = (y j - y j’)2
(ej ) = y j - y j’
13
Can be proved that coefficients bo , b1 , …, bp
determined by OLS are “best estimates” of
parameters B 0 , B1 , …, Bp if the random error
meet the assumptions:
 E (ej ) = 0,
 D (ej ) = E (ej2 ) = 2 ,
 E(ej1 , ej2 ) = 0 , for each j1  j2
Verbal formulation : Random errors are required to have
zero mean, constant variance and should be independent.
14
Coefficients of the simple regression function can be
derived:
n
F ( b0 ,b1 )
, 2
  (y j  y j )  0
bi
j 1
n
F ( b0 ,b1 )
2
  (y j  bo - b1 x j )  0
bi
j 1
F ( b0 ,b1 )
 2 (y j  bo - b1 x j )( 1 )  0
b0
j 1
n
F ( b0 ,b1 )
 2 (y j  bo - b1 x j )(  x j )  0
b1
j 1
n
15
After transformation we get two normal
equations with two unknown parameters:
n
y
j 1
n
j
 n.b0  b1 .  x j
j 1
n
x y
j 1
j
n
j
n
 b0 . x j  b1 .  x
j 1
j 1
2
j
The system of equation can be solved by
elimination method , or by using determinants.
We get the coefficients b o a b 1
16
The procedure for calculating the
coefficients LRF
xj
yj
xjyj
xj 2
x1
y1
x 1y 1
x12
x2
y2
xn
yn
 x  y x y
j
17
j
j
j
x
2
j
Interpretation of simple linear regression
coefficients
…intercept - expected value of dependent variable if the
independent variable is equal to zero
 b 1 …. Regression coefficient express the change in
dependent variable, if the independent variable will change
by one unit.
 if b1 > 0 …positive correlation (dependence)
 if b1< 0 ….negative correlation (dependence)
 bo
18
Properties of least square method:
n
( y
j 1
 y )  min
,
j
j
n
 (y
j 1
j
2
y )  0
,
j
Regression function passes throught
the coordinates x a y
19
When OLS can be applied?
 If the regression function is linear
 Linear in parameters (LiP)
 Or we can transform regression function to be linear in
parameters
 Consider in which of the following regression functions
can be used OLS
20
Some types of simple regression function:
y  bo  b1 / x j
'
j
y  bo  b1 . log x j
'
j
y  bo  b1 . x j  b2 . x
'
j
y  bo . b
'
j
xj
1
y  bo . x
'
j
b1
j
y  bo  b1 b
'
j
21
xj
2
2
j
Examples from micro- and macroeconomy
 Phillips curve ????
 Cobb -Douglas production curve
 Engel curves
 Curve of economic growth
 Any other? …...
22
Examining the consumption of selected
commodities (depends on the level of GNP)
kcal na obyv. a deň
Obrázok 2. Priebeh spotreby energie živočíšneho pôvodu
1600
1400
1200
1000
800
600
400
200
0
rozvinuté krajiny
rozvojové krajiny
0
10000
20000
30000
HNP v US$ na obyv. a rok
23
40000
Comparison of two cases of correlation
Which correlation is closer?
y
y
x
24
x
Confidence interval for linear regression
In addition to point estimates of parameters of linear
regression functions are often calculated also interval
estimates of parameters, which are called confidence
intervals. Calculations of confidence intervals can be done
with standard deviations of parameters and residual
variance. Residual variance, if all the conditions of classical
linear model are satisfied, is undistorted estimate of the
stochastic parameter  2 and is calculated according to
n
equation
2
s
25
p  k 1
2
rez

 y
j 1
j
 yj 
n p
Interval estimate of any parameters for the regression line
Assumes that if the assumptions formulated in classical linear model has
variable
y j  b0  b1 x j
bi   i
ti 
s bi
t distribution with n – p degrees of freedom. For the chosen confidence
level 1 – 
is confidence interval for parameter  0 given by relationship


P b0  t .sb0  0  b0  t .sb0  1  
26
And for parameter

1
sb1  sr .
1
2
(
x

x
)
 j

P b1  t .sb1  1  b1  t .sb1  1  
Analogically is constructed confidence interval for regression line


P yj  t .s yj  Y j  yj  t .s yj  1  
Where is quantile of t distribution S with (for regression line n-2)
degrees of freedom.
27
Role of the correlation
 Examine tightness - strength - of dependence
 We use various correlation indices
 Should be bounded in interval
 and within that interval increased to a higher power of
dependence
28
Correlation analysis provides methods and techniques
which are used for verifying of explanatory ability
of quantified regression models
as a whole and its parts.
Verification of explanatory ability
of quantified regression models
leads to calculation of numerical characteristics,
which in concentrated form describe the quality of the
calculated models.
29
Index of correlation and index of determination
In population Iyx estimate from sample data is iyx est
Iyx = iyx . Principle lies in the decomposition of variability
of dependent variable Y
n
( y
j 1
n
 y )  ( y j'  y )  ( y j  y j' )
2
j
Total
variability
of
dependent
variable
30
n
2
j 1
Variability of
dependent
variable
explained by
regression
function
2
j 1
Variability
unexplained by
regression
function
- Residual
variability
.
Its obvious that there is a relationship:
T=E+U
 y
n
T=
j 1
 y
2
j
Total sum of squares (of deviation)
2

E =  y j  y 
n
is explained sum of squares
j 1
 y
n
U =
j n
31
 y j  is unexplained (residual) sum of
2
j
squares.
Index of correlation iyx
n
i yx 
 ( y '  y)
j1
n
(y
j 1
2
j
j
 y)
2
E

T
Index of determination iyx2
 y
 yj 
n
T U
U
i 
 1  1
T
T
2
32
j 1
n
2
j
 y
j 1
 y
2
j
Index of determination can take values from 0 to 1,
when the value of the index is close to 1, the great
proportion of the total variability is explained by the
model and vice versa, if the index of determination is
close to zero, the low proportion of the total variability
is explained by the model.
Index of determination is commonly used as a criterion
in deciding about the shape of the regression function.
However, if the regression functions has different
number of parameters, it is necessary to adjust the
index of determination to the corrected form:
( n  1)   y j  y j
n
I
33
2
kor
 1

2
j 1
(n  p )  y j  y 
n
j 1
2
Variability
Sum of
squares
Degrees of
freedom
  y  y 
n
Explained
V=
2
j 1
 y  y  
n
Unexplained
N=
j n
2
j
. y  y 
C=
j 1
34
n p
j
n
Total
p 1
j
2
j
n 1
F test
Variance
V
s 
p 1
2
y
N
s 
n p
2
r
F=
s y2
s r2
Test criterion in the table can be used for simultaneous
testing the significance of the regression model, the index
of determination and also correlation index. We compare
calculated value of F test and quantile of F distribution
with p-1 and n-p degrees of freedom.
if F  F  p  1,n  p  regression model is insignificant,
as well as the index of correlation and index of
determination.
if F > F  p  1,n  p 
regression model is statistically
significant as well as the index of correlation and index of
determination.
35
For a detailed evaluation of the parameters quality of
regression model is used t tests. We formulate the null
hypothesis
H0 :  i  0 pre i = 0, 1
H1 :  i  0
where we assume zero therefore insignificant effect or
impact of the variable at which the parameter is. The test
criterion is defined by relationship:
bi
ti 
s bi
36
Where bi is value of the parameter of regression
function
and s b is standard error of the parameter.
i
We will compare calculated value of test
criterion with quantile of t distribution at significance
level  and n  p degrees of freedom .:
- if t  t ( n  p ) we do not reject null hypothesis
about insignificance of the parameter.
- if t  t ( n  p )
we reject null hypothesis, and
confirm statistical significance of the parameter.
37
Nonlinear regression and correlation
analysis
In addition to linear regression functions, in practice are
very often used nonlinear functions, which can be used also
with two or more parameters. Some non-linear regression
functions can be suitably transformed to be linear in
parameters, and we can then use the method of least squares..
Most often, we can transform nonlinear function with two
parameters to shape:
U    0  1Z  
38
We estimate regression function in form
u j  b0  b1 .z j
where
u  f ( y)
z  f (x)
Function is then calculated as a linear function. Not all
non-linear functions can be converted in this way, only
those which are linear in parameters, ie there is some
form of transformation called the linearising
transformation, most often it is the substitution and
logarithmic transformation for example
39
Hyperbolic function
b1
y j  b0 
xj
uy
1
z
x
u j  b0  b1 .z j
40
Logarithmic function
y j  b0  b1 . ln x j
uy
z  ln x j
u j  b0  b1 .z
41
Exponential function
xj
1
y j  c0 .c
log y j  log c0  x j . log c1
u  log y
z  x j b0  log c0
u j  b0  b1 .z
42
b1  log c1
power (Cobb-Douglas production function)
y j  c0 .x
b1
j
log y j  log c 0  b1 . log x j
u  log y
z  log x
u j  b0  b1 .z
43
b0  log c0
Similarly, it is possible to modify some more parametrical
nonlinear functions such as.
second degree parabola
yj  b0  b1.x j  b2 .x
zx
2
j
2
yj  b0  b1.x j  b2 .z j
44
Second degree hyperbole
b1
b2
y j  b0 
 2
xj
xj
u y
1
z
x
1
s 2
xj
uj  b0  b1.z j  b2 .s j
45
It should be noted that the transformed regression functions do
not always have the same parameters as the original non-linear
regression function, so it is necessary for the estimated
parameters of the transformed functions to do backwards
calculations of the original parameters. Thus obtained
estimates of the original parameters, do not have optimal
statistical properties, but are often sufficient to solve specific
tasks.
Some regression function can not be adjusted or transformed
to functions linear in parameters. Estimates of the parameters
of such functions are obtained using different approximate or
iterative methods. Most of them are based on so-called gradual
improvement of initial estimates, which may be eg. expert
estimates, or the estimates obtained by the selected points and
46
so on.
Multiple regression and correlation analysis
Suppose that the dependent variable Y and explanatory
(independent) variables Xi ,i = 1, 2, ..., k
Are in linear relationship, we have already mentioned in
previous sections, can be written:
Y   f ( X 1 , X 2 ,, X k ,  0 , 1 ,  2 ,,  k )  
Which we estimate:
y j  f ( x1 j , x 2 j ,, x kj , b0 , b1 ,, bk )
47
Coefficients b0 , b1 ,..., bk, which are estimates of
parameters  0 ,  1 ,..., , kshould meet the condition of
the Least squares method
2

F ( b0 ,b1 ,...,bk )   y j  y j   min
n
j 1
since we assume a particular shape of the regression functions, we can
install it into previous relationship and look for a minimum of this
function ie.:
n
2
0 1
k
j
0
1 1j
k kj
j 1
F( b ,b ,...,b )   y  b  b x    b x
  min
we determine the minimum of the function similarly like in
the case of a simple regression equation using partial
derivatives
of functions
48
F ( b0 ,b1 , ,bk )
0
bi
Which leads to system of equations:
n
y
j 1
n
x
j 1
1j

j
n
n
j 1
j 1
 b0 .n  b1 . x1 j    bk . xkj
n
n
j 1
j 1
n
y j  b0 . x1 j  b1 . x    bk . x1 j .xkj


2
1j
j 1

n
n
n
n
j 1
j 1
j 1
j 1
2
x
.
y

b
x

b
x
x



b
.
x
 kj j 0  kj 1  1 j kj
k  kj
49
The solution of this system of equations will be the
coefficients of linear regression equations b0 , b1 ,..., bk
Like for the simple linear relationship, we can calculate
estimate of the parameters from the matrix equation

T
1
T 
b  (X X ) X y
b0 
b 
  1
b  b2 
 

bk 
1 x11
1 x
12

X  1 x13

 
1 x1n





xk 1 
xk 2 
xk 3 

 
xkn 
 y1 
y 
  2
y

 
 yn 
The
quality of a regression model can be evaluated similarly to the
50
simple linear relationship, which we described in the previous section.
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.809324
R Square
0.655006
Adjusted R Square0.647818
Standard Error 175.9096
Observations
50
ANOVA
df
Regression
Residual
Total
Intercept
lnGNP
51
1
48
49
SS
2820028
1485322
4305350
MS
F Significance F
2820028 91.13269 1.13E-12
30944.2
Coefficients
Standard Error t Stat
P-value Lower 95% Upper 95%
-584.881 150.0542 -3.8978 0.000301 -886.585 -283.177
164.9321 17.27699 9.546344 1.13E-12 130.1944 199.6698
Important terms:
Correlation analysis – group of techniques to measure the association
between two variables
Dependent variable – variable that is being predicted or estimated
Independent variable – variable that provides the basis for estimation.
It is the predictor variable
Coefficient of correlation – a measure of the strength of the linear
relationship
Coefficient of determination – The proportion of the total variation in
the dependent variable Y that is explained, or accounted for, by the
variation in the independent variable X
52
Regression equation – An equation that express relationship
between variables
Least square principle – Determining a regression equation
by minimizing the sum of the squares of the vertical distances
between the actual Y values and the predicted values of Y
53
Download