Y - Department of Applied Mathematics & Statistics

advertisement
AMS 572 Group Project
Simple Linear Regression &
Correlation
Instructor: Prof. Wei Zhu
11/21/2013
Outline
1. Motivation & Introduction – Lizhou Nie
2. A Probabilistic Model for Simple Linear Regression –
Long Wang
3. Fitting the Simple Linear Regression Model – Zexi Han
4. Statistical Inference for Simple Linear Regression –
Lichao Su
5. Regression Diagnostics – Jue Huang
6. Correlation Analysis – Ting Sun
7. Implementation in SAS – Qianyi Chen
8. Application and Summary – Jie Shuai
1. Motivation
Fig. 1.2 Obama & Romney during Presidential Election Campaign
http://outfront.blogs.cnn.com/2012/08/14/the-most-negative-incampaign-history/
Fig. 1.1 Simplified Model for Solar System
http://popperfont.net/2012/11/13/the-ultimate-solar-system-animated-gif/
Introduction
• Regression Analysis
 Linear Regression:
Simple Linear Regression: {y, x}
Multiple Linear Regression: {y; x1, … , xp}
Multivariate Linear Regression: {y1, … , yn; x1, … , xp}
• Correlation Analysis
 Pearson Product-Moment Correlation
Coefficient: Measurement of Linear Relationship
between Two Variables
History
Adrien-Marie
Legendre
• • George
Carl
Sir Friedrich
Francis
UdnyGalton
Yule
Gauss
 Karl
Further
Development
of
&
Coining
Pearson
the Term
Earliest
Form
of
Least
Square
 Extention
to a Theory
“Regression”
Regression:
Least
including
Gauss-Markov
More
Generalized
Square
Method
Theorem Context
Statistical
http://en.wikipedia.org/wiki/Regression_analysis
http://en.wikipedia.org/wiki/Adrien_Marie_Legendre
http://en.wikipedia.org/wiki/Carl_Friedrich_Gauss
http://en.wikipedia.org/wiki/Francis_Galton
http://www.york.ac.uk/depts/maths/histstat/people/yule.gif
http://en.wikipedia.org/wiki/Karl_Pearson
2. A Probabilistic Model
Simple Linear Regression
- Special Case of Linear Regression
- One Response Variable to One Explanatory Variable
General Setting
- We Denote Explanatory Variable as Xi’s and
Response Variable as Yi’s
- N Pairs of Observations {xi, yi}, i = 1 to n
2. A Probabilistic Model
Sketch the Graph
1
2
3
4
X
37.70
16.31
28.37
-12.13
Y
9.82
5.00
9.27
2.98
⋮
98
99
100
⋮
9.06
28.54
-17.19
⋮
7.34
10.37
2.33
(29, 5.5)
2. A Probabilistic Model
In Simple Linear Regression, Data is described as:
Where
~ N(0,
)
The Fitted Model:
Where
- Intercept
- Slope of Regression Line
3. Fitting the Simple Linear Regression Model
Table 3.1.
Groove depth (in mils)
450.00
400.00
Milage(in 1000 miles)
Groove Depth (in mils)
350.00
0
394.33
4
329.50
200.00
8
291.00
150.00
12
255.17
16
229.33
20
204.83
24
179.00
28
163.83
32
150.33
300.00
250.00
100.00
50.00
0.00
0
5
10
15
20
25
30
Mileage (in 1000 miles)
Fig 3.1. Scatter plot of tire tread
wear vs. mileage. From: Statistics
and Data Analysis; Tamhane and
Dunlop; Prentice Hall.
35
3. Fitting the Simple Linear Regression Model
The difference between the fitted
line and real data is ei
450.00
ei
400.00
Groove depth (in mils)
350.00
is the vertical distance between
fitted line and the real data

ei  yi  yi
300.00
250.00
ei = yi - ( b0 + b1 xi ),(i =1, 2,....., n)
ei
200.00
Our goal: minimize the sum of square
150.00
n
2
Q    yi   0  1 xi 
100.00
50.00
i 1
0.00
0
5
10
15
20
Mileage (in 1000 miles)
Fig 3.2.
25
30
35
3. Fitting the Simple Linear Regression Model
Least Square Method
n
¶Q
= -2å[ yi - ( b0 + b1 xi )] = 0
¶b0
i=1
¶Q
= -2å xi [ yi - ( b0 + b1xi )] = 0
¶b1
i=1
n
n
n
i 1
i 1
 0 n  1  xi   yi
n
n
n
i 1
i 1
i 1
 0  xi  1  xi2   xi yi
3. Fitting the Simple Linear Regression Model
n

0 
n
n
n
i 1
n
i 1
n
i 1
( x )( yi )  ( xi )( xi yi )
2
i
i 1
n x  ( xi )
2
i
i 1

1 
2
i 1
n
n
n
i 1
i 1
i 1
n xi yi  ( xi )( yi )
n
n
n x  ( xi )
i 1
2
i
i 1
2
3. Fitting the Simple Linear Regression Model
To simplify, we denote:
n
n
1 n
S xy   ( xi  x )( yi  y )   xi yi  ( xi )( yi )
n i 1
i 1
i 1
i 1
n
n
n
n
n
n
1
2
2
2
S xx   ( xi  x )   xi  ( xi )
n i 1
i 1
i 1
n
1
2
2
2
S yy   ( yi  y )   yi  ( yi )
n i 1
i 1
i 1
Ù
b0 = y - b 1 x
Ù
b1 =
S xy
S xx
3. Fitting the Simple Linear Regression Model
Back to the example:
å x =144,å y = 2,197.32,å x
i
2
i
i
= 3,264, å yi2 = 589,887.08, å xi yi = 28,167.72
n = 9 x =16, y = 244.15
n
n
n
S xy = å xi yi - (å xi )(å yi ) = 28,167.72 - (144 ´ 2,197.32) = -6,989.40
n i=1
9
i=1
i=1
1
n
1
n
S xx = å x - (å xi ) = 3264 - (144) = 960
n i=1
9
i=1
2
i
1
2
1
2
-6989.40
ˆ
b1 =
= -7.281 and bˆ0 = 244.15+ 7.281´16 = 360.64
960
3. Fitting the Simple Linear Regression Model
Therefore, the equation of fitted line is:
yˆ = 360.64 - 7.281x
Not enough!
3. Fitting the Simple Linear Regression Model
Check the goodness of fit of LS line
We define:
SST = SSR + SSE
n
n
n
Prove:
n
SST = å( yi - y) = å( yˆi - y) + å ( yi - yˆi ) + 2å( yi - yˆi )( yˆi - y)
2
i=1
2
i=1
i=1
SSR
The ratio:
r
SSR
SSE
r =
= 1SST
SST
2
2
2
i=1
SSE
=0
SST: total sum of squares
SSR: Regression sum of squares
SSE: Error sum of squares
is called the coefficient of
determination
3. Fitting the Simple Linear Regression Model
Check the goodness of fit of LS line
Back to the example:
n
1 n
1
SST = S yy = å y - (å yi )2 = 589,887.08 - (2197.32)2 = 53,418.73
n i=1
9
i=1
2
i
SSR = SST - SSE = 53,418.73- 2531.53 = 50,887.20
50,887.20
r =
= 0.953
53,418.73
2
r = - 0.953 = -0.976
where the sign of r follows from the sign of bˆ1 since 95.3% of
the variation in tread wear is accounted for by linear regression
on mileage, the relationship between the two is strongly linear
with a negative slope.
3. Fitting the Simple Linear Regression Model
r is the sample correlation coefficient
between X and Y:
rX ,Y =
S XY
S XX · SYY
For the simple linear regression,
r
2
X ,Y
=r
2
3. Fitting the Simple Linear Regression Model
2

Estimation of
2

The variance measures the scatter of the Yi
around their means mi = b0 + b1 xi
An unbiased estimate of  2 is given by
n
2
e
i
SSE
s 

n2 n2
2
i 1
3. Fitting the Simple Linear Regression Model
From the example, we have SSE=2351.3 and n-2=7,therefore
S2 =
2351.53
= 361.65
7
Which has 7 d.f. The estimate of  is
s = 361.65 =19.02
4. Statistical Inference
For SLR
Under the normal error assumption


* Point estimators: 0 , 1


* Sampling distributions of  0 and 1:
E(ˆ0 )  0
E(ˆ )  
1
1
2


x
i

2
ˆ

 0 ~ N   0, 

nS
xx 


2



ˆ

 1 ~ N   1,
Sxx 

 xi 2
SE (  0 )  s
nS xx


s
SE (1 ) 
S xx
22
Derivation
( xi  x ) E (Yi )
ˆ
E ( 1 )  
S xx
i 1
n
( xi  x ) E (  0  1 xi )

S xx
i 1
n
n
( xi  x )
( x  x ) xi
 0 
 1  i
S xx
S xx
i 1
i 1
n
n
1  n


( xi  x ) xi   ( xi  x ) x 


S xx  i 1
i 1

1 n
  ( xi  x ) 2  1
S xx i 1
2
 xi  x 
ˆ
 Var (Yi )
Var ( 1 )   
i 1  S xx 
n
 xi  x 

   
i 1  S xx 
n
2
2


2
S xx
2
n
2
(
x

x
)
 i
i 1
 2 S xx
S xx
2

2
S xx
Derivation
E ( ˆ0 )  E (Y  ˆ1 x )
E (Y )


 E ( ˆ ) x
i
1

n
 E ( 0  1 xi )

n 0  1  x i
 0
n
n
 1 x
 1 x
Var ( ˆ0 )  Var (Y  ˆ1 x )
2
 Var (Y )  x Var ( ˆ )
1
2
x 2 2


n
S xx
2


(
x

x
)
x

n
x
i
i
2 




nS
xx



 2  xi 2
nSxx
For mathematical derivations, please refer to the Tamhane and Dunlop text book, P331.
Statistical Inference on β0 and β1
* Pivotal Quantities (P.Q.’s):
ˆ 0   0
~ tn  2
SE(ˆ 0)
ˆ 1   1
~ tn  2
SE( ˆ 1)
* Confidence Intervals (C.I.’s):
   

0  t  SE  0  , 1  t  SE  1 
n  2,
n  2,


 
2
2

25
26/69
Hypothesis tests:
.
H 0 : 1  10 vs . H a : 1  10
0
ˆ
1  1
 tn2, / 2
Reject H 0 at level  if t0 
SE( ˆ1 )

A useful application is to show whether there
is a linear relationship between x and y
H 0 : 1  0 vs . H a : 1  0
ˆ1
Reject H 0 at level if t0 
 tn  2, / 2
SE(ˆ1 )
27/69
Analysis of Variance (ANOVA)
Mean Square:
A sum of squares divided by its degrees of freedom.
SSR
MSR 
1
and
SSE
MSE 
n2
MSR SSR ˆ S xx  ˆ1
 2  2 
s/ S
MSE
s
s
xx

2
1
f1,n2,  t
2
  ˆ1 
 
  t02  F0
  SE( ˆ ) 
1 
 
2
n2, / 2
2
Analysis of Variance (ANOVA)
ANOVA Table
Source of
Variation
(Source)
Sum of
Squares
(SS)
Degrees of
Freedom
(d.f.)
Regression
SSR
1
Error
SSE
n-2
Total
SST
n-1
Mean
Square
(MS)
SSR
1
SSE
MSE=
n2
MSR=
28
F
MSR
F=
MSE
5. Regression Diagnostics
5.1 Checking the Model Assumptions
5.1.1 Checking for Linearity
5.1.2 Checking for Constant Variance
5.1.3 Checking for Normality
 Primary tool: residual plots
5.2 Checking for Outliers and Influential Observations
5.2.1 Checking for Outliers
5.2.2 Checking for Influential Observations
5.2.3 How to Deal with Outliers and Influential Observations
5. Regression Diagnostics
5.1 Checking the Model Assumptions
5.1.1 Checking for Linearity
5.1.2 Checking for Constant Variance
5.1.3 Checking for Normality
 Primary tool: residual plots
5.2 Checking for Outliers and Influential Observations
5.2.1 Checking for Outliers
5.2.2 Checking for Influential Observations
5.2.3 How to Deal with Outliers and Influential Observations
5. Regression Diagnostics
5.1.1 Checking for Linearity
i
𝑥𝑖
𝑦𝑖
𝑦𝑖
𝑒𝑖
1
0
394.33
360.64
33.69
2
4
329.50
331.51
-2.01
3
8
291.00
302.39
-11.39
4
12
255.17
273.27
-18.10
5
16
229.33
244.15
-14.82
6
20
204.83
215.02
-10.19
7
24
179.00
185.90
-6.90
8
28
163.83
156.78
7.05
9
32
150.33
127.66
22.67
Table 5.1 The 𝑥𝑖 , 𝑦𝑖 , 𝑦𝑖 , 𝑒𝑖 for the Tire Wear Data
Figure 5.1 Scatter plots for, 𝑦𝑖 𝑣𝑠. 𝑥𝑖 , 𝑒𝑖 𝑣𝑠. 𝑥𝑖 for the Tire Wear Data
5. Regression Diagnostics
5.1.1 Checking for Linearity (Data transformation)
x
x2
x3
x
x
y
y
y
logy
1/y
x
y
logx y
-1/x y
x logy
x -1/y
x
logx
-1/x
x
x
y
y
y2
y3
y
x
x2
x3
x
x
y
y
y
y2
y3
Figure 5.2 Typical Scatter Plot Shapes and Corresponding Linearizing Transformations
5. Regression Diagnostics
5.1.1 Checking for Linearity
(Data transformation)
i
𝑥𝑖
𝑦𝑖
ln(𝑦𝑖 )
𝑦𝑖
𝑒𝑖
1
0
394.33
5.926
374.64
19.69
2
4
329.50
5.807
332.58
-3.08
3
8
291.00
5.688
295.24
-4.24
4
12
255.17
5.569
262.09
-6.92
5
16
229.33
5.450
232.67
-3.34
6
20
204.83
5.331
206.54
-1.71
7
24
179.00
5.211
183.36
-4.36
8
28
163.83
5.092
162.77
1.06
9
32
150.33
4.973
144.50
5.83
Table 5.2 The 𝑥𝑖 , 𝑦𝑖 , ln(𝑦𝑖 ), 𝑦𝑖 , 𝑒𝑖 for the Tire Wear Data
Figure 5.2 Scatter plots for,ln(𝑦𝑖 ) 𝑣𝑠. 𝑥𝑖 , 𝑒𝑖 𝑣𝑠. 𝑥𝑖 for the Tire Wear Data
5. Regression Diagnostics
5.1 Checking the Model Assumptions
5.1.1 Checking for Linearity
5.1.2 Checking for Constant Variance
5.1.3 Checking for Normality
 Primary tool: residual plots
5.2 Checking for Outliers and Influential Observations
5.2.1 Checking for Outliers
5.2.2 Checking for Influential Observations
5.2.3 How to Deal with Outliers and Influential Observations
5. Regression Diagnostics
5.1.2 Checking for Constant Variance
Plot the residuals against the fitted value
If the constant variance assumption is correct,
the dispersion of the 𝑒𝑖 ’s is approximately
constant with respect to the 𝑦𝑖 ’s.
Figure 5.3 Plots of Residuals 𝑒𝑖 𝑣𝑠. 𝑦𝑖 for Tire Wear Data
Figure 5.4 Plots of Residuals 𝑒𝑖 𝑣𝑠. 𝑦𝑖 Corresponding to Different
Functional Relationships between Var Y and E(Y)
5. Regression Diagnostics
5.1 Checking the Model Assumptions
5.1.1 Checking for Linearity
5.1.2 Checking for Constant Variance
5.1.3 Checking for Normality
 Primary tool: residual plots
5.2 Checking for Outliers and Influential Observations
5.2.1 Checking for Outliers
5.2.2 Checking for Influential Observations
5.2.3 How to Deal with Outliers and Influential Observations
5. Regression Diagnostics
5.1.3 Checking for normality
Make a normal plot of the residuals
They have a zero mean and an
approximately constant variance.
(assuming the other assumptions about
the model are correct)
Figure 5.5 Normal Probability Plot for Tire Wear Data (p−value = 0.0097)
5. Regression Diagnostics
5.1 Checking the Model Assumptions
5.1.1 Checking for Linearity
5.1.2 Checking for Constant Variance
5.1.3 Checking for Normality
 Primary tool: residual plots
5.2 Checking for Outliers and Influential Observations
5.2.1 Checking for Outliers
5.2.2 Checking for Influential Observations
5.2.3 How to Deal with Outliers and Influential Observations
5. Regression Diagnostics
Outlier:
Influential Observation:
an observation that does not
follow the general pattern of the
relationship between y and x. A
large residual indicates an outlier.
an influential observation has an
extreme x-value, an extreme yvalue, or both.
Standardized residuals are given
by
If we express the fitted value of y as
a linear combination of all the 𝑦𝑗
ei* 
ei

SE (ei )
ei
1 ( x  x )2
s 1  i
n
S xx

ei
, i  1, 2,..., n.
s
If ei  2 , then the corresponding
observation may be regarded as
an outlier.
*
n
yˆi   hij y j
j 1
1 ( xi  x ) 2
hii  
n
S xx
If hii  2(k  1) / n, then the corresponding
observations may be regarded as
influential observation.
5. Regression Diagnostics
5.2 Checking for Outliers and Influential Observations
i
ei*
i
hii
1
2.8653
1
0.3778
2
-0.4113
2
0.2611
3
-0.5367
3
0.1778
4
-0.8505
4
0.1278
5
-0.4067
5
0.1111
6
-0.2102
6
0.1278
7
-0.5519
7
0.1778
8
0.1416
8
0.2611
9
0.8484
9
0.3778
ei*  2
Table 5.3 Standard residuals & leverage for transformed data Tire Wear Data
hii  0.44
MATLAB Code for Regression Diagnostics
clear;clc;
x = [0 4 8 12 16 20 24 28 32];
y = [394.33 329.50 291.00 255.17 229.33 204.83 179.00 163.83 150.33];
y1 = log(y); %data transformation
p = polyfit(x,y,1) %linear regression predicts y from x
% p = polyfit(x,log(y),1)
yfit = polyval(p,x) %use p to predict y
yresid = y - yfit %compute the residuals
%yresid = y1 - exp(yfit) %residual for transformed data
ssresid = sum(yresid.^2); %residual sum of squares
sstotal = (length(y)-1) * var(y); %sstotal
rsq = 1 - ssresid/sstotal; %R square
normplot(yresid) %normal plot for residuals
[h,p,jbstat,critval]=jbtest(yresid) %test normality
scatter(x,y,500,'r','.') %generate the scatter plots
lsline
laxis([-5,35,-10,25])
xlabel('x_i')
ylabel('y_i')
Title('plot of ...')
for i = 1:length(x) % check for outliers
p(i) = yresid(i)/std(yresid)/sqrt(1-1/length(x)-(yresid(i)-mean(yresid)^2)/(yresid(i)-mean(yresid))^2)
end
%check for influential observations
for j = 1:length(x)
q(i) = 1/length(x)+(x(i)-mean(x))^2/960
end
6.1 Correlation Analysis
Why we need this?
Regression analysis is used to model the
relationship between two variables.
But when there is no such distinction and
both variables are random, correlation
analysis is used to study the strength of the
relationship.
6.1 Correlation Analysis- Example
Example
Flu reported
Life expectancy
Temperature
Economy level
People who get flu shot
Figure 6.1
Economic growth
6.2 Bivariate Normal Distribution
Because we need to investigate the correlation
between X,Y
𝑓 𝑥, 𝑦 =
1
2𝜋𝜎𝑥 𝜎𝑦
𝜌 = 𝐶𝑜𝑟𝑟 𝑋, 𝑌 =
1
𝑥 − 𝜇𝑥
exp[−
(
2 1 − 𝜌2
𝜎𝑥
1 − 𝜌2
2
𝑦 − 𝜇𝑦
+
𝜎𝑦
2
− 2𝜌
𝑥 − 𝜇𝑥
𝜎𝑥
𝑦 − 𝜇𝑦
)]
𝜎𝑦
𝐶𝑜𝑣(𝑋, 𝑌)
𝑉𝑎𝑟(𝑋)𝑉𝑎𝑟(𝑌)
−1 ≤ 𝜌 ≤ 1
Figure 6.2
Source:http://wiki.stat.ucla.edu/socr/index.php/File:SOCR_BivariateNormal_JS_Activity_Fig7.png
6.2 Why introduce Bivariate Normal Distribution?
First, we need to do some computation.
𝜌𝜎𝑌
𝜌𝜎𝑌
𝐸 𝑌 𝑋 = 𝑥 = 𝜇𝑌 −
𝜇 +
𝑥
𝜎𝑋 𝑋
𝜎𝑋
𝑉𝑎𝑟 𝑌 𝑋 = 𝑥 = (1 − 𝜌2 )𝜎𝑌 2
Compare with:
𝜌𝜎𝑌
𝛽0 = 𝜇𝑌 −
𝜇
𝜎𝑋 𝑋
𝜌𝜎𝑌
𝛽1 =
𝜎𝑋
So, if (X,Y) have a bivariate normal distribution, then the regression
model is true
6.3 Statistical Inference of r
Define the r.v. R corresponding to r
But the distribution of R is quite complicated
-0.7
f(r)
f(r)
f(r)
r
-0.3
r
Figure 6.3
0
f(r)
r
0.5
r
6.3 Exact test when ρ=0
Test: H0 : ρ=0 ,
Test statistic:
T0 
Ha : ρ≠0
r n2
1 r
Reject H0 iff
2
t0  tn2, / 2
Example
A researcher wants to determine if two test
instruments give similar results. The two test
instruments are administered to a sample of 15
students. The correlation coefficient between the
two sets of scores is found to be 0.7. Is this
correlation statistically significant at the .01 level?
H0 : ρ=0 ,
t0 
Ha : ρ≠0
0.7 15  2
1  0.7
2
 3.534
3.534 = t0 > t13, .005 = 3.012
So, we reject H0
6.3 Note:They are the same!
Because
𝑟=
𝑠𝑥
𝛽1
𝑠𝑦
= 𝛽1
𝑆𝑥𝑥
𝑠𝑦𝑦
= 𝛽1
𝑆𝑥𝑥
𝑆𝑆𝑇
2
, 1−𝑟 =
𝑆𝑆𝐸
𝑆𝑆𝑇
=
(𝑛−2)𝑠 2
𝑆𝑆𝑇
So
𝑡=
𝑟 𝑛−2
1−
𝑟2
= 𝛽1
𝑆𝑥𝑥
𝑆𝑆𝑇
(𝑛 − 2)𝑆𝑆𝑇
𝛽1
𝛽1
=
=
(𝑛 − 2)𝑠 2
𝑠/ 𝑆𝑥𝑥 𝑆𝐸(𝛽1 )
We can say
H0: β1=0 are equivalent to H0: ρ=0
6.3 Approximate test when ρ≠0
Because that the exact distribution of R is
not very useful for making inference on ρ,
R.A Fisher showed that we can do the
following linear transformation, to let it be
approximate normal distribution.
That is,
 1 1   1 
1 1 R 

,
tanh R  ln
  N  ln

2 1 R 
2
1


n

3




1
6.3 Steps to do the approximate test on ρ
1,H0 : ρ= ρ0 vs. H1 : ρ ≠ ρ0
2, point estimator
3, T.S.
4, C.I
6.4 The pitfalls of correlation analysis
Lurking Variable
Over extrapolation
7. Implementation in SAS
state district
democ
A
voteA
expend
A
expend
B
prtystrA
lexpend
A
lespend
B
shareA
1
"AL"
7
1
68
328.3
8.74
41
5.793916 2.167567
97.41
2
"AK"
1
0
62
626.38
402.48
60
6.439952 5.997638
60.88
3
"AZ"
2
1
73
99.61
3.07
55
4.601233 1.120048
97.01
"WI"
8
1
30
14.42
227.82
47
2.668685 5.428569
5.95
…
173
Table7.1 vote example data
7. Implementation in SAS
SAS code of the vote example
proc corr data=vote1;
var F4 F10;
run;
Pearson Correlation Coefficients, N = 173
Prob > |r| under H0: Rho=0
F4
F4
F10
1.00000
0.92528
Table7.2 correlation coeffients
proc reg data=vote1;
model F4=F10;
label F4=voteA; label F10=shareA;
output out=fitvote residual=R;
run;
7. Implementation in SAS
SAS output
Analysis of Variance
Source
DF
Sum of Squares
Mean Square
F Value
Pr > F
Model
1
41486
41486
1017.70
<.0001
Error
171
6970.77364
40.76476
Corrected Total
172
48457
Root MSE
6.38473
R-Square
0.8561
Dependent Mean
50.50289
Adj R-Sq
0.8553
Coeff Var
12.64230
Parameter Estimates
Variable
Label
DF
Parameter Estimate
Standard Error
t Value
Intercept
Intercept
1
26.81254
0.88719
30.22
F10
F10
1
0.46382
0.01454
31.90
Table7.3 SAS output for vote example
7. Implementation in SAS
Figure7.1 Plot of Residual vs. ShareA for vote example
7. Implementation in SAS
Figure7.2 Plot of voteA vs. shareA for vote example
7. Implementation in SAS
SAS-Check Homoscedasticity
Figure7.3 Plots of SAS output for vote example
7. Implementation in SAS
SAS-Check Normality of Residuals
Tests for Location: Mu0=0
SAS code:
Test
proc univariate data=fitvote normal;
Student's t
var R;
Sign
qqplot R / normal (Mu=est Sigma=est);
run;
Signed Rank
Statistic
p Value
t
0
Pr > |t|
M
-0.5
Pr >= |M| 1.0000
S
-170.5
Pr >= |S|
Tests for Normality
Test
Statistic
p Value
Shapiro-Wilk
W
0.952811
Pr < W
0.7395
KolmogorovSmirnov
D
0.209773
Pr > D
>0.1500
Cramer-von Mises
W-Sq
0.056218
Pr > W-Sq
>0.2500
Anderson-Darling
A-Sq
0.30325
Pr > A-Sq
>0.2500
Table7.4 SAS output for checking normality
1.0000
0.7969
7. Implementation in SAS
SAS-Check Normality of Residuals
Figure7.4 Plot of Residual vs. Normal Quantiles for vote example
8. Application
• Linear regression is widely used to
describe possible relationships between
variables. It ranks as one of the most
important tools in these disciplines.





Marketing/business analytics
Healthcare
Finance
Economics
Ecology/environmental science
8. Application
• Prediction, forecasting or deduction
 Linear regression can be used to fit a
predictive model to an observed data set
of Y and X values. After developing such a
model, if an additional value of X is then given
without its accompanying value of Y, the fitted
model can be used to make a prediction of the
value of Y.
8. Application
• Quantifying the strength of relationship
 Given a variable y and a number of
variables X1, ..., Xp that may be related to Y,
linear regression analysis can be applied to
assess which Xj may have no relationship
with Y at all, and to identify which subsets of
the Xj contain redundant information about Y.
8. Application
Example 1. Trend line
A trend line represents
a trend, the long-term
movement in time
series data after other
components have
been accounted
for. Trend lines are
sometimes used in
business analytics to
show changes in data
over time.
Figure 8.1 Refrigerator sales over a 13-year period
http://www.likeoffice.com/28057/Excel-2007-Formatting-charts
8. Application
Example 2. Clinical drug trials
Regression analysis is
widely utilized in
healthcare. The graph
shows an example in which
we investigate the
relationship between
protein concentration and
absorbance employing
linear regression analysis.
Figure 8.2 BSA Protein Concentration Vs.
Absorbance
http://openwetware.org/wiki/User:Laura_Flynn/Notebook/
Experimental_Biological_Chemistry/2011/09/13
Summary
Model
Assumptions
Linearity, Constant
Variance &
Normality
Outliers &
Influential
Observations
Data
Transformation
Linear Regression
Analysis
Probabilistic
Models
Correlation
Analysis
Least Square
Estimate
Correlation
Coefficient
(Bivariate Normal
Distribution, Exact
T-test, Approximate
Z-test.
Statistical
Inference
Acknowledgement & References
Acknowledgement
•
Sincere thanks go to Prof. Wei Zhu
References
•
•
•
Statistics and Data Analysis, Ajit Tamhane & Dorothy
Dunlop.
Introductory Econometrics: A Modern Approach, Jeffrey
M. Wooldridge,5th ed.
http://en.wikipedia.org/wiki/Regression_analysis
http://en.wikipedia.org/wiki/Adrien_Marie_Legendre
etc. (web links have already been included in the slides)
Download