8114-01

advertisement
Session 1
Outline for Session 1
• Course Objectives & Description
• Review of Basic Statistical Ideas
– Intercept, Slope, Correlation, Causality
• Simple Linear Regression
– Statistical Model and Concepts
– Regression in Excel
Applied Regression -- Prof. Juran
2
Course Themes
• Learn useful and practical tools of
regression and data analysis
• Learn by example and by doing
• Learn enough theory to use regression
safely
Applied Regression -- Prof. Juran
3
• Shape the course experience to meet
your goals
– The agenda is flexible
– Pick your own project
– The professor also enjoys learning
• Let’s enjoy ourselves – life is too short
Applied Regression -- Prof. Juran
4
Basic Information
Canvas
www.columbia.edu/~dj114/
dj114@columbia.edu
Applied Regression -- Prof. Juran
5
Basic Requirements
• Come to class and participate
• Cases once/twice per session
• Project
Applied Regression -- Prof. Juran
6
What is Regression Analysis?
• A Procedure for Data Analysis
– Regression analysis is a family of
mathematical procedures for fitting
functions to data.
– The most basic procedure -- simple linear
regression -- fits a straight line to a set of
data so that the sum of the squared “y
deviations” is minimal. Regression can be
used on a completely pragmatic basis.
Applied Regression -- Prof. Juran
7
What is Regression Analysis?
• A Foundation for Statistical Inference
– If special statistical conditions hold, the
regression analysis:
• Produces statistically “best” estimates of the
“true” underlying relationship and its
components
• Provides measures of the quality and reliability
of the fitted function
• Provides the basis for hypothesis tests and
confidence and prediction intervals
Applied Regression -- Prof. Juran
8
Some Regression Applications
• Determining the factors that influence energy consumption in a detergent
plant
• Measuring the volatility of financial securities
• Determining the influence of ambient launch temperature on Space Shuttle
o-ring burn through.
• Identifying demographic and purchase history factors that predict high
consumer response to catalog mailings
• Mounting a legal defense against a charge of sex discrimination in pay.
• Determining the cause of leaking antifreeze bottles on a packing line.
• Measuring the fairness of CEO compensation
• Predicting monthly champagne sales
Applied Regression -- Prof. Juran
9
Course Outline
• Basics of regression
– Bottom: inferences about effects of
independent variables on the dependent
variable
– Middle: Analysis of Variance
– Top: summary measures for the model
Applied Regression -- Prof. Juran
10
Course Outline
• Advanced Regression Topics
–
–
–
–
–
–
–
–
Interval Estimation
Full Model with Arrays
Qualitative Variables
Residual Analysis
Thoughts on Nonlinear Regression
Model-building Ideas
Multicollinearity
Autocorrelation, serial correlation
Applied Regression -- Prof. Juran
11
Course Outline
• Related Topics
– Chi-square Goodness-of-Fit Tests
– Forecasting Methods
• Exponential Smoothing
• Regression
• Two Multivariate Methods
– Cluster Analysis
– Discriminant Analysis
• Binary Logistic Regression
Applied Regression -- Prof. Juran
12
The Theory Underlying
Simple Linear Regression
Regression can always be used to fit a straight line
to a set of data. It is a relatively easy computational
task (Excel, Minitab, etc.) .
If specified conditions hold, statistical theory can be
employed to evaluate the quality and reliability of
the line - for prediction of future events.
Applied Regression -- Prof. Juran
13
The Standard Statistical Model
–Y: the “dependent” random variable, the effect or
outcome that we wish to predict or understand.
–X: the “independent” deterministic variable, an
input, cause or determinant that may cause,
influence, explain or predict the values of Y.
The dependent random variable
The independent deterministic variable
Y( X )   0   1 X  
The parameters of the “true” regression relationship
Applied Regression -- Prof. Juran
A random “noise” factor
14
Regression Assumptions
The expected value of Y is a linear function of X:
EY(X)  0  1X ,
E( )  0
The variance of Y does not change with X:
VarY( X )   , Var( )  
2
Applied Regression -- Prof. Juran
2
15
Regression Assumptions
Random variations at different X values are
uncorrelated:
Cov( i ,  j )  0, i , j
Random variations from the regression line are
normally distributed:
Y ( X ) ~ N(  0   1 X ,  2 ),  ~ N (0 ,  2 )
Applied Regression -- Prof. Juran
16
Thoughts on Linearity
The significance of the word “linear” in the linear
regression model
Y ( X )   0   1 X1   2 X 2    p X p
is not linearity in the X’s, it is linearity in the Betas (the
slope coefficients). Consider the following variants – both of
which are linear:
Y ( X )   0   1X1   2 X12   3 X 2   4 X1X 2
ln Y( X )  0  1X1   2 X2    p Xp
Applied Regression -- Prof. Juran
17
There are many creative ways to fit non-linear functions
by linear regression. Consider a few popular
linearizations:
Y  X
log Y  log    log X
Y  e X
ln Y  ln    X
Y
X
X
e   X
Y
1  e   X
1
1




Y
X
ln
Y
   X
1Y
Time permitting, we will look at some of these
possibilities later in the course. These may present
interesting opportunities for student term projects.
Applied Regression -- Prof. Juran
18
Regression Estimators
We are given the data set:
i
Y
X
1
y1
x1
2
y2
x2
...
...
...
i
yi
xi
...
...
...
n
yn
xn
We seek good estimators ̂0 of 0 and ̂1 of 1 that minimize the sums of the
squared residuals (errors). The ith residual is
ei  yi  ( ˆ0  ˆ1 xi ), i  1,2 ,..., n
Applied Regression -- Prof. Juran
19
Computer Repair Example
A
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Applied Regression -- Prof. Juran
1
2
3
4
5
6
7
8
9
10
11
12
13
14
B
Minutes
23
29
49
64
74
87
96
97
109
119
149
145
154
166
C
Units
1
2
3
4
4
5
6
6
7
8
9
9
10
10
20
Statistical Basics
Basic statistical computations and graphical displays are very helpful in
doing and interpreting a regression. We should always compute:
n
y
y
i 1
 (y
i 1
i
i
 y)
i 1
i
n
n
 (x
and sx 
i 1
i
i
 x )2
n1
 ( y  y )( x  x)
 ( y  y )  (x  x)
i
Applied Regression -- Prof. Juran
x
2
n1
rX ,Y 
x
and
n
n
sy 
n
i
2
i
2
21
A
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
1
2
3
4
5
6
7
8
9
10
11
12
13
14
B
Minutes
23
29
49
64
74
87
96
97
109
119
149
145
154
166
mean
stdev
count
correl
covar
97.21
46.22
14
0.9937
126.29
correl
covar
covar
0.9937
136
136
C
Units
1
2
3
4
4
5
6
6
7
8
9
9
10
10
D
E
Error (min)
-74.21
-68.21
-48.21
-33.21
-23.21
-10.21
-1.21
-0.21
11.79
21.79
51.79
47.79
56.79
68.79
F
Error (units)
-5
-4
-3
-2
-2
-1
0
0
1
2
3
3
4
4
G
H
I
J
K
L
=AVERAGE(C$2:C$15)
6
=STDEV(C$2:C$15)
2.96
=COUNT(C$2:C$15)
14
=CORREL(B$2:B$15,C$2:C$15)
=COVAR(B$2:B$15,C$2:C$15)
=SUMPRODUCT(E2:E15,F2:F15)/SQRT(SUMPRODUCT(E2:E15,E2:E15)*SUMPRODUCT(F2:F15,F2:F15))
Book method
=B20*(B18*C18)
B6014 method
=SUMPRODUCT(E2:E15,F2:F15)/(B19-1)
Book method
Applied Regression -- Prof. Juran
22
Graphical Analysis
We should always plot
• histograms of the y and x values,
• a time order plot of x and y (if appropriate)
and
• a scatter plot of y on x.
Applied Regression -- Prof. Juran
23
Minutes
4
Frequency
3
2
1
0
0
25
50
75
100
125
150
175
200
Minutes
Applied Regression -- Prof. Juran
24
Units
3
Frequency
2
1
0
0
1
2
3
4
5
6
7
8
9
10
11
Units
Applied Regression -- Prof. Juran
25
Minutes vs. Units
180
160
140
Minutes
120
100
80
60
40
20
0
0
2
4
6
8
10
12
Units
Applied Regression -- Prof. Juran
26
Estimating Parameters
• Using Excel
• Using Solver
• Using analytical formulas
Applied Regression -- Prof. Juran
27
Using Excel (Scatter Diagram)
Applied Regression -- Prof. Juran
28
Minutes vs. Units
180
160
y = 15.509x + 4.1617
R² = 0.9874
140
Minutes
120
100
80
60
40
20
0
0
2
4
6
8
10
12
Units
Applied Regression -- Prof. Juran
29
Using Excel (Data Analysis)
Data Tab – Data Analysis
Applied Regression -- Prof. Juran
30
Using Excel (Data Analysis)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
A
SUMMARY OUTPUT
Regression Statistics
Multiple R
R Square
Adjusted R Square
Standard Error
Observations
B
C
D
E
F
G
H
I
Upper 95%
11.4718
16.6090
Lower 95.0%
-3.1485
14.4085
Upper 95.0%
11.4718
16.6090
0.9937
0.9874
0.9864
5.3917
14
ANOVA
Regression
Residual
Total
Intercept
Units
df
1
12
13
SS
27419.5088
348.8484
27768.3571
Coefficients
4.1617
15.5088
Standard Error
3.3551
0.5050
Applied Regression -- Prof. Juran
MS
F
27419.5088 943.2009
29.0707
t Stat
1.2404
30.7116
P-value
0.2385
0.0000
Significance F
0.0000
Lower 95%
-3.1485
14.4085
31
Using Solver
A
1
2
1
3
2
4
3
5
4
6
5
7
6
8
7
9
8
10
9
11
10
12
11
13
12
14
13
15
14
16
17 Intercept
18 Slope
B
C
Minutes Units
23
1
29
2
49
3
64
4
74
4
87
5
96
6
97
6
109
7
119
8
149
9
145
9
154
10
166
10
D
E
F
G
Predictions
Errors
Errors^2
19.6704
3.3296
11.0861
=$B$17+$B$18*C3
35.1792
-6.1792
38.1824
50.6880
-1.6880
2.8492
=B5-E5
66.1967
-2.1967
4.8256
66.1967
7.8033
60.8909
81.7055
5.2945
28.0317
97.2143
-1.2143
1.4745
97.2143
-0.2143
0.0459
112.7230
-3.7230
13.8611
128.2318
-9.2318
85.2265
143.7406
5.2594
27.6614
143.7406
1.2594
1.5861
159.2494
-5.2494
27.5558
159.2494
6.7506
45.5712
348.8484
H
I
=F7^2
=SUM(G2:G15)
4.1617
15.5088
Applied Regression -- Prof. Juran
32
Applied Regression -- Prof. Juran
33
A
1
1
2
2
3
3
4
4
5
5
6
6
7
7
8
8
9
9
10
10
11
11
12
12
13
13
14
14
15
16
17 Intercept
18 Slope
19
B
Minutes
23
29
49
64
74
87
96
97
109
119
149
145
154
166
D
C
Units
1
2
3
4
4
5
6
6
7
8
9
9
10
10
G
F
E
Errors^2
Errors
Predictions
11.0861
3.3296
19.6704
38.1824
-6.1792
35.1792
2.8492
-1.6880
50.6880
4.8256
-2.1967
66.1967
60.8909
7.8033
66.1967
28.0317
5.2945
81.7055
1.4745
-1.2143
97.2143
0.0459
-0.2143
97.2143
13.8611
-3.7230
112.7230
85.2265
-9.2318
128.2318
27.6614
5.2594
143.7406
1.5861
1.2594
143.7406
27.5558
-5.2494
159.2494
45.5712
6.7506
159.2494
348.8484
H
I
4.1617
15.5088
Applied Regression -- Prof. Juran
34
Using Formulas
ˆ

y  y x  x 


 x  x 
i
1
i
2
RABE 2.13
i
ˆ0  y  ˆ1 x
Applied Regression -- Prof. Juran
RABE 2.13
35
A
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18 mean
19
B
Minutes
1
2
3
4
5
6
7
8
9
10
11
12
13
14
C
Units
D
23
29
49
64
74
87
96
97
109
119
149
145
154
166
1
2
3
4
4
5
6
6
7
8
9
9
10
10
97.21429
6
E
Error (min)
-74.2143
-68.2143
-48.2143
-33.2143
-23.2143
-10.2143
-1.2143
-0.2143
11.7857
21.7857
51.7857
47.7857
56.7857
68.7857
Slope
Intercept
Applied Regression -- Prof. Juran
F
Error (units)
-5
-4
-3
-2
-2
-1
0
0
1
2
3
3
4
4
G
H
I
J
K
L
M
=SUMPRODUCT(E2:E15,F2:F15)/(SUMPRODUCT(F2:F15,F2:F15))
15.50877 Eq. 2.13
=B18-F18*C18
4.161654 Eq. 2.14
36
Correlation and Regression
There is a close relationship between regression
and correlation. The correlation coefficient, ,
measures the degree to which random variables
X and Y move together or not.
 = +1 implies a perfect positive linear
relationship while  = -1 implies a perfect
negative linear relationship.  = 0 essentially
implies independence.
Applied Regression -- Prof. Juran
37
Statistical Basics: Covariance
The covariance can be calculated using:
Cov
 XY 


 E X  X Y Y
or equivalently

  
CovXY   EXY   X Y
Usually, we find it more useful to consider the coefficient of
correlation. That is,
Corr  XY  
Cov
 XY 
 X Y
Sometimes the inverse relation is useful:
Cov
 XY 
Applied Regression -- Prof. Juran
  X  Y Corr  XY 
38
Correlation and Regression
• The sample (Pearson) correlation coefficient is
1  
Cov( X , Y )
 X Y
1
• Regressions automatically produce an estimate of the squared
correlation called R2 or R-square. Values of R-square close to 1
indicate a strong relationship while values close to 0 indicate a
weak or non-existent relationship
rX ,Y 
 ( y  y )( x  x)
 ( y  y )  (x  x)
i
i
Applied Regression -- Prof. Juran
i
2
i
2
39
Some Validity Issues
• We need to evaluate the strength of the relationship,
whether we have the proper functional form, and the
validity of the several statistical assumptions from a
practical and theoretical viewpoint using a multiplicity
of tools.
• Fitted regression functions are interpolations of the data
in hand, and extrapolation is always dangerous.
Moreover, the functional form that fits the data in our
range of “experience” may not fit beyond it.
Applied Regression -- Prof. Juran
40
• Regressions are based on past data. Why should the
same functional form and parameters hold in the
future?
• In some uses of regression the future value of x may not
be known – this adds greatly to our uncertainty.
• In collecting data to do a regression choose x values
wisely – when you have a choice. They should:
– Be in the range where you intend to work
– Be spread out along the range with some observations near
practical extremes
– Have replicated values at the same x or at very nearby x values
for good estimation of 
• Whenever possible test the stability of your model with
a “holdout” sample, not used in the original model
fitting.
Applied Regression -- Prof. Juran
41
Summary
• Course Objectives & Description
• Review of Basic Statistical Ideas
– Intercept, Slope, Correlation, Causality
• Simple Linear Regression
– Statistical Model and Concepts
– Regression in Excel
• Computer Repair Example
Applied Regression -- Prof. Juran
42
Download