Agenda SMA 6304 / MIT 2.853 / MIT 2.854 Analysis

advertisement
SMA 6304 / MIT 2.853 / MIT 2.854
Manufacturing Systems
Lecture 10: Data and Regression
Analysis
Lecturer: Prof. Duane S. Boning
1
Copyright 2003 © Duane S. Boning.
Agenda
1.
Comparison of Treatments (One Variable)
•
2.
Analysis of Variance (ANOVA)
Multivariate Analysis of Variance
•
3.
Model forms
Regression Modeling
•
•
•
Regression fundamentals
Significance of model terms
Confidence intervals
2
Copyright 2003 © Duane S. Boning.
Is Process B Better Than Process A?
time
order
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
method
A
A
A
A
A
A
A
A
A
A
B
B
B
B
B
B
B
B
B
B
yield
89.7
81.4
84.5
84.8
87.3
79.7
85.1
81.7
83.7
84.5
84.7
86.1
83.2
91.9
86.3
79.3
82.6
89.1
83.7
88.5
yield
92
90
88
86
84
82
80
78
Copyright 2003 © Duane S. Boning.
0
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16 17 18 19 20
time order
3
1
Two Means with Internal Estimate of Variance
Method A
Method B
Pooled estimate of σ2
with ν=18 d.o.f
Estimated variance
of
Estimated standard error
of
So only about 80% confident that
mean difference is “real” (signficant)
4
Copyright 2003 © Duane S. Boning.
Comparison of Treatments
Population A
Population C
Population B
Sample A
Sample B
Sample C
•
Consider multiple conditions (treatments, settings for some variable)
•
Key question: are the observed differences in mean “significant”?
– There is an overall mean µ and real “effects” or deltas between conditions τi.
– We observe samples at each condition of interest
– Typical assumption (should be checked): the underlying variances are all the
same – usually an unknown value (σ02)
Copyright 2003 © Duane S. Boning.
5
Steps/Issues in Analysis of Variance
1. Within group variation
– Estimates underlying population variance
2. Between group variation
– Estimate group to group variance
3. Compare the two estimates of variance
– If there is a difference between the different treatments,
then the between group variation estimate will be inflated
compared to the within group estimate
– We will be able to establish confidence in whether or not
observed differences between treatments are significant
Hint: we’ll be using F tests to look at ratios of variances
Copyright 2003 © Duane S. Boning.
6
2
(1) Within Group Variation
• Assume that each group is normally distributed and shares a
common variance σ02
• SSt = sum of square deviations within tth group (there are k groups)
• Estimate of within group variance in tth group (just variance formula)
• Pool these (across different conditions) to get estimate of common
within group variance:
• This is the within group “mean square” (variance estimate)
7
Copyright 2003 © Duane S. Boning.
(2) Between Group Variation
• We will be testing hypothesis µ1 = µ2 = … = µk
• If all the means are in fact equal, then a 2nd estimate
of σ2 could be formed based on the observed
differences between group means:
• If all the treatments in fact have different means, then
sT2 estimates something larger:
Variance is “inflated” by the
real treatment effects τt
Copyright 2003 © Duane S. Boning.
8
(3) Compare Variance Estimates
• We now have two different possibilities for sT2,
depending on whether the observed sample mean
differences are “real” or are just occurring by chance
(by sampling)
• Use F statistic to see if the ratios of these variances
are likely to have occurred by chance!
• Formal test for significance:
Copyright 2003 © Duane S. Boning.
9
3
(4) Compute Significance Level
• Calculate observed F ratio (with appropriate
degrees of freedom in numerator and
denominator)
• Use F distribution to find how likely a ratio this
large is to have occurred by chance alone
– This is our “significance level”
– If
then we say that the mean differences or treatment
effects are significant to (1-α)100% confidence or
better
10
Copyright 2003 © Duane S. Boning.
(5) Variance Due to Treatment Effects
• We also want to estimate the sum of squared
deviations from the grand mean among all
samples:
11
Copyright 2003 © Duane S. Boning.
(6) Results: The ANOVA Table
source of sum of
variation squares
degrees
of
freedom
mean square
F0
Pr(F0)
Between
treatments
Within
treatments
Also referred to
as “residual” SS
Total about
the grand
average
Copyright 2003 © Duane S. Boning.
12
4
Example: Anova
A
B
11
10
12
C
10
8
6
12
12
10
11
10
8
6
Excel: Data Analysis, One-Variation Anova
A
Anova: S ingle Factor
SUMMARY
Groups
A
B
C
Count
Sum
3
3
3
ANOVA
Source of Variation
Between Groups
Within Groups
SS
Total
33
24
33
df
2
6
30
8
C
Average Variance
11
1
8
4
11
1
MS
18
12
B
F
9
2
4.5
P-value
0.064
F crit
5.14
Copyright 2003 © Duane S. Boning.
13
ANOVA – Implied Model
• The ANOVA approach assumes a simple mathematical
model:
•
•
•
•
Where µt is the treatment mean (for treatment type t)
And τt is the treatment effect
With εti being zero mean normal residuals ~N(0,σ02)
Checks
–
–
–
–
Plot residuals against time order
Examine distribution of residuals: should be IID, Normal
Plot residuals vs. estimates
Plot residuals vs. other variables of interest
Copyright 2003 © Duane S. Boning.
14
MANOVA – Two Dependencies
• Can extend to two (or more) variables of interest. MANOVA
assumes a mathematical model, again simply capturing the means
(or treatment offsets) for each discrete variable level:
• Assumes that the effects from the two variables are additive
Copyright 2003 © Duane S. Boning.
15
5
Example: Two Factor MANOVA
•
Two LPCVD deposition tube types, three gas suppliers. Does supplier matter
in average particle counts on wafers?
– Experiment: 3 lots on each tube, for each gas; report average # particles added
Analysis of Variance
Factor 1
Gas
Factor 2 1
2
Tube
A
B
C
7
13
36
44
2
18
10
40
10
7 36 2
13 44 18
Source
Model
Error
C. Total
15
25
DF Sum of Squares Mean Square
450.0
1350.00
3
14.0
28.00
2
1378.00
5
F Ratio
32.14
Prob > F
0.0303
Effect Tests
Source
Tube
Gas
20 20 20
20 20 20
-10 20 -10
-10 20 -10
Nparm
1
2
-5
5
DF Sum of Squares
150.00
1
1200.00
2
-5 -5
5 5
2
-2
F Ratio
10.71
42.85
Prob > F
0.0820
0.0228
1 -3
-1 3
16
Copyright 2003 © Duane S. Boning.
MANOVA – Two Factors with Interactions
• May be interaction: not simply additive – effects may depend
synergistically on both factors:
2
IID, ~N(0,σ )
An effect that depends on both
t & i factors simultaneously
t = first factor = 1,2, … k
(k = # levels of first factor)
i = second factor = 1,2, … n (n = # levels of second factor)
j = replication = 1,2, … m
(m = # replications at t, jth combination of factor levels
• Can split out the model more explicitly…
Estimate by:
17
Copyright 2003 © Duane S. Boning.
MANOVA Table – Two Way with Interactions
source of
variation
sum of
squares
degrees
of
freedom
mean square
F0
Pr(F0)
Between levels
of factor 1 (T)
Between levels
of factor 2 (B)
Interaction
Within Groups
(Error)
Total about
the grand
average
Copyright 2003 © Duane S. Boning.
18
6
Measures of Model Goodness – R2
• Goodness of fit – R2
– Question considered: how much better does the model do that just
using the grand average?
– Think of this as the fraction of squared deviations (from the grand
average) in the data which is captured by the model
• Adjusted R2
– For “fair” comparison between models with different numbers of
coefficients, an alternative is often used
– Think of this as (1 – variance remaining in the residual).
Recall νR = νD - νT
Copyright 2003 © Duane S. Boning.
19
Regression Fundamentals
• Use least square error as measure of goodness to estimate coefficients in a model
• One parameter model:
–
–
–
–
–
–
–
–
Model form
Squared error
Estimation using normal equations
Estimate of experimental error
Precision of estimate: variance in b
Confidence interval for β
Analysis of variance: significance of b
Lack of fit vs. pure error
• Polynomial regression
Copyright 2003 © Duane S. Boning.
20
Least Squares Regression
• We use least-squares to estimate
coefficients in typical regression models
• One-Parameter Model:
• Goal is to estimate β with “best” b
• How define “best”?
– That b which minimizes sum of squared
error between prediction and data
– The residual sum of squares (for the
best estimate) is
Copyright 2003 © Duane S. Boning.
21
7
Least Squares Regression, cont.
• Least squares estimation via normal
equations
– For linear problems, we need not
calculate SS(β); rather, direct solution for
b is possible
– Recognize that vector of residuals will be
normal to vector of x values at the least
squares estimate
• Estimate of experimental error
– Assuming model structure is adequate,
estimate s2 of σ2 can be obtained:
Copyright 2003 © Duane S. Boning.
22
Precision of Estimate: Variance in b
• We can calculate the variance in our estimate of the slope, b:
• Why?
Copyright 2003 © Duane S. Boning.
23
Confidence Interval for β
• Once we have the standard error in b, we can calculate confidence
intervals to some desired (1-α)100% level of confidence
• Analysis of variance
– Test hypothesis:
– If confidence interval for β includes 0, then β not significant
– Degrees of freedom (need in order to use t distribution)
p = # parameters estimated
by least squares
Copyright 2003 © Duane S. Boning.
24
8
income Leverage Residuals
Example Regression
Age
Income
8
6.16
22
9.88
35
14.35
40
24.06
57
30.34
73
32.17
78
42.18
87
43.23
98
48.76
Whole Model
Analysis of Variance
Sum of Squares
8836.6440
64.6695
8901.3135
Mean Square
8836.64
8.08
F Ratio
1093.146
Prob > F
<.0001
Tested against reduced model: Y=0
Parameter Estimates
Term
Intercept
age
Zeroed
Estimate
0
0.500983
Std Error
0
0.015152
t Ratio
.
33.06
Prob>|t|
.
<.0001
Effect Tests
Source
age
50
DF
1
8
9
Source
Model
Error
C. Total
Nparm
1
DF
1
Sum of Squares
8836.6440
F Ratio
1093.146
Prob > F
<.0001
40
30
• Note that this simple model assumes an intercept of
zero – model must go through origin
20
10
0
0
25
50
75
100
age Leverage, P<.0001
• We will relax this requirement soon
25
Copyright 2003 © Duane S. Boning.
Lack of Fit Error vs. Pure Error
• Sometimes we have replicated data
– E.g. multiple runs at same x values in a designed experiment
• We can decompose the residual error contributions
• This allows us to TEST for lack of fit
Where
SSR = residual sum of squares error
SSL = lack of fit squared error
SSE = pure replicate error
– By “lack of fit” we mean evidence that the linear model form is
inadequate
Copyright 2003 © Duane S. Boning.
26
Regression: Mean Centered Models
• Model form
• Estimate by
Copyright 2003 © Duane S. Boning.
27
9
Polynomial Regression
Analysis of Variance
Source
Model
Error
C. Total
DF
2
7
9
Sum of Squares Mean Square
665.70617
332.853
45.19383
6.456
710.90000
F Ratio
51.5551
Prob > F
<.0001
• Generated using JMP package
Lack Of Fit
Source
Lack Of Fit
Pure Error
Total Error
DF
3
4
7
Sum of Squares Mean Square
18.193829
6.0646
27.000000
6.7500
45.193829
F Ratio
0.8985
Prob > F
0.5157
Max RSq
0.9620
Summary of Fit
RSquare
RSquare Adj
Root Mean Sq Error
Mean of Response
Observations (or Sum Wgts)
0.936427
0.918264
2.540917
82.1
10
Parameter Estimates
Term
Intercept
x
x*x
Estimate
35.657437
5.2628956
-0.127674
Std Error
5.617927
0.558022
0.012811
t Ratio
6.35
9.43
-9.97
Prob>|t|
0.0004
<.0001
<.0001
Effect Tests
Source
x
x*x
Nparm
1
1
DF
1
1
Sum of Squares
574.28553
641.20451
F Ratio
88.9502
99.3151
Prob > F
<.0001
<.0001
34
Copyright 2003 © Duane S. Boning.
Summary
•
•
•
Comparison of Treatments – ANOVA
Multivariate Analysis of Variance
Regression Modeling
Next Time
•
•
Time Series Models
Forecasting
Copyright 2003 © Duane S. Boning.
35
12
Download