SMA 6304 / MIT 2.853 / MIT 2.854 Manufacturing Systems Lecture 10: Data and Regression Analysis Lecturer: Prof. Duane S. Boning 1 Copyright 2003 © Duane S. Boning. Agenda 1. Comparison of Treatments (One Variable) • 2. Analysis of Variance (ANOVA) Multivariate Analysis of Variance • 3. Model forms Regression Modeling • • • Regression fundamentals Significance of model terms Confidence intervals 2 Copyright 2003 © Duane S. Boning. Is Process B Better Than Process A? time order 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 method A A A A A A A A A A B B B B B B B B B B yield 89.7 81.4 84.5 84.8 87.3 79.7 85.1 81.7 83.7 84.5 84.7 86.1 83.2 91.9 86.3 79.3 82.6 89.1 83.7 88.5 yield 92 90 88 86 84 82 80 78 Copyright 2003 © Duane S. Boning. 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 time order 3 1 Two Means with Internal Estimate of Variance Method A Method B Pooled estimate of σ2 with ν=18 d.o.f Estimated variance of Estimated standard error of So only about 80% confident that mean difference is “real” (signficant) 4 Copyright 2003 © Duane S. Boning. Comparison of Treatments Population A Population C Population B Sample A Sample B Sample C • Consider multiple conditions (treatments, settings for some variable) • Key question: are the observed differences in mean “significant”? – There is an overall mean µ and real “effects” or deltas between conditions τi. – We observe samples at each condition of interest – Typical assumption (should be checked): the underlying variances are all the same – usually an unknown value (σ02) Copyright 2003 © Duane S. Boning. 5 Steps/Issues in Analysis of Variance 1. Within group variation – Estimates underlying population variance 2. Between group variation – Estimate group to group variance 3. Compare the two estimates of variance – If there is a difference between the different treatments, then the between group variation estimate will be inflated compared to the within group estimate – We will be able to establish confidence in whether or not observed differences between treatments are significant Hint: we’ll be using F tests to look at ratios of variances Copyright 2003 © Duane S. Boning. 6 2 (1) Within Group Variation • Assume that each group is normally distributed and shares a common variance σ02 • SSt = sum of square deviations within tth group (there are k groups) • Estimate of within group variance in tth group (just variance formula) • Pool these (across different conditions) to get estimate of common within group variance: • This is the within group “mean square” (variance estimate) 7 Copyright 2003 © Duane S. Boning. (2) Between Group Variation • We will be testing hypothesis µ1 = µ2 = … = µk • If all the means are in fact equal, then a 2nd estimate of σ2 could be formed based on the observed differences between group means: • If all the treatments in fact have different means, then sT2 estimates something larger: Variance is “inflated” by the real treatment effects τt Copyright 2003 © Duane S. Boning. 8 (3) Compare Variance Estimates • We now have two different possibilities for sT2, depending on whether the observed sample mean differences are “real” or are just occurring by chance (by sampling) • Use F statistic to see if the ratios of these variances are likely to have occurred by chance! • Formal test for significance: Copyright 2003 © Duane S. Boning. 9 3 (4) Compute Significance Level • Calculate observed F ratio (with appropriate degrees of freedom in numerator and denominator) • Use F distribution to find how likely a ratio this large is to have occurred by chance alone – This is our “significance level” – If then we say that the mean differences or treatment effects are significant to (1-α)100% confidence or better 10 Copyright 2003 © Duane S. Boning. (5) Variance Due to Treatment Effects • We also want to estimate the sum of squared deviations from the grand mean among all samples: 11 Copyright 2003 © Duane S. Boning. (6) Results: The ANOVA Table source of sum of variation squares degrees of freedom mean square F0 Pr(F0) Between treatments Within treatments Also referred to as “residual” SS Total about the grand average Copyright 2003 © Duane S. Boning. 12 4 Example: Anova A B 11 10 12 C 10 8 6 12 12 10 11 10 8 6 Excel: Data Analysis, One-Variation Anova A Anova: S ingle Factor SUMMARY Groups A B C Count Sum 3 3 3 ANOVA Source of Variation Between Groups Within Groups SS Total 33 24 33 df 2 6 30 8 C Average Variance 11 1 8 4 11 1 MS 18 12 B F 9 2 4.5 P-value 0.064 F crit 5.14 Copyright 2003 © Duane S. Boning. 13 ANOVA – Implied Model • The ANOVA approach assumes a simple mathematical model: • • • • Where µt is the treatment mean (for treatment type t) And τt is the treatment effect With εti being zero mean normal residuals ~N(0,σ02) Checks – – – – Plot residuals against time order Examine distribution of residuals: should be IID, Normal Plot residuals vs. estimates Plot residuals vs. other variables of interest Copyright 2003 © Duane S. Boning. 14 MANOVA – Two Dependencies • Can extend to two (or more) variables of interest. MANOVA assumes a mathematical model, again simply capturing the means (or treatment offsets) for each discrete variable level: • Assumes that the effects from the two variables are additive Copyright 2003 © Duane S. Boning. 15 5 Example: Two Factor MANOVA • Two LPCVD deposition tube types, three gas suppliers. Does supplier matter in average particle counts on wafers? – Experiment: 3 lots on each tube, for each gas; report average # particles added Analysis of Variance Factor 1 Gas Factor 2 1 2 Tube A B C 7 13 36 44 2 18 10 40 10 7 36 2 13 44 18 Source Model Error C. Total 15 25 DF Sum of Squares Mean Square 450.0 1350.00 3 14.0 28.00 2 1378.00 5 F Ratio 32.14 Prob > F 0.0303 Effect Tests Source Tube Gas 20 20 20 20 20 20 -10 20 -10 -10 20 -10 Nparm 1 2 -5 5 DF Sum of Squares 150.00 1 1200.00 2 -5 -5 5 5 2 -2 F Ratio 10.71 42.85 Prob > F 0.0820 0.0228 1 -3 -1 3 16 Copyright 2003 © Duane S. Boning. MANOVA – Two Factors with Interactions • May be interaction: not simply additive – effects may depend synergistically on both factors: 2 IID, ~N(0,σ ) An effect that depends on both t & i factors simultaneously t = first factor = 1,2, … k (k = # levels of first factor) i = second factor = 1,2, … n (n = # levels of second factor) j = replication = 1,2, … m (m = # replications at t, jth combination of factor levels • Can split out the model more explicitly… Estimate by: 17 Copyright 2003 © Duane S. Boning. MANOVA Table – Two Way with Interactions source of variation sum of squares degrees of freedom mean square F0 Pr(F0) Between levels of factor 1 (T) Between levels of factor 2 (B) Interaction Within Groups (Error) Total about the grand average Copyright 2003 © Duane S. Boning. 18 6 Measures of Model Goodness – R2 • Goodness of fit – R2 – Question considered: how much better does the model do that just using the grand average? – Think of this as the fraction of squared deviations (from the grand average) in the data which is captured by the model • Adjusted R2 – For “fair” comparison between models with different numbers of coefficients, an alternative is often used – Think of this as (1 – variance remaining in the residual). Recall νR = νD - νT Copyright 2003 © Duane S. Boning. 19 Regression Fundamentals • Use least square error as measure of goodness to estimate coefficients in a model • One parameter model: – – – – – – – – Model form Squared error Estimation using normal equations Estimate of experimental error Precision of estimate: variance in b Confidence interval for β Analysis of variance: significance of b Lack of fit vs. pure error • Polynomial regression Copyright 2003 © Duane S. Boning. 20 Least Squares Regression • We use least-squares to estimate coefficients in typical regression models • One-Parameter Model: • Goal is to estimate β with “best” b • How define “best”? – That b which minimizes sum of squared error between prediction and data – The residual sum of squares (for the best estimate) is Copyright 2003 © Duane S. Boning. 21 7 Least Squares Regression, cont. • Least squares estimation via normal equations – For linear problems, we need not calculate SS(β); rather, direct solution for b is possible – Recognize that vector of residuals will be normal to vector of x values at the least squares estimate • Estimate of experimental error – Assuming model structure is adequate, estimate s2 of σ2 can be obtained: Copyright 2003 © Duane S. Boning. 22 Precision of Estimate: Variance in b • We can calculate the variance in our estimate of the slope, b: • Why? Copyright 2003 © Duane S. Boning. 23 Confidence Interval for β • Once we have the standard error in b, we can calculate confidence intervals to some desired (1-α)100% level of confidence • Analysis of variance – Test hypothesis: – If confidence interval for β includes 0, then β not significant – Degrees of freedom (need in order to use t distribution) p = # parameters estimated by least squares Copyright 2003 © Duane S. Boning. 24 8 income Leverage Residuals Example Regression Age Income 8 6.16 22 9.88 35 14.35 40 24.06 57 30.34 73 32.17 78 42.18 87 43.23 98 48.76 Whole Model Analysis of Variance Sum of Squares 8836.6440 64.6695 8901.3135 Mean Square 8836.64 8.08 F Ratio 1093.146 Prob > F <.0001 Tested against reduced model: Y=0 Parameter Estimates Term Intercept age Zeroed Estimate 0 0.500983 Std Error 0 0.015152 t Ratio . 33.06 Prob>|t| . <.0001 Effect Tests Source age 50 DF 1 8 9 Source Model Error C. Total Nparm 1 DF 1 Sum of Squares 8836.6440 F Ratio 1093.146 Prob > F <.0001 40 30 • Note that this simple model assumes an intercept of zero – model must go through origin 20 10 0 0 25 50 75 100 age Leverage, P<.0001 • We will relax this requirement soon 25 Copyright 2003 © Duane S. Boning. Lack of Fit Error vs. Pure Error • Sometimes we have replicated data – E.g. multiple runs at same x values in a designed experiment • We can decompose the residual error contributions • This allows us to TEST for lack of fit Where SSR = residual sum of squares error SSL = lack of fit squared error SSE = pure replicate error – By “lack of fit” we mean evidence that the linear model form is inadequate Copyright 2003 © Duane S. Boning. 26 Regression: Mean Centered Models • Model form • Estimate by Copyright 2003 © Duane S. Boning. 27 9 Polynomial Regression Analysis of Variance Source Model Error C. Total DF 2 7 9 Sum of Squares Mean Square 665.70617 332.853 45.19383 6.456 710.90000 F Ratio 51.5551 Prob > F <.0001 • Generated using JMP package Lack Of Fit Source Lack Of Fit Pure Error Total Error DF 3 4 7 Sum of Squares Mean Square 18.193829 6.0646 27.000000 6.7500 45.193829 F Ratio 0.8985 Prob > F 0.5157 Max RSq 0.9620 Summary of Fit RSquare RSquare Adj Root Mean Sq Error Mean of Response Observations (or Sum Wgts) 0.936427 0.918264 2.540917 82.1 10 Parameter Estimates Term Intercept x x*x Estimate 35.657437 5.2628956 -0.127674 Std Error 5.617927 0.558022 0.012811 t Ratio 6.35 9.43 -9.97 Prob>|t| 0.0004 <.0001 <.0001 Effect Tests Source x x*x Nparm 1 1 DF 1 1 Sum of Squares 574.28553 641.20451 F Ratio 88.9502 99.3151 Prob > F <.0001 <.0001 34 Copyright 2003 © Duane S. Boning. Summary • • • Comparison of Treatments – ANOVA Multivariate Analysis of Variance Regression Modeling Next Time • • Time Series Models Forecasting Copyright 2003 © Duane S. Boning. 35 12