Session 1 Outline for Session 1 • Course Objectives & Description • Review of Basic Statistical Ideas – Intercept, Slope, Correlation, Causality • Simple Linear Regression – Statistical Model and Concepts – Regression in Excel Applied Regression -- Prof. Juran 2 Course Themes • Learn useful and practical tools of regression and data analysis • Learn by example and by doing • Learn enough theory to use regression safely Applied Regression -- Prof. Juran 3 • Shape the course experience to meet your goals – The agenda is flexible – Pick your own project – The professor also enjoys learning • Let’s enjoy ourselves – life is too short Applied Regression -- Prof. Juran 4 Basic Information Canvas www.columbia.edu/~dj114/ dj114@columbia.edu Applied Regression -- Prof. Juran 5 Basic Requirements • Come to class and participate • Cases once/twice per session • Project Applied Regression -- Prof. Juran 6 What is Regression Analysis? • A Procedure for Data Analysis – Regression analysis is a family of mathematical procedures for fitting functions to data. – The most basic procedure -- simple linear regression -- fits a straight line to a set of data so that the sum of the squared “y deviations” is minimal. Regression can be used on a completely pragmatic basis. Applied Regression -- Prof. Juran 7 What is Regression Analysis? • A Foundation for Statistical Inference – If special statistical conditions hold, the regression analysis: • Produces statistically “best” estimates of the “true” underlying relationship and its components • Provides measures of the quality and reliability of the fitted function • Provides the basis for hypothesis tests and confidence and prediction intervals Applied Regression -- Prof. Juran 8 Some Regression Applications • Determining the factors that influence energy consumption in a detergent plant • Measuring the volatility of financial securities • Determining the influence of ambient launch temperature on Space Shuttle o-ring burn through. • Identifying demographic and purchase history factors that predict high consumer response to catalog mailings • Mounting a legal defense against a charge of sex discrimination in pay. • Determining the cause of leaking antifreeze bottles on a packing line. • Measuring the fairness of CEO compensation • Predicting monthly champagne sales Applied Regression -- Prof. Juran 9 Course Outline • Basics of regression – Bottom: inferences about effects of independent variables on the dependent variable – Middle: Analysis of Variance – Top: summary measures for the model Applied Regression -- Prof. Juran 10 Course Outline • Advanced Regression Topics – – – – – – – – Interval Estimation Full Model with Arrays Qualitative Variables Residual Analysis Thoughts on Nonlinear Regression Model-building Ideas Multicollinearity Autocorrelation, serial correlation Applied Regression -- Prof. Juran 11 Course Outline • Related Topics – Chi-square Goodness-of-Fit Tests – Forecasting Methods • Exponential Smoothing • Regression • Two Multivariate Methods – Cluster Analysis – Discriminant Analysis • Binary Logistic Regression Applied Regression -- Prof. Juran 12 The Theory Underlying Simple Linear Regression Regression can always be used to fit a straight line to a set of data. It is a relatively easy computational task (Excel, Minitab, etc.) . If specified conditions hold, statistical theory can be employed to evaluate the quality and reliability of the line - for prediction of future events. Applied Regression -- Prof. Juran 13 The Standard Statistical Model –Y: the “dependent” random variable, the effect or outcome that we wish to predict or understand. –X: the “independent” deterministic variable, an input, cause or determinant that may cause, influence, explain or predict the values of Y. The dependent random variable The independent deterministic variable Y( X ) 0 1 X The parameters of the “true” regression relationship Applied Regression -- Prof. Juran A random “noise” factor 14 Regression Assumptions The expected value of Y is a linear function of X: EY(X) 0 1X , E( ) 0 The variance of Y does not change with X: VarY( X ) , Var( ) 2 Applied Regression -- Prof. Juran 2 15 Regression Assumptions Random variations at different X values are uncorrelated: Cov( i , j ) 0, i , j Random variations from the regression line are normally distributed: Y ( X ) ~ N( 0 1 X , 2 ), ~ N (0 , 2 ) Applied Regression -- Prof. Juran 16 Thoughts on Linearity The significance of the word “linear” in the linear regression model Y ( X ) 0 1 X1 2 X 2 p X p is not linearity in the X’s, it is linearity in the Betas (the slope coefficients). Consider the following variants – both of which are linear: Y ( X ) 0 1X1 2 X12 3 X 2 4 X1X 2 ln Y( X ) 0 1X1 2 X2 p Xp Applied Regression -- Prof. Juran 17 There are many creative ways to fit non-linear functions by linear regression. Consider a few popular linearizations: Y X log Y log log X Y e X ln Y ln X Y X X e X Y 1 e X 1 1 Y X ln Y X 1Y Time permitting, we will look at some of these possibilities later in the course. These may present interesting opportunities for student term projects. Applied Regression -- Prof. Juran 18 Regression Estimators We are given the data set: i Y X 1 y1 x1 2 y2 x2 ... ... ... i yi xi ... ... ... n yn xn We seek good estimators ̂0 of 0 and ̂1 of 1 that minimize the sums of the squared residuals (errors). The ith residual is ei yi ( ˆ0 ˆ1 xi ), i 1,2 ,..., n Applied Regression -- Prof. Juran 19 Computer Repair Example A 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Applied Regression -- Prof. Juran 1 2 3 4 5 6 7 8 9 10 11 12 13 14 B Minutes 23 29 49 64 74 87 96 97 109 119 149 145 154 166 C Units 1 2 3 4 4 5 6 6 7 8 9 9 10 10 20 Statistical Basics Basic statistical computations and graphical displays are very helpful in doing and interpreting a regression. We should always compute: n y y i 1 (y i 1 i i y) i 1 i n n (x and sx i 1 i i x )2 n1 ( y y )( x x) ( y y ) (x x) i Applied Regression -- Prof. Juran x 2 n1 rX ,Y x and n n sy n i 2 i 2 21 A 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 1 2 3 4 5 6 7 8 9 10 11 12 13 14 B Minutes 23 29 49 64 74 87 96 97 109 119 149 145 154 166 mean stdev count correl covar 97.21 46.22 14 0.9937 126.29 correl covar covar 0.9937 136 136 C Units 1 2 3 4 4 5 6 6 7 8 9 9 10 10 D E Error (min) -74.21 -68.21 -48.21 -33.21 -23.21 -10.21 -1.21 -0.21 11.79 21.79 51.79 47.79 56.79 68.79 F Error (units) -5 -4 -3 -2 -2 -1 0 0 1 2 3 3 4 4 G H I J K L =AVERAGE(C$2:C$15) 6 =STDEV(C$2:C$15) 2.96 =COUNT(C$2:C$15) 14 =CORREL(B$2:B$15,C$2:C$15) =COVAR(B$2:B$15,C$2:C$15) =SUMPRODUCT(E2:E15,F2:F15)/SQRT(SUMPRODUCT(E2:E15,E2:E15)*SUMPRODUCT(F2:F15,F2:F15)) Book method =B20*(B18*C18) B6014 method =SUMPRODUCT(E2:E15,F2:F15)/(B19-1) Book method Applied Regression -- Prof. Juran 22 Graphical Analysis We should always plot • histograms of the y and x values, • a time order plot of x and y (if appropriate) and • a scatter plot of y on x. Applied Regression -- Prof. Juran 23 Minutes 4 Frequency 3 2 1 0 0 25 50 75 100 125 150 175 200 Minutes Applied Regression -- Prof. Juran 24 Units 3 Frequency 2 1 0 0 1 2 3 4 5 6 7 8 9 10 11 Units Applied Regression -- Prof. Juran 25 Minutes vs. Units 180 160 140 Minutes 120 100 80 60 40 20 0 0 2 4 6 8 10 12 Units Applied Regression -- Prof. Juran 26 Estimating Parameters • Using Excel • Using Solver • Using analytical formulas Applied Regression -- Prof. Juran 27 Using Excel (Scatter Diagram) Applied Regression -- Prof. Juran 28 Minutes vs. Units 180 160 y = 15.509x + 4.1617 R² = 0.9874 140 Minutes 120 100 80 60 40 20 0 0 2 4 6 8 10 12 Units Applied Regression -- Prof. Juran 29 Using Excel (Data Analysis) Data Tab – Data Analysis Applied Regression -- Prof. Juran 30 Using Excel (Data Analysis) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 A SUMMARY OUTPUT Regression Statistics Multiple R R Square Adjusted R Square Standard Error Observations B C D E F G H I Upper 95% 11.4718 16.6090 Lower 95.0% -3.1485 14.4085 Upper 95.0% 11.4718 16.6090 0.9937 0.9874 0.9864 5.3917 14 ANOVA Regression Residual Total Intercept Units df 1 12 13 SS 27419.5088 348.8484 27768.3571 Coefficients 4.1617 15.5088 Standard Error 3.3551 0.5050 Applied Regression -- Prof. Juran MS F 27419.5088 943.2009 29.0707 t Stat 1.2404 30.7116 P-value 0.2385 0.0000 Significance F 0.0000 Lower 95% -3.1485 14.4085 31 Using Solver A 1 2 1 3 2 4 3 5 4 6 5 7 6 8 7 9 8 10 9 11 10 12 11 13 12 14 13 15 14 16 17 Intercept 18 Slope B C Minutes Units 23 1 29 2 49 3 64 4 74 4 87 5 96 6 97 6 109 7 119 8 149 9 145 9 154 10 166 10 D E F G Predictions Errors Errors^2 19.6704 3.3296 11.0861 =$B$17+$B$18*C3 35.1792 -6.1792 38.1824 50.6880 -1.6880 2.8492 =B5-E5 66.1967 -2.1967 4.8256 66.1967 7.8033 60.8909 81.7055 5.2945 28.0317 97.2143 -1.2143 1.4745 97.2143 -0.2143 0.0459 112.7230 -3.7230 13.8611 128.2318 -9.2318 85.2265 143.7406 5.2594 27.6614 143.7406 1.2594 1.5861 159.2494 -5.2494 27.5558 159.2494 6.7506 45.5712 348.8484 H I =F7^2 =SUM(G2:G15) 4.1617 15.5088 Applied Regression -- Prof. Juran 32 Applied Regression -- Prof. Juran 33 A 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 13 13 14 14 15 16 17 Intercept 18 Slope 19 B Minutes 23 29 49 64 74 87 96 97 109 119 149 145 154 166 D C Units 1 2 3 4 4 5 6 6 7 8 9 9 10 10 G F E Errors^2 Errors Predictions 11.0861 3.3296 19.6704 38.1824 -6.1792 35.1792 2.8492 -1.6880 50.6880 4.8256 -2.1967 66.1967 60.8909 7.8033 66.1967 28.0317 5.2945 81.7055 1.4745 -1.2143 97.2143 0.0459 -0.2143 97.2143 13.8611 -3.7230 112.7230 85.2265 -9.2318 128.2318 27.6614 5.2594 143.7406 1.5861 1.2594 143.7406 27.5558 -5.2494 159.2494 45.5712 6.7506 159.2494 348.8484 H I 4.1617 15.5088 Applied Regression -- Prof. Juran 34 Using Formulas ˆ y y x x x x i 1 i 2 RABE 2.13 i ˆ0 y ˆ1 x Applied Regression -- Prof. Juran RABE 2.13 35 A 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 mean 19 B Minutes 1 2 3 4 5 6 7 8 9 10 11 12 13 14 C Units D 23 29 49 64 74 87 96 97 109 119 149 145 154 166 1 2 3 4 4 5 6 6 7 8 9 9 10 10 97.21429 6 E Error (min) -74.2143 -68.2143 -48.2143 -33.2143 -23.2143 -10.2143 -1.2143 -0.2143 11.7857 21.7857 51.7857 47.7857 56.7857 68.7857 Slope Intercept Applied Regression -- Prof. Juran F Error (units) -5 -4 -3 -2 -2 -1 0 0 1 2 3 3 4 4 G H I J K L M =SUMPRODUCT(E2:E15,F2:F15)/(SUMPRODUCT(F2:F15,F2:F15)) 15.50877 Eq. 2.13 =B18-F18*C18 4.161654 Eq. 2.14 36 Correlation and Regression There is a close relationship between regression and correlation. The correlation coefficient, , measures the degree to which random variables X and Y move together or not. = +1 implies a perfect positive linear relationship while = -1 implies a perfect negative linear relationship. = 0 essentially implies independence. Applied Regression -- Prof. Juran 37 Statistical Basics: Covariance The covariance can be calculated using: Cov XY E X X Y Y or equivalently CovXY EXY X Y Usually, we find it more useful to consider the coefficient of correlation. That is, Corr XY Cov XY X Y Sometimes the inverse relation is useful: Cov XY Applied Regression -- Prof. Juran X Y Corr XY 38 Correlation and Regression • The sample (Pearson) correlation coefficient is 1 Cov( X , Y ) X Y 1 • Regressions automatically produce an estimate of the squared correlation called R2 or R-square. Values of R-square close to 1 indicate a strong relationship while values close to 0 indicate a weak or non-existent relationship rX ,Y ( y y )( x x) ( y y ) (x x) i i Applied Regression -- Prof. Juran i 2 i 2 39 Some Validity Issues • We need to evaluate the strength of the relationship, whether we have the proper functional form, and the validity of the several statistical assumptions from a practical and theoretical viewpoint using a multiplicity of tools. • Fitted regression functions are interpolations of the data in hand, and extrapolation is always dangerous. Moreover, the functional form that fits the data in our range of “experience” may not fit beyond it. Applied Regression -- Prof. Juran 40 • Regressions are based on past data. Why should the same functional form and parameters hold in the future? • In some uses of regression the future value of x may not be known – this adds greatly to our uncertainty. • In collecting data to do a regression choose x values wisely – when you have a choice. They should: – Be in the range where you intend to work – Be spread out along the range with some observations near practical extremes – Have replicated values at the same x or at very nearby x values for good estimation of • Whenever possible test the stability of your model with a “holdout” sample, not used in the original model fitting. Applied Regression -- Prof. Juran 41 Summary • Course Objectives & Description • Review of Basic Statistical Ideas – Intercept, Slope, Correlation, Causality • Simple Linear Regression – Statistical Model and Concepts – Regression in Excel • Computer Repair Example Applied Regression -- Prof. Juran 42