Stat 328 Formulas Quantiles Roughly speaking, the : quantile of a distribution is a number UÐ:Ñ such that a fraction : of the distribution is to the left and a fraction Ð" • :Ñ is to the right of the number. There are various possible conventions for making this notion precise for an empirical distribution/data set. One is that for ordered data values C" Ÿ C# Ÿ á Ÿ C8 the 3th ordered value is the 3•Þ& 8 quantile, i.e. C3 œ UÐ 3 • Þ& Ñ 8 (and other quantiles are gotten by interpolation). A particularly important quantile is UÐÞ&Ñ, the distribution median. This is a number that puts half of the distribution to its left and half to its right. Mean, Variance, and Standard Deviation For a data set C" ß C# ß á ß C8 the sample mean is " 8 C œ "C3 8 3œ" and if the data set constitutes an entire population of interest, the population mean is " R . œ "C3 R 3œ" Further, the sample variance is =# œ 8 " "ÐC3 • CÑ# 8 • " 3œ" and the corresponding population variance is 5# œ " R "ÐC3 • .Ñ# R 3œ" The square root of the variance is in the original units and is called the standard deviation. -1- Inference Based on the Extremes of a Single Sample If one adopts a probability model that says that observations are independent random draws from a fixed continuous distribution/population/universe it is easy to see how to make some simple inferences based on the sample extremes. For data C" Ÿ C# Ÿ á Ÿ C8 , the interval ÐC" ß C8 Ñ can serve as an interval meant to bracket the distribution median or as an interval meant to bracket a single additional value drawn from the distribution. As a confidence interval for UÐÞ&Ñ, the interval ÐC" ß C8 Ñ has associated confidence/reliability " • #ÐÞ&Ñ8 As a prediction interval for C8€" (a single additional observation from this distribution) the appropriate associated confidence/reliability is "• # 8€" Calculation of Normal Probabilities Areas under the normal curve with . œ ! and 5 œ " ("standard normal probabilities'' between ! and D ) are given in Table 1, Appendix C, page 812 of the text. The symmetry of the standard normal curve about ! then allows one to find arbitrary standard normal probabilities using the table. Areas under the normal curve with mean . and standard deviation 5 are available by converting values on the original scale to values on the "D scale'' via C•. Dœ 5 and then using the standard normal table. Many handheld calculators will compute normal probabilities. Normal Plotting It is extremely useful to be able to model a data-generating mechanism as "independent random draws from a normal distribution.'' A means of investigating the extent to which this is sensible is to make a so-called "normal plot'' of a data set. This is a plot that facilitates comparison of data quantiles and normal quantiles. If these are roughly linearly related, the normal model is plausible. If they are not, the normal model is implausible. For data C" Ÿ C# Ÿ á Ÿ C8 and UD Ð:Ñ the standard normal quantile function (UD Ð:Ñ is the number that puts standard normal probability : to its left), a normal plot can be made by plotting points ÐC3 ß UD Ð 3 • Þ& ÑÑ 8 -2- One- and Two-Sample Intervals for Normal Distributions If it is plausible to think of C" ß C# ß á ß C8 as random draws from a single normal distribution, then probability theory can be invoked to support quantitative inference. Several useful estimation and prediction results (based on theory for derived distributions of appropriate statistics) are as follows. Confidence limits for the distribution mean . are –C „ > = È8 where > is a quantile of the so-called "> distribution with / œ 8 • " degrees of freedom.'' (The > distributions are tabled in Table 2, Appendix C, page 813 of the text.) Prediction limits for C8€" , a single additional observation from the distribution are –C „ >=Ê1 € 1 8 Confidence limits for the distribution standard deviation 5 are =Ë 8•" 8•" and =Ë # # ;upper ;lower where ;#upper and ;#lower are upper and lower quantiles of the so-called ";# distribution with / œ 8 • " degrees of freedom.'' (The ;# distributions are tabled in Table 10, Appendix C, pages 827 and 828 of the text.) If it is plausible to think of two samples as independently drawn from two normal distributions with a common variance (this might be investigated by normal plotting both samples on a single set of axes, looking for two linear plots with comparable slopes), probability theory can again be invoked to support quantitative inferences for the difference in means. That is, for =P œ Ë Ð8" • "Ñ=#" € Ð8# • "Ñ=## Ð8" • "Ñ € Ð8# • "Ñ a "pooled estimate'' of 5 (based on a weighted average of the two sample variances), confidence limits for ." • .# are –C • –C „ >=P Ê 1 € 1 1 2 81 82 where > is a quantile of the >8" €8# •# distribution. -3- If one drops the equal variance assumption (but maintains that two samples are independently drawn from two normal distributions) probability theory can again be invoked to support quantitative inferences for the ratio of standard deviations. That is, confidence limits for 5" Î5# are =" " =" ÈJ8# •"ß8" •" and È =# J8" •"ß8# •" =# where J/" ß/# is an upper quantile of the so-called "J distribution with /" numerator degrees of freedom and /# denominator degrees of freedom.'' (The J distributions are tabled in Tables 3-6, Appendix C, pages 814-821 of the text.) One- and Two-Sample Testing for Normal Distributions If it is plausible to think of C" ß C# ß á ß C8 as random draws from a single normal distribution, then probability theory can be invoked to support hypothesis testing. That is, H0 :. œ # can be tested using the test statistic –C • # X œ = È8 and a >8•" reference distribution. Further, H0 :5 œ # can be tested using the test statistic \2 œ Ð8 • "Ñ=2 ## and a ;#8•" reference distribution. If it is plausible to think of two samples as independently drawn from two normal distributions with a common variance, probability theory can again be invoked to support testing for the difference in means. That is, H0 :.1 • .2 œ # can be tested using the test statistic –C • –C • # 2 X œ 1 1 =P É 81 € 812 and a >8" €8# •# reference distribution. And if one drops the equal variance assumption (but maintains that two samples are independently drawn from two normal distributions) probability theory can again be invoked to support testing for the ratio of standard deviations. That is, H0 :5" Î5# œ " can be tested using the statistic J œ and an J8" •"ß8# •" reference distribution. -4- =21 =22 Simple Linear Regression Model The basic (normal) "simple linear regression" model says that a response/output variable C depends on an explanatory/input/system variable B in a "noisy but linear" way. That is, one supposes that there is a linear relationship between B and mean C, .ClB œ "! € "" B and that (for fixed B) there is around that mean a distribution of C that is normal. Further, the standard assumption is that the standard deviation of the response distribution is constant in B. In symbols it is standard to write C œ "! € "" B € % where % is normal with mean ! and standard deviation 5. This describes one C. Where several observations C3 with corresponding values B3 are under consideration, the assumption is that the C3 (the %3 ) are independent. (The %3 are conceptually equivalent to unrelated random draws from the same fixed normal continuous distribution.) The model statement in its full glory is then C3 œ "! € "" B3 € %3 for 3 œ "ß #ß ÞÞÞß 8 %3 for 3 œ "ß #ß ÞÞÞß 8 are independent normal Ð!ß 5# Ñ random variables The model statement above is a perfectly theoretical matter. One can begin with it, and for specific choices of "! ß "" and 5 find probabilities for C at given values of B. In applications, the real mode of operation is instead to take 8 data pairs ÐB" ß C" Ñß ÐB# ß C# Ñß ÞÞÞß ÐB8 ß C8 Ñ and use them to make inferences about the parameters "! ß "" and 5 and to make predictions based on the estimates (based on the empirically fitted model). Descriptive Analysis of Approximately Linear ÐBß CÑ Data After plotting ÐBß CÑ data to determine that the "linear in B mean of C" model makes some sense, it is reasonable to try to quantify "how linear" the data look and to find a line of "best fit" to the scatterplot of the data. The sample correlation between C and B ! ÐB3 • BÑÐC3 • CÑ 8 <œ 3œ" Ë ! ÐB3 • BÑ# † ! ÐC3 • CÑ# 8 8 3œ" 3œ" is a measure of strength of linear relationship between B and C. Calculus can be invoked to find a slope "" and intercept "! minimizing the sum of squared vertical distances from data points to a fitted line, ! ÐC3 • Ð"! € "" B3 ÑÑ# . These "least squares" 8 3œ" -5- values are ! ÐB3 • BÑÐC3 • CÑ 8 ," œ 3œ" ! ÐB3 • BÑ# 8 3œ" and ,! œ C • ," B It is further common to refer to the value of C on the "least squares line" corresponding to B3 as a fitted or predicted value sC3 œ ,! € ," B3 One might take the difference between what is observed (C3 ) and what is "predicted" or "explained" (sC3 ) as a kind of leftover part or "residual" corresponding to a data value /3 œ C3 • sC3 The sum ! ÐC3 • CÑ# is most of the sample variance of the 8 values C3 . It is a measure of raw 8 3œ" variation in the response variable. People often call it the "total sum of squares" and write WW>9> œ "ÐC3 • CÑ# 8 3œ" The sum of squared residuals ! /#3 is a measure of variation in response remaining unaccounted 8 3œ" for after fitting a line to the data. People often call it the "error sum of squares" and write WWI œ "/#3 œ "ÐC3 • sC3 Ñ# 8 8 3œ" 3œ" One is guaranteed that WWX 9> WWI . So the difference WWX 9> • WWI is a non-negative measure of variation accounted for in fitting a line to ÐBß CÑ data. People often call it the "regression sum of squares" and write WWV œ WW>9> • WWI The coefficient of determination expresses WWV as a fraction of WWX 9> and is V# œ WWV WWX 9> which is interpreted as "the fraction of raw variation in C accounted for in the model fitting process." -6- Parameter Estimates for SLR The descriptive statistics for ÐBß CÑ data can be used to provide "single number estimates" of the (typically unknown) parameters of the simple linear regression model. That is, the slope of the least squares line can serve as an estimate of "" , s " œ ," " and the intercept of the least squares line can serve as an estimate of " ! , s ! œ ,! " The variance of C for a given B can be estimated by a kind of average of squared residuals 8 " WWI "/#3 œ = œ 8 • # 3œ" 8•# # Of course, the square root of this "regression sample variance" is = œ È=# and serves as a single number estimate of 5. Interval-Based Inference Methods for SLR The normal simple linear regression model provides inference formulas for model parameters. Confidence limits for 5 are =Ë 8•# 8•# and = Ë ;#upper ;#lower where ;#upper and ;#lower are upper and lower quantiles of the ;# distribution with / œ 8 • # degrees of freedom. And confidence limits for "" (the slope of the line relating mean C to B ... the rate of change of average C with respect to B) are = ," „ > 8 # ! Ë ÐB3 • BÑ 3œ" where > is a quantile of the > distribution with / œ 8 • # degrees of freedom. Confidence limits for .ClB œ "! € "" B (the mean value of C at a given value B) are Í Í" ÐB • BÑ# Ð,! € ," BÑ „ >=Í Í8 € 8 ! ÐB3 • BÑ# Ì 3œ" (Note that by choosing B œ !, this formula provides confidence limits for "! , though this parameter is rarely of independent practical interest.) -7- Prediction limits for an additional observation C at a given value B are Í Í " ÐB • BÑ# Ð,! € ," BÑ „ >=Í " € € 8 Í 8 ! ÐB3 • BÑ# Ì 3œ" Hypothesis Tests and SLR The normal simple linear regression model supports hypothesis testing. H0 :"" œ # can be tested using the test statistic ," • # X œ = Ë ! ÐB3 •BÑ# 8 3œ" and a >8•# reference distribution. H0 :.ClB œ # can be tested using the test statistic X œ Ð,! € ," BÑ • # = Ë " 8 € ÐB•BÑ# ! ÐB3 •BÑ# 8 3œ" and a >8•# reference distribution. ANOVA and SLR The breaking down of WWX 9> into WWV and WWI can be thought of as a kind of "analysis of variance" in C. That enterprise is often summarized in a special kind of table. The general form is as below. ANOVA Table (for SLR) Source SS df MS F Regression WWV " QWV œ WWVÎ" J œ QWVÎQWI Error WWI 8 • # QWI œ WWIÎÐ8 • #Ñ Total WW>9> 8 • " In this table the ratios of sums of squares to degrees of freedom are called "mean squares." The mean square for error is, in fact, the estimate of 5# (i.e. QWI œ =# ). As it turns out, the ratio in the "F" column can be used as a test statistic for the hypothesis H0 :"" œ 0. The reference distribution appropriate is the J"ß8•# distribution. As it turns out, the value J œ QWVÎQWI is the square of the > statistic for testing this hypothesis, and the J test produces exactly the same :-values as a two-sided > test. -8- Standardized Residuals and SLR The theoretical variances of the residuals turn out to depend upon their corresponding B values. As a means of putting these residuals all on the same footing, it is common to "standardize" them by dividing by an estimated standard deviation for each. This produces standardized residuals /3 /‡3 œ # = " • 8" • 8ÐB3 •BÑ ! ÐB3 •BÑ# Ë 3œ" These (if the normal simple linear regression model is a good one) "ought" to look as if they are approximately normal with mean ! and standard deviation ". Various kinds of plotting with these standardized residuals (or with the raw residuals) are used as means of "model checking" or "model diagnostics." Multiple Linear Regression Model The basic (normal) "multiple linear regression" model says that a response/output variable C depends on explanatory/input/system variables B" ß B# ß á ß B5 in a "noisy but linear" way. That is, one supposes that there is a linear relationship between B" ß B# ß á ß B5 and mean C, .ClB" ßB# ßá,B5 œ "! € "" B" € "# B# € â € "5 B5 and that (for fixed B" ß B# ß á ß B5 ) there is around that mean a distribution of C that is normal. Further, the standard assumption is that the standard deviation of the response distribution is constant in B" ß B# ß á B5 . In symbols it is standard to write C œ "! € "" B" € "# B# € â € "5 B5 € % where % is normal with mean ! and standard deviation 5. This describes one C. Where several observations C3 with corresponding values B"3 ß B#3 ß á ß B53 are under consideration, the assumption is that the C3 (the %3 ) are independent. (The %3 are conceptually equivalent to unrelated random draws from the same fixed normal continuous distribution.) The model statement in its full glory is then C3 œ "! € "" B"3 € "# B#3 € â € "5 B53 € %3 for 3 œ "ß #ß ÞÞÞß 8 %3 for 3 œ "ß #ß ÞÞÞß 8 are independent normal Ð!ß 5# Ñ random variables The model statement above is a perfectly theoretical matter. One can begin with it, and for specific choices of "! ß "" ß "# ß á ß "5 and 5 find probabilities for C at given values of B" ß B# ß á ß B5 . In applications, the real mode of operation is instead to take 8 data vectors ÐB"" ß B#" ß á ß B5" ß C" Ñß ÐB"# ß B## ß á ß B5# ß C# Ñß ÞÞÞß ÐB"8 ß B#8 ß á ß B58 ß C8 Ñ and use them to make inferences about the parameters "! ß "" ,"# ß á ß "5 and 5 and to make predictions based on the estimates (based on the empirically fitted model). -9- Descriptive Analysis of Approximately Linear ÐB" ß B# ß á ß B5 ß CÑ Data Calculus can be invoked to find coefficients "! ß "" ß "# ß á ß "5 minimizing the sum of squared vertical distances from data points in Ð5 € "Ñ dimensional space to a fitted surface, ! ÐC3 • Ð"! € "" B"3 € "# B#3 € â € "5 B53 ÑÑ# . These "least squares" values DO NOT have 8 3œ" simple formulas (unless one is willing to use matrix notation). And in particular, one can NOT simply somehow use the formulas from simple linear regression in this more complicated context. We will call these minimizing coefficients ,! ß ," ß ,# ß á ß ,5 and need to rely upon JMP to produce them for us. It is further common to refer to the value of C on the "least squares surface" corresponding to B"3 ß B#3 ß á ß B5 3 as a fitted or predicted value sC3 œ ,! € ," B"3 € ,# B#3 € â € ,5 B53 Exactly as in SLR, one takes the difference between what is observed (C3 ) and what is "predicted" or "explained" (sC3 ) as a kind of leftover part or "residual" corresponding to a data value /3 œ C3 • sC3 The total sum of squares, WWX 9> œ ! ÐC3 • CÑ# , is (still) most of the sample variance of the 8 8 3œ" values C3 and measures raw variation in the response variable. Just as in SLR, the sum of squared residuals ! /#3 is a measure of variation in response remaining unaccounted for after 8 3œ" fitting the equation to the data. As in SLR, people call it the error sum of squares and write WWI œ "/#3 œ "ÐC3 • sC3 Ñ# 8 8 3œ" 3œ" (The formula looks exactly like the one for SLR. It is simply the case that now sC3 is computed using all 5 inputs, not just a single B.) One is still guaranteed that WWX 9> WWI . So the difference WWX 9> • WWI is a non-negative measure of variation accounted for in fitting the linear equation to the data. As in SLR, people call it the regression sum of squares and write WWV œ WW>9> • WWI The coefficient of (multiple) determination expresses WWV as a fraction of WWX 9> and is V# œ WWV WWX 9> which is interpreted as "the fraction of raw variation in C accounted for in the model fitting process." This quantity can also be interpreted in terms of a correlation, as it turns out to be the square of the sample linear correlation between the observations C3 and the fitted or predicted values sC3 . -10- Parameter Estimates for MLR The descriptive statistics for ÐB" ß B# ß á ß B5 ß CÑ data can be used to provide "single number estimates" of the (typically unknown) parameters of the multiple linear regression model. That is, the least squares coefficients ,! ß ," ß ,# ß á ß ,5 serve as estimates of the parameters "! ß "" ß "# ß á ß "5 . The first of these is a kind of high-dimensional "intercept" and in the case where the predictors are not functionally related, the others serve as rates of change of average C with respect to a single B, provided the other B's are held fixed. The variance of C for a fixed values B" ß B# ß á ß B5 can be estimated by a kind of average of squared residuals 8 " WWI "/#3 œ = œ 8 • 5 • " 3œ" 8•5•" # The square root of this "regression sample variance" is = œ È=# and serves as a single number estimate of 5. Interval-Based Inference Methods for MLR The normal multiple linear regression model provides inference formulas for model parameters. Confidence limits for 5 are =Ë 8•5•" 8•5•" and =Ë # ;upper ;#lower where ;#upper and ;#lower are upper and lower quantiles of the ;# distribution with / œ 8 • 5 • " degrees of freedom. Confidence limits for "4 (the rate of change of average C with respect to B4 ) are ,4 „ >astandard error of ,4 b where > is a quantile of the > distribution with / œ 8 • 5 • " degrees of freedom. There is no simple formula for "standard error of ,4 " and in particular, one can NOT simply somehow use the formula from simple linear regression in this more complicated context. It IS the case that this standard error is a multiple of =, but we will have to rely upon JMP to provide it for us. Confidence limits for .ClB" ßB# ßá,B5 œ "! € "" B" € "# B# € â € "5 B5 (the mean value of C at a particular choice of the B" ß B# ß á ß B5 ) are sC „ >Ðstandard error of sC) There is no simple formula for "standard error of sC" and in particular one can NOT simply somehow use the formula from simple linear regression in this more complicated context. It IS the case that this standard error is a multiple of =, but we will have to rely upon JMP to provide it for us. -11- Prediction limits for an additional observation C at a given vector ÐB" ß B# ß á ß B5 Ñ are sC „ >È=# € Ðstandard error of sC)# Hypothesis Tests and MLR The normal multiple linear regression model supports hypothesis testing. H0 :"4 œ # can be tested using the test statistic X œ ,4 • # standard error of ,4 and a >8•5•" reference distribution. H0 :.ClB" ßB# ßá,B5 œ # can be tested using the test statistic X œ sC • # standard error of sC and a >8•5•" reference distribution. ANOVA and MLR As in SLR, the breaking down of WWX 9> into WWV and WWI can be thought of as a kind of "analysis of variance" in C, and summarized in a special kind of table. The general form for MLR is as below. Source Regression Error Total ANOVA Table (for MLR Overall F Test) SS df MS F WWV 5 QWV œ WWVÎ5 J œ QWVÎQWI WWI 8 • 5 • " QWI œ WWIÎÐ8 • 5 • "Ñ WW>9> 8 • " (Note that as in SLR, the mean square for error is, in fact, the estimate of 5# (i.e. QWI œ =# ).) As it turns out, the ratio in the "F" column can be used as a test statistic for the hypothesis H0 :"" œ "# œ â œ "5 œ 0. The reference distribution appropriate is the J5ß8•5•" distribution. "Partial F Tests" in MLR It is possible to use ANOVA ideas to invent F tests for investigating whether some whole group of " 's (short of the entire set) are all !. For example one might want to test the hypothesis H0 :":€" œ ":€# œ â œ "5 œ 0 -12- (This is the hypothesis that only the first : of the 5 input variables B3 have any impact on the mean system response ... the hypothesis that the first : of the B's are adequate to predict C ... the hypothesis that after accounting for the first : of the B's, the others do not contribute "significantly" to one's ability to explain or predict C.) If we call the model for C in terms of all 5 of the predictors the "full model" and the model for C involving only B" through B: the "reduced model" then an J test of the above hypothesis can be made using the statistic J œ ÐWWVFull • WWVReduced ÑÎÐ5 • :Ñ QWIFull and an J5•:ß8•5•" reference distribution. WWVFull negative. WWVReduced so the numerator here is non- Finding a :-value for this kind of test is a means of judging whether V # for the full model is "significantly"/detectably larger than V # for the reduced model. (Caution here, statistical significant is not the same as practical importance. With a big enough data set, essentially any increase in V # will produce a small :-value.) It is reasonably common to expand the basic MLR ANOVA table to organize calculations for this test statistic. This is (Expanded) ANOVA Table (for MLR) Source Regression B" ß ÞÞÞß B: SS WWVFull WWVRed df 5 : MS QWVFull œ WWVÎ5 F J œ QWVFull ÎQWIFull B:€" ß ÞÞÞß B5 lB" ß ÞÞÞß B: Error Total WWVFull • WWVRed WWIFull WW>9> 5•: 8•5•" 8•" ÐWWVFull • WWVRed ÑÎÐ5 • :Ñ QWIFull œ WWIFull ÎÐ8 • 5 • "Ñ ÐWWVFull •WWVRed ÑÎÐ5•:Ñ QWIFull Standardized Residuals in MLR As in SLR, people sometimes wish to standardize residuals before using them to do model checking/diagnostics. While it is not possible to give a simple formula for the "standard error of /3 " with using matrix notation, most MLR programs will compute these values. The standardized residual for data point 3 is then (as in SLR) /3 /‡3 œ standard error of /3 If the normal multiple linear regression model is a good one these "ought" to look as if they are approximately normal with mean ! and standard deviation ". -13- Intervals and Tests for Linear Combinations of " 's in MLR It is sometimes important to do inference for a linear combination of MLR model coefficients P œ -! "! € -" "" € -# "# € â € -5 "5 (where -! ß -" ß á ß -5 are known constants). Note, for example, that .ClB" ßB# ßá,B5 is of this form for -! œ "ß -" œ B" ß -# œ B# ß á ß and -5 œ B5 . Note too that a difference in mean responses at two sets of predictors, say ÐB" ß B# ß á ß B5 Ñ and ÐBw" ß Bw# ß á ß Bw5 Ñ is of this form for -! œ !ß -" œ B" • Bw" ß -# œ B# • Bw# ß á ß and -5 œ B5 • Bw5 . An obvious estimate of P is s œ -! ,! € -" ," € -# ,# € â € -5 ,5 P Confidence limits for P are s „ >Ðstandard error of PÑ s P s"Þ This standard error is a multiple of =, but There is no simple formula for "standard error of P s and its standard error is we will have to rely upon JMP to provide it for us. (Computation of P under the "Custom Test" option in JMP.) H0 :P œ # can be tested using the test statistic X œ s•# P s standard error of P and a >8•5•" reference distribution. Or, if one thinks about it for a while, it is possible to find a reduced model that corresponds to the restriction that the null hypothesis places on the MLR model and to use a ": œ 5 • "" partial J test (with " and 8 • 5 • " degrees of freedom) equivalent to the > test for this purpose. MLR Model-Building The MLR model and inference formulas above form the core of a set of tools in common use for building reliable predictions of C from a large set of predictors B" ß á ß B5 . There are a number of extensions and ways of using these tools (and their extensions) that combine to form a full model-building technology. Some of the additional ideas are summarized below. Diagnostic Tools Residual Plotting The residuals /3 œ C3 • sC 3 from MLR are meant to be empirical approximates of the "random errors" %3 œ C3 • .ClB" ßB# ßá,B5 in the MLR model. The MLR model says that the %3 are normal (with mean 0 and standard deviation 5) and independent. So one should expect the residuals to be -14- describable in approximately these terms. They should be essentially "patternless normal random noise" and if they aren't, a problem with the corresponding MLR model is indicated. (This possibility then causes one to be skeptical of the appropriateness of any probability-based inferences based on the MLR model.) Common ways of looking at MLR residuals (or their standardized/Studentized versions /‡3 ) are to 1) normal-plot them hoping to see an approximately linear plot, and 2) plot them against any variables of interest (like, for example, B" ß á ß B5 ß sC or C, time order of observation, values of any variable potentially of importance but not included in the model, etc.) looking for a pattern that can simultaneously suggest a problem with a current model and identify possible remedial measures. Chapter 7 of the text discusses residual plotting in some detail, and we'll return to this topic later. Diagnostic Measures/Statistics A Pooled Sample Standard Deviation and a Lack-of-Fit F Test In problems where there are one or more ÐB" ß B# ß á ß B5 Ñ vectors that have multiple responses C, it is possible to make an estimate of 5# that doesn't depend for its appropriateness on the particular form of the relationship between B" ß B# ß á ß B5 and mean C used in the MLR model. That is, if there are 1 groups of C's each coming from a single ÐB" ß B# ß á ß B5 Ñ combination and having a group sample variance =#4 , then one can make a kind of "pooled standard deviation" from these as =Pooled Ð8" • "Ñ=#" € Ð8# • "Ñ=## € â € Ð81 • "Ñ=#1 œË Ð8" • "Ñ € Ð8# • "Ñ € â € Ð81 • "Ñ Provided that 5 doesn't change with B" ß B# ß á ß B5 , this is a legitimate estimate of 5, regardless of whether or not one has an appropriate form for .ClB" ßB# ßá,B5 . On the other hand, the MLR sample standard deviation = ( œ ÈQWI ) will tend to overestimate 5 if .ClB" ßB# ßá,B5 Á "! € "" B" € "# B# € â € "5 B5 . So informal comparison of = to =Pooled is a means of doing model diagnosis (a large difference being indicative of poor model fit). This can be made more formal by inventing a related J test statistic. That is, the MLR error sum of squares is sometimes broken down as WWI œ Ð8 • 1Ñ=#Pooled € ÐWWI • Ð8 • 1Ñ=#Pooled Ñ and the (nonnegative) terms of the right of this equation given the names WWT I ("pure error" sum of squares) and WWP9J ("lack of fit" sum of squares). That is WWT I œ Ð8 • 1Ñ=#Pooled -15- and WWP9J œ WWI • WWT I In this notation, the statistic J œ WWP9J ÎaÐ8 • 5 • "Ñ • Ð8 • 1Ñb WWT I ÎÐ8 • 1Ñ is an index of whether the two estimates of 5 are detectably different, with large values corresponding to cases where = is much larger than =Pooled . Then, as it turns out, under the MLR model this statistic has an J1•5•"ß8•1 reference distribution and there is a formal :-value to be associated with a disparity between = and =Pooled (the J1•5•"ß8•1 probability to the right of the observed value). People sometimes even go so far as to add a pair of lines to the MLR ANOVA table, breaking down the "Error" source (and correpsonding sum of squares and df) into "LoF" and "Pure Error" components. V # and = œ È Q W I As one searches through myriad possible MLR models potentially describing the relationship between predictors B and a response, C, one generally wants a model with a "small" number of predictors (a simple or parsimonious model), a "large" value of V # and a "small" value of =. The naive "solution" to the model search problem of just "picking the biggest possible model" is in fact no solution, in light of the potential for "overfitting" a data set (adopting a model with too many "wiggles" that can nicely reproduce the C's in the data set, but that does a very poor job when used to produce mild extrapolations or interpolations). Other Functions of WWI For a given prediction problem (and therefore fixed WWX 9>), V # and = œ ÈQWI are "equivalent" in the sense that one could be obtained from the other (and WWX 9>). They don't, however, necessarily produce the same ordering of reduced models of some grand MLR model in terms of "best looking" values of the criteria (e.g. the full model has the largest V # but may not have the smallest =). There are several other functions of WWI that have been suggested as possible statistics for model diagnostics/selection. Among them are Mallows' G: and Akaike's Information Criterion (the AIC). Mallows' G: is based on the fact that the average total squared difference between the 8 values sC3 from a MLR fit and their (real) means .3 can be worked out theoretically. When divided by 5# , this quantity is a quantity >: that is Ð: € "Ñ when there is a choice of " 's in a :-predictor MLR model that produces correct means for all 8 data points and is otherwise larger. Mallows' suggestion for comparing reduced versions of a full (5 predictor) MLR model, is that for a reduced model with : Ÿ 5 predictors, one compute G: œ WWIRed € #Ð: € "Ñ • 8 QWIFull -16- (an estimate of >: if the full 5-variable MLR model is correct) and look for small : and G: no more than about Ð: € "Ñ. The thinking is that such a reduced model is simpler than the full model and appears to produce predictions comparable to the full model predictions at the ÐB" ß B# ß á ß B5 Ñ vectors in the data set. Akaike's Information Criterion is based on considerations beyond the scope of this exposition. For a MLR model with 5 predictors, it is EMG œ 8lnŒ WWI • € #Ð5 € "Ñ 8 and people look for small values of EMG . Press Statistic This is a diagnostic measure built on the notion that a model should not be terribly sensitive to individual data points used to fit it, or equivalently that one ought to be able to predict a response even without using that response to fit the model. Beginning with a particular form for a MLR model and 8 data points, let sCÐ3Ñ œ the value of C3 predicted by a model fit to the other Ð8 • "Ñ data points (note that this is not necessarily sC3 ). The "prediction sum of squares" is TVIWW œ "ÐC3 • sCÐ3Ñ Ñ# 8 3œ" and one wants small values of this. Model Search Algorithms Given a particular full model (a particular set of 5 predictors), to do model selection one needs some computer-assisted means of "poking around" in the (often very large) set of possible reduced models looking for good reduced models. Statistical packages must offer some methods for this. Probably the best such methodology is of the "all possible regressions" variety. This kind of routine will, for a given set of 5 predictors and a choice of #, produce a list of the top # models with " predictor, the top # models with # predictors, ..., the top # models with Ð5 • "Ñ predictors. (Slick computational schemes make this possible despite the fact that the number of models to be checked grows astronomically with the size of the full model, 5.) An older (and really, inferior and completely ad hoc) methodology attempts to "add in" or "drop out" variables in regression models one at a time on the basis of :-values for (> or J ) tests of hypotheses that their regression coefficients are !. (An algorithm adds in or drops out the most obvious predictor at each step.) This kind of "stepwise regression" methodology can be run in a purely "backwards elimination" mode (that begins with the full model and successively drops single variables), in a purely "forward selection" mode (that begins with a model containing only the predictor most highly correlated with C) or in a "mixed" mode that at any step can either add or drop a predictor variable depending upon the :-values for adding and dropping. JMP has implemented stepwise regression as its model searching tool. -17- The reason that stepwise searching is inferior to all-possible-regressions searches is that when one by some stepwise means or another gets to a particular :-variable model, one is NOT guaranteed that such a model is even the "best" one available of size : according to an V# (or any other) criterion. Only the exhaustive search provided by an all-possible-regressions algorithm produces such a guarantee. Creating New Variables from Exisiting Ones Where neither a full MLR model for C in terms of all available predictors B" ß á ß B5 , nor any reduction of it is satisfactory, one is far from "out of tricks to pull" in the quest to find a useful means of predicting C. One obvious possibility is to replace C and/or one or more of the B's with "transformed" versions of themselves. One might take square roots or logarithms (or ?????) of the responses and/or one or more of the predictors, do the modeling and inference and "untransform" back to original scales of measurement in order to interpret the inferences (by squaring or exponentiating, or ?????). Another possibility is to fit models that are not simply linear in the predictor(s) but quadratic, or cubic, or ... For example, the full quadratic MLR regression model for C in the 5 œ # predictors B" and B# is C œ "! € "" B" € "# B# +"$ B#" € "% B## € "& B" B# € % For a single predictor B, JMP will do the fitting of a polynomial for C under the Fit Y by X menu. For 5 # different B's, one needs to create powers of individual predictors and cross product terms by using the "cross" button to add them to the "Effects in Model" portion of the dialog box (or to use the "Response Surface" macro to fill that portion of the box after highlighting the original predictors in the list of columns) under the JMP Fit Model menu. The appearance of the cross product term in the quadratic model above raises the possibility of using predictors that are functions of more than one of a set of basic variables B" ß á ß B5 . Such terms are often called "interaction" terms. A model without interaction terms is sometimes called "additive" in that a mean response is gotten by simply adding to an overall "! the separate contributions due to each of the 5 predictors. For an additive model (one without interactions) plots of mean C against any one of the predictors, say B4 , are parallel for different sets of the other B's. Models with interactions have plots of mean C verus an B4 involved in an interaction that are NOT parallel. Qualitative Factors/Inputs and Dummy Variables At first look, it would seem that MLR has nothing to say about problems where some or all of the basic system inputs that determine the nature of a response are qualitative rather than quantitative. But as a matter of fact, with the proper amount of cleverness, it's possible to put even qualitative factors into the MLR framework. Consider a factor, call it A, that has M possible "levels" or settings. (A could, for example, be something like employee gender with M œ #.) It is then possible to represent A in MLR notation through the creation of M • " dummy variables. That is, one defines -18- BA" œ œ " 0 if the observation is from level " of A otherwise BA# œ œ " 0 if the observation is from level # of A otherwise ã " BAßM•" œ œ 0 if the observation is from level M • " of A otherwise Then, the model C œ "! € ""BA" € "#B A# € â € "M•" BAßM•" € % says that observations are normal with standard deviation 5 and mean "! € "" "! € "# ã "! € "M•" "! if observation is from level " of A if observation is from level # of A ã if observation is from level M • " of A if observation is from level M of A All of the MLR machinery is available to do inference for the " 's and sums and differences thereof (that amount to means and differences in mean responses under various levels of A). The approach above is the one taken in the textbook. But other (equivalent) versions of this business are possible. Two are of special interest because of the way the JMP does its automatic coding for qualitative "nominal" and "ordinal" variables. That is, a first alternative to the method above is to do what JMP seems to do for "nominal" variables. Define M • " variables BwA" BwA2 BwAßM•" Ú" œÛ •" Ü! if the observation is from level " of A if the observation is from level M of A otherwise œÛ •" Ü! if the observation is from level # of A if the observation is from level M of A otherwise Ú" Ú" œÛ •" Ü! ã if the observation is from level M • " of A if the observation is from level M of A otherwise -19- The model C œ "! € ""B wA" € "#B wA# € â € "M•" BwAßM•" € % then says that observations are normal with standard deviation 5 and mean "! € "" "! € "# ã "! € "M•" if observation is from level " of A if observation is from level # of A ã if observation is from level M • " of A "! • Œ ! "3 • M•" if observation is from level M of A 3œ" With this coding, the sum of the means is M"! and thus "! is the arithmetic average of the M means. The other " 's are then deviations of the "first" M • " means from this arithmetic average of the M means. A third version of this is what JMP seems to do for "ordinal" variables. Define M • " variables BwwA2 œ œ BwwA$ œ œ " 0 " 0 if the observation is from level # of A otherwise if the observation is from level # or $ of A otherwise ã " BwwAßM œ œ 0 if the observation is from level #ß $ß á ß or M of A otherwise Then, the model C œ "! € "2B wwA# € " $B wwA$ € â € "M BwwAßM € % says that observations are normal with standard deviation 5 and mean "! "! € "# "! € "# € "$ ã "! € ! "3 if observation is from level " of A if observation is from level # of A if observation is from level $ of A ã M if observation is from level M of A 3œ# The "intercept" here is the mean for the first level of A and the other " 's are the differences between means for successive levels of A (in the order " through M ). -20- Once one has seen this idea of using M • " dummies to represent a single qualitative factor with M levels, it is easy to go on and include more than one qualitative factor in a model (through the use of a second set of dummies) and to create interactions involving qualitative factors (by taking products with all of its dummies), etc. It is worth considering in detail what one gets from using dummy variables where there are two qualitative factors and all possible combinations of levels of those factors are represented in the data set. (The standard jargon for "all possible combinations of M levels of A and N levels of B represented in a data set" is that the data contain a "(full) two-way factorial in the factors A and B." The figure below shows the M † N different combinations of levels of the factors laid out in a table, with "cell mean" responses filling the cells. Factor A " # M " ."" .#" ã .M" Factor B # N ."# â ."N .## â .#N ã ã .M# â .MN It is reasonable to ask what dummy variables can provide in terms of modeling a response in this context. Since it is the most sensible coding provided automatically by your software, let us consider the Bw coding above, instead of the B coding discussed in Chapter 5 of your text in what follows. For 3 œ "ß #ß ÞÞÞß 3 • " let BwAi Ú" œÛ •" Ü! if the observation is from level 3 of A if the observation is from level M of A otherwise and for 4 œ "ß #ß ÞÞÞß 4 • " let BwB4 Ú" œÛ •" Ü! if the observation is from level 4 of B if the observation is from level N of B otherwise A MLR regression model for response C (first) involving only the dummies themselves is C œ "! € "A" BwA" € "A# BwA# € â € "A ßM•" BwAßM•" € "B" BwB" € "B# BwB# € â € "B,N•" BwBßN•" € % This (no-interactions) model says that for 3 Ÿ M • " and 4 Ÿ N • " .34 œ "! € "A3 € "B4 -21- (*) for 4 Ÿ N • " .M4 œ "! • Ž""A3 • € "B4 M•" 3œ" for 3 Ÿ M • " œ "! € "A3 • Ž""B4 • N•" .3N 4œ" and that .MN œ "! • Ž""A3 • • Ž""B4 • M•" N•" 3œ" 4œ" The no interactions model says that with level of A (B) held fixed, as one moves across levels of B (A), the mean responses are changed by adding different " 's to "! , and that the same addition would be done on every fixed level of A (B). There are "parallel traces of means" as one moves across levels of B (A) for the different levels of A (B). Consider too what one gets from averaging means across rows and down columns in the two way table. Letting a dot subscript indicate that one has averaged out over the missing subscript, one can extend the table above to get Factor B " # N " ."" ."# â ."N ."Þ Factor A # .#" .## .#N .#Þ ã ã ã ã M .M" .M# â .MN .MÞ .Þ" .Þ# â .ÞN .ÞÞ Adding the above expressions for the .34 in terms of the " 's across a row and dividing by N it becomes clear that for 3 Ÿ M • ", "A3 œ .3Þ • .ÞÞ the difference between the row average mean and the grand mean. Similarly, adding the above expressions for the .34 in terms of the " 's down a column and dividing by M it becomes clear that for 4 Ÿ N • " , "B4 œ .Þ4 • .ÞÞ the difference between the column average mean and the grand mean. These functions of the means .34 are common summaries of a complete two-way table of M † N means and standard jargon is that the "main effect of A at its 3th level" œ .3Þ • .ÞÞ -22- while the "main effect of B at its 4th level" œ .Þ4 • .ÞÞ Our exposition here says that the regression coefficients " with the JMP "nominal" coding of qualitative factors are (at least in the no-interaction model) exactly the factor main effects. Note that for the last level of the factors, the fact that the .3Þ • .ÞÞ (and the .Þ4 • .ÞÞ ) sum to zero means that the main effect for the last level of the factors is the negative sum of the other main effects. Now consider what happens when one adds to the no-interaction model (*) all cross products of BwA3 and BBw 4 terms. One then has a model with number of predictors "5" œ ÐM • "Ñ € ÐN • "Ñ € ÐM • "ÑÐN • "Ñ œ MN • " It should not then be completely surprising that such a model then allows for any possible choice of the MN means .34 . That is, the model with all the A dummies, the B dummies and the products of A and B dummies in it, is really equivalent to starting with MN levels of a single omnibus factor and making up MN • " dummies from scratch. The advantage of using the present coding (instead of starting all over and making up a new coding for a single omnibus factor) is that the cross product terms are interpretable. That is, as it turns out, a model extending (*) by including all possible cross product terms still ends up implying that for 3 Ÿ M • " "A3 œ .3Þ • .ÞÞ and for 4 Ÿ N • " , "B4 œ .Þ4 • .ÞÞ and then that for 3 Ÿ M • " and 4 Ÿ N • " .34 œ "! € "A3 € "B4 € "AB34 so that for such 3 and 4 "AB34 œ .34 • a"! € "A3 € "B4 b œ .34 • Ðmean from the no-interaction model) œ .34 • Ðgrand mean € 3th A main effect € 4th B main effectÑ It is common to define for all 3 and 4 interaction of A at level 3 and B at level 4 œ .34 • Ðmean from the no-interaction modelÑ œ .34 • Ðgrand mean € 3th A main effect € 4th B main effectÑ so that the ÐM • "ÑÐN • "Ñ cross product " 's can be interpreted as "interactions" measuring departure from additivity/parallelism. As it turns out, the "interactions" for a given row or column sum to !, so that one can get them for the last row level M of A or level N of B as the negative sum of the others in the corresponding column or row. -23- Notice that considering a full model including all A dummies, all B dummies and all AB dummy products, various reduced models have sensible interpretations. The reduced model C œ "! € "A" BwA" € "A# BwA# € â € "A ßM•" BwAßM•" € "B" BwB" € "B# BwB# € â € "B,N•" BwBßN•" € % is obviously one of "no A ‚ B interactions." (This model says that there are parallel traces on a plot of mean C versus level of one of the factors.) The reduced model C œ "! € "A" BwA" € "A# BwA# € â € " AßM•" BwAßM•" € % is one of "A main effects only." (This model says that all means in a given row are the same.) And the reduced model C œ "! € "B" BwB" € "B# BwB# € â € "B,N•" BwBßN•" € % is one of "B main effects only." (This model says that all means in a given column are the same.) Of course, all of the MLR regression machinery is then available for doing everything from making confidence intervals for the factorial effects (that are " 's or linear combinations thereof) to looking at plots of mean responses on JMP, to predicting new responses at particular combinations of levels of the factors, to testing reduced models against the full model, etc. Dummy Variables and Piece-Wise Regressions Dummy variables are very useful objects. We saw above that they can be used to incorporate basically qualitative information into a MLR analysis. They can also be used to allow one to "piece together" different fitted curves defined over different regions. To illustrate what is possible, suppose that one wants to model C so that .ClB is a continuous function of B, having the properties that for 5" • 5# (the known locations of so-called "knots"), .ClB is a linear function of B for B Ÿ 5" , a possibly different linear function of B for 5" Ÿ B Ÿ 5# , and a yet possibly different linear function of B for 5# Ÿ B. This can be done as follows. Define two dummy variables ! if B • 5" B" œ œ " if 5" Ÿ B and B# œ œ ! if B • 5# " if 5# Ÿ B A model with the target properties is then C œ "! € "" B € "# ÐB • 5" ÑB" € "$ ÐB • 5$ ÑB# € % Notice that this model has .ClB œ "! € "" B for B Ÿ 5" -24- and .ClB œ Ð"! • "# 5" Ñ € Ð"" +"# ÑB for 5" Ÿ B Ÿ 5# and .ClB œ Ð"! • "# 5" • "$ 5# Ñ € Ð"" +"# € "$ ÑB for 5# Ÿ B The same kind of thing can be done with other numbers of knots and with higher order polynomials (like, e.g., quadratics or cubics). For higher order polynomials, it can even be done in a way that forces the curve defined by .ClB to not have "sharp corners" at any knot. All of this is still in the framework provided by MLR. Diagnostic Plots and More Diagnostic Measures There are various kinds of residuals, ways of plotting them and measures of "influence" on a regression that are meant to help in the black art of model building. We have already alluded to the fact that under the MLR model, we expect ordinary residuals /3 œ C3 • sC3 to look like mean ! normal random noise and that standardized residual= /3 /‡3 œ standard error of /3 should like standard normal random noise. In the context of defining the TVIWW statistic we alluded to the notion of deleted residuals /Ð3Ñ œ C3 • sC Ð3Ñ and the hope that if a model is a good one and not overly sensitive to the exact data vectors used to fit it, these shouldn't be ridiculously larger in magnitude than the regular residuals, /3 . This does not exhaust the ways in which people have suggested using the residual idea. It is possible to invent standardized/Studentized deleted residuals /Ð3Ñ /‡Ð3Ñ œ standard error of /Ð3Ñ and there are yet other possibilities. Partial Residual Plots (JMP "Effect Leverage Plots") In somewhat nonstandard language, SAS/JMP makes what it calls "effect leverage plots" that accompany its "effect tests." These are based on another kind of residuals, sometimes called partial residuals. With 5 predictor variables, I might think about understanding the importance of variable 4 by considering residuals computed using only the other 5 • " predictor variables to do prediction (i.e. using a reduced model not including B4 ). Although it is nearly impossible to see this from their manual and help functions or how the axes of the plots are labeled, the effect leverage plot in JMP for variable 4 is a plot of -25- /Ð4Ñ ÐC3 Ñ œ the 3th C residual regressing on all predictor variables except B4 versus /Ð4Ñ ÐB43 Ñ œ the 3th B4 residual regressing on all predictor variables except B4 On this plots there is a horizontal line drawn (ostensibly at C) that really represents C partial residual equal to 0 (C perfectly predicted by all predictors excepting B4 ). (The vertical axis IS in the original C units, but should not really be labeled as C, but rather as partial residual.) The sum of squared vertical distances from the plotted points to this line is then WWI for a model without predictor 4. The horizontal plotting positions of the points are in the original B4 units, but are partial residuals of the B4 's NOT B4 's themselves. The horizontal center of the plot is at B4 partial residual of !, not at B4 as JMP (inaccurately) represents things. The nonhorizontal line on the plots is in fact the least squares line through the plotted points. What is interesting is that the usual residuals from that least squares line are the residuals for the full MLR fit to the data. So the sum of the squared vertical distances from points to sloped line is then WWI for the full model. The larger is reduction in SSE from the horizontal line to the sloped one, the smaller the :-value for testing H! :"4 œ !. Highlighting a point on a JMP partial residual plot makes it bigger on the other plots and highlights it in the data table (for examination or, for example, potential exclusion). We can at least on these plots see which points are fit poorly in a model that excludes a given predictor and the effect the addition of that last predictor has on the prediction of that C. (Note that points near the center of the horizontal scale are ones that have B4 that can already be predicted from the other B's and so addition of B4 to the prediction equation does not much change the residual. Points far to the right or left of center have values of predictor 4 that are unlike their predictions from the other B's. They both tend to more strongly influence the nature of the change in the model predictions as B4 is added to the model, and tend to have their residuals more strongly affected than points in the middle of the plot (where B4 might be predicted from the other B's). Leverage The notion of how much potential influence a single data point has on a fit is an important one. The JMP partial residual plot/"effect leverage" plot is aimed at addressing this issue by highlighting points with large B4 partial residuals. Another notion of the same kind is based on the fact that there are 8# numbers 233w (3 œ "ß á ß 8 and 3w œ "ß á ß 8) depending upon the 8 vectors ÐB"3 ß B#3 ß á ß B5 3 Ñ only (and not the C's) so that each sC3 is sC3 œ 23" C" € 23# C# € â23ß3•" C3•" € 233 C3 € 23ß3€" C3€" € â € 238 C8 233 is then somehow a measure of how heavily C3 is counted in its own prediction and is usually called the leverage of the corresponding data point. It is a fact that ! • 233 • " and ! 233 œ 5 € ". So the 233 's average to Ð5 € "ÑÎ8, and a plausible rule of thumb is 8 3œ" -26- that when a single 233 is more than twice this average value, the corresponding data point has an important ÐB"3 ß B#3 ß á ß B5 3 Ñ. It is not at all obvious, but as it turns out, the TVIWW statistic has the formula /3 TVIWW œ ! Š "•2 ‹ involving these leverage values. This shows that big TVIWW 33 8 # 3œ" occur when big leverages are associated with large ordinary residuals. Cook's D The leverage 233 involves only predictors and no C's. A proposal by Cook to measure overall effect that point 3 has on the regression is the statistic 233 /3 H3 œ Œ • Ð5 € "ÑQWI " • 233 # where large values of this supposedly identify points that by virtue of either their leverage or their large ordinary residual are "influential." H3 is Cook's Distance. 0-1 Responses Sometimes the response, C, is an indicator of whether or not some event of interest has occurred " Cœœ ! if the event occurs if the event does not occur It is possible to think about assessing the impact of some predictor variables B" ß B# ß á ß B5 "on C." But ordinary regression analysis is not the right vehicle for doing so. The standard regression models have normal C's, not !-" C's. Here the mean of C is .ClB" ßB# ßáßB5 œ T ÒC œ "Ó œ : and is constrained to be between ! and 1. The most commonly available technology for doing inference here is so-called "logistic regression," that says that the "log odds ratio is linear in B" ß B# ß á ß B5 ." That is, the assumption is that lnŒ : • œ "! € "" B" € "# B# € â € "5 B5 "•: (**) and the " 's become the increase in log odds ratio for a unit increase in predictor, the other predictors held fixed. (Note that as log odds ratio increases, : increases, log odds ratio of ! being : œ Þ&). The actual technology required to fit the relationship (**) is more complicated than least squares, and the methods of inference are based on different mathematics. But using a package like JMP, these differences are largely invisible to a user, and one can reason from the JMP report mostly by analogy to ordinary regression. Both "Fit Y by X" and "Fit Model" in JMP will automatically fit (**) if one gives it a nominal response variable. One bit of confusion that is possible here concerns the fact that JMP knows numerical order and alphabetical order. So if you ask it to do logistic regression, it will do so for : corresponding to what it considers to be the "first" category. -27- The !-" coding that most people use thus ends up giving the fit for Ð" • :) instead of :. So with JMP and !'s and "'s as above and wishing to do inference for :, one needs to use Cw œ " • C as the response in order to get the signs of the "fitted regression coefficients, ," correct. Some Ideas About Looking at Data Over Time A fundamental issue in business is the detection and prediction of change over time. What follows as a wrap-up of this course is a brief introduction to the some statistical methods useful in this enterprise. Shewhart "Control" Charts Working over 70 years ago at Bell Labs (and originally interested in manufacturing), Walter Shewhart invented the so-called "control chart" as a process monitoring device, aimed at change detection. (He correctly reasoned that unnecessary process variation over time degrades performance of both the process and any product it produces. He therefore sought to invent means of detecting unnecessary/removable process change and avoiding ill-advised reaction to purely "random/inherent" fluctuation.) His method was to plot summary statistics from samples taken over time, and to compare them to "control limits" separating "common values" from "unusual" values of the same. His "control limits" for a statistic [ are PGP[ œ .[ • $5[ and Y G P[ œ .[ € $5[ For various [ , probability theory is called upon to supply the appropriate means and standard deviations. Some standard types of charts and their corresponding means and standard deviations (for use in control limits) are given in the following table. Charted Statistic B (sample mean of 8 measurements) = (sample std dev of 8 measurements) V (sample range of 8 measurements) Mean . -% 5 .# 5 ? s (sample mean "defects per unit") - s: (sample fraction "defective") : Standard Deviation 5 /È 8 5È" • -%# .$ 5 É -5 É :Ð"•:Ñ 8 In the table, . and 5 are the mean and standard deviation of a supposedly normal "stable process distribution"à -%ß .# and .$ are "control chart constants" derived from normal distribution assumptions and depending upon 8; - is a "stable process mean rate of occurrence of 'defects'" and ? s is a rate based on a count from 5 units; : is the "stable process rate of producing 'defectives'" and s: is based on 8 items. Where past experience supplies values for the process parameters (. and/or 5ß - or :Ñ one can apply the control limits to data as they are generated and signal alarms/the need for intervention/corrective action when out-of-control points are detected, on-line. Where there is no such past experience available, one must take several samples of data in hand, temporarily assume process stability and estimate the parameters, and apply the limits -28- only "retrospectively" to the data in hand. There is an entire "SPC" culture built around various variations on exactly how these estimates are made, etc. Time Series Ideas A time series, C> , is (cleverly enough) a series of numbers collected with a time order, >, attached. There is a huge statistical literature on the modeling and prediction of time series of measurements. To finish off Stat 328 we will talk about only the most elementary ideas of the subject, ones that are supported by JMP-In 3.2.6 and discussed in Chapter 16 of the JMP Start Statistics book. (JMP 4.0 offers far more than JMP 3 in this area.) The default implicit assumption in (Shewhart control charting and) one sample inference is that a data generating mechanism is producing what look like "iid" successive measurements. In many business contexts, this is clearly NOT a sensible description of reality, and other tools are needed beyond simple parameter estimates for a single distribution. To begin with, many real time series have a clear long-term trend in them. A first step in their analysis is to account for that trend (and sometimes to "remove" it by subtraction). There are a variety of old-style smoothing methods that have been employed in this enterprise. JMP-IN offers a very nice modern facility in its "Fit Spline" option in "Fit Y by X." It is also possible to fit low order polynomials to the C> using ordinary regression, i.e. to fit equations like C> œ ,! € ," > € ,# ># in the "Fit Y by X." routine. One advantage of fitting a function of > like this (rather than using another more flexible smoother) is that one then has an equation for limited extrapolation/prediction into the future. After accounting for a trend, there often remain medium term seasonal or business cycle effects in business time series. One way to try to model these is using dummy variables. That is, if I invent a nominal variable "month" or "quarter" and use it in a regression for C> (or a detrended version of C> obtained by subtracting from it a smoothed series) I can often capture an adjustment for "period" in terms of estimated regression coefficients for the dummies JMP creates. Another useful way of operating with either raw time series or ones that have had trends and/or seasonal effects removed is through the use of "autoregressions." That is, plots of /> versus />•" (or />•# , etc.) (residual series developed by removing trend and seasonal components) will often show substantial correlations. That suggests using the "lagged values" />•" ß />•# ß á as "predictor variables" in a regression fit like s/> œ ,! € ," />•" € ," />•# -29- This way of operating is not only often effective, it has the virtue of allowing one-stepahead forecasts to be made from the present. -30-