CHE 4372/5372 – Chapter 2: ANOVA and Regression Joe Gauthier – Spring 2023 Last updated March 26, 2023 Learning goals: • Understand the analysis of variance (ANOVA): – why it is useful – how it is performed – how it is analyzed – how to use JMP for ANOVA • Understand regression models and how it connects to ANOVA • Learn how to perform residual analysis • Learn how to evaluate the assumptions of normal responses and constant variance Before we start, a brief recap of terminology from Chapter 0: • response: the dependent variable (i.e. what we measure or would like to predict) • factor: an independent (explanatory) variable • level or treatment: specific values of the factor at which experimental runs are performed • replicates: identical runs for a given experiment Recall also the types of variables: • Continuous variables are always numeric • Discrete (categorical) variables may be numeric or character; all character variables are discrete, but not all discrete variables are character Figure 1: Data types in JMP The previous chapter involved analysis of data from a single factor, two–level experiment. For example, with the lab rat weights, we had: 1 • response: weight of rats • factor: group • levels: 1, 2 • n=5, 8 Similarly, when analyzing the fuel economy of the mid–sized SUV, we had: • response: miles per gallon • factor: model of car • levels: Ford Explorer, Chevy Blazer • n =8, 9 In this chapter, we will focus on experiments studied at a levels with n replicates performed at each level. 1 Analysis of Variance (ANOVA) Let’s start with an example. Suppose we want to study the effect of cotton content on the tensile strength of a composite fiber containing cotton and synthetic materials (e.g. polyester). Here we have: • response: tensile strength • factor: weight % cotton in the fiber • levels: 15, 20 ,25, 30, 35 % (a = 5) • replicates: n = 5 The experimental data has been entered into JMP and can be found on Blackboard: Lecture content > Chapter 2 > polycotton_fiber_tensile_strength.JMP Figure 2: Cotton tensile strength data as entered in JMP 2 Five measurements were made for each retirement using a completely randomized design. Note in particular that the factor is specified to be a numeric, nominal variable, and the response is a continuous variable. Just as we did before with a two factor comparison, we can use the Fit Y by X platform in JMP, and see the following: Figure 3: Result of Fit Y by X analysis in JMP. Display options enabled: jittered points, box and whisker plot. Do you think the mean tensile strength depends on the weight percent of cotton? What about the variance? A reasonable question to ask is whether or not there is a significant difference between the mean tensile strength at different levels. In other words, does the factor (wt% cotton) have an effect on the response? Using what we learned in the previous chapter, we could look at each pair, and there are in fact 10 different pairs to consider: Figure 4: Needed pairwise t tests to rigorously determine the effect of cotton weight percent on the mean strength. If each pairwise test is performed at a significance level α, then the probability of correctly not rejecting the null hypothesis when H0 is in fact true is (1 − α) for each test. If H0 is true for all m pairs, then the probability of not committing any type I errors is (1 − α)m – much lower! • For α = 0.05 and m = 10, then (1 − 0.05)10 = 0.6; the probability of committing type I error is thus 40%. • For α = 0.01 and m = 10, the probability of committing a type I error is 10%. So, we need an alternative approach. Let’s summarize the data reporting for a single–factor experiment such as the cotton tensile strength example, illustrated in the table below. 3 Table 1: Summary of data reporting for a single factor experiment. The number of treatments or levels is given as a, the number of replicate observations (for each treatment) is given as n, and N = an is the total number of experiments/observations. treatment (level) 1 2 .. . a observed responses y11 y21 .. . y12 y22 .. . ya1 ya2 ... ... ... ... totals averages y1n y2n .. . y1• y2• .. . y 1• y 2• .. . yan ya• y•• y a• y •• We define a few new variables under the ‘totals’ and ‘averages’ column, which will simplify our notation later: yi• = y•• = n X yij j=1 a X n X (1) (2) yij i=1 j=1 (3) y i• = yi• /n y •• = y•• /N (4) . This might seem confusing at first. To understand the notation, y i• is just the mean of all observations of treatment i. In the case of the cotton tensile strength, this might refer to the mean tensile strength of all observations at 30 wt%. y •• is referred to as the ‘grand’ mean, which is just the average of all observations. In the case of the cotton tensile strength, this would just be the average of all 25 observations. Returning to our question: does the factor affect the response? If the answer is no, then all the y values are random samples from the same population distribution, regardless of the factor level. The variability between samples at the same factor level (i.e. within treatments) will then be the same as the variability between samples at different factor levels (i.e. between treatments). Again returning to our example, this would suggest that the mean and variance tensile strength at all levels is the same within and between cotton weight percentages. However, if the factor does affect the response, then the variation between treatments should be greater than the variation within treatments. Strategy: we can propose a mathematical model to describe how the response depends on the factor, and then perform statistical inference tests on the parameters of this model. 1.1 Empirical models In this class, we will only consider problems in which the response variable is continuous. The factors (i.e. independent variables) may be discrete or continuous, leading to, • discrete–factors effects models • continuous–factor effects models • mixed–factor effects models The key point to take away is that hypotheses will now be made about model parameters, not parameters of a statistical distribution. In particular, we will focus on building empirical models that propose an additive (but not necessarily linear) relationship between the response and factor(s). response = intercept + effect1 + effect2 + ... + error . (5) Each additive term in this model equation between the intercept and error terms is an effect. Each effect term is composed of one or more parameters. The error term is assumed to be random and normally distributed 4 (as predicted by the Central Limit Theorem). Here’s an example of a continuous–factor effects model similar to others we’ll see again soon: y = β0 + β1 T + β2 P + β3 T P + β4 T 2 + ε (6) . Here, β1 T is the main/linear effect of factor T , β2 P is the main/linear effect of factor P , β3 T P is the cross (or interaction) effect of factors T and P , and β4 T 2 is the quadratic effect of T . Note that the factors’ variable types determine the possible types of effects and also the number of model parameters. Discrete–factor effects models Consider a mathematical model for describing the main effect of a single discrete factor: ( i = 1, 2, ..., a yij = µ + τi + εij (7) j = 1, 2, ..., n Here µ is the intercept of the model. y •• should serve as a good estimator for this parameter, as we’ll see later. τi are parameters that describe the main effect of the discrete factor. Remember that what we’re trying to test here is whether the effect between treatments is the same as the effect within treatments. In other words, we want to test E(yij ) = µ + τi = µi for i = 1, 2, ..., a. To probe this, we can use the following hypotheses: H0 : µ1 = µ2 = ... = µa (8) H1 : µi 6= µj for at least one pair i, j (9) Each of these means refers to the mean effect of treatment i. In the context of our cotton tensile strength example, µ1 might refer to the mean tensile strength of 15 wt%, µ2 might refer to the mean tensile strength of 20 wt%, and so on. Typically we like to think of the mean µ as an ‘overall’ mean so that, Pa i=1 µi =µ . (10) a Recalling that each µi = µ + τi , we can plug this in to find, a 1X 1 µ= (µ + τi ) = a i=1 a aµ + a X ! τi i=1 a 1X =µ+ τi a i=1 (11) In other words, this assumption implies a constraint on the effects τi , a X τi = 0 . (12) i=1 The interpretation of the above analysis is that our model essentially is just the ‘grand’ mean y •• plus an effect τi . That is, the effect estimates τi are a ‘deviation’ from the overall mean. Another consequence of the above constraint is that we can rewrite the hypotheses to be tested by just subtracting µ, H0 : τ1 = τ2 = ... = τa = 0 (13) H1 : τi 6= 0 for at least one treatment (14) Note that the model has a independent parameters: the overall mean, plus the a − 1 independent τi values. The model used to test the null hypothesis therefore has a − 1 degrees of freedom. If you’ve been paying close attention, this might seem familiar to you. We already know how to use e.g. least squares regression to fit a line to data; i.e. to fit a linear model that describes the relationship between a continuous dependent variable and a continuous independent variable, illustrated below in Figure 5. 5 25 y = a + bx y 20 15 10 5 12.5 15.0 17.5 20.0 22.5 25.0 27.5 30.0 32.5 x Figure 5: Fitting a continuous model to discrete data. It’s just a line! We will soon see that the discrete–factor and continuous–factor models are indeed closely related. Let’s now return to our example of cotton tensile strength. Recall that our model is, ( i = 1, 2, ..., a yij = µ + τi + εij (15) j = 1, 2, ..., n with i referring to the factor level (in this case, the weight percent of cotton in the fiber) and j referring to the experiment at the particular level (in this case, experiments 1–5 at each level). Our first goal is to use the experimental data to obtain estimates for the parameters of our discrete factor effects model. This can be easily done: (16) µ̂ = y •• = 15.04 τ̂i = (y i• − y •• ) = 9.8 − 15.04 = −5.24 for 15% cotton (17) = 15.4 − 15.04 = 0.36 for 20% cotton (18) etc. (19) Here, the ‘hat’ means that this quantity is an estimate for the parameter. In other words µ̂ is an estimate for the model parameter µ, and τ̂i is an estimate for τi . We therefore have our prediction equation, ŷi = µ̂ + τ̂i , −5.24 if wt% = 15 0.36 if wt% = 20 ŷ = 15.04 + 2.56 if wt% = 25 (20) 6.56 if wt% = 30 −4.24 if wt% = 35 But, we’re not done yet! We now have estimates for the model parameters, but we still need to determine whether any of them are significantly different than zero. 6 1.2 One way ANOVA Let’s again recall our model: ( i = 1, 2, ..., a j = 1, 2, ..., n yij = µ + τi + εij (21) According to this effects model, there are two possible sources of variability, i.e. two possible reasons why every experiment doesn’t give exactly the same response y: 1. variability due to the treatment τi 2. random error εij The analysis of variance method provides a way to partition the total variability between these two sources. Warning: there is a good deal of math ahead, and it’s important that you understand how this method is derived. I strongly encourage you to take the time to work these steps out yourself and ensure you understand each step. The sample variance for all N observations is calculated following the standard sample variance expression, but substituting our known values y •• , a S2 = n 1 XX SST 2 (yij − y •• ) = N − 1 i=1 j=1 N −1 (22) . The analysis of variance method focuses its attention on the numerator, i.e. the double sum, SST , referred to as the total corrected sum of squares. Our goal here is to determine how much of this variability is due to random errors, and how much is due to the effect of the factor on the response. We can decompose this total into two separate terms: SST = a X n X i=1 j=1 2 (yij − y •• ) = n a X 2 (y i• − y •• ) + i=1 a X n X 2 (yij − y i• ) (23) i=1 j=1 = SSTreatments + SSE , (24) where SSTreatments is the ‘treatments sum of squares’ and SSE is the ‘error sum of squares’. The above equation is the fundamental ANOVA identity. It illustrates precisely how we are able to decouple the total error into two pieces: the first, which estimates the variability due to the treatment (i.e. it measures the variability between the treatment averages and the overall average) and the variability within treatments (i.e. the error of individual measurements compared to the corresponding treatment average). From the polycotton fiber tensile strength, we can calculate these quantities, SST = 636.96 (25) SSTreatments = 475.76 (26) SSE = 161.20 (27) This is all well and good, but how do we use these quantities? We need to identify a test statistic (and reference distribution) that will allow us to test our null hypothesis of τi = 0 for all i. Recall that we assumed the response is normally distributed, i.e. yij ∼ N (µ + τi , σ 2 ). In other words, (yij − µ − τi ) ∼ N (0, 1) σ2 (28) . Now, if H0 is true, then all τi = 0 and we can cast S 2 as a χ2N statistic, a n 1 XX 2 (yij − µ) ∼ χ2N σ 2 i=1 j=1 7 . (29) Additionally, if H0 is true, we also expect that each treatment mean should be a good estimate of the population mean (since there is no variability due to the treatment), and so, a n SSE 1 XX 2 = (yij − y i• ) ∼ χ2N −a σ2 σ 2 i=1 j=1 (30) . We can also invoke the CLT to assert that the treatment mean at any level i should be distributed as, (31) y i• ∼ N (µ + τi , σ 2 /n) , in other words, (y i• − µ − τi ) ∼ N (0, 1) σ 2 /n (32) . Again, if H0 is true, then all τi = 0 and so, a n X 2 (y − µ) ∼ χ2a σ 2 i=1 i• (33) . In this case, the grand mean provides a good estimate of the population mean, and so, a SSTreatments n X 2 = (y − y •• ) ∼ χ2a−1 σ2 σ 2 i=1 i• (34) . We also know that the ratio of these two χ2 variables follows an F distribution with (a − 1) numerator d.o.f. and a(n − 1) = (N − a) denominator d.o.f., χ2a−1 /(a − 1) ∼ Fa−1,N −a χ2N −a /(N − a) (35) . Substituting SSTreatments and SSE , SSTreatments /(a − σ2 SSE /(N − a) σ2 1) = SSTreatments a−1 SSE N −a ∼ Fa−1,N −a . (36) We now define two new quantities, M STreatments = M SE = SSTreatments a−1 SSE N −a , . (37) (38) Here, M SE is a pooled estimate of the common variance σ 2 within each of the a treatments. If the treatment means are all equal, then M STreatments also provides an estimate of σ 2 . If the treatment means are not all equal, then the value of M STreatments will be greater than σ 2 . From our polycotton example, SSTreatments 475.76 = = 118.94 a−1 5−1 SSE 161.20 M SE = = = 8.06 N −a 25 − 5 M STreatments = (39) (40) Summarizing the hypotheses Recall our hypotheses: H0 : τ1 = τ2 = ... = τa = 0 (41) H1 : τi 6= 0 for at least one treatment (42) 8 The test statistic and reference distribution are, M STreatments ∼ Fa−1,N −a M SE F0 = . (43) We reject H0 when F0 is an extreme value in the right tail of the Fa−1,N −a distribution. In other words, M STreatments 6 1 if H0 is true M Se M STreatments F0 = 1 if H0 is false M Se F0 = (44) (45) We made several assumptions, which we will come back to in a later section: • experimental design is completely and properly randomized • model errors are ∼ N (0, σ 2 ) • response variance σ 2 is the same for all levels of the factor • the observations (i.e. response) are ∼ N (µ + τi , σ 2 ) Aside: is ANOVA always two sided? H0 : τ1 = τ2 = ... = τa = 0 (46) H1 : τ1 6= 0 or τ2 6= 0 or ... or τa 6= 0 (47) The short answer is: yes. The ANOVA is not capable of testing a one–sided H1 . Both positive and negative values of τi have a tendency to increase M STreatments , and hence the F statistic. This is actually a good thing! If one–sided tests were possible, as the number of levels (i.e. a) increased, the number of possible alternative hypotheses would become hideously large. Aside: is the equal variances assumption valid? A key assumption in ANOVA is that the variances for all levels are equal, i.e. σ12 = σ22 = ... = σa2 . There are modified ANOVA methods (e.g. Welch) that have been developed to handle the case where variances cannot assumed to be equal. However, these are rarely needed. For a well–designed experiment, why is it often reasonable to expect that the variance should be about the same for all levels? One–way ANOVA summary We’ve just described how we will perform the analysis of variance for a single discrete factor effects model. This is referred to as “one–way analysis of variance.” ANOVA is a remarkably useful and powerful tool in statistics and data analysis. Table 2: Summary table for performing one–way ANOVA source factor error total sum of squares SSTreatments SSE SST degrees of freedom a−1 N −a N −1 mean square F0 ∼ Fα,a−1,N −a M STreatments M SE M STreatments /M SE And recall the expressions for SST and SSTreatments , SST = a X n X i=1 j=1 a X 2 (yij − y •• ) 2 (y i• − y •• ) SSTreatments = n i=1 9 (48) (49) We’re finally ready to return to our cotton fiber tensile strength example. We’ll use JMP to build and test the following model: ( i = 1, 2, ..., a yij = µ + τi + εij (50) j = 1, 2, ..., n a X τi = 0 (51) , i=1 H0 : τ1 = τ2 = ... = τa = 0 (52) H1 : τi 6= 0 for at least one treatment (53) To do this, we’ll use the ‘Fit Model’ platform in JMP, illustrated below in Figure 6. Figure 6: Fitting a continuous model to discrete data. It’s just a line! JMP did all of the arithmetic for us. Let’s write out the model equation: yij = µ + τi + εij , and we have our prediction equation, ŷij = µ̂ + τ̂i , with parameter estimates τ̂i calculated by JMP as, −5.24 if wt% = 15 0.36 if wt% = 20 ŷ = 15.04 + 2.56 if wt% = 25 6.56 if wt% = 30 −4.24 if wt% = 35 . (54) (55) (56) The first panel in the JMP report labeled ‘Analysis of Variance’ tests the hypotheses listed above. The calculated p–value is < 0.0001, so we reject H0 . At least one treatment is significantly different than zero. Which one(s) are significantly non–zero? We’ll get to that shortly. 10 Aside: coding data Recall that in Chapter 1 we briefly discussed normalization of data, which was the process of rescaling/shifting data by a simple multiplication/addition by constants. This process is also referred to as coding. For example, consider the following dataset: data = {13.4, 7.2, 29.0, 10.1, 17.7} (57) . We can normalize (code) these data so that they range from 0 to +1: ycoded = y − ymin y − 7.2 = ymax − ymin 29.0 − 7.2 , (58) giving the following: y y (coded) 13.4 0.28 7.2 0 29.0 1 10.1 0.13 17.7 0.48 This process is actually very commonly done, for a couple of reasons. For example, non–integer data can be coded so that all values are integers. Computationally, this is beneficial for speed, accuracy, and therefore long–term data storage. Perhaps more importantly, it is also done for proprietary/security reasons. Coded data can be shared with “outsiders” without giving them your actual data. All of the statistical analyses we have done so far will work just as well with coded data as they will with uncoded data! A few lessons learned • The analysis of variance involves all of the concepts we’ve covered so far: data types, probability, distributions, formulating and testing hypotheses, ... • ANOVA is a sophisticated analysis technique, but easy to understand when we break it down • With software such as JMP, ANOVA is very easy to perform ... but correctly interpreting the results demands that you understand the fundamentals 1.3 Generalized ANOVA Consider a general example, where a response variable y is measured at three levels of a factor. Ten measurements are collected at each level, with the results illustrated below in Figure 7. 11 Figure 7: Generalized ANOVA example. The response y is measured at three levels of a factor, given by the three Normal distributions. Here, vertical lines correspond to averages of the three levels (y 1• , y 2• , y 3• , and the grand mean y •• ). Next we can follow our prescription from the previous section and calculate the sum of squares associated with the treatment and the error, SST = SSTreatments + SSE , (59) with SSTreatments = n a X (60) (y i• − y •• )2 i=1 = 10 (y 1• − y •• )2 + (y 2• − y •• )2 + (y 3• − y •• )2 (61) , and SSE = a X n X (yij − y i• )2 (62) i=1 j=1 = 10 10 10 X X X (y1j − y 1• )2 + (y2j − y 2• )2 + (y3j − y 3• )2 j=1 j=1 (63) . j=1 Remembering the hypotheses that ANOVA tests, (64) H0 : τ1 = τ2 = τ3 = 0 H1 : τ1 6= 0 or τ2 6= 0 or τ3 6= 0 (65) , with test statistic, F0 = M STreatments SSTreatments /(a − 1) SSTreatments /(3 − 1) = = ∼ F2,27 M SE SSE /(N − a) SSE /(30 − 3) . (66) If the ANOVA p–value> α, then we do not move on to the next step of comparing individual pairs of effects. The factor does not have a significant effect on the response. Case closed! 12 However, if p–value < α, then the factor has a significant effect on the response. In other words, at least one of the effects is significantly non-zero. But which one(s)? To determine this, our next step is to compare all possible pairs of treatments. In other words, we wish to test a series of hypotheses having the general form, (67) H0 : µi = µj H1 : µi 6= µj (68) , for all pairs i 6= j. But we can’t simply do a t–test on each pair, as the probability of committing a type I error would be substantially higher than α. What do we do? Controlling the ‘experimentwise’ or ‘family-wise’ error What we really need to do is to find some way to compare all of the pairwise means while controlling the total probability of committing a type I error. Recall that if α = 0.05 and and we have m = 10 pairs of t–tests to compute, then the probability of committing a type I error is given as 1 − (1 − α)m = 1 − (1 − 0.05)10 = 40% . (69) In other words, if we take the naive approach, we really have an effective α = 0.4, much higher than we would typically be comfortable with. But what if we just reduce α in our pairwise tests? If α were set to a lower value, say 0.005, then we would have an effective α of αeffective = 1 − (1 − 0.005)10 ≈ 0.048 . (70) This approach is called the Bonferonni correction. By simply using a (much) more conservative significance level when performing the pairwise tests, we can take the naive approach while maintaining our comfortable overall significance level. This overall significance level is referred to as the ‘experimentwise’ or ‘family-wise’ error – the error associated with the entire procedure, or entire family of procedures. In principle there’s no problem with this approach. However, in reality, using a significance level of α = 0.005 when doing the pairwise tests means that it will become much more difficult to reject the null hypothesis even when you should. In other words, the probability of committing a type II error increases, or equivalently, the power decreases. It is entirely possible to reject the null hypothesis from ANOVA (i.e., at least one effect is significant), and then fail to reject any null hypothesis during the pairwise t–tests with the Bonferonni correction (i.e. no effect is significant). An apparent contradiction caused by type II error! In 1953, an American mathematician named John Tukey developed an alternative way to do many pairwise mean comparisons while controlling the family-wise error, and without significantly reducing the power of the test. The original formulation of this approach assumed a balanced experiment design (i.e. all factors have the same number of samples), but it was later modified by Kramer to allow for a mixed design. More details can be found in the textbook, but the general idea behind the Tukey(–Kramer) Honestly Significant Difference (HSD) Test is that, rather than doing an actual pairwise comparison, it determines the minimum difference between sample means that can be considered significant. In particular, two means are said to be significantly different if s qα (a, f ) 1 1 |y i• − y j• | > √ M SE + , (71) ni nj 2 where a is the total number of treatments/levels, f is the number of degrees of freedom associated with M SE , and q is the corresponding ‘Studentized range’ statistic. This is essentially just another reference distribution, with values given in the back of the textbook just as the z, t, χ2 , and F distributions. Using a similar procedure to our previous discussion on confidence intervals, we can also compute a confidence interval for the difference of each pair of (population) means: s s qα (a, f ) 1 1 qα (a, f ) 1 1 y i• − y j• − √ M SE + ≤ µi − µj ≤ y i• − y j• + √ M SE + . (72) ni nj ni nj 2 2 13 If this confidence interval does not contain zero, then the population means are said to be significantly different. Generally speaking, this confidence interval will be a bit wider than the naive approach of using a standard t–test, but will be much narrower than using e.g. the Bonferonni correction to the standard t–test. We can visualize this in JMP by asking it to do pairwise comparisons with a standard t–test (the naive approach without Bonferonni correction) and also Tukey’s HSD, shown below in Figure 8. Figure 8: Pairwise mean comparisons using both the standard t–test (called Student’s t, the naive approach without Bonferonni correction), the Tukey HSD test, and the t–test with Bonferonni correction. The radius of the circle is related to the (one–sided) width of the confidence interval. To the right side of Figure 8 are three sets of comparisons. The first, labeled ‘Each Pair Student’s t’ uses α = 0.05 for each pairwise comparison, which as we discussed above gives a roughly 40% chance of committing a type I error. Each circle is roughly related (but not equivalent) to a confidence interval about the mean of a given treatment (in this case, the tensile strength given the weight % cotton). Finally, the last section shows the naive t–test with the Bonferonni correction (reducing the significance level of each pairwise comparison) for a total type I error probability of about 0.05, the same as the Tukey HSD test. JMP also reports a comparison for all pairs using the Tukey–Kramer HSD, shown in Figure 9. Figure 9: Computed values of |y i• − y j• | − HSD, where HSD = from the diagonal that JMP reports HSD = 5.3730. qα (a,f ) √ 2 r M SE 1 ni + 1 nj . Here we can see A little further down, JMP reports the significant differences in a way that is a little easier to see than 14 the circles, shown in Figure 10. Figure 10: Report of significant differences between means from the Tukey–Kramer HSD test. Finally, JMP automatically computes confidence intervals when you run a ‘Compare Means’ test, illustrated in Figure 11. Figure 11: Report of significant differences between means from the Tukey–Kramer HSD test. Wait a second... If the Tukey–Kramer HSD test is able to (effectively) perform pairwise means comparisons without increasing the risk of committing a type I error, why did we even bother with ANOVA in the first place? Why can we not just skip ANOVA and go straight to Tukey? This is a great question and is actually a pretty deep ‘rabbit hole’ and area of current research. If you ask a statistician, they would say you don’t necessarily need to do ANOVA if you’re going to use a compairson method like Tukey HSD which controls the family–wise error. In fact, from the Wikipedia page for the Tukey HSD test: “A common mistaken belief is that the Tukey hsd should only be used following a significant ANOVA. The ANOVA is not necessary because the Tukey test controls the Type I error rate on its own.” – some Wikipedia user My response: 15 In all seriousness, the Wikipedia user is in principle correct. In practice though, the scientific community is typically skeptical of results presented without first rejecting H0 via ANOVA. It’s possible, for instance, to fail to reject H0 during ANOVA (i.e., suggesting no significant difference between treatments), and to also find significant difference(s) from Tukey HSD (i.e. suggesting at least one significant difference between treatments). This would likely represent a ‘borderline’ case of significance, and would therefore be met with skepticism. Peer reviewers would likely demand you do a few more experiments to make sure the effect is legitimate. If it is legitimate, the increased ‘significance’ of the effect would likely lead you to now reject H0 with ANOVA. The end result of this practice is a slightly reduced overall significance level, i.e. we lower α from 0.05 to something slightly lower by this practice. 1.4 Desigining experiments with a control In some experiments, there may be one particular treatment that for some reason is special. We call this level a control. Examples include: • Exploring modifications of a process, and using the current operating conditions as a control • Exploring changes in product composition/formulation, and using the current composition/formulation as a control • Exploring the effect of different drugs and using a placebo as the control Some people enjoy the idea of a control so much that they believe all experiments should have one. This is certainly not true! But it does explain why it is so common for the first question to dribble out of some people’s mouth when asking about your results will be “well yes that’s great, but what was your control?” We do want to use a control when the goal of an experiment is to compare all treatments to a control level to another one. But that’s obviously not always the case. Did we need a control for the cotton fiber experiments? Some of the confusion related to controls is certainly related to the fact that we sometimes refer to experiments as ‘controlled.’ This is an experiment in which we set factor (independent variables) levels (values) as desired. In science and engineering, we are usually able to do controlled experiments, though not in cases where it would be unethical. For example, most of our knowledge on nutrition in humans is extrapolated from experiments using animals. As a society we are ‘okay’ with the idea of e.g. locking mice in a cage for their entire lives and completely controlling their diet and exercise, which is important when investigating the effect of diet on e.g. weight loss and longevity. But we’re less okay with doing this with human beings! In any case, a ‘controlled experiment’ is not the same thing as a control in an experiment. When you are using a control, it is usually a good idea to perform more replicates for the control level. A rule of thumb is usually √ nc = n a , (73) where a as before is the number of levels, nc is the number of replicates for the control level, and n is the number of replicates for the other levels. For example, if n = 5 and a = 5, then nc ≈ 11. As always, the overall run order must be properly randomized, including the control runs. Example: drug toxicity 16 Suppose we want to measure the toxicity of two drugs. We can measure this by exposing a culture of cells to the drug and measuring the percentage of cells that die after exposure. In this setup, it would be good to have a control, i.e. a level that receives no drug, as a benchmark to compare to the effect of the drugs. We will do 15 runs: 7 with a control (placebo, i.e. no drug), 4 with drug A, and 4 with drug B. The run order is completely randomized. Note that in this case, we are not interested in comparing drug A to drug B, only the comparisons of the drugs to the control. An important distinction! The results of the experiment are shown below in Figure 12. Figure 12: Drug toxicity experiment results. As we did with the tensile strength analysis, we start with ANOVA: (74) H0 : τplacebo = τA = τB = 0 H1 : τplacebo 6= 0 or τA 6= 0 or τB 6= 0 . (75) Using JMP, we load the data into the Fit Y by X platform and click the ‘Means/ANOVA’ option in the dropdown box. JMP reports: Figure 13: Results of ANOVA on the drug toxicity data. With p–value= 0.0384 < α = 0.05, we reject H0 and conclude that at least one treatment pair is significantly different. Next, we proceed to the means comparisons. One twist when using a control is that Tukey–Kramer HSD is no longer the ideal test. Recall that Tukey–Kramer compares all means. However, with a control, we really just want to compare everything to the control, i.e. we don’t care about drug A 17 vs drug B, we only care about drug A vs the control and drug B vs the control. A modification to the Tukey–Kramer HSD to account for this was developed by a person named Dunnett in the early 1960s, and allows for a slightly more accurate comparison. We can do this with JMP, with results shown in Figure 14 below. Figure 14: Result of Tukey and Dunnett analysis on the data following rejection of H0 with ANOVA. Note that the Tukey–Kramer HSD test finds no significant difference, but Dunnett’s test finds drug B to be significantly different. We get a little extra power (i.e. reduced probability of a type II error) by neglecting to compare drug A to drug B! A key question Since the primary goal of this experiment was to compare each of the drugs with the control, why not just do two separate experiments? In other words, why not do one experiment to explore the effect of drug A, and then another experiment to explore the effect of drug B? The answer is economics. The joint experiment with both drugs at the same time is more efficient and provides more reliable results, since we are able to pool the results and get a better estimate of e.g. the effect of the control level. 18 2 Residuals analysis In the last section, we made several important assumptions about the populations from which our experiments are sampling. In this section, we will see how we can analyze the residuals of a model to see if the assumptions we made are likely to be valid. First, let’s briefly review the three analysis platforms we’ve used with JMP so far: 1. Distribution – analysis of a single sample: histograms, tests on means and variances, normal quantile plots, outlier box plots, ... 2. Fit Y by X – data with a single response and a single factor having one, two, or more levels: two sample tests, one–way ANOVA, ... 3. Fit Model – data with a single response and one or more factors (i.e. can do most of what the Fit Y by X platform does and much more) Next, let’s define what we mean by a residual. A residual is simply the difference between the actual (i.e. observed) and estimated (i.e. predicted by the model) values of the response variable. The residual for observation j in treatment i is defined as, eij = yij − ŷi (76) , where as before the ‘hat’ variable is just our estimate of the response in treatment i. Residuals analysis is a very useful technique for answering a variety of questions: 1. Is the response variable normally distributed? 2. Are there any outlier data points? 3. Is the model adequate? The residual of a data point is related to the error term in the model. Recall: proposed model: yij = µ + τi + εij fitted model: ŷi = µ̂ + τ̂i residual: eij = yij − ŷi (77) (78) . (79) If the proposed model includes all relevant factors, and if µ̂ and τ̂i are good estimators for µ and τi , then eij should be a good estimator for εij . Checking our assumptions: normally distributed response variable The statistical methods we now have in our toolbox (z–tests, t–tests, χ2 –tests, F –tests, ANOVA, ...) all assume the response variable to be normally distributed. If the model accounts for all important factors (i.e. if the model is “adequate”) and if the response variable is normally distributed, and if the variance σ 2 is constant, and if the experiment was properly randomized, then e ≈ ε ∼ N (0, σ 2 ) . (80) In other words, the model errors (i.e. residuals) should be ∼ N (0, σ 2 )! An important point: although y is assumed to be normally distributed with constant σ, this does NOT mean we expect all observations to be from a single normal distribution. So we can NOT check the assumption simply by looking at a histogram of all the y values. Recall the figure from the beginning of our discussion of generalized ANOVA: 19 Here we have measured the response variable y at different levels of a factor. If we just plotted a histogram of the entire data set, it would not appear to be normally distributed! We have to make sure we plot the residuals, which would subtract out the effective difference in the mean to give us a single normally distributed variable ∼ N (0, σ 2 ). We can do this in one of three ways, in order of least rigorous to most rigorous: 1. visual inspection of histogram of residuals (requires large n). 2. normal quantile plots 3. statistical tests of normality All three of these can be done in the ‘Distribution’ platform of JMP. The good news is that even when the normality assumption appears to be questionable, in many cases these tests (especially t–tests) have been found to give mostly reliable results. First, let’s discuss the normal quantile plots, also referred to as a ‘Q–Q’ plot (quantile–quantile plot). In short, the way this works is by computing quantiles of the given data set and comparing the magnitude of those quantiles to what we would expect from a normal distribution. Recall that a quantile is just the x value of a distribution at which point a certain percentage of the data is to the left of that x value, and it’s closely related to a percentile. For instance, the median of a distribution is just the 50th percentile: by definition, 50% of the data in the distribution is to the left of the median. We can use Q–Q plots to estimate if a data set is normally distributed pretty easily. If the data set is actually normally distributed, then the quantiles of that data set will be roughly the same as what we would expect of the quantiles for the normal distribution. So if we plot the quantiles of our data set on the y–axis, and the quantiles of a normal distribution on the x–axis, and the points roughly fall along the line y = x, then we can say that the data are normally distributed. A few examples are shown below in Figure 15. 20 Figure 15: Example of Q–Q plots for a normal distribution (top left), χ25 distribution (top right), F5,5 distribution (bottom left), and exponential distribution (bottom right). Normally distributed data will lie approximately along the line y = x on the Q–Q plot, while other distributions will deviate substantially. In the above figure, note that the Q–Q for data drawn from an N (0, 1) distribution pretty closely follows the diagonal line. All others show clear deviations, which are evidence of them not being normally distributed – because they’re not! In addition to quantile plots, we can statistically test if a data set is normally distributed using the Shaprio–Wilk’s test. The details of how this works aren’t important, and JMP can do it for us easily. The null hypothesis of this test supposes that the data are normally distributed. If the p–value is less than the significance level, we reject H0 and conclude the data are not normally distributed. What should be noted however, is that we need to distinguish between a statistically significant result and a practically significant result, which we will return to later. To suffice: very large sample sizes will tend to reject H0 even if the true difference is very small! 21 Now let’s look at how we can analyze residuals in JMP. We’ll use the polycotton fiber tensile strength data to illustrate. First we will load the polycotton fiber data back into the ‘Fit Model’ platform and make sure to set the emphasis to ‘minimal report’: Figure 16: Loading data into the Fit Model platform of JMP. Setting Emphasis = ‘Minimal Report’ basically says ‘do not drown me in results!’ Next we tell JMP to calculate the residuals and save them as a column in the data table: Figure 17: Using JMP to calculate residuals and save them as a column in the data table. 22 This creates a new column in the data table containing the residuals. We then load this column into the ‘Distribution’ platform of JMP and click the ‘Normal Quantile Plot’ option: Figure 18: Using JMP to calculate and plot a Q–Q plot, testing the normality and constant variance assumption in ANOVA. The data mostly follow the diagonal line, indicating that the response variable is normally distributed and that the variance in each treatment is constant. Good news! We can also test this using the Shapiro–Wilks test, which as a reminder takes as a null hypothesis the data being normally distributed. First we ask JMP to fit a normal distribution to the residuals: Figure 19: Using JMP to fit a normal distribution to the residuals. Then within the ‘Fitted Normal’ dialogue box, we tell it to perform a ‘Goodness of Fit’ test: 23 Figure 20: Using JMP to perform a goodness of fit test on the resulting normal distribution, fit to residuals. Finally, the results of the test are shown here: Figure 21: Results of the Shapiro–Wilks test in JMP. Since the p–value=0.18 is greater than the significance level α = 0.05, we do not reject H0 and conclude the data are normally distributed. With all of this done, we can be reasonably confident that the assumptions made during ANOVA/Tukey HSD were founded. But what do we do if our data are not normally distributed? There are a few options: 1. Search the literature: is the probability distribution for your response variable y known? Have others identified an appropriate test statistic and reference distribution? 2. Use nonparametric tests – these are statistical analyses that do not make assumptions about the population distribution. Examples include kernel density estimation techniques, or the Welch ANOVA method (which does not assume constant variance between levels). 3. Look for a mathematical transformation of y that produces an approximately normally distributed quantity. As an example, sometimes y is not normally distributed, but ln(y) is normally distributed. In this case, y is said to follow a lognormal probability distribution. These variables are fairly common in some areas of interest, including: • survival time after diagnosis of cancer • latent period of an infectious disease • rainfall amounts • species abundance • income of working people Aside: outliers 24 For identification of outliers, we first need to define the standardized residual: dij = √ eij M SE (81) . Or, sometimes the M SE is modified slightly make a Studentized residual: dij = p eij M SE (1 − hii ) , (82) where hii is an element of the “hat” matrix H which we will discuss later. Basically we just divide the residual by an estimate for the standard deviation, i.e. either the mean square error, M SE or slightly modified M SE for the Studentized residual. A rule of thumb for outlier identification: an observation that has a scaled residual |dij | > 3 may be an outlier. An outlier is an “unusual” or “wild” observation that doesn’t appear to be consistent with most of the other data. What can cause outliers? • a mistake in experimental measurement • incorrect calculation when processing the data • a poor/inadequate model • a genuine extreme observation Note: outliers are not always “bad” data. An outlier may turn out to be the most interesting and useful observation in the entire experiment! Many famous discoveries were made by the person who did not throw out the outlier. Examples: teflon, post–it notes, corn flakes, LSD, silly putty, aniline dyes, scotchgard, cellophane, rayon, vulcanized rubber, penicillin, identification of Charon as a moon of Pluto, the ozone hole over Antarctica, ... By default, JMP will create a box plot for every histogram. In the box plot, outliers are identified as a ‘dot’ outside of the box plot, illustrated below in Figure 22 Figure 22: Identification of outliers using JMP. The outlier is illustrated as a dot outside of the box plot. This plot for the response variable itself will identify extreme or unusual values; however, these should not be called “outliers” necessarily because a model has not been specified at this stage. To identify outliers, box plots should be constructed on the residuals, not the response variable. Are there any outliers in the tensile strength experiment? Let’s check JMP. 25 Figure 23: Analyzing the polycotton tensile strength data for outliers using JMP. No outliers! Dealing with an outlier • Review your lab notebook to see if anything unusual happened on that particular run. Are there any good reasons to think this is a “bad” data point? • Explore what happens when you repeat the data analysis with the outlier excluded. Do your conclusions change? • Remember that an outlier might possibly be a genuinely interesting and novel result. Don’t throw your ticket to (scientific) fame and fortune down the drain! Remember also that outliers are determined by extreme residuals, not extreme responses. Is your data point an outlier because the response is unusual, or is your model just inadequate? Whether or not a point might be considered an outlier depends on the model under consideration! Thus, a point that is excluded as an outlier in the analysis of one model should not be deleted entirely. If we decide to try out a different model, we should re-include any previously excluded points, fit the model, and again check for outliers. Model adequacy A model is said to be adequate if it quantitatively captures how the effects influence the response, so that the only unexplained variability is due to random error. A model is therefore said to be inadequate if other factors (e.g. nuisance variables that were not properly randomized out of influence) are having an effect but were not included in the model. As we’ll see in the next section, even if we identify all relevant factors, a model may still be inadequate if the appropriate effects are not included in the model! How can we check for model adequacy? There are a few useful and informative plots we can construct from the residuals to check model adequacy: • time sequence plot • plot of residuals vs estimated response values • plot of residuals vs independent variables not included in the study • and of course the familiar and much–loved R2 statistic also provides some measure of model adequacy, with some important caveats Remember that if our model accurately captures all relevant effects, then the residuals should be distributed as eij ∼ N (0, σ 2 ). We can sequentially plot data drawn from a normal distribution to get an idea of what a residuals plot ‘should’ look like, illustrated in Figure 24. 26 y 1 0 1 0 5 10 15 data sequence 20 25 Figure 24: Plot of data drawn sequentially from N (0, 1). The data is structureless, with no obvious patterns or trends. Order vs disorder – successes and failures of our brains Between your ears rests the most sophisticated device in the known universe for collecting, analyzing, and interpreting data. Humans are especially good at processing visual information. We run great simulation software to model our 3D world in real time. Take advantage of this when presenting your work or choosing a computational tool. But also be wary of our susceptibility to being fooled by illusions, or seeing patterns in chaos. One particular phenomenon to be aware of is called ‘Pareidolia.’ What do you see in the picture below? Figure 25: Image of the surface of Mars taken by Viking 1 in July of 1976. Is that a human face, or a rock formation? From Wikipedia, Pareidolia is: “the tendency for perception to impose a meaningful interpretation on a nebulous stimulus, usually visual, so that one sees an object, pattern, or meaning where there is none.” As 27 human beings, we have a tendency to find patterns where none exist. If you’re old enough to remember the above image circulating in the early age of the internet (late 1990’s, early 2000’s), you certainly remember the conclusions that were being drawn as a result of this phenomenon. In 2001, a much higher resolution image of the same rock formation was taken by a NASA probe, shown below: Figure 26: Image of the same rock formation as above taken by a NASA probe in 2001. How about now: is that a human face, or a rock formation? Back to the polycotton fiber data, let’s use JMP to analyze the residuals. First we’ll load the data into the Fit Model platform as before. We can easily plot the residuals by the test sequence (called Row Number in JMP) and against the predicted values: Figure 27: Telling JMP to plot residuals by row number and predicted values to test the assumptions in our analyses. 28 We’re interested in the ‘Plot Residual by Row’ and ‘Plot Residual by Predicted’ options. Since we organized our data table to have the test sequence associated with the rows sequentially, plotting residuals by row is equivalent to plotting residuals by test sequence. This is the resulting plot from JMP: Figure 28: JMP residuals analysis: plot of residuals vs test sequence appears to be structureless, so there’s nothing to worry about. Next, the residuals plotted against the predicted value: Figure 29: JMP residuals analysis: plot of residuals vs predicted value. This is a good way to graphically check the assumption that the variance (σ 2 ) is similar at all levels. The final thing we would want to check is the residuals plotted against other variables. The day of the week of each experiment was not a factor in the experiment, but suppose we had that data and plotted the residuals against that variable: 29 Figure 30: JMP residuals analysis: plot of residuals vs day of the week. It looks like the day of the week is a nuisance variable that we’ll need to take into account in the future. Uh oh, there’s a clear trend when we plot the residuals vs the day of the week that the experiment was performed! This is a nuisance variable we didn’t take into account and will need to be properly blocked/randomized in the future. Brief comment on R2 A familiar statistic when evaluating model adequacy is the R2 statistic, defined as, R2 = SSTreatments SST . (83) The R2 statistic can be useful, but the reference distributions are not known, and so they cannot be used to make conclusions. In other words, at a given significance level α, there is no valid way to determine a “cutoff” R2 value. Furthermore, the R2 statistic does not punish ‘overfitting’ of data, which we will learn more about towards the end of the class. Bottom line: take R2 statistics with a heap of salt! 2.1 Side note: statistical significance vs practical significance In general, increasing the sample size of an experiment will reduce the probability of committing a type II error (failing to reject H0 when H0 is false). At the same time, decreasing the significance level α reduces our chance of committing a type I error, but increases the chance of making a type II error. In the real world we never know the true population parameters, and so the sample size is a very important decision you have to make as a scientist. The statistical guidance behind this topic is called ‘power analysis’ which we unfortunately won’t have time to cover in this class, but is discussed in the textbook. To understand why this is important, let’s take an example of rat weights. Suppose a medical researcher is using two species of rats as their test subject. Rats of species A and B are randomly selected from two large populations in which rat weight is ∼ N (µA , σ 2 ) and ∼ N (µB , σ 2 ), respectively. The research doesn’t know (and never will know) the population parameters µA , µB , σ 2 . To determine if the difference in mean weight between the species is significant, the researcher plans to weigh a randomly selected sample of N rats (so N/2 of each species) and perform a t–test at α = 0.05. All hypotheses tested will be two–tailed. We’ll consider several different scenarios for the (blue) and B (orange) rat populations. 30 p(y) 0.04 0.02 ~N(500, 102) ~N(550, 102) p(y) 0.00 460 480 500 520 540 560 580 600 y 0.04 ~N(510, 102) ~N(500, 102) 0.02 0.00 460 p(y) 0.04 500 y ~N(500, 102) 0.02 0.00 480 460 520 540 560 ~N(501, 102) 480 500 y 520 540 Figure 31: Scenarios for the rat populations to be investigated. Case 1: Large difference between means and small sample size. µA = 500 g, µB = 550 g, σ = 10 g, N = 10. Figure 32: Data analysis for case 1 with µA = 500 g, µB = 550 g, σ = 10 g, N = 10. In this case we reject H0 and conclude the population means are significantly different. 31 Case 2: Small difference between means and small sample size. µA = 500 g, µB = 510 g, σ = 10 g, N = 10. Figure 33: Data analysis for case 2 with µA = 500 g, µB = 510 g, σ = 10 g, N = 10. In this case we fail to reject H0 and conclude the population means are not significantly different. Case 3: Small difference between means and larger sample size. µA = 500 g, µB = 510 g, σ = 10 g, N = 20. Figure 34: Data analysis for case 3 with µA = 500 g, µB = 510 g, σ = 10 g, N = 20. In this case we reject H0 and conclude the population means are significantly different. Case 4: Tiny difference between means and larger sample size. µA = 500 g, µB = 501 g, σ = 10 g, N = 20. Figure 35: Data analysis for case 4 with µA = 500 g, µB = 501 g, σ = 10 g, N = 20. In this case we fail to reject H0 and conclude the population means are not significantly different. 32 At this point you might be wondering if I have lost my mind. Would H0 ever really be rejected if the two populations being sampled are like those above? The answer is yes! We can discover very small differences, but only with sufficiently large sample sizes. Case 5: Tiny difference between means and very large sample size. µA = 500 g, µB = 501 g, σ = 10 g, N = 2000. Figure 36: Data analysis for case 5 with µA = 500 g, µB = 501 g, σ = 10 g, N = 2000. In this case we reject H0 and conclude the population means are significantly different. Statistically significant differences summarized: for comparative tests between two samples, if the sample size is too small, even large differences between µA and µB will not be statistically significant. The converse is also true. Extremely small differences between µA and µB will be found to be statistically significant if the sample size is sufficiently large. Although a difference may be statistically significant, it might not be practically significant. Question: is this a problem with our approach? Or a strength? Determining practical significance depends on common sense, familiarity with the system being studied, and how you intend to use the results. Deciding on a sample size is hard, and even experts make mistakes. If you’re curious, there is a somewhat rigorous way to do this called ‘prospective power analysis’ which we unfortunately won’t have time to cover. But the textbook has a good discussion on it, and there are many internet resources available. And, JMP can do it (of course!). 3 Effects models and regression analysis As a reminder, an effects model is an empirical mathematical model proposed to describe the relationship between a response variable and one or more factors. Each additive term in the model is called an effect. Building and analyzing a model is a stepwise process as we discussed in the last section: 1. deciding on a model 2. estimation of the model parameters (“fitting”) 3. ANOVA to perform a “whole model” test to determine whether at least one effect is significant 4. statistical tests on individual effects (e.g. Tukey HSD) 5. residuals analysis 6. power analysis – did you make a type II error? As we saw in the last section, estimating parameters for a discrete–factor effects model is relatively easy. Parameters are estimated from the group (level) sample means. In a continuous–factor effects model, it’s a bit more complicated. We have a few options, but the most common is linear least squares regression, technically a sub–class of the more general method of “maximum likelihood estimation” (MLE). See the Gauss–Markov theorem for more details. In any case, for both discrete– and continuous–factor models, ANOVA is used for the first analysis: the whole model test. 33 3.1 Linear regression What do we mean by a linear model? Consider the familiar case in which the independent variable (factor) is continuous – there are x y no levels of x. We could have a variety of different models to fit this data: x1 y1 x2 y2 y = β0 + β1 x + ε (84) x3 y3 2 y = β0 + β1 x + ε (85) .. .. . . y = β0 + β1 ln(x) + ε (86) xn yn √ y = β0 + β1 / x + ε (87) All of the above models are linear models because they are linear with respect to the model parameters β0 and β1 . These parameters are also called the regression coefficients. Let’s take as an example a simple linear model we all know and love: y = β0 + β1 x + ε . (88) Here, β0 and β1 are the “true” model parameters. From experimental data (i.e. random samples from a larger population), we hope to obtain accurate estimates of these parameters. We will then have a model that allows us to predict the response, ŷ = β̂0 + β̂1 x (89) However, fitting these parameters usually involves an over–determined set of equations. Let’s take as an example an experiment to study the effect of continuous variable x on continuous response y: Here we have six equations with 2 unknowns (β̂0 and β̂1 ). This system is mathematically over–determined. Real world data will always contain some random error and so there will be no exact solution to this set of equations. So instead we focus on finding the “best” solution, the model for which predicted values provide the “best fit” to the actual values. The best approach to solving this problem is called the method of ‘maximum likelihood estimation’ or MLE which we will cover towards the end of the course. Another option in addition to MLE is to choose what is called a ‘loss function’ L. The loss function essentially tells the model what to minimize when optimizing the parameters. Two typical choices of loss function might are, L= L= n X i=1 n X (90) |ei | e2i , (91) i=1 where ei is the residual of point i, i.e. ei = yi − ŷi = yi − (β̂0 + β̂1 xi ) . 34 (92) (93) Pn The optimal values of the parameters are those which minimize the loss function. Taking L = i=1 e2i as our loss function, we can determine optimal values of the parameters by taking the derivative with respect to each parameter and setting them equal to zero (i.e. minimizing with respect to the parameters): L= n X e2i = i=1 ∂L = −2 ∂ β̂0 ∂L ∂ β̂1 β̂1 = −2 β̂0 n n X X (yi − ŷi )2 = (yi − β̂0 − β̂1 xi )2 i=1 n X (yi − β̂0 − β̂1 xi ) = 0 i=1 n X (94) i=1 (95) (96) xi (yi − β̂0 − β̂1 xi ) = 0 i=1 This is now just two equations with two unknowns (β̂0 and β̂1 ) which can be easily solved. In general there will be one equation for each parameter, since we need to take the derivative with respect to every parameter and set it equal to zero. The above analysis is called “least squares linear regression” – “least squares” refers to the fact that we are minimizing the square of the error (i.e. a quadratic loss function), and linear regression refers to the fact that the model is linear in the parameters. As it turns out, least squares linear regression is exactly identical to the result from the maximum likelihood estimation approach when the errors are distributed normally (i.e. the model is adequate). In other words, least squares linear regression is statistically optimal in estimating the parameters! Least squares linear regression can be generalized to multivariable models without too much difficulty. For example, suppose x1 and x2 are two different continuous variables (e.g. temperature and pressure). The following are a few examples of linear models relating these two variables to a response variable y: y = β0 + β1 x 1 + β2 x 2 + ε (97) y = β0 + β1 x 1 + β2 x 2 + β3 x 1 x 2 ε x1 5 y = β0 + β1 x1 + β2 sin +ε x2 (98) (99) Again, these are all linear models – ‘linear’ refers to the parameters, not to the variables! In an even more general case, suppose we have k continuous variables, x1 , x2 , ...xk . We can write down the simplest linear model for this case, y = β0 + β1 x1 + β2 x2 + ... + βk xk + ε . (100) Given n total observations, we can write n equations for k unknowns. For example, the ith equation in this list will be: ŷi = β̂0 + β̂1 x1 + β̂2 x2 + ... + β̂k xk . (101) We end up with n equations and k + 1 unknown coefficients. If n > p there is no exact solution! We can apply the same least squares framework as before, L= n X e2i n n k X X X 2 = (yi − ŷi ) = (yi − β̂0 − β̂k xij )2 i=1 ∂L = −2 ∂ β̂0 ∂L ∂ β̂1 β̂k6=0 i=1 i=1 n k X X (yi − β̂0 − β̂k xij ) = 0 i=1 = etc. (102) j=1 (103) j=1 (104) , β̂k6=1 and solve this set of k + 1 equations for the estimators β̂0 , β̂1 , ..., β̂k . This is what the computer is doing ‘behind the scenes’ when you fit a line in Excel or elsewhere! Aside: dimensions and units 35 In discrete factor effects models, all model parameters (µ, τi ) have the same units as the response variable. Recall the example with the polycotton fiber tensile strength. However, in continuous factor effects models, the model parameters will generally not all have the same units. Regression coefficient units depend on the dimensions of both the response and the factor. For example, in a single variable linear regression, √ y = β0 + β1 x 2 + β2 / x + ε , (105) it certainly cannot be the case that each of these coefficients have the same units, unless both y and x are unitless. 3.2 Testing regression models Recall that for a discrete factor effects model with multiple level, our first goal (after fitting the model to obtain parameter estimates) was to use ANOVA to determine whether the factor affects the result at any level (i.e. at least one level has a significant effect). If we find that it does, we move to the second goal which was to compare effects at each level. A similar strategy is used for continuous factor effects models: 1. Whole model test via ANOVA 2. Tests on the individual parameters of the model (i.e. the regression coefficients) Just as before, the total variability can be decomposed into the variability explained by our regression model and the variability due to random error: SST = N X (yi − y •• ) 2 (106) (ŷi − y •• ) 2 (107) i=1 SSmodel = N X i=1 SSE = N X (108) 2 (ŷy − yi ) i=1 We set up ANOVA similarly to before, and first test whether there is a relationship between the response variable and any regressor variable: (109) (110) H0 : β1 = β2 = ... = βk = 0 H1 : βj 6= 0 for at least one j note that β0 is not included in these hypotheses! That’s because it represents the “overall” mean, similar to µ in the discrete–factor effects model example. Summarizing into a similar table as in the discrete factor case, Table 3: ANOVA for testing significance of regression source factor error total sum of squares SSModel SSE SST degrees of freedom k =p−1 N −p N −1 mean square F0 ∼ Fp−1,N −p M SModel M SE M SModel /M SE 36 Example: Suppose you are doing an experiment to test the effect of oil viscosity x on the wear y of a bearing assembly: Figure 37: Ball bearing wear experiment. We’ll start by proposing a simple model describing how the viscosity affects the wear: y = β0 + β1 x + ε . (111) We can analyze this in the Fit Y by X platform of JMP, making sure to specify that both variables are continuous. After loading the data, we fit a line by selecting ‘Fit Line’ as illustrated in Figure 38: Figure 38: Performing linear regression with JMP, part 1. With the resulting JMP output: 37 Figure 39: Performing linear regression with JMP, part 2. In addition to the usual R2 statistics, we get: • ANOVA results (an F –test to determine the significance of the whole model) • parameter estimates and t–tests for each of the regression coefficients What conclusions can we make based on the R2 statistics and ANOVA? How are the parameter t–tests done and what do they mean? We can also run this in the Fit Model platform, which is a bit more powerful and has more details available: 38 Figure 40: Performing linear regression with JMP, part 3. Note that in the Fit Model platform, JMP reports confidence intervals on the model parameters. What does the null hypothesis here really mean? Really what we have is a null model that we are comparing with our proposed model. If the factor has no effect on the response, then we fail to reject the null model. The whole model test is essentially just comparing the proposed model, i.e. y = β0 + β1 x + ε with the null hypothesis model, y = β0 + ε . , (112) (113) A significant result in the ANOVA indicates that at least one of the regressor variables (we only have one in this example) has a significant effect on the response, illustrated in Figure 41. 39 Figure 41: Performing linear regression with JMP, part 4. What if ANOVA p–value > α? In this case the proper conclusion is that the proposed model is not significantly better than the null model. Note that this is not the same as concluding that x has no significant effect on y. It could be that our model is deficient. ANOVA tests the model! So, we should analyze the residuals to determine whether the model is adequate. If the residual analysis checks out, only then can we justify a conclusion that x has no significant effect on the response y. We do that in Figure 42: Figure 42: Performing linear regression with JMP, part 5: residuals analysis. The residuals appear to be structureless when plotted by row and by the predicted value. The residual normal quantile plot appears to be following the line y = x without any shift or deviating trend. Plotting the actual value by the predicted value similarly shows a strong parity, suggesting our model is adequate. 40 3.3 Leverage points In a given dataset, sometimes one or a handful of points will be what is called a ‘high leverage’ point. In other words, these points are particularly influential in the model parameters. To see how we can identify high leverage points, first we need to dust off some of our linear algebra knowledge. First, we note that linear algebra lets us write linear regression models in a very compact way: y = Xβ + ε (114) , where the bolded versions of each variable refer to column vectors containing all levels of each variable. In the case of the viscosity–wear example, 193 1 1.6 230 1 15.5 172 ; X = 1 22 ; β = β0 y= . (115) 91 1 43 β1 113 1 33 125 1 40 Additionally, linear algebra provides a way to easily and analytically solve for the least squares estimates of the regression coefficients. We won’t prove this relationship (if you’re interested, look up the ‘normal equation’), but we can directly write, −1 0 β̂ = X 0 X Xy , (116) where X 0 refers to the transpose of X. Plugging this result into our prediction equation, and we have, ŷ = X β̂ = X X 0 X −1 X 0 y = Hy , (117) −1 where H is called the “hat” matrix – it “puts the hat on y.” The H and X 0 X matrices both play an important role in regression analysis. Recall for example that the diagonal elements of the hat matrix were used in the calculation of Studentized residuals. For the viscosity–wear data, we can calculate the hat matrix: 0.64 0.37 0.24 −0.16 0.03 −0.11 0.37 0.25 0.20 0.03 0.11 0.05 0.24 0.20 0.18 0.11 0.14 0.12 −1 0 . H=X XX X= (118) −0.16 0.03 0.11 0.40 0.26 0.36 0.03 0.11 0.14 0.26 0.21 0.25 −0.11 0.05 0.12 0.36 0.25 0.33 Diagonal elements of H may indicate observations that have high leverage, i.e. are highly influential, by virtue of their location in X space. In JMP, you can save diagonal elements of the hat matrix with ‘Save’ > ‘Columns’ > ‘Hats’: x y diag(H) 1.6 193 0.64 15.5 230 0.25 22.0 172 0.18 43.0 91 0.40 33.0 113 0.21 40.0 125 0.33 Here, the high leverage points are identified by rows where the diagonal of the corresponding hat matrix is large. In other words, the first and fourth rows. Note that these rows correspond to the min and max values of x. Leverage point analysis is more interesting in useful in multi–variate problems, when X is k–dimensional. 41 3.4 Checking the adequacy of a regression model The final thing to check with a regression model is adequacy. We will talk about this in greater detail towards the end of the course (keep an eye out for bias/variance tradeoff), but for now, we can use many of the same tools we used when analyzing discrete factor effects models. In particular: • residuals analysis: – normal quantile plot for standardized/Studentized residuals – plot residuals vs. each regressor variable and look for a structureless plot • R2 and adjusted R2 : SSModel SST M SE n−1 2 adj. R = 1 − =1− (1 − R2 ) SST /(n − 1) n−p R2 = (119) (120) The adjusted R2 is just a modification of the original R2 which attempts (somewhat successfully) to punish overfitting of the data. To see this, let’s return to our viscosity–wear data. In our initial analysis, we fit a linear model with only one coefficient (besides the intercept), i.e. we fit a line to the data. What if we included higher order terms and fit a polynomial to this data? We can fit a higher order polynomial in the Fit Y by X platform of JMP. After loading the data, choose ‘Fit Polynomial’ > ‘4,quartic’ to fit a fourth order polynomial. The results are summarized below in Figure 43. Figure 43: Fitting a fourth order polynomial to the viscosity–wear data. This model looks great! Recall that with the linear (i.e. fitting a line) model, R2 = 0.73 and adj. R = 0.66. Here we see the fourth order fit has R2 = 0.98 and adj. R2 = 0.92, a significant improvement! But hold on a second, we fail to reject H0 with ANOVA! In other words, ANOVA is telling us that this fourth order model is not significantly better than the null model. What’s going on here? Adding terms to the model decreases the error degrees of freedom, which results in a higher p–value despite a higher F –value. To fit a higher order model, we need more observations. ANOVA can partition the variance effectively only if we have enough data to allow for a good look at random error. We need more error degrees of freedom! For example, if we have only two experimental points, we would have a perfect fit (R2 = 1) but zero error d.o.f., and so the model would not be testable. This phenomenon is called ‘overfitting’ of the data, which we will return to later. 2 42