Section 16 Linear constraints in multiple linear regression. Analysis of variance. Multiple linear regression with general linear constraints. Let us consider a multiple linear regression Y = X∂ + β and suppose that we want to test a hypothesis given by a set of s linear equations. In a matrix form: H0 : A∂ = c, where A is a s × p matrix and c is a s × 1 vector. We will assume that s � p and the matrix A has rank s. This generalizes two types of hypotheses from previous lecture, when we considered only one linear combination of parameters (s = 1 case) or tested hypothesis about all parameters simultaneously (s = p case). To test this general hypothesis, a natural idea is to compare how far A∂ˆ is from c and ˆ Clearly, this distribution is normal with to do this we need to find the distribution of A∂. mean A∂ and covariance EA(∂ˆ − ∂)(∂ˆ − ∂)T AT = ACov(∂ˆ)AT = χ 2 A(X T X)−1 AT = χ 2 D where we introduced a notation D := A(X T X)−1 AT . A matrix D is a symmetric positive definite invertible s × s matrix and, therefore, we can take its square root D 1/2 . It is easy to check that the covariance of D −1/2 A(∂ˆ − ∂) is χ 2 I. This implies that 1 −1/2 ˆ 1 |D A(∂ − ∂)|2 = 2 (A(∂ˆ − ∂))T D −1 A(∂ˆ − ∂) � λ2s . 2 χ χ Under null hypothesis, A∂ = c, we get 1 (A∂ˆ − c)T D −1 (A∂ˆ − c) � λ2s . χ2 113 (16.0.1) ˆ we get Since nχ̂ 2 /χ 2 � λ2n−p is independent of ∂, � nχ̂ 2 1 ˆ − c)T D −1 (A∂ˆ − c) (A∂ sχ 2 (n − p)χ 2 n−p ˆ = (A∂ − c)T D −1 (A∂ˆ − c) � Fs,n−p. (16.0.2) nsχ̂ 2 This is enough to test hypothesis H0 . However, in a variety of applications a different equiv­ alent representation of (16.0.1) is more useful. It is given in terms of MLE ∂ˆA of ∂ that satisfies the constraint in H0 . In other words, ∂ˆA is obtained by solving: |Y − X∂ |2 � min � subject to the constraint A∂ = c. (16.0.3) Lemma. If ∂ˆA is solution of (16.0.3) then the left hand side of (16.0.1) is equal to 1 |X(∂ˆA − ∂ˆ)|2 . χ2 (16.0.4) Proof. First, let us find the constrained MLE ∂ˆA explicitly. By method of Lagrange multipliers we need to solve a system of two equations: ⎠ � ⎟ 2 T A∂ = c, |Y − X∂| + (A∂ − c) � = 0, �∂ where � is a s × 1 vector. The second equation is −2X T Y + 2X T X∂ + AT � = 0. Solving this for ∂ gives 1 1 ∂ˆA = (X T X)−1 X T Y − (X T X)−1 AT � = ∂ˆ − (X T X)−1 AT �. 2 2 Since ∂ˆA must satisfy the linear constraint, we get 1 1 c = A∂ˆA = A∂ˆ − A(X T X)−1 AT � = A∂ˆ − D�. 2 2 Solving this for �, � = 2D −1 (A∂ˆ − c), we get ∂ˆA = ∂ˆ − (X T X)−1 AT D −1 (A∂ˆ − c). and, therefore, X(∂ˆA − ∂ˆ) = −X(X T X)−1 AT D −1 (A∂ˆ − c). We can use this formula to compute |X(∂ˆA − ∂ˆ)|2 = (X(A∂ˆ − ∂ˆ))T X(∂ˆA − ∂ˆ) = (A∂ˆ − c)T (X(X T X)−1 AT D −1 )T X(X T X)−1 AT D −1 (A∂ˆ − c) = (A∂ˆ − c)T D −1 A(X T X)−1 X T X(X T X)−1 AT D −1 (A∂ˆ − c). = (A∂ˆ − c)T D −1 A(X T X)−1 AT D −1 (A∂ˆ − c) = (A∂ˆ − c)T D −1 DD −1 (A∂ˆ − c) = (A∂ˆ − c)T D −1 (A∂ˆ − c). 114 Comparing with (16.0.1) proves Lemma. Using (16.0.2) and Lemma, we get that under null hypothesis H0 : n−p |X(∂ˆA − ∂ˆ)|2 � Fs,n−p . nsχ̂ 2 (16.0.5) There are many different models that are special cases of a multiple linear regression and many hypotheses about these model can be written as a general linear constraints. We will describe one such model in detail - one-way layout in analysis of variance. Then we will describe a couple of other models without going into details since the idea will become clear. Analysis of variance: one-way layout. Suppose that we are given p independent samples Y11 , . . . , Y1n1 � N(µ1 , χ 2 ) .. . Yp1 , . . . , Ypnp � N(µp , χ 2 ) of sizes n1 , . . . , np correspondingly. We assume that the variance of all distributions are equal. We would like to test the hypothesis that the means of all distributions are equal, H0 : µ 1 = . . . = µ p . This problem is in fact a special case of a multiple linear regression and testing hypothesis given by linear equations. We can write Yki = µk + βki , where gki � N(0, χ 2 ), for k = 1, . . . , p, i = 1 . . . , ni . Let us consider n × 1 vector, where n = n1 + . . . + np , Y = (Y11 , . . . , Y1n1 , . . . , Yp1 , . . . , Ypnp )T and p × 1 parameter vector µ = (µ1 , . . . , µp )T . Then we can write all the equations in a matrix form Y = Xµ + β, where X is the following n × p matrix: ⎞ 1 � .. � . � � 1 � � 0 � . � . � . X = � � 0 � . � .. � � 0 � � . � .. 0 � 0 .. ⎜ . ⎜ ⎜ 0 ⎜ ⎜ 0 ⎜ .. ⎜ ⎜ . ⎜ ⎜ . 1 . . . 0 ⎜ .. .. .. ⎜ . . . ⎜ ⎜ 0 . . . 1 ⎜ ⎜ .. .. .. ⎜ . . . ⎝ 0 ... 1 0 ... .. .. . . 0 ... 1 ... .. .. . . 115 The blocks have n1 , . . . , np rows. Basically, the predictor matrix X consists of indicators to which group the observation belongs to. The hypothesis H0 can be written in a matrix form as Aµ = 0 for (p − 1) × p matrix � ⎞ 1 0 . . . 0 −1 � 0 1 . . . 0 −1 ⎜ � ⎜ A = � .. .. .. . .. ⎜ . . � . . . . . ⎝ 0 0 . . . 1 −1 We need to compute the statistic in (16.0.5) that will have distribution Fp−1,n−p . First of all, � ⎞ n1 0 . . . 0 � 0 n2 . . . 0 ⎜ � ⎜ X T X = � .. .. .. .. ⎜ . � . . . . ⎝ 0 0 . . . nr Since µ̂ = (X T X)−1 X T Y it is easy to see that for each i � p, n i 1 � µ̂i = Yik = Y¯ i ­ the average of ith sample. ni k=1 We also get p n i 1 1 � � χ̂ = |Y − Xµ| 2 = (Yik − Ȳi )2 . n n i=1 k=1 2 To find the MLE µ̂A under the linear constraints Aµ = 0 we simply need to minimize |Y − Xµ|2 over vectors µ = (µ1 , . . . , µ1 )T with all equal coordinates. But, obviously, Xµ is a vector (µ1 , . . . , µ1 )T of size n × 1, so we need to minimize p � ni � (Yik − µ1 )2 min µ1 i=1 k=1 and we get Therefore, and p n i 1 � � µ1 = Yik = Y¯ - overall average of all samples. n i=1 k=1 µ̂A − µ̂ = (Ȳ − Ȳ1 , . . . , Ȳ − Ȳp )T p p ni � � � 2 ni (Ȳi − Ȳ )2 . |X(µ̂A − µ̂)| = (Ȳi − Ȳ ) = 2 i=1 i=1 k=1 By (16.0.5), under the null hypothesis H0 , �p 2 n−p i=1 ni (Ȳi − Ȳ ) � � F := � Fp−1,n−p . i p − 1 pi=1 nk=1 (Yik − Ȳi )2 116 (16.0.6) In order to test H0 , we define a decision rule ⎛ H0 , F � c � α= H1 , F > c � where Fp−1,n−p(c� , +∼) = �. The sum in the numerator in (16.0.6) represents the total variation of the sample means Y¯i of each population around the overall mean Y¯ . The sum in the numerator represent the total variation of the observations Yik around their particular sample means Ŷi . This interpretation of the test statistic explains the name - analysis of variance, or anova. Example. Let us again consider normal body temperature dataset and perform anova test to compare the mean body temperature for men and women. Previously we have tested this using t-tests and KS test for two samples. We use Matlab function [p,tbl,stats]=anova1([men, women]) where ’men’ and ’women’ are 65 × 1 vectors. For unequal groups ’anova1’ requires a second argument with group labels. The output produces a table ’tbl’: ’Source’ ’Columns’ ’Error’ ’Total’ ’SS’ [ 2.7188] [66.6262] [69.3449] ’df’ [ 1] [128] [129] ’MS’ [2.7188] [0.5205] ’F’ [5.2232] ’Prob>F’ [0.0239] ’SS’ gives the sum of squares in the numerator of (16.0.6) (’Columns’), denominator (’Error’), and their total sum. Degrees of freedom ’df’ represent degrees of freedom p − 1 and n − p. ’MS’ represents the normalized sums of squares by corresponding degrees of freedom. ’F’ is a statistic in (16.0.6) and ’Prob>F’ is a p-value corresponding to this F -statistic. This means that at the level of significance � = 0.05 we reject the null hypothesis that the means are equal. Analysis of variance: two-way layout. Suppose that we again have samples from different groups only now the groups will have two categories defined by two factors. For example, if we want to compare SAT scores in different states but also separate public and private schools then we will have groups defined by two factors - state and school type. We consider the following model of the data: Yijk = µij + βijk for i = 1, . . . , a, j = 1, . . . , b and k = 1, . . . , nij , i.e. we have a categories of the first factor, b categories of the second factor and nij observations in group (i, j). This model is not any different from one-way anova, simply the groups are indexed by two parameters/factors, but the estimation of parameters can be carried out as in the one-way anova. However, to test various hypotheses about the effects of these two factors it is more convenient to write the model in an equivalent way as follow: Yijk = µ + �i + ∂j + δij + βijk 117 where we assume that a � i=1 �i = 0, b � ∂j = 0, j=1 b � i=1 δij = b � δij = 0. j=1 These constraints define all parameters uniquely from original parameters µij . Parameter µ is called the overall mean. The reason we separate additive effects �i and ∂j of two factors from the most general interaction effect δij is because it is easier to formulate various hypotheses in terms of these parameters. For example: • H0 : �1 = . . . = �a = 0 - the additive effect of the first factor is insignificant; • H0 : ∂1 = . . . = ∂b = 0 - the additive effect of the second factor is insignificant; • H0 : all δij = 0 - the effect of the interaction of both factors is insignificant, i.e. the effect of factors is additive. Matlab function ’anova2’ performs two-way layout of anova if the sizes of all groups n ij are equal, i.e. the data is balanced. If the sizes of groups are different one should use ’anovan’ ­ a general N-way anova. Analysis of covariance. This is another special case of multiple regression when all groups of data have a continuous predictor variable. The model is: Yik = � + �i + (∂ + ∂i )Xik + βik for i = 1, . . . , a and k = 1, . . . , ni . We have a groups and ni observations in ith group. To determine the parameters uniquely we assume that a � i=1 �i = 0, a � ∂i = 0. i=1 Example. (Fruitfly dataset) We consider a dataset from [1] (available on the journal’s website) and [2]. The experiment consisted of five groups of male fruitflies, 25 male fruitflies in each group. The males in each group were supplied with different number of either receptive or non receptive females each day. Group 1: 8 newly inseminated non-receptive females per day; Group 2: no females; Group 3: 1 newly inseminated non-receptive female per day; Group 4: 1 receptive female per day; Group 5: 8 receptive females per day. The experiment was designed to test if the increased reproduction results in decreased longevity, so the lifespan of each male fruitfly was the response variable Y . One-way anova. Let us start with a one-way anova, i.e. we consider a model Yij = µi + βik , where i = 1, . . . , 5, k = 1, . . . , 25 118 and test the hypothesis H0 : µ1 = . . . = µ5 . Suppose that ’lifespan1’ is a 25 × 5 matrix such that each column contains observations from one group. Then running [p,tbl,stats]=anova1(lifespan1); produces the boxplot in figure 16.1 and a table ’tbl’: ’Source’ ’Columns’ ’Error’ ’Total’ ’SS’ [1.1939e+004] [2.6314e+004] [3.8253e+004] ’df’ ’MS’ ’F’ ’Prob>F’ [ 4] [2.9848e+003] [13.6120] [3.5156e-009] [120] [ 219.2793] [124] p-value suggests how unlikely hypothesis H0 is. The boxplot suggests that the last group’s 100 90 80 Values 70 60 50 40 30 20 1 2 3 Column Number 4 5 Figure 16.1: Boxplot for one-way ANOVA. lifespan is most different from the other four groups. As a result, we might want to test the hypothesis H0 : µ1 = . . . = µ4 that the means of the first four groups are equal, [p,tbl,stats]=anova1(lifespan1(:,1:4)); we get the following table 119 ’Source’ ’Columns’ ’Error’ ’Total’ ’SS’ [ 988.0800] [2.2798e+004] [2.3787e+004] ’df’ ’MS’ ’F’ ’Prob>F’ [ 3] [329.3600] [1.3869] [0.2515] [96] [237.4842] [99] and we see that the p-value is 0.2515, so we accept H0 if the level of significance � � p-value. Two-way anova. Let us now consider four groups without the second group (no females) and test the effects of two factors: • Factor A: ’receptive’ or ’non-receptive’; • Factor B: ’1’ or ’8’. This means that we consider a model Yijk = µ + �i + ∂j + δij + βijk for i = 1, . . . , 2, j = 1, . . . , 2 and k = 1, . . . , 25. To use Matlab function ’anova2’ we arrange the data into a 50 × 2 matrix ’lifespan2’ such that two columns represent two categories of Factor A, the first 25 rows represent group ’1’ in Factor B and rows 26 through 50 represent group ’8’ in Factor B. Then [p,tbl,stats]=anova2(lifespan2,25) produces (here 25 indicates the number of replicas in one cell) the table ’Source’ ’Columns’ ’Rows’ ’Interaction’ ’Error’ ’Total’ ’SS’ [6.6749e+003] [1.7223e+003] [2.3717e+003] [1.9817e+004] [3.0586e+004] ’df’ ’MS’ ’F’ [ 1] [6.6749e+003] [32.3348] [ 1] [1.7223e+003] [ 8.3430] [ 1] [2.3717e+003] [11.4890] [96] [ 206.4308] [99] ’Prob>F’ [1.3970e-007] [ 0.0048] [ 0.0010] p-values in the last column correspond to three hypotheses: • H0 : �1 = �2 = 0, i.e. the effect of Factor A is insignificant; • H0 : ∂1 = ∂2 = 0, i.e. the effect of Factor B is insignificant; • H0 : δ11 = δ12 = δ21 = δ22 = 0, i.e. the effect of the ’interaction’ between Factors A and B is insignificant. Small p-values suggest that all these hypotheses should be rejected. Analysis of covariance. Besides reproduction factors A and B, another continuous ex­ planatory variable for longevity was used - the length of thorax (a division of a body between the head and the abdomen - chest). We are now in the setting of ancova: Yik = � + �i + (∂ + ∂i )Xik + βik for i = 1, . . . , 5 and k = 1, . . . , 25. Analysis of covariance tool in Matlab 120 aoctool(thorax,lifespan,groups); produces the following output, figure 16.2: 100 100 1 2 3 4 5 90 80 70 1 2 3 4 5 80 60 60 50 40 40 30 20 20 10 0.65 0.7 0.75 0.8 0.85 0 0.9 0.65 0.7 0.75 0.8 0.85 0.9 Figure 16.2: Left column top to bottom: graph of fitted line for each group, estimates of coefficients, ancova test table. Right column: same under assumption that all slopes are equal. We see that the p-value of ’groups*thorax’ interaction, corresponding to the hypothesis that all ∂i = 0, is equal to 0.9844, which means that we can accept this hypothesis. As a result, we fit the model with equal slopes for all groups, figure 16.2, right column. The p-values for ’groups’ and ’thorax’, corresponding to the hypotheses all �i = 0 and ∂ = 0, are almost 0 and we should reject these hypotheses. 121 References. [1] Hanley, J. A., and Shapiro, S. H. (1994), ”Sexual Activity and the Lifespan of Male Fruitflies: A Dataset That Gets Attention,” Journal of Statistics Education, Volume 2, Num­ ber 1. [2] Linda Partridge and Marion Farquhar (1981), ”Sexual Activity and the Lifespan of Male Fruitflies,” Nature, 294, 580-581. 122