The General Linear Hypothesis George H Olson, Ph. D. Leadership and Educational Studies Appalachian State University Manuscript in Preparation Outline of Manuscript (colored entries are already developed. See below) Introduction What is the General Linear Hypothesis? How does it differ from the General Linear Model (GLM)? Early work Paucity of contemporary references The Univariate General Linear Hypothesis (GLH) Definition of the Model (the Theory) Assumptions Estimation of and Expectation and variance of b Estimating the variance, σ2. Testing General Linear Hypotheses Typical Applications Analysis of Variance Models Choice of design matrices Dummy variable coding Effect coding Using a super identity design matrix Simple examples that can be completed using a calculator (or Excel) More complex examples using SPSS, SAS, and SYSTAT. Analysis of Covariance Models Choice of design matrices Simple examples that can be completed using a calculator (or Excel) More complex examples using SPSS, SAS, and SYSTAT Multiple Regression Models Simple examples that can be completed using a calculator (or Excel) More complex examples using SPSS, SAS, and SYSTAT The Multivariate General Linear Model Definition of the Model (the Theory) Additional Assumptions Types of Hypotheses that can be Tested Most General Model: ABC =D Estimation of Parameter Matrices, β and Typical Applications Analysis of Variance The Equations The statistical tests Simple Examples that can be computed using a calculator (or Excel) More complex examples using SPSS, SAS, and SYSTAT Analysis of Variance with Repeated Measures The Equations The statistical tests Simple Examples that can be computed using a calculator (or Excel) More complex examples using SPSS, SAS, and SYSTAT Multivariate Profile Analysis The Equations The statistical tests Simple Examples that can be computed using a calculator (or Excel) More complex examples using SPSS, SAS, and SYSTAT Analysis of Covariance The Equations The statistical tests Simple Examples that can be computed using a calculator (or Excel) More complex examples using SPSS, SAS, and SYSTAT Analysis of Covariance with Repeated Measures Covariance Measures Collected Prior to Repeated Measures Covariance Measures Collected Concomitant with Repeated Measures Multivariate Regression Analysis The Equations The statistical tests Simple Examples that can be computed using a calculator (or Excel) More complex examples using SPSS, SAS, and SYSTAT Appendices Summation Operators (Use paper from my website) Expectation Operators Matrix Algebra (Use paper from my website) Introduction By now, education researchers should be disposed to the idea that both causes and effects of educational phenomena are inherently multivariate in nature. Researchers have held this view for over six decades (see, for example, Tatsuoka, 1973), as well as older surveys of multivariate statistical applications in education and psychology contained in books by Cattell, 1962, and Whitla, 1968, and more recent surveys by Srivastave & Carter, 1983,and Myers, 2008) In their statistical applications, however, while recognizing that causes of educational phenomena are, indeed, multidimensional, have tended to employ techniques of multiple linear regression (MLR) with increasing levels of sophistication, to investigate both the combined effects of, and interrelationships among, multiple independent variables. For the most part, however, researchers have tended to not investigate the "multivariateness" of educational outcomes. Instead, the typical approach has been to study the effects of a common set of independent variables on each of several criterion variables separately. A less well-known data-analytic strategy (at least in the social sciences), that also utilizes GLMs, is the general linear hypothesis (GLH) first proposed by Kolodziejczyk (1935). In 1951 Kempthorne, in a now out-or-publication book, provided a very tractable development GLH theory and showed how it could be applied of a wide variety of research situations. Graybill (1961 )provided extensions. Both authors used matrix notation to extend Kolodziejczyk’s original formulation. In preparing for this paper, I did a quick, informal review of the studies published in the last three years in the American Educational Research Journal and the Journal of Educational Psychology. Over 80 percent of the experimental and comparative studies reported in these journals investigated effects on multiple criteria. Although several of these studies employed multivariate techniques, the majority of them (approximately three-fourths) employed multiple univariate analyses of variance. The typical example would be a study in which several independent variables (including pretests, demographic data, and incidence of treatment measures) are examined separately for their effect on, say, measures of motivation, achievement, and attitude toward school— variables that few would deny are correlated (perhaps highly) in the population. The possible ramifications of performing separate univariate analyses on correlated criteria are well-known (e.g., Myers, 1990). If a Type I error occurs in tests involving one criteria, the probability is greater than a that it will occur in the tests involving the other criteria also. This fact, by itself, should provide sufficient motivation to the researcher to seek multivariate techniques for the analysis of multiple-criteria designs. At the very least a multivariate technique offers the researcher a procedure for controlling experiment-wise probability of a Type I error. Multivariate statistical procedures appropriately fit this epistemological point of view that the "effects" in educational settings are rarely, if ever, unidimensional and therefore should not be studied in isolation. The study of multiple outcomes simultaneously can afford the researcher an opportunity to "uncover" complex relationships among treatment and outcome variables that might otherwise go undiscovered. An example might be a situation in which it is found that two experimental instructional programs lead not only to increases in several measures of achievement and motivation, but also, under one of the programs, to increases in the correlation among the variables. Such a finding might have important implications for understanding the dynamics of the instructional programs. If the need for greater application of multivariate statistical procedures can be established so easily, why then are these procedures not used more frequently in educational research? Although it is impossible to answer this question completely, there is no doubt that at least part of the answer lies in the fact that multivariate procedures are more complex, both mathematically and conceptually, then_univariate procedures. Until recently, researchers who sought to use multivariate techniques had to first develop a- fairly high level of mathematical skill to be able to read the existing text books. Even given the requisite mathematical ability, the existing textbooks have often proven less than useful to the applied researcher since they contained a paucity of examples on anything even bearing a passing resemblance to the kinds of problems encountered in educational research. In terms of practical use of the multivariate general linear model, there has been very little published since the disappearance of the programs, BMD 05V and BMDX 63, when SPSS, inc. purchased the BMD suite of statistical package. Most of the literature relating to the general linear hypothesis is quite dated. Several years (decades) ago there were a number of intermediate-level (in mathematics) books on multivariate statistics. Many of these are still (e.g., Cooley & Lohnes, 1971, Finn, 1974, Harris, 1975, Morrison, 1967, Press, 1972, Tatsuoka, 1971, and Van de Geer, 1971). More recently, books by Srivastave & Carter (1983) and Myers (1990) have provided fairly tractable accounts of the general linear hypothesis. Although these books usually require some ability in matrix algebra, the mathematical developments tend to be quite readable. More than that, however, these books provide a rich source of examples of applications of multivariate procedures. It is likely that, due to these books alone, many more applications of multivariate procedures have begun to appear, and will continue to appear in the published literature. Unfortunately, much of the literature on the general linear hypothesis that has been published in recent years, has not appeared in the education research literature. Rather it appears in such areas as brain research (Friston, et al., 1994, Keselman, Wilcox, & Lix, 2003), and marketing research (Perreault & Darden, 1975). Most of the more recent literature on the general linear hypothesis has been highly technical, mathematical treaties (e.g., Muller & Peterson, 1984, an examination of methods for computing power in testing the multivariate general linear hypotheses; McDonald, 1975, tests for general linear hypotheses under various complex multivariate designs; and Hothorn, Bretz, & Westfall, 2008, simultaneous inference). The Univariate General Linear Hypothesis (GLH) We will begin with the univariate general linear hypothesis (GLH) and treat the multivariate case later.The GLH is the most general of all parametric linear statistical procedures. In fact, all linear statistical tests (univariate and multivariate) can be developed as special subclasses of GLH theory. This includes, among other procedures, factor analysis, discriminant analysis, prediction (or regression) analysis, and the analysis of variance and covariance. Although the mathematical development of GLH theory is complex, its conceptualization, at least insofar as the most common applied situations are concerned, is not particularly difficult. It does require an elementary facility for matrix algebra, however. A simple statement of the theory is that if a set of dependent variables, Y (where the bolding denotes a matrix), is linearly related to a set of independent variables, X, by the equation, y = Xβ + ε, (1) where β is a set of unknown parameters of interest, then any hypothesis of the form H0: A β C = D (2) is testable. In this equation, the matrices A and C are used to select particular elements from among rows and columns of β . The matrix, D, is a matrix of constants specified by the investigator. Usually D is set equal to 0, a matrix of zeros. In the next section a brief introduction to the theory of the MGLH is provided. Readers who wish to pursue this development in greater depth are encouraged to consult the appropriate chapters in Kempthorne (1952), Kulbach (1968), Mendenhall (1968), and Seal (1964). Early papers by Smith, Gnanadesikan, and Hughes (1962) and Bock (1963b) develop the MGLH in a way that is useful to those who may wish to program the computations. Thorough, but highly mathematical, presentations of the MGLH have been given by Anderson (1958), Rao (1965), and Seber (1966). In the third section of this paper, several general algebraic examples of typical applications are provided. Other examples of applications of the MGLH, using real data, can be found in Bock (1963a), Bock and Haggard (1968), Jones (1966), and Finn (1974). Introduction to the Theory of the GLH In this paper, interest is primarily focused upon the analysis of variance and the analysis of covariance.- Repeated measures (or profile) analyses of variance and covariance are also covered but are treated as special instances of multivariate analyses of variance and covariance. Before proceeding to illustrative applications of the GLH, however, it is first necessary to lay down some of the mathematical ground work. Definition of the Model We begin by rewriting the general equation for a univariate linear model given earlier, y = Xβ + ε, (1) Where y is an (n × 1) column vector of observations or measurements (dependent variates), denoted by, y1 y y 2, yn X is an (n × p) matrix of known predictor or design variables, denoted by, x11 x12 x x 22 21 X x n1 x n2 x1 p x2 p , x np β is a (p × 1) matrix of parameters (coefficients) to be estimated, denoted by β1 β 2 , and β p ε is an (n × 1) matrix of disturbances (residuals, measurement errors), denoted by ε1 ε 2. ε n It should be noted that the model (1) is very general. Depending upon the choice of X the model forms the basis for a wide variety of common linear statistical procedures. For instance, if X is a matrix of dummy classification variables, than t (1) is the model for the analysis of variance. Furthermore, if X consists of measures taken on a set of p predictor variables, then (1) is a prediction model. Finally, if X contains both classification variables and predictor variables, (1) is a model for the analysis of covariance. If y is a vector of “1’s” and “0’s” and X contains measures on a set of predictor variables, then (1) provides a model for a two-group discriminant analysis. Additionally, if y is expanded to an (n x q) matrix, Y, of observations on q variates, then (1) serves as a model for an even wider class of statistical procedures including factor analysis. In this sequel, only those models where X is a matrix of classification variables, predictor variables, or both are considered. Assumptions The assumptions applicable to the model defined above, for purposes of testing hypotheses, are the following: 1. The model is linear in terms of the parameters, 2. N > p + q. 3. X is of full rank, p. 4. ε is distributed normally around μ = 0, with variance = σ2 In, where 0 is a column vector of zeros, and In is an (n × n) identity matrix. The first assumption is rarely limiting in educational research and evaluation. Many models which appear non-linear on first sight are actually intrinsically linear. In these situations, suitable linear models can be written following acceptable transformations of the original measures (see Draper and Smith, 1968; ch. 5, for a discussion of these types of models). Assumption number 2 is required to ensure the availability of a sufficient number of observations to estimate the p elements in .The third assumption is necessary to ensure a unique solution for Since this solution requires the inverse of X'X, it is necessary that X'X be nonsingular; hence X must be of full tank. More general, non-unique solution which use generalized inverses of X'X, allow X to have more columns than its rank (i.e., they allow X to be less than full rank). Since full-rank design matrices are usually quite easily constructed for most designs in educational research, the generalized inverse approach is rarely, if ever, needed. Anyway, for any design matrix less than full rank, there always exists a transformation matrix, T, such that X* = XT is of full rank (see Bock [1963], Graybill [1961; pp. 235-239], or Smith [1972], for details on computing T). The fourth assumption provides the foundation for the theory underlying tests of GLHs. The assumption implies that Y MVN( ij , ). Estimation of and Since the elements of and are not generally known in practice, they must be estimated. Two procedures are available, viz. maximum likelihood (ML) and Least Squares (LS). Since the former requires the assumption of multivariate normality whereas the latter does not, and since the solutions for are the same in either case, only the LS procedure will be pursued. The LS procedure calls for obtaining estimates of , b say, such that the error sums of squares and cross-products (SSCP) are a minimum. In matrix terms, estimates of (viz, b) are obtained such that the following equation is a minimum. The least squares approach is to find a solution for b such that the sums of squares ( s 2 y i ee ) is a minimum. This is accomplished by differentiating the quadratic form, s 2 (y - Xb)(y - Xb) yy yXβ bXy bXXb yy bXXb. with respect to b, setting the result to zero, and solving for b: s 2 / b 2X(y - Xb) 0 2Xy 2XXb 0. (3) Rearranging terms, Xy ( XX)b, where the solution for b is given by ( XX ) 1 Xy ( X X ) 1 ( X X )b b. (4) The solution (4) depends upon X'X being nonsingular, a condition that always exists when X is of full column rank (i.e., rank{X} = p). There are more general solutions to (4), using, for instance, generalized inverses, but these yield non-unique solutions for b. Usually, however, a more generalized solution for b is unnecessary for when X is less than full rank (i.e., when the number of columns in X is greater than its rank) there always exists a transformation matrix, T, such that X* = XT is of full rank. Furthermore, it is fairly easy, in most applications, to construct X such that it is of full column rank. The estimates, b, have been shown to be unbiased and minimally dispersed by several authors (e.g., Anderson [1958], Kulback [1968], Press [1972], Rao [1965]). This is fairly easy to show. Expectation and variance of b. Recall that in the population, the parametric equation (1), is y = X + ε . Hence, (4) can be re-written as ( XX)-1 Xy ( XX) -1 X( X ) ( XX) -1 ( XX) ( XX) -1 X ( XX) -1 X. (5) Then, the expectation of b is given by E {b} E { ( XX ) 1 X} ( XX ) 1 XE {} , (6) since, under the assumptions given earlier, E {} 0. Hence, the expectation of b is β; demonstrating the unbiasedness of b. From (1), X ( y ) from which, ( XX ) 1 X '( y ). Then, the variance of b, using the expectation operator { E }, is given by V(b) E {(b - β)(b - β)} E {[(XX)-1 Xy ( XX)-1 X(y ε)][(XX)-1 Xy ( XX)-1 X(y ε )]} E {[(XX)-1 Xy (XX)-1 Xy (XX)-1 X] [(XX)-1 Xy (XX)-1 Xy (XX)-1 X]} E {[(XX)-1 X][(XX)-1 X]} E {(XX)-1 XX(XX)-1 } E {}(XX)-1 XX(XX)1 σ 2 I(XX)1. (7) Furthermore, it can easily be shown that (b ) ( XX) 1 Xy ( XX) 1 X(y ) ( XX) 1 Xy ( XX) 1 Xy ( XX) 1 X ( XX) 1 X. (8) Hence, (8) can now be written as V{b} E {[( XX ) 1 X ][ ( X X ) 1 X ]} E {( XX ) 1 XX ( XX ) 1} E {}( XX ) 1 XX ( XX ) 1 E {}( XX ) 1 σ 2 I( XX ) 1. (9) Any other estimate of b, b* for instance, where b* [( XX ) 1 X a]y, is unbiased, would yield the expected value, E {b*} E {[(X X) -1 X + a]y} [(X X) -1 X y + a]E {y} [(X X) -1 X y + a]E {Xβ} β aXβ. (10) However, since b* is unbiased, aX 0 . Now, substituting the steps (eqs?) given earlier, only this time substituting b* for b, the variance of b* is given by V( b*) 2 I[(X X) -1 X + a][(X X) -1 X + a] 2 I[(X X) -1 X X(X X) -1 aX(X X) -1 (X X) -1 X a aa ] 2 I[(X X) -1 aa], (11) which is clearly greater (by a'a) then (b). Estimating the variance, σ2. In the population, the residual variance for a single observation, {εi} is given by 2 V{ε} V{y - Xβ} E {[ y - Xβ][ y - Xβ]} E {yy - y Xβ βXy βXXβ} E {yy - [Xβ ε ]Xβ βX[ Xβ ε ] βXXβ} E {yy βXXβ ' X βX Xβ βX βXXβ} yy βXXβ E {}X βXXβ βXE {} βXXβ yy βXXβ βXXβ βXXβ 2βXE {} yy βXXβ, (12) Since, again under the assumptions, E {} 0. Typically, however, σ2 is estimated from sample observations. Given an independent random sample of observations, Eq (?) can be re-written as y = Xb + e , to reflect that all terms are generated from a sample. The sample variance is then given by s2 V{e} , where, since e y - Xb , s2 V{e} V{y - Xb E y - Xb y - Xb. But, it was shown earlier that b ( XX)1 X , so that (13) y - Xb ( X ) X[ ( X X ) 1 X ] ( X ) X ( X X ) 1 X X X ( X X ) 1 X X X X ( X X ) 1 X X ( XX ) 1 X [I-X ( XX ) 1 X ], so that, s2 E (14) y - Xb y - Xb E {[I - X(XX)-1 Xε][I - X(XX)-1 Xε ] E {ε[I - X(XX)-1 X]ε}. (15) Since it is easily shown that [I X(XX )-1 X] is idempotent since X(XX)-1 X can easily be shown to be idempotent. Note that the product under the expectation operator in (15) is a scalar and thus equal to its trace. Using the property of the trace that states that the trace of a product of matrices in invariant under cyclic permutations of the matrices, (?) can be re-written as s2 E {tr{[I-X ( X X ) 1 X ]}} tr{I-X ( X X ) 1 X }E {} 2 [tr{I} tr{( X X ) 1 X X}] 2 (tr{IN } tr{I p }) 2 (N p ). (16) The previous equation (15) demonstrates that the sample variance, s2, is a biased estimate of the population variance, i.e., it is inflated by (N - p). Therefore, the unbiased estimate is given by ˆ 2 s2 / ( N p). Testing General Linear Hypotheses Typically, we are interested in testing one or more of three types of linear hypotheses: (i) a hypothesis that the parameters in β are equal to a specific vector of constants: H01: Aβ = d, where A = Ip; (ii) a hypothesis that some subset of the parameters is equal to a given vector of constants: H02: A(k × p)β = d, where k < p; and (iii) a hypothesis that some linear function of the β that some linear function(s) of the parameters, β, is(are) equal to a given vector of constants: H03: A(k × p)β = d, where k ≤ p. In (i) A is a p-order identity matrix; in (ii) A(k) is a (k × p) matrix in which the rows each contain one element equal to “1” and the rest equal to “0;” in (iii) the rows of A (i.e., the ai') are vectors of weights for obtaining linear combinations of the parameters in β. Actually any general linear hypothesis may contain any combination of the hypotheses: i, ii, and iii, subject to the constraint that the inverse, (A'A)-1exists. All three types of hypotheses have the same general form that can be expressed more generally as: H0: A'β = β*, or equivalently as H0: A'β - β* = 0. Estimates of parameters under H0. Consider first, case (ii), above, where A is a (k × p) matrix. Under this hypothesis, A constrains k of the parameters in β by setting them equal to zero, while leaving the remaining k - p parameters unconstrained (i.e., free to vary). The hypothesis can be written as β1 β1* β * 2 β 2 * β k β k * H0 : β k 1 γ k 1 γ β k 2 γ k 2 β k p γ k p (17) where the βi, i = 1,2, …,k, are selected by the rows of A, the βi*, i = 1,2, …,k, are known constants, and the γ’s are new parameters resulting from the constraints. To estimate the γ’s, let A A 1, A2 where A1 is now the k × p matrix given in (ii) above and A 2 is (p - k) × p, typically having the form: A 2 0 I . Under the null hypothesis, H02:, y XA , which, when X is partitioned in conformance with A, can be written as A y [X1 X 2 ] 1 e A 2 A [X1 X 2 ] 1 e A 2 * [X1 X 2 ] e γ X1 * X 2 e. (18) Or, since the solutions are computed on sample data, y X1 * X 2 g e. As before, the objective is to find a solution for g such that the scalar product, s2 =ee = y X1* X 2g y X1* X 2g , (19) is a minimum. This is accomplished by differentiating (?) with respect to g and setting the result equal to zero: s2 / g 2X 2 y X1* X 2g 2 X 2 y 2X 2 X1* 2X 2 X 2g 0. Solving for g: (20) X2 X 2 g X2 y X2X1 * X2 y X1 * , (21) And g (X2 X 2 )1 X2 X1 * (X2 X 2 )1 X2 y (X2 X 2 )1 X2 X1 * . (22) Expectation and variance of g. The expected value of g is given by E {g} E {( X2 X 2 ) 1 X2 ( y X1* )} E {( X2 X 2 ) 1 X2 [( X e) X1* ]} E {( X2 X 2 ) 1 X2 [( X1* X2 e) X1* ]} E {( X2 X 2 ) 1 X2 ( X2 e)} (23) E { ( X2 X 2 ) X2e} 1 ( X2 X 2 ) 1 X2E {e} . Hence, g provides an unbiased estimate of γ. The variance of g is given by V{{g} E {(g E {g})(g E {g})} E {(g )(g ) E {[( X2 X 2 ) 1 X2 (y X1*) ][( X2 X 2 ) 1 X2 (y X1*) ]} E {[( X2 X 2 ) 1 X2 [( X1 * X 2 e) X1* ] ] [( X1 * X 2 e) X1* ] ]} E {[( X2 X 2 ) 1 X2 ( X 2 e) ][( X2 X 2 ) 1 X2 ( X 2 e) ]} E {[ ( X2 X 2 ) 1 X2e ][ ( X2 X 2 ) 1 X2e ]} E {( X2 X 2 ) 1 X2eeX 2 ( X2 X 2 ) 1} E {ee}( X2 X 2 ) 1 X2 X 2 ( X2 X 2 ) 1 =σ 2 I ( X2 X 2 ) 1. (24) Residual variance under the null hypothesis. Under the null hypothesis, H0, y X1 * X2 e, where e X1* y X2 or e y X1β* X2 . Then, the variance of e is given by V{e} * X1 X1 X1 X 2 ( X2 X 2 ) 1 X2 X1 * σ 2 . But, [ X1X1 X1X 2 ( X2 X 2 ) 1 X2 X1 ] [A1 ( XX ) 1 A1 ]1, so that 1 V{e} * A1 (XX)1 A1 * σ 2 , which is greater than the variance given for the unrestricted model. Since the identity earlier may not be obvious, it may be useful to show the derivation. Begin with the partitioned matrix, X = [X1 X2] and compute X X X1 X 2 X X 1 1 X 2 X1 X 2 X 2 X12 X 11 , X 21 X 22 then, let XX 1 X11 X12 21 X 22 X ( X X X 1 X ) 1 X 111 X12 X 22 111 12 1122 21 . ( X 22 X 21 X111 X12 ) 1 X 22 X 21 X A Now, if A 1 , then A2 1 1 1 A1 XX A1 X1 X1 X1 X 2 X 2 X 2 X 2 X1 , so that A1 XX 1 A1 1 X1 X1 X1 X 2 X2 X 2 X2 X1 . 1 Because, X1 X1 X1 X 2 X2 X 2 X2 X1 is idempotent. 1 Typical Applications Having described the basic equations of the GLH, we now turn to an algebraic exposition of some of the more typical models found in educational research. Examples of typical applications using real data can be found in many of the references cited earlier as well as in Olson (1971). In the discussion which follows, we begin with the model for the analysis of variance, move to a brief discussion of the analysis of covariance, and finally present a discussion on the analysis of repeated measures designs. Choice of Design Matrices Dummy variable coding. Effect coding. Using a super identity design matrix. While there exists a mathematical equivalence between hypothesis tests performed using GLH and the more traditional MLR, GLH provides a more elegant, straight-forward approach to hypothesis testing. This is especially so in the cases illustrated here. If we let 1n1 0 X 0 0 1n2 0 0 0 , 1n p what I prefer to call a “super identity matrix,” then the usual solution to the normal equations yields, 1 2 β p Then A can be used in a straight-forward fashion to test hypotheses involving linear combinations of group means. As a simple example, suppose we have a single factor, three-group design. In this saturation, the terms in (1) would be specified as 1n1 y1 y y2 , X 0 y3 0 0 1 1 1n2 0 , β μ 2 , and ε 2 , 3 3 0 1n3 Where, in general yi is an ni x 1 column vector of observations for individuals in group i, 1ni is an ni x 1 column vector of 1s corresponding to the vector yi for group i, μi is the mean of the observations in group i, and εi is an ni x 1 column vector of random disturbances. 0 Letting A and d in (3) be 1 0 1 0 A , d , 0 1 1 0 respectively, sets up the following test of the null hypothesis to test the “group effect”: 3 0 H0 1 . 2 3 0 Appendices Summation Operators The summation operator (∑) {Greek letter, capital sigma} is an instruction to sum over a series of values. For instance, if we have the set of values for the variable, X = {X1, X2, X3, X4, X5}, then n 5 X i 1 i = X1+ X2+X3+ X4+ X5 n 5 Literally, the expression, X i , says: beginning with i=1 and ending with i=5, sum over i 1 the variables Xi. As an example, let Xi = 8, X2 = 10, X3 = 11, X4 = 15, X5 = 16. Then n = 5 {the number of cases}, and n 5 X i 1 i = 8 + 10 + 11 + 15 + 16 = 60. In many contexts, it is clear that the summation is over all cases and we do not need the superscript over the summation operator. Furthermore, in most contexts it is assumed that the n 5 summation begins with i = 1. Hence, the notation, X i is taken to imply X i . In most i i 1 situations, where the variable has only one subscript, as in Xi, the subscript can be omitted. In these situations, X n 5 implies X i . i 1 In other contexts, the variable X may have more than one subscript, e.g., Xij. This occurs, for instance, when individuals belong to two or more subgroupings or cross-classifications. We might have a situation as shown below in Table 1. Table 1: Three groups. Group 1 X11, X21, X31, X41 Group 2 X12, X22, X32, X42, X52, X62 Group 3 X13, X23, X33, X43, X53 Here we have three groups, each with a different number of cases. We denote the ith case in the jth group with the symbol, Xij. To sum all the cases, over all three groups, we would use the following, double summation operator, 3 nj j i X ij , which instructs us to sum over the three groups (j=1, 2, and 3) and, within each group, sum over the number of cases in the group (i=1, 2, 3, 4 for Group 1; i=1, 2, 3, 4, 5, 6 for Group 2; i=1, 2, 3, 4, 5 for Group 3). For simplicity, we often write the summation expression as, X ij , where it is assumed that we are to sum over all groups and all cases within each group. For a more concrete example, lets substitute the following numbers for the symbolic values given above. Table 2: Three groups with real numbers. Group 1 Group 2 10, 8, 12, 13 6, 11, 8, 10, 8, 12 Then, Group 3 14, 6, 6, 10, 9 X = X ij ij = [10+8+12+13] +[6+11+8+10+8+12] +[14+6+6+10+9] = 43+55+45 = 143. A more complex situation occurs when cases are grouped into cross-classifications. Table 3 represents a situation where cases are cross-classified by sex and age-category. Table 3: Cross-classification of sex and age. Age Category Child Adolescent Adult Male X111, X211, X311, X411 X112, X212, X312, X412 X113, X213, X313, X413 Female X121, X221, X321, X421 X122, X222, X322, X422 X123, X223, X323, X423 Sex In the table the notation, Xijk, denotes the ith individual in the jth row (Sex category) and kth column (Age Category). Hence, X423 is the last case in the Adult-Female cell. To indicate summation over all the cases in the table, we would use the notation, X k j ijk , i where it is assumed that the summation is over all N cases, i, over all J rows, j, and all K columns, k. Dot Notation It is often easier to denote aggregates using dot notation. In an expression such as X Xi , i the dot (∙) represents aggregation or summation over the missing (dotted) subscript. For instance, n 5 in the example given earlier, where we had X i = 60, we could simply write i 1 X∙ = 60 {read X-dot = 60}. Using dot notation we would represent the cell aggregates in the Sex by Age Category table above as shown in Table 4. Table 4: Cross-classification using dot notation. Age Category Child Adolescent Adult Male X∙11 X∙12 X∙13 Sex Female X∙21 X∙22 X∙23 Total X∙∙1 X∙∙2 X∙∙3 Total X∙1∙ X∙2∙ X∙∙∙ where, for the Male-Adolescent cell, X∙12 = X i12 . To denote the sum of the values for all the i males, we write, X∙1∙ ; for all females, X∙2∙. Similarly, the sum of the values for all children is, X∙∙1; all adolescents, X∙∙2; and all adults, X∙∙3 And the sum of the values for all the cases, X∙∙∙ Note that X X∙1∙ = k X∙∙1= X j X∙∙∙ = i1k ij1 , and i X k , i j ijk . i Rules of summation Summation of a constant. Let c be some constant value. Then, N c Nc . i 1 In other words, summing a constant N times is the same as multiplying the constant by N. Hence, if c = 5, then 12 c 12 5 60 . i 1 This rule can be extended to double summation. Thus, J nj J c c = j i = n j c . j As an example, consider the situation involving the three groups given earlier in Table 2. If all cases, in all groups, have the constant value, 10, then 3 nj 10 = j i 10 = [10+10+10+10] j i + [10+10+10+10+10+10] 3 nj + [10+10+10+10+10] = (4×10) + (6×10) + (5×10) = 15×10 = 150. Multiplication by a constant. If all the values of a variable, X, are multiplied by the same constant, c, then, N N c X i c X i . i i N cX i i For example, let the set of 5 values of the variable X be {3, 9, 5, 7, 10}, assume that each is multiplied by the constant, 2. Then, 2X i = 2(3)+2(9)+2(5)+2(7)+2(10) i = 2(3+9+5+7+10) = 34 = 2 X i i = cX∙∙ Again, this can be expanded to multiple summations: cX j ij c X ij = cX∙∙, i j cX k j i ijk i c X ij = cX∙∙∙, k j i and so on. As an example, again consider the three group situation given earlier in Table 2. If all cases were multiplied by 5, then 3 nj j i 5( X ij ) = [(5×10) + (5×8) + (5×12) + (5×13)] +[(5×6) + (5×11) + (5×8) + (5×10) + (5×8) + (5×12)] +[(5×14) + (5×6) + (5×6) + (5×10) + (5×9)] = 5(43+55+45) 3 nj =5 ( X ij ) j i = 10(143) = 715. In some situations, the values in different groups are multiplied by different constants. For instance, suppose the values in Group 1 (in the example just given) were multiplied by the constant c1, the values in Group 2 by c2, and the values in group 3 by c3. Then, the sum of all the cases would be given by, c X . i i i In this situation, it is necessary that the constants remain within the summation operator. Now, suppose that in the Sex by Age Category example given earlier, we have the values as shown below in Table 5. Table 5 Child Age Category Adolescent Adult Male 2, 3, 5, 4 5, 1, 1, 3 2, 2, 3, 0 Female 1, 3, 5, 2 2, 0, 3, 1 5, 3, 4, 3 Sex If all the male values are multiplied by the constant, 5, and all the female values are multiplied by the constant, 10. Then, letting c1 = 5 and c2 = 10, the sum over all cases, rows, and columns is given by, c k j i j X ijk = c X j k j i ijk = 5(14+10+7) + 10(11+6+15) = 5(31) + 10(32) = 475. Note that the right-hand side of the summation notation above could have been written as c X j k j ijk , i without the parentheses. Using dot notation, the above expression can be written as c X j k jk . j Order of operations. The order of operations, when using summation operators, is the same as that for arithmetic. That is, operations involving the values in a summation operation are indicated by mathematical punctuation. When indicated by punctuation, operations are to be carried out on values prior to summation. Some examples should suffice. X 2 i ( X 12 X 22 X 32 X N2 ) , i Xi X 1 X 2 X N , and i log X i log X 1 log X 2 log X N . i On the other hand, 2 X i ( X1 X 2 X 3 )2 , i X i i X1 X 2 X N , and log X i log( X 1 X 2 X N ) . i Note that these rules apply even when we have multiple summations. For example, letting capital J represent the number of groups, 2 2 X i X 1 j X 2 j X nj j 1 i j . J X 2j j In words, within each group, j: sum the values over cases, i, then square the sum. After squaring the sums within all of the J groups, sum the squared sums over groups. For example, consider the data given in Table 2, earlier. 2 X i = 432 + 552 + 452 = 6899. j 1 i J Distributive rule of summation. The summation operator is distributive when the value being operated upon is itself a sum (or difference). For example, X i Yi X i Yi , i i X 2 J j i ij X X ij2 2 X X ij X 2 J i j i J J J X ij2 2 X X ij X 2 j i J j i J j i J X ij2 2 X X ij X 2 n j j i j i j Note that the last step in this equation follows from the rule pertaining to summation over a constant, given earlier. Hence, nJ n1 2 n2 2 2 X X X j i i i i X 2 2 2 2 n1 X n2 X n J X J J n j X 2 j J X 2 n j j Summation involving two or more variables. If each case has values on two variables, Xi and Yi, say, then, N X Y i i i X i Yi i i Instead, X Y = (X1Y1 + X2Y2 + ∙∙∙ + XNYN). i i i These rules can be extended, quite logically, to situations involving more than two variables. Expectation Operators This appendix has yet to be developed. Matrix Algebra <Need an introduction> Dimensions and order of a matrix. A p by q dimensioned matrix is a p (rows) by q (columns) array of numbers, or symbols, and will itself be represented by an upper-case, bold, italic symbol; e.g, A , β . For instance, a 2 by 3 matrix, A , can be represented, symbolically, as a symbolic array a A 11 a21 a12 a22 a13 , a23 or as an actual array of numbers 3 5 6 A . 2 4 3 In either case, the order of A is said to be 2 3. On the other hand, the matrix, d11 d12 d d 22 D 21 d 31 d 32 d 41 d 42 d13 d 23 , d 33 d 43 is a 4 3 matrix. That is, its order is 4 3. A matrix having only one column or one row is called a vector and is represented by a lower-case, bold, italic symbol; e.g., a, . Hence the 1 4 and 3 1 matrices, a a1 a2 a3 a4 , and 1 β 2 , 3 are both vectors. A is a row vector; is a column vector. The numbers (or symbols) inside a matrix are called elements. Thus in the first matrix, A, given above, a11 and a23 are elements of the matrix A. Similarly, in the second matrix, A, above, the numbers 4 and 6 are elements. Elements in a matrix are indexed (or referred to) by subscripts that give their row and column locations. For, instance, in the first matrix, A, above, a23 is the element in the second row of the third column. Similarly, in the matrix, D, above, element d41 is found in the fourth row, first column. In general, aij is the i,j’th element of A. When referring to elements of vectors, however, the first row (or column) is assumed. Hence, in the vector, a, above, a2 is its second element; in , 3 is the third element. In general, elements of vectors, are indicated by the j’th element of row (i.e., 1 x q) vectors and the i’th element of column (i.e., p x 1) vectors. A matrix in which the number of rows (p) equals the number of columns (q) is called a square matrix.; otherwise, providing it is not a vector, it is called a rectangular matrix. The natural order of a rectangular matrix has more rows than columns (i.e., p > q). Hence, a11 A a12 a13 a 21 a 22 , a 23 a 3 x 2 matrix, is presented in its natural order. The transpose of a matrix is obtained by interchanging its rows and columns. Thus, the transpose of A, represented as A', is a A 11 a 21 a12 a 22 a13 , a 23 a 3 x 2 matrix. As a concrete example, consider 2 5 X 1 2 4 3 7 4 , 6 7 2 6 which, when transposed, becomes 2 5 1 2 X 4 7 6 2 . 3 4 7 6 The natural order of a vector is a p x 1 column vector. Here, the 4 x 1 column vector, a1 a a 2, a3 a 4 Is presented in its natural order. The transpose of a, a a1 a2 a3 a4 , a 1 x 4 row vector. As mentioned earlier a matrix where p = q is a square matrix. At least two particular types of square matrices are particularly important, symmetric matrices and diagonal matrices. A symmetric matrix is a square matrix in which the elements above the main diagonal are mirror images of the elements below the main diagonal (the main diagonal is that set of elements running from the upper left-hand side of a square matrix to the lower right-hand. In the symmetric matrix V, given below, the elements in bold represent the main diagonal. Note that the elements above the main diagonal are mirror images of the elements below the main diagonal. 56 52 38 V 52 64 61 . 38 61 59 Diagonal matrices are square matrices in which all elements except the main diagonal are zero (0). For instance, 37 0 0 0 0 52 0 0 D 0 0 49 0 0 0 0 61 is a symmetric diagonal matrix. An important diagonal matrix is the identity matrix. The identity matrix is a diagonal matrix in which all the elements along the main diagonal are unity (1). For example, 1 0 0 I 0 1 0 0 0 1 is a 3 x 3 identity matrix. Symmetric matrices have important properties. For instance, a symmetric matrix is equal to its transpose. That is, for the matrix just given, V = V’. Matrix Operations Addition. The addition of two matrices, A and B, is accomplished by adding corresponding elements, e.g., aij + bij.; hence, C AB a11 a12 a a 22 21 a31 a32 a 41 a 42 a11 b11 a b 21 21 a31 b31 a 41 b41 c11 c12 c c 21 22 c31 c32 c 41 c 42 a13 b11 b12 b13 a 23 b21 b22 b23 a33 b31 b32 b33 a 43 b41 b42 b43 a12 b12 a13 b13 a 22 b22 a 23 b23 a32 b32 a33 b33 a 42 b42 a 43 b 43 c13 c 23 c33 c 43 It is obvious (or should be obvious) that the orders of the two matrices being added need to be identical. Here, for instance, both matrices are of order 4 x 3. When the orders of the two matrices are not identical, addition is impossible. Addition of matrices is commutative. Hence A + B = B + A. As a concrete example, let 7 3 A 4 1 5 8 1 1 , and 5 8 3 4 2 1 B 4 3 0 3 3 3 . 1 2 1 3 Then, C = A + B = B + A 7 2 3 1 = 4 4 1 3 9 4 = 8 4 5 0 8 3 1 3 1 3 5 1 8 2 3 1 4 3 5 11 4 4 . 6 10 4 7 Multiplication. Multiplication is somewhat more complicated. When multiplying matrices we need to distinguish among multiplication by a scalar, vector multiplication, premultiplication, and post-multiplication, inner-products and outer-products. But first, let us define a scalar. A scalar is simply a single number, such as 2, 17.6, or -300.335. Symbolically, we typically use the symbol, c, to represent a scalar. Scalar multiplication. Scalar multiplication is easily shown by example. Given the 3 x 2 matrix X and the scalar, c, the product, cX, is given by x 11 x 12 cX c x 21 x 22 x31 x32 cx 11 cx 12 cx 21 cx 22 cx31 cx32 x 11 x 12 x 21 x 22 c x31 x32 x 11 c x 12 c x 21 c x 22 c . x31c x32 c As a concrete example, let 4 7 X 3 0 1 2 and c = 5. Then 4 cX 53 1 4 3 1 7 0 2 7 0 5 2 20 35 15 0 5 10 In the above expression, the product, cX, results from pre-multiplying X by the scalar c; whereas the product Xc results from post-multiplying X by the scalar, c. Since scalar multiplication is commutative, cX will always be equal to Xc. This is not the case with matrix multiplication, however. Only in certain special cases will the products formed by premultiplying and post-multiplying two matrices, X and Y be equal. In general XY ≠ YX. Matrix Multiplication. First, note that given two matrices, e.g., A of order n x m, and B of order p x q, multiplication is only possible when the number of columns in the pre-multiplier is equal to the number of rows in the post-multiplier. Hence the product, AB, is only possible when m = p. Similarly, the product BA is only possible when q = n. When m = p the product, C = AB will have order n x q. Similarly, when q = n, the product, C = BA will have order p x m. We can illustrate this by defining the two matrices, x11 x X 21 x31 x 41 x12 x 22 x32 x 42 x13 x 23 , x33 x 43 and y Y 11 y 21 y12 y 22 y13 . y 23 Note that X has order 4 x 3, and Y’ has order 2 x 3. Here, only the products, XY (not XY’) and Y’X’ (not YX) are possible. Post-multiplying X by Y, we have x11 x XY 21 x31 x 41 x13 y11 y 21 x 23 y12 y 22 x33 y13 y 23 x 43 x11 y11 x12 y12 x13 y13 x11 y 21 x12 y 22 x13 y 23 x y x y x y x 21 y 21 x 22 y 22 x 23 y 23 22 12 23 13 21 11 x31 y11 x32 y12 x33 y13 x31 y 21 x32 y 22 x33 y 23 x 41 y11 x 42 y12 x 43 y13 x 41 y 21 x 42 y 22 x 43 y 23 x12 x 22 x32 x 42 where the product is of order 4 x 2. As a more concrete example, let 2 1 X 2 1 Then 0 1 3 1 1 0 3 2 2 1 1 and Y . 3 0 2 2 1 XY 2 1 0 1 2 3 3 1 1 0 1 0 1 2 3 2 4 0 1 6 0 3 2 3 1 3 0 2 4 1 0 6 0 0 2 3 2 3 0 4 5 9 6 5 , 5 6 7 7 a 4 x 2 matrix. The only other product possible between these two matrices is Y’X’: x11 x 21 x31 x 41 y13 x12 x 22 x32 x 42 y 22 y 23 x13 x 23 x33 x 43 y x y12 x12 y13 x13 y11 x 21 y12 x 22 y13 x 23 11 11 y 21 x 21 y 22 x12 y 23 x13 y 21 x 21 y 22 x 22 y 23 x 23 y 11 YX y 21 y12 2 1 2 2 1 1 0 3 1 3 0 2 1 1 0 2 0 1 2 3 1 6 0 2 3 0 2 3 6 5 7 , 8 5 6 7 y11 x31 y12 x 43 y13 x33 y 21 x31 y 22 x 43 y 23 x33 y11 x 41 y12 x 42 y13 x 43 y 21 x 41 y 22 x 42 y 23 x 43 1 3 2 4 1 0 2 3 2 6 0 0 3 0 4 which is a 2 x 4 matrix. When A and B are both square matrices of the same order, then all the products, AB, BA, A’B, AB’, B’A, BA’, A’B’, and B’A’ are possible. Furthermore, except is certain special cases, the matrices formed by these various products will be different from each other. As an exercise, try computing all possible products of the following two square matrices: 2 2 1 1 2 3 A 2 3 1 and B 3 1 0 . 2 1 1 0 2 1 Vector Multiplication. Having learned how to compute matrix multiplication, vector multiplication is easy. It follows the same rules of matrix multiplication except that here one of the matrices is a vector. For instance, given the 4 x 1 column vector, n1 y11 n y 2 , and the 4 x 2 matrix, n Y 21 n3 y 31 n4 y 41 y12 y 22 , y 32 y 42 then the product n’Y is the only possible product. Hence, y11 y n4 21 y 31 y 41 y12 y 22 n Y n1 n2 n3 y 32 y 42 n1 y11 n2 y 21 n3 y 31 n4 y 41 n y n y . i i1 i n1 y12 n2 y 22 n3 y 32 n4 y 42 i2 To put this in concrete terms, let 3 2 n 3 1 2 5 and Y 2 4 1 3 . 2 4 Then 3 2 n Y 3 1 2 5 2 4 1 3 6 2 4 20 3 3 4 15 32 25 . 2 3 Some special cases of vector and matrix multiplication are particularly important. For instance, let the vector, 1 be defined as a vector with all elements equal to 1: 1 1 1 1 and Y y 1 1 1 2 1 y 2 4 3 4 3 2 0. {Note, y1 and y2 are column vectors}. 1 2 Then, 1Y y y 14 8. 1 2 If we let X x 1 1 1 x 2 1 1 1 2 1 3 and Y y 1 2 2 2 1 y 2 4 3 4 3 2 0 . 1 3 Then, y1 X Y xy1 y 14 xy 31 2 2 9 . 16 Note also that x1 X X x 2 x1 x x x 1 2 2 2 n x 2 x x 2 2 2 5 10 . 10 22 Inner-products and outer-products. Given the two matrices, a11 A a 21 a31 a12 b b b a 22 and B 11 12 13 , b21 b22 b23 a32 Verify that the following two products are possible, AB and BA (of course, A’B’ and B’A’ also are possible). The orders of the two products are not identical. AB has order, 3 x 3, while BA has order 2 x 2. The product AB is an outer-product while the product BA is an inner-product. In general, whenever both an inner-product and an outer-product exist for two matrices, the order of the inner-product will be less than the outer-product. Furthermore, in general, the inner-product will not be identical to the outer-product. For example, let 2 1 1 2 4 . A 3 2 and B 3 2 1 3 1 Then 5 6 9 20 9 . AB 9 10 14 and BA 15 8 6 8 13 Inner- and outer-products are particularly important vector multiplication. Let 1 2 a 1 1 1 1 , b 1 1 1, and X 4 2 2 3 3 4 . 3 2 1 1 Then aX 9 9 10 x x x , {Note the summation is over i1 i2 i3 rows} and 6 x1 j 9 x 2 j . {Here the summation is over columns}. Xb 9 x3 j 4 x 4 j Also, it is easy to verify that aXb 28 xij where the summation is over rows and columns.