Dispersion assessment in the location of facial landmarks on photographs B. R. Campomanes-Álvarez1, O. Ibáñez1, F. Navarro2, I. Alemán2, O. Cordón134, S. Damas1 1 European Centre for Soft Computing, 33600 Mieres, Asturias, Spain 2 Physical Anthropology Laboratory, University of Granada, 18012, Granada, Spain 3 Department of Computer Science and Artificial Intelligence, University of Granada, 18014, Granada, Spain 4 Research Center on Information and Communication Technologies (CITIC-UGR), University of Granada, 18014, Granada, Spain rosario.campomanes@softcomputing.es The Analysis of the Variance (ANOVA) ANOVA is a statistical process for determining the impact independent variables have on the dependent variable in a regression analysis. That test identifies factors that are influencing on a given data set. A factor of an ANOVA experiment is a controlled independent variable, i.e., a variable whose levels are set by the experimenter. Within the ANOVA test, a simple effect of a factor on a dependent variable is called main effect, whereas an interaction is the variation among the differences between means of different levels of one factor over different levels of other factor. When data is unbalanced, there are at least three approaches in ANOVA, commonly called Type I, Type II and III Sums of Squares (SS) to testing different hypotheses about the data, whereas in balanced designs those tests are identical [1, 2, 3]. Types of Sum of Squares for ANOVA Consider a model that includes two factors A and B; there are therefore two main effects, and an interaction, AB. The full model is represented by SS(A, B, AB). Other models are represented similarly: SS(A, B) indicates the model with no interaction, SS(B, AB) indicates the model that does not account for effects from factor A, and so on. The influence of particular factors (including interactions) can be tested by examining the differences between models. For example, to determine the presence of an interaction effect, an F-test of the models SS(A, B, AB) and the no-interaction model SS(A, B) would be carried out. It is convenient to define incremental sums of squares to represent these differences. Let SS(AB | A, B) = SS(A, B, AB) - SS(A, B) SS(A | B, AB) = SS(A, B, AB) - SS(B, AB) SS(B | A, AB) = SS(A, B, AB) - SS(A, AB) SS(A | B) = SS(A, B) - SS(B) SS(B | A) = SS(A, B) - SS(A) The notation shows the incremental differences in sums of squares, for example SS(AB | A, B) represents "the sum of squares for interaction after the main effects", and SS(A | B) is "the sum of squares for the A main effect after the B main effect and ignoring interactions" [1]. The different types of sums of squares then arise depending on the stage of model reduction at which they are performed. In particular [1]: 1 Type I: Sequential The SS for each factor is the incremental improvement in the error SS as each factor effect is added to the regression model. In other words it is the effect as if the factor were considered one at a time into the model, in the order they are entered in the model selection. The SS can also be viewed as the reduction in residual sum of squares obtained by adding that term to a fit that already includes the terms listed before it. o o o o o o o SS(A | B) for factor A. SS(B | A) for factor B. SS(AB | B, A) for interaction AB. This tests the main effect of factor A, followed by the main effect of factor B after the main effect of A, followed by the interaction effect AB after the main effects. Because of the sequential nature and the fact that the two main factors are tested in a particular order, this type of sums of squares will give different results for unbalanced data depending on which main effect is considered first. For unbalanced data, this approach tests for a difference in the weighted marginal means. In practical terms, this means that the results are dependent on the realized sample sizes, namely the proportions in the particular data set Note that this is often not the hypothesis that is of interest when dealing with unbalanced data. Type II: Hierarchical or partially sequential Type II SS is the reduction in residual error due to adding the term to the model after all other terms except those that contain it, or the reduction in residual sum of squares obtained by adding that term to a model consisting of all other terms that do not contain the term in question. An interaction comes into play only when all involved factors are included in the model. For example, the SS for main effect of factor A is not adjusted for any interactions involving A: AB, AC and ABC, and sums of squares for twoway interactions control for all main effects and all other two-way interactions, and so on. o o o o o o SS(A | B) for factor A. SS(B | A) for factor B. This type tests for each main effect after the other main effect. Note that no significant interaction is assumed (in other words, you should test for interaction first (SS(AB | A, B)) and only if AB is not significant, continue with the analysis for main effects). If there is indeed no interaction, then Type II is statistically more powerful than Type III (see [2] for further details). Computationally, this is equivalent to running a Type I analysis with different orders of the factors, and taking the appropriate output. Type III: Marginal or orthogonal Type III SS gives the sum of squares that would be obtained for each variable if it were entered last into the model. That is, the effect of each variable is evaluated after all other factors have been accounted for. Therefore the result for each term is equivalent to what is obtained with Type I analysis when the term enters the model as the last one in the ordering. o o o o o SS(A | B, AB) for factor A. SS(B | A, AB) for factor B. This type tests for the presence of a main effect after the other main effect and interaction. This approach is therefore valid in the presence of significant interactions. However, it is often not interesting to interpret a main effect if interactions are present (generally speaking, if a significant interaction is present, the main effects should not be further analyzed). If the interactions are not significant, Type II gives a more powerful test. 2 The type of SS only influences computations on unbalanced data because for orthogonal designs, it does not matter which type of SS is used since they are essentially the same. In summary, usually the hypothesis of interest is about the significance of one factor while controlling for the level of the other factors. If the data is unbalanced, this equates to using type II or III SS. In general, if there is no significant interaction effect, then type II is more powerful, and follows the principle of marginality. If interaction is present, then type II is inappropriate while type III can still be used, but results need to be interpreted with caution (in the presence of interactions, main effects are rarely interpretable) [1, 2]. The ANOVA Estimated Model The final result obtained by the ANOVA test corresponds to the ANOVA estimated model calculated from an initial model that can be represented as a factorial design with three factors, A with a levels, B with b levels, and C with c levels: Yi , j ,k ,l i , j ,k i , j ,k ,l , (1) (i = 1, 2,..., a; j= 1, 2,…, b; k = 1, 2,…, c; l = 1, 2,…, nijk), where Yi , j , k ,l is the lth observation in the cell defined by the ith level of A, the jth level observation of B, and the kth level observation of C, cell mean (expected value) and i , j ,k is the i , j ,k ,l is the residual error. Usually this model is written using another parameterization: Yi , j ,k ,l i j k ij ijkl , where is the overall mean, (2) i , j , k are the main effects of factors A, B, and C. ij is the interaction effect of level i of factor A and level j of factor B, and ijkl is the random error for experimental unit k receiving level i of factor A, level j of factor B, as well as level k of factor C. Hence, the estimated model calculated by the ANOVA test can be denoted by: Yi , j ,k ,l ˆ ˆ i ˆ j ˆk ˆij ˆijkl . (3) The statistics: ˆ , ˆ1 , ˆ 2 ,..., ˆ1 , ˆ2 ,..., ˆ1 , ˆ2 ,..., ˆ11 , ˆ12 ,... are the best estimators of the parameters: , 1 , 2 ,..., 1 , 2 ,..., 1 , 2 ,..., 11, 12 ,... respectively. Those results can be used to analyze the impact of the independent variables or factors on the dependent variable, specifically what levels of each factor present a higher influence on the modeled variable. References 1. Fox J (2008) Applied Regression Analysis and Generalized Linear Models. 2nd ed, SAGE Publications, Inc, Thousand Oaks, California 3 2. Langsrud O (2003) ANOVA for unbalanced data: Use Type II instead of Type III sums of squares. Statistics and Computing 13:163-167 3. Little RJA, Rubin DB (1987) Statistical analysis with missing data. Wiley, New York, USA 4