Author template for journal articles

advertisement
Dispersion assessment in the location of facial
landmarks on photographs
B. R. Campomanes-Álvarez1, O. Ibáñez1, F. Navarro2, I. Alemán2, O. Cordón134, S. Damas1
1
European Centre for Soft Computing, 33600 Mieres, Asturias, Spain
2
Physical Anthropology Laboratory, University of Granada, 18012, Granada, Spain
3
Department of Computer Science and Artificial Intelligence, University of Granada, 18014, Granada,
Spain
4
Research Center on Information and Communication Technologies (CITIC-UGR), University of
Granada, 18014, Granada, Spain
rosario.campomanes@softcomputing.es
The Analysis of the Variance (ANOVA)
ANOVA is a statistical process for determining the impact independent variables have on the dependent
variable in a regression analysis. That test identifies factors that are influencing on a given data set. A
factor of an ANOVA experiment is a controlled independent variable, i.e., a variable whose levels are set
by the experimenter. Within the ANOVA test, a simple effect of a factor on a dependent variable is called
main effect, whereas an interaction is the variation among the differences between means of different
levels of one factor over different levels of other factor. When data is unbalanced, there are at least three
approaches in ANOVA, commonly called Type I, Type II and III Sums of Squares (SS) to testing
different hypotheses about the data, whereas in balanced designs those tests are identical [1, 2, 3].
Types of Sum of Squares for ANOVA
Consider a model that includes two factors A and B; there are therefore two main effects, and an
interaction, AB. The full model is represented by SS(A, B, AB).
Other models are represented similarly: SS(A, B) indicates the model with no interaction, SS(B, AB)
indicates the model that does not account for effects from factor A, and so on.
The influence of particular factors (including interactions) can be tested by examining the differences
between models. For example, to determine the presence of an interaction effect, an F-test of the models
SS(A, B, AB) and the no-interaction model SS(A, B) would be carried out.
It is convenient to define incremental sums of squares to represent these differences. Let
SS(AB | A, B) = SS(A, B, AB) - SS(A, B)
SS(A | B, AB) = SS(A, B, AB) - SS(B, AB)
SS(B | A, AB) = SS(A, B, AB) - SS(A, AB)
SS(A | B) = SS(A, B) - SS(B)
SS(B | A) = SS(A, B) - SS(A)
The notation shows the incremental differences in sums of squares, for example SS(AB | A, B) represents
"the sum of squares for interaction after the main effects", and SS(A | B) is "the sum of squares for the A
main effect after the B main effect and ignoring interactions" [1].
The different types of sums of squares then arise depending on the stage of model reduction at which they
are performed. In particular [1]:
1
Type I: Sequential
The SS for each factor is the incremental improvement in the error SS as each factor effect is added to the
regression model. In other words it is the effect as if the factor were considered one at a time into the
model, in the order they are entered in the model selection. The SS can also be viewed as the reduction in
residual sum of squares obtained by adding that term to a fit that already includes the terms listed before
it.
o
o
o
o
o
o
o
SS(A | B) for factor A.
SS(B | A) for factor B.
SS(AB | B, A) for interaction AB.
This tests the main effect of factor A, followed by the main effect of factor B after the
main effect of A, followed by the interaction effect AB after the main effects.
Because of the sequential nature and the fact that the two main factors are tested in a
particular order, this type of sums of squares will give different results for unbalanced
data depending on which main effect is considered first.
For unbalanced data, this approach tests for a difference in the weighted marginal
means. In practical terms, this means that the results are dependent on the realized
sample sizes, namely the proportions in the particular data set
Note that this is often not the hypothesis that is of interest when dealing with
unbalanced data.
Type II: Hierarchical or partially sequential
Type II SS is the reduction in residual error due to adding the term to the model after all other terms
except those that contain it, or the reduction in residual sum of squares obtained by adding that term to a
model consisting of all other terms that do not contain the term in question. An interaction comes into
play only when all involved factors are included in the model. For example, the SS for main effect of
factor A is not adjusted for any interactions involving A: AB, AC and ABC, and sums of squares for twoway interactions control for all main effects and all other two-way interactions, and so on.
o
o
o
o
o
o
SS(A | B) for factor A.
SS(B | A) for factor B.
This type tests for each main effect after the other main effect.
Note that no significant interaction is assumed (in other words, you should test for
interaction first (SS(AB | A, B)) and only if AB is not significant, continue with the
analysis for main effects).
If there is indeed no interaction, then Type II is statistically more powerful than Type
III (see [2] for further details).
Computationally, this is equivalent to running a Type I analysis with different orders of
the factors, and taking the appropriate output.
Type III: Marginal or orthogonal
Type III SS gives the sum of squares that would be obtained for each variable if it were entered last into
the model. That is, the effect of each variable is evaluated after all other factors have been accounted for.
Therefore the result for each term is equivalent to what is obtained with Type I analysis when the term
enters the model as the last one in the ordering.
o
o
o
o
o
SS(A | B, AB) for factor A.
SS(B | A, AB) for factor B.
This type tests for the presence of a main effect after the other main effect and
interaction. This approach is therefore valid in the presence of significant interactions.
However, it is often not interesting to interpret a main effect if interactions are present
(generally speaking, if a significant interaction is present, the main effects should not be
further analyzed).
If the interactions are not significant, Type II gives a more powerful test.
2
The type of SS only influences computations on unbalanced data because for orthogonal designs, it does
not matter which type of SS is used since they are essentially the same.
In summary, usually the hypothesis of interest is about the significance of one factor while controlling for
the level of the other factors. If the data is unbalanced, this equates to using type II or III SS. In general, if
there is no significant interaction effect, then type II is more powerful, and follows the principle of
marginality. If interaction is present, then type II is inappropriate while type III can still be used, but
results need to be interpreted with caution (in the presence of interactions, main effects are rarely
interpretable) [1, 2].
The ANOVA Estimated Model
The final result obtained by the ANOVA test corresponds to the ANOVA estimated model calculated
from an initial model that can be represented as a factorial design with three factors, A with a levels, B
with b levels, and C with c levels:
Yi , j ,k ,l   i , j ,k   i , j ,k ,l ,
(1)
(i = 1, 2,..., a; j= 1, 2,…, b; k = 1, 2,…, c; l = 1, 2,…, nijk), where Yi , j , k ,l is the lth observation in the cell
defined by the ith level of A, the jth level observation of B, and the kth level observation of C,
cell mean (expected value) and
 i , j ,k is the
 i , j ,k ,l is the residual error. Usually this model is written using another
parameterization:
Yi , j ,k ,l     i   j  k   ij   ijkl ,
where  is the overall mean,
(2)
 i ,  j , k are the main effects of factors A, B, and C.  ij is the interaction
effect of level i of factor A and level j of factor B, and
 ijkl is the random error for experimental unit k
receiving level i of factor A, level j of factor B, as well as level k of factor C.
Hence, the estimated model calculated by the ANOVA test can be denoted by:
Yi , j ,k ,l  ˆ  ˆ i  ˆ j  ˆk  ˆij  ˆijkl .
(3)
The statistics:
ˆ , ˆ1 , ˆ 2 ,..., ˆ1 , ˆ2 ,..., ˆ1 , ˆ2 ,..., ˆ11 , ˆ12 ,...
are the best estimators of the parameters:
, 1 ,  2 ,..., 1 ,  2 ,..., 1 , 2 ,...,  11,  12 ,...
respectively.
Those results can be used to analyze the impact of the independent variables or factors on the dependent
variable, specifically what levels of each factor present a higher influence on the modeled variable.
References
1.
Fox J (2008) Applied Regression Analysis and Generalized Linear Models. 2nd ed, SAGE
Publications, Inc, Thousand Oaks, California
3
2.
Langsrud O (2003) ANOVA for unbalanced data: Use Type II instead of Type III sums of
squares. Statistics and Computing 13:163-167
3.
Little RJA, Rubin DB (1987) Statistical analysis with missing data. Wiley, New York, USA
4
Download