Multivar

advertisement
Multivariate Discriminant Analysis
The Multivariate Discriminant Analysis is used to determine which variables discriminate better an event
occurrence between two or more groups. The fundamental basic concept of the Analysis of the
Discriminant Function is to establish which groups are differentiated in respect to a variable mean, and
then using said variable as a member of the predictor’s group.
The Discriminant Analysis is very similar to the Analysis of Variance (ANOVA). It can be specifically
answered Yes or No to two or more groups of variables that are significantly different, each one in respect
to the mean of a particular variable, if the mean for a variable is significantly different in different groups.
It can then be said that this variable discriminates between the groups. The Fischer test is that which
permits in the case of only one variable the verification of whether or not the variable discriminates
between groups.
As it is described in elemental concepts and in the ANOVA analysis of variance and/or the MANOVA
analysis of multivariate variance, the statistical F Fischer is evaluated as the variance relation between
groups of data having to do with the combination (average) with variance groups. If the variance among
groups is significantly higher, then there will be significant differences between the means. Generally
several variances are included in the study to find out which one or which ones contribute to the
discrimination between groups. When this is the case, there is a matrix of the variances and co-variances
total, obtaining therefore the combination matrix of variances and co-variances. These two matrices can
be compared through the multivariate F-test in order to determine if there exists between both groups a
significant difference, in regards to all variables. This procedure is identical to the Manova analysis of
multivariate variance. Just like in Manova, the multivariate test is done first. If it is statistically
significant, then it can be seen which of the variables has means significantly different through the
groups. The procedure with multiple variables is very complex, the main reasoning would be to search for
variables that discriminate between groups looking for differences in the means. The most common
application of the Multivariate Discriminant Analysis is to include many variables in order to determine
those that will better discriminate between groups. This way a model is constructed that will allow to
obtain the best predictor for each group.
Stepwise Discriminant Analysis
One of the methods to built a statistical predicting model is the "Forward Stepwise Discriminant
Analysis" , where a step by step discriminant model is done. The statistical program especifically revises
all the variables at each step and evaluates which one contributes more to the discrimination between
groups. This variable will be included in the model, and the analysis will go on with the following step.
Another method is the "Backward Stepwise Discriminant" Analysis, in which the statistical program
includes first all the model variables, and then in each step it eliminates the variable of the member
groups contributing less to the prediction. A successful analysis will result in the model having only the
most important variables, this is those variables that best contribute to discrimination among groups. The
stepwise process is “conducted” by the respective values F to enter and F to remove.
The F enter/remove value for a variable indicates its statitical significance in the discrimination between
groups. Therefore, it is a measure of how each variable contributes as member of the group and as the
only contribution to the prediction. The F enter and F remove values can be interpreted in the same sense
as the procedure step by step of multiple regression. Generally the statistical program will continue to
chose those variables that will be included in the stastistical model, while the respective F values for those
variables are higher than that specified for F enter; and it will exclude from the model the variables with
significance lower than the F remove specified value.
A common interpretation of the SDA results is to consider the levels of statistical significance at nominal
value. When the statistic program decides which variable to include or to exclude in the following step of
the analysis, it will computerize the significance of the contribution of each considered variable. The
stepwise procedure will work at random, as it takes and/or chooses the variables that will be included in
the model as the field of maximum discrimination. Therefore, when the stepwise approximation is used
the significance levels should not rebound in the true range of the alfa error. This is the probability of
erroneously rejecting the null hypothesis H0 that there is no discrimination between groups. The
multivariate analysis of variance was originally developed by Wilks (1932) through a generalized
principle of probability ratio.
Statistical Program
There was a Statistical Program for PC that permitted to carry out the SDA in order to obtain a statistical
model for Zonda wind an it severity prediction. Once the variables have been chosen and the procedure
has been indicated, the program carries out the SDA and affords the following results:
- Quantity of steps of the analysis
- Number of variables entered in the model
- Last entered variable
- Variables in the model
- Name of the variable, Wilks Lambda, Partial Lambda, F to remove, p-level referred to F remove,
Tolerance, 1-Tolerance
- Variables outside the model:
Name of the variable, Wilks Lambda, Partial Lambda, F to remove, p-level referred to F to remove,
Tolerance, 1-Tolerance
- Distance among groups
- Summary of the Stepwise Analysis for the chosen variables:
Number of step, F to enter or to remove, degrees of freedom for the respective F, number of variables in
the model after the chosen variable, Wilks Lambda after the respective step, the value of F associated to
Lambda, degrees of freedom for that F, the p-level for that value of F.
- Canonic analysis and graphics
- Classification functions
- Classification Matrix
- Classification of the cases
- Mahalanobis Square distance
- Later probabilities
The Wilk’s Lambda value represents the discrimination among groups, assuming values in the range
between 0 and 1, 0 correspond to the total discrimination and 1 to the non-discrimination. This value is
evaluated above all the discrimination as the relation between the determinant for the variance-covariance matrix of the entering groups, over the determinant for the total matrix of the variance/covariance
Wilks’s Lambda = det (W)/ det (T)
The partial Lambda is defined as the multiplicative increase in Lambda resulting from adding the
respective variable, this is the Wilk’s Lambda associated wih the only contribution of the respective
variable, to the discriminatory power of the model.
Partial Lambda = lambda after/ lambda before
In other words it can be defined as the relation between the Wilk’s Lambda after incorporating the
variable divided by the Wilk’s lambda before its incorporation.
The F to remove value is the F Fisher value associated to the partial respective Lambda and it is
calculated as:
Where:
F = [(n-q-p) /(q-1)]* [(1-partial  )/partial ]
n = number of cases
q is the number of groups
p is the number of variables
partial  is partial lambda
The tolerance for each variable is calculated as 1 – R2 of the respective variable with all the other
variables of the model, it is a measure of the variable’s redundancy.
The classification functions permit to determine the scores for each case and each group. The scores are
calculated through the following formula:
Si = Ci + wi1 * x1 * x2 + wi3 * x3 + ......+ wim * xm
Where subindex i indicates the respective group, subindexes 1, 2, 3, ......, m indicate the m variables
(predictors) chosen by the discriminant analysis, ci is a constant for the i – nth group, wi is the weight of
the i – nth variable when computing the score for the i – nth group, xi is the value observed for the
respective case for the i-nth variable, Si is the score result.
The Classification Matrix has the information about the number and percentage of cases correctly
classified in each group, giving a classification of aprioristic probability.
Download