Uploaded by iwuagwuchuks

DEVELOPMENT OF STATISTICAL PROGRAM FOR DISCRIMINANT ANALYSIS FOR MIXED VARIABLES USING R-CODE

advertisement
AsPoly Journal of Sciences, Engineering and Environmental Studies
Volume 1: No.1, March 2020, pp. 58 – 79
DEVELOPMENT OF STATISTICAL
PROGRAM FOR DISCRIMINANT ANALYSIS
FOR MIXED VARIABLES USING R-CODE
Iwuagwu, Chukwuma E.
Ibegbulem, Zebulon O.
Nwosu, Obinna M.
Ekwuribe, Celestine
Department of Statistics,
Abia State Polytechnic, Aba, Abia State.
Abstract
Discriminant analysis is a statistical technique that tried
to predict on the bases of one or more independent
variables. Whether an individual or other objects can be
placed into a particular category of a categorical
dependent variables. This research has its aim on the
development of statistical program for discriminant
analysis for mixed variables (discrete and continuous)
using R-programming language. The data for the research
were obtained from simulated data and real data. In the
simulated experiment, a training data set was generated
using R-code. The real data came from secondary data
which consist of fifteen variables obtained from UNDP
report on human development. The result obtained with Rprogramming language revealed that the developed
program gave satisfactory result in terms of minimizing
the average error rate.
Keywords: Location model, Discriminant, Discrete and
Continuous, R-Code
58
AsPoly Journal of Sciences, Engineering and Environmental Studies
1.1 Introduction
Discriminant analysis is a statistical technique that allows
one in understanding the differences of objects between
two or more groups with respect to several variables
simultaneously. The procedure tried to predict on the
basis of one or more predictors or independent variables,
whether an individual or other subjects can be placed into
a particular category of a categorical dependent variables.
The purpose is to determine the predictor variables on the
base, which groups can be determined and discriminant
model is built on the basis of a set of observations for
which the groups are known.
Lachenbruch (1975) viewed the problem of
discriminant analysis as that of assigning an unknown
observation to a group with low error rate.
Johnson and Wichem (1992) defined discriminant analysis
and classification as a multivariate techniques concerned
with separating distinct set of objects (or observations)
and with allocating new objects to previously defined
groups.
In general, discriminant analysis concerns with the
development of a rule for allocating objects into one of
some distinct groups and the rule will be use to determine
a group of some future objects.
Some examples of classification problems are:
1) A patient is admitted to a hospital with diagnosis
of Myocardial infarction. The medical doctor
examined the patient in order to obtain the
systolic blood pressure, heart rate, stroke index
and the mean arterial pressure. The measurements
will be used to determine the probability of survival
of the patient, Onyeagu (2003).
2) In admission of a candidate in a programme in a
Higher Institution, the candidate must be assigned
to the categories of Admit or Do-not admit, Osuji
(2010).
59
AsPoly Journal of Sciences, Engineering and Environmental Studies
3) In banking, an officer in charge of credit facility
may wish to classify loan or credit given to
customers as either good or bad, or low or high,
credit risk, Osuji (2010).
4) An Archeologist obtains a skull and wants to know
if it belongs to a tribe that inhabited an area 20,
000 years ago or to a successor that live nearby. On
the basis of measurements made on a set of skulls,
from each of the two populations, the assignment
may be made, Onyeagu (2003).
Real world problems show that, variables are mixed with
continuous and discrete variables. Discrete variable is a
random variable whose set of possible values is a finite or
infinite sequence of numbers, 𝑥1 , 𝑥2 , ⋯. The set of value is
a subset of the non-negative integers 0, 1. While the
continuous variable is a variable whose set of possible
value is a continuous interval of real numbers, 𝑥.
1.2 Statement of problem
Real world problem indicates that we are faced with data
comprise of discrete and continuous variables. Therefore
the researcher is faced with the problem of discriminating
between two groups and allocating individual to one or the
other and there is no current program
1.3 Objectives of the study
The objectives of the study include:
(a) To develop a current programme using RProgramming Language that can handle mixed
variables.
(b) To estimate the error rate
(c) To apply the developed programme on Location
model using simulated and real data
1.4 Significance of the study
The successful development of this programme will spur
researchers in different fields of study like Sciences,
60
AsPoly Journal of Sciences, Engineering and Environmental Studies
Business and Social Sciences into further research in this
area and related ones thereby enhancing the expansion of
knowledge.
2.1 Literature Review
Hamid, Mei and Yahaya (2017) in their work designated as
"New Discrimination Procedure of Location Model for
Handling Large Categorical Variables". The location model
proposed in the past is a predictive discriminant rule that
can classify new observations into one of two predefined
groups based on mixtures of continuous and categorical
variables. The ability of location model to discriminate
new observation correctly is highly dependent on the
number of multinomial cells created by the number of
categorical variables. This study conducts a preliminary
investigation to show the location model that uses
maximum likelihood estimation has high misclassification
rate up to 45% on average in dealing with more than six
categorical variables for all 36 data tested. Such model
indicated highly incorrect prediction as this model
performed badly for large categorical variables even with
large sample size. To alleviate the high rate of
misclassification, a new strategy is embedded in the
discriminant rule by introducing nonlinear principal
component analysis (NPCA) into the classical location
model (cLM), mainly to handle the large number of
categorical variables. This new strategy is investigated on
some simulation and real datasets through the estimation
of misclassification rate using leave-one-out method.
The results from numerical investigations manifest
the feasibility of the proposed model as the
misclassification rate is dramatically decreased compared
to the cLM for all 18 different data settings. A practical
application using real dataset demonstrates a significant
improvement and obtains comparable result among the
best methods that are compared. The overall findings
reveal that the proposed model extended the applicability
range of the location model as previously it was limited to
61
AsPoly Journal of Sciences, Engineering and Environmental Studies
only six categorical variables to achieve acceptable
performance. This study proved that the proposed model
with new discrimination procedure can be used as an
alternative to the problems of mixed variables
classification, primarily when facing with large categorical
variables.
Ikechukwu (2016) in his work titled “Evaluation of
Error Rate Estimators in Discriminant Analysis with
Multivariate Binary Variables.” He said that, classification
problems often suffer from small samples in conjunction
with large number of features, which makes error
estimation problematic. When a sample is small, there is
insufficient data to split the sample and the same data are
used for both classifier design and error estimation. Error
estimation can suffer from high variance, bias or both. The
problem of choosing a suitable error estimator is
exacerbated by the fact that estimation performance
depends on the rule used to design the classifier, the
feature-label distribution to which the classifier is to be
applied and the sample size.
His paper was concerned with evaluation of error rate
estimators in two group discriminant analysis with
multivariate binary variables. Behaviour of eight most
commonly used estimators are compared and contrasted
by mean of Monte Carlo Simulation. The criterion used for
comparing those error rate estimators is sum squared
error rate (SSE). Four experimental factors are considered
for the simulation namely: the number of variables, the
sample size relative to number of variables, the prior
probability and the correlation between the variables in
the populations. They obtained two major results from this
study. Firstly, using the simulation experiments we ranked
the estimators as follows: DS, O, OS, U, R, JK, P and D.
The best method was the DS estimator. Secondly, they
concluded that, it is better to increase the number of
variables because accuracy increases with increasing
number of variables. Also, the general trend for the
estimators was an increase in error rate as sample size
62
AsPoly Journal of Sciences, Engineering and Environmental Studies
decreases while decreasing the distance between
populations generally increase the error rate. DS
estimator was the most consistent and thus reliable over
all combinations of probability pattern and sample sizes.
El-Hanjouri and Hamad (2015) on their work titled
“Using Cluster Analysis and Discriminant Analysis
Methods in Classification with Application on Standard of
Living Family in Palestinian Areas”. In their research work,
they applied methods of multivariate statistical analysis,
especially cluster analysis (CA) in order to recognize the
disparity in the living standards for family among the
Palestinian areas. The research results concluded that,
there was a convergence in living standards for family
between two areas formed the first cluster of high
living standards which are the urban of middle West
Bank and the camp of middle West Bank, also there
was a convergence of living standards for family
among the seven areas formed the second cluster of
middle living standards which are the urban of North
West Bank, the camp of North West Bank, the rural of
North West Bank, the urban of South West Bank, the
camp of South West Bank, the rural of South West Bank
and the rural of middle West Bank.
In addition, there is a convergence of living standards
for family the three areas formed the third cluster of low
living standards which are the urban of Gaza strip, the
rural of Gaza strip and the camp of Gaza strip. After a
comparison among several methods of cluster analysis
through a cluster validation (Hierarchical Cluster Analysis,
K-means Clustering and K-medoids Clustering), the
preference was for the Hierarchical Cluster Analysis
method. However, after an examination to choose the best
method of connection through agglomerate coefficient in
the Hierarchical Cluster Analysis (Single linkage method,
Complete linkage method, Average linkage method and
Ward linkage method), the preference was for Ward
linkage method which has been selected to be used in the
classification. Moreover, the Discriminant Analysis method
63
AsPoly Journal of Sciences, Engineering and Environmental Studies
(DA) applied to distinguish the variables that contribute
significantly to this disparity among families inside
Palestinian areas and the results show that the variables of
monthly Income, assistance, agricultural land, animal
holdings, total expenditure, imputed rent, remittances
and non-consumption expenditure are significantly
contributed to disparity.
El-Habil, and El-Jazzar (2013) in their paper titled “A
Comparative Study between Linear Discriminant Analysis
and Multinomial Logistic Regression.” Their paper aimed
to compare between the two different methods of
classification: linear discriminant analysis (LDA) and
multinomial logistic regression (MLR) using the overall
classification accuracy, investigating the quality of their
prediction in terms of sensitivity and specificity, and
examining area under the ROC curve (AUC) in order to
make the choice between the two methods easier, and to
understand how the two models behave under different
data and group characteristics. Model performance had
been assessed from two special cases of the k-fold
partitioning technique, the ‘leave-one-out’ and ‘hold out’
procedures. The performance evaluation for the two
methods was carried out using real data and also by
simulation.
Results show that logistic regression slightly exceeds
linear discriminant analysis in the correct classification
rate, but when taking into account sensitivity, specificity
and AUC, the differences in the AUC were negligible. By
simulation, we examined the impact of changes regarding
the sample size, distance between group means,
categorization, and correlation matrices between the
predictors on the performance of each method. Results
indicate that the variation in sample size, values of
Euclidean distance, different number of categories have
similar impact on the result for the two methods, and
both methods LDA and MLR show a significant
improvement in classification accuracy in the absence of
multicollinearity among the explanatory variables.
64
AsPoly Journal of Sciences, Engineering and Environmental Studies
Fernandez, G. (2009) in his work "Discriminant Analysis, a
Powerful Classification Technique in Predictive Modeling".
He observed that discriminant analysis is one of the
classical classification techniques used to discriminate a
single categorical variable using multiple attributes.
Discriminant analysis also assigns observations to one of
the pre-defined groups based on the knowledge of the
multi-attributes. When the distribution within each group
is multivariate normal, a parametric method can be used
to develop a discriminant function using generalized
squared distance measure.
The classification criterion is derived based on either
the individual within-group covariance matrices or the
pooled covariance matrix that also takes into account the
prior probabilities of the classes. Non-parametric
discriminant methods are based on non-parametric groupspecific probability densities. Either a kernel or the knearest-neighbor method can be used to generate anonparametric density estimate in each group and to produce
a classification criterion. The performance of a
discriminant criterion could be evaluated by estimating
probabilities of misclassification of new observations in
the validation data.
Krzanowski
(1975) considered the problem of
discriminating between two groups and allocating
individuals to one or the other when the available data
consists of both binary and continuous variables.
The researcher derived a discriminant function from a
probabilistic model for mixed binary and continuous
variables and assessed the utility of the allocation rule
with probabilities of misclassification, or error rates.
Krzanowski compared the Location model derived with
Fisher linear discriminant function (LDF). In the
simulation study, the result conducted in two situations,
showed that in situation (a), when there is no interaction,
the average error rates for two methods were very similar
for all combinations of parameter values but under
65
AsPoly Journal of Sciences, Engineering and Environmental Studies
situation (b), that is when interactions were considered;
the average error when Fisher’s LDF was used is higher
than that for the Location model for all parameter values.
Krzanowski (1982) derived an allocation rule for
hypothesis testing of mixture of variables. Let the j th
member of the training set from
i 
Vj
 1 be denoted by
 x ji  
  i  
 y j 
where x ij , the vector of the dummy binary variable is
obtained from the discrete component of V ji  , and y ij , is
the vector of the continuous variable. Furthermore, let
n im denote the number of members of the training set
from  i that falls in cell m of the contingency table
x
defined by
and
ni   km1 n im
be the sample size of the
training set from  i i  1, 2 ; m  1, 2, , k  then using the
location model assumption, the likelihood of the two
training sets is given by:

L  2  
where
p
 ij

 1  n1  n2 
2
2

2
   p
i 1

i 2
n
1m
ni


i 
 exp  1 2  yi   ij

i 1

takes the value

 im 
' y     
defined by
q
imi   vi   i. j x jm     i. jk x jm  xkm 
j 1
i k
66
i
i
ij

AsPoly Journal of Sciences, Engineering and Environmental Studies
th
For all j which fall in the m cell of the contingency table
constructed from x i  1, 2 ; m  1, 2, , k  . If we consider
the extra individual, v , to be classified and suppose that
the discrete components place it into cell m of the
contingency table. If this individual is included with the
training set from
Lim  2 
1 p
2
 i , then an extra multiplying factor

1
2

pim exp  1
y     ' y     
1
m
2
m
i
i
must be incorporated in the equation (2) above. The
hypothesis testing allocation rule is therefore given by
evaluating the statistic

supL1m L 
supL2 m L 
v to  1 , if   1 and to  2 otherwise.
Maximization of L2 m L  follows the same general lines as
and allocating
in the estimation approach to the problem, the only
difference is that for supL1m L  , we have
observations in cell m for
n1m ,
for
 1 , and y is included with the
other continuous vectors having mean
for supL2 m L  , we have
2 ,
and
y,
n1m  1
1m  ,
n2 m  1 observations
is included with the
while
in cell m
n2 m continuous
vectors having mean  2m  . All other cells are unchanged in
both maximizations.
Hamid (2010) focused on the idea for construction of
model when the number of variables involves the
combination of mixed variables. The researcher used the
following steps to accomplish the final algorithm of the
location model.
67
AsPoly Journal of Sciences, Engineering and Environmental Studies
i) Omit
object
i  1, 2,  , n.
i
from
the
sample
n
where
ii) Perform Principal Component Analysis (PCA) for the
continuous variables from the remaining objects
n1  n2  i  to choose the best combination of
components or to reduce its dimensions.
iii) Repeat step (ii) for conducting PCA purposely for
binary variables, then combine the results from step
(ii) and (iii) to produce 2PCA.
iv) Compute and estimate  i ,  and pi using the new
components resulting from 2PCA, further construct
the location model function.
v) Pool and run step (ii-iv) together to produce a new
algorithm of 2PCA plus LM.
vi) Predict the group of the omitted object i using a
new constructed model, if the prediction made is
correct then assign error,  ij  0 , otherwise  ij  1 .
 
vii) Repeat step (i) - (vi) for all objects in turn.
viii) Compute the leave-one-out error rate
 ij n100 .
using
The model constructed was evaluated with simulated data,
where the researcher selected two groups having
Multivariate Normal Distribution, MVN
 2 , 
1 ,  and MVN
with P  50 continuous variables and q  50
binary variables. The objects in the sample were divided
into two parts, learning set and test set. The learning set
was used in the construction of the model and the test set
used for evaluation purposes. The result showed that the
new approach can be used in data reductions and will be
able to manage different types of variables (binary and
continuous simultaneously).
Miller et al. (1998) considered the problem of
discriminant analysis with discrete (categorical) and
continuous variables with data missing at random. The
68
AsPoly Journal of Sciences, Engineering and Environmental Studies
used a hypothesis testing approach based on the
generalized likelihood ratio and use boot-strapping to
determine the critical values in order to control the type 1
error rate. The work used three algorithms for dealing
with this case, each assuming a different model for the
data.
1. The Indicator Algorithm replaces categorical variables
with indicator variables and treats these as if they are
continuous.
2. The Full Algorithm assumed a multinomial
distribution for the discrete part, and a multivariate
normal distribution (with mean and covariance
depending on the discrete part); as the conditional
distribution of the continuous part, given the discrete
part and
3. The Common Algorithm assumes a multinomial
distribution for the discrete part and a multivariate
normal distribution (with only a mean depending on
the discrete part) as the conditional of the continuous
part, given the discrete part (that is a common
covariance matrix is assumed across all multinomial
cells). The performance of these algorithms is
compared through a simulated study. The indicator
algorithm seems to have highest power; it also tends
to display a higher type 1 error rate than desired. The
full and the common algorithms have very similar
power, but the common algorithms appeared to
control the type 1 error more effectively and least
susceptible to problems occurring when some
multinomial cells are sparsely represented.
Ganeshanandam and Krzanowski (1990) adopted Fishers
linear discriminant function for allocating new observation
into one of two existing groups. The methods of
estimating the misclassification error rates were reviewed
and evaluated by Monte Carlo simulations. The
investigation was carried out under both ideal
(multivariate normal data) and non-ideal (multivariate
69
AsPoly Journal of Sciences, Engineering and Environmental Studies
binary data) conditions. The assessment was based on the
usual mean square error (MSE) criterion and also on a
new criterion of optimism. The result showed that
although, there is a common cluster of good estimators
for both ideal and non-ideal conditions. The single best
estimators vary with respect to the different criterion.
Efron (1975) looked at the Efficiency of Logistic
Regression compared to Normal Discriminant Analysis.
x
can arise from one or
Suppose that a random vector
two P-dimensional normal populations differing in mean
but not in covariance.
x ~  p 1 with probability of  1
x ~  p  0  with probability of  0
Where,  1   0  1
If the parameter
1 ,  0  1  110 , 
are known,
x
can be assigned to a population on the basis of
then
Fishers Linear Discriminant Function
 x    0  1 x

1
 0  Log 1  y1 1 1   0  1  0 
0 2
  1  0   1
The assignment is to population 1,
If  x   0 and to population 0
If
 x   0
This method of assignment minimizes the expected
probability of misclassification. The author completed the
asymptotic relative efficiency of the two procedures,
70
AsPoly Journal of Sciences, Engineering and Environmental Studies
logistic regression shown to be between one half and two
thirds as effective as normal discrimination for statistical
interesting values of the parameters.
Takiokurita and Otsu (2009) worked on the Logistic
Discriminant Analysis. They proposed a novel non-linear
discriminant analysis (LGDA) in which the posterior
probabilities are estimated by multi-normal logistic
regression (MLR). The experimental results are shown by
comparing the discriminant space constructed by Logistic
Discriminant Analysis (LGDA) and Linear Discriminant
Analysis (LDA), for the standard repository data sets. The
experimental result shows that the discriminant space
constructed by LGDA is better than the one obtained by
Linear Discriminant Analysis (LDA).
James and Wilson (1978) looked at choosing between
logistic regression and Discriminant Analysis. They carried
out two experimental studies of non-normal classification
problem, compared the two methods and found that
logistic regression with MLE output performing classical
linear discriminant analysis in both cases.
Panagiotakos (2006) in his paper titled “A Comparison
between logistic regression and linear discriminant
analysis for the prediction of categorical health outcome.”
He investigated whether these two methods of analysis
result in similar findings in evaluating Categorical Health
outcomes. He concluded that logistic regression resulted
in the same model as did discriminant analysis.
Maja (2004), in their paper comparison of Logistic
Regression and Linear Discriminant Analysis, a simulation
study. They considered the problem of choosing between
the two method and set same guidelines for proper choice.
The performance of the methods was based on several
measures of predictive accuracy. The performance was
studied by simulations. They found out that Linear
Discriminant Analysis (LDA) is a more appropriate method
when the explanatory variables are normally distributed. In
the case of categorized variables, linear discriminant
analysis remains preferable and fails only when the
71
AsPoly Journal of Sciences, Engineering and Environmental Studies
number of categories is really small. The result of Logistic
Regression (LR), however is in all these cases constantly
close and a little worse than those of LDA.
But, whenever the assumptions of LDA are not met,
the usage of LDA is not justified, while LR gives good
results regardless of the distribution.
Osuji (2010) considered seven classification rules for
the discrete discriminant problem. A simulation
experiment was conducted to compare the performance of
all those rules. He observed that the procedure that is
most favoured is the optimal rule when three variables are
considered whereas when four variables are involved the
full multinomial rule is preferred in terms of minimizing
the expected actual error rate or expected cost of
misclassification.
Nocairi et al. (2006) said that linear and quadratic
discriminant analyses are likely to lend to unstable models
and poor prediction in the presence of quasi collinearity
among variables or in the case of the small sample and
high-dimensional settings. A simple regularization
procedure was proposed to cope with this problem. It is
based on the introduction of tuning parameter that draws
a line between linear or quadratic discriminant analysis
that is based on Mahalanobis distance and discrimination
analysis based on the identity matrix. The tuning
parameter is customized to individual situation by
minimizing the cross-validated misclassification risk. The
efficiency of the method of analysis in comparison with
existing procedure was demonstrated on the basis of a
data set and a large simulation study.
3.1 Methodology
The data for the work is based on the simulated and real
data. In the simulation experiment, a training data set was
generated using R-code and the average error rate were
computed for 2 ≤ 𝑞 ≤ 6 and 0.1 ≤ 𝑃1 𝑃2 ≤ 0.9 , for two
situations, that is, situation (a) a case with no interaction
between discrete and continuous variables and situation
72
AsPoly Journal of Sciences, Engineering and Environmental Studies
(b) a case involving interaction between discrete and
continuous variables, also, 𝑞 are the components of the
discrete variables 𝑥 and 𝑝 are the components of the
continuous variables, 𝑦 . The real data came from
secondary data which consists of 15 variables obtained
from UNDP report on Human Development index.
Thirteen variables were continuous while two as discrete
and this consist of countries classified as High and Low
Human Development report obtained from 2014 – 2018.
The data was analyzed with discriminant analysis using the
location model.
3.2 Location Model
The model that will be used in the development of this
program is Location model formulated by Krzanowski in
1975.
Let, 𝑥 , denote the vector of discrete variables with, 𝑞 ,
components and, 𝑦 , represent the vector of the continuous
variables with, 𝑝, components. The location model then
assumes that, the probability of obtaining an observation
in the, 𝑗 𝑡ℎ , cell of the multinomial table is 𝑝𝑖𝑗 , in
population, , (𝑖 = 1, 2 ; 𝑗 = 1, 2) .
If the discrete variables have been allocated to an
individual cell 𝑗 , the continuous variables, 𝑦 , have a
multivariate normal distribution with mean, 𝜇 (𝑖) , and
dispersion matrix ∑ , in population 𝜋𝑖 , 𝜋𝑖 (𝑖 = 1, 2 ; 𝑗 =
1, 2) .
Then, the conditional probability density of, 𝑦 , given
that the discrete variables locate the individual in cell j, is

1
2 
c 2

1
2




 1
exp  y  i j  '  1 y  i( j ) 

 2
in  i (i  1, 2) . Thus the joint probability density of
obtaining the individual in cell j and observing the
continuous variable value, y , is
73
AsPoly Journal of Sciences, Engineering and Environmental Studies

pij
2 
c 2
in

1
2




 1
exp  y   i j  '  1 y   i( j ) 

 2
 i ( i  1,2)
3.3 Optimum Allocation Rule
If all population parameters are known, the allocation rule
for an observation x y  , x define the vectors of discrete
variables and, y , that of the continuous variables, is
assign
 
m
1
1
to
if

  2m   1 y  1 2 1m    2m   log p2 m p1m
 

and otherwise to  2 where

m  1   iq1 xi 2 i 1 .
3.4 Estimated Allocation Rule
Let n1m and n2 m denote the number of observations
falling in cell m , of the table from
 1 and  2
and let
y ijm  denote the vector of continuous variables associated
with the
Then, if
j th
observation in cell m of the sample from
m 
yi
i
1 n1m m 

 yij
nim j 1
The maximum likelihood estimate of the population
m 
parameters pˆ im ; ̂ i and  are given by:
pˆ im 
nim
; ̂ im   yim 
ni
2 k n1m

1

   yijm   yim   yijm   yim 
n1  n2  2k i1 m1 j 1

where,
i  1, 2 ; m  1, 2 ,  , k
74


AsPoly Journal of Sciences, Engineering and Environmental Studies
3.5 Estimation of Error Rates
The success of an allocation rule can be assess by the
probability of misclassification or error rates that it gives
rise to. The error rates are given by:

p2 1   p1m  logP2 m p1m  12 Dm2 Dm
k
m1
and


p1 2   p2 m  log p1m p2 m   12 Dm2 Dm
k
m1

where,  is the cumulative standard normal distribution

function and Dm2  1m    2m 
  
1
1
m
  2m 

is the
Mahalanobi’s squared distance between  1 and  2 in cell
j
of the multinomial table.
4.1 Data analysis
The result of the simulated data was shown in the table
below. In the simulated experiment from a training set
which was generated with R-code indicated that the
proportion of error rate which is less than 30% is optimal
in minimizing the probability of misclassification.
The result of the average error rate for the real data is
given as 0.2694 which showed that the proportion of error
made by the rule is optimal in minimizing the probability
of misclassification.
P1
THE RESULTS OF THE AVERAGE ERROR RATE
FOR 𝟐 ≤ 𝒒 ≤ 𝟔 AND 𝟎. 𝟏 ≤ 𝑷𝟏 𝑷𝟐 ≤ 𝟎. 𝟗
P2
0.1
0.3
0.1
0.5
0.7
Situation
a
b
a
b
a
b
a
q=2
LM
0.2323
0.2560
0.2605
0.2765
0.2665
0.2388
0.2425
q=3
LM
0.2490
0.2478
0.2660
0.2583
0.2405
0.2455
0.2668
75
q=4
LM
0.2520
0.2585
0.2315
0.2463
0.2568
0.2485
0.2665
q=5
LM
0.2790
0.2510
0.2540
0.2433
0.2560
0.2435
0.2420
q=6
LM
0.2777
0.2493
0.2550
0.2460
0.2569
0.2437
0.2447
AsPoly Journal of Sciences, Engineering and Environmental Studies
0.9
0.3
0.3
0.5
0.7
0.9
0.5
0.5
0.7
0.9
0.7
0.7
0.9
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
0.2415
0.2265
0.2475
0.2490
0.2295
0.2463
0.2570
0.2385
0.2248
0.2565
0.2278
0.2235
0.2428
0.2595
0.2458
0.2503
0.2438
0.2388
0.2715
0.2363
0.2315
0.2595
0.2443
0.2538
0.2520
0.2615
0.2505
0.2473
0.2515
0.2523
0.2378
0.2293
0.2513
0.2533
0.2503
0.2508
0.2575
0.2288
0.2635
0.2403
0.2473
0.2383
0.2555
0.2383
0.2535
0.2518
0.2483
0.2550
0.2560
0.2590
0.2513
0.2558
0.2533
0.2520
0.2540
0.2510
0.2298
0.2533
0.2505
0.2595
0.2505
0.2420
0.2458
0.2586
0.2570
0.2660
0.2403
0.2518
0.2570
0.2528
0.2405
0.2323
0.2363
0.2478
0.2438
0.2430
0.2350
0.2436
0.2598
0.2498
0.2503
0.2703
0.2438
0.2430
Confusion Matrix using Location Model Approach for
Real Data
Location Model
Observed
Sample 1
Sample 2
Sample 1
2
11
Sample 2
2
19
Source: R-Console 3.1.20
Misclassification Rate= (2+11)/34= 0.38
5.1 Findings
The average error rate obtained from the developed
program with R-code is comparable with those obtained by
Kranowski (1975).
76
0.2581
0.2575
0.2661
0.2405
0.2494
0.2561
0.2520
0.2428
0.2222
0.2346
0.2487
0.2428
0.2450
0.2353
0.2436
0.2634
0.2493
0.2508
0.2708
0.2778
0.2493
AsPoly Journal of Sciences, Engineering and Environmental Studies
The results from the repeated number of times under
identical conditions for the average error rate exhibited
stability.
In both situation (a) and (b), that is without
interaction and with interaction respectively, the findings
gave a satisfactory result.
The number of misclassifications was not large and the
percentage of misclassification was under control.
5.2 Conclusion
The classification rule based on the developed programme
with R-programming language using the Location model
gave a satisfactory result in terms of minimizing the
average error rate and the result is not different when
compared with the original work done by Krzanowski in
1975.
5.3 Recommendation
Since we have established that, the developed R-computer
programme gave better result in terms of minimizing the
average error rate in the classification that involved mixed
variables, we are recommending that statisticians should
use the location model program when classifying objects.
References
Efron, B. (1975). The efficiency of logistic regression
compared to normal discriminant analysis. Journal of
American Statistical Association. Vol. 70, No. 352
El-Habil, A. M., & El-Jazzar, M. (2013). A Comparative
Study between Linear Discriminant Analysis and
Multinomial Logistic Regression. An-Najah University
Journal for Research - Humanities, 28, 1525-1548.
El-Hanjouri, M. M. R. and Hamad, B. S. (2015). Using
Cluster Analysis and Discriminant Analysis Methods in
Classification with Application on Standard of Living
Family in Palestinian Areas. International Journal of
Statistics and Applications. 5(5):213-222
77
AsPoly Journal of Sciences, Engineering and Environmental Studies
Fernandez, G. (2009). Discriminant Analysis, a Powerful
Classification Technique in Predictive Modeling.
George Fernandez University of Nevada. Reno.
Ganeshanandam and Krzanowski (1990). Variable
selection in discriminant analysis based on the
location model for mixed variables. Advance Data
Analysis. 1: 105–122.
Hamid, H. H. (2010). A new approach for classifying large
number of mixed variables. World Academy of Science
Engineering & Technology. Vol. 46, 10-22
Hamid, H., Mei, L. M. & Yahaya, S. S. (2017). New
Discrimination Procedure of Location Model for
Handling
Large
Categorical
Variables.
Sains
Malaysiana. 46(6) (2017): 1001–1010
Ikechukwu, E. (2016). Evaluation of Error Rate Estimators
in Discriminant Analysis with Multivariate Binary
Variables. American Journal of Theoretical and Applied
Statistics. 5, 173.
James, P. & Wilson, S. (1978). Choosing between logistic
regression and discriminant analysis. Journal of
american statistical association. Vol. 73, No. 364, 699705
Johnson, R. A. and Wichem D. W. (1992). Applied
multivariate statistical analysis, Englewood, Cliffs, New
Jersey
Krzanowski, w. J. (1975). Discrimination and classification
using both binary and continuous variables. Journal of
American Statistical Association. 70, 782–790.
Krzanowski, W. J. (1982). Mixture of continuous and
categorical variables in discriminant analysis: A
Hypothesis - Testing Approach. Biometrics. Vol. 38,
991-1002
Lachenbruch, P. A. (1975). Discriminant analysis. Hafner
Press, New York.
Maja, B. L. (2004). Discriminant analysis using mixed
continuous, dichotomous, and ordered categorical
variables, Multivariate. Res. 42 (2007), 631–645.
78
AsPoly Journal of Sciences, Engineering and Environmental Studies
Miller, J. W., Woodward, W. A. & Gray, H. L. (1998). A
hypothesis testing approach to discriminant analysis
with mixed categories and continuous variables when
data are missing. A scientific report. No. 1 Philips
laboratory directorate of geophysics air force material.
MA 0173-3010
Nocairi, H., Qannari, E. M. & Hanafi, M. (2006). A simple
regularization procedure for discriminant analysis.
Journal of communication in statistics simulation and
computation. Vol. 35, 957-967
Onyeagu, S. I. (2003). A first course in multivariate
statistical analysis. Mega Concept.
Osuji, G. A. (2010). Evaluation of some classification
procedures for binary variables. A ph.d thesis. Nnamdi
Azikiwe university, Awka
Panagiotakos, D. B. (2006). A comparison between
logistic regression and linear discriminant analysis for
the Prediction of categorical health outcome. Journal
of Statistical Science. Vol. 5, 73-84
Takiokurita, K. & Otsu, N. (2009). Logistic discriminant
proceeding of the international conference on systems
man and cybernetics. San Antonis, Texas, USA.
79
Download