DEVELOPMENT OF STATISTICAL PROGRAM FOR DISCRIMINANT ANALYSIS FOR MIXED VARIABLES USING R-CODE

AsPoly Journal of Sciences, Engineering and Environmental Studies Volume 1: No.1, March 2020, pp. 58 – 79 DEVELOPMENT OF STATISTICAL PROGRAM FOR DISCRIMINANT ANALYSIS FOR MIXED VARIABLES USING R-CODE Iwuagwu, Chukwuma E. Ibegbulem, Zebulon O. Nwosu, Obinna M. Ekwuribe, Celestine Department of Statistics, Abia State Polytechnic, Aba, Abia State. Abstract Discriminant analysis is a statistical technique that tried to predict on the bases of one or more independent variables. Whether an individual or other objects can be placed into a particular category of a categorical dependent variables. This research has its aim on the development of statistical program for discriminant analysis for mixed variables (discrete and continuous) using R-programming language. The data for the research were obtained from simulated data and real data. In the simulated experiment, a training data set was generated using R-code. The real data came from secondary data which consist of fifteen variables obtained from UNDP report on human development. The result obtained with Rprogramming language revealed that the developed program gave satisfactory result in terms of minimizing the average error rate. Keywords: Location model, Discriminant, Discrete and Continuous, R-Code 58 AsPoly Journal of Sciences, Engineering and Environmental Studies 1.1 Introduction Discriminant analysis is a statistical technique that allows one in understanding the differences of objects between two or more groups with respect to several variables simultaneously. The procedure tried to predict on the basis of one or more predictors or independent variables, whether an individual or other subjects can be placed into a particular category of a categorical dependent variables. The purpose is to determine the predictor variables on the base, which groups can be determined and discriminant model is built on the basis of a set of observations for which the groups are known. Lachenbruch (1975) viewed the problem of discriminant analysis as that of assigning an unknown observation to a group with low error rate. Johnson and Wichem (1992) defined discriminant analysis and classification as a multivariate techniques concerned with separating distinct set of objects (or observations) and with allocating new objects to previously defined groups. In general, discriminant analysis concerns with the development of a rule for allocating objects into one of some distinct groups and the rule will be use to determine a group of some future objects. Some examples of classification problems are: 1) A patient is admitted to a hospital with diagnosis of Myocardial infarction. The medical doctor examined the patient in order to obtain the systolic blood pressure, heart rate, stroke index and the mean arterial pressure. The measurements will be used to determine the probability of survival of the patient, Onyeagu (2003). 2) In admission of a candidate in a programme in a Higher Institution, the candidate must be assigned to the categories of Admit or Do-not admit, Osuji (2010). 59 AsPoly Journal of Sciences, Engineering and Environmental Studies 3) In banking, an officer in charge of credit facility may wish to classify loan or credit given to customers as either good or bad, or low or high, credit risk, Osuji (2010). 4) An Archeologist obtains a skull and wants to know if it belongs to a tribe that inhabited an area 20, 000 years ago or to a successor that live nearby. On the basis of measurements made on a set of skulls, from each of the two populations, the assignment may be made, Onyeagu (2003). Real world problems show that, variables are mixed with continuous and discrete variables. Discrete variable is a random variable whose set of possible values is a finite or infinite sequence of numbers, 𝑥1 , 𝑥2 , ⋯. The set of value is a subset of the non-negative integers 0, 1. While the continuous variable is a variable whose set of possible value is a continuous interval of real numbers, 𝑥. 1.2 Statement of problem Real world problem indicates that we are faced with data comprise of discrete and continuous variables. Therefore the researcher is faced with the problem of discriminating between two groups and allocating individual to one or the other and there is no current program 1.3 Objectives of the study The objectives of the study include: (a) To develop a current programme using RProgramming Language that can handle mixed variables. (b) To estimate the error rate (c) To apply the developed programme on Location model using simulated and real data 1.4 Significance of the study The successful development of this programme will spur researchers in different fields of study like Sciences, 60 AsPoly Journal of Sciences, Engineering and Environmental Studies Business and Social Sciences into further research in this area and related ones thereby enhancing the expansion of knowledge. 2.1 Literature Review Hamid, Mei and Yahaya (2017) in their work designated as "New Discrimination Procedure of Location Model for Handling Large Categorical Variables". The location model proposed in the past is a predictive discriminant rule that can classify new observations into one of two predefined groups based on mixtures of continuous and categorical variables. The ability of location model to discriminate new observation correctly is highly dependent on the number of multinomial cells created by the number of categorical variables. This study conducts a preliminary investigation to show the location model that uses maximum likelihood estimation has high misclassification rate up to 45% on average in dealing with more than six categorical variables for all 36 data tested. Such model indicated highly incorrect prediction as this model performed badly for large categorical variables even with large sample size. To alleviate the high rate of misclassification, a new strategy is embedded in the discriminant rule by introducing nonlinear principal component analysis (NPCA) into the classical location model (cLM), mainly to handle the large number of categorical variables. This new strategy is investigated on some simulation and real datasets through the estimation of misclassification rate using leave-one-out method. The results from numerical investigations manifest the feasibility of the proposed model as the misclassification rate is dramatically decreased compared to the cLM for all 18 different data settings. A practical application using real dataset demonstrates a significant improvement and obtains comparable result among the best methods that are compared. The overall findings reveal that the proposed model extended the applicability range of the location model as previously it was limited to 61 AsPoly Journal of Sciences, Engineering and Environmental Studies only six categorical variables to achieve acceptable performance. This study proved that the proposed model with new discrimination procedure can be used as an alternative to the problems of mixed variables classification, primarily when facing with large categorical variables. Ikechukwu (2016) in his work titled “Evaluation of Error Rate Estimators in Discriminant Analysis with Multivariate Binary Variables.” He said that, classification problems often suffer from small samples in conjunction with large number of features, which makes error estimation problematic. When a sample is small, there is insufficient data to split the sample and the same data are used for both classifier design and error estimation. Error estimation can suffer from high variance, bias or both. The problem of choosing a suitable error estimator is exacerbated by the fact that estimation performance depends on the rule used to design the classifier, the feature-label distribution to which the classifier is to be applied and the sample size. His paper was concerned with evaluation of error rate estimators in two group discriminant analysis with multivariate binary variables. Behaviour of eight most commonly used estimators are compared and contrasted by mean of Monte Carlo Simulation. The criterion used for comparing those error rate estimators is sum squared error rate (SSE). Four experimental factors are considered for the simulation namely: the number of variables, the sample size relative to number of variables, the prior probability and the correlation between the variables in the populations. They obtained two major results from this study. Firstly, using the simulation experiments we ranked the estimators as follows: DS, O, OS, U, R, JK, P and D. The best method was the DS estimator. Secondly, they concluded that, it is better to increase the number of variables because accuracy increases with increasing number of variables. Also, the general trend for the estimators was an increase in error rate as sample size 62 AsPoly Journal of Sciences, Engineering and Environmental Studies decreases while decreasing the distance between populations generally increase the error rate. DS estimator was the most consistent and thus reliable over all combinations of probability pattern and sample sizes. El-Hanjouri and Hamad (2015) on their work titled “Using Cluster Analysis and Discriminant Analysis Methods in Classification with Application on Standard of Living Family in Palestinian Areas”. In their research work, they applied methods of multivariate statistical analysis, especially cluster analysis (CA) in order to recognize the disparity in the living standards for family among the Palestinian areas. The research results concluded that, there was a convergence in living standards for family between two areas formed the first cluster of high living standards which are the urban of middle West Bank and the camp of middle West Bank, also there was a convergence of living standards for family among the seven areas formed the second cluster of middle living standards which are the urban of North West Bank, the camp of North West Bank, the rural of North West Bank, the urban of South West Bank, the camp of South West Bank, the rural of South West Bank and the rural of middle West Bank. In addition, there is a convergence of living standards for family the three areas formed the third cluster of low living standards which are the urban of Gaza strip, the rural of Gaza strip and the camp of Gaza strip. After a comparison among several methods of cluster analysis through a cluster validation (Hierarchical Cluster Analysis, K-means Clustering and K-medoids Clustering), the preference was for the Hierarchical Cluster Analysis method. However, after an examination to choose the best method of connection through agglomerate coefficient in the Hierarchical Cluster Analysis (Single linkage method, Complete linkage method, Average linkage method and Ward linkage method), the preference was for Ward linkage method which has been selected to be used in the classification. Moreover, the Discriminant Analysis method 63 AsPoly Journal of Sciences, Engineering and Environmental Studies (DA) applied to distinguish the variables that contribute significantly to this disparity among families inside Palestinian areas and the results show that the variables of monthly Income, assistance, agricultural land, animal holdings, total expenditure, imputed rent, remittances and non-consumption expenditure are significantly contributed to disparity. El-Habil, and El-Jazzar (2013) in their paper titled “A Comparative Study between Linear Discriminant Analysis and Multinomial Logistic Regression.” Their paper aimed to compare between the two different methods of classification: linear discriminant analysis (LDA) and multinomial logistic regression (MLR) using the overall classification accuracy, investigating the quality of their prediction in terms of sensitivity and specificity, and examining area under the ROC curve (AUC) in order to make the choice between the two methods easier, and to understand how the two models behave under different data and group characteristics. Model performance had been assessed from two special cases of the k-fold partitioning technique, the ‘leave-one-out’ and ‘hold out’ procedures. The performance evaluation for the two methods was carried out using real data and also by simulation. Results show that logistic regression slightly exceeds linear discriminant analysis in the correct classification rate, but when taking into account sensitivity, specificity and AUC, the differences in the AUC were negligible. By simulation, we examined the impact of changes regarding the sample size, distance between group means, categorization, and correlation matrices between the predictors on the performance of each method. Results indicate that the variation in sample size, values of Euclidean distance, different number of categories have similar impact on the result for the two methods, and both methods LDA and MLR show a significant improvement in classification accuracy in the absence of multicollinearity among the explanatory variables. 64 AsPoly Journal of Sciences, Engineering and Environmental Studies Fernandez, G. (2009) in his work "Discriminant Analysis, a Powerful Classification Technique in Predictive Modeling". He observed that discriminant analysis is one of the classical classification techniques used to discriminate a single categorical variable using multiple attributes. Discriminant analysis also assigns observations to one of the pre-defined groups based on the knowledge of the multi-attributes. When the distribution within each group is multivariate normal, a parametric method can be used to develop a discriminant function using generalized squared distance measure. The classification criterion is derived based on either the individual within-group covariance matrices or the pooled covariance matrix that also takes into account the prior probabilities of the classes. Non-parametric discriminant methods are based on non-parametric groupspecific probability densities. Either a kernel or the knearest-neighbor method can be used to generate anonparametric density estimate in each group and to produce a classification criterion. The performance of a discriminant criterion could be evaluated by estimating probabilities of misclassification of new observations in the validation data. Krzanowski (1975) considered the problem of discriminating between two groups and allocating individuals to one or the other when the available data consists of both binary and continuous variables. The researcher derived a discriminant function from a probabilistic model for mixed binary and continuous variables and assessed the utility of the allocation rule with probabilities of misclassification, or error rates. Krzanowski compared the Location model derived with Fisher linear discriminant function (LDF). In the simulation study, the result conducted in two situations, showed that in situation (a), when there is no interaction, the average error rates for two methods were very similar for all combinations of parameter values but under 65 AsPoly Journal of Sciences, Engineering and Environmental Studies situation (b), that is when interactions were considered; the average error when Fisher’s LDF was used is higher than that for the Location model for all parameter values. Krzanowski (1982) derived an allocation rule for hypothesis testing of mixture of variables. Let the j th member of the training set from i  Vj  1 be denoted by  x ji     i    y j  where x ij , the vector of the dummy binary variable is obtained from the discrete component of V ji  , and y ij , is the vector of the continuous variable. Furthermore, let n im denote the number of members of the training set from  i that falls in cell m of the contingency table x defined by and ni   km1 n im be the sample size of the training set from  i i  1, 2 ; m  1, 2, , k  then using the location model assumption, the likelihood of the two training sets is given by:  L  2   where p  ij   1  n1  n2  2 2  2    p i 1  i 2 n 1m ni   i   exp  1 2  yi   ij  i 1  takes the value   im  ' y      defined by q imi   vi   i. j x jm     i. jk x jm  xkm  j 1 i k 66 i i ij  AsPoly Journal of Sciences, Engineering and Environmental Studies th For all j which fall in the m cell of the contingency table constructed from x i  1, 2 ; m  1, 2, , k  . If we consider the extra individual, v , to be classified and suppose that the discrete components place it into cell m of the contingency table. If this individual is included with the training set from Lim  2  1 p 2  i , then an extra multiplying factor  1 2  pim exp  1 y     ' y      1 m 2 m i i must be incorporated in the equation (2) above. The hypothesis testing allocation rule is therefore given by evaluating the statistic  supL1m L  supL2 m L  v to  1 , if   1 and to  2 otherwise. Maximization of L2 m L  follows the same general lines as and allocating in the estimation approach to the problem, the only difference is that for supL1m L  , we have observations in cell m for n1m , for  1 , and y is included with the other continuous vectors having mean for supL2 m L  , we have 2 , and y, n1m  1 1m  , n2 m  1 observations is included with the while in cell m n2 m continuous vectors having mean  2m  . All other cells are unchanged in both maximizations. Hamid (2010) focused on the idea for construction of model when the number of variables involves the combination of mixed variables. The researcher used the following steps to accomplish the final algorithm of the location model. 67 AsPoly Journal of Sciences, Engineering and Environmental Studies i) Omit object i  1, 2,  , n. i from the sample n where ii) Perform Principal Component Analysis (PCA) for the continuous variables from the remaining objects n1  n2  i  to choose the best combination of components or to reduce its dimensions. iii) Repeat step (ii) for conducting PCA purposely for binary variables, then combine the results from step (ii) and (iii) to produce 2PCA. iv) Compute and estimate  i ,  and pi using the new components resulting from 2PCA, further construct the location model function. v) Pool and run step (ii-iv) together to produce a new algorithm of 2PCA plus LM. vi) Predict the group of the omitted object i using a new constructed model, if the prediction made is correct then assign error,  ij  0 , otherwise  ij  1 .   vii) Repeat step (i) - (vi) for all objects in turn. viii) Compute the leave-one-out error rate  ij n100 . using The model constructed was evaluated with simulated data, where the researcher selected two groups having Multivariate Normal Distribution, MVN  2 ,  1 ,  and MVN with P  50 continuous variables and q  50 binary variables. The objects in the sample were divided into two parts, learning set and test set. The learning set was used in the construction of the model and the test set used for evaluation purposes. The result showed that the new approach can be used in data reductions and will be able to manage different types of variables (binary and continuous simultaneously). Miller et al. (1998) considered the problem of discriminant analysis with discrete (categorical) and continuous variables with data missing at random. The 68 AsPoly Journal of Sciences, Engineering and Environmental Studies used a hypothesis testing approach based on the generalized likelihood ratio and use boot-strapping to determine the critical values in order to control the type 1 error rate. The work used three algorithms for dealing with this case, each assuming a different model for the data. 1. The Indicator Algorithm replaces categorical variables with indicator variables and treats these as if they are continuous. 2. The Full Algorithm assumed a multinomial distribution for the discrete part, and a multivariate normal distribution (with mean and covariance depending on the discrete part); as the conditional distribution of the continuous part, given the discrete part and 3. The Common Algorithm assumes a multinomial distribution for the discrete part and a multivariate normal distribution (with only a mean depending on the discrete part) as the conditional of the continuous part, given the discrete part (that is a common covariance matrix is assumed across all multinomial cells). The performance of these algorithms is compared through a simulated study. The indicator algorithm seems to have highest power; it also tends to display a higher type 1 error rate than desired. The full and the common algorithms have very similar power, but the common algorithms appeared to control the type 1 error more effectively and least susceptible to problems occurring when some multinomial cells are sparsely represented. Ganeshanandam and Krzanowski (1990) adopted Fishers linear discriminant function for allocating new observation into one of two existing groups. The methods of estimating the misclassification error rates were reviewed and evaluated by Monte Carlo simulations. The investigation was carried out under both ideal (multivariate normal data) and non-ideal (multivariate 69 AsPoly Journal of Sciences, Engineering and Environmental Studies binary data) conditions. The assessment was based on the usual mean square error (MSE) criterion and also on a new criterion of optimism. The result showed that although, there is a common cluster of good estimators for both ideal and non-ideal conditions. The single best estimators vary with respect to the different criterion. Efron (1975) looked at the Efficiency of Logistic Regression compared to Normal Discriminant Analysis. x can arise from one or Suppose that a random vector two P-dimensional normal populations differing in mean but not in covariance. x ~  p 1 with probability of  1 x ~  p  0  with probability of  0 Where,  1   0  1 If the parameter 1 ,  0  1  110 ,  are known, x can be assigned to a population on the basis of then Fishers Linear Discriminant Function  x    0  1 x  1  0  Log 1  y1 1 1   0  1  0  0 2   1  0   1 The assignment is to population 1, If  x   0 and to population 0 If  x   0 This method of assignment minimizes the expected probability of misclassification. The author completed the asymptotic relative efficiency of the two procedures, 70 AsPoly Journal of Sciences, Engineering and Environmental Studies logistic regression shown to be between one half and two thirds as effective as normal discrimination for statistical interesting values of the parameters. Takiokurita and Otsu (2009) worked on the Logistic Discriminant Analysis. They proposed a novel non-linear discriminant analysis (LGDA) in which the posterior probabilities are estimated by multi-normal logistic regression (MLR). The experimental results are shown by comparing the discriminant space constructed by Logistic Discriminant Analysis (LGDA) and Linear Discriminant Analysis (LDA), for the standard repository data sets. The experimental result shows that the discriminant space constructed by LGDA is better than the one obtained by Linear Discriminant Analysis (LDA). James and Wilson (1978) looked at choosing between logistic regression and Discriminant Analysis. They carried out two experimental studies of non-normal classification problem, compared the two methods and found that logistic regression with MLE output performing classical linear discriminant analysis in both cases. Panagiotakos (2006) in his paper titled “A Comparison between logistic regression and linear discriminant analysis for the prediction of categorical health outcome.” He investigated whether these two methods of analysis result in similar findings in evaluating Categorical Health outcomes. He concluded that logistic regression resulted in the same model as did discriminant analysis. Maja (2004), in their paper comparison of Logistic Regression and Linear Discriminant Analysis, a simulation study. They considered the problem of choosing between the two method and set same guidelines for proper choice. The performance of the methods was based on several measures of predictive accuracy. The performance was studied by simulations. They found out that Linear Discriminant Analysis (LDA) is a more appropriate method when the explanatory variables are normally distributed. In the case of categorized variables, linear discriminant analysis remains preferable and fails only when the 71 AsPoly Journal of Sciences, Engineering and Environmental Studies number of categories is really small. The result of Logistic Regression (LR), however is in all these cases constantly close and a little worse than those of LDA. But, whenever the assumptions of LDA are not met, the usage of LDA is not justified, while LR gives good results regardless of the distribution. Osuji (2010) considered seven classification rules for the discrete discriminant problem. A simulation experiment was conducted to compare the performance of all those rules. He observed that the procedure that is most favoured is the optimal rule when three variables are considered whereas when four variables are involved the full multinomial rule is preferred in terms of minimizing the expected actual error rate or expected cost of misclassification. Nocairi et al. (2006) said that linear and quadratic discriminant analyses are likely to lend to unstable models and poor prediction in the presence of quasi collinearity among variables or in the case of the small sample and high-dimensional settings. A simple regularization procedure was proposed to cope with this problem. It is based on the introduction of tuning parameter that draws a line between linear or quadratic discriminant analysis that is based on Mahalanobis distance and discrimination analysis based on the identity matrix. The tuning parameter is customized to individual situation by minimizing the cross-validated misclassification risk. The efficiency of the method of analysis in comparison with existing procedure was demonstrated on the basis of a data set and a large simulation study. 3.1 Methodology The data for the work is based on the simulated and real data. In the simulation experiment, a training data set was generated using R-code and the average error rate were computed for 2 ≤ 𝑞 ≤ 6 and 0.1 ≤ 𝑃1 𝑃2 ≤ 0.9 , for two situations, that is, situation (a) a case with no interaction between discrete and continuous variables and situation 72 AsPoly Journal of Sciences, Engineering and Environmental Studies (b) a case involving interaction between discrete and continuous variables, also, 𝑞 are the components of the discrete variables 𝑥 and 𝑝 are the components of the continuous variables, 𝑦 . The real data came from secondary data which consists of 15 variables obtained from UNDP report on Human Development index. Thirteen variables were continuous while two as discrete and this consist of countries classified as High and Low Human Development report obtained from 2014 – 2018. The data was analyzed with discriminant analysis using the location model. 3.2 Location Model The model that will be used in the development of this program is Location model formulated by Krzanowski in 1975. Let, 𝑥 , denote the vector of discrete variables with, 𝑞 , components and, 𝑦 , represent the vector of the continuous variables with, 𝑝, components. The location model then assumes that, the probability of obtaining an observation in the, 𝑗 𝑡ℎ , cell of the multinomial table is 𝑝𝑖𝑗 , in population, , (𝑖 = 1, 2 ; 𝑗 = 1, 2) . If the discrete variables have been allocated to an individual cell 𝑗 , the continuous variables, 𝑦 , have a multivariate normal distribution with mean, 𝜇 (𝑖) , and dispersion matrix ∑ , in population 𝜋𝑖 , 𝜋𝑖 (𝑖 = 1, 2 ; 𝑗 = 1, 2) . Then, the conditional probability density of, 𝑦 , given that the discrete variables locate the individual in cell j, is  1 2  c 2  1 2      1 exp  y  i j  '  1 y  i( j )    2 in  i (i  1, 2) . Thus the joint probability density of obtaining the individual in cell j and observing the continuous variable value, y , is 73 AsPoly Journal of Sciences, Engineering and Environmental Studies  pij 2  c 2 in  1 2      1 exp  y   i j  '  1 y   i( j )    2  i ( i  1,2) 3.3 Optimum Allocation Rule If all population parameters are known, the allocation rule for an observation x y  , x define the vectors of discrete variables and, y , that of the continuous variables, is assign   m 1 1 to if    2m   1 y  1 2 1m    2m   log p2 m p1m    and otherwise to  2 where  m  1   iq1 xi 2 i 1 . 3.4 Estimated Allocation Rule Let n1m and n2 m denote the number of observations falling in cell m , of the table from  1 and  2 and let y ijm  denote the vector of continuous variables associated with the Then, if j th observation in cell m of the sample from m  yi i 1 n1m m    yij nim j 1 The maximum likelihood estimate of the population m  parameters pˆ im ; ̂ i and  are given by: pˆ im  nim ; ̂ im   yim  ni 2 k n1m  1     yijm   yim   yijm   yim  n1  n2  2k i1 m1 j 1  where, i  1, 2 ; m  1, 2 ,  , k 74   AsPoly Journal of Sciences, Engineering and Environmental Studies 3.5 Estimation of Error Rates The success of an allocation rule can be assess by the probability of misclassification or error rates that it gives rise to. The error rates are given by:  p2 1   p1m  logP2 m p1m  12 Dm2 Dm k m1 and   p1 2   p2 m  log p1m p2 m   12 Dm2 Dm k m1  where,  is the cumulative standard normal distribution  function and Dm2  1m    2m     1 1 m   2m   is the Mahalanobi’s squared distance between  1 and  2 in cell j of the multinomial table. 4.1 Data analysis The result of the simulated data was shown in the table below. In the simulated experiment from a training set which was generated with R-code indicated that the proportion of error rate which is less than 30% is optimal in minimizing the probability of misclassification. The result of the average error rate for the real data is given as 0.2694 which showed that the proportion of error made by the rule is optimal in minimizing the probability of misclassification. P1 THE RESULTS OF THE AVERAGE ERROR RATE FOR 𝟐 ≤ 𝒒 ≤ 𝟔 AND 𝟎. 𝟏 ≤ 𝑷𝟏 𝑷𝟐 ≤ 𝟎. 𝟗 P2 0.1 0.3 0.1 0.5 0.7 Situation a b a b a b a q=2 LM 0.2323 0.2560 0.2605 0.2765 0.2665 0.2388 0.2425 q=3 LM 0.2490 0.2478 0.2660 0.2583 0.2405 0.2455 0.2668 75 q=4 LM 0.2520 0.2585 0.2315 0.2463 0.2568 0.2485 0.2665 q=5 LM 0.2790 0.2510 0.2540 0.2433 0.2560 0.2435 0.2420 q=6 LM 0.2777 0.2493 0.2550 0.2460 0.2569 0.2437 0.2447 AsPoly Journal of Sciences, Engineering and Environmental Studies 0.9 0.3 0.3 0.5 0.7 0.9 0.5 0.5 0.7 0.9 0.7 0.7 0.9 b a b a b a b a b a b a b a b a b a b a b 0.2415 0.2265 0.2475 0.2490 0.2295 0.2463 0.2570 0.2385 0.2248 0.2565 0.2278 0.2235 0.2428 0.2595 0.2458 0.2503 0.2438 0.2388 0.2715 0.2363 0.2315 0.2595 0.2443 0.2538 0.2520 0.2615 0.2505 0.2473 0.2515 0.2523 0.2378 0.2293 0.2513 0.2533 0.2503 0.2508 0.2575 0.2288 0.2635 0.2403 0.2473 0.2383 0.2555 0.2383 0.2535 0.2518 0.2483 0.2550 0.2560 0.2590 0.2513 0.2558 0.2533 0.2520 0.2540 0.2510 0.2298 0.2533 0.2505 0.2595 0.2505 0.2420 0.2458 0.2586 0.2570 0.2660 0.2403 0.2518 0.2570 0.2528 0.2405 0.2323 0.2363 0.2478 0.2438 0.2430 0.2350 0.2436 0.2598 0.2498 0.2503 0.2703 0.2438 0.2430 Confusion Matrix using Location Model Approach for Real Data Location Model Observed Sample 1 Sample 2 Sample 1 2 11 Sample 2 2 19 Source: R-Console 3.1.20 Misclassification Rate= (2+11)/34= 0.38 5.1 Findings The average error rate obtained from the developed program with R-code is comparable with those obtained by Kranowski (1975). 76 0.2581 0.2575 0.2661 0.2405 0.2494 0.2561 0.2520 0.2428 0.2222 0.2346 0.2487 0.2428 0.2450 0.2353 0.2436 0.2634 0.2493 0.2508 0.2708 0.2778 0.2493 AsPoly Journal of Sciences, Engineering and Environmental Studies The results from the repeated number of times under identical conditions for the average error rate exhibited stability. In both situation (a) and (b), that is without interaction and with interaction respectively, the findings gave a satisfactory result. The number of misclassifications was not large and the percentage of misclassification was under control. 5.2 Conclusion The classification rule based on the developed programme with R-programming language using the Location model gave a satisfactory result in terms of minimizing the average error rate and the result is not different when compared with the original work done by Krzanowski in 1975. 5.3 Recommendation Since we have established that, the developed R-computer programme gave better result in terms of minimizing the average error rate in the classification that involved mixed variables, we are recommending that statisticians should use the location model program when classifying objects. References Efron, B. (1975). The efficiency of logistic regression compared to normal discriminant analysis. Journal of American Statistical Association. Vol. 70, No. 352 El-Habil, A. M., & El-Jazzar, M. (2013). A Comparative Study between Linear Discriminant Analysis and Multinomial Logistic Regression. An-Najah University Journal for Research - Humanities, 28, 1525-1548. El-Hanjouri, M. M. R. and Hamad, B. S. (2015). Using Cluster Analysis and Discriminant Analysis Methods in Classification with Application on Standard of Living Family in Palestinian Areas. International Journal of Statistics and Applications. 5(5):213-222 77 AsPoly Journal of Sciences, Engineering and Environmental Studies Fernandez, G. (2009). Discriminant Analysis, a Powerful Classification Technique in Predictive Modeling. George Fernandez University of Nevada. Reno. Ganeshanandam and Krzanowski (1990). Variable selection in discriminant analysis based on the location model for mixed variables. Advance Data Analysis. 1: 105–122. Hamid, H. H. (2010). A new approach for classifying large number of mixed variables. World Academy of Science Engineering & Technology. Vol. 46, 10-22 Hamid, H., Mei, L. M. & Yahaya, S. S. (2017). New Discrimination Procedure of Location Model for Handling Large Categorical Variables. Sains Malaysiana. 46(6) (2017): 1001–1010 Ikechukwu, E. (2016). Evaluation of Error Rate Estimators in Discriminant Analysis with Multivariate Binary Variables. American Journal of Theoretical and Applied Statistics. 5, 173. James, P. & Wilson, S. (1978). Choosing between logistic regression and discriminant analysis. Journal of american statistical association. Vol. 73, No. 364, 699705 Johnson, R. A. and Wichem D. W. (1992). Applied multivariate statistical analysis, Englewood, Cliffs, New Jersey Krzanowski, w. J. (1975). Discrimination and classification using both binary and continuous variables. Journal of American Statistical Association. 70, 782–790. Krzanowski, W. J. (1982). Mixture of continuous and categorical variables in discriminant analysis: A Hypothesis - Testing Approach. Biometrics. Vol. 38, 991-1002 Lachenbruch, P. A. (1975). Discriminant analysis. Hafner Press, New York. Maja, B. L. (2004). Discriminant analysis using mixed continuous, dichotomous, and ordered categorical variables, Multivariate. Res. 42 (2007), 631–645. 78 AsPoly Journal of Sciences, Engineering and Environmental Studies Miller, J. W., Woodward, W. A. & Gray, H. L. (1998). A hypothesis testing approach to discriminant analysis with mixed categories and continuous variables when data are missing. A scientific report. No. 1 Philips laboratory directorate of geophysics air force material. MA 0173-3010 Nocairi, H., Qannari, E. M. & Hanafi, M. (2006). A simple regularization procedure for discriminant analysis. Journal of communication in statistics simulation and computation. Vol. 35, 957-967 Onyeagu, S. I. (2003). A first course in multivariate statistical analysis. Mega Concept. Osuji, G. A. (2010). Evaluation of some classification procedures for binary variables. A ph.d thesis. Nnamdi Azikiwe university, Awka Panagiotakos, D. B. (2006). A comparison between logistic regression and linear discriminant analysis for the Prediction of categorical health outcome. Journal of Statistical Science. Vol. 5, 73-84 Takiokurita, K. & Otsu, N. (2009). Logistic discriminant proceeding of the international conference on systems man and cybernetics. San Antonis, Texas, USA. 79

DEVELOPMENT OF STATISTICAL PROGRAM FOR DISCRIMINANT ANALYSIS FOR MIXED VARIABLES USING R-CODE

Related documents

Products

Support

DEVELOPMENT OF STATISTICAL PROGRAM FOR DISCRIMINANT ANALYSIS FOR MIXED VARIABLES USING R-CODE

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib