International Journal of Engineering Trends and Technology- Volume3Issue3- 2012 A PERSPECTIVE MISSING VALUES IN DATAMINING APPLICATIONS Dr.S.S.Dhenakaran, M.Sc., M.phil., PhD., 1, T. kalaivani, M.sc., M.Phil.,2 1 Associate professor, Department of computer science && engg, Algappa University, Karaikudi - 630001. 2 Research Scholar, Department of computer science && engg, Alagappa University, Karaikudi - 630001. reveal patterns and trends in very large data sets. “Data Abstract mining is the process of discovering meaningful new In large database there may be some values missing in some of the attributes. These missing values are calculated first by identifying it either discrete/continuous and then the values are calculated by mean, median. In this paper the calculated missing set values are utilized to estimate the imputation of missing values in data set. Methods are discussed for learning and application of decision rules for classification of data with many missing values. A method is presented to induce decision rules from data with missing values either by format of the rules is showing no different than with missing values or no special features are specified to prepare the original data. Data with missing values complicates both the learning process and the application of solution of new data. The most common preprocessing techniques involves filling in the missing values. KEYWORDS:Datamining,MissingData, Decision Tree. correlations, patterns and trends by sifting through large amounts of data stored in repositories, using pattern recognition technologies as well as statistical and mathematical techniques.” Data mining involves the use of sophisticated data analysis tools to discover previously unknown, valid patterns and relationships in large data sets. These tools can include statistical models, mathematical algorithms, and machine learning methods. Data mining is currently in a state of growth and it needs further improvements to attain the development. More products are being developed, more businesses are incorporating the efforts of data mining into their decision making processes. Most of the successful business decisions are made from reliable data source and their validation through the application of tools and techniques. Most of the literature on data mining focuses on its benefits and burdens in making business decisions. Introduction 2. Decision Tree 1. Data mining Decision trees are powerful and popular tools for “Data mining is the application of statistics in the form classification and prediction. The attractiveness of of exploratory data analysis and predictive models to decision trees is due to the fact that, in contrast to neural ISSN: 2231-5381 http://www.internationaljournalssrg.org Page 358 International Journal of Engineering Trends and Technology- Volume3Issue3- 2012 networks, decision trees represent rules. Rules can kernel. Develops novelty kernel estimators for discrete readily be expressed so that humans can understand and continuous target values, respectively. them or even directly used in a database access language 4.2. Imputation of missing values like SQL so that records falling into a particular In this module, we utilize the missing results sets to category may be retrieved. In some applications, the estimate the imputation of missing values in data set. We accuracy of a classification or prediction is the only impute missing values thing that matters. In such situations we do not mean,median,mode necessarily care how or why the model works. In other estimators or novel frequency estimator are proposed situations, the ability to explain the reason for a decision against the case that data sets have both continuous and is crucial. discrete independent attributes. Related work 4.3 Performance Analysis 3. Missing attribute values In this module we are going to measure the performance One or more of the attribute values may be missing both metric of the proposed algorithms outperform the for examples in the training set and for objects which are existing ones for imputing both discrete and continuous to be classified. Missing data might occur because the missing values. value is not relevant to a particular case, could not be 5. Corrupted values recorded when the data was collected, or is ignored by Sometimes some of the values in the training set are users because of privacy concerns. If attributes are altered from what they should have been. This may missing in any training set, the system may either ignore result in one or more tuples in the data set conflicting this object totally; try to take it into account by, for with the rules already established. The system may then instance, finding what is the missing attribute's most regard these extreme values as noise and ignore them. probable value, or use the value “missing”, “unknown” The problem is that one never knows if the extreme or “NULL” as a separate value for the attribute. values are correct or not and the challenge is how to 4. Data Preparation handle “weird” values in the best manner. Data preparation is an essential part of model fitting and 6. Ignore the tuple this must be done before the design and analysis of data This is usually done when the class label is missing. sets. Also, it is recommended if the tuple contains several 4.1. Missing Value Estimation attributes with missing values. Use a global constant to In this module, we are going to apply the discrete fill in the missing value. Replace all missing attribute attributes (including ordering and no ordering discrete values by the same constant, such as a label like attributes/variables) over the data sets. Then, a mixture “Missing”, “Unknown”, “- ”, or “?”. Use the attribute kernel function is proposed by combining a discrete mean to fill in the missing value. Use the attribute mean kernel function with a continuous one. Furthermore, a for all samples belong to the same class: for example, if new estimator is constructed based on the mixture classifying customers according to credit risk, replace ISSN: 2231-5381 http://www.internationaljournalssrg.org by making use of the based iterative nonparametric Page 359 International Journal of Engineering Trends and Technology- Volume3Issue3- 2012 the missing value with the average income value for customers in the same credit risk category as that of the given tuple. Use the most probable value to fill in the missing value: this technique is appropriate for sparse missing values. Difficulties arise if the tuple contains more than one missing attribute values. 7. Experimental Results We considered several data sets from real applications and data sets taken from the UCI data set in this section. None of these data sets have missing values. The selected data sets let us compare the imputed values with their real values. For these complete data sets, missed values are generated at random so as to systematically study the performance of the proposed method. The percentage of missing values (the “missing rate”) was fixed at 10, 20, 30, 50, and 80 percent for each data set. For comparison with the proposed method (denoted by Mixing), four selected imputation methods are the nonparametric iterative single-kernel imputation method with a polynomial kernel (denoted by Poly), a nonparametric This will be discussed in future work. Since the number of individual components of missing values is 8. Conclusion and Future Work In this paper, we propose imputation method for dealing with missing values, which is presented in target attribute in data preprocessing, and then to find out the mean, median value to original data set for imputing the missing values in data set. In practice, datasets usually present missing values in conditional attributes and class attributes, which makes the problem of missing value imputation more sophisticated. In our future work, we will deal with this problem. 8. REFERENCES high, it is not feasible to monitor convergence for every [1]. M.L. Brown, “Data Mining and the Impact of Missing Data,”Industrial imputed missing value. Schafer considered that since Management and Data Systems, vol. 103, no. 8, pp. 611-621, 2003. convergence rates are closely related to missing [2]. Z. Ghahramani and M. Jordan, “Mixture Models for Learningfrom Incomplete Data,” Computational Learning Theory and Natural Learning information, it makes sense to focus on parameters (in Systems, R. Greiner, T. Petsche, and S.J. Hanson, eds.,vol. IV: Making our paper, it will be variance and mean of the imputed Learning Systems Practical, pp. 67-85, The MIT Press, 1997. values. Obviously, we can also use other parameters, [3]. J. Han and M. Kamber, Data Mining Concepts and Techniques, second ed. Morgan Kaufmann Publishers, 2006. such as distribution function) for which the fractions of missing information are high. ISSN: 2231-5381 http://www.internationaljournalssrg.org Page 360