A PERSPECTIVE MISSING VALUES IN DATAMINING APPLICATIONS

advertisement
International Journal of Engineering Trends and Technology- Volume3Issue3- 2012
A PERSPECTIVE MISSING VALUES IN DATAMINING
APPLICATIONS
Dr.S.S.Dhenakaran, M.Sc., M.phil., PhD., 1, T. kalaivani, M.sc., M.Phil.,2
1
Associate professor, Department of computer science && engg,
Algappa University, Karaikudi - 630001.
2
Research Scholar, Department of computer science && engg,
Alagappa University, Karaikudi - 630001.
reveal patterns and trends in very large data sets. “Data
Abstract
mining is the process of discovering meaningful new
In large database there may be some values missing in
some of the attributes. These missing values are
calculated
first
by
identifying
it
either
discrete/continuous and then the values are calculated by
mean, median. In this paper the calculated missing set
values are utilized to estimate the imputation of missing
values in data set. Methods are discussed for learning
and application of decision rules for classification of
data with many missing values. A method is presented to
induce decision rules from data with missing values
either by format of the rules is showing no different than
with missing values or no special features are specified
to prepare the original data. Data with missing values
complicates
both the
learning process
and the
application of solution of new data. The most common
preprocessing techniques involves filling in the missing
values.
KEYWORDS:Datamining,MissingData, Decision Tree.
correlations, patterns and trends by sifting through large
amounts of data stored in repositories, using pattern
recognition technologies as well as statistical and
mathematical techniques.” Data mining involves the use
of sophisticated data analysis tools to discover
previously unknown, valid patterns and relationships in
large data sets. These tools can include statistical
models, mathematical algorithms, and machine learning
methods. Data mining is currently in a state of growth
and it needs further improvements to attain the
development. More products are being developed, more
businesses are incorporating the efforts of data mining
into their decision making processes. Most of the
successful business decisions are made from reliable
data source and their validation through the application
of tools and techniques. Most of the literature on data
mining focuses on its benefits and burdens in making
business decisions.
Introduction
2. Decision Tree
1. Data mining
Decision trees are powerful and popular tools for
“Data mining is the application of statistics in the form
classification and prediction. The attractiveness of
of exploratory data analysis and predictive models to
decision trees is due to the fact that, in contrast to neural
ISSN: 2231-5381 http://www.internationaljournalssrg.org
Page 358
International Journal of Engineering Trends and Technology- Volume3Issue3- 2012
networks, decision trees represent rules. Rules can
kernel. Develops novelty kernel estimators for discrete
readily be expressed so that humans can understand
and continuous target values, respectively.
them or even directly used in a database access language
4.2. Imputation of missing values
like SQL so that records falling into a particular
In this module, we utilize the missing results sets to
category may be retrieved. In some applications, the
estimate the imputation of missing values in data set. We
accuracy of a classification or prediction is the only
impute missing values
thing that matters. In such situations we do not
mean,median,mode
necessarily care how or why the model works. In other
estimators or novel frequency estimator are proposed
situations, the ability to explain the reason for a decision
against the case that data sets have both continuous and
is crucial.
discrete independent attributes.
Related work
4.3 Performance Analysis
3. Missing attribute values
In this module we are going to measure the performance
One or more of the attribute values may be missing both
metric of the proposed algorithms outperform the
for examples in the training set and for objects which are
existing ones for imputing both discrete and continuous
to be classified. Missing data might occur because the
missing values.
value is not relevant to a particular case, could not be
5. Corrupted values
recorded when the data was collected, or is ignored by
Sometimes some of the values in the training set are
users because of privacy concerns. If attributes are
altered from what they should have been. This may
missing in any training set, the system may either ignore
result in one or more tuples in the data set conflicting
this object totally; try to take it into account by, for
with the rules already established. The system may then
instance, finding what is the missing attribute's most
regard these extreme values as noise and ignore them.
probable value, or use the value “missing”, “unknown”
The problem is that one never knows if the extreme
or “NULL” as a separate value for the attribute.
values are correct or not and the challenge is how to
4. Data Preparation
handle “weird” values in the best manner.
Data preparation is an essential part of model fitting and
6. Ignore the tuple
this must be done before the design and analysis of data
This is usually done when the class label is missing.
sets.
Also, it is recommended if the tuple contains several
4.1. Missing Value Estimation
attributes with missing values. Use a global constant to
In this module, we are going to apply the discrete
fill in the missing value. Replace all missing attribute
attributes (including ordering and no ordering discrete
values by the same constant, such as a label like
attributes/variables) over the data sets. Then, a mixture
“Missing”, “Unknown”, “- ”, or “?”. Use the attribute
kernel function is proposed by combining a discrete
mean to fill in the missing value. Use the attribute mean
kernel function with a continuous one. Furthermore, a
for all samples belong to the same class: for example, if
new estimator is constructed based on the mixture
classifying customers according to credit risk, replace
ISSN: 2231-5381 http://www.internationaljournalssrg.org
by making use of the
based
iterative
nonparametric
Page 359
International Journal of Engineering Trends and Technology- Volume3Issue3- 2012
the missing value with the average income value for
customers in the same credit risk category as that of the
given tuple. Use the most probable value to fill in the
missing value: this technique is appropriate for sparse
missing values. Difficulties arise if the tuple contains
more than one missing attribute values.
7. Experimental Results
We considered several data sets from real applications
and data sets taken from the UCI data set in this section.
None of these data sets have missing values. The
selected data sets let us compare the imputed values with
their real values. For these complete data sets, missed
values are generated at random so as to systematically
study the performance of the proposed method. The
percentage of missing values (the “missing rate”) was
fixed at 10, 20, 30, 50, and 80 percent for each data set.
For comparison with the proposed method (denoted by
Mixing), four selected imputation methods are the
nonparametric iterative single-kernel imputation method
with a polynomial kernel (denoted by Poly), a
nonparametric
This will be discussed in future work. Since the
number of individual components of missing values is
8. Conclusion and Future Work
In this paper, we propose imputation method for dealing
with missing values, which is presented in target
attribute in data preprocessing, and then to find out the
mean, median value to original data set for imputing the
missing values in data set. In practice, datasets usually
present missing values in conditional attributes and class
attributes, which makes the problem of missing value
imputation more sophisticated. In our future work, we
will deal with this problem.
8. REFERENCES
high, it is not feasible to monitor convergence for every
[1]. M.L. Brown, “Data Mining and the Impact of Missing Data,”Industrial
imputed missing value. Schafer considered that since
Management and Data Systems, vol. 103, no. 8, pp. 611-621, 2003.
convergence rates are closely related to missing
[2]. Z. Ghahramani and M. Jordan, “Mixture Models for Learningfrom
Incomplete Data,” Computational Learning Theory and Natural Learning
information, it makes sense to focus on parameters (in
Systems, R. Greiner, T. Petsche, and S.J. Hanson, eds.,vol. IV: Making
our paper, it will be variance and mean of the imputed
Learning Systems Practical, pp. 67-85, The MIT Press, 1997.
values. Obviously, we can also use other parameters,
[3]. J. Han and M. Kamber, Data Mining Concepts and Techniques, second ed.
Morgan Kaufmann Publishers, 2006.
such as distribution function) for which the fractions of
missing information are high.
ISSN: 2231-5381 http://www.internationaljournalssrg.org
Page 360
Download