slides

advertisement
G54DMT – Data Mining Techniques and
Applications
http://www.cs.nott.ac.uk/~jqb/G54DMT
Dr. Jaume Bacardit
jqb@cs.nott.ac.uk
Topic 2: Data Preprocessing
Lecture 4: Handling missing values
Structure of the lecture taken from http://sci2s.ugr.es/MVDM/index.php
Outline of the lecture
• What is a missing value and why is it
important?
• Strategies for missing values handling
• Imputation techniques
What is a missing values?
• A missing value (Mv) is an empty cell in the
table that represents a dataset
Attributes
Instances
?
Missing values in the ARFF format
@relation labor
@attribute 'duration' real
@attribute 'wage-increase-first-year' real
@attribute 'wage-increase-second-year' real
@attribute 'wage-increase-third-year' real
@attribute 'cost-of-living-adjustment' {'none','tcf','tc'}
@attribute 'working-hours' real
@attribute 'pension' {'none','ret_allw','empl_contr'}
@attribute 'standby-pay' real
@attribute 'shift-differential' real
@attribute 'education-allowance' {'yes','no'}
@attribute 'statutory-holidays' real
@attribute 'vacation' {'below_average','average','generous'}
@attribute 'longterm-disability-assistance' {'yes','no'}
@attribute 'contribution-to-dental-plan' {'none','half','full'}
@attribute 'bereavement-assistance' {'yes','no'}
@attribute 'contribution-to-health-plan' {'none','half','full'}
@attribute 'class' {'bad','good'}
@data
1,5,?,?,?,40,?,?,2,?,11,'average',?,?,'yes',?,'good'
2,4.5,5.8,?,?,35,'ret_allw',?,?,'yes',11,'below_average',?,'full',?,'full','good'
?,?,?,?,?,38,'empl_contr',?,5,?,11,'generous','yes','half','yes','half','good'
3,3.7,4,5,'tc',?,?,?,?,'yes',?,?,?,?,'yes',?,'good'
Why do missing values exist?
• Faulty equipment
• Incorrect measurements
• Missing cells in manual data entry
– Very frequent in questionnaires for medical
scenarios
– Actually having a low rate of missing values may
be suspicious (Barnard and Meng, 99)
• Censored/anonymous data
Why missing values are important?
• Three reasons
– Loss of efficiency: less patterns extracted from
data or conclusions statistically less strong
– Complications in handling and analyzing the data.
Methods are in general not prepared to handle
them
– Bias resulting from differences between missing
and complete data. Data Mining methods
generate different models
• (Barnard and Meng, 99)
Caracterisation of missing values
(Little & Rubin, 87)
• Three categories of missing values
– Missing completely at random (MCAR), when the distribution of an
example having a missing value for an attribute does not depend on
either the observed data or the missing data
– Missing at random (MAR), when the distribution of an example having
a missing value for an attribute depends on the observed data, but
does not depend on the missing data
– Not missing at random (NMAR), when the distribution of an example
having a missing value for an attribute depends on the missing values.
• Depending on the type of missing value, some of the handling
methods will be suitable or not
Strategies for missing values handling
(Farhangfar et al., 08)
• Discarding examples with missing values
– Simplest approach. Allows the use of unmodified data
mining methods
– Only practical if there are few examples with missing
values. Otherwise, it can introduce bias
• Convert the missing values into a new value
– Use a special value for it
– Add an attribute that indicates if value is missing or not
– Greatly increases the difficulty of the data minig process
• Imputation methods
– Assign a value to the missing one, based on the rest of the
dataset
– Use the unmodified data mining methods
Imputation methods
• As they extract a model from the dataset to
perform the imputation, they are suitable for
MCAR and, to a lesser extent, MAR types of
missing values
• Not suitable for NMAR type of missing data
– It would be necessary in this case to go back to
the source of the data to obtain more information
• In the next slides several imputation methods
will be described. All references to the original
papers presenting them are available at
http://sci2s.ugr.es/MVDM/index.php
Do Not Impute (DNI)
• Simply use the default MV policy of the data
mining method
• Only suitable if such policy exists
• Example for rule learning
– Attributes with missing values would be
considered as irrelevant
Function match (instance x, rule r)
Foreach attribute att in the domain
If att is present in x and
(x.att < r.att.lower or x.att > r.att.upper)
Return false
EndIf
EndFor
Return true
Most Common (MC) value
• If the missing value is continuous
– Replace it with the mean value of the attribute for
the dataset
• If the missing value is discrete
– Replace it with the most frequent value of the
attribute for the dataset
• Simple and fast to compute
• Assumes that each attribute
presents a normal distribution
?
Ave
Concept Most Common (CMC) value
• Refinement of the MC
policy
• The MV is replaced with
the mean/most frequent
value computed from the
instances belonging to the
same class
• Assumes that the
distribution for an attribute
of all instances from the
same class is normal
?
Ave
Imputation with k-Nearest Neighbour
(KNNI)
• k-NN machine learning algorithm
– Given an unlabeled new instance
– Select the k instances from the training
set most similar to the new instance
• What is similar? E.g. an euclidean distance
– Predict the majority class from these k
instances
• k-NN for MV imputation
– Select the k nearest neighbours
– Replace the MV with the most
frequence/mean value from these k
instances
Weighted imputation with K-Nearest
Neighbour (WKNNI)
• Refinement of KNNI
– Select the k neighbours as before
– MV generation is now performed through a
weighted average of the values for the missing
attribute from these k neighbours
– The closest neighbours from k have more weight
K-means Clustering Imputation (KMI)
• Clustering: automatic aggregation of
instances in groups
• K-means: Given a dataset it identifies k
(predefined parameter) groups (clusters)
of similar instances.
• For each cluster it computes the
centroid
– Artificial representative of the cluster
– Mean/most frequent value of instances
in a cluster
• MV imputation
– Identify the cluster to which the instance
with MV belongs to
– Take the value of the centroid
Imputation with Fuzzy K-means
Clustering (FKMI)
• Fuzzy logic: Reasoning framework that
explicitly takes into account uncertainty
• In fuzzy k-means each instances does not
simply belongs to a cluster or not
• It has a membership degree to each cluster
• Missing values are computed as weighed sum
of all centroids, using the membership
function of each cluster as the weight
Other methods
•
•
•
•
•
•
Bayesian Principal Component Analysis (BPCA)
Local Least Squares Imputation (LLSI)
Event Covering (EC)
Support Vector Machine Imputation (SVMI)
Expectation Maximisation Imputation (EMI)
Singular Value Decomposition Imputation
(SVDI)
Details and references of these methods in http://sci2s.ugr.es/MVDM/index.php
And what is the global picture?
• In terms of attributes
– Methods that treat each attribute separately
– Methods that take decisions from the whole
record
– Methods that consider a subset of attributes
• In terms of instances
– Imputation based on the complete instance set
– Imputation based on a subset of similar records
• Methods that decompose the dataset and
take decisions in a different space
Which method is the best?
• The literature of full of comparisons of methods
– Impact of imputation of missing values on classification
error for discrete data
– A Study on the Use of Imputation Methods for
Experimentation with Radial Basis Function Network
Classifiers Handling Missing Attribute Values: The good
synergy between RBFs and EventCovering method
– Missing value estimation for DNA microarray gene
expression data: Local least squares imputation
Resources
• List of web site dedicated to missing values
– http://sci2s.ugr.es/MVDM/index.php#four
• Bibliography on missing values
– http://sci2s.ugr.es/MVDM/biblio.php
• Implementation of the methods described in
this lecture is available in the KEEL package
Questions?
Download