G54DMT – Data Mining Techniques and Applications http://www.cs.nott.ac.uk/~jqb/G54DMT Dr. Jaume Bacardit jqb@cs.nott.ac.uk Topic 2: Data Preprocessing Lecture 4: Handling missing values Structure of the lecture taken from http://sci2s.ugr.es/MVDM/index.php Outline of the lecture • What is a missing value and why is it important? • Strategies for missing values handling • Imputation techniques What is a missing values? • A missing value (Mv) is an empty cell in the table that represents a dataset Attributes Instances ? Missing values in the ARFF format @relation labor @attribute 'duration' real @attribute 'wage-increase-first-year' real @attribute 'wage-increase-second-year' real @attribute 'wage-increase-third-year' real @attribute 'cost-of-living-adjustment' {'none','tcf','tc'} @attribute 'working-hours' real @attribute 'pension' {'none','ret_allw','empl_contr'} @attribute 'standby-pay' real @attribute 'shift-differential' real @attribute 'education-allowance' {'yes','no'} @attribute 'statutory-holidays' real @attribute 'vacation' {'below_average','average','generous'} @attribute 'longterm-disability-assistance' {'yes','no'} @attribute 'contribution-to-dental-plan' {'none','half','full'} @attribute 'bereavement-assistance' {'yes','no'} @attribute 'contribution-to-health-plan' {'none','half','full'} @attribute 'class' {'bad','good'} @data 1,5,?,?,?,40,?,?,2,?,11,'average',?,?,'yes',?,'good' 2,4.5,5.8,?,?,35,'ret_allw',?,?,'yes',11,'below_average',?,'full',?,'full','good' ?,?,?,?,?,38,'empl_contr',?,5,?,11,'generous','yes','half','yes','half','good' 3,3.7,4,5,'tc',?,?,?,?,'yes',?,?,?,?,'yes',?,'good' Why do missing values exist? • Faulty equipment • Incorrect measurements • Missing cells in manual data entry – Very frequent in questionnaires for medical scenarios – Actually having a low rate of missing values may be suspicious (Barnard and Meng, 99) • Censored/anonymous data Why missing values are important? • Three reasons – Loss of efficiency: less patterns extracted from data or conclusions statistically less strong – Complications in handling and analyzing the data. Methods are in general not prepared to handle them – Bias resulting from differences between missing and complete data. Data Mining methods generate different models • (Barnard and Meng, 99) Caracterisation of missing values (Little & Rubin, 87) • Three categories of missing values – Missing completely at random (MCAR), when the distribution of an example having a missing value for an attribute does not depend on either the observed data or the missing data – Missing at random (MAR), when the distribution of an example having a missing value for an attribute depends on the observed data, but does not depend on the missing data – Not missing at random (NMAR), when the distribution of an example having a missing value for an attribute depends on the missing values. • Depending on the type of missing value, some of the handling methods will be suitable or not Strategies for missing values handling (Farhangfar et al., 08) • Discarding examples with missing values – Simplest approach. Allows the use of unmodified data mining methods – Only practical if there are few examples with missing values. Otherwise, it can introduce bias • Convert the missing values into a new value – Use a special value for it – Add an attribute that indicates if value is missing or not – Greatly increases the difficulty of the data minig process • Imputation methods – Assign a value to the missing one, based on the rest of the dataset – Use the unmodified data mining methods Imputation methods • As they extract a model from the dataset to perform the imputation, they are suitable for MCAR and, to a lesser extent, MAR types of missing values • Not suitable for NMAR type of missing data – It would be necessary in this case to go back to the source of the data to obtain more information • In the next slides several imputation methods will be described. All references to the original papers presenting them are available at http://sci2s.ugr.es/MVDM/index.php Do Not Impute (DNI) • Simply use the default MV policy of the data mining method • Only suitable if such policy exists • Example for rule learning – Attributes with missing values would be considered as irrelevant Function match (instance x, rule r) Foreach attribute att in the domain If att is present in x and (x.att < r.att.lower or x.att > r.att.upper) Return false EndIf EndFor Return true Most Common (MC) value • If the missing value is continuous – Replace it with the mean value of the attribute for the dataset • If the missing value is discrete – Replace it with the most frequent value of the attribute for the dataset • Simple and fast to compute • Assumes that each attribute presents a normal distribution ? Ave Concept Most Common (CMC) value • Refinement of the MC policy • The MV is replaced with the mean/most frequent value computed from the instances belonging to the same class • Assumes that the distribution for an attribute of all instances from the same class is normal ? Ave Imputation with k-Nearest Neighbour (KNNI) • k-NN machine learning algorithm – Given an unlabeled new instance – Select the k instances from the training set most similar to the new instance • What is similar? E.g. an euclidean distance – Predict the majority class from these k instances • k-NN for MV imputation – Select the k nearest neighbours – Replace the MV with the most frequence/mean value from these k instances Weighted imputation with K-Nearest Neighbour (WKNNI) • Refinement of KNNI – Select the k neighbours as before – MV generation is now performed through a weighted average of the values for the missing attribute from these k neighbours – The closest neighbours from k have more weight K-means Clustering Imputation (KMI) • Clustering: automatic aggregation of instances in groups • K-means: Given a dataset it identifies k (predefined parameter) groups (clusters) of similar instances. • For each cluster it computes the centroid – Artificial representative of the cluster – Mean/most frequent value of instances in a cluster • MV imputation – Identify the cluster to which the instance with MV belongs to – Take the value of the centroid Imputation with Fuzzy K-means Clustering (FKMI) • Fuzzy logic: Reasoning framework that explicitly takes into account uncertainty • In fuzzy k-means each instances does not simply belongs to a cluster or not • It has a membership degree to each cluster • Missing values are computed as weighed sum of all centroids, using the membership function of each cluster as the weight Other methods • • • • • • Bayesian Principal Component Analysis (BPCA) Local Least Squares Imputation (LLSI) Event Covering (EC) Support Vector Machine Imputation (SVMI) Expectation Maximisation Imputation (EMI) Singular Value Decomposition Imputation (SVDI) Details and references of these methods in http://sci2s.ugr.es/MVDM/index.php And what is the global picture? • In terms of attributes – Methods that treat each attribute separately – Methods that take decisions from the whole record – Methods that consider a subset of attributes • In terms of instances – Imputation based on the complete instance set – Imputation based on a subset of similar records • Methods that decompose the dataset and take decisions in a different space Which method is the best? • The literature of full of comparisons of methods – Impact of imputation of missing values on classification error for discrete data – A Study on the Use of Imputation Methods for Experimentation with Radial Basis Function Network Classifiers Handling Missing Attribute Values: The good synergy between RBFs and EventCovering method – Missing value estimation for DNA microarray gene expression data: Local least squares imputation Resources • List of web site dedicated to missing values – http://sci2s.ugr.es/MVDM/index.php#four • Bibliography on missing values – http://sci2s.ugr.es/MVDM/biblio.php • Implementation of the methods described in this lecture is available in the KEEL package Questions?