Missing values problem in Data Mining JELENA STOJANOVIC 03/20/2014 Outline Missing data problem Missing values in attributes Missing values in target variable Missingness mechanisms Aapproaches to Missing values Eliminate Data Objects Estimate Missing Values Handling the Missing Value During Analysis Experimental analisys Conclusion Missing Data problem There are a lot of serious data quality problems in real datasets: incomplete, redundant, inconsistent and noisy reduce the performance of data mining algorithms Missing data is a common issue in almost every real dataset. Caused by varied factors: high cost involved in measuring variables, failure of sensors, reluctance of respondents in answering certain questions or an ill-designed questionnaire. Missing values in datasets The missing data problem arises when values for one or more variables are missing from recorded observations. 10 Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 60K Missing values in attributes (independant variables) Missing labels Missingness mechanism Missing Completely At Random Missing At Random Missing Not At Random Missing Completely at Random - MCAR Missing Completely at Random - the missingness mechanism does not depend on the variable of interest, or any other variable, which is observed in the dataset. The data are collected and observed arbitrarily and the collected data does not depend on any other variable of the dataset. The case when respondents decide to reveal their income levels based on coinflips This type of missing data is very rarely found and the best method is to ignore such cases. MCAR (continued) Estimate E(X) from partially observed data: X* = [0, 1, m, m,1,1, m, 0, 0, m…] True data: X = [0, 1, 0, 0, 1, 1, 0, 0, 0, 1…] E(X)=? E(X) = 0.5 Rx = [0, 0, 1, 1, 0, 0, 1, 0, 0, 1…] If MCAR: X* = [0, 1, m, m,1,1, m, 0, 0, m…] and E(X) = 3/6 =0.5 Missing At Random - MAR Missing at random - when the probability of an instance having a missing value for an attribute may depend on the known values, but not on the value of the missing data itself; Missingness can only be explained by variables that are fully observed whereas those that are partially observed cannot be responsible for missingness in others; an unrealistic assumption in many cases. Women in the population are more likely to not reveal their age, therefore percentage of missing data among female individuals will be higher. Missing Not Ar Random- MNAR When data are not either MCAR or MAR Missingness mechanism depends on another partially observed variable Situation in witch the missingness mechanism depends on the actual value of missing data. The probability of an instance having a missing value for an attribute could depend on the value of that attribute Difficult task; model the missingness Missing data consequences They can significantly bias the outcome of research studies. Response profiles of non-respondents and respondents can be significantly different from each other. Performing the analysis using only complete cases and ignoring the cases with missing values can reduce the sample size thereby substantially reducing estimation efficiency. Many of the algorithms and statistical techniques are generally tailored to draw inferences from complete datasets. It may be difficult or even inappropriate to apply these algorithms and statistical techniques on incomplete datasets. Handling missing values In general, methods to handle missing values belong either to sequential methods (preprocessing methods) or to parallel methods (methods in which missing attribute values are taken into account during the main process of acquiring knowledge). Existing approaches: Eliminate Data Objects or Attributes Estimate Missing Values Handling the Missing Values During Analysis Eliminate data objects Eliminating data attributes Estimate Missing Values most common/ mean value Imputation Imputation- nearest neighbor Handling the Missing Value During Analysis Missing values are taken into account during the main process of acquiring knowledge Some examples: Clustering - similarity between the objects calculated using only the attributes that do not have missing values. C4.5 - splitting cases with missing attribute values into fractions and adding these fractions to new case subsets. CART -A method of surrogate splits to handle missing attribute values Rule-based induction algorithms- missing values „do not care conditions“ Pairwise deletion is used to evaluate statistical parameters from available information CRF-marginalizing out effect of missing label instances on labeled data Internal missing data strategy used by C4.5 C4.5 uses a probabilistic approach to handle missing data C4.5: Multiple split (Each node T can be partitioned into T1 , T2 … Tn subsets) Evaluation measure: Information Gain ratio If there exist missing values in an attribute X, C4.5 uses the subset with all known values of X to calculate the information gain. Once a test based on an attribute X is chosen, sC4.5 uses a probabilistic approach to partition the instances with missing values in X Internal missing data strategy used by C4.5 When an instance in T with known value is assigned to a subset Ti, probability of that instance belonging to subset Ti is 1 probability of that instance belonging to all other subsets is 0 C4.5 associates to each instance in Ti a weight representing the probability of that instance belonging to Ti. If the instance has a known value, and satisfies the test with outcome Oi, then this instance is assigned to Ti with weight 1 If the instance has an unknown value, this instance is assigned to all partitions with different weights for each one: The weight for the partition Ti is the probability that instance belongs to Ti. This probability is estimated as the sum of the weights of instances in T known to satisfy the test with outcome Oi, divided by the sum of weights of the cases in T with known values on the attribute X. Experimental Analysis* Using cross-validation estimated error rates compare performance of : K-nearest neighbour algorithm as an imputation method Mean or mode imputation method Internal algorithms used by C4.5 and CN2 to learn with missing data Missing values were artificially implanted, in different rates and attributes (more than 50%) Data sets from UCI [10]: Bupa, Cmc, Pima and Breast *G. Batista and M.C. Monard, “An Analysis of Four Missing Data Treatment Methods for Supervised Learning,”Applied Artificial Intelligence,vol. 17, pp. 519-533, 2003 Comparative results for the Breast data set Comparative results for the Bupa data set Comparative results for the Cmc data set Comparative results for the Prima data set Conclusion Missing data huge data quality problem Vast variety of causes of missingess In general, there is no best, universal method of handling missing values Different types of missingness mechanism (MCAR, MAR, MNAR) and datasets require different approaches of dealing with missing values Thank you for your attention! Questions? Homework problem: 1. List the types of missingness mechanisms. State one way you think should be appropriate for solving each of them and shortly explain way. Eliminate data objects or attributes Eliminate objects with missing values (listwise deletion) Simple and effective strategy Even partially specified objects contains some information If there are many objects- reliable analysis can be difficult or impossible Unless data are missing completely at random, listwise deletion can bias the outcome. Eliminate attributes that have missing values Carefully: These attributes maybe critical for analysis Listwise deletion and pairwise deletion used in approximately 96% of studies in the social and behavioral sciences. Estimate Missing Values Missing data sometimes can be estimated reliably using values of remaing cases or attrubutes: replacing a missing attribute value by the most common value of that attribute, replacing a missing attribute value by the mean for numerical attributes, assigning all possible values to the missing attribute value, assigning to a missing attribute value the corresponding value taken from the closest case, replacing a missing attribute value by a new value, computed from a new data set, considering the original attribute as a decision (imputation) For this strategy, comonly used are machine learning algorithms: Unstructured (Decision trees, Naive Bayes, K-Neares neighbors…) Structured (Hidden Markov Models, Conditional Random Fields, Structured SVM…) Some of these methods are more accurate, but more computationaly expensive, so different situations require different solutions Handling the Missing Value During Analysis Missing attribute values are taken into account during the main process of acquiring knowledge In clustering, similarity between the objects calculated using only the attributes that do not have missing values. Similarity in this case only approximation, but unless the total number of attributes is small or the numbers of missing values is high, degree of inaccuracy may not matter much. C4.5 induces a decision tree during tree generation, splitting cases with missing attribute values into fractions and adding these fractions to new case subsets. A method of surrogate splits to handle missing attribute values was introduced in CART. In modification of the LEM2 (Learning from Examples Module, version 2) rule induction algorithm rules are induced form the original data set, with missing attribute values considered to be "do not care" conditions or lost values. In statistics, pairwise deletion is used to evaluate statistical parameters from available information: to compute the covariance of variables X and Y , all those cases or observations in which both X and Y used regardless of whether other variables in the dataset have missing values. are observed are In CRFs, marginalizing out effect of missing label instances on labeled data, and thus utilizing information of all observations and preserving the observed graph structre.