Missing Completely at Random

advertisement
Missing values problem in
Data Mining
JELENA STOJANOVIC
03/20/2014
Outline

Missing data problem
Missing values in attributes
 Missing values in target variable
 Missingness mechanisms


Aapproaches to Missing values
Eliminate Data Objects
 Estimate Missing Values
 Handling the Missing Value During Analysis

Experimental analisys
 Conclusion

Missing Data problem

There are a lot of serious data quality problems in real datasets:
incomplete, redundant, inconsistent and noisy

reduce the performance of data mining algorithms

Missing data is a common issue in almost every real dataset.

Caused by varied factors:

high cost involved in measuring variables,

failure of sensors,

reluctance of respondents in answering certain questions or

an ill-designed questionnaire.
Missing values in datasets

The missing data problem
arises when values for one
or more variables are
missing from recorded
observations.
10
Tid Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
60K
Missing values in attributes
(independant variables)
Missing labels
Missingness mechanism

Missing Completely At Random

Missing At Random

Missing Not At Random
Missing Completely at Random - MCAR




Missing Completely at Random - the
missingness mechanism does not depend
on the variable of interest, or any other
variable, which is observed in the dataset.
The data are collected and observed
arbitrarily and the collected data does
not depend on any other variable of the
dataset.
The case when respondents decide to
reveal their income levels based on coinflips
This type of missing data is very rarely
found and the best method is to ignore
such cases.
MCAR (continued)

Estimate E(X) from partially observed data:
X* = [0, 1, m, m,1,1, m, 0, 0, m…]

True data:
X = [0, 1, 0, 0, 1, 1, 0, 0, 0, 1…]


E(X)=?
E(X) = 0.5
Rx = [0, 0, 1, 1, 0, 0, 1, 0, 0, 1…]
If MCAR:

X* = [0, 1, m, m,1,1, m, 0, 0, m…]
and
E(X) = 3/6 =0.5
Missing At Random - MAR

Missing at random - when the probability of an
instance having a missing value for an attribute may
depend on the known values, but not on the value
of the missing data itself;

Missingness can only be explained by variables that
are fully observed whereas those that are partially
observed cannot be responsible for missingness in
others; an unrealistic assumption in many cases.

Women in the population are more likely to not
reveal their age, therefore percentage of missing
data among female individuals will be higher.
Missing Not Ar Random- MNAR

When data are not either MCAR or MAR

Missingness mechanism depends on another
partially observed variable

Situation in witch the missingness mechanism
depends on the actual value of missing data. The
probability of an instance having a missing value
for an attribute could depend on the value of
that attribute

Difficult task; model the missingness
Missing data consequences

They can significantly bias the outcome of research studies.

Response profiles of non-respondents and respondents can be significantly
different from each other.

Performing the analysis using only complete cases and
ignoring the cases with missing values can reduce the sample
size thereby substantially reducing estimation efficiency.

Many of the algorithms and statistical techniques are generally
tailored to draw inferences from complete datasets.

It may be difficult or even inappropriate to apply these algorithms and
statistical techniques on incomplete datasets.
Handling missing values


In general, methods to handle missing values belong
either to sequential methods (preprocessing methods)
or to parallel methods (methods in which missing
attribute values are taken into account during the main
process of acquiring knowledge).
Existing approaches:

Eliminate Data Objects or Attributes

Estimate Missing Values

Handling the Missing Values During Analysis
Eliminate data objects
Eliminating data attributes
Estimate Missing Values
most
common/
mean
value
Imputation
Imputation- nearest neighbor
Handling the Missing Value During Analysis

Missing values are taken into account during the main process of acquiring
knowledge

Some examples:

Clustering - similarity between the objects calculated using only the attributes
that do not have missing values.

C4.5 - splitting cases with missing attribute values into fractions and adding
these fractions to new case subsets.

CART -A method of surrogate splits to handle missing attribute values

Rule-based induction algorithms- missing values „do not care conditions“

Pairwise deletion is used to evaluate statistical parameters from available
information

CRF-marginalizing out effect of missing label instances on labeled data
Internal missing data strategy used by C4.5

C4.5 uses a probabilistic approach to handle missing data

C4.5:

Multiple split (Each node T can be partitioned into T1 , T2 … Tn subsets)

Evaluation measure: Information Gain ratio

If there exist missing values in an attribute X, C4.5 uses the subset with all
known values of X to calculate the information gain.

Once a test based on an attribute X is chosen, sC4.5 uses a probabilistic
approach to partition the instances with missing values in X
Internal missing data strategy used by C4.5


When an instance in T with known value is assigned to a subset Ti,

probability of that instance belonging to subset Ti is 1

probability of that instance belonging to all other subsets is 0
C4.5 associates to each instance in Ti a weight representing the probability of
that instance belonging to Ti.

If the instance has a known value, and satisfies the test with outcome Oi, then this instance is
assigned to Ti with weight 1

If the instance has an unknown value, this instance is assigned to all partitions with different
weights for each one:

The weight for the partition Ti is the probability that instance belongs to Ti.

This probability is estimated as the sum of the weights of instances in T known to satisfy the
test with outcome Oi, divided by the sum of weights of the cases in T with known values on
the attribute X.
Experimental Analysis*

Using cross-validation estimated error rates compare performance of :

K-nearest neighbour algorithm as an imputation method

Mean or mode imputation method

Internal algorithms used by C4.5 and CN2 to learn with missing data

Missing values were artificially implanted, in different rates and attributes
(more than 50%)

Data sets from UCI [10]: Bupa, Cmc, Pima and Breast
*G. Batista and M.C. Monard, “An Analysis of Four Missing Data Treatment Methods for Supervised Learning,”Applied
Artificial Intelligence,vol. 17, pp. 519-533, 2003
Comparative results for the Breast data set
Comparative results for the Bupa data set
Comparative results for the Cmc data set
Comparative results for the Prima data set
Conclusion

Missing data huge data quality problem

Vast variety of causes of missingess

In general, there is no best, universal method of handling missing
values

Different types of missingness mechanism (MCAR, MAR, MNAR)
and datasets require different approaches of dealing with
missing values
Thank you for your attention!
Questions?
Homework problem:

1. List the types of missingness mechanisms. State one
way you think should be appropriate for solving each of
them and shortly explain way.
Eliminate data objects or attributes


Eliminate objects with missing values (listwise deletion)

Simple and effective strategy

Even partially specified objects contains some information

If there are many objects- reliable analysis can be difficult or impossible

Unless data are missing completely at random, listwise deletion can bias the outcome.
Eliminate attributes that have missing values


Carefully: These attributes maybe critical for analysis
Listwise deletion and pairwise deletion used in approximately 96% of studies in the
social and behavioral sciences.
Estimate Missing Values

Missing data sometimes can be estimated reliably using values of remaing cases or
attrubutes:

replacing a missing attribute value by the most common value of that attribute,

replacing a missing attribute value by the mean for numerical attributes,

assigning all possible values to the missing attribute value,

assigning to a missing attribute value the corresponding value taken from the closest case,

replacing a missing attribute value by a new value, computed from a new data set, considering
the original attribute as a decision (imputation)


For this strategy, comonly used are machine learning algorithms:

Unstructured (Decision trees, Naive Bayes, K-Neares neighbors…)

Structured (Hidden Markov Models, Conditional Random Fields, Structured SVM…)
Some of these methods are more accurate, but more computationaly expensive, so different situations
require different solutions
Handling the Missing Value During Analysis

Missing attribute values are taken into account during the main process of acquiring knowledge

In clustering, similarity between the objects calculated using only the attributes that do not have missing values.
Similarity in this case only approximation, but unless the total number of attributes is small or the numbers of
missing values is high, degree of inaccuracy may not matter much.

C4.5 induces a decision tree during tree generation, splitting cases with missing attribute values into fractions
and adding these fractions to new case subsets.

A method of surrogate splits to handle missing attribute values was introduced in CART.

In modification of the LEM2 (Learning from Examples Module, version 2) rule induction algorithm rules are
induced form the original data set, with missing attribute values considered to be "do not care" conditions or
lost values.

In statistics, pairwise deletion is used to evaluate statistical parameters from available information:


to compute the covariance of variables X and Y , all those cases or observations in which both X and Y
used regardless of whether other variables in the dataset have missing values.
are observed are
In CRFs, marginalizing out effect of missing label instances on labeled data, and thus utilizing information of all
observations and preserving the observed graph structre.
Download