APPLICATION OF EM ALGORITHM ON MISSING CATEGORICAL DATA ANALYSIS A report submitted

advertisement
lil
APPLICATION OF EM ALGORITHM ON MISSING CATEGORICAL DATA
ANALYSIS
NORAIM BINTI HASAN
A reportsubmittedin partialfulfilment of the
requirementsfor the awardof the degreeof
Masterof Science(Mathematic)
Facultyof Science
UniversitiTeknologiMalaysia
DECEMBER2OO9
To my beloved husband, son and all my family members
ACKNOWLEDGEMENT
In preparing this thesis, I was in contact with many people, researchers,
academicians, and practitioners. They have contributed towards my understanding
and thoughts. In particular, I wish to express my sincere appreciation to my thesis
supervisor, Assoc. Prof. Dr. Ismail b. Mohamad, for encouragement, guidance,
critics and friendship. Without their continued support and interest, this thesis would
have never been the same as presented here. Librarians at UTM also deserve my
special thanks for their assistance in supplying the relevant literatures.
My colleagues should also be recognised for their support and the assistance
provided at various occasions. Their views and tips are useful indeed. My sincere
appreciation also extends to my beloved husband and son, my family and also not
forgotten my in-laws, for their understanding and sacrificial. Unfortunately, it is not
possible to list all of them in this limited space.
ABSTRAK
Algoritma EM merupakan salah satu daripada kaedah untuk menyelesaikan
masalah berkaitan dengan data tidak lengkap berdasarkan kepada satu rangka
lengkap. Algorithma EM merupakan satu pendekatan parametrik untuk mencari
taksiran ML data tidak lengkap. Algorithma ini terbahagi kepada dua langkah,
dimana langkah pertamanya, langkah Ekspektasi, atau lebih dikenali sebagai langkah
E, mencari ekspektasi kepada loglikelihood, bersyarat kepada data yang dapat
diperolehi dan anggaran terkini,
. Langkah kedua, langkah Pemaksimuman atau
langkah M dimana ia akan memaksimumkan nilai loglikelihood untuk mencari satu
anggaran parameter yang baru. Prosedur ini berlaku berselang- seli antara kedua-dua
langkah ini sehingga anggaran parameter tersebut malar.
ABSTRACT
Expectation- Maximization algorithm, or in short, EM algorithm is one of the
methodologies for solving incomplete data problems sequentially based on a
complete framework. The EM algorithm is a parametric approach to find the
Maximum Likelihood, ML parameter estimates for incomplete data. The algorithm
consists of two steps. The first step is the Expectation step, better known as E-step,
finds the expectation of the loglikelihood, conditional on the observed data and the
current parameter estimates; say
. The second step is the Maximization step, or M-
step, which maximize the loglikelihood to find new estimates of the parameters. The
procedure alternates between the two steps until the parameter estimates converge to
some fixed values.
TABLE OF CONTENTS
CHAPTER
1
TITLE
PAGE
DECLARATION
ii
DEDICATION
iii
ACKNOWLEDGEMENTS
iv
ABSTRACT
v
ABSTRACT
vi
TABLE OF CONTENTS
vii
LIST OF TABLES
x
LIST OF SYMBOLS
xiv
INTRODUCTION
1
1.1
Problem Statement
1
1.2
Objective Of The Study
2
1.3
Scope Of The Study
3
1.4
2
3
LITERATURE REVIEW
4
2.1
Missing Data
5
2.1.1
Classes of Missing Data
6
2.1.1.1 Censored Data
6
2.1.1.2 Latent Variable
7
2.1.1.3 Non-Response Item
8
The Expectation-Maximization Algorithm
10
2.2
3
Significance Of The Study
RESEARCH METHODOLOGY
12
3.1
Missing Data Patterns
12
3.2
General Definition of Missingness Mechanism
15
3.3
EM Theory in General
17
3.4
Incomplete Contingency Table
27
3.4.1
3.4.2
ML Estimation in Incomplete Contingency
Table
27
The EM Algorithm
28
3.4.2.1 Multinomial Sampling
28
3.4.2.2 Product Multinomial Sampling
30
3.4.2.3 EM Algorithm to Determine the ML
Estimates of Cell Probabilities in An
Incomplete × Contingency Table
Data Missing on Both Categories
31
3.5
4
5
Chi- Squared Test
35
3.5.1
Goodness-of-fit Test
35
3.5.2
Independence Test
41
RESULT AND DISCUSSION
46
4.1
Data Construction
52
4.1.1
Missing Completely At Random (MCAR)
53
4.1.2
Missing At Random (MAR)
59
4.1.3
Not Missing At Random (NMAR)
64
4.1.4
The Chi- Squared Test
68
CONCLUSION AND RECOMMENDATION
72
5.1
Conclusion
72
5.2
Recommendation
73
REFERRENCES
75
LIST OF TABLES
TABLE NO.
TITLE
PAGE
Classification of sample units in an incomplete ×
contingency table
32
3.2
Frequency distribution
36
3.3
The calculation of statistic
39
3.4
The observed frequency of category i
40
3.5
A two-way contingency table
41
3.6
A two-way dimensional contingency table of joint events
42
4.1
An example dataset of full data
47
4.1(a)
Continuous data
47
4.1(b)
Categorical data
47
4.2
An example dataset for MCAR
49
4.2(a)
Continuous data
49
4.2(b)
Categorical data
49
4.3
An example dataset for MAR
50
4.3(a)
Continuous data
50
3.1
4.3(b)
Categorical data
51
4.4
An example dataset for NMAR
51
4.4(a)
Continuous data
51
4.4(b)
Categorical data
52
4.5
Full data
53
4.6
Artificial Incomplete Data for MCAR.
54
4.6(a)
MCAR with 10% data missing
54
4.6(b)
MCAR with 20% data missing
54
4.6(c)
MCAR with 30% data missing
55
4.7
Marginal total of probabilities for MCAR with 10% data
missing
4.8
Iteration of EM algorithm for MCAR with 10% data missing
problem
4.9
55
57
Complete data obtained by EM algorithm for 10% MCAR
problem
58
4.10
MCAR with 20% of the data are missing
58
4.10(a)
Iteration of EM algorithm.
58
4.10(b)
Complete data obtained by EM algorithm
58
4.11
MCAR with 30% of the data are missing
59
4.11(a)
Iteration of EM algorithm.
59
4.11(b)
Complete data obtained by EM algorithm
59
4.12
Artificial Incomplete Data for MAR.
60
4.12(a)
MAR with 10% data missing
60
4.12(b)
MAR with 20% data missing
61
4.12(c)
MAR with 30% data missing
61
4.13
MAR with 10% of the data are missing
61
4.13(a)
Iteration of EM algorithm.
61
4.13(b)
Complete data obtained by EM algorithm
62
4.14
MAR with 20% of the data are missing
62
4.14(a)
Iteration of EM algorithm.
62
4.14(b)
Complete data obtained by EM algorithm
62
4.15
MAR with 30% of the data are missing
63
4.15(a)
Iteration of EM algorithm.
63
4.15(b)
Complete data obtained by EM algorithm
63
4.16
Artificial Incomplete Data for NMAR.
65
4.16(a)
NMAR with 10% data missing
65
4.16(b)
NMAR with 20% data missing
65
4.16(c)
NMAR with 30% data missing
65
4.17
NMAR with 10% of the data are missing
66
4.17(a)
Iteration of EM algorithm.
66
4.17(b)
Complete data obtained by EM algorithm
66
4.18
MAR with 20% of the data are missing
66
4.18(a)
Iteration of EM algorithm.
66
4.18(b)
Complete data obtained by EM algorithm
67
4.19
MAR with 30% of the data are missing
67
4.19(a)
Iteration of EM algorithm.
67
4.19(b)
Complete data obtained by EM algorithm
67
4.20
The
4.21
The
calculation for full data
values for all cases
69
70
LIST OF SYMBOLS
The observed value
The missing value
Number of observations or Total counts
Estimates of
Current estimates of
The counts in cell ( , )
The observed value for
( )
The probability that an observation falls in cell ( , )
rth estimates of
Observed frequencies
Expected frequencies
1
CHAPTER 1
INTRODUCTION
1.1
PROBLEM STATEMENT
Incomplete table is referred to the table in which the entries or information on
one or more of the categorical variables are missing, a prior zero or undetermined
(Fienberg, 1980). Missing data treatment is an important data quality issue in data
mining, data warehousing, and database management. Real-world data often has missing
values.
2
The presence of missing values can cause serious problems when the data is used
for reporting, information sharing, and decision support. First, data with missing values
may provide biased information. For example, a survey question that is related to
personal information will more likely be left unanswered for those who are more
sensitive about privacy. Second, many data modeling and analysis techniques cannot
deal with missing values and have to cast out a whole record value if one of the attribute
values is missing. Third, even though some data modeling and analysis tools can handle
missing values, there are often restrictions in the domain of missing values. For
example, classification systems typically do not allow missing values in the class
attribute.
Missing data always becomes the main obstacles for the researchers to further
their studies. Some researcher will just ignore, truncate, censor, or collapse with those
missing data. This might able to make the problem easier but it will lead to inappropriate
conclusion and confusion. Therefore, a proper strategy should be used to treat such
missing data.
1.2
OBJECTIVE OF THE STUDY
This research is carried out with some objectives as listed below:
1) To apply the EM algorithm on multinomial model in missing categorical data
analysis.
2) To compare the results of independence test for complete and incomplete
data.
3
1.3
SCOPE OF THE STUDY
This study is concentrated on the contingency table where some missing values
are present and thus the EM algorithm will be applied on it. Only Missing At Random
(MAR) data and Not Missing At Random (NMAR) data are considered in this study.
1.4
SIGNIFICANCE OF THE STUDY
The EM algorithm will be successful in dealing with missing data values in
contingency table or in other words we can say that we can find the missing values by
applying the EM algorithm. By the end of this study, we will discover a new dimension
of problem such as the missingness mechanism which will have a direct impact or effect
on the missing values.
4
CHAPTER 2
LITERATURE REVIEW
Missing data analysis have been well studied especially by Little and Rubin
(2002). But for incomplete categorical data analysis it is still under study. In recent
years, many researchers are concerned about the analysis of incomplete categorical data.
Xiao-Bai Li (2009) have proposed a new Bayesian approach for estimating and
replacing categorical data. With this approach, the posterior probabilities of missing
value belonging to a certain category are estimated using the simple Bayes method.
Based on the estimated probabilities, two alternative methods for replacing the missing
values are proposed. The first replaces the missing value with the value having the
maximum probability; the second uses a value that is selected with probability
proportional to the estimated posterior distribution. The approach is nonparametric and
5
does not require prior knowledge about the distributions of the data. The approach is not
related to any specific data analysis/mining task and thus can be applied to a wide
variety of tasks. A major problem is that the variability associated with the missing data
is biasedly represented when only the observed values are taken into account,
meanwhile all missing values of an attributes are not. In other words, the missing values
are not taken into account in predicting the missing values. As a result, the statistical
distribution of the data is altered and the quality of the data is affected.
2.1
Missing Data
Appropriate treatment of missing values is essential in all analysis and is critical
in some, such as time series analysis. Inappropriate handling of missing values will
distort analysis because, until proven otherwise, the researcher must assume that missing
cases be different in analytically important ways from cases where values are at hand.
That is, the problem with missing values is not so much reduced sample size as it is the
possibility that the remaining data set is biased.
6
2.1.1
Classes of Missing Data
There are several classes of missing data problem in the statistical literature,
each of which is unique.
2.1.1.1 Censored Data
A major object of study in statistics is survival analysis where interest centers on
the failure time of a group or groups of individuals. For example, in a clinical trial,
interest may center on the survival time of cancer patients from the time of receiving
chemotherapy treatment. In a life-testing experiment in industrial reliability, interest
centers on the lifetimes of machine components.
Survival times and lifetimes are known as failure times, the response variable in
survival analysis study. In the clinical trial above, some patients may opt to withdraw
from the study or otherwise become unavailable before the experiment expires so that
the true survival times for them are not known.
7
In a life-testing experiment in industrial reliability, not all components in the
study may have failed before the end of the study. The only information that is available
is that the subject’s survival times exceed a certain value and the true survival times are
not known. These incidents created the problem of missing values and the
incompleteness of the observations on the failure time is called censoring. This class of
missing data is known as censored data.
2.1.1.2 Latent variable
Some variables cannot be measured nor can they be observed directly although
some other observed measurable variables are thought to be related to the unobserved
variables. The observed variable is called a manifest variable and the unobserved
variable is known as a latent variable. A classic example of latent variable is intelligence
which is immeasurable but the Intelligence Quotient (IQ) test score is thought to reflect
one’s level of intelligence.
Another example of a latent variable is religious commitment which cannot be
measured but is thought to be related to the observed frequency of one’s performance of
religious rituals. Other examples of latent variables include stereotyping in sociology,
mathematics anxiety in education and economic trust or confidence in economic. These
latent variables are hypothetical constructs which are not measurable but some of their
8
effects on manifest variables are observable. In general, a latent variable model, models
the relationship between the manifest and the latent variable.
2.1.1.3 Non-response Item
Non-response item refers to the fact that due to fatigue, sensitivity, lack of
knowledge or other factors, respondents not infrequently leave particular items blank on
mail questionnaires or decline to give any response during interviews. This forces the
researchers to decide whether to leave cases with missing data out of analysis when data
are missing for a variable being analyzed, or if a value should be imputed for the case
and the blank replaced by the imputed value. Similar issues arise with archival data,
where the researcher may find no recorded data for certain values of certain records.
Whereas a latent variable is entirely missing, non-response creates an incomplete
data set with gaps in the data matrix. This class of missing data deprives us of the
familiar data structure. The non-response problem is a straight forward problem faced
by many practicing statisticians and non-statistical community such that a three volume
work addressing this issue was written by Madow et. al. (1983). It is easy to understand
why this problem has received such attention, especially in the United States and United
Kingdom; governments have an interest in clean and reliable official statistics.
9
2.1.2
Missing Data Mechanism
The occurrence of missing data is caused by certain mechanisms. Three different
mechanisms that cause missing data are extinguished by Rubin (1976). Missing
Completely At Random (MCAR) exists when missing values are randomly distributed
across all observations. In this case, the probability of an observation being missing does
not depend on the data values. This means that each item of the data has the same
probability of being missing.
Missing At Random (MAR) is a condition which exist when missing values are
not randomly distributed among all observations, but are randomly distributed within
one or more subsamples. The probability of an observation being missing depends on
the observed values but not on the missing values.
The other mechanism is called Not Missing At Random (NMAR), also called
Non-ignorable missingness is the most problematic form, existing when missing values
are not randomly distributed across observations, but the probability of missingness
cannot be predicted from the variables in the model.
To understand the missingness mechanism better, suppose we consider a case in
which we are interested in studying the relationship between age and income where
subjects are chosen to participate in the study. Suppose all n measurements of age are
fully observed but some measurements of income are missing.
10
The missing income data are MCAR if the probability of being missing does not
depend on the values of age or income, that is the missing income values are not related
to age or income. The missing data are MAR if the probability of being missing depends
on age values and not on income values, which means the missing income values are
related to age values. The missing income data are NMAR if the probability of being
missing depends on the values of income. Diggle and Kenward (1994) introduced the
terms “informative drop-out” for non-ignorable drop-out in longitudinal data analysis.
Other situations where the missing data are MAR that depends on an outside
variable, i.e. a variable that is not in the study and a MAR situation that depends on a
combination of two or more variables were observed by Kim & Curry (1977) and Roth
(1994).
2.2
The Expectation-Maximization Algorithm
A modern statistical procedure for dealing with missing data called Expectation
Maximization Algorithm, or EM Algorithm in short, is an efficient iterative procedure to
compute the Maximum Likelihood (ML) estimates in the presence of missing or hidden
data, is based on an old ad-hoc idea. The idea is to impute estimated values where there
are missing values, estimate the parameters, re-estimate the missing values assuming the
new parameter estimates are the true ones and then re-estimate the parameters. This
sequence is repeated until the parameter estimates converge to some stationary values.
11
This approach is called the missing information principle by Orchard and Woodbury
(1972). Even though this approach had been proposed as early as 1926 by McKendrick,
it was not until 1977 that it was presented in its general form by Dempster, Laird and
Rubin and formally called the EM Algorithm. This influential work started a new area
of EM application in many statistical problems including factor analysis and survival
analysis.
Hartley and Hocking (1971) advocated an algorithm for directly deriving a
likelihood equation from an incomplete data matrix and determining a standard solution
based on the scoring method. On the other hand, Orchard and Woodbury (1972), Beal
and Little (1975), and Sundberg (1976) derived a method or finding the maximumlikelihood solution based on an algorithm which later became generally known as the
EM algorithm by Dempster, Laird, and Rubin-DLR (1977).
Rubin (1991) regarded the EM algorithm as one of the methodologies for solving
incomplete data problems sequentially based on a complete framework. The EM
algorithm is a parametric approach to find the ML parameter estimates for incomplete
data. The algorithm consists of two steps. The first step, the Expectation step, better
known as E-step, finds the expectation of the loglikelihood, conditional on the observed
data and the current parameter estimates; say
. the second step is the Maximization
step, or M-step, which maximize the loglikelihood to find new estimates of the
parameters. The procedure alternates between the two steps until the parameter
estimates converge to some fixed values.
12
CHAPTER 3
RESEARCH METHODOLOGY
3.1
Missing Data Patterns
Standard statistical methods are developed to deal with complete data matrices
such as
13
⎡
⎢
=⎢
⎢
⎣
⎤
⎥
⎥
⎥
⎦
where all entries of the matrix are observed. This matrix can be represented as
=(
where
)
represents the observed values. When there are some values missing, some
of the entries required for the complete matrix
being no longer observed, the result is
an incomplete data matrix. A hypothetical complete data matrix can be written as
=(
where
)
,
represents the missing values.
A typical incomplete data matrix looks like the matrix below where * represents
the missing values and the row vector represents a unit.
⎡
⎢
=⎢
⎢
⎣
∗
∗
∗
∗
∗
∗
∗
⎤
⎥
⎥
⎥
⎦
14
In this particular example of an incomplete data matrix , variable
and is more completely observed than variable
observed than variables
missing data when variable
and
. Variable
,
. Variables
and
is fully observed
is more completely
form a monotone pattern of
is discarded and the resulting matrix is given as
⎡
⎢
=⎢
⎢
⎣
∗
∗
∗
∗
⎤
⎥
⎥
⎥
⎦
Discarding unit three which is the third row of the incomplete data matrix, also
forms another monotone missing data pattern which is formed by variables
and the resulting data matrix is given as
=
∗
∗
∗
,
,
and
∗
∗
∗
When a row vector contains some missing values the incomplete data is known
as item non-response. When the row vector contains only missing values, it is known as
unit non-response. Clearly unit non-response reduces the whole sample size and item
non-response reduces the sample size for the corresponding variable. Another obvious
effect of missing data is the presence of gaps in the data matrix. Statistical packages like
GLIM, SAS, SPSS and MINITAB do not recognize these gaps and will opt to work on
the complete units only, thus reducing the size of the sample which also means throwing
away some information contained in the incomplete units.
15
One effect of this treatment of missing data can be non-response bias. The nonrespondents may possess certain characteristics that make them different from the
respondents. This makes the two groups distinguishable from each other. Treating both
groups as if they are the same will give biased results; bias due to non-response.
3.2
General Definition of Missingness Mechanism
Let Y denotes a hypothetical complete ( × ) data matrix of n observations on
p variables and R an ( × ) missingness indicator matrix, such that
missing and
by parameter
Rewriting
= 0 if
= 1 if
is present. Suppose that a distribution for Y is ( | ) indexed
and a distribution for R given Y is ( | , ) indexed by parameter
=(
,
is
) where
represents the observed values of Y and
represent the missing values, ( | , ) can be written as ( |
Definition 3.1 The missing data are MCAR if
( |
,
, ) = ( | ).
,
, ).
.
16
Definition 3.2 The missing data are MAR if
( |
,
, )= ( |
, ).
Definition 3.3 The missing data are NMAR if
( |
,
, )= ( |
,
, ).
These definitions are due to Rubin (1976). The MCAR situation implies that
missingness does not depend on the observed values nor on the missing values of Y.
MAR implies that missingness does not depend on the missing values of Y and NMAR
implies that missingness depends on the missing values of Y. Rubin (1976) further
showed that if the missing data are MCAR or MAR and
likelihood inference for
out of (
,
and
are distinct, the
can be based on the likelihood obtained by integrating
| ) without including a model for the missing data mechanism. Under
this condition, the missing-data mechanism is termed ignorable. When the missing data
are NMAR, maximum likelihood estimation requires a model for the missing - data
mechanism. This situation is termed non-ignorable. In practice, the type of missingness
mechanism is often assumed but Little (1988) provides a method to test whether the
missingness mechanism is MCAR or not.
17
3.3
EM Theory in General
Rubin (1991) regarded the EM algorithm as one of the methodologies for solving
incomplete data problems sequentially based on a complete framework. The idea on
which it is based is simple, as summarized in the steps shown below, assuming that
is the observed data portion and
is the missing data portion:
1) If the problem is so difficult that the solution cannot be derived immediately
just from the data at hand
, make the data “complete” to the extent that
solving the problem is regarded as easy to formulate the problem (assuming
that the missing data portion
exists).
2) For example, if the objective for the time being is to derive the estimate of
parameter , which is , enter a provisional value into
3) Improve
using
and enter the value into
=(
,
)
represent the missing values. Suppose that
a model exist for Y with probability density
factoring this density we would get
or simply
converges.
denotes the hypothetical complete data where
represents the observed values of Y and
( | )= (
.
.
4) Repeat the aforementioned two steps until the value of
Let
to determine
,
( | ) where
| )= (
| )∙ (
=( ,
| )
,…,
). On
18
(
where (
,
| )= (
| )∙ (
, )
|
| ) is the conditional density of the observed data
is the density of the missing data
(3.1)
and (
|
, )
given the observed data. For missing data
problem, Dempster, Laird and Rubin (1977) assume that:
1) Parameters to be estimated are independent of the missing data process, and
2) Missing data are missing at random.
The loglikelihood that corresponds to Equation 3.1 is
( |
where ( |
,
,
)= ( |
) + log[ (
|
, )]
(3.2)
) is referred to as the complete-data loglikelihood, ( |
referred to as the observed data loglikelihood and log[ (
|
, )] is the missing
part of the complete-data loglikelihood. The purpose is to estimate
( |
) is
by maximizing
) with respect to . Rearranging Equation 3.2, we obtained
( |
)= ( |
,
) −log[ (
|
, )]
(3.3)
Taking the expectation of Equation 3.3 over the distribution of the missing data
given
and a current estimates of , say
, gives Equation 3.4 below
19
( |
) (
( |
=
) (
,
− ∫[log (
|
, )] (
|
)
,
|
|
)
,
)
,
(3.4)
By allowing
( |
and
( |
)=
)=
( |
) (
,
[log (
|
, )] (
|
)
,
|
,
)
Thus, Equation (3.3) can be written as
( |
)= ( |
)− ( |
)
Jensen’s inequality (Rao, 1972) states that if ( ) is convex, then [ ( )] ≥ [ ( )],
which implies that
(
|
)≥
( |
)
20
( )
Taking
as a starting point, the EM algorithm performs a sequence of
iterations to maximize ( |
( )
( )
). The new estimates of
(
. In general, the successive estimate
, that is,
( |
(
)
(
)
= (
( )
)
) for some function
( )
=
−[
( )
, is a function of
is a function of the previous estimate
(∙). The difference in values of
) from the previous iteration is thus
−
, says
(
)
(
( )
−
)
( )
−
(3.5)
The expected score is
( )=
−∫
=
giving
) (
|
,
)
,
) (
|
,
)
( |
=
where
( |
[log (
|
, )]
|
,
(3.6)
21
( )=
=
−∫
( |
)
(
|
,
)
,
)
(
|
,
)
( |
[log (
|
, )] (
|
,
)
(3.7)
Since
∫ (
|
,
)
= 1,
and the second part of the right hand side of Equation (3.7) is the expected score, which,
as can be proven, equals zero, we have
( )=
Maximizing
incomplete data ( |
with respect to .
( |
( |
)=
( | ).
) is usually easier than maximizing the loglikelihood of the
). The EM algorithm does just that by maximizing
( |
)
When the complete data Y has a distribution from the regular exponential family
distribution, the following theorem (Mood, et. al., 1974) is useful in deriving the EM
algorithm.
22
Theorem 3.1 Let
( ;
,
,⋯,
). If
,
,⋯,
be a random sample from
( ;
,
,⋯,
)= ( ,
that is ( ;
,
,⋯,
) is a member of the k- parameter exponential family, then
,⋯,
) ( )exp
( ),⋯,
( ,
a density distribution
,⋯,
) ( )
( )
is a minimal set of jointly complete and sufficient statistics.
The regular exponential family distribution is defined by
( | ) = exp [ ( ) ( ) + ( ) + ( )]
where
( ) denotes the vector of complete-data sufficient statistics,
(3.8)
denotes a
parameter vector, c is a function of , and d is a function of y.
Many of the well-known distributions such as Binomial, Multinomial, Poisson,
Gamma, Normal, and Multivariate normal distributions belong to the regular
exponential family distribution. In this case, the E-step of the algorithm finds
23
( |
) = [ ( ) ( ) + ( ) + ( )|
= ( ) [ ( )|
] + [ ( )|
]
]+ ( ).
Thus, the E-step reduces to estimating the complete-data sufficient statistics
( ) assuming the current parameter estimates are the true values of the parameters.
These expected sufficient statistics are then used to find the ML parameter estimates in
the M-step. However, the EM algorithm does not give the standard error automatically
as can be seen below.
Let
denote the complete data and x denote the observed data. Given x then
∈ ( ). Then the likelihood, L is given by
=
( )
( )
Then, the loglikelihood is
= log
( )
( )
24
At the M-step of the EM algorithm, we maximize
=
( )
( ) log ( )
where ( ) uses the parameter estimates from the previous iteration. Now consider the
derivatives with respect to the parameters β
=
∫
=
( )
1
( )
( )
( )
( )
( )
1
( )
( )
Thus, at the ML solution, ( ) = ( ), such that
=
Thus, a solution of
( )
( )
∝
= 0 is the ML solution. Note that, in principle, there may be
other solutions. Dempster et. al. (1977) showed that the procedure converges to the ML
solution.
25
However,
1. The second derivatives of
and
are not the same, so that by using the
second derivatives from the EM procedure does not yield the correct asymptotic
covariance matrix of the parameter estimates.
2.
and
are not simply related so that the correction of the second
derivatives is not trivial.
The above derivation is valid for the Missing Completely At Random (MCAR) situation,
in which the probability of the missing data pattern does not depend on y or β.
To extent this to the more general cases, we consider a different representation of
the data. Let r be a missing data indicator vector, such that
= 1 if the kth observation
is missing or incomplete, and 0 otherwise. Let the complete data be ( , ) and the
incomplete data ( , ). Now we consider a missing data mechanism defined by the
conditional distribution of r given y with parameters. The complete data Likelihood can
be written as
( )= ( ; ) ( | ; )
Note that the complete data situation includes the complete observations y, the
indicators of which will be missing in the incomplete data.
26
In the missing at random, MAR, case we assume that the conditional distribution
of r depends only on x so that
( )= ( ; ) ( | ; )
thus,
( )=
( )
( )
=
( )
( ; ) ( | ; )
= log ( | ; ) + log
= ( | ; )
( )
( )
( ; )
( ; )
It is clear that for inference about β the first term can be ignored and we are in the same
situation as in the previous section.
27
3.4
Incomplete Contingency Table
An incomplete contingency table, or also called missing categorical data is a
contingency table where information on one or more of the categorical variable is
missing. It is assumed that the data are MAR and the missing data mechanism is
ignorable. We will discuss ML estimation of cell probabilities in an incomplete
contingency table by using all the observed data- including data where information on
one or more of the categorical variables is missing. Lipsitz, Parzen and Molenberghs
(1998) uses the Poisson generalized linear model to obtain ML estimates of cell
probabilities for the saturated loglinear model whilst Little and Rubin (1988) describes
and uses the EM algorithm to determine the ML estimates of cell probabilities for any
loglinear model.
3.4.1
ML Estimation in Incomplete Contingency Table
Consider an
and
× contingency table with categorical variables
= {1,2, ⋯ , }
= {1,2, ⋯ , }. A multinomial sampling procedure is assumed. Let
count in cell ( , ),
the observed value of
and
= ∑∑
be the
the total counts. The
counts in each cell can be arranged to form the complete data vector
=
,
,⋯
with [ ] = , the vector of expected counts.
28
If information on one or both of the categories is missing, the contingency table
is said to be incomplete. The data to be classified in the contingency table can be split
into two parts, namely:
1. The fully classified cases, where the information on all of the categories is
available, and
2. The partially classified cases, where information on some of the categories is
missing.
It is assumed that the data are MAR and the missing data mechanism is ignorable.
3.4.2
The EM Algorithm
3.4.2.1 Multinomial Sampling
∑∑
If the probability that an observation falls in cell ( , ) is
= 1, then the complete data
, where
have a multinomial distribution,
≥ 0 and
29
~
;
,
,⋯,
with probability
( | )=
where
=(
∏∏
!
⋯
!
(3.9)
,
,⋯,
).The kernel of the complete data log-likelihood is
( | )=
log
+
The cell counts,
log
+⋯+
log
, are the sufficient statistics and the MLE of
=
is
30
3.4.2.2 Product Multinomial Sampling
Let
=∑
be the total counts in row i and
that an element falls in row i. If the
=∑
be the probability
elements of row i are independent, each having
a probability distribution
for = 1,2, ⋯ , , then, given the row total
and the vector of cell probabilities , the
element of row i have a multinomial distribution
,
and
|
=
,⋯,
|
=
, ~
;
,
,⋯,
(3.10)
.
When samples from different rows are independent, the joint probability
function for the entire data set is the product of I multinomial probability functions,
( | ,
,
,⋯,
)=
!
!
!⋯
!
⋯
31
Similarly, if the column totals are fixed then the elements of column j will have a
multinomial distribution
,
with
|
=
,⋯,
|
=
, ~
;
,
,⋯,
(3.11)
.
3.4.2.3 EM Algorithm to Determine the ML Estimates of the Cell Probabilities in
An Incomplete × Contingency Table: Data Missing On Both Categories.
If missing values occur on both
into three parts denoted by ,
and
where only
and
, the observed data can be partitioned
and C respectively, where A includes units having both
observed, B includes those having only
observed and C includes those
was observed. In part A, observations are fully classified, and in B and C
are only partially. The three parts of the sample are displayed in Table 3.1. The objective
is to determine the ML estimates of cell probabilities in the × table by using the fully
and partially classified data.
32
Sample part A
=1
=2
⋮
=
=1
=2
⋮
⋯
=
⋯
⋯
⋮
⋮
⋮
⋯
⋮
⋯
(a) Both variables observed
Sample part B
=1
=2
⋮
=
(b)
⋮
is missing
Sample part C
=1
=2
⋯
⋯
(c) C missing
=
Table 3.1: Classification of sample units in an incomplete × contingency table.
33
Assume that the data are MAR and the missingness mechanism is ignorable. Let
'
=
,
,⋯,
,
'
=(
,
,⋯,
) and
'
=
,
,⋯,
is missing in sample part B, the counts observed are totals across
. Since
. Hence,
compared to sample part A, row totals are observed in sample part B and column totals
in sample part C. The observed data are
{
=(
Let
,
,
,
: = 1,2, ⋯ , ; = 1,2, ⋯ , }
) be the observed data vector,
,
=
complete data vector and
,
,⋯,
=
,
,⋯,
be the
be the vector of cell probabilities for
which the ML estimates must be determined.
Each complete data count,
, can be expressed as the sum of contributions from
each of the three sample parts, that is
across
are observed, that is
=
+
+
. For sample part B totals
, whilst the individual cell counts,
, are missing. It
follows from (3.10) that the predictive distribution of the missing data in part B given
and
is a product multinomial,
,
where
|
=
,⋯,
,
=
|
, ~
.
;
,
,⋯,
(3.12)
34
For part C only the totals across
are observed, that is
. From (3.11), the
predictive distribution of the missing data in sample part C is given
and
is a
product multinomial,
,
|
where
=
,⋯,
,
|
, ~
=
,
=
,
,⋯,
(3.13)
. Thus,
=
=
;
+
+
+
+
,
,
+
,
+
(3.14)
The distribution of the complete data belong to the regular exponential family
with sufficient statistics the cell counts,
(3.14),
,
( )
( )
is calculated where
,
( )
=
+
=
=
+
+
. In the E-step of the EM algorithm,
, = 0,1,2, ⋯, is the rth estimate of
+
( )
( )
,
+
,
. From
( )
( )
+
( )
( )
,
( )
(3.15)
35
In the M-step,
(
)
is calculated by substituting the results from the E-step into
the expression of the MLE of
for the complete data. That is,
(
)
=
=
1
,
+
( )
( )
( )
+
( )
( )
(3.16)
The process iterates between (3.15) and (3.16) until convergence is attain.
3.5
Chi- Squared Test
3.5.1
Goodness-of-fit Test
A problem that arises frequently in statistical work is the testing of the
compatibility of a set of observed and theoretical frequencies. This type of problem has
already been discussed and solved for the special case in which there are only two pairs
of frequencies to be compared.
36
Consider the result obtained from an experiment of tossing a die 300 times, as
shown in table 3.2 below:
Outcome
Frequency
1
45
2
52
3
60
4
58
5
44
6
41
Table 3.2: Frequency distribution.
There are six possible outcomes for each trial, that is, obtaining number 1, 2, 3,
4, 5 or 6. These outcomes are also referred to as categories. The question we would like
to answer is whether the dice is a fair dice. The results obtained from the experiment is
the evidence for concluding whether the dice is a fair dice or otherwise. We know that a
fair dice has the following characteristic
(1) = (2) = (3) = (4) = (5) = (6) =
1
6
If X is a random variable representing the outcome obtained for each trial, then X
follows the uniform distribution with ( = ) =
for
= 1, 2, ⋯ , 6. The objective is
to test the hypotheses that the die is a fair dice which can be stated as follows:
: (1) = (2) = (3) = (4) = (5) = (6) =
: ( ) ≠ ( ) for
= 1, 2, ⋯ , 6; ≠
1
6
37
The statement in
is equivalent to the dice being a fair dice and the statement in
is
equivalent to the dice not being a fair dice.
If the dice is a fair dice, we expect the frequency for the outcome
or category
i is
=
( ) for
= 1, 2, ⋯ , 6;
where n is the number of trials. This then gives us the expected frequencies
=
(
) = 300
= 50
=
(
) = 300
= 50
=
(
) = 300
= 50
=
(
) = 300
= 50
=
(
) = 300
= 50
=
(
) = 300
= 50
However, the observed frequencies obtained from the experiment are
= 45,
= 58,
= 52,
= 44,
= 60
= 41
which differ from the expected frequencies if the dice is a fair dice.
38
The logic is if the dice is a fair dice, the difference between the observed and the
expected frequencies (
−
) is either zero or a small number. The difference between
the observed and the expected frequencies forms the statistic to test the hypothesis
regarding the probability distribution of the random variables. The statistics is stated in
the following theorem.
Theorem 3.2 The statistic
(
=
−
follows the Chi-Square distribution with ( −
)
− 1) degree of freedom.
where k is the number of categories and p is the number of unknown parameters needed
to be estimated from the data. If there is no unknown parameter, then the degrees of
freedom is
− 1 where
calculated statistic
= 0. This test is a one-tailed test where
=
at significance level .
(
−
)
>
,
is rejected if the
39
(
−
)
= 45
= 300
1
= 50
6
(45 − 50)
= 0.50
50
= 60
= 300
1
= 50
6
(60 − 50)
= 2.00
50
= 52
= 300
= 58
= 300
= 44
= 300
= 41
= 300
1
= 50
6
(52 − 50)
= 0.08
50
1
= 50
6
(58 − 50)
= 1.28
50
1
= 50
6
(44 − 50)
= 0.72
50
1
= 50
6
(41 − 50)
= 1.62
50
Table 3.3: The calculation of statistic
Since the statistic
is calculated from the observed sample, we use the denotation
as the calculated statistic
=
(
−
)
At significance level
( .
,
)
. So,
= 0.50 + 0.08 + 2.00 + 1.28 + 0.72 + 1.62 = 6.20
= 0.05, we reject
= 11.070, and accept
if
unknown parameters are absent. Since
≤
( .
if
,
).
>
( .
Note that
,
=
)
= 6.20 < 11.070, we accept
where
− 1 since
and
conclude that there is no evidence that the dice is not fair, or in other words we can say
that the dice is fair.
40
The test we have seen above is called goodness-of-fit test. In general, we would
observe the following table with
= 1, 2, ⋯ , and
=
Category
+
represents the observed frequency for category i for
+⋯+
1
.
2
⋯
⋯
Frequency
Table 3.4: The observed frequency for category i.
The belief is that the probability of category i occurring,
null hypothesis
as
: ()=
Assuming
=
( ), is stated in the
for = 1, 2, ⋯ , .
is correct, the expected frequency for each category i,
, is calculated by
( ) and with the help of Theorem 2, we can test the hypothesis stated in
.
41
3.5.2
Independence Test
A very useful application of
test occurs in connection with testing the
compatibility of observed and expected frequencies in two-way tables. Such two-way
tables are usually called contingency table. Table 3.5 below is an illustration of a
contingency table.
A contingency table is usually constructed for the purpose of studying the
relationship between the two variables of classification. In particular, one may wish to
know whether the two variables are related. By means of the
test, it is possible to test
the hypothesis that the two variables are independent. This test is called independence
test which capitalizes on the fact of independent events in probability study,
( ∩ )= ( )∙ ( )
Column variable
Category
Category
Category
Category ⋯
Category
Category
⋯
Category
Row
variable
Category
⋯
⋯
⋯
⋯
⋯
⋯
Table 3.5: A two-dimensional contingency table
⋯
42
The above contingency table is a ( × ) contingency table, where r denotes the
number of categories of the row variable, c denotes the number of categories of the
is the observed frequency in cell ( , ), that is, the observed
column variable and
frequency for
category of the row variable and
category of the column variable.
Let:
be the total frequency for row category i.
be the total frequency for row category j.
be the grand total frequency all cells ( , ).
∩
Each cell represents the joint event
Category
Category
Category
Row
variable Category
⋯
Category
. Thus,
Column variable
Category Category
⋯
Category
(
∩
) (
∩
)
⋯
(
∩
)
(
∩
) (
∩
)
⋯
(
∩
)
⋯
(
∩
⋯
) (
∩
⋯
)
⋯
⋯
⋯
(
∩
⋯
)
Table 3.6: A two-dimensional contingency table of joint events.
43
If the events
and
are independent, then
( ) or
often, we do not know the true values of
∩
= ( ) ( ). Most
( ) but we know from the
probability estimation that the best estimator for population proportion or probability is
the sample proportion. Thus,
( )=
and
=
Therefore, the estimated probability for the joint categories is
∩
= ( )
=
×
With this estimated joint probability, we can find the expected frequency in each cell,
if
and
are independent. The expected frequency in cell ( , ) is
=
=
=
=
∩
( )
×
44
Now, if
and
are truly independent, we anticipate
and
and if they differ, the difference is not significant. The statistic (
basis for the independence test which is stated in Theorem 3.
−
do not differ
) forms the
Theorem 3.3 The statistic
(
=
−
)
follows the Chi-Square distribution with ( − 1)( − 1) degree of freedom where
= the observed frequency in cell ( , ), and
= the expected frequency in cell ( , ).
The theorem can be written simply as
=
(
−
)
~
(
)(
)
This theorem is useful in testing the following hypotheses,
: Row and column variables are independent.
: Row and column variables are not independent.
45
This test is a one-tailed test on the right where
greater than
value with
,(
)(
, we reject
)
is rejected if the calculated
at significance level
if
>
,(
)(
value is
. Again, denoting the calculated
).
46
CHAPTER 4
RESULT AND DISCUSSION
Suppose we are examining the effect of age on income. If missingness on
income is a function of age, or in other words, elder individuals do not report their
income, then the data are MAR. If missingness on income is a function of income, i.e
person with high income refuse to report their income, then the data are NMAR.
To understand these mechanisms better, let consider a simple example of
missing data. Suppose that we have the full data as shown in Table 4.1 below.
47
Age
<30
30 to 55
>55
<30
30 to 55
>55
<30
30 to 55
>55
<30
30 to 55
>55
<30
30 to 55
>55
<30
30 to 55
>55
<30
30 to 55
>55
<30
30 to 55
>55
<30
30 to 55
>55
<30
30 to 55
>55
Income
High
High
High
High
High
High
High
High
High
High
High
High
High
High
High
Low
Low
Low
Low
Low
Low
Low
Low
Low
Low
Low
Low
Low
Low
Low
(a) Continuous data
<30
30 to 55
>55
High
5
5
5
Low
5
5
5
(b) Categorical data
Table 4.1: An example dataset of full data.
48
Consider that we have a situation where the data are fully observed on age
values, but some missing exist on income values.* in the table denotes the missing
information.
Age
<30
30 to 55
>55
<30
30 to 55
>55
<30
30 to 55
>55
<30
30 to 55
>55
<30
30 to 55
>55
<30
30 to 55
>55
<30
30 to 55
>55
<30
30 to 55
>55
<30
30 to 55
>55
<30
30 to 55
>55
Income
*
High
High
High
*
High
High
High
*
*
High
High
High
*
*
Low
Low
*
*
*
Low
*
Low
Low
Low
*
Low
Low
Low
*
(a) Continuous data
49
High
Low
<30
2
2
30 to 55
2
2
>55
2
2
(b) Categorical data
Table 4.2: An example dataset for MCAR.
Table 4.2 is created such that some of the observations are Missing Completely
At Random (MCAR), that is, the missing values do not depends on the age values and
income values. This means that the missing income values are not related to the age
values. By table 4.2(b), we can see that each cell of the table has the same probability of
being missing.
The data can be confirmed as MCAR by doing the t-tests of mean differences on
income and age after dividing respondents into those with and without missing data.
This is done to establish that those two variables do not differ significantly. The SPSS
Missing Values Analysis (MVA) option supports Little’s MCAR test, which is a chisquared test for missing completely at random. If the p value for Little’s MCAR test is
not significant, then the data may be assumed to be MCAR.
Table 4.3 is constructed in such a way that some observations are Missing At
Random (MAR) where missing values are not randomly distributed among all
observations, but are randomly distributed within one or more subsamples. The
probability of an observation being missing depends on the observed values but not on
the missing values. The observed values in this case are the age values, therefore
missing are depending on the age values. For example, in this study we found that the
information about individual whose age is greater than 55 years old tends to be missing.
50
Besides that, Table 4.4 is created to have some of its observation are unknown
and said to be Not Missing At Random (NMAR) since the probability of income data
being missing depends on the values of income itself, i.e. those who obtained high
income refused to revealed their income.
Age
<30
30 to 55
>55
<30
30 to 55
>55
<30
30 to 55
>55
<30
30 to 55
>55
<30
30 to 55
>55
<30
30 to 55
>55
<30
30 to 55
>55
<30
30 to 55
>55
<30
30 to 55
>55
<30
30 to 55
>55
Income
High
High
*
High
*
*
High
High
High
*
High
High
High
High
*
*
Low
Low
Low
Low
*
Low
Low
*
Low
*
Low
Low
Low
*
(a) Continuous data
51
<30
30 to 55
>55
High
4
4
2
Low
4
4
2
(b) Categorical data
Table 4.3: An example dataset for MAR.
Age
<30
30 to 55
>55
<30
30 to 55
>55
<30
30 to 55
>55
<30
30 to 55
>55
<30
30 to 55
>55
<30
30 to 55
>55
<30
30 to 55
>55
<30
30 to 55
>55
<30
30 to 55
>55
<30
30 to 55
>55
Income
*
*
High
High
High
*
*
*
High
*
High
*
*
High
*
*
Low
Low
Low
Low
Low
Low
Low
*
Low
Low
Low
Low
*
Low
(a) Continuous data
52
<30
30 to 55
>55
High
2
2
2
Low
4
4
4
(b) Categorical data
Table 4.4: An example dataset for NMAR.
Missing At Random (MAR) and Not Missing At Random (NMAR), the testing can
be done by accessing the SPSS Missing Values Analysis (MVA). By default, it will
generate a table of “Separate Variance t-Tests” in which rows are all variables which
have 1% missing or more, and columns are all variables. In any cell, if
(2 −
)≤
0.05, this means that missing cases in the row variables are significantly correlated with
the column variable and thus are not missing at random. Else, then the case is said to be
missing at random.
4.1
Data Construction
Suppose we have the following data in Table 4.5. The row variable is the income
obtained by the respondents and the column variable refers to their ages.
53
Age
Income
<30
30-55
>55
Total
High
100
130
110
340
Low
150
155
195
500
Total
250
285
305
840
Table 4.5: Full data.
4.1.1
Missing Completely At Random (MCAR)
In Table 4.6 (a), the value of 10% of the total number of the candidates is
missing. The value of 34 and 49 refers to the data of the candidates whose income is
Low and High respectively but information their age is missing. Meanwhile, the value
of 25, 28 and 30 refers to the data which age is below 30 years old, between 30 to 55
years old and above 55 years old respectively. Information about their income status is
missing.
In Table 4.6 (b), there are 20% of the total number of the candidates is missing.
The value of 67 and 101 refers to the data of the candidates whose income is Low and
High respectively but information their age is missing. Meanwhile, the value of 50, 57
54
and 61 refers to the data which age is below 30 years old, between 30 to 55 years old
and above 55 years old respectively. Information about their income status is missing.
In Table 4.6 (c), 30% of the total number of the candidates is missing. The value
of 101 and 151 refers to the data of the candidates whose income is Low and High
respectively but information their age is missing. Meanwhile, the value of 75, 86 and 91
refers to the data which age is below 30 years old, between 30 to 55 years old and above
55 years old respectively. Information about their income status is missing.
Age
Missing
<30
30-55
>55
High
90
117
99
34
Low
135
140
176
49
25
28
30
83
Income
Missing
(a) MCAR with 10% data missing.
Age
Missing
<30
30-55
>55
High
80
105
88
67
Low
120
123
156
101
50
57
61
168
Income
Missing
(b) MCAR with 20% data missing.
55
Age
Income
Missing
<30
30-55
>55
High
30
38
33
101
Low
45
48
58
151
75
86
91
252
Missing
(c) MCAR with 30% data missing.
Table 4.6: Artificial Incomplete Data for MCAR cases.
Following the notation in previous chapter for Table 4.6 (a),
'
=
' , ' , '
where
' is the observed data: ' =
; = 1,2, = 1, 2, 3 = {90, 117, 99, 135, 140, 176},
' is the missing data for income variable: ' = {
and ' is the missing data for age variable: ' =
The fully classified data,
( )'
=
; = 1,2} = {34, 49},
; = 1,2, 3 = {25, 28, 30}.
were used to determine a starting value for the algorithm,
{90, 117, 99, 135, 140, 176}
≈ {0.11889, 0.15456, 0.13078, 0.17834, 0.18494, 0.23250}
0.11889
0.15456
0.13078
0.17834
0.18494
0.2325
= 0.29723
= 0.3395
= 0.36328
(4.1)
= 0.40423
= 0.59578
Table 4.7: Marginal total of probabilities for MCAR with 10% data missing.
56
From (3.16), the first estimate of
( )
=
=
is
1
( )
+
( )
+
( )
( )
1
0.11889
0.11889
90 + 34
+ 25
923
0.40423
0.29723
=0.119177
Similarly, the first estimates of
( )
( )
=
=
( )
=
( )
=
( )
=
( )
,
( )
,
( )
,
( )
and
( )
are
1
0.154557
0.154557
117 + 34
+ 28
923
0.40423
0.3395
= 0.154656
1
0.130779
0.130779
99 + 34
+ 30
923
0.40423
0.36328
= 0.130878
1
0.184941
0.184941
140 + 49
+ 28
923
0.59578
0.3395
= 0.184684
1
0.178336
0.178336
135 + 49
+ 25
923
0.59578
0.29723
1
0.232497
0.232497
176 + 49
+ 30
923
0.59578
0.36328
= 0.178405
= 0.232201
This gives,
( )'
= {0.119177, 0.154656, 0.130878, 0.178405, 0.0.184684, 0.0.23220}
57
which is used to calculate the second estimate for . The process continues until the
convergence is attained. Table 4.8 shows the values at different steps of algorithm.
r
0
1
2
3
∞
( )
0.118890
0.119177
0.119203
0.119205
0.119206
( )
( )
0.154557
0.154656
0.154663
0.154663
0.154663
0.130779
0.130878
0.130887
0.130888
0.130889
( )
0.178336
0.178405
0.178410
0.178411
0.178411
( )
0.184941
0.184684
0.184660
0.184657
0.184657
( )
0.232497
0.232201
0.232170
0.232175
0.232175
Table 4.8: Iteration of the EM algorithm for MCAR with 10% data missing problem.
From the probabilities we obtained in Table 4.8, we can find the complete data table.
=
∙
(4.2)
Make use of Equation (4.2) above, we have
=
Similarly, the values of
∙ 840 = 0.119206 (840) = 100.13556 ≈ 100.
,
= 0.154663 (840) = 130
= 0.130889 (840) = 110
,
,
and
are
= 0.178411 (840) = 150
= 0.184657 (840) = 155
= 0.232175(840) = 195
58
Therefore, the complete data is
Age
Income
<30
30-55
>55
High
100
130
110
Low
150
155
195
Table 4.9: Complete data obtained by EM algorithm for 10% MCAR problem.
Repeating the same processes on both MCAR with 20% and 30% missing data
problems, we have the following.
r
0
1
2
3
∞
( )
0.119408
0.118684
0.118623
0.118613
0.118611
( )
0.15625
0.155773
0.155701
0.15569
0.155688
( )
0.130952
0.130553
0.130481
0.130468
0.130465
( )
0.178571
0.178944
0.179006
0.179016
0.179018
( )
0.183036
0.183418
0.18349
0.183054
0.183507
( )
0.232143
0.232628
0.232699
0.23271
0.232712
(a) Iteration of the EM algorithm.
Age
<30
30-55
>55
High
100
131
110
Low
150
154
195
Income
(b) Complete data obtained by EM algorithm.
Table 4.10: MCAR with 20% of the data are missing.
59
r
0
1
2
3
∞
( )
0.119048
0.118664
0.118571
0.118548
0.11854
( )
0.156463
0.156261
0.156219
0.156209
0.156206
( )
0.130952
0.130296
0.130133
0.130092
0.13008
( )
0.178571
0.178965
0.179059
0.079082
0.179089
( )
0.181973
0.182726
0.182918
0.182967
0.182985
( )
0.232993
0.233088
0.233101
0.233102
0.233101
(a) Iteration of the EM algorithm.
Age
Income
<30
30-55
>55
High
100
131
109
Low
151
154
196
(b) Complete data obtained by EM algorithm.
Table 4.11: MCAR with 30% of the data are missing.
4.1.2
Missing At Random (MAR)
For this type of mechanism, proportion of missing in column of age above 55
years old is greater than the rest of data in the table. In Table 4.12 (a), the value of 10%
of the total number of the candidates is missing. The value of 33 and 51 refers to the
data of the candidates whose income is Low and High respectively but information
about their age is missing. Meanwhile, the value of 20, 22 and 42 refers to the data
60
which age is below 30 years old, between 30 to 55 years old and above 55 years old
respectively. Information about their income status is missing.
In Table 4.12 (b), there are 20% of the total number of the candidates is missing.
The value of 66 and 102 refers to the data of the candidates whose income is Low and
High respectively but information their age is missing. Meanwhile, the value of 40, 44
and 84 refers to the data which age is below 30 years old, between 30 to 55 years old
and above 55 years old respectively. Information about their income status is missing.
In Table 4.12 (c), 30% of the total number of the candidates is missing. The
value of 100 and 152 refers to the data of the candidates whose income is Low and High
respectively but information their age is missing. Meanwhile, the value of 58, 68 and
126 refers to the data which age is below 30 years old, between 30 to 55 years old and
above 55 years old respectively. Information about their income status is missing.
Age
Income
Missing
Missing
<30
30-55
>55
Low
92
120
95
33
High
138
143
168
51
20
22
42
83
(a) MAR with 10% data missing.
61
Age
Income
Missing
<30
30-55
>55
Low
84
110
80
66
High
126
131
141
102
40
44
84
168
Missing
(b) MAR with 20% data missing.
Age
Income
Missing
<30
30-55
>55
Low
76
99
65
100
High
116
118
114
152
58
68
126
252
Missing
(c) MAR with 30% data missing.
Table 4.12: Artificial Incomplete Data for MAR.
Table 4.13, 4.14 and 4.15 below show the values of estimate,
at different steps
of algorithm for all the cases respectively.
r
0
1
2
∞
( )
0.121693
0.118928
0.118715
0.118698
( )
0.15873
0.154694
0.154382
0.154357
( )
0.125661
0.130284
0.130703
0.130741
( )
0.18254
0.179302
0.178966
0.178928
( )
0.189153
0.185287
0.1849894
0.18485
(a) Iteration of the EM algorithm.
( )
0.222222
231506
0.23234
0.232426
62
Age
Income
<30
30-55
>55
High
100
130
110
Low
150
155
195
(b) Complete data obtained by EM algorithm for 10% MAR problem.
Table 4.13: MAR with 10% of the data is missing.
r
0
1
2
∞
( )
0.125
0.119279
0.118471
0.118348
( )
0.16369
0.155337
0.154155
0.15397
( )
0.119048
0.128648
0.130237
0.130526
( )
0.18752
0.180845
0.179578
0.179284
( )
0.19494
0.186994
0.185513
0.185172
( )
0.209821
0.228897
0.232045
0.2327
(a) Iteration of the EM algorithm.
Age
Income
<30
30-55
>55
High
99
129
110
Low
151
156
195
(b) Complete data obtained by EM algorithm.
Table 4.14: MAR with 20% of the data are missing.
63
r
0
1
2
∞
( )
0.129252
0.11962
0.117676
0.1172
( )
0.168367
0.156843
0.154558
0.154004
( )
0.110544
0.126225
0.129876
0.13094
( )
0.197279
0.184715
0.181509
0.180435
( )
0.20068
0.189118
0.18617
0.185165
( )
0.193878
0.223479
0.230211
0.232256
(a) Iteration of the EM algorithm.
Age
Income
<30
30-55
>55
High
98
129
110
Low
152
156
195
(b) Complete data obtained by EM algorithm.
Table 4.15: MAR with 30% of the data are missing.
64
4.1.3
Not Missing At Random (NMAR)
Table 4.16 below shows that data are missing with proportion of missing in row
of high income is greater than the rest of data in the table. In Table 4.16 (a), the value of
10% of the total number of the candidates is missing. The value of 59 and 25 refers to
the data of the candidates whose income is Low and High respectively but information
about their age is missing. Meanwhile, the value of 23, 32 and 29 refers to the data
which age is below 30 years old, between 30 to 55 years old and above 55 years old
respectively. Information about their income status is missing.
In Table 4.16 (b), there are 20% of the total number of the candidates is missing.
The value of 118 and 50 refers to the data of the candidates whose income is Low and
High respectively but information their age is missing. Meanwhile, the value of 50, 60
and 58 refers to the data which age is below 30 years old, between 30 to 55 years old
and above 55 years old respectively. Information about their income status is missing.
In Table 4.16 (c), 30% of the total number of the candidates is missing. The
value of 176 and 76 refers to the data of the candidates whose income is Low and High
respectively but information their age is missing. Meanwhile, the value of 74, 91 and 87
refers to the data which age is below 30 years old, between 30 to 55 years old and above
55 years old respectively. Information about their income status is missing.
65
Age
Income
Missing
<30
30-55
>55
High
83
107
91
59
Low
144
146
185
25
23
32
29
84
Missing
(a) NMAR with 10% data missing.
Age
Income
Missing
<30
30-55
>55
High
65
85
72
118
Low
135
140
175
50
50
60
58
168
Missing
(b) NMAR with 20% data missing.
Age
Income
Missing
Missing
<30
30-55
>55
High
48
63
53
176
Low
128
131
165
76
74
91
87
252
(c) NMAR with 30% data missing.
Table 4.16: Artificial Incomplete Data for NMAR.
66
Table 4.17, 4.18 and 4.19 below show the values of estimate,
at different steps
of algorithm for all the cases respectively.
( )
r
0
1
2
∞
0.109788
0.117789
0.118384
0.118403
( )
0.141534
0.154762
0.156093
0.156287
( )
0.12037
0.129511
0.130303
0.130365
( )
0.190476
0.179837
0.179022
0.178963
( )
0.193122
0.18631
0.185357
0.185214
( )
0.244709
0.231791
0.23084
0.230766
(a) Iteration of the EM algorithm.
Age
Income
<30
30-55
>55
High
99
131
110
Low
150
156
194
(b) Complete data obtained by EM algorithm for 10% NMAR problem.
Table 4.17: NMAR with 10% of the data are missing.
r
0
1
2
∞
( )
0.0967262
0.114881
0.117907
0.118526
( )
0.126488
0.151634
0.155844
0.156811
( )
0.107143
0.126168
0.129216
0.129686
( )
0.200893
0.182292
0.179245
0.178641
( )
0.208333
0.191358
0.187727
0.186826
(a) Iteration of the EM algorithm.
( )
0.260417
0.233668
0.230061
0.229511
67
Age
Income
<30
30-55
>55
High
99
132
109
Low
150
157
193
(b) Complete data obtained by EM algorithm.
Table 4.18: NMAR with 20% of the data are missing.
r
0
1
2
∞
( )
0.081633
0.10961
0.115906
0.117593
( )
0.107143
0.146668
0.156005
0.159553
( )
0.090136
0.11999
0.1266
0.128036
( )
0.217687
0.187511
0.180905
0.179073
( )
0.222789
0.197738
0.189872
0.186754
( )
0.280612
0.238484
0.230712
0.228991
(a) Iteration of the EM algorithm.
Age
Income
<30
30-55
>55
High
99
134
108
Low
150
157
192
(b) Complete data obtained by EM algorithm.
Table 4.19: NMAR with 30% of the data are missing.
68
4.1.4
The Chi-Squared Test
In connection with tables in previous section, the
test can be used to test the
hypothesis that there is a relationship between an individual’s age and his income.
Consider the application of Theorem 3.3 to testing independence in tables in previous
section.
Suppose we have the hypotheses as
: Individual s age and income he earned are independent.
: Individual s age and income he earned are independent.
From Table 4.4 above,i.e table of full data, we have
= 2,
= 100 + 130 + 110
= 3,
= 100 + 150
= 340 ,
= 250,
= 500,
= 285,
= 150 + 155 + 195
= 130 + 155
= 110 + 195
= 305,
= 840.
69
=100
=130
=110
=150
=155
=195
340(250)
= 101.19048
840
340(285)
= 115.35714
840
340(305)
= 123.45238
840
500(250)
= 148.80952
840
500(285)
= 169.64286
840
500(305)
= 181.54762
840
Table 4.20:
−
(100 − 101.19048)
101.19048
(130 − 115.3571)
115.3571
(110 − 123.4523)
123.4523
(150 − 148.80952)
148.80952
(155 − 169.64286)
169.64286
(195 − 181.54762)
181.54762
= 0.01401
= 1.85870
= 1.46586
= 0.00952
= 1.26391
= 0.99680
calculation for full data.
= 0.01401 + 1.8587 + 1.46586 + 0.00952 + 1.26391 + 0.9968
= 5.60880.
The critical value at 5% significance level is
and the rule is to reject
if
is significant and the hypothesis
> 5.991. Since
.
,(
)(
)
=
.
,
= 5.991
= 5.60880 < 5.991, this result
is therefore accepted. We can conclude that, for the
case of full data, one’s age and income he earned are independent.
The same calculation is repeating on the complete data obtained by EM
algorithm of all cases and is summarized as follows.
70
Type of
Missingness
MCAR
MAR
NMAR
Available Data
Accept/
Value
Reject
5.01459
Accept
Accept
4.91984
4.53351
Accept
4.97741
Accept
4.33358
Accept
3.69837
Accept
4.96211
Accept
3.87375
Accept
3.44817
Accept
Missing
Percentage
10%
20%
30%
10%
20%
30%
10%
20%
30%
Table 4.21: The
With EM
Accept/
Value
Reject
5.6088
Accept
6.0393
Reject
6.4433
Reject
5.6088
Accept
5.6393
Accept
5.6696
Accept
5.6411
Accept
5.76997
Accept
5.8016
Accept
value for all cases.
From Table 4.21 above, we can clearly see a pattern for the mechanisms,
i.e MCAR, MAR and NMAR, where when the percentage of the missing data increase,
the
values become larger.
For all cases, we are expecting that their
value of full data. From calculation of
for the full data as
values will be approximately the
above, i.e Table 4.20, we obtained
= 5.60880. Since for full data
problems will also be accepting
value
is accepted, therefore all of
, as expected. By accepting
then we can conclude
that there is a significance evidence at 5% significance level that age of individuals do
not have any influence to the income he earned.
For MCAR case, when there are 10% data are missing, we will have
5.60880. This value is exactly the same as the
value of full data. Thus, we accept
But when there are 20% and 30%, data are missing, we have
= 6.4433 respectively. Both
=
.
= 6.0393 and
values are greater than the critical value at 5%
71
significance level,
.
,
= 5.991. Therefore, we reject
for both 20% and 30%
MCAR. We can conclude that for 10% data are missing there is a significance evidence
at 5% significance level that age of individuals do not have any influence to the income
he earned. But when the missing values become larger, the individual’s age and income
he earned become related. This is not correspond to our objective.
For MAR case, where when only a small number, such that 10%, of the data are
missing,
.will be accepted since the
20% of the data are missing, the
= 5.6088 <
.
,
is still accepted since
= 5.991. Also when
= 5.6393 < 5.991.
Even when there are large number of the data i.e 30%, are missing, we still accepting
since
= 5.6696 < 5.991. Thus, we can conclude that, at 5% significance level
there is significant evidence that age of individuals is not related to the income he
earned, no matter how much the missing values is.
Besides that, for NMAR case, when there are 10% and 20% of the data are
missing, we will have
when compared to
.
,
= 5.6411 and
= 5.76997 respectively. This value
= 5.991, we have that
. Also, when there are 30% data are missing,
the critical value at 5% significance level,
.
<
,
.
,
. Therefore, we accept
= 5.8016 . This value is less than
= 5.991. Therefore, we accept
.
We can conclude that there is significant evidence at 5% significance level that, age of
individuals do not influent the income he earned for all level of missing data.
72
CHAPTER 5
CONCLUSION AND RECOMMENDATION
5.1
Conclusion
Missing data pose problems to practicing statisticians in the sense that the
standard statistical method cannot be used directly on the incomplete data set. In
general, this study focuses on the EM algorithm performance on different types of data
missingness mechanisms and different levels of missing data.
73
The missingness mechanism, which in practice is assumed, is considered. This
consideration is very useful in understanding the effect of the missingness mechanism
on the observed sample, and in reasoning why this missing data technique is successful
or fail in dealing with a particular situation. For this study purpose, we found that for
missingness with mechanism of MAR and NMAR, EM algorithm will give better
recovery of the missing values compared to mechanism of MCAR.
At different level of missingness, i.e, when the percentage of the missing data is
larger, the
values become greater, and at some higher missing percentage the value
might become misleading. We can say that when the portion of missing values is
increasing, then the missing values recovered by EM algorithm is not as good as the full
data.
5.2
Recommendation
The main purpose of this study is not emphasizing the interpretation of the data
but it is more concerned on how we handle those incomplete tables before we analyze
the data. So from the discussion in previous chapter, the EM algorithm can be used if we
are interested to know the relation between the variables. But if we prefer to do further
analysis on the incomplete data, then we can use the EM algorithm to estimate the
missing values on larger missing portion, or when the missing data only present in only
one cell of the contingency table. Therefore, the contributions of the results by the
74
analysis of incomplete categorical data are able to give a great benefit to the public and
solve the obstacle for the researcher in the future.
75
REFERENCES
Aitken, A. P. (1974). Assessing Systematic Errors in Rainfall-Runoff Models.Journal of
Hydrology. 20, 131-136.
Arifah Bahar, Ismail Mohamad, Muhammad Hisyam Lee, Noraslinda Mohamed Ismail,
Norazlina Ismail, Norhaiza Ahmad, Zarina Mohd Khalid.(2008). Engineering
Statistics. Desktop Publisher.
Beal, E. M., Little, R. J. A. (1975). Missing Values in Multivariate Analysis. J. R. Stat.
Soc. B, 37, 129-145.
Rindskopf. D. A (1992). General Approach To Categorical Data Analysis With Missing
Data, Using Generalized Linear Models With Composite Links. Psycometrika. 57,
1, 29-42.
Dempster, A. P., Laird, N. M., Rubin, D. B. (1977). Maximum Likelihood for
incomplete data via the EM algorithm (with discussion). J. R. Stat. Soc. B, 39, 138.
76
Diggle, P. and Kenward, M. G. (1994). Informative Drop-out In Longitudinal Data
Analysis. Applied Statistics. 43, 1, 49-93.
Erling B. Anderson (1997). Introduction to the Statistical Analysis of Categorical Data.
Springer-Verlag Berlin Heidelberg New York.
Fienberg, S. E. (1980). The Analysis of Cross-Classified Categorical Data. Cambridge
Mass.: The MIT press.
Hartly, H. O., Hocking, R. R. (1971). The Analysis of Incomplete Data. Biometrics.27,
783-808.
Hoo Ling Ping and Safian Uda (2006). Methods Analyzing Incomplete Categorical
Data. Proceedings of the 2nd IMT-GT Regional Conferrence on Mathematics,
Statistics and Applications, USM, Penang.
Ismail Mohamad (2003). Data Analysis in the Presence of Missing Data. Lancaster
University. Doctor of Philosophy Dissertation.
Kim, J. O., and Curry, J. (1977). The Treatment of Missing Data In Multivariate
Analysis. Sosiological Methods and Research. 6, 215-241.
Little, R. J. A(1988). A Test of Missing Completely At Random for Multivariate Data
With Missing Values. Journal of American Statistical Association. 83, 404, 1198202.
Little, R. J. A., Rubin, D. B. (2002). Statistical Analysis With Missing Data, 2nd Edition.
New Jersey.: John Wiley.
77
Madow, W. G., Nisselson, H., Olkin, I. and Rubin, D.B. (eds)(1983). Incomplete Data
In Sample Surveys (Vols. 1-3). New York: Academic Press.
Michiko Watanabe and Kazunori Yamaguchi (2004). The EM Algorithm and Related
Statistical Models. Marcel Decker.
McKendrick, A. G. (1926). Applications of Mathematics to Medical Problems.
Proc.Edinburgh Math. Soc. 44, 98-130.
Mood, M. A., Graybill, F. A., and Boes, D. C. (1974). Introduction to the Theory of
Statistics, Third Edition.McGraw-Hill Book Company.
Orchard, T., Woodburry, M. A. (1972). A Missing Information Principle: Theory and
Applications. Proceedings of the 6th Berkeley Symposium on Mathematical
Statistics and Probability. 1, 697-715.
Paul G. Hoel (1984). Introduction to Mathematical Statistics, Fifth Edition. John Wiley
and Sons.
Rao, C. R. (1972). Linear Statistical Inference and Its Applications. New York Wiley.
Roth, P. L. (1994). Missing Data: A Conceptual Review For Applied Psychologists.
Personnel Psychology. 47, 537-560.
Rubin, D. B. (1976). Inference and Missing Data. Biometrika. 63, 538-543.
Rubin, D. B. (1991). EM and Beyond. Psychometrika. 56, 241-254.
78
Stuart. R. Lipsitz, Michael Parzen and Geert Molenberghs (1998). Obtaining the
Maximum Likelihood in Incomplete × Contingency Table Using a Poisson
Generalized Linear Model. Journal of Computational and Graphical Statistics,
7, 3, 356-376.
Sunberg, R. (1976). An Iterative Method for Solution of the Likelihood Equations for
Incomplete Data From Exponential Families. Commun. Stat. Simul. Comput. 5,
55-64.
Xiao-Bai Li (2009) A Bayesian Approach for Estimating and Replacing Missing
Categorical Data. ACM Journal of Data and Information Quality. 1, 1, 1-11.
Download