lil APPLICATION OF EM ALGORITHM ON MISSING CATEGORICAL DATA ANALYSIS NORAIM BINTI HASAN A reportsubmittedin partialfulfilment of the requirementsfor the awardof the degreeof Masterof Science(Mathematic) Facultyof Science UniversitiTeknologiMalaysia DECEMBER2OO9 To my beloved husband, son and all my family members ACKNOWLEDGEMENT In preparing this thesis, I was in contact with many people, researchers, academicians, and practitioners. They have contributed towards my understanding and thoughts. In particular, I wish to express my sincere appreciation to my thesis supervisor, Assoc. Prof. Dr. Ismail b. Mohamad, for encouragement, guidance, critics and friendship. Without their continued support and interest, this thesis would have never been the same as presented here. Librarians at UTM also deserve my special thanks for their assistance in supplying the relevant literatures. My colleagues should also be recognised for their support and the assistance provided at various occasions. Their views and tips are useful indeed. My sincere appreciation also extends to my beloved husband and son, my family and also not forgotten my in-laws, for their understanding and sacrificial. Unfortunately, it is not possible to list all of them in this limited space. ABSTRAK Algoritma EM merupakan salah satu daripada kaedah untuk menyelesaikan masalah berkaitan dengan data tidak lengkap berdasarkan kepada satu rangka lengkap. Algorithma EM merupakan satu pendekatan parametrik untuk mencari taksiran ML data tidak lengkap. Algorithma ini terbahagi kepada dua langkah, dimana langkah pertamanya, langkah Ekspektasi, atau lebih dikenali sebagai langkah E, mencari ekspektasi kepada loglikelihood, bersyarat kepada data yang dapat diperolehi dan anggaran terkini, . Langkah kedua, langkah Pemaksimuman atau langkah M dimana ia akan memaksimumkan nilai loglikelihood untuk mencari satu anggaran parameter yang baru. Prosedur ini berlaku berselang- seli antara kedua-dua langkah ini sehingga anggaran parameter tersebut malar. ABSTRACT Expectation- Maximization algorithm, or in short, EM algorithm is one of the methodologies for solving incomplete data problems sequentially based on a complete framework. The EM algorithm is a parametric approach to find the Maximum Likelihood, ML parameter estimates for incomplete data. The algorithm consists of two steps. The first step is the Expectation step, better known as E-step, finds the expectation of the loglikelihood, conditional on the observed data and the current parameter estimates; say . The second step is the Maximization step, or M- step, which maximize the loglikelihood to find new estimates of the parameters. The procedure alternates between the two steps until the parameter estimates converge to some fixed values. TABLE OF CONTENTS CHAPTER 1 TITLE PAGE DECLARATION ii DEDICATION iii ACKNOWLEDGEMENTS iv ABSTRACT v ABSTRACT vi TABLE OF CONTENTS vii LIST OF TABLES x LIST OF SYMBOLS xiv INTRODUCTION 1 1.1 Problem Statement 1 1.2 Objective Of The Study 2 1.3 Scope Of The Study 3 1.4 2 3 LITERATURE REVIEW 4 2.1 Missing Data 5 2.1.1 Classes of Missing Data 6 2.1.1.1 Censored Data 6 2.1.1.2 Latent Variable 7 2.1.1.3 Non-Response Item 8 The Expectation-Maximization Algorithm 10 2.2 3 Significance Of The Study RESEARCH METHODOLOGY 12 3.1 Missing Data Patterns 12 3.2 General Definition of Missingness Mechanism 15 3.3 EM Theory in General 17 3.4 Incomplete Contingency Table 27 3.4.1 3.4.2 ML Estimation in Incomplete Contingency Table 27 The EM Algorithm 28 3.4.2.1 Multinomial Sampling 28 3.4.2.2 Product Multinomial Sampling 30 3.4.2.3 EM Algorithm to Determine the ML Estimates of Cell Probabilities in An Incomplete × Contingency Table Data Missing on Both Categories 31 3.5 4 5 Chi- Squared Test 35 3.5.1 Goodness-of-fit Test 35 3.5.2 Independence Test 41 RESULT AND DISCUSSION 46 4.1 Data Construction 52 4.1.1 Missing Completely At Random (MCAR) 53 4.1.2 Missing At Random (MAR) 59 4.1.3 Not Missing At Random (NMAR) 64 4.1.4 The Chi- Squared Test 68 CONCLUSION AND RECOMMENDATION 72 5.1 Conclusion 72 5.2 Recommendation 73 REFERRENCES 75 LIST OF TABLES TABLE NO. TITLE PAGE Classification of sample units in an incomplete × contingency table 32 3.2 Frequency distribution 36 3.3 The calculation of statistic 39 3.4 The observed frequency of category i 40 3.5 A two-way contingency table 41 3.6 A two-way dimensional contingency table of joint events 42 4.1 An example dataset of full data 47 4.1(a) Continuous data 47 4.1(b) Categorical data 47 4.2 An example dataset for MCAR 49 4.2(a) Continuous data 49 4.2(b) Categorical data 49 4.3 An example dataset for MAR 50 4.3(a) Continuous data 50 3.1 4.3(b) Categorical data 51 4.4 An example dataset for NMAR 51 4.4(a) Continuous data 51 4.4(b) Categorical data 52 4.5 Full data 53 4.6 Artificial Incomplete Data for MCAR. 54 4.6(a) MCAR with 10% data missing 54 4.6(b) MCAR with 20% data missing 54 4.6(c) MCAR with 30% data missing 55 4.7 Marginal total of probabilities for MCAR with 10% data missing 4.8 Iteration of EM algorithm for MCAR with 10% data missing problem 4.9 55 57 Complete data obtained by EM algorithm for 10% MCAR problem 58 4.10 MCAR with 20% of the data are missing 58 4.10(a) Iteration of EM algorithm. 58 4.10(b) Complete data obtained by EM algorithm 58 4.11 MCAR with 30% of the data are missing 59 4.11(a) Iteration of EM algorithm. 59 4.11(b) Complete data obtained by EM algorithm 59 4.12 Artificial Incomplete Data for MAR. 60 4.12(a) MAR with 10% data missing 60 4.12(b) MAR with 20% data missing 61 4.12(c) MAR with 30% data missing 61 4.13 MAR with 10% of the data are missing 61 4.13(a) Iteration of EM algorithm. 61 4.13(b) Complete data obtained by EM algorithm 62 4.14 MAR with 20% of the data are missing 62 4.14(a) Iteration of EM algorithm. 62 4.14(b) Complete data obtained by EM algorithm 62 4.15 MAR with 30% of the data are missing 63 4.15(a) Iteration of EM algorithm. 63 4.15(b) Complete data obtained by EM algorithm 63 4.16 Artificial Incomplete Data for NMAR. 65 4.16(a) NMAR with 10% data missing 65 4.16(b) NMAR with 20% data missing 65 4.16(c) NMAR with 30% data missing 65 4.17 NMAR with 10% of the data are missing 66 4.17(a) Iteration of EM algorithm. 66 4.17(b) Complete data obtained by EM algorithm 66 4.18 MAR with 20% of the data are missing 66 4.18(a) Iteration of EM algorithm. 66 4.18(b) Complete data obtained by EM algorithm 67 4.19 MAR with 30% of the data are missing 67 4.19(a) Iteration of EM algorithm. 67 4.19(b) Complete data obtained by EM algorithm 67 4.20 The 4.21 The calculation for full data values for all cases 69 70 LIST OF SYMBOLS The observed value The missing value Number of observations or Total counts Estimates of Current estimates of The counts in cell ( , ) The observed value for ( ) The probability that an observation falls in cell ( , ) rth estimates of Observed frequencies Expected frequencies 1 CHAPTER 1 INTRODUCTION 1.1 PROBLEM STATEMENT Incomplete table is referred to the table in which the entries or information on one or more of the categorical variables are missing, a prior zero or undetermined (Fienberg, 1980). Missing data treatment is an important data quality issue in data mining, data warehousing, and database management. Real-world data often has missing values. 2 The presence of missing values can cause serious problems when the data is used for reporting, information sharing, and decision support. First, data with missing values may provide biased information. For example, a survey question that is related to personal information will more likely be left unanswered for those who are more sensitive about privacy. Second, many data modeling and analysis techniques cannot deal with missing values and have to cast out a whole record value if one of the attribute values is missing. Third, even though some data modeling and analysis tools can handle missing values, there are often restrictions in the domain of missing values. For example, classification systems typically do not allow missing values in the class attribute. Missing data always becomes the main obstacles for the researchers to further their studies. Some researcher will just ignore, truncate, censor, or collapse with those missing data. This might able to make the problem easier but it will lead to inappropriate conclusion and confusion. Therefore, a proper strategy should be used to treat such missing data. 1.2 OBJECTIVE OF THE STUDY This research is carried out with some objectives as listed below: 1) To apply the EM algorithm on multinomial model in missing categorical data analysis. 2) To compare the results of independence test for complete and incomplete data. 3 1.3 SCOPE OF THE STUDY This study is concentrated on the contingency table where some missing values are present and thus the EM algorithm will be applied on it. Only Missing At Random (MAR) data and Not Missing At Random (NMAR) data are considered in this study. 1.4 SIGNIFICANCE OF THE STUDY The EM algorithm will be successful in dealing with missing data values in contingency table or in other words we can say that we can find the missing values by applying the EM algorithm. By the end of this study, we will discover a new dimension of problem such as the missingness mechanism which will have a direct impact or effect on the missing values. 4 CHAPTER 2 LITERATURE REVIEW Missing data analysis have been well studied especially by Little and Rubin (2002). But for incomplete categorical data analysis it is still under study. In recent years, many researchers are concerned about the analysis of incomplete categorical data. Xiao-Bai Li (2009) have proposed a new Bayesian approach for estimating and replacing categorical data. With this approach, the posterior probabilities of missing value belonging to a certain category are estimated using the simple Bayes method. Based on the estimated probabilities, two alternative methods for replacing the missing values are proposed. The first replaces the missing value with the value having the maximum probability; the second uses a value that is selected with probability proportional to the estimated posterior distribution. The approach is nonparametric and 5 does not require prior knowledge about the distributions of the data. The approach is not related to any specific data analysis/mining task and thus can be applied to a wide variety of tasks. A major problem is that the variability associated with the missing data is biasedly represented when only the observed values are taken into account, meanwhile all missing values of an attributes are not. In other words, the missing values are not taken into account in predicting the missing values. As a result, the statistical distribution of the data is altered and the quality of the data is affected. 2.1 Missing Data Appropriate treatment of missing values is essential in all analysis and is critical in some, such as time series analysis. Inappropriate handling of missing values will distort analysis because, until proven otherwise, the researcher must assume that missing cases be different in analytically important ways from cases where values are at hand. That is, the problem with missing values is not so much reduced sample size as it is the possibility that the remaining data set is biased. 6 2.1.1 Classes of Missing Data There are several classes of missing data problem in the statistical literature, each of which is unique. 2.1.1.1 Censored Data A major object of study in statistics is survival analysis where interest centers on the failure time of a group or groups of individuals. For example, in a clinical trial, interest may center on the survival time of cancer patients from the time of receiving chemotherapy treatment. In a life-testing experiment in industrial reliability, interest centers on the lifetimes of machine components. Survival times and lifetimes are known as failure times, the response variable in survival analysis study. In the clinical trial above, some patients may opt to withdraw from the study or otherwise become unavailable before the experiment expires so that the true survival times for them are not known. 7 In a life-testing experiment in industrial reliability, not all components in the study may have failed before the end of the study. The only information that is available is that the subject’s survival times exceed a certain value and the true survival times are not known. These incidents created the problem of missing values and the incompleteness of the observations on the failure time is called censoring. This class of missing data is known as censored data. 2.1.1.2 Latent variable Some variables cannot be measured nor can they be observed directly although some other observed measurable variables are thought to be related to the unobserved variables. The observed variable is called a manifest variable and the unobserved variable is known as a latent variable. A classic example of latent variable is intelligence which is immeasurable but the Intelligence Quotient (IQ) test score is thought to reflect one’s level of intelligence. Another example of a latent variable is religious commitment which cannot be measured but is thought to be related to the observed frequency of one’s performance of religious rituals. Other examples of latent variables include stereotyping in sociology, mathematics anxiety in education and economic trust or confidence in economic. These latent variables are hypothetical constructs which are not measurable but some of their 8 effects on manifest variables are observable. In general, a latent variable model, models the relationship between the manifest and the latent variable. 2.1.1.3 Non-response Item Non-response item refers to the fact that due to fatigue, sensitivity, lack of knowledge or other factors, respondents not infrequently leave particular items blank on mail questionnaires or decline to give any response during interviews. This forces the researchers to decide whether to leave cases with missing data out of analysis when data are missing for a variable being analyzed, or if a value should be imputed for the case and the blank replaced by the imputed value. Similar issues arise with archival data, where the researcher may find no recorded data for certain values of certain records. Whereas a latent variable is entirely missing, non-response creates an incomplete data set with gaps in the data matrix. This class of missing data deprives us of the familiar data structure. The non-response problem is a straight forward problem faced by many practicing statisticians and non-statistical community such that a three volume work addressing this issue was written by Madow et. al. (1983). It is easy to understand why this problem has received such attention, especially in the United States and United Kingdom; governments have an interest in clean and reliable official statistics. 9 2.1.2 Missing Data Mechanism The occurrence of missing data is caused by certain mechanisms. Three different mechanisms that cause missing data are extinguished by Rubin (1976). Missing Completely At Random (MCAR) exists when missing values are randomly distributed across all observations. In this case, the probability of an observation being missing does not depend on the data values. This means that each item of the data has the same probability of being missing. Missing At Random (MAR) is a condition which exist when missing values are not randomly distributed among all observations, but are randomly distributed within one or more subsamples. The probability of an observation being missing depends on the observed values but not on the missing values. The other mechanism is called Not Missing At Random (NMAR), also called Non-ignorable missingness is the most problematic form, existing when missing values are not randomly distributed across observations, but the probability of missingness cannot be predicted from the variables in the model. To understand the missingness mechanism better, suppose we consider a case in which we are interested in studying the relationship between age and income where subjects are chosen to participate in the study. Suppose all n measurements of age are fully observed but some measurements of income are missing. 10 The missing income data are MCAR if the probability of being missing does not depend on the values of age or income, that is the missing income values are not related to age or income. The missing data are MAR if the probability of being missing depends on age values and not on income values, which means the missing income values are related to age values. The missing income data are NMAR if the probability of being missing depends on the values of income. Diggle and Kenward (1994) introduced the terms “informative drop-out” for non-ignorable drop-out in longitudinal data analysis. Other situations where the missing data are MAR that depends on an outside variable, i.e. a variable that is not in the study and a MAR situation that depends on a combination of two or more variables were observed by Kim & Curry (1977) and Roth (1994). 2.2 The Expectation-Maximization Algorithm A modern statistical procedure for dealing with missing data called Expectation Maximization Algorithm, or EM Algorithm in short, is an efficient iterative procedure to compute the Maximum Likelihood (ML) estimates in the presence of missing or hidden data, is based on an old ad-hoc idea. The idea is to impute estimated values where there are missing values, estimate the parameters, re-estimate the missing values assuming the new parameter estimates are the true ones and then re-estimate the parameters. This sequence is repeated until the parameter estimates converge to some stationary values. 11 This approach is called the missing information principle by Orchard and Woodbury (1972). Even though this approach had been proposed as early as 1926 by McKendrick, it was not until 1977 that it was presented in its general form by Dempster, Laird and Rubin and formally called the EM Algorithm. This influential work started a new area of EM application in many statistical problems including factor analysis and survival analysis. Hartley and Hocking (1971) advocated an algorithm for directly deriving a likelihood equation from an incomplete data matrix and determining a standard solution based on the scoring method. On the other hand, Orchard and Woodbury (1972), Beal and Little (1975), and Sundberg (1976) derived a method or finding the maximumlikelihood solution based on an algorithm which later became generally known as the EM algorithm by Dempster, Laird, and Rubin-DLR (1977). Rubin (1991) regarded the EM algorithm as one of the methodologies for solving incomplete data problems sequentially based on a complete framework. The EM algorithm is a parametric approach to find the ML parameter estimates for incomplete data. The algorithm consists of two steps. The first step, the Expectation step, better known as E-step, finds the expectation of the loglikelihood, conditional on the observed data and the current parameter estimates; say . the second step is the Maximization step, or M-step, which maximize the loglikelihood to find new estimates of the parameters. The procedure alternates between the two steps until the parameter estimates converge to some fixed values. 12 CHAPTER 3 RESEARCH METHODOLOGY 3.1 Missing Data Patterns Standard statistical methods are developed to deal with complete data matrices such as 13 ⎡ ⎢ =⎢ ⎢ ⎣ ⎤ ⎥ ⎥ ⎥ ⎦ where all entries of the matrix are observed. This matrix can be represented as =( where ) represents the observed values. When there are some values missing, some of the entries required for the complete matrix being no longer observed, the result is an incomplete data matrix. A hypothetical complete data matrix can be written as =( where ) , represents the missing values. A typical incomplete data matrix looks like the matrix below where * represents the missing values and the row vector represents a unit. ⎡ ⎢ =⎢ ⎢ ⎣ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ⎤ ⎥ ⎥ ⎥ ⎦ 14 In this particular example of an incomplete data matrix , variable and is more completely observed than variable observed than variables missing data when variable and . Variable , . Variables and is fully observed is more completely form a monotone pattern of is discarded and the resulting matrix is given as ⎡ ⎢ =⎢ ⎢ ⎣ ∗ ∗ ∗ ∗ ⎤ ⎥ ⎥ ⎥ ⎦ Discarding unit three which is the third row of the incomplete data matrix, also forms another monotone missing data pattern which is formed by variables and the resulting data matrix is given as = ∗ ∗ ∗ , , and ∗ ∗ ∗ When a row vector contains some missing values the incomplete data is known as item non-response. When the row vector contains only missing values, it is known as unit non-response. Clearly unit non-response reduces the whole sample size and item non-response reduces the sample size for the corresponding variable. Another obvious effect of missing data is the presence of gaps in the data matrix. Statistical packages like GLIM, SAS, SPSS and MINITAB do not recognize these gaps and will opt to work on the complete units only, thus reducing the size of the sample which also means throwing away some information contained in the incomplete units. 15 One effect of this treatment of missing data can be non-response bias. The nonrespondents may possess certain characteristics that make them different from the respondents. This makes the two groups distinguishable from each other. Treating both groups as if they are the same will give biased results; bias due to non-response. 3.2 General Definition of Missingness Mechanism Let Y denotes a hypothetical complete ( × ) data matrix of n observations on p variables and R an ( × ) missingness indicator matrix, such that missing and by parameter Rewriting = 0 if = 1 if is present. Suppose that a distribution for Y is ( | ) indexed and a distribution for R given Y is ( | , ) indexed by parameter =( , is ) where represents the observed values of Y and represent the missing values, ( | , ) can be written as ( | Definition 3.1 The missing data are MCAR if ( | , , ) = ( | ). , , ). . 16 Definition 3.2 The missing data are MAR if ( | , , )= ( | , ). Definition 3.3 The missing data are NMAR if ( | , , )= ( | , , ). These definitions are due to Rubin (1976). The MCAR situation implies that missingness does not depend on the observed values nor on the missing values of Y. MAR implies that missingness does not depend on the missing values of Y and NMAR implies that missingness depends on the missing values of Y. Rubin (1976) further showed that if the missing data are MCAR or MAR and likelihood inference for out of ( , and are distinct, the can be based on the likelihood obtained by integrating | ) without including a model for the missing data mechanism. Under this condition, the missing-data mechanism is termed ignorable. When the missing data are NMAR, maximum likelihood estimation requires a model for the missing - data mechanism. This situation is termed non-ignorable. In practice, the type of missingness mechanism is often assumed but Little (1988) provides a method to test whether the missingness mechanism is MCAR or not. 17 3.3 EM Theory in General Rubin (1991) regarded the EM algorithm as one of the methodologies for solving incomplete data problems sequentially based on a complete framework. The idea on which it is based is simple, as summarized in the steps shown below, assuming that is the observed data portion and is the missing data portion: 1) If the problem is so difficult that the solution cannot be derived immediately just from the data at hand , make the data “complete” to the extent that solving the problem is regarded as easy to formulate the problem (assuming that the missing data portion exists). 2) For example, if the objective for the time being is to derive the estimate of parameter , which is , enter a provisional value into 3) Improve using and enter the value into =( , ) represent the missing values. Suppose that a model exist for Y with probability density factoring this density we would get or simply converges. denotes the hypothetical complete data where represents the observed values of Y and ( | )= ( . . 4) Repeat the aforementioned two steps until the value of Let to determine , ( | ) where | )= ( | )∙ ( =( , | ) ,…, ). On 18 ( where ( , | )= ( | )∙ ( , ) | | ) is the conditional density of the observed data is the density of the missing data (3.1) and ( | , ) given the observed data. For missing data problem, Dempster, Laird and Rubin (1977) assume that: 1) Parameters to be estimated are independent of the missing data process, and 2) Missing data are missing at random. The loglikelihood that corresponds to Equation 3.1 is ( | where ( | , , )= ( | ) + log[ ( | , )] (3.2) ) is referred to as the complete-data loglikelihood, ( | referred to as the observed data loglikelihood and log[ ( | , )] is the missing part of the complete-data loglikelihood. The purpose is to estimate ( | ) is by maximizing ) with respect to . Rearranging Equation 3.2, we obtained ( | )= ( | , ) −log[ ( | , )] (3.3) Taking the expectation of Equation 3.3 over the distribution of the missing data given and a current estimates of , say , gives Equation 3.4 below 19 ( | ) ( ( | = ) ( , − ∫[log ( | , )] ( | ) , | | ) , ) , (3.4) By allowing ( | and ( | )= )= ( | ) ( , [log ( | , )] ( | ) , | , ) Thus, Equation (3.3) can be written as ( | )= ( | )− ( | ) Jensen’s inequality (Rao, 1972) states that if ( ) is convex, then [ ( )] ≥ [ ( )], which implies that ( | )≥ ( | ) 20 ( ) Taking as a starting point, the EM algorithm performs a sequence of iterations to maximize ( | ( ) ( ) ). The new estimates of ( . In general, the successive estimate , that is, ( | ( ) ( ) = ( ( ) ) ) for some function ( ) = −[ ( ) , is a function of is a function of the previous estimate (∙). The difference in values of ) from the previous iteration is thus − , says ( ) ( ( ) − ) ( ) − (3.5) The expected score is ( )= −∫ = giving ) ( | , ) , ) ( | , ) ( | = where ( | [log ( | , )] | , (3.6) 21 ( )= = −∫ ( | ) ( | , ) , ) ( | , ) ( | [log ( | , )] ( | , ) (3.7) Since ∫ ( | , ) = 1, and the second part of the right hand side of Equation (3.7) is the expected score, which, as can be proven, equals zero, we have ( )= Maximizing incomplete data ( | with respect to . ( | ( | )= ( | ). ) is usually easier than maximizing the loglikelihood of the ). The EM algorithm does just that by maximizing ( | ) When the complete data Y has a distribution from the regular exponential family distribution, the following theorem (Mood, et. al., 1974) is useful in deriving the EM algorithm. 22 Theorem 3.1 Let ( ; , ,⋯, ). If , ,⋯, be a random sample from ( ; , ,⋯, )= ( , that is ( ; , ,⋯, ) is a member of the k- parameter exponential family, then ,⋯, ) ( )exp ( ),⋯, ( , a density distribution ,⋯, ) ( ) ( ) is a minimal set of jointly complete and sufficient statistics. The regular exponential family distribution is defined by ( | ) = exp [ ( ) ( ) + ( ) + ( )] where ( ) denotes the vector of complete-data sufficient statistics, (3.8) denotes a parameter vector, c is a function of , and d is a function of y. Many of the well-known distributions such as Binomial, Multinomial, Poisson, Gamma, Normal, and Multivariate normal distributions belong to the regular exponential family distribution. In this case, the E-step of the algorithm finds 23 ( | ) = [ ( ) ( ) + ( ) + ( )| = ( ) [ ( )| ] + [ ( )| ] ]+ ( ). Thus, the E-step reduces to estimating the complete-data sufficient statistics ( ) assuming the current parameter estimates are the true values of the parameters. These expected sufficient statistics are then used to find the ML parameter estimates in the M-step. However, the EM algorithm does not give the standard error automatically as can be seen below. Let denote the complete data and x denote the observed data. Given x then ∈ ( ). Then the likelihood, L is given by = ( ) ( ) Then, the loglikelihood is = log ( ) ( ) 24 At the M-step of the EM algorithm, we maximize = ( ) ( ) log ( ) where ( ) uses the parameter estimates from the previous iteration. Now consider the derivatives with respect to the parameters β = ∫ = ( ) 1 ( ) ( ) ( ) ( ) ( ) 1 ( ) ( ) Thus, at the ML solution, ( ) = ( ), such that = Thus, a solution of ( ) ( ) ∝ = 0 is the ML solution. Note that, in principle, there may be other solutions. Dempster et. al. (1977) showed that the procedure converges to the ML solution. 25 However, 1. The second derivatives of and are not the same, so that by using the second derivatives from the EM procedure does not yield the correct asymptotic covariance matrix of the parameter estimates. 2. and are not simply related so that the correction of the second derivatives is not trivial. The above derivation is valid for the Missing Completely At Random (MCAR) situation, in which the probability of the missing data pattern does not depend on y or β. To extent this to the more general cases, we consider a different representation of the data. Let r be a missing data indicator vector, such that = 1 if the kth observation is missing or incomplete, and 0 otherwise. Let the complete data be ( , ) and the incomplete data ( , ). Now we consider a missing data mechanism defined by the conditional distribution of r given y with parameters. The complete data Likelihood can be written as ( )= ( ; ) ( | ; ) Note that the complete data situation includes the complete observations y, the indicators of which will be missing in the incomplete data. 26 In the missing at random, MAR, case we assume that the conditional distribution of r depends only on x so that ( )= ( ; ) ( | ; ) thus, ( )= ( ) ( ) = ( ) ( ; ) ( | ; ) = log ( | ; ) + log = ( | ; ) ( ) ( ) ( ; ) ( ; ) It is clear that for inference about β the first term can be ignored and we are in the same situation as in the previous section. 27 3.4 Incomplete Contingency Table An incomplete contingency table, or also called missing categorical data is a contingency table where information on one or more of the categorical variable is missing. It is assumed that the data are MAR and the missing data mechanism is ignorable. We will discuss ML estimation of cell probabilities in an incomplete contingency table by using all the observed data- including data where information on one or more of the categorical variables is missing. Lipsitz, Parzen and Molenberghs (1998) uses the Poisson generalized linear model to obtain ML estimates of cell probabilities for the saturated loglinear model whilst Little and Rubin (1988) describes and uses the EM algorithm to determine the ML estimates of cell probabilities for any loglinear model. 3.4.1 ML Estimation in Incomplete Contingency Table Consider an and × contingency table with categorical variables = {1,2, ⋯ , } = {1,2, ⋯ , }. A multinomial sampling procedure is assumed. Let count in cell ( , ), the observed value of and = ∑∑ be the the total counts. The counts in each cell can be arranged to form the complete data vector = , ,⋯ with [ ] = , the vector of expected counts. 28 If information on one or both of the categories is missing, the contingency table is said to be incomplete. The data to be classified in the contingency table can be split into two parts, namely: 1. The fully classified cases, where the information on all of the categories is available, and 2. The partially classified cases, where information on some of the categories is missing. It is assumed that the data are MAR and the missing data mechanism is ignorable. 3.4.2 The EM Algorithm 3.4.2.1 Multinomial Sampling ∑∑ If the probability that an observation falls in cell ( , ) is = 1, then the complete data , where have a multinomial distribution, ≥ 0 and 29 ~ ; , ,⋯, with probability ( | )= where =( ∏∏ ! ⋯ ! (3.9) , ,⋯, ).The kernel of the complete data log-likelihood is ( | )= log + The cell counts, log +⋯+ log , are the sufficient statistics and the MLE of = is 30 3.4.2.2 Product Multinomial Sampling Let =∑ be the total counts in row i and that an element falls in row i. If the =∑ be the probability elements of row i are independent, each having a probability distribution for = 1,2, ⋯ , , then, given the row total and the vector of cell probabilities , the element of row i have a multinomial distribution , and | = ,⋯, | = , ~ ; , ,⋯, (3.10) . When samples from different rows are independent, the joint probability function for the entire data set is the product of I multinomial probability functions, ( | , , ,⋯, )= ! ! !⋯ ! ⋯ 31 Similarly, if the column totals are fixed then the elements of column j will have a multinomial distribution , with | = ,⋯, | = , ~ ; , ,⋯, (3.11) . 3.4.2.3 EM Algorithm to Determine the ML Estimates of the Cell Probabilities in An Incomplete × Contingency Table: Data Missing On Both Categories. If missing values occur on both into three parts denoted by , and where only and , the observed data can be partitioned and C respectively, where A includes units having both observed, B includes those having only observed and C includes those was observed. In part A, observations are fully classified, and in B and C are only partially. The three parts of the sample are displayed in Table 3.1. The objective is to determine the ML estimates of cell probabilities in the × table by using the fully and partially classified data. 32 Sample part A =1 =2 ⋮ = =1 =2 ⋮ ⋯ = ⋯ ⋯ ⋮ ⋮ ⋮ ⋯ ⋮ ⋯ (a) Both variables observed Sample part B =1 =2 ⋮ = (b) ⋮ is missing Sample part C =1 =2 ⋯ ⋯ (c) C missing = Table 3.1: Classification of sample units in an incomplete × contingency table. 33 Assume that the data are MAR and the missingness mechanism is ignorable. Let ' = , ,⋯, , ' =( , ,⋯, ) and ' = , ,⋯, is missing in sample part B, the counts observed are totals across . Since . Hence, compared to sample part A, row totals are observed in sample part B and column totals in sample part C. The observed data are { =( Let , , , : = 1,2, ⋯ , ; = 1,2, ⋯ , } ) be the observed data vector, , = complete data vector and , ,⋯, = , ,⋯, be the be the vector of cell probabilities for which the ML estimates must be determined. Each complete data count, , can be expressed as the sum of contributions from each of the three sample parts, that is across are observed, that is = + + . For sample part B totals , whilst the individual cell counts, , are missing. It follows from (3.10) that the predictive distribution of the missing data in part B given and is a product multinomial, , where | = ,⋯, , = | , ~ . ; , ,⋯, (3.12) 34 For part C only the totals across are observed, that is . From (3.11), the predictive distribution of the missing data in sample part C is given and is a product multinomial, , | where = ,⋯, , | , ~ = , = , ,⋯, (3.13) . Thus, = = ; + + + + , , + , + (3.14) The distribution of the complete data belong to the regular exponential family with sufficient statistics the cell counts, (3.14), , ( ) ( ) is calculated where , ( ) = + = = + + . In the E-step of the EM algorithm, , = 0,1,2, ⋯, is the rth estimate of + ( ) ( ) , + , . From ( ) ( ) + ( ) ( ) , ( ) (3.15) 35 In the M-step, ( ) is calculated by substituting the results from the E-step into the expression of the MLE of for the complete data. That is, ( ) = = 1 , + ( ) ( ) ( ) + ( ) ( ) (3.16) The process iterates between (3.15) and (3.16) until convergence is attain. 3.5 Chi- Squared Test 3.5.1 Goodness-of-fit Test A problem that arises frequently in statistical work is the testing of the compatibility of a set of observed and theoretical frequencies. This type of problem has already been discussed and solved for the special case in which there are only two pairs of frequencies to be compared. 36 Consider the result obtained from an experiment of tossing a die 300 times, as shown in table 3.2 below: Outcome Frequency 1 45 2 52 3 60 4 58 5 44 6 41 Table 3.2: Frequency distribution. There are six possible outcomes for each trial, that is, obtaining number 1, 2, 3, 4, 5 or 6. These outcomes are also referred to as categories. The question we would like to answer is whether the dice is a fair dice. The results obtained from the experiment is the evidence for concluding whether the dice is a fair dice or otherwise. We know that a fair dice has the following characteristic (1) = (2) = (3) = (4) = (5) = (6) = 1 6 If X is a random variable representing the outcome obtained for each trial, then X follows the uniform distribution with ( = ) = for = 1, 2, ⋯ , 6. The objective is to test the hypotheses that the die is a fair dice which can be stated as follows: : (1) = (2) = (3) = (4) = (5) = (6) = : ( ) ≠ ( ) for = 1, 2, ⋯ , 6; ≠ 1 6 37 The statement in is equivalent to the dice being a fair dice and the statement in is equivalent to the dice not being a fair dice. If the dice is a fair dice, we expect the frequency for the outcome or category i is = ( ) for = 1, 2, ⋯ , 6; where n is the number of trials. This then gives us the expected frequencies = ( ) = 300 = 50 = ( ) = 300 = 50 = ( ) = 300 = 50 = ( ) = 300 = 50 = ( ) = 300 = 50 = ( ) = 300 = 50 However, the observed frequencies obtained from the experiment are = 45, = 58, = 52, = 44, = 60 = 41 which differ from the expected frequencies if the dice is a fair dice. 38 The logic is if the dice is a fair dice, the difference between the observed and the expected frequencies ( − ) is either zero or a small number. The difference between the observed and the expected frequencies forms the statistic to test the hypothesis regarding the probability distribution of the random variables. The statistics is stated in the following theorem. Theorem 3.2 The statistic ( = − follows the Chi-Square distribution with ( − ) − 1) degree of freedom. where k is the number of categories and p is the number of unknown parameters needed to be estimated from the data. If there is no unknown parameter, then the degrees of freedom is − 1 where calculated statistic = 0. This test is a one-tailed test where = at significance level . ( − ) > , is rejected if the 39 ( − ) = 45 = 300 1 = 50 6 (45 − 50) = 0.50 50 = 60 = 300 1 = 50 6 (60 − 50) = 2.00 50 = 52 = 300 = 58 = 300 = 44 = 300 = 41 = 300 1 = 50 6 (52 − 50) = 0.08 50 1 = 50 6 (58 − 50) = 1.28 50 1 = 50 6 (44 − 50) = 0.72 50 1 = 50 6 (41 − 50) = 1.62 50 Table 3.3: The calculation of statistic Since the statistic is calculated from the observed sample, we use the denotation as the calculated statistic = ( − ) At significance level ( . , ) . So, = 0.50 + 0.08 + 2.00 + 1.28 + 0.72 + 1.62 = 6.20 = 0.05, we reject = 11.070, and accept if unknown parameters are absent. Since ≤ ( . if , ). > ( . Note that , = ) = 6.20 < 11.070, we accept where − 1 since and conclude that there is no evidence that the dice is not fair, or in other words we can say that the dice is fair. 40 The test we have seen above is called goodness-of-fit test. In general, we would observe the following table with = 1, 2, ⋯ , and = Category + represents the observed frequency for category i for +⋯+ 1 . 2 ⋯ ⋯ Frequency Table 3.4: The observed frequency for category i. The belief is that the probability of category i occurring, null hypothesis as : ()= Assuming = ( ), is stated in the for = 1, 2, ⋯ , . is correct, the expected frequency for each category i, , is calculated by ( ) and with the help of Theorem 2, we can test the hypothesis stated in . 41 3.5.2 Independence Test A very useful application of test occurs in connection with testing the compatibility of observed and expected frequencies in two-way tables. Such two-way tables are usually called contingency table. Table 3.5 below is an illustration of a contingency table. A contingency table is usually constructed for the purpose of studying the relationship between the two variables of classification. In particular, one may wish to know whether the two variables are related. By means of the test, it is possible to test the hypothesis that the two variables are independent. This test is called independence test which capitalizes on the fact of independent events in probability study, ( ∩ )= ( )∙ ( ) Column variable Category Category Category Category ⋯ Category Category ⋯ Category Row variable Category ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ Table 3.5: A two-dimensional contingency table ⋯ 42 The above contingency table is a ( × ) contingency table, where r denotes the number of categories of the row variable, c denotes the number of categories of the is the observed frequency in cell ( , ), that is, the observed column variable and frequency for category of the row variable and category of the column variable. Let: be the total frequency for row category i. be the total frequency for row category j. be the grand total frequency all cells ( , ). ∩ Each cell represents the joint event Category Category Category Row variable Category ⋯ Category . Thus, Column variable Category Category ⋯ Category ( ∩ ) ( ∩ ) ⋯ ( ∩ ) ( ∩ ) ( ∩ ) ⋯ ( ∩ ) ⋯ ( ∩ ⋯ ) ( ∩ ⋯ ) ⋯ ⋯ ⋯ ( ∩ ⋯ ) Table 3.6: A two-dimensional contingency table of joint events. 43 If the events and are independent, then ( ) or often, we do not know the true values of ∩ = ( ) ( ). Most ( ) but we know from the probability estimation that the best estimator for population proportion or probability is the sample proportion. Thus, ( )= and = Therefore, the estimated probability for the joint categories is ∩ = ( ) = × With this estimated joint probability, we can find the expected frequency in each cell, if and are independent. The expected frequency in cell ( , ) is = = = = ∩ ( ) × 44 Now, if and are truly independent, we anticipate and and if they differ, the difference is not significant. The statistic ( basis for the independence test which is stated in Theorem 3. − do not differ ) forms the Theorem 3.3 The statistic ( = − ) follows the Chi-Square distribution with ( − 1)( − 1) degree of freedom where = the observed frequency in cell ( , ), and = the expected frequency in cell ( , ). The theorem can be written simply as = ( − ) ~ ( )( ) This theorem is useful in testing the following hypotheses, : Row and column variables are independent. : Row and column variables are not independent. 45 This test is a one-tailed test on the right where greater than value with ,( )( , we reject ) is rejected if the calculated at significance level if > ,( )( value is . Again, denoting the calculated ). 46 CHAPTER 4 RESULT AND DISCUSSION Suppose we are examining the effect of age on income. If missingness on income is a function of age, or in other words, elder individuals do not report their income, then the data are MAR. If missingness on income is a function of income, i.e person with high income refuse to report their income, then the data are NMAR. To understand these mechanisms better, let consider a simple example of missing data. Suppose that we have the full data as shown in Table 4.1 below. 47 Age <30 30 to 55 >55 <30 30 to 55 >55 <30 30 to 55 >55 <30 30 to 55 >55 <30 30 to 55 >55 <30 30 to 55 >55 <30 30 to 55 >55 <30 30 to 55 >55 <30 30 to 55 >55 <30 30 to 55 >55 Income High High High High High High High High High High High High High High High Low Low Low Low Low Low Low Low Low Low Low Low Low Low Low (a) Continuous data <30 30 to 55 >55 High 5 5 5 Low 5 5 5 (b) Categorical data Table 4.1: An example dataset of full data. 48 Consider that we have a situation where the data are fully observed on age values, but some missing exist on income values.* in the table denotes the missing information. Age <30 30 to 55 >55 <30 30 to 55 >55 <30 30 to 55 >55 <30 30 to 55 >55 <30 30 to 55 >55 <30 30 to 55 >55 <30 30 to 55 >55 <30 30 to 55 >55 <30 30 to 55 >55 <30 30 to 55 >55 Income * High High High * High High High * * High High High * * Low Low * * * Low * Low Low Low * Low Low Low * (a) Continuous data 49 High Low <30 2 2 30 to 55 2 2 >55 2 2 (b) Categorical data Table 4.2: An example dataset for MCAR. Table 4.2 is created such that some of the observations are Missing Completely At Random (MCAR), that is, the missing values do not depends on the age values and income values. This means that the missing income values are not related to the age values. By table 4.2(b), we can see that each cell of the table has the same probability of being missing. The data can be confirmed as MCAR by doing the t-tests of mean differences on income and age after dividing respondents into those with and without missing data. This is done to establish that those two variables do not differ significantly. The SPSS Missing Values Analysis (MVA) option supports Little’s MCAR test, which is a chisquared test for missing completely at random. If the p value for Little’s MCAR test is not significant, then the data may be assumed to be MCAR. Table 4.3 is constructed in such a way that some observations are Missing At Random (MAR) where missing values are not randomly distributed among all observations, but are randomly distributed within one or more subsamples. The probability of an observation being missing depends on the observed values but not on the missing values. The observed values in this case are the age values, therefore missing are depending on the age values. For example, in this study we found that the information about individual whose age is greater than 55 years old tends to be missing. 50 Besides that, Table 4.4 is created to have some of its observation are unknown and said to be Not Missing At Random (NMAR) since the probability of income data being missing depends on the values of income itself, i.e. those who obtained high income refused to revealed their income. Age <30 30 to 55 >55 <30 30 to 55 >55 <30 30 to 55 >55 <30 30 to 55 >55 <30 30 to 55 >55 <30 30 to 55 >55 <30 30 to 55 >55 <30 30 to 55 >55 <30 30 to 55 >55 <30 30 to 55 >55 Income High High * High * * High High High * High High High High * * Low Low Low Low * Low Low * Low * Low Low Low * (a) Continuous data 51 <30 30 to 55 >55 High 4 4 2 Low 4 4 2 (b) Categorical data Table 4.3: An example dataset for MAR. Age <30 30 to 55 >55 <30 30 to 55 >55 <30 30 to 55 >55 <30 30 to 55 >55 <30 30 to 55 >55 <30 30 to 55 >55 <30 30 to 55 >55 <30 30 to 55 >55 <30 30 to 55 >55 <30 30 to 55 >55 Income * * High High High * * * High * High * * High * * Low Low Low Low Low Low Low * Low Low Low Low * Low (a) Continuous data 52 <30 30 to 55 >55 High 2 2 2 Low 4 4 4 (b) Categorical data Table 4.4: An example dataset for NMAR. Missing At Random (MAR) and Not Missing At Random (NMAR), the testing can be done by accessing the SPSS Missing Values Analysis (MVA). By default, it will generate a table of “Separate Variance t-Tests” in which rows are all variables which have 1% missing or more, and columns are all variables. In any cell, if (2 − )≤ 0.05, this means that missing cases in the row variables are significantly correlated with the column variable and thus are not missing at random. Else, then the case is said to be missing at random. 4.1 Data Construction Suppose we have the following data in Table 4.5. The row variable is the income obtained by the respondents and the column variable refers to their ages. 53 Age Income <30 30-55 >55 Total High 100 130 110 340 Low 150 155 195 500 Total 250 285 305 840 Table 4.5: Full data. 4.1.1 Missing Completely At Random (MCAR) In Table 4.6 (a), the value of 10% of the total number of the candidates is missing. The value of 34 and 49 refers to the data of the candidates whose income is Low and High respectively but information their age is missing. Meanwhile, the value of 25, 28 and 30 refers to the data which age is below 30 years old, between 30 to 55 years old and above 55 years old respectively. Information about their income status is missing. In Table 4.6 (b), there are 20% of the total number of the candidates is missing. The value of 67 and 101 refers to the data of the candidates whose income is Low and High respectively but information their age is missing. Meanwhile, the value of 50, 57 54 and 61 refers to the data which age is below 30 years old, between 30 to 55 years old and above 55 years old respectively. Information about their income status is missing. In Table 4.6 (c), 30% of the total number of the candidates is missing. The value of 101 and 151 refers to the data of the candidates whose income is Low and High respectively but information their age is missing. Meanwhile, the value of 75, 86 and 91 refers to the data which age is below 30 years old, between 30 to 55 years old and above 55 years old respectively. Information about their income status is missing. Age Missing <30 30-55 >55 High 90 117 99 34 Low 135 140 176 49 25 28 30 83 Income Missing (a) MCAR with 10% data missing. Age Missing <30 30-55 >55 High 80 105 88 67 Low 120 123 156 101 50 57 61 168 Income Missing (b) MCAR with 20% data missing. 55 Age Income Missing <30 30-55 >55 High 30 38 33 101 Low 45 48 58 151 75 86 91 252 Missing (c) MCAR with 30% data missing. Table 4.6: Artificial Incomplete Data for MCAR cases. Following the notation in previous chapter for Table 4.6 (a), ' = ' , ' , ' where ' is the observed data: ' = ; = 1,2, = 1, 2, 3 = {90, 117, 99, 135, 140, 176}, ' is the missing data for income variable: ' = { and ' is the missing data for age variable: ' = The fully classified data, ( )' = ; = 1,2} = {34, 49}, ; = 1,2, 3 = {25, 28, 30}. were used to determine a starting value for the algorithm, {90, 117, 99, 135, 140, 176} ≈ {0.11889, 0.15456, 0.13078, 0.17834, 0.18494, 0.23250} 0.11889 0.15456 0.13078 0.17834 0.18494 0.2325 = 0.29723 = 0.3395 = 0.36328 (4.1) = 0.40423 = 0.59578 Table 4.7: Marginal total of probabilities for MCAR with 10% data missing. 56 From (3.16), the first estimate of ( ) = = is 1 ( ) + ( ) + ( ) ( ) 1 0.11889 0.11889 90 + 34 + 25 923 0.40423 0.29723 =0.119177 Similarly, the first estimates of ( ) ( ) = = ( ) = ( ) = ( ) = ( ) , ( ) , ( ) , ( ) and ( ) are 1 0.154557 0.154557 117 + 34 + 28 923 0.40423 0.3395 = 0.154656 1 0.130779 0.130779 99 + 34 + 30 923 0.40423 0.36328 = 0.130878 1 0.184941 0.184941 140 + 49 + 28 923 0.59578 0.3395 = 0.184684 1 0.178336 0.178336 135 + 49 + 25 923 0.59578 0.29723 1 0.232497 0.232497 176 + 49 + 30 923 0.59578 0.36328 = 0.178405 = 0.232201 This gives, ( )' = {0.119177, 0.154656, 0.130878, 0.178405, 0.0.184684, 0.0.23220} 57 which is used to calculate the second estimate for . The process continues until the convergence is attained. Table 4.8 shows the values at different steps of algorithm. r 0 1 2 3 ∞ ( ) 0.118890 0.119177 0.119203 0.119205 0.119206 ( ) ( ) 0.154557 0.154656 0.154663 0.154663 0.154663 0.130779 0.130878 0.130887 0.130888 0.130889 ( ) 0.178336 0.178405 0.178410 0.178411 0.178411 ( ) 0.184941 0.184684 0.184660 0.184657 0.184657 ( ) 0.232497 0.232201 0.232170 0.232175 0.232175 Table 4.8: Iteration of the EM algorithm for MCAR with 10% data missing problem. From the probabilities we obtained in Table 4.8, we can find the complete data table. = ∙ (4.2) Make use of Equation (4.2) above, we have = Similarly, the values of ∙ 840 = 0.119206 (840) = 100.13556 ≈ 100. , = 0.154663 (840) = 130 = 0.130889 (840) = 110 , , and are = 0.178411 (840) = 150 = 0.184657 (840) = 155 = 0.232175(840) = 195 58 Therefore, the complete data is Age Income <30 30-55 >55 High 100 130 110 Low 150 155 195 Table 4.9: Complete data obtained by EM algorithm for 10% MCAR problem. Repeating the same processes on both MCAR with 20% and 30% missing data problems, we have the following. r 0 1 2 3 ∞ ( ) 0.119408 0.118684 0.118623 0.118613 0.118611 ( ) 0.15625 0.155773 0.155701 0.15569 0.155688 ( ) 0.130952 0.130553 0.130481 0.130468 0.130465 ( ) 0.178571 0.178944 0.179006 0.179016 0.179018 ( ) 0.183036 0.183418 0.18349 0.183054 0.183507 ( ) 0.232143 0.232628 0.232699 0.23271 0.232712 (a) Iteration of the EM algorithm. Age <30 30-55 >55 High 100 131 110 Low 150 154 195 Income (b) Complete data obtained by EM algorithm. Table 4.10: MCAR with 20% of the data are missing. 59 r 0 1 2 3 ∞ ( ) 0.119048 0.118664 0.118571 0.118548 0.11854 ( ) 0.156463 0.156261 0.156219 0.156209 0.156206 ( ) 0.130952 0.130296 0.130133 0.130092 0.13008 ( ) 0.178571 0.178965 0.179059 0.079082 0.179089 ( ) 0.181973 0.182726 0.182918 0.182967 0.182985 ( ) 0.232993 0.233088 0.233101 0.233102 0.233101 (a) Iteration of the EM algorithm. Age Income <30 30-55 >55 High 100 131 109 Low 151 154 196 (b) Complete data obtained by EM algorithm. Table 4.11: MCAR with 30% of the data are missing. 4.1.2 Missing At Random (MAR) For this type of mechanism, proportion of missing in column of age above 55 years old is greater than the rest of data in the table. In Table 4.12 (a), the value of 10% of the total number of the candidates is missing. The value of 33 and 51 refers to the data of the candidates whose income is Low and High respectively but information about their age is missing. Meanwhile, the value of 20, 22 and 42 refers to the data 60 which age is below 30 years old, between 30 to 55 years old and above 55 years old respectively. Information about their income status is missing. In Table 4.12 (b), there are 20% of the total number of the candidates is missing. The value of 66 and 102 refers to the data of the candidates whose income is Low and High respectively but information their age is missing. Meanwhile, the value of 40, 44 and 84 refers to the data which age is below 30 years old, between 30 to 55 years old and above 55 years old respectively. Information about their income status is missing. In Table 4.12 (c), 30% of the total number of the candidates is missing. The value of 100 and 152 refers to the data of the candidates whose income is Low and High respectively but information their age is missing. Meanwhile, the value of 58, 68 and 126 refers to the data which age is below 30 years old, between 30 to 55 years old and above 55 years old respectively. Information about their income status is missing. Age Income Missing Missing <30 30-55 >55 Low 92 120 95 33 High 138 143 168 51 20 22 42 83 (a) MAR with 10% data missing. 61 Age Income Missing <30 30-55 >55 Low 84 110 80 66 High 126 131 141 102 40 44 84 168 Missing (b) MAR with 20% data missing. Age Income Missing <30 30-55 >55 Low 76 99 65 100 High 116 118 114 152 58 68 126 252 Missing (c) MAR with 30% data missing. Table 4.12: Artificial Incomplete Data for MAR. Table 4.13, 4.14 and 4.15 below show the values of estimate, at different steps of algorithm for all the cases respectively. r 0 1 2 ∞ ( ) 0.121693 0.118928 0.118715 0.118698 ( ) 0.15873 0.154694 0.154382 0.154357 ( ) 0.125661 0.130284 0.130703 0.130741 ( ) 0.18254 0.179302 0.178966 0.178928 ( ) 0.189153 0.185287 0.1849894 0.18485 (a) Iteration of the EM algorithm. ( ) 0.222222 231506 0.23234 0.232426 62 Age Income <30 30-55 >55 High 100 130 110 Low 150 155 195 (b) Complete data obtained by EM algorithm for 10% MAR problem. Table 4.13: MAR with 10% of the data is missing. r 0 1 2 ∞ ( ) 0.125 0.119279 0.118471 0.118348 ( ) 0.16369 0.155337 0.154155 0.15397 ( ) 0.119048 0.128648 0.130237 0.130526 ( ) 0.18752 0.180845 0.179578 0.179284 ( ) 0.19494 0.186994 0.185513 0.185172 ( ) 0.209821 0.228897 0.232045 0.2327 (a) Iteration of the EM algorithm. Age Income <30 30-55 >55 High 99 129 110 Low 151 156 195 (b) Complete data obtained by EM algorithm. Table 4.14: MAR with 20% of the data are missing. 63 r 0 1 2 ∞ ( ) 0.129252 0.11962 0.117676 0.1172 ( ) 0.168367 0.156843 0.154558 0.154004 ( ) 0.110544 0.126225 0.129876 0.13094 ( ) 0.197279 0.184715 0.181509 0.180435 ( ) 0.20068 0.189118 0.18617 0.185165 ( ) 0.193878 0.223479 0.230211 0.232256 (a) Iteration of the EM algorithm. Age Income <30 30-55 >55 High 98 129 110 Low 152 156 195 (b) Complete data obtained by EM algorithm. Table 4.15: MAR with 30% of the data are missing. 64 4.1.3 Not Missing At Random (NMAR) Table 4.16 below shows that data are missing with proportion of missing in row of high income is greater than the rest of data in the table. In Table 4.16 (a), the value of 10% of the total number of the candidates is missing. The value of 59 and 25 refers to the data of the candidates whose income is Low and High respectively but information about their age is missing. Meanwhile, the value of 23, 32 and 29 refers to the data which age is below 30 years old, between 30 to 55 years old and above 55 years old respectively. Information about their income status is missing. In Table 4.16 (b), there are 20% of the total number of the candidates is missing. The value of 118 and 50 refers to the data of the candidates whose income is Low and High respectively but information their age is missing. Meanwhile, the value of 50, 60 and 58 refers to the data which age is below 30 years old, between 30 to 55 years old and above 55 years old respectively. Information about their income status is missing. In Table 4.16 (c), 30% of the total number of the candidates is missing. The value of 176 and 76 refers to the data of the candidates whose income is Low and High respectively but information their age is missing. Meanwhile, the value of 74, 91 and 87 refers to the data which age is below 30 years old, between 30 to 55 years old and above 55 years old respectively. Information about their income status is missing. 65 Age Income Missing <30 30-55 >55 High 83 107 91 59 Low 144 146 185 25 23 32 29 84 Missing (a) NMAR with 10% data missing. Age Income Missing <30 30-55 >55 High 65 85 72 118 Low 135 140 175 50 50 60 58 168 Missing (b) NMAR with 20% data missing. Age Income Missing Missing <30 30-55 >55 High 48 63 53 176 Low 128 131 165 76 74 91 87 252 (c) NMAR with 30% data missing. Table 4.16: Artificial Incomplete Data for NMAR. 66 Table 4.17, 4.18 and 4.19 below show the values of estimate, at different steps of algorithm for all the cases respectively. ( ) r 0 1 2 ∞ 0.109788 0.117789 0.118384 0.118403 ( ) 0.141534 0.154762 0.156093 0.156287 ( ) 0.12037 0.129511 0.130303 0.130365 ( ) 0.190476 0.179837 0.179022 0.178963 ( ) 0.193122 0.18631 0.185357 0.185214 ( ) 0.244709 0.231791 0.23084 0.230766 (a) Iteration of the EM algorithm. Age Income <30 30-55 >55 High 99 131 110 Low 150 156 194 (b) Complete data obtained by EM algorithm for 10% NMAR problem. Table 4.17: NMAR with 10% of the data are missing. r 0 1 2 ∞ ( ) 0.0967262 0.114881 0.117907 0.118526 ( ) 0.126488 0.151634 0.155844 0.156811 ( ) 0.107143 0.126168 0.129216 0.129686 ( ) 0.200893 0.182292 0.179245 0.178641 ( ) 0.208333 0.191358 0.187727 0.186826 (a) Iteration of the EM algorithm. ( ) 0.260417 0.233668 0.230061 0.229511 67 Age Income <30 30-55 >55 High 99 132 109 Low 150 157 193 (b) Complete data obtained by EM algorithm. Table 4.18: NMAR with 20% of the data are missing. r 0 1 2 ∞ ( ) 0.081633 0.10961 0.115906 0.117593 ( ) 0.107143 0.146668 0.156005 0.159553 ( ) 0.090136 0.11999 0.1266 0.128036 ( ) 0.217687 0.187511 0.180905 0.179073 ( ) 0.222789 0.197738 0.189872 0.186754 ( ) 0.280612 0.238484 0.230712 0.228991 (a) Iteration of the EM algorithm. Age Income <30 30-55 >55 High 99 134 108 Low 150 157 192 (b) Complete data obtained by EM algorithm. Table 4.19: NMAR with 30% of the data are missing. 68 4.1.4 The Chi-Squared Test In connection with tables in previous section, the test can be used to test the hypothesis that there is a relationship between an individual’s age and his income. Consider the application of Theorem 3.3 to testing independence in tables in previous section. Suppose we have the hypotheses as : Individual s age and income he earned are independent. : Individual s age and income he earned are independent. From Table 4.4 above,i.e table of full data, we have = 2, = 100 + 130 + 110 = 3, = 100 + 150 = 340 , = 250, = 500, = 285, = 150 + 155 + 195 = 130 + 155 = 110 + 195 = 305, = 840. 69 =100 =130 =110 =150 =155 =195 340(250) = 101.19048 840 340(285) = 115.35714 840 340(305) = 123.45238 840 500(250) = 148.80952 840 500(285) = 169.64286 840 500(305) = 181.54762 840 Table 4.20: − (100 − 101.19048) 101.19048 (130 − 115.3571) 115.3571 (110 − 123.4523) 123.4523 (150 − 148.80952) 148.80952 (155 − 169.64286) 169.64286 (195 − 181.54762) 181.54762 = 0.01401 = 1.85870 = 1.46586 = 0.00952 = 1.26391 = 0.99680 calculation for full data. = 0.01401 + 1.8587 + 1.46586 + 0.00952 + 1.26391 + 0.9968 = 5.60880. The critical value at 5% significance level is and the rule is to reject if is significant and the hypothesis > 5.991. Since . ,( )( ) = . , = 5.991 = 5.60880 < 5.991, this result is therefore accepted. We can conclude that, for the case of full data, one’s age and income he earned are independent. The same calculation is repeating on the complete data obtained by EM algorithm of all cases and is summarized as follows. 70 Type of Missingness MCAR MAR NMAR Available Data Accept/ Value Reject 5.01459 Accept Accept 4.91984 4.53351 Accept 4.97741 Accept 4.33358 Accept 3.69837 Accept 4.96211 Accept 3.87375 Accept 3.44817 Accept Missing Percentage 10% 20% 30% 10% 20% 30% 10% 20% 30% Table 4.21: The With EM Accept/ Value Reject 5.6088 Accept 6.0393 Reject 6.4433 Reject 5.6088 Accept 5.6393 Accept 5.6696 Accept 5.6411 Accept 5.76997 Accept 5.8016 Accept value for all cases. From Table 4.21 above, we can clearly see a pattern for the mechanisms, i.e MCAR, MAR and NMAR, where when the percentage of the missing data increase, the values become larger. For all cases, we are expecting that their value of full data. From calculation of for the full data as values will be approximately the above, i.e Table 4.20, we obtained = 5.60880. Since for full data problems will also be accepting value is accepted, therefore all of , as expected. By accepting then we can conclude that there is a significance evidence at 5% significance level that age of individuals do not have any influence to the income he earned. For MCAR case, when there are 10% data are missing, we will have 5.60880. This value is exactly the same as the value of full data. Thus, we accept But when there are 20% and 30%, data are missing, we have = 6.4433 respectively. Both = . = 6.0393 and values are greater than the critical value at 5% 71 significance level, . , = 5.991. Therefore, we reject for both 20% and 30% MCAR. We can conclude that for 10% data are missing there is a significance evidence at 5% significance level that age of individuals do not have any influence to the income he earned. But when the missing values become larger, the individual’s age and income he earned become related. This is not correspond to our objective. For MAR case, where when only a small number, such that 10%, of the data are missing, .will be accepted since the 20% of the data are missing, the = 5.6088 < . , is still accepted since = 5.991. Also when = 5.6393 < 5.991. Even when there are large number of the data i.e 30%, are missing, we still accepting since = 5.6696 < 5.991. Thus, we can conclude that, at 5% significance level there is significant evidence that age of individuals is not related to the income he earned, no matter how much the missing values is. Besides that, for NMAR case, when there are 10% and 20% of the data are missing, we will have when compared to . , = 5.6411 and = 5.76997 respectively. This value = 5.991, we have that . Also, when there are 30% data are missing, the critical value at 5% significance level, . < , . , . Therefore, we accept = 5.8016 . This value is less than = 5.991. Therefore, we accept . We can conclude that there is significant evidence at 5% significance level that, age of individuals do not influent the income he earned for all level of missing data. 72 CHAPTER 5 CONCLUSION AND RECOMMENDATION 5.1 Conclusion Missing data pose problems to practicing statisticians in the sense that the standard statistical method cannot be used directly on the incomplete data set. In general, this study focuses on the EM algorithm performance on different types of data missingness mechanisms and different levels of missing data. 73 The missingness mechanism, which in practice is assumed, is considered. This consideration is very useful in understanding the effect of the missingness mechanism on the observed sample, and in reasoning why this missing data technique is successful or fail in dealing with a particular situation. For this study purpose, we found that for missingness with mechanism of MAR and NMAR, EM algorithm will give better recovery of the missing values compared to mechanism of MCAR. At different level of missingness, i.e, when the percentage of the missing data is larger, the values become greater, and at some higher missing percentage the value might become misleading. We can say that when the portion of missing values is increasing, then the missing values recovered by EM algorithm is not as good as the full data. 5.2 Recommendation The main purpose of this study is not emphasizing the interpretation of the data but it is more concerned on how we handle those incomplete tables before we analyze the data. So from the discussion in previous chapter, the EM algorithm can be used if we are interested to know the relation between the variables. But if we prefer to do further analysis on the incomplete data, then we can use the EM algorithm to estimate the missing values on larger missing portion, or when the missing data only present in only one cell of the contingency table. Therefore, the contributions of the results by the 74 analysis of incomplete categorical data are able to give a great benefit to the public and solve the obstacle for the researcher in the future. 75 REFERENCES Aitken, A. P. (1974). Assessing Systematic Errors in Rainfall-Runoff Models.Journal of Hydrology. 20, 131-136. Arifah Bahar, Ismail Mohamad, Muhammad Hisyam Lee, Noraslinda Mohamed Ismail, Norazlina Ismail, Norhaiza Ahmad, Zarina Mohd Khalid.(2008). Engineering Statistics. Desktop Publisher. Beal, E. M., Little, R. J. A. (1975). Missing Values in Multivariate Analysis. J. R. Stat. Soc. B, 37, 129-145. Rindskopf. D. A (1992). General Approach To Categorical Data Analysis With Missing Data, Using Generalized Linear Models With Composite Links. Psycometrika. 57, 1, 29-42. Dempster, A. P., Laird, N. M., Rubin, D. B. (1977). Maximum Likelihood for incomplete data via the EM algorithm (with discussion). J. R. Stat. Soc. B, 39, 138. 76 Diggle, P. and Kenward, M. G. (1994). Informative Drop-out In Longitudinal Data Analysis. Applied Statistics. 43, 1, 49-93. Erling B. Anderson (1997). Introduction to the Statistical Analysis of Categorical Data. Springer-Verlag Berlin Heidelberg New York. Fienberg, S. E. (1980). The Analysis of Cross-Classified Categorical Data. Cambridge Mass.: The MIT press. Hartly, H. O., Hocking, R. R. (1971). The Analysis of Incomplete Data. Biometrics.27, 783-808. Hoo Ling Ping and Safian Uda (2006). Methods Analyzing Incomplete Categorical Data. Proceedings of the 2nd IMT-GT Regional Conferrence on Mathematics, Statistics and Applications, USM, Penang. Ismail Mohamad (2003). Data Analysis in the Presence of Missing Data. Lancaster University. Doctor of Philosophy Dissertation. Kim, J. O., and Curry, J. (1977). The Treatment of Missing Data In Multivariate Analysis. Sosiological Methods and Research. 6, 215-241. Little, R. J. A(1988). A Test of Missing Completely At Random for Multivariate Data With Missing Values. Journal of American Statistical Association. 83, 404, 1198202. Little, R. J. A., Rubin, D. B. (2002). Statistical Analysis With Missing Data, 2nd Edition. New Jersey.: John Wiley. 77 Madow, W. G., Nisselson, H., Olkin, I. and Rubin, D.B. (eds)(1983). Incomplete Data In Sample Surveys (Vols. 1-3). New York: Academic Press. Michiko Watanabe and Kazunori Yamaguchi (2004). The EM Algorithm and Related Statistical Models. Marcel Decker. McKendrick, A. G. (1926). Applications of Mathematics to Medical Problems. Proc.Edinburgh Math. Soc. 44, 98-130. Mood, M. A., Graybill, F. A., and Boes, D. C. (1974). Introduction to the Theory of Statistics, Third Edition.McGraw-Hill Book Company. Orchard, T., Woodburry, M. A. (1972). A Missing Information Principle: Theory and Applications. Proceedings of the 6th Berkeley Symposium on Mathematical Statistics and Probability. 1, 697-715. Paul G. Hoel (1984). Introduction to Mathematical Statistics, Fifth Edition. John Wiley and Sons. Rao, C. R. (1972). Linear Statistical Inference and Its Applications. New York Wiley. Roth, P. L. (1994). Missing Data: A Conceptual Review For Applied Psychologists. Personnel Psychology. 47, 537-560. Rubin, D. B. (1976). Inference and Missing Data. Biometrika. 63, 538-543. Rubin, D. B. (1991). EM and Beyond. Psychometrika. 56, 241-254. 78 Stuart. R. Lipsitz, Michael Parzen and Geert Molenberghs (1998). Obtaining the Maximum Likelihood in Incomplete × Contingency Table Using a Poisson Generalized Linear Model. Journal of Computational and Graphical Statistics, 7, 3, 356-376. Sunberg, R. (1976). An Iterative Method for Solution of the Likelihood Equations for Incomplete Data From Exponential Families. Commun. Stat. Simul. Comput. 5, 55-64. Xiao-Bai Li (2009) A Bayesian Approach for Estimating and Replacing Missing Categorical Data. ACM Journal of Data and Information Quality. 1, 1, 1-11.