Project Summary & Key Readings

advertisement
Imputation of Missing Values from Hierarchical US Census Microdata 1850-1880
Muhammad Aurangzeb Ahmad
Nupur Bhatnagar
Data Description:
The Integrated Public Use Microdata Series (IPUMS) consists of thirty-eight highprecision samples of the American population drawn from fifteen federal censuses and
from the American Community Surveys of 2000-2004. The thirty-eight samples, which
draw on every surviving census from 1850-2000, collectively comprise the richest source
of quantitative information on long-term changes in the American population. The
IPUMS assigns uniform codes across all the samples to facilitate analysis of social and
economic change.
Problem Statement:
The census micro data consists of a large number of variables which were collected
responses to different population surveys. One such variable of interest is the RELATE
variable. This particular variable describes an individual's relationship to the head of
household or householder. From the 1880 United States census onwards, question
regarding relationship between every person in the household to every other person in the
household was asked. However for the previous three censuses i.e., 1850, 1860 and 1870
censuses this variable was not part of the census survey. Thus from the point of view of a
demographics researcher this variable is missing for those datasets.
The aim of the proposed project would be to build probabalistic predictive models in
order to predict the relationships of people with head of the household for the censuses
where this data is 'missing.' What follows is a list of suggested key reading for the project
and a outline of various components of the project.
Project Components:
1. Study different supervised and unsupervised learning models in order to find the
best suited probabilistic distribution of the multivariate data set.
2. Building a classification model based on the readings. The model should be able
to predict missing values for household relationship.
3. Evaluate the predictive model with some statistical testing. This will involve
conferring the
Summary of Key Readings
Nordbotten, S. (1996): Neural Network Imputation Applied to the Norwegian 1990
Census Data. Journal of Official Statistics, Vol. 12, No.4, pp. 385-401.
The paper describes the use of neural networks for imputing individual values for survey
attributes utilizing the available administrative data. The 1990 census data for Norway is
used for imputation.
Nordbotten, S. (1996): Editing and Imputation by Means of Neural Networks.
Statistical Journal of UN/ECE, Vol.13, No.2 , pp. 119-129.
The authors describe the role and importance of imputation in census data and why
imputation is considered to be expensive part of the census and surveys. Applications of
neural networks to imputation are also discussed.
Nordbotten, S. (1998): Estimating Population Proportions from Imputed Data.
Computational Statistics & Data Analysis, Vol. 27, 1998, pp. 291-309.
An "impute first - aggregate next" approach is taken for the Norwegian census. The
imputation estimates were compared with simple unbiased estimates obtained by the
traditional "aggreagate first - estimate next" approach. The former approach is found to
be better than the later approach. The author also proposes predictors for predicting the
accuracy of such imputation estimates were proposed.
English Only United Nations Statistical Commission and Economic Commission
Working Paper No. 15 (2003)
The paper describes the statistical disclosure limitation techniques used for the US
Census 2000 tabular data. Many of these tables are published for very small geographic
areas. The paper procedures for short form tables, long form tables, special tabulations,
and an online query system for tables.
Analysis of Incomplete Multivariate Data
By J. L. Schafer
This book provides approaches to the analysis of multivariate data. It discusses
likelihood-based inference with incomplete data in particulat the EM algorithms and the
second concerns techniques of Markove chain Monte Carlo.It gives methods of statistical
inference from multivariate datasets.
Imputation of Missing values when the probability of response depends on the value
being imputed
JS Greenlees, WS Reece, KD Zieschang - Journal of the American Statistical Association
http://www.jstor.org/view/01621459/di985952/98p0573d/0?frame=noframe&userID=80
654f68@umn.edu/01cce4403700501cadbcb&dpi=3&config=jstor
This paper views the missing value problem as one of the parameter estimation in a
regression model with stochastic censoring of the dependent variable. In the paper the
author discusses a “hot deck” approach which consist of imputing a randomly selected
cell respondent’s value to each non respondent. The data set it has used for validation is a
population survey data set and is imputing income of the nonrespondents in the current
population survey.
The treatment of missing values and its effect in the classifier accuracy:
E Acuna, C Rodriguez - Classification, Clustering and Data Mining Applications, 2004 academic.uprm.edu
http://academic.uprm.edu/~eacuna/IFCS04r.pdf
In this paper the author uses a predictive model based on KNN algorithm to impute
missing values. In this method the missing values of instances are imputed considering a
given number of instances that are most similar to the instance of nterest.The similarity of
two instances is determined using a distance function.It evaluates its methods using
twelve data sets coming from machine learning repository at the university of
California,Irvine.
Wikipedia
Wikipedia gives a head on start for getting acquainted with different multivariate models.
It discusses :
 Regression Analysis, types of regressions like simple and multiple linear
regression, nonlinear regression and other models. It also discusses some aspects
of Bayesian statistics.
 Principal Component Analysis: A technique to simplify a dataset by reducing
multidimensional dataset to lower dimensions for analysis. This is useful for our
project since the population data has many attributes and feature selection is one
of the major things.
 Linear discriminate Analysis: It finds the linear combination of features which
best separate two or more classes of object or event. The resulting combinations
may be used as a linear classifier, or more commonly in dimensionality reduction
before later classification.
 Logistic regression: A statistical regression model widely used in Machine
Learning for binary dependent variables.
Approximate Association Rule Mining:
Jyothsna R. Nayak and Diane J. Cook
http://ranger.uta.edu/~cook/pubs/flairsj01.pdf
This paper act as an add on the association rule mining algorithm wherein small
variations in the data,potentially important discoveries may be ignored. It develops an
approximate association rule mining algorithm that searches for approximate rules. This
approach is useful in processing missing data which probabilistically contributes to the
support of possibly matching pattern.
Summary of Search Results
Google Scholar
Keywords: imputation census
Results 6820
CiteSeer
Keywords: imputation census
Results 1
DBLP
Keywords: imputation census
Results 0
Download