chap1 - National Bureau of Economic Research

advertisement
IPUMS-98 VOLUME 1
USER’S GUIDE
IPUMS Design - Introduction
Page 1.1.1
Chapter 1
INTRODUCTION
This document describes the Integrated Public Use Microdata Series
(IPUMS-98), created at the University of Minnesota in November 1997.
The IPUMS consists of twenty-five high-precision samples of the
American population drawn from thirteen federal censuses. Some of these
samples have existed for years, and others were created specifically for
this database. The twenty-five samples, which span the censuses of 1850
to 1990, collectively constitute our richest source of quantitative
information on long-term changes in the American population. However,
because different investigators created these samples at different times,
they employed a wide variety of record layouts, coding schemes, and
documentation. This has complicated efforts to use them to study change
over time. The IPUMS assigns uniform codes across all the samples and
brings relevant documentation into a coherent form to facilitate analysis of
social and economic change.
Citation and Use
All persons are granted a limited license to use and distribute this
documentation and the accompanying data, subject to the following
conditions:
1. No fee may be charged for use or distribution.
2. Publications and research reports based on the database must cite it
appropriately. The citation should include the following:
Steven Ruggles and Matthew Sobek
Integrated Public Use Microdata Series: Version 2.0
Minneapolis: Minnesota Historical Census Projects,
University of Minnesota, 1997
In addition, we request that users send us a copy of any publications,
research reports, or educational material making use of the data or
documentation. Printed matter should be sent to:
IPUMS
Historical Census Projects
Department of History
University of Minnesota
614 Social Sciences
267 19th Avenue South
Minneapolis, MN 55455
Electronic material should be sent to:
ipums@hist.umn.edu
Universe
The Integrated Public Use Microdata Series consists of a series of
compatible-format individual-level representative samples of the United
States for the years 1850, 1860, 1870, 1880, 1900, 1910, 1920, 1940,
1950, 1960, 1970, 1980, and 1990. Each of these samples is
independent; it is not possible to trace individuals from one census year
to the next. The data represent all persons in states, territories that
eventually became states, and the District of Columbia, with the
following exceptions:
1. The 1850 and 1860 samples exclude the slave population.
2. The pre-1890 samples exclude “Indians not taxed.”
3. The pre-1960 samples, except 1900 and 1910, exclude
Alaska and Hawaii.
In addition, for the period since 1970 the data series includes unoccupied
housing units.
Date of enumeration varied by census year, as described below:
1850-1900:
1910:
1920:
1930-1990:
Subject Content
June 1
April 15
January 1
April 1
IPUMS-98 VOLUME 1
USER’S GUIDE
The data series includes information on a broad range of population
characteristics, including fertility, nuptiality, life-course transitions,
immigration, internal migration, labor-force participation, occupational
structure, education, ethnicity, and household composition.
The
information available in each sample varies according to the questions
asked in that year and by differences in post-enumeration processing. In
general, the later census years provide a greater range of characteristics
than the earlier ones, though the earlier censuses often contain greater
detail for the variables that are available. A full listing of available
variables can be found in the section entitled “Variable Availability and
Record Layout.” For the period since 1960, the data series also provides
detailed housing characteristics.
Organization of Documentation
The IPUMS documentation is divided into five volumes. Volume 1:
User’s Guide covers minimum information most users require to use the
database, including the overall design of the database, variable
descriptions, and coding schemes for each variable. Volume 2: User’s
Guide Supplement includes supplemental information on some of the
more complicated variables, such as maps and coding schemes for the
geographic areas identified in the samples, and alternate occupation and
industry coding schemes available for particular census years. Volume
3: Counting the Past provides enumerator instructions, replicas of the
census forms, procedural histories of the censuses, and descriptions of
missing-data allocation procedures and other post-enumeration
processing. Volumes four and five consist of complete descriptions of
all the data transformations we carried out to create the IPUMS.
This introduction to the database is essential reading for all
prospective users. Additional chapters in this volume cover sample
designs, sampling errors, occupational coding, and family
interrelationship codes. We then provide a guide to the availability of
variables in each IPUMS sample, variable descriptions, coding schemes,
and marginal frequencies. Finally, we briefly describe the missing and
inconsistent data allocation and data quality flags.
Uses of Microdata
IPUMS Design - Introduction
Page 1.1.2
Most population dataespecially historical census datahas
traditionally been available only in aggregated tabular form. The
IPUMS is microdata, which means that it provides information about
individual persons and households. This makes it possible for
researchers to create tabulations tailored to their particular questions.
Since the IPUMS includes nearly all the detail originally recorded by the
census enumerations, users can construct a great variety of tabulations
interrelating any desired set of variables. The flexibility offered by
microdata is particularly important for historical research because the
aggregate tabulations produced by the Census Bureau are often not
comparable across time, and until recently the subject coverage of
census publications was limited.
Microdata does pose some limitations, however. Most important,
for the period since 1940 census microdata are subject to strict
confidentiality measures that limit their usefulness for some
applications. The IPUMS samples for these years include no names,
addresses or other potentially identifying information. To further ensure
that no individuals can be identified, the Census Bureau limits the detail
on place of residence, place of work, very high incomes, and several
other variables. Most important, the microdata records for the period
since 1940 identify no geographic areas with fewer than 100,000
inhabitants (250,000 in 1960 and 1970). Therefore the IPUMS is
inappropriate for research that requires the identification of specific
small geographic areas in those census years.
Sample Availability
The IPUMS samples for the years since 1960 were originally created
by the Census Bureau as part of each decennial enumeration; the
samples for earlier years were created by a variety of individual
researchers with funding from the National Institutes of Health and the
National Science Foundation. Table 1 lists the samples now available as
part of the IPUMS as well as those which will be added to it in the near
future, including samples for 1860 and 1870 and an enlarged sample for
1900. We hope eventually to add an additional sample for 1930.
IPUMS-98 VOLUME 1
USER’S GUIDE
IPUMS Design - Introduction
Page 1.1.3
Table 1
IPUMS File Characteristics
Filename
ip18501
ip18601
ip18701
ip18801
ip19001
ip19002
ip19101
ip19102
ip19103
ip19201
ip19401
ip19501
ip19601
ip19701
ip19702
ip19703
ip19704
ip19705
ip19706
ip19801
ip19802
ip19803
ip19901
ip19902
ip19903
ip19904
IPUMS Sample Name
1850 sample
1860 sample
1870 sample
1880 sample
1900 sample
1900 sample–enlarged version
1910 unweighted sample
1910 sample combined w/ oversamples
1910 Hispanic oversample
1920 sample
1940 sample
1950 sample
1960 sample
1970 Form 1 State sample
1970 Form 2 State sample
1970 Form 1 Metro sample
1970 Form 2 Metro sample
1970 Form 1 Neighborhood sample
1970 Form 1 Neighborhood sample
1980 State sample
1980 Metro sample
1980 Urban/rural sample
1990 State sample
1990 Metro sample
1990 Elderly sample
1990 Flat State sample
Original Sample Name
Preliminary version 1
Preliminary version 1
Preliminary version 2
5% State sample
15% State sample
5% County group sample
15% County group sample
5% Neighborhood sample
15% Neighborhood sample
A sample
B sample
C sample
5% sample
1% sample
3% Elderly sample
1% Unweighted sample
Principal Investigators
Menard/Ruggles
Ruggles
Ruggles
Ruggles/Menard
Preston
Ruggles
Preston
Preston et al.
Gutmann/Ruggles
Ruggles/Menard
Winsborough et al.
Winsborough et al.
Census Bureau
Census Bureau
Census Bureau
Census Bureau
Census Bureau
Census Bureau
Census Bureau
Census Bureau
Census Bureau
Census Bureau
Census Bureau
Census Bureau
Census Bureau
Census Bureau
Release Dates:
IPUMS Original
Version Version
1995
1998
1998
1995
1993
2002
1993
1998
1998
1997
1993
1993
1993
1993
1995
1995
1995
1998
1998
1996
1993
1998
1996
1993
1999
1998
1994
1998
1998
1994
1980
2002
1989
1990
1996
1997
1984
1984
1971
1972
1972
1972
1972
1972
1972
1983
1983
1983
1992
1992
1993
1995
Sample
Density
1 in 100
1 in 500
1 in 500
1 in 100
1 in 760
1 in 100
1 in 250
varies
varies
1 in 200
1 in 100
1 in 100
1 in 100
1 in 100
1 in 100
1 in 100
1 in 100
1 in 100
1 in 100
1 in 20
1 in 100
1 in 100
1 in 20
1 in 100
1 in 33
1 in 100
Number of
Records in
Thousands:
HH Person
37
TBA
TBA
107
27
205
89
5
22
129
391
461
579
744
744
744
744
744
744
4711
942
942
5528
1106
n.a.
1106
198
TBA
TBA
503
100
760
366
24
96
521
1351
1922
1780
2030
2030
2030
2030
2030
2030
11337
2267
2267
12500
2500
n.a.
2500
IPUMS-98 VOLUME 1
USER’S GUIDE
Users should be aware that the 1920 sample is currently in
preliminary form. The sample will be expanded before the final version
is released in late 1998.
File Structure
Most users will access the data through our on-line data extraction
system, available at our web site (http://www.ipums.umn.edu). This
system allows users to select a subset of cases and variables for analysis,
so they do not have to cope with multiple gigabytes of data. The
extraction system creates a record layout tailored to the needs of each
user. For users who want to use the raw data or who plan complicated
data manipulations, however, it is essential to understand the raw
IPUMS record layout and file structure.
The raw data files for each IPUMS sample consist of a series of 334character logical records. These records are of two types: household
records and person records. Household recordsidentified with an “H”
in the first columncontain information pertaining to an entire
household or group quarters residence. Each household record is
followed by a series of person recordsidentified by a “P” in the first
columnwhich contain information about each sampled individual in
the unit.
In the text of this document, variables are identified according to
their record type and column location. For example, the variable AGE is
identified as P52-54 because it is located in columns 52 through 54 of
the person record.
The IPUMS record layout stresses column-comparability rather than
compactness. In general, all variables available across multiple census
years appear in the same columns in every year. When a variable is not
available for a given year, the columns are either filled with a missing data
value or, in some cases, with another variable available for different years.
Within households, individuals are generally sequenced according to
their household relationship codes (variable RELATE, P48-51).
Household heads (or householders after 1970) appear first. They are
ordinarily followed by a spouse (if any), children (in descending age
order), other relatives, and non-relatives. In the pre-1960 census years
there are a few exceptions to this sequence, since the original order of
enumeration is preserved and not all enumerators followed the
IPUMS Design - Introduction
Page 1.1.4
prescribed sequence.
Most analyses of IPUMS data require manipulation of both
household and person records. Each record contains a serial number
which links the persons in the housing unit to the appropriate household
record. Many statistical software packagessuch as SAS, SPSS, Stata
and BMDPwill handle the IPUMS datasets easily. They offer various
methods to rectangularize the hierarchical file structure by attaching
household information to each person record. In addition, constructed
variables in the IPUMS make it easy to attach characteristics of
household and family members, such as spouses, children, and parents.
Use of these variables is discussed in Chapter 5, “Family
Interrelationships” in this volume.
In general, the variables in the IPUMS are coded into numeric
categories, but there are several exceptions. The following variables
contain alphabetic characters:
H1
H 168-199
P1
P 219-234
P 235-250
RECTYPE Record Type (all samples)
STREETStreet Address (1880 and 1920 only)
RECTYPE Record Type (all samples)
NAMELASTName, Last (1850, 1880 and 1920 only)
NAMEFRSTName, First (1850, 1880 and 1920 only)
The files identified as available in Table 1 may be downloaded from
the internet via our World Wide Web page (http://www.ipums.umn.edu)
or our anonymous FTP site (ftp.ipums.umn.edu) and are also available
through our interactive data-extraction system. For further details, see
our web site.
Sample-line Characteristics in 1940 and 1950
Since 1940, the census has asked an extended set of questions of a
sample of the population. In the first two years that the census used
sampling, it was done on an individual basis: individuals who fell on a
designated “sample line” of the census enumeration form were asked an
extra set of questions (the enumeration forms are reproduced in Volume
3: Counting the Past). The 1940 and 1950 samples were constructed to
ensure that each household contains one of these sample-line
individuals. In the IPUMS, these variables are coded as “not applicable”
IPUMS-98 VOLUME 1
USER’S GUIDE
for all individuals except the sample-line individual. Thus, users will
find that for many variables in 1940 and 1950 the number of available
cases is limited. The sample-line format of the census imposes some
limitations on the kinds of analysis that can be carried out. For example,
one might want to find out if native-born persons with foreign-born
parents tended to marry other persons of the same ethnicity. Since the
variables on parental birthplaces in 1940 and 1950 are on the sample
line, and only one sample-line person is included in each household, it is
impossible to compare the parental birthplaces of husbands and wives in
these years.
In the censuses of 1960 through 1990, the Census Bureau also
employed sampling, but it was carried out on a household basis: a
portion of households received “long form” questionnaires with an
additional set of questions to be filled out for each member of the
household. Since the public use files were constructed entirely from
these long forms, most sample items are available for all individuals in
the IPUMS; the few exceptions are noted in the section of this volume
entitled “Variable Availability and Record Layout.”
Sample Weights
With the exception of the samples for 1860, 1870, 1940, 1950, 1990,
and the 1910 oversamples of Blacks and Hispanics, the IPUMS samples
are unweighted “flat” samples.
This means that each
observationwhether a household or an individualrepresents a fixed
number of persons in the general U.S. population. That number is
roughly the inverse of the sample density shown in Table 1. For
example, each individual in the samples for 1880, 1940, and 1960
represents approximately 100 persons in the general population, and
each individual in the original 1900 sample represents approximately
760 persons in the general population.
The 1860, 1870, 1940, 1950, and 1990 samples are weighted. This
means that persons with some characteristics are over-represented in the
samples, while others are underrepresented. In the 1860 and 1870
samples, households containing any black person were sampled at twice
the density of other households in order to allow more in-depth analysis
of the black population. In the case of 1940 and 1950, a flat sample was
impossible because of the complexities in sample design necessitated by
IPUMS Design - Introduction
Page 1.1.5
the sample-line individuals. In the case of 1990, the Census Bureau
opted against a flat sample design in order to maximize precision for
persons residing in small localities. In all three cases, users must take an
additional step if they want to obtain statistics that are representative of
the general population.
To obtain representative statistics, users have two basic options: they
can either apply sample weights or select a representative unweighted
subsample of the data.
To apply sample weights to an IPUMS file, users should follow one
of the following procedures:
1. For household-level analyses in 1860, 1870, or 1990, or householdlevel analyses in 1940 or 1950 that do not involve sample-line
characteristics, weight the analysis by the variable HHWT (H18-21).
HHWT gives the number of households in the general population
represented by each household in the sample.
2. For person-level analyses in 1860, 1870, or 1990, or person-level
analyses in 1940 or 1950 that do not involve sample-line
characteristics, apply the variable PERWT (P20-23). PERWT gives
the number of individuals in the general population represented by
each individual in the sample.
3. For any analyses in 1940 or 1950 involving sample-line
characteristics, apply the variable SLWT (P16-19). SLWT gives the
number of individuals in the general population represented by each
sample-line individual.
The variables HHWT, PERWT, and SLWT also exist for the
unweighted IPUMS samples, but they are simply the overall ratio of the
full population count to the number of cases in the sample. For example,
the weights in 1880 average 99.74, and all those in 1900 average 758.94.
Thus, applying the weights to the unweighted samples will inflate case
counts to correspond to the entire population.
To select an unweighted subsample of an IPUMS file, users should
follow one of the following procedures:
IPUMS-98 VOLUME 1
USER’S GUIDE
1. For any analyses of 1940, 1950, or 1990 that do not involve sampleline characteristics, select entire households with a value of “2” for
the variable SELFWTHH (H30)
2. For any analyses of 1940 or 1950 involving sample-line
characteristics, select individuals with a value of “2” for the variable
SELFWTSL (P24) .
The unweighted subsamples have the disadvantage that the number
of cases may be sharply reduced, but for some applications they are
easier to work with than the weighted samples. To make things simpler
for users of the 1990 sample, we have created an unweighted 1-in-100
extract of the 1-in-20 weighted sample. This version (ip19904) requires
no weights, and the values of HHWT and PERWT are always 100.
We plan to add oversamples of the Black and Hispanic populations
in 1910 (ip19102 and ip19103). When we add them, we plan to create
appropriate weights to allow users to obtain representative statistics from
a combination of all three 1910 samples.
Additional details on sample weights and sample designs can be
found in Chapter 2, “Sample Designs.”
Variations in Group Quarters Definitions
In all the IPUMS files, persons in large units such as institutions and
boarding houses were sampled at the individual level, and not as members
of households. Such units are termed “group quarters.” Unlike members
of households, group quarters residents cannot be compared to their coresidents because the co-residents are not ordinarily contained in the
sample. Furthermore, the samples for 1940 and subsequent census years
provide fewer variables for residents of group quarters than they do for
household members.
Unfortunately, different census years did not define group quarters
consistently, and this complicates inter-year comparisons for many
variables. Thus, individuals sampled as group quarters residents in some
years might have been sampled as household members in other years. For
example, the 1940, 1950, 1960, and 1970 census years define group
quarters as units containing five or more persons unrelated to the
household head/householder. The IPUMS samples for the remaining
IPUMS Design - Introduction
Page 1.1.6
census years define group quarters as units with 10 or more persons
unrelated to the household head.1
Sufficient information is available in all census years to determine
whether a sampling unit would have been treated as group quarters or a
household under the 1940-1970 rules, and this is done in the IPUMS
variable GQ (H72). Users who wish to create a common household
universe for all eleven census years can use GQ to eliminate from their
research universe all units that would have been sampled as group quarters
in the 1940-1970 census years, by simply selecting households coded “1.”
However, many researchers, particularly those concentrating on the
earlier IPUMS years when boarders, lodgers, servants, and secondary
families were much more common, will not wish to use GQ in this way.
Doing so would eliminate from the universe many pre-1940 units for
which the pre-1940 samples contain full information on all household
members.
The differences among the samples in treatment of group quarters, like
other variations in sample design, also affect the precision of the samples.
See Chapters 2 and 3, “Sample Designs” and “Sampling Errors,” in this
volume.
Alternate Census Forms, 1960-1980
In two census years, 1960 and 1970, the Census Bureau used two
different sample “long forms” with slightly differing questions when they
took the census. Moreover, in 1980 certain information was collected for
only a fraction of forms. Depending on the census year, these have
varying implications for the IPUMS.
In 1960, certain housing information (such as stories and elevators)
was only collected in cities with 50,000 or more residents, where form PH4 was distributed, and other information (such as sewage disposal) was
only collected outside large cities, where form PH-3 was used. A third set
of housing questions was asked for a 5 percent sample of all housing units,
and a fourth set was asked of 20 percent of all units. The 1960 IPUMS
1
The original samples for 1850 through 1920 used slightly broader definitions of
group quarters, but in the IPUMS the larger units are treated as group quarters.
See Chapter 2 on “Sample Designs,” and the variable description for Group
Quarters (GQ) in this volume.
IPUMS-98 VOLUME 1
USER’S GUIDE
sample contains all four sets of questions, but each is available only for a
subset of cases. The IPUMS variable SAMP1960 identifies which set of
questions the household answered. For those housing items asked of
only a subset of the population, the universe statement identifies which
form included the question, and those not included in the universe are
coded “not applicable.” In the case of the PH-3 and PH-4 samples, users
must be aware of the universe limitations imposed by the different forms.
The distinction between the 20% form and the 5% form should have little
effect for most analyses beyond limiting the number of available cases, but
if users want to estimate the absolute number of occurrences of an item in
the 1960 population, they should multiply 20% form items by 125 and the
5% form items by 500.
In 1970 the Census Bureau used two substantially different long forms
with varying questions on both the person and household records. One
form, referred to in the IPUMS as Form 1, was filled out by 5% of the
population, and the other form, Form 2, was filled out by 15% of the
population. For example, the Form 1 (5%) sample included questions such
as age at first marriage, citizenship status, and occupation five years ago,
whereas the Form 2 (15%) sample inquired about parental birthplaces,
school attendance, and migration status. The IPUMS includes separate
samples for each form. Users must decide which version they need before
carrying out their analysis.
The 1980 census used a single long form, but the makers of the sample
used only a portion of the forms for questions on travel time to work, place
of work, and migration. In the IPUMS, these variables are available for
half of all cases; the rest are coded “not applicable.” This should have
little effect on analysis, but if users wish to estimate the absolute number
of occurrences of one of these items in the 1980 population, they should
multiply the weight by two or apply the variable MIGSAMP.
Geographic Coding and Comparability
Every census collected precise information on residential location.
This information is preserved in the pre-1940 census samples, but the 1940
and subsequent census years suppressed much of it in order to meet
confidentiality requirements. Thus, the 1940 and 1950 samples do not
identify places that had fewer than 100,000 residents in 1980; the 1960 and
1970 samples do not identify places with fewer than 250,000 inhabitants;
IPUMS Design - Introduction
Page 1.1.7
and the 1980 and 1990 samples do not identify places with fewer than
100,000 inhabitants. The impact of these rules on comparability and
geographic coverage of areas smaller than states are explained in the
documentation for each geographic variable—see particularly the variables
CNTYGP97 (County group, 1970), CNTYGP98 (County group, 1980),
PUMA (Public Use Microdata Area, 1990), SEA (State economic area),
METRO (Metropolitan status), METAREA (Metropolitan area), CITY
(City identifier), and URBAN (Urban/rural status).
In the 1970, 1980, and 1990 census years the Bureau produced
alternate versions of the public use files containing different geographic
codes. Because of confidentiality regulations, the samples do not identify
any places smaller than 250,000 in 1970 or 100,000 in 1980 and 1990. By
providing multiple samples with different coding schemes, the Census
Bureau was able to maximize flexibility without violating the
confidentiality thresholds.
In 1970, three geographic versions were produced, and for each one
the Census Bureau created a Form 1 sample and a Form 2 sample (see
above):
 The State samples identify all states. No geographic
subdivisions of states are identified, but for most states it is
possible to distinguish rural from urban areas and
metropolitan from non-metropolitan areas.
 The Metro (also known as County Group) samples identify
economic areas within states of 250,000 or more, but those
areas do not always follow state boundaries.
Every
metropolitan area of 250,000 or more is identified. Only four
states can be completely identified, because metropolitan
areas frequently cross state boundaries and identification of
both state and metropolitan area would violate the
confidentiality rules.
No rural/urban distinctions are
available, and smaller metropolitan areas cannot be identified.
IPUMS-98 VOLUME 1
USER’S GUIDE

The Neighborhood samples provide only regional information
and size of place, but they also give specific characteristics of
the surrounding neighborhood that allow contextual analysis.
The neighborhoods are not identified by name but represent
areas approximately the size of census tracts, which contained
about 4,000 people. See “Geographic Tools” in Volume 2:
User’s Guide Supplement for the record layout and
descriptions of the characteristics.
The Census Bureau also created three geographic variations in 1980:

The State (A) sample identifies all states, larger metropolitan
areas, and most counties over 100,000 population. In many
cases individual cities are also identified. It does not identify
urban/rural residence or residence in smaller metropolitan
areas. The State sample is very large, including 1-in-20 (5%)
of the U.S. population.

The Metro (B) sample identifies 282 metropolitan areas over
100,000 population. Only twenty states can be completely
identified, because metropolitan areas frequently cross state
boundaries and identification of both state and metropolitan
area would violate the confidentiality rules. Metropolitan
areas are distinguished from non-metropolitan areas, but the
sample does not identify urban/rural residence.

The Urban/Rural (C) sample identifies urban/rural residence,
central city residence, and particular urbanized areas. For
confidentiality reasons, only 28 states can be entirely
identified, and no metropolitan areas are identified.
The Census Bureau produced two geographic versions in 1990:
 The State (5%) sample identifies all states, and within states,
most counties or parts of counties with 100,000 or more
population. It also identifies most metropolitan areas over
100,000 completely. The sample is the largest one in the
IPUMS.
 The Metro (1%) sample is similar, but more metropolitan
areas are identified and some states cannot be completely
identified because of confidentiality restrictions.
The universe statements, variable descriptions, and comparability
discussions that comprise the data dictionaries for each IPUMS variable
IPUMS Design - Introduction
Page 1.1.8
explain how variables are affected by confidentiality restrictions. For
example, the universe statement for METAREA indicates that the variable
is available only for the Metro samples in 1970 and only for the State and
Metro samples in 1980.
To maximize historical comparability, the IPUMS constructs a variety
of geographic variables for earlier years that were not contained in the
original samples. The variables SEA (State economic area), METRO
(Metropolitan status), and METAREA (Metropolitan area), among others,
represent concepts not yet in use when the earlier censuses were taken.
The IPUMS applies these concepts to the pre-1940 samples in order to
extend the series of comparable geographic codes backwards. In addition,
the 1850-1920 census years include COUNTY (County of residence),
which is especially useful for constructing variables describing the local
socioeconomic context of each case. These county-level data are available
in machine-readable form (ICPSR file 0003) from the InterUniversity
Consortium
for
Political
and
Social
Research
(http://www.icpsr.umich.edu).
Occupational and Industrial Classifications
Occupation and industry are among the most important variables for
analyses of long-term social change because the early census years provide
few alternative indicators of socioeconomic status or labor-force
participation. The Census Bureau has modified its classification systems
every decade, so all comparisons of occupation and industry require
extensive reconciliation of codes. There are nine different occupational
classification systems consisting of between 285 and 550 categories each.
Although a complete reconciliation of these coding schemes is impossible,
we provide variables that maximize the potential for consistent
comparisons of occupational status. In each census year, we provide the
contemporary occupational classification as well as imposing a common
coding scheme based on the 1950 Census Bureau classification. This and
other comparable occupational variables are described at length in the
Chapter 4, “Occupation Codes and Income Scores.”
Constructed Variables on Family Interrelationships
The IPUMS contains three pointer variables—MOMLOC, POPLOC,
and SPLOC—that give the location within the household of each
IPUMS-98 VOLUME 1
USER’S GUIDE
individual's mother, father, and spouse. These variables allow users to
easily attach characteristics of individuals to those of their kin, and they
are convenient tools for constructing measures of fertility and coresidence. The IPUMS also includes several of the most commonly
requested variables on own children: Number of Own Children
(NCHILD), Number of Own Children Under Age Five (NCHLT5), Age of
Eldest Own Child (ELDCH), and Age of Youngest Own Child (YNGCH).
These and other constructed family interrelationship variables are fully
described in this volume in Chapter 5, “Family Interrelationships.”
Coding Schemes
The original census samples that make up the IPUMS employed
different classification systems and coding schemes in every census year.
A central goal of the IPUMS is to reconcile these in order to create
comparable codes for each variable. Perfect uniformity across years was
our ideal, but is only achieved in a few variables (most notably SEX)
which are classified in precisely the same way in each sample. However,
for most variables, such perfection could not be achieved without an
unacceptable loss of information. This is primarily because the variables
often contain more detail in some census years than in others; if we had
reduced all census years to their lowest common denominator, this detail
would have been lost.
The person-record variable RELATE (Relationship to Household
Head/Householder), is illustrative. Most of the samples code it differently.
For instance, household heads/householders are coded “1000” in the
original 1910 sample, “01” in the 1940 sample, and “0” in the 1960
sample. To reconcile these, the IPUMS translates the various codes into a
single code (“01”) for household heads/householders in all years, thus
easily eliminating this simple incompatibility. But some censuses code
RELATE in more detail — that is, using more categories — than others.
For example, the 1960 and 1970 samples used only 15 categories for
RELATE, while the 1910 sample distinguishes 161 categories. Tables 2
and 3 reproduce part of the original codebook pages describing the
IPUMS Design - Introduction
Page 1.1.9
Table 2
1910 Relationship Codes (Partial Listing)
P06 REL Relationship to head
Columns 14-17
Width
4
Value Description
-3 Unknown
-2 Illegible
-1 Blank
1000 Head
1201 Husband
1202 Wife
1300 Child
1301 Son
1302 Daughter
1310 Stepchild
1311 Stepson
1313 Stepdaughter
1320 Adopted child
1321 Adopted son
1322 Adopted daughter
1331 Son-in-law
1332 Daughter-in-law
1341 Stepson-in-law
1342 Stepdaughter-in-law
.
2301 Nephew
2302 Niece
.
2501 Uncle
2502 Aunt
.
6000 Servant
Total
No. of individuals
12
44
910
80,589
26
63,773
3
82,168
78,407
9
1,429
1,279
21
187
219
1,020
840
8
10
%
0.00
0.01
0.25
22.00
0.01
17.41
0.00
22.44
21.41
0.00
0.39
0.35
0.01
0.05
0.06
0.28
0.23
0.00
0.00
1,430
1,485
0.39
0.41
137
245
0.04
0.07
3,802
1.04
366,239
100.00
Source: Michael A. Strong, et al., User’s Guide, Public Use Sample, 1910 United States
Census of Population, Population Studies Center, University of Pennsylvania, 1989.
IPUMS-98 VOLUME 1
USER’S GUIDE
IPUMS Design - Introduction
Page 1.1.10
Table 3
1960 Relationship Codes
Character
[Column]
P1
Item and data
descriptor name
Basic
relationship
HEADRELA
Code
[Value]
0
1
2
3
4
5
6
P2
Detailed
relationship of
persons in
1
households
2
HEADRELB
3
4
5
6
7
8
9
Description of codes
Head of household
Wife of head
Son or daughter of head
Other relative of head
Roomer, boarder, or
lodger
Patient or inmate
Other not related to head
Head, wife, or child not
in subfamily, or GQ
Grandson or
granddaughter
Father or mother or
stepparent
Father-in-law or motherin-law
Brother or sister or
stepbrother or stepsister
Brother- or sister-in-law
Other relative
Partner or friend
Roomer, boarder, or
lodger
Resident employee
Source: United States Bureau of the Census, Technical Documentation for the 1960 Public
Use Sample, Inter-university Consortium for Political and Social Research (ICPSR) edition,
1973.
Relationship to Household Head/Householder for both the 1910 and 1960
samples. If we forced the 1910 categories into those for 1960, we would
lose such categories as nephew, aunt, and domestic servant.
To avoid such problems and still maximize code comparability, we
designed two-part coding systems for many variables. A general code,
constituting the first one, two, or sometimes three columns of these
variables, serves as a lowest common denominator that classifies
information available in all samples containing the variable. The general
codes are usually fully comparable across years. A detailed code,
contained in the columns that follow the general code, preserves subcategories that are available in some but not all years. It must be read with
the general code. The general-detailed coding system maximizes
comparability without losing information, and has been applied to all
complex categorical classifications except for occupation, which is
discussed at length in Chapter 4, “Occupation Codes and Income Scores.”
Table 4 shows part of the IPUMS data dictionary for RELATE. The
header indicates that RELATE is available in both general and detailed
form. The general codes comprise the first two columns of RELATE
(P48-49). These first two digits are comparable for all years. They
generally follow the 1960 and 1970 codes, which served as our lowest
common denominator for creating RELATE, as described above. The
general codes allow researchers to comparatively analyze RELATE across
all ten census years for which it is available. Researchers who wish to use
RELATE in its detailed form—that is, those who wish to analyze
information that is available only for some years—should read all four
columns of RELATE (P48-51), thus incorporating the detailed subcategories contained in the detailed columns (P50-51).
Using the Data Dictionaries
The data dictionaries, split into a household and person section,
describe each variable. For each variable, the dictionaries provide a
universe statement, a variable description, and a discussion of variable
comparability across census years. In certain cases, additional user notes
caution researchers about potential problems that some uses of the variable
may entail. Users should pay close attention to the universe statement,
comparability discussion, and user notes. In many cases, variables with
IPUMS-98 VOLUME 1
USER’S GUIDE
apparently comparable codes are actually defined slightly differently or
are available for different populations in various census years.
For most variables a frequency table gives the value label for each
IPUMS code. The RELATE frequency table shown in Table 4 serves as
an example. The numbers under each census year are the number of cases
(frequencies) in each category for that year; a blank indicates that the
category is not available for that year. The codes and frequencies table for
the general codes has no blanks, indicating that all categories are available
in all census years. By contrast, the table for the detailed codes is filled
with blanks, because many codes are available only in a subset of years. A
blank in the pre-1940 censuses usually means that the response did not
occur in the population, because, for the most part, the pre-1940 census
samples preserved virtually all available detail. A blank in a table for the
census years from 1940 to the present ordinarily means that the variable
was simply coded into broader categories.
Note that the Detailed Relationship distribution shows a space
between the general and detailed codes only to improve readability—no
blank spaces exist in the dataset. To save space, the documentation
provides frequencies only for the detailed version of some variables.
The frequency counts in the data dictionaries are unweighted;
therefore, they do not necessarily accurately reflect the distribution for the
general population. In particular, frequencies for 1950 (a heavily weighted
sample) often appear inconsistent with surrounding years. By applying
appropriate weights, all samples can be made representative of the general
population.
In years with multiple samples, we chose one for purposes of
presenting frequencies. The 1970 frequencies are from the Form 2 (15%)
State sample, by default. If a variable is only available in the Form 1 (5%)
samples, then the Form 1 State sample frequencies are presented. In the
case of METAREA in 1970, the Form 2 Metro (County Group) sample
was used. The 1980 frequencies are taken from the Metro (“B”) sample,
and the 1990 frequencies are from the Metro (1%) sample.
Indentations in the value labels column of the data dictionary are
meaningful. Any item indented beneath another is a subset of the larger
category. If a subcategory is not available in a given year, in general the
cases would have been coded into the larger category. For example, in
Table 4, panel 2, the category of “adopted child” is not available after
IPUMS Design - Introduction
Page 1.1.11
1940. Adopted children in more recent years are recorded in the larger
category “child,” under which “adopted child” is indented.
Public Use Microdata Samples Source Materials
For all census years from 1850 to 1950 (except for 1890, which was
destroyed by fire), the original manuscript population schedules are
preserved on microfilm at the National Archives in Washington D.C. In
each year, the microfilm reels and the schedules within reels are
organized geographically: alphabetically by state, within states
alphabetically by county, and within counties numerically by
enumeration district. For census years since 1960, the census schedules
exist in machine-readable form.
The basic sources for most of the IPUMS documentation—are the
documentation provided for each of the individual public use microdata
samples. These are listed below. All of them should be available
through the Inter-university Consortium for Political and Social
Research (ICPSR), P.O. Box 1248, Ann Arbor, MI, 48106.
1850: Steven Ruggles, Russell R. Menard, et al., Public Use
Microdata Sample of the 1850 United States Census of Population:
User’s Guide and Technical Documentation, Social History Research
Laboratory, Department of History, University of Minnesota, 1995.
1860-1870: See 1920.
1880: Steven Ruggles, Russell R. Menard, et al., Public Use
Microdata Sample of the 1880 United States Census of Population:
User’s Guide and Technical Documentation, Social History Research
Laboratory, Department of History, University of Minnesota, 1994.
1900: Stephen N. Graham, 1900 Public Use Sample, User’s
Handbook, Center for Studies in Demography and Ecology, University
of Washington, 1980.
1910: Michael A. Strong, et al., User’s Guide, Public Use Sample,
1910 United States Census of Population, Population Studies Center,
University of Pennsylvania, 1989.
1920: The 1920 sample is currently being created by Steve Ruggles
and others at the Historical Census Projects, Department of History,
University of Minnesota. It was designed at the outset to be comparable
to the IPUMS, so a separate volume of documentation will be issued for
it only over Ruggles’s dead body.
IPUMS-98 VOLUME 1
USER’S GUIDE
1940: United States Bureau of the Census, Census of Population,
1940: Public Use Sample Technical Documentation, Government
Printing Office, 1984.
1950: United States Bureau of the Census, Census of Population,
1950: Public Use Sample Technical Documentation, Government
Printing Office, 1984.
1960: United States Bureau of the Census, Technical Documentation
for the 1960 Public Use Sample, Government Printing Office, 1973. See
also the entry for 1970.
1970: United States Bureau of the Census, Public Use Samples of
Basic Records from the 1970 Census: Description and Technical
Documentation, Government Printing Office, 1972. Also contains
important information about the 1960 sample that is not contained in the
1960 documentation.
1980: United States Bureau of the Census, Public Use Samples of
Basic Records from the 1980 Census: Description and Technical
Documentation, Government Printing Office, 1983.
IPUMS Design - Introduction
Page 1.1.12
1990: United States Bureau of the Census, Census of Population and
Housing, 1990: Public Use Microdata Samples, Technical
Documentation, Government Printing Office, 1993.
User Feedback Requested
If users encounter major errors in the IPUMS, such as undocumented
categories, we would appreciate a note to that effect. We would also
appreciate any comments or suggestions for improvement of the next
edition of the IPUMS. Send these messages via electronic mail to
ipums@hist.umn.edu.
Users should not be too concerned if their frequencies do not precisely
match those in the documentation. We have made a number of changes
since those frequencies were produced that may have small effects on the
number of cases in some categories. The purpose of the frequencies is to
indicate the availability of particular categories and to provide a general
guide for recoding.
(Table 4 appears on the following page.)
IPUMS-98 VOLUME 1
USER’S GUIDE
IPUMS Design - Introduction
Page 1.1.13
Table 4
Frequency Table for RELATE (Partial Listing)
Codes and Frequencies – General:
Code
1850
1880
1900
1910
1920
1940
1950
1960
1970
1980
1990
Relatives
Head/Householder
01
101865
21338
80631
120597
350354
443719
529984
634408
804615
918782
Spouse
02
81371
16676
63785
95759
267997
375009
396000
437135
490262
532985
Child
03
246600
47052
163632
229713
540738
854306
697452
783186
763029
785662
Child-in-law
04
2296
457
1876
3349
12275
22678
5835
5002
4434
3801
Parent
05
3831
943
3030
4493
12315
18448
12843
12440
14492
16593
Parent-in-law
06
2423
571
2460
3867
10654
19873
12905
10241
6260
3686
Sibling
07
6228
1375
5015
7373
17231
19839
14398
15043
20282
23376
Sibling-in-law
08
2858
692
2924
4395
10318
15798
7264
5737
4507
3167
Grandchild
09
7766
1580
5431
7576
26400
56257
26366
25623
26552
42869
Other relatives
10
5588
1132
4232
5994
15257
25977
21150
13248
15494
23889
Non-relatives
Partner, friend, visitor
11
278
196
731
241
1757
2194
2529
10073
33466
64252
Other non-relatives
12
39136
7730
29896
33870
73605
63280
54219
56163
58997
55918
Institutional inmates
13
2600
683
2596
3905
12831
4820
18943
21334
24930
25072
1880
1900
1910
1920
1940
1950
1960
1970
1980
1990
Codes and Frequencies - Detailed:
Code
1850
Relatives
Head/householder
01 01
101865
21338
80631
120597
350354
443719
529984
634408
804615
978782
Spouse
02 01
81354
16673
63785
95758
267997
375009
396000
437135
490262
532985
02 02
17
3
03 01
241941
46186
226155
531847
837706
697452
783186
763029
745382
3177
8891
16600
3336
12275
22678
5835
5002
4434
3801
12315
18448
12843
12440
14492
16593
10654
19873
12905
10241
6260
3686
2nd/3rd wife (polygamous)
Child
1
160427
Adopted child
03 02
755
103423
381
Stepchild
03 03
3904
763
2725
Adopted, n.s.
03 04
Child-in-law
Step child-in-law
04 01
2272
457
1859
04 02
24
17
13
05 01
3781
919
2978
4442
Stepparent
05 02
50
24
52
51
Parent-in-law
06 01
2418
571
2453
3864
06 02
5
7
3
Parent
Stepparent-in-law
40280
57
IPUMS-98 VOLUME 1
USER’S GUIDE
IPUMS Design - Introduction
Page 1.1.14
This page intentionally left blank.
Download