chap1 - National Bureau of Economic Research

IPUMS-98 VOLUME 1 USER’S GUIDE IPUMS Design - Introduction Page 1.1.1 Chapter 1 INTRODUCTION This document describes the Integrated Public Use Microdata Series (IPUMS-98), created at the University of Minnesota in November 1997. The IPUMS consists of twenty-five high-precision samples of the American population drawn from thirteen federal censuses. Some of these samples have existed for years, and others were created specifically for this database. The twenty-five samples, which span the censuses of 1850 to 1990, collectively constitute our richest source of quantitative information on long-term changes in the American population. However, because different investigators created these samples at different times, they employed a wide variety of record layouts, coding schemes, and documentation. This has complicated efforts to use them to study change over time. The IPUMS assigns uniform codes across all the samples and brings relevant documentation into a coherent form to facilitate analysis of social and economic change. Citation and Use All persons are granted a limited license to use and distribute this documentation and the accompanying data, subject to the following conditions: 1. No fee may be charged for use or distribution. 2. Publications and research reports based on the database must cite it appropriately. The citation should include the following: Steven Ruggles and Matthew Sobek Integrated Public Use Microdata Series: Version 2.0 Minneapolis: Minnesota Historical Census Projects, University of Minnesota, 1997 In addition, we request that users send us a copy of any publications, research reports, or educational material making use of the data or documentation. Printed matter should be sent to: IPUMS Historical Census Projects Department of History University of Minnesota 614 Social Sciences 267 19th Avenue South Minneapolis, MN 55455 Electronic material should be sent to: ipums@hist.umn.edu Universe The Integrated Public Use Microdata Series consists of a series of compatible-format individual-level representative samples of the United States for the years 1850, 1860, 1870, 1880, 1900, 1910, 1920, 1940, 1950, 1960, 1970, 1980, and 1990. Each of these samples is independent; it is not possible to trace individuals from one census year to the next. The data represent all persons in states, territories that eventually became states, and the District of Columbia, with the following exceptions: 1. The 1850 and 1860 samples exclude the slave population. 2. The pre-1890 samples exclude “Indians not taxed.” 3. The pre-1960 samples, except 1900 and 1910, exclude Alaska and Hawaii. In addition, for the period since 1970 the data series includes unoccupied housing units. Date of enumeration varied by census year, as described below: 1850-1900: 1910: 1920: 1930-1990: Subject Content June 1 April 15 January 1 April 1 IPUMS-98 VOLUME 1 USER’S GUIDE The data series includes information on a broad range of population characteristics, including fertility, nuptiality, life-course transitions, immigration, internal migration, labor-force participation, occupational structure, education, ethnicity, and household composition. The information available in each sample varies according to the questions asked in that year and by differences in post-enumeration processing. In general, the later census years provide a greater range of characteristics than the earlier ones, though the earlier censuses often contain greater detail for the variables that are available. A full listing of available variables can be found in the section entitled “Variable Availability and Record Layout.” For the period since 1960, the data series also provides detailed housing characteristics. Organization of Documentation The IPUMS documentation is divided into five volumes. Volume 1: User’s Guide covers minimum information most users require to use the database, including the overall design of the database, variable descriptions, and coding schemes for each variable. Volume 2: User’s Guide Supplement includes supplemental information on some of the more complicated variables, such as maps and coding schemes for the geographic areas identified in the samples, and alternate occupation and industry coding schemes available for particular census years. Volume 3: Counting the Past provides enumerator instructions, replicas of the census forms, procedural histories of the censuses, and descriptions of missing-data allocation procedures and other post-enumeration processing. Volumes four and five consist of complete descriptions of all the data transformations we carried out to create the IPUMS. This introduction to the database is essential reading for all prospective users. Additional chapters in this volume cover sample designs, sampling errors, occupational coding, and family interrelationship codes. We then provide a guide to the availability of variables in each IPUMS sample, variable descriptions, coding schemes, and marginal frequencies. Finally, we briefly describe the missing and inconsistent data allocation and data quality flags. Uses of Microdata IPUMS Design - Introduction Page 1.1.2 Most population dataespecially historical census datahas traditionally been available only in aggregated tabular form. The IPUMS is microdata, which means that it provides information about individual persons and households. This makes it possible for researchers to create tabulations tailored to their particular questions. Since the IPUMS includes nearly all the detail originally recorded by the census enumerations, users can construct a great variety of tabulations interrelating any desired set of variables. The flexibility offered by microdata is particularly important for historical research because the aggregate tabulations produced by the Census Bureau are often not comparable across time, and until recently the subject coverage of census publications was limited. Microdata does pose some limitations, however. Most important, for the period since 1940 census microdata are subject to strict confidentiality measures that limit their usefulness for some applications. The IPUMS samples for these years include no names, addresses or other potentially identifying information. To further ensure that no individuals can be identified, the Census Bureau limits the detail on place of residence, place of work, very high incomes, and several other variables. Most important, the microdata records for the period since 1940 identify no geographic areas with fewer than 100,000 inhabitants (250,000 in 1960 and 1970). Therefore the IPUMS is inappropriate for research that requires the identification of specific small geographic areas in those census years. Sample Availability The IPUMS samples for the years since 1960 were originally created by the Census Bureau as part of each decennial enumeration; the samples for earlier years were created by a variety of individual researchers with funding from the National Institutes of Health and the National Science Foundation. Table 1 lists the samples now available as part of the IPUMS as well as those which will be added to it in the near future, including samples for 1860 and 1870 and an enlarged sample for 1900. We hope eventually to add an additional sample for 1930. IPUMS-98 VOLUME 1 USER’S GUIDE IPUMS Design - Introduction Page 1.1.3 Table 1 IPUMS File Characteristics Filename ip18501 ip18601 ip18701 ip18801 ip19001 ip19002 ip19101 ip19102 ip19103 ip19201 ip19401 ip19501 ip19601 ip19701 ip19702 ip19703 ip19704 ip19705 ip19706 ip19801 ip19802 ip19803 ip19901 ip19902 ip19903 ip19904 IPUMS Sample Name 1850 sample 1860 sample 1870 sample 1880 sample 1900 sample 1900 sample–enlarged version 1910 unweighted sample 1910 sample combined w/ oversamples 1910 Hispanic oversample 1920 sample 1940 sample 1950 sample 1960 sample 1970 Form 1 State sample 1970 Form 2 State sample 1970 Form 1 Metro sample 1970 Form 2 Metro sample 1970 Form 1 Neighborhood sample 1970 Form 1 Neighborhood sample 1980 State sample 1980 Metro sample 1980 Urban/rural sample 1990 State sample 1990 Metro sample 1990 Elderly sample 1990 Flat State sample Original Sample Name Preliminary version 1 Preliminary version 1 Preliminary version 2 5% State sample 15% State sample 5% County group sample 15% County group sample 5% Neighborhood sample 15% Neighborhood sample A sample B sample C sample 5% sample 1% sample 3% Elderly sample 1% Unweighted sample Principal Investigators Menard/Ruggles Ruggles Ruggles Ruggles/Menard Preston Ruggles Preston Preston et al. Gutmann/Ruggles Ruggles/Menard Winsborough et al. Winsborough et al. Census Bureau Census Bureau Census Bureau Census Bureau Census Bureau Census Bureau Census Bureau Census Bureau Census Bureau Census Bureau Census Bureau Census Bureau Census Bureau Census Bureau Release Dates: IPUMS Original Version Version 1995 1998 1998 1995 1993 2002 1993 1998 1998 1997 1993 1993 1993 1993 1995 1995 1995 1998 1998 1996 1993 1998 1996 1993 1999 1998 1994 1998 1998 1994 1980 2002 1989 1990 1996 1997 1984 1984 1971 1972 1972 1972 1972 1972 1972 1983 1983 1983 1992 1992 1993 1995 Sample Density 1 in 100 1 in 500 1 in 500 1 in 100 1 in 760 1 in 100 1 in 250 varies varies 1 in 200 1 in 100 1 in 100 1 in 100 1 in 100 1 in 100 1 in 100 1 in 100 1 in 100 1 in 100 1 in 20 1 in 100 1 in 100 1 in 20 1 in 100 1 in 33 1 in 100 Number of Records in Thousands: HH Person 37 TBA TBA 107 27 205 89 5 22 129 391 461 579 744 744 744 744 744 744 4711 942 942 5528 1106 n.a. 1106 198 TBA TBA 503 100 760 366 24 96 521 1351 1922 1780 2030 2030 2030 2030 2030 2030 11337 2267 2267 12500 2500 n.a. 2500 IPUMS-98 VOLUME 1 USER’S GUIDE Users should be aware that the 1920 sample is currently in preliminary form. The sample will be expanded before the final version is released in late 1998. File Structure Most users will access the data through our on-line data extraction system, available at our web site (http://www.ipums.umn.edu). This system allows users to select a subset of cases and variables for analysis, so they do not have to cope with multiple gigabytes of data. The extraction system creates a record layout tailored to the needs of each user. For users who want to use the raw data or who plan complicated data manipulations, however, it is essential to understand the raw IPUMS record layout and file structure. The raw data files for each IPUMS sample consist of a series of 334character logical records. These records are of two types: household records and person records. Household recordsidentified with an “H” in the first columncontain information pertaining to an entire household or group quarters residence. Each household record is followed by a series of person recordsidentified by a “P” in the first columnwhich contain information about each sampled individual in the unit. In the text of this document, variables are identified according to their record type and column location. For example, the variable AGE is identified as P52-54 because it is located in columns 52 through 54 of the person record. The IPUMS record layout stresses column-comparability rather than compactness. In general, all variables available across multiple census years appear in the same columns in every year. When a variable is not available for a given year, the columns are either filled with a missing data value or, in some cases, with another variable available for different years. Within households, individuals are generally sequenced according to their household relationship codes (variable RELATE, P48-51). Household heads (or householders after 1970) appear first. They are ordinarily followed by a spouse (if any), children (in descending age order), other relatives, and non-relatives. In the pre-1960 census years there are a few exceptions to this sequence, since the original order of enumeration is preserved and not all enumerators followed the IPUMS Design - Introduction Page 1.1.4 prescribed sequence. Most analyses of IPUMS data require manipulation of both household and person records. Each record contains a serial number which links the persons in the housing unit to the appropriate household record. Many statistical software packagessuch as SAS, SPSS, Stata and BMDPwill handle the IPUMS datasets easily. They offer various methods to rectangularize the hierarchical file structure by attaching household information to each person record. In addition, constructed variables in the IPUMS make it easy to attach characteristics of household and family members, such as spouses, children, and parents. Use of these variables is discussed in Chapter 5, “Family Interrelationships” in this volume. In general, the variables in the IPUMS are coded into numeric categories, but there are several exceptions. The following variables contain alphabetic characters: H1 H 168-199 P1 P 219-234 P 235-250 RECTYPE Record Type (all samples) STREETStreet Address (1880 and 1920 only) RECTYPE Record Type (all samples) NAMELASTName, Last (1850, 1880 and 1920 only) NAMEFRSTName, First (1850, 1880 and 1920 only) The files identified as available in Table 1 may be downloaded from the internet via our World Wide Web page (http://www.ipums.umn.edu) or our anonymous FTP site (ftp.ipums.umn.edu) and are also available through our interactive data-extraction system. For further details, see our web site. Sample-line Characteristics in 1940 and 1950 Since 1940, the census has asked an extended set of questions of a sample of the population. In the first two years that the census used sampling, it was done on an individual basis: individuals who fell on a designated “sample line” of the census enumeration form were asked an extra set of questions (the enumeration forms are reproduced in Volume 3: Counting the Past). The 1940 and 1950 samples were constructed to ensure that each household contains one of these sample-line individuals. In the IPUMS, these variables are coded as “not applicable” IPUMS-98 VOLUME 1 USER’S GUIDE for all individuals except the sample-line individual. Thus, users will find that for many variables in 1940 and 1950 the number of available cases is limited. The sample-line format of the census imposes some limitations on the kinds of analysis that can be carried out. For example, one might want to find out if native-born persons with foreign-born parents tended to marry other persons of the same ethnicity. Since the variables on parental birthplaces in 1940 and 1950 are on the sample line, and only one sample-line person is included in each household, it is impossible to compare the parental birthplaces of husbands and wives in these years. In the censuses of 1960 through 1990, the Census Bureau also employed sampling, but it was carried out on a household basis: a portion of households received “long form” questionnaires with an additional set of questions to be filled out for each member of the household. Since the public use files were constructed entirely from these long forms, most sample items are available for all individuals in the IPUMS; the few exceptions are noted in the section of this volume entitled “Variable Availability and Record Layout.” Sample Weights With the exception of the samples for 1860, 1870, 1940, 1950, 1990, and the 1910 oversamples of Blacks and Hispanics, the IPUMS samples are unweighted “flat” samples. This means that each observationwhether a household or an individualrepresents a fixed number of persons in the general U.S. population. That number is roughly the inverse of the sample density shown in Table 1. For example, each individual in the samples for 1880, 1940, and 1960 represents approximately 100 persons in the general population, and each individual in the original 1900 sample represents approximately 760 persons in the general population. The 1860, 1870, 1940, 1950, and 1990 samples are weighted. This means that persons with some characteristics are over-represented in the samples, while others are underrepresented. In the 1860 and 1870 samples, households containing any black person were sampled at twice the density of other households in order to allow more in-depth analysis of the black population. In the case of 1940 and 1950, a flat sample was impossible because of the complexities in sample design necessitated by IPUMS Design - Introduction Page 1.1.5 the sample-line individuals. In the case of 1990, the Census Bureau opted against a flat sample design in order to maximize precision for persons residing in small localities. In all three cases, users must take an additional step if they want to obtain statistics that are representative of the general population. To obtain representative statistics, users have two basic options: they can either apply sample weights or select a representative unweighted subsample of the data. To apply sample weights to an IPUMS file, users should follow one of the following procedures: 1. For household-level analyses in 1860, 1870, or 1990, or householdlevel analyses in 1940 or 1950 that do not involve sample-line characteristics, weight the analysis by the variable HHWT (H18-21). HHWT gives the number of households in the general population represented by each household in the sample. 2. For person-level analyses in 1860, 1870, or 1990, or person-level analyses in 1940 or 1950 that do not involve sample-line characteristics, apply the variable PERWT (P20-23). PERWT gives the number of individuals in the general population represented by each individual in the sample. 3. For any analyses in 1940 or 1950 involving sample-line characteristics, apply the variable SLWT (P16-19). SLWT gives the number of individuals in the general population represented by each sample-line individual. The variables HHWT, PERWT, and SLWT also exist for the unweighted IPUMS samples, but they are simply the overall ratio of the full population count to the number of cases in the sample. For example, the weights in 1880 average 99.74, and all those in 1900 average 758.94. Thus, applying the weights to the unweighted samples will inflate case counts to correspond to the entire population. To select an unweighted subsample of an IPUMS file, users should follow one of the following procedures: IPUMS-98 VOLUME 1 USER’S GUIDE 1. For any analyses of 1940, 1950, or 1990 that do not involve sampleline characteristics, select entire households with a value of “2” for the variable SELFWTHH (H30) 2. For any analyses of 1940 or 1950 involving sample-line characteristics, select individuals with a value of “2” for the variable SELFWTSL (P24) . The unweighted subsamples have the disadvantage that the number of cases may be sharply reduced, but for some applications they are easier to work with than the weighted samples. To make things simpler for users of the 1990 sample, we have created an unweighted 1-in-100 extract of the 1-in-20 weighted sample. This version (ip19904) requires no weights, and the values of HHWT and PERWT are always 100. We plan to add oversamples of the Black and Hispanic populations in 1910 (ip19102 and ip19103). When we add them, we plan to create appropriate weights to allow users to obtain representative statistics from a combination of all three 1910 samples. Additional details on sample weights and sample designs can be found in Chapter 2, “Sample Designs.” Variations in Group Quarters Definitions In all the IPUMS files, persons in large units such as institutions and boarding houses were sampled at the individual level, and not as members of households. Such units are termed “group quarters.” Unlike members of households, group quarters residents cannot be compared to their coresidents because the co-residents are not ordinarily contained in the sample. Furthermore, the samples for 1940 and subsequent census years provide fewer variables for residents of group quarters than they do for household members. Unfortunately, different census years did not define group quarters consistently, and this complicates inter-year comparisons for many variables. Thus, individuals sampled as group quarters residents in some years might have been sampled as household members in other years. For example, the 1940, 1950, 1960, and 1970 census years define group quarters as units containing five or more persons unrelated to the household head/householder. The IPUMS samples for the remaining IPUMS Design - Introduction Page 1.1.6 census years define group quarters as units with 10 or more persons unrelated to the household head.1 Sufficient information is available in all census years to determine whether a sampling unit would have been treated as group quarters or a household under the 1940-1970 rules, and this is done in the IPUMS variable GQ (H72). Users who wish to create a common household universe for all eleven census years can use GQ to eliminate from their research universe all units that would have been sampled as group quarters in the 1940-1970 census years, by simply selecting households coded “1.” However, many researchers, particularly those concentrating on the earlier IPUMS years when boarders, lodgers, servants, and secondary families were much more common, will not wish to use GQ in this way. Doing so would eliminate from the universe many pre-1940 units for which the pre-1940 samples contain full information on all household members. The differences among the samples in treatment of group quarters, like other variations in sample design, also affect the precision of the samples. See Chapters 2 and 3, “Sample Designs” and “Sampling Errors,” in this volume. Alternate Census Forms, 1960-1980 In two census years, 1960 and 1970, the Census Bureau used two different sample “long forms” with slightly differing questions when they took the census. Moreover, in 1980 certain information was collected for only a fraction of forms. Depending on the census year, these have varying implications for the IPUMS. In 1960, certain housing information (such as stories and elevators) was only collected in cities with 50,000 or more residents, where form PH4 was distributed, and other information (such as sewage disposal) was only collected outside large cities, where form PH-3 was used. A third set of housing questions was asked for a 5 percent sample of all housing units, and a fourth set was asked of 20 percent of all units. The 1960 IPUMS 1 The original samples for 1850 through 1920 used slightly broader definitions of group quarters, but in the IPUMS the larger units are treated as group quarters. See Chapter 2 on “Sample Designs,” and the variable description for Group Quarters (GQ) in this volume. IPUMS-98 VOLUME 1 USER’S GUIDE sample contains all four sets of questions, but each is available only for a subset of cases. The IPUMS variable SAMP1960 identifies which set of questions the household answered. For those housing items asked of only a subset of the population, the universe statement identifies which form included the question, and those not included in the universe are coded “not applicable.” In the case of the PH-3 and PH-4 samples, users must be aware of the universe limitations imposed by the different forms. The distinction between the 20% form and the 5% form should have little effect for most analyses beyond limiting the number of available cases, but if users want to estimate the absolute number of occurrences of an item in the 1960 population, they should multiply 20% form items by 125 and the 5% form items by 500. In 1970 the Census Bureau used two substantially different long forms with varying questions on both the person and household records. One form, referred to in the IPUMS as Form 1, was filled out by 5% of the population, and the other form, Form 2, was filled out by 15% of the population. For example, the Form 1 (5%) sample included questions such as age at first marriage, citizenship status, and occupation five years ago, whereas the Form 2 (15%) sample inquired about parental birthplaces, school attendance, and migration status. The IPUMS includes separate samples for each form. Users must decide which version they need before carrying out their analysis. The 1980 census used a single long form, but the makers of the sample used only a portion of the forms for questions on travel time to work, place of work, and migration. In the IPUMS, these variables are available for half of all cases; the rest are coded “not applicable.” This should have little effect on analysis, but if users wish to estimate the absolute number of occurrences of one of these items in the 1980 population, they should multiply the weight by two or apply the variable MIGSAMP. Geographic Coding and Comparability Every census collected precise information on residential location. This information is preserved in the pre-1940 census samples, but the 1940 and subsequent census years suppressed much of it in order to meet confidentiality requirements. Thus, the 1940 and 1950 samples do not identify places that had fewer than 100,000 residents in 1980; the 1960 and 1970 samples do not identify places with fewer than 250,000 inhabitants; IPUMS Design - Introduction Page 1.1.7 and the 1980 and 1990 samples do not identify places with fewer than 100,000 inhabitants. The impact of these rules on comparability and geographic coverage of areas smaller than states are explained in the documentation for each geographic variable—see particularly the variables CNTYGP97 (County group, 1970), CNTYGP98 (County group, 1980), PUMA (Public Use Microdata Area, 1990), SEA (State economic area), METRO (Metropolitan status), METAREA (Metropolitan area), CITY (City identifier), and URBAN (Urban/rural status). In the 1970, 1980, and 1990 census years the Bureau produced alternate versions of the public use files containing different geographic codes. Because of confidentiality regulations, the samples do not identify any places smaller than 250,000 in 1970 or 100,000 in 1980 and 1990. By providing multiple samples with different coding schemes, the Census Bureau was able to maximize flexibility without violating the confidentiality thresholds. In 1970, three geographic versions were produced, and for each one the Census Bureau created a Form 1 sample and a Form 2 sample (see above):  The State samples identify all states. No geographic subdivisions of states are identified, but for most states it is possible to distinguish rural from urban areas and metropolitan from non-metropolitan areas.  The Metro (also known as County Group) samples identify economic areas within states of 250,000 or more, but those areas do not always follow state boundaries. Every metropolitan area of 250,000 or more is identified. Only four states can be completely identified, because metropolitan areas frequently cross state boundaries and identification of both state and metropolitan area would violate the confidentiality rules. No rural/urban distinctions are available, and smaller metropolitan areas cannot be identified. IPUMS-98 VOLUME 1 USER’S GUIDE  The Neighborhood samples provide only regional information and size of place, but they also give specific characteristics of the surrounding neighborhood that allow contextual analysis. The neighborhoods are not identified by name but represent areas approximately the size of census tracts, which contained about 4,000 people. See “Geographic Tools” in Volume 2: User’s Guide Supplement for the record layout and descriptions of the characteristics. The Census Bureau also created three geographic variations in 1980:  The State (A) sample identifies all states, larger metropolitan areas, and most counties over 100,000 population. In many cases individual cities are also identified. It does not identify urban/rural residence or residence in smaller metropolitan areas. The State sample is very large, including 1-in-20 (5%) of the U.S. population.  The Metro (B) sample identifies 282 metropolitan areas over 100,000 population. Only twenty states can be completely identified, because metropolitan areas frequently cross state boundaries and identification of both state and metropolitan area would violate the confidentiality rules. Metropolitan areas are distinguished from non-metropolitan areas, but the sample does not identify urban/rural residence.  The Urban/Rural (C) sample identifies urban/rural residence, central city residence, and particular urbanized areas. For confidentiality reasons, only 28 states can be entirely identified, and no metropolitan areas are identified. The Census Bureau produced two geographic versions in 1990:  The State (5%) sample identifies all states, and within states, most counties or parts of counties with 100,000 or more population. It also identifies most metropolitan areas over 100,000 completely. The sample is the largest one in the IPUMS.  The Metro (1%) sample is similar, but more metropolitan areas are identified and some states cannot be completely identified because of confidentiality restrictions. The universe statements, variable descriptions, and comparability discussions that comprise the data dictionaries for each IPUMS variable IPUMS Design - Introduction Page 1.1.8 explain how variables are affected by confidentiality restrictions. For example, the universe statement for METAREA indicates that the variable is available only for the Metro samples in 1970 and only for the State and Metro samples in 1980. To maximize historical comparability, the IPUMS constructs a variety of geographic variables for earlier years that were not contained in the original samples. The variables SEA (State economic area), METRO (Metropolitan status), and METAREA (Metropolitan area), among others, represent concepts not yet in use when the earlier censuses were taken. The IPUMS applies these concepts to the pre-1940 samples in order to extend the series of comparable geographic codes backwards. In addition, the 1850-1920 census years include COUNTY (County of residence), which is especially useful for constructing variables describing the local socioeconomic context of each case. These county-level data are available in machine-readable form (ICPSR file 0003) from the InterUniversity Consortium for Political and Social Research (http://www.icpsr.umich.edu). Occupational and Industrial Classifications Occupation and industry are among the most important variables for analyses of long-term social change because the early census years provide few alternative indicators of socioeconomic status or labor-force participation. The Census Bureau has modified its classification systems every decade, so all comparisons of occupation and industry require extensive reconciliation of codes. There are nine different occupational classification systems consisting of between 285 and 550 categories each. Although a complete reconciliation of these coding schemes is impossible, we provide variables that maximize the potential for consistent comparisons of occupational status. In each census year, we provide the contemporary occupational classification as well as imposing a common coding scheme based on the 1950 Census Bureau classification. This and other comparable occupational variables are described at length in the Chapter 4, “Occupation Codes and Income Scores.” Constructed Variables on Family Interrelationships The IPUMS contains three pointer variables—MOMLOC, POPLOC, and SPLOC—that give the location within the household of each IPUMS-98 VOLUME 1 USER’S GUIDE individual's mother, father, and spouse. These variables allow users to easily attach characteristics of individuals to those of their kin, and they are convenient tools for constructing measures of fertility and coresidence. The IPUMS also includes several of the most commonly requested variables on own children: Number of Own Children (NCHILD), Number of Own Children Under Age Five (NCHLT5), Age of Eldest Own Child (ELDCH), and Age of Youngest Own Child (YNGCH). These and other constructed family interrelationship variables are fully described in this volume in Chapter 5, “Family Interrelationships.” Coding Schemes The original census samples that make up the IPUMS employed different classification systems and coding schemes in every census year. A central goal of the IPUMS is to reconcile these in order to create comparable codes for each variable. Perfect uniformity across years was our ideal, but is only achieved in a few variables (most notably SEX) which are classified in precisely the same way in each sample. However, for most variables, such perfection could not be achieved without an unacceptable loss of information. This is primarily because the variables often contain more detail in some census years than in others; if we had reduced all census years to their lowest common denominator, this detail would have been lost. The person-record variable RELATE (Relationship to Household Head/Householder), is illustrative. Most of the samples code it differently. For instance, household heads/householders are coded “1000” in the original 1910 sample, “01” in the 1940 sample, and “0” in the 1960 sample. To reconcile these, the IPUMS translates the various codes into a single code (“01”) for household heads/householders in all years, thus easily eliminating this simple incompatibility. But some censuses code RELATE in more detail — that is, using more categories — than others. For example, the 1960 and 1970 samples used only 15 categories for RELATE, while the 1910 sample distinguishes 161 categories. Tables 2 and 3 reproduce part of the original codebook pages describing the IPUMS Design - Introduction Page 1.1.9 Table 2 1910 Relationship Codes (Partial Listing) P06 REL Relationship to head Columns 14-17 Width 4 Value Description -3 Unknown -2 Illegible -1 Blank 1000 Head 1201 Husband 1202 Wife 1300 Child 1301 Son 1302 Daughter 1310 Stepchild 1311 Stepson 1313 Stepdaughter 1320 Adopted child 1321 Adopted son 1322 Adopted daughter 1331 Son-in-law 1332 Daughter-in-law 1341 Stepson-in-law 1342 Stepdaughter-in-law . 2301 Nephew 2302 Niece . 2501 Uncle 2502 Aunt . 6000 Servant Total No. of individuals 12 44 910 80,589 26 63,773 3 82,168 78,407 9 1,429 1,279 21 187 219 1,020 840 8 10 % 0.00 0.01 0.25 22.00 0.01 17.41 0.00 22.44 21.41 0.00 0.39 0.35 0.01 0.05 0.06 0.28 0.23 0.00 0.00 1,430 1,485 0.39 0.41 137 245 0.04 0.07 3,802 1.04 366,239 100.00 Source: Michael A. Strong, et al., User’s Guide, Public Use Sample, 1910 United States Census of Population, Population Studies Center, University of Pennsylvania, 1989. IPUMS-98 VOLUME 1 USER’S GUIDE IPUMS Design - Introduction Page 1.1.10 Table 3 1960 Relationship Codes Character [Column] P1 Item and data descriptor name Basic relationship HEADRELA Code [Value] 0 1 2 3 4 5 6 P2 Detailed relationship of persons in 1 households 2 HEADRELB 3 4 5 6 7 8 9 Description of codes Head of household Wife of head Son or daughter of head Other relative of head Roomer, boarder, or lodger Patient or inmate Other not related to head Head, wife, or child not in subfamily, or GQ Grandson or granddaughter Father or mother or stepparent Father-in-law or motherin-law Brother or sister or stepbrother or stepsister Brother- or sister-in-law Other relative Partner or friend Roomer, boarder, or lodger Resident employee Source: United States Bureau of the Census, Technical Documentation for the 1960 Public Use Sample, Inter-university Consortium for Political and Social Research (ICPSR) edition, 1973. Relationship to Household Head/Householder for both the 1910 and 1960 samples. If we forced the 1910 categories into those for 1960, we would lose such categories as nephew, aunt, and domestic servant. To avoid such problems and still maximize code comparability, we designed two-part coding systems for many variables. A general code, constituting the first one, two, or sometimes three columns of these variables, serves as a lowest common denominator that classifies information available in all samples containing the variable. The general codes are usually fully comparable across years. A detailed code, contained in the columns that follow the general code, preserves subcategories that are available in some but not all years. It must be read with the general code. The general-detailed coding system maximizes comparability without losing information, and has been applied to all complex categorical classifications except for occupation, which is discussed at length in Chapter 4, “Occupation Codes and Income Scores.” Table 4 shows part of the IPUMS data dictionary for RELATE. The header indicates that RELATE is available in both general and detailed form. The general codes comprise the first two columns of RELATE (P48-49). These first two digits are comparable for all years. They generally follow the 1960 and 1970 codes, which served as our lowest common denominator for creating RELATE, as described above. The general codes allow researchers to comparatively analyze RELATE across all ten census years for which it is available. Researchers who wish to use RELATE in its detailed form—that is, those who wish to analyze information that is available only for some years—should read all four columns of RELATE (P48-51), thus incorporating the detailed subcategories contained in the detailed columns (P50-51). Using the Data Dictionaries The data dictionaries, split into a household and person section, describe each variable. For each variable, the dictionaries provide a universe statement, a variable description, and a discussion of variable comparability across census years. In certain cases, additional user notes caution researchers about potential problems that some uses of the variable may entail. Users should pay close attention to the universe statement, comparability discussion, and user notes. In many cases, variables with IPUMS-98 VOLUME 1 USER’S GUIDE apparently comparable codes are actually defined slightly differently or are available for different populations in various census years. For most variables a frequency table gives the value label for each IPUMS code. The RELATE frequency table shown in Table 4 serves as an example. The numbers under each census year are the number of cases (frequencies) in each category for that year; a blank indicates that the category is not available for that year. The codes and frequencies table for the general codes has no blanks, indicating that all categories are available in all census years. By contrast, the table for the detailed codes is filled with blanks, because many codes are available only in a subset of years. A blank in the pre-1940 censuses usually means that the response did not occur in the population, because, for the most part, the pre-1940 census samples preserved virtually all available detail. A blank in a table for the census years from 1940 to the present ordinarily means that the variable was simply coded into broader categories. Note that the Detailed Relationship distribution shows a space between the general and detailed codes only to improve readability—no blank spaces exist in the dataset. To save space, the documentation provides frequencies only for the detailed version of some variables. The frequency counts in the data dictionaries are unweighted; therefore, they do not necessarily accurately reflect the distribution for the general population. In particular, frequencies for 1950 (a heavily weighted sample) often appear inconsistent with surrounding years. By applying appropriate weights, all samples can be made representative of the general population. In years with multiple samples, we chose one for purposes of presenting frequencies. The 1970 frequencies are from the Form 2 (15%) State sample, by default. If a variable is only available in the Form 1 (5%) samples, then the Form 1 State sample frequencies are presented. In the case of METAREA in 1970, the Form 2 Metro (County Group) sample was used. The 1980 frequencies are taken from the Metro (“B”) sample, and the 1990 frequencies are from the Metro (1%) sample. Indentations in the value labels column of the data dictionary are meaningful. Any item indented beneath another is a subset of the larger category. If a subcategory is not available in a given year, in general the cases would have been coded into the larger category. For example, in Table 4, panel 2, the category of “adopted child” is not available after IPUMS Design - Introduction Page 1.1.11 1940. Adopted children in more recent years are recorded in the larger category “child,” under which “adopted child” is indented. Public Use Microdata Samples Source Materials For all census years from 1850 to 1950 (except for 1890, which was destroyed by fire), the original manuscript population schedules are preserved on microfilm at the National Archives in Washington D.C. In each year, the microfilm reels and the schedules within reels are organized geographically: alphabetically by state, within states alphabetically by county, and within counties numerically by enumeration district. For census years since 1960, the census schedules exist in machine-readable form. The basic sources for most of the IPUMS documentation—are the documentation provided for each of the individual public use microdata samples. These are listed below. All of them should be available through the Inter-university Consortium for Political and Social Research (ICPSR), P.O. Box 1248, Ann Arbor, MI, 48106. 1850: Steven Ruggles, Russell R. Menard, et al., Public Use Microdata Sample of the 1850 United States Census of Population: User’s Guide and Technical Documentation, Social History Research Laboratory, Department of History, University of Minnesota, 1995. 1860-1870: See 1920. 1880: Steven Ruggles, Russell R. Menard, et al., Public Use Microdata Sample of the 1880 United States Census of Population: User’s Guide and Technical Documentation, Social History Research Laboratory, Department of History, University of Minnesota, 1994. 1900: Stephen N. Graham, 1900 Public Use Sample, User’s Handbook, Center for Studies in Demography and Ecology, University of Washington, 1980. 1910: Michael A. Strong, et al., User’s Guide, Public Use Sample, 1910 United States Census of Population, Population Studies Center, University of Pennsylvania, 1989. 1920: The 1920 sample is currently being created by Steve Ruggles and others at the Historical Census Projects, Department of History, University of Minnesota. It was designed at the outset to be comparable to the IPUMS, so a separate volume of documentation will be issued for it only over Ruggles’s dead body. IPUMS-98 VOLUME 1 USER’S GUIDE 1940: United States Bureau of the Census, Census of Population, 1940: Public Use Sample Technical Documentation, Government Printing Office, 1984. 1950: United States Bureau of the Census, Census of Population, 1950: Public Use Sample Technical Documentation, Government Printing Office, 1984. 1960: United States Bureau of the Census, Technical Documentation for the 1960 Public Use Sample, Government Printing Office, 1973. See also the entry for 1970. 1970: United States Bureau of the Census, Public Use Samples of Basic Records from the 1970 Census: Description and Technical Documentation, Government Printing Office, 1972. Also contains important information about the 1960 sample that is not contained in the 1960 documentation. 1980: United States Bureau of the Census, Public Use Samples of Basic Records from the 1980 Census: Description and Technical Documentation, Government Printing Office, 1983. IPUMS Design - Introduction Page 1.1.12 1990: United States Bureau of the Census, Census of Population and Housing, 1990: Public Use Microdata Samples, Technical Documentation, Government Printing Office, 1993. User Feedback Requested If users encounter major errors in the IPUMS, such as undocumented categories, we would appreciate a note to that effect. We would also appreciate any comments or suggestions for improvement of the next edition of the IPUMS. Send these messages via electronic mail to ipums@hist.umn.edu. Users should not be too concerned if their frequencies do not precisely match those in the documentation. We have made a number of changes since those frequencies were produced that may have small effects on the number of cases in some categories. The purpose of the frequencies is to indicate the availability of particular categories and to provide a general guide for recoding. (Table 4 appears on the following page.) IPUMS-98 VOLUME 1 USER’S GUIDE IPUMS Design - Introduction Page 1.1.13 Table 4 Frequency Table for RELATE (Partial Listing) Codes and Frequencies – General: Code 1850 1880 1900 1910 1920 1940 1950 1960 1970 1980 1990 Relatives Head/Householder 01 101865 21338 80631 120597 350354 443719 529984 634408 804615 918782 Spouse 02 81371 16676 63785 95759 267997 375009 396000 437135 490262 532985 Child 03 246600 47052 163632 229713 540738 854306 697452 783186 763029 785662 Child-in-law 04 2296 457 1876 3349 12275 22678 5835 5002 4434 3801 Parent 05 3831 943 3030 4493 12315 18448 12843 12440 14492 16593 Parent-in-law 06 2423 571 2460 3867 10654 19873 12905 10241 6260 3686 Sibling 07 6228 1375 5015 7373 17231 19839 14398 15043 20282 23376 Sibling-in-law 08 2858 692 2924 4395 10318 15798 7264 5737 4507 3167 Grandchild 09 7766 1580 5431 7576 26400 56257 26366 25623 26552 42869 Other relatives 10 5588 1132 4232 5994 15257 25977 21150 13248 15494 23889 Non-relatives Partner, friend, visitor 11 278 196 731 241 1757 2194 2529 10073 33466 64252 Other non-relatives 12 39136 7730 29896 33870 73605 63280 54219 56163 58997 55918 Institutional inmates 13 2600 683 2596 3905 12831 4820 18943 21334 24930 25072 1880 1900 1910 1920 1940 1950 1960 1970 1980 1990 Codes and Frequencies - Detailed: Code 1850 Relatives Head/householder 01 01 101865 21338 80631 120597 350354 443719 529984 634408 804615 978782 Spouse 02 01 81354 16673 63785 95758 267997 375009 396000 437135 490262 532985 02 02 17 3 03 01 241941 46186 226155 531847 837706 697452 783186 763029 745382 3177 8891 16600 3336 12275 22678 5835 5002 4434 3801 12315 18448 12843 12440 14492 16593 10654 19873 12905 10241 6260 3686 2nd/3rd wife (polygamous) Child 1 160427 Adopted child 03 02 755 103423 381 Stepchild 03 03 3904 763 2725 Adopted, n.s. 03 04 Child-in-law Step child-in-law 04 01 2272 457 1859 04 02 24 17 13 05 01 3781 919 2978 4442 Stepparent 05 02 50 24 52 51 Parent-in-law 06 01 2418 571 2453 3864 06 02 5 7 3 Parent Stepparent-in-law 40280 57 IPUMS-98 VOLUME 1 USER’S GUIDE IPUMS Design - Introduction Page 1.1.14 This page intentionally left blank.

chap1 - National Bureau of Economic Research

Related documents

Products

Support

chap1 - National Bureau of Economic Research

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib