6347895 Project Summary Broader Impact: Census Research Data Centers (RDCs), based in Ann Arbor, Berkeley, Boston, Chicago, Durham, Ithaca, Los Angeles, New York City, and Washington provide approved scientists with access to confidential Census data for research that directly benefits both the Census Bureau and society. The RDC directors, administrators, board members and researchers, together with the Center for Economic Studies and the Longitudinal Employer-Household Dynamics (LEHD) Program, constitute a collaborative research network that is building and supporting a secure distributed computer network that enables research that is critical to our economic and civic prosperity and security (dmc). The network operates under physical security constraints dictated by Census and the Internal Revenue Service. The constraints essentially eliminate the possibility of distributing the computations to facilities outside of the Bureau’s main computing facility. Instead, the researchers use the RDCs as supervised remote access facilities that provide a secure, encrypted connection to the RDC computing network (int). The research conducted in RDCs and at LEHD over the past decade has made important contributions to our understanding of essential social, economic, and environmental issues that would not have been possible without use of the confidential data accessible via the RDC network. It is difficult to overstate the significance of this research, which has used more than 30 years of longitudinally integrated establishment micro-data from the Census Business Register and Economic Censuses; confidential micro-data from all the major Census surveys (Current Population Survey, Survey of Income and Program Participation, American Housing Survey), confidential micro-data from the Decennial Censuses of Population in 1990 and 2000; longitudinally integrated Unemployment Insurance wage records, ES-202 establishment data, and Social Security Administration data; federal tax information linked to major surveys; environmental data on air quality linked to Business Register and Economic Census data; Medicaid data linked to the Survey of Income and Program Participation; and many others (dmc). Intellectual Merit for the National Priority Area of Economic Prosperity and Vibrant Civil Society: We propose to address the technical and logistical issues raised by the creation, maintenance, and growth of the RDC network while maintaining the confidentiality guaranteed to participants in Census data. Over the next four years, we expect that the RDCs and LEHD will lead a new wave of research with the development of innovative, large-scale linked data products that integrate Census Bureau surveys, censuses and administrative records with data from state governments and surveys conducted by private institutions. Both CES and LEHD have extensive experience in creating these products. The RDC network researchers will enhance that experience and contribute their own expertise to the data linking research. The newly created data will be richer than any presently available to researchers with no increase in respondent burden. They will also raise complicated and vexing issues regarding disclosure avoidance and participant privacy (dmc). Intellectual Merit for the National Priority Area of Advances in Science and Engineering: The second part of our proposal, the creation of synthetic versions of these confidential data sets, will increase the accessibility of these data to social science researchers while preserving the confidentiality of private information. Synthetic and partially-synthetic data are new confidentiality protection techniques that rely on computationally intensive sampling from the posterior predictive distribution of the underlying confidential data. The result is micro-data that preserve important analytical properties of the original data and are, thus, inference-valid. The synthetic versions of confidential data are for public use. At the same time, ongoing research within the RDCs using the gold-standard confidential data will constantly test the quality of the synthesized data and allow for continuous improvement (int). As a result a continuous feedback relationship will be established between the research activities conducted in RDCs on confidential Census Bureau data and the quality of the Bureau’s public use data products—namely, the synthetic micro data created by these projects. In order to accomplish these computationally-intensive activities, as well as to allow researchers to engage in such innovative research as agent-based simulations and geo-spatial analysis, we will install a supercluster of SMP nodes optimized for the applications of creating linked data, analyzing the goldstandard data, and processing the data to produce multiply-synthesized public use data sets (int). Two industry partners, Intel and Unisys, have promised to directly support the creation of this supercluster by donating 256 Itanium 2 processors and providing the computing crossbars, cluster infrastructure, and disk storage arrays at manufacturer’s cost. The Linux-based system will be integrated and tuned by the proposal team from Argonne National Laboratories. The synthetic data specialists on the proposal team will port existing multi-threaded data synthesizers and develop new ones (int). 6347895 Project Description and Coordination Plan Page 1 I. Introduction Disciplines that rely on empirical validation require that appropriate resources be available to support rigorous, scientific, empirical research. Thanks to previous National Science Foundation investments, enormously rich new datasets on household and business-level data, previously collected by the federal government but unavailable to most researchers, have been created – and are becoming more broadly available to the scientific community. These data have already begun to open up new avenues of research, such as analysis of the geospatial characteristics of social activity and agent-based simulation and modeling. But the full return on investment will not be achieved until without solving two critical problems: first, providing access to these confidential and highly sensitive data, and second, providing sufficient computer resources to support such access, The solutions to these problems will not only bear immediate fruit to the social sciences in terms of increasing access to these new data– but will also bear future fruit in terms of providing access to other sensitive datasets in areas such as health care and biological research. Our proposal addresses these two challenges by creating a multi-layered confidentiality protection system that integrally links the Census Research Data Center network and the Bureau’s public use products. We propose to develop and enhance methods for creating new public-use products based on the idea of inference-valid synthetic data (Rubin 1993) and partially synthetic data (Abowd and Woodcock, 2001; Reiter 2003). Inference-valid synthetic data solves the confidentiality protection problem by replacing the confidential micro-data with draws from the posterior predictive distribution conditional on these micro-data. Thus, the synthetic data can preserve most of the distributional characteristics of the original data while dramatically reducing the risk of disclosure of confidential information on any individual or business in the original census, sample or administrative data. Since we will always use the term “synthetic data” in this sense, we do not always use the qualifier “inference-valid.” The distinction between synthetic and partially-synthetic data is elaborated below. The scientific research community has voiced legitimate concerns about the quality of synthetic data— especially empirical researchers familiar with the elaborate multivariate distributions embodied in a complicated integrated micro-data product. It is along this dimension that our proposal is truly unique and innovative. We propose that the creation and testing of the synthetic data be undertaken by that very research community in direct collaboration with the Census Bureau. To accomplish this ambitious mission, we rely upon the RDC-network and the LEHD Program to create and use gold-standard versions of the confidential Census micro-data in accordance with the existing access and approval protocols. These data would be available within the secure network of RDCs for approved projects. Researchers would, as a condition for working in the RDCs, directly assist in the testing of synthetic data by archiving the meta-data and results from their studies in the model database that the proposal would create within the RDC network. We recognize that the creation and testing of synthetic data places enormous computational demands on the RDC network. Our proposal addresses this second challenge of increased computational demands by proposing the installation of a supercluster computer to be designed, tested, and implemented by the proposal team in collaboration with critical industry partners. We will use the expertise of the Argonne National Laboratories and specialists within the LEHD Program and the RDC network to develop the appropriate parallel statistical computing environment, which will then be used by the RDC-based researchers and the data synthesizers. The feedback loop between the synthetic data and the models built on the gold-standard confidential data will permit creation of new public-use synthetic micro-data. The loop will also permit the continuous improvement of these synthetic data as new models are tested using the gold-standard confidential data and the current version of the synthetic data. II. Overview of the Role of the RDC-Network Social science research relies significantly, if not principally, on detailed micro-data collected from the primary agents of the economy and society: households and businesses. When these data are collected the agency that assumes custody of the data has a dual responsibility. On the one hand, it must facilitate the research that motivated the collection of the data. On the other hand, it normally has an ethical and legal obligation to protect the privacy and confidentiality of the respondents who provided the actual micro-data. The execution of this dual responsibility results in two distinct, but related, outcomes. The custodian develops public-use products from the data and, at the same time, enforces an access control protocol for the underlying confidential micro-data. Over the course of the twentieth century the United States Bureau of the Census became the leading custodian of American micro-data, primarily through its role as the collector of the Decennial Census of Population and Housing and the Economic Censuses (Office of Management and Budget, 2003). In the context of the Census Bureau, the dual responsibility cited above has concrete expressions. The Bureau publishes data in the form of a 6347895 Project Description and Coordination Plan Page 2 variety of public-use products ranging from printed reports to detailed micro-data. These are authorized by Chapters 3, 5, and 9 of Title 13 of the U.S. Code. The Bureau’s legal responsibilities to protect the confidentiality of the respondents are embodied in Chapter 1, Sub-chapter 1, Section 9, which prohibits: (1) the use of the information furnished under the provisions of this title for any purpose other than the statistical purposes for which it is supplied; or (2) making any publication whereby the data furnished by any particular establishment or individual under this title can be identified; or (3) permitting anyone other than the sworn officers and employees of the Department or bureau or agency thereof to examine the individual reports. Violation of these prohibitions can result in prison terms of up to five years and fines up to $250,000 for each offense. The prohibitions have been consistently interpreted by the Bureau and the courts to divide all of its data products into two mutually exclusive categories: (1) public-use products, which are published without copyright or any other restriction on how they may be used, and (2) confidential data products, which may only be used by sworn employees of the Bureau under the Bureau’s direct supervision. Section 24 of Chapter 1, Subchapter 2 makes provision for temporary employees who are granted what is now called “Special Sworn Status” and who may use the Bureau’s confidential data for work directly related to a statutorily authorized Census activity. The most direct consequence of the Census Bureau’s dual responsibility to publish statistics and to protect confidentiality is that the overwhelming bulk of the micro-data that it has collected over the course of the twentieth century remains confidential. All micro-data collected from businesses are confidential. Except for the one and five percent public-use micro samples, all micro-data from the Decennial Census of Population and Housing are confidential. The Bureau does produce public-use micro-data from its demographic survey programs, which are often conducted with another government agency. These include the Current Population Survey, the Survey of Income and Program Participation, the American Community Survey, and the Consumer Expenditure Survey. These survey-based public-use micro-data products are heavily used by statisticians, social scientists, health scientists, demographers, geographers, public policy researchers, and many others around the world (Duncan et al. 1993). There is widespread recognition that the Census Bureau’s confidential data represent a national asset of enormous value whose proper stewardship demands both scrupulous adherence to strict standards of confidentiality protection and vigorous efforts to ensure that the data are fully used by the American public (Duncan et al. 1993). The inherent tension between protecting the confidentiality of the underlying micro-data, as well as the privacy of the respondents, and providing extensive public-use products from those same data has led the Bureau to cultivate a direct relationship with the academic research community. This relationship is at the heart of the present research proposal. Academic researchers agree to undertake research projects that meet the Bureau’s statutory mission, as described in Title 13 and, in particular, Chapter 5. These researchers undertake their projects under the direct supervision of the Bureau, either at its headquarters or in a Research Data Center (RDC), formally acting as employees with Special Sworn Status (SSS). During their research the academics provide direct benefits to the Bureau in the form of new public-use data products and improvements to existing censuses and surveys. The Center for Economic Studies (CES) at the Census Bureau has welcomed academic researchers since 1981. Most of the original CES-based external researchers used confidential business micro-data. CES now operates an RDC at its offices in Upper Marlboro, MD. The first RDC located outside of the Bureau’s suburban DC headquarters was opened in Boston in 1994, funded in part by a grant from the National Science Foundation and now operated by the NBER in Cambridge, MA. The RDC network was expanded to include the Carnegie-Mellon RDC (opened in 1996; to close in 2004), the California Census RDC (opened in 1998 with sites in Berkeley and Los Angeles); the Research Triangle RDC (operated by Duke University, opened in 1999); the Michigan Census RDC (operated by the University of Michigan, opened in 2001); the Chicago Census RDC (operated by Northwestern University, opened in 2001); and the newest addition the New York Census RDC (with sites at Baruch College and Cornell University, to open in 2004). Except for the Carnegie Mellon RDC, all of these facilities were initially partially funded by the NSF. In addition to the RDCs, Census has operated the Longitudinal Employer-Household Dynamics Program in its Suitland, MD headquarters. This program has developed integrated household and business data products under the scientific direction of an academic research team and with significant NSF support from a social data infrastructure grant to Cornell University. The terms and conditions of academic research access to Census confidential micro-data have been refined over the last decade, since the opening of the Boston RDC. These terms provide the legal and technological constraints under which the access operates. Prior review of all projects and conduct of the research by SSS employees working in a Census supervised facility are the fundamental tenets of the RDC-based researcher access. 6347895 Project Description and Coordination Plan Page 3 Because the confidential micro-data under the Bureau’s stewardship come from a variety of sources and are often commingled in the underlying data files, the review and access protocols are subject to a variety of statutes— most notably Title 13 (discussed above), Title 26 (Internal Revenue Service), and Title 42 (Social Security Administration). Combining Census (Title 13) data with data from different administrative sources requires the cover of a “Memorandum of Understanding” (MOU) between the Census Bureau and the other statutory data custodian(s). Because of the commingling of Census (Title 13) and IRS (Title 26) data, particularly in the Bureau’s business micro-data, the MOUs and a general policy memo jointly issued by Census and IRS on September 15, 2000 mandate that the IRS be granted prior review authority for all projects that access commingled Federal Tax Information (FTI, Title 26 data) and Census (Title 13) data. This prior review ensures that the project meets the Census Bureau’s statutory mission, as delineated in Title 13, Chapter 5 and that the researcher’s access to the data is limited to the minimal amount of FTI necessary to undertake the project. In addition, the presence of commingled Title 26 data in the Bureau’s confidential data files mandates that all the IRS data security provisions (see IRS Publication 1075 and citations therein) be enforced in the RDCs. This latter requirement to a large degree dictates the physical characteristics of the RDC computing network and prohibits any meaningful distributed processing. The process for reviewing, approving, conducting, and certifying projects has evolved since the creation of the CES, the RDCs, and the LEHD Program. This evolution has been difficult and costly for the same reasons that there is a tension between the provision of data products and the protection of data confidentiality in the Bureau’s mission. A viable process has to be transparent: it must be evident to a disinterested third party how any researcher (internal or external) defines and conducts a research project using confidential Census data. More importantly, the process must stand up to public scrutiny by potentially hostile observers: there must be an auditable trail from application to certification that permits Census and the other statutory data custodians (e.g., IRS) to demonstrate that their stewardship has been lawful and prudent. Transparency and auditability require that the process be formal and subject to written rules. The September 15, 2000 “Criteria for the Review and Approval of Census Projects that Use Federal Tax Information” has provided the general framework for determining whether a project’s predominant purpose is consistent with Title 13, Chapter 5. This framework, while important and necessary, has led to very real resource and management costs. At the same time as there has been heightened awareness and formalization of the legal conditions under which the Bureau nurtures access to its confidential data, there has also been a significant degradation of the Bureau’s ability to produce public-use micro-data. Massive increases in the computational power available to an individual snooper as well as comparably massive increases in the quantity of identifiable data distributed over the Internet (e.g., exact names, locations and dates of significant events like births, deaths, and marriages; exact names and addresses of businesses) have combined to place increasingly draconian restrictions on the detail provided in public-use micro-data (see Sweeney 2001 Elliot 2001 and related articles in Doyle et al. 2001). The prospect of greatly reduced access to public-use household and individual data as well as the diminution of the already miniscule prospects for public-use business data leads us to propose a significant enhancement of the collaboration between the Census Bureau and the academic research community. Our proposal builds on the existing RDC and LEHD structures and is rooted in the Bureau’s core mission of providing public-use data products while maintaining the confidentiality of the underlying micro-data. We provide a framework that is fully consistent with the September 15, 2000 criteria document and is both transparent and comprehensively auditable. Equally important, our proposal integrates the provision of public-use data products with the scientific use of confidential micro-data in a manner that creates a permanent, formal, and essential feedback relationship between the two activities. III. Completing the Loop: The Integration of Census Micro and Aggregate Data Products and the RDC Network Current data products developed and used by the scientific research community can be classified into three basic types: (i) public use aggregate tabulations from micro-data on businesses, households/persons and/or integrated business/employee data; (ii) public use micro-data on households/persons; and (iii) confidential microdata on businesses, households/persons, and/or integrated data made accessible through the Census/NSF RDC network. Our proposal ambitiously integrates all of these data products in a manner that will dramatically increase the quantity and quality of public-use products (especially micro-data) and dramatically increase the usefulness and access to the confidential data inside the RDC network. As described below in much more detail, the proposal outlines a fully integrated data production and research access system that will create synthetic data as a part of its regular research activities. These synthetic data will be made available to the general scientific research community 6347895 Project Description and Coordination Plan Page 4 in a variety of methods. The proposal outlines the potential of making publicly available inference-valid synthetic micro-data on businesses and integrated business/worker data for the first time. The key here is the synthetic data. When the scientific research community has access to micro-data along the lines we propose, it will revolutionize research on businesses and households and their interaction. As outlined below, pilot projects have already demonstrated that program is feasible in this context. However, there are a large number of difficult statistical and confidentiality protection issues that remain unresolved, particularly for applications involving business micro-data and for applications using longitudinal data, integrated data and geospatial data. Much of the proposed project is devoted to addressing these unresolved issues. A fundamental component is the use of the RDC network as the central laboratory for developing and testing our proposed synthetic data methods. Using the RDCs as laboratories directly enhances the researcher access to the confidential micro-data in the RDC network so that the scientific benefits associated with the individual RDC projects are also a direct outcome of supporting the RDC-network through this proposal. To understand laboratory role of the RDC consider a prototypical project. A researcher develops an idea that can only be investigated with the micro-data available in the RDC network. The researcher write a proposal, obtain funding and, critically, make a compelling case that the project has the Title 13 benefits to the Census Bureau explained above. The researcher must meet all of these conditions prior to using any of the confidential micro-data. Once the project is approved and started, all of the analysis must be completed at a secure RDC location. When the project is completed, the researcher must leave behind a fully documented version of the methods and data used (so, amongst other things, the analysis could be replicated by another researcher) and also make a full report on the Title 13 benefits. Consider the benefits to such a researcher of having public-use synthetic data (or limited access synthetic micro-data available through an electronic network). First, in the development of both the research idea and the demonstration of Title 13 benefits, access to the synthetic data significantly enhances the researcher’s capabilities. Second, while the project is ongoing, testing and exploration of alternative specifications could be more ambitious if the project proceeded in parallel using both the “gold-standard” and the synthetic micro data. Third, the full documentation of methods and data in a meta-data system available to the research community for purposes of further work and/or replication is both easier and more accessible if the project proceeds with the synthetic data in such a parallel fashion. Fourth, the full report on Title 13 benefits is enhanced by the parallel analysis as the researcher would be able to explore data and methodological issues in a much richer manner. Such parallel analyses act as key “observations” in developing methods for certifying and maintaining the inference-validity of the synthetic data. This completes the loop. The parallel analysis provides direct evidence on the degree to which and over what dimensions the synthetic data are indeed inference-valid. The joint products over time are (i) improved access and functionality of the RDC system and (ii) high quality public-use synthetic micro-data. Does completing this loop eventually yield such rich and complete synthetic micro-data that most social science research could be conducted with the public-use synthetic micro data? Would success imply that the need for RDCs would diminish over time? While we are optimistic that the public-use synthetic data will be analytically valid for a wide range of applications, we are also certain that the answer to both of these questions is “no.” It will always be important to have access to the confidential gold-standard micro-data on the RDC network. First, confidentiality restrictions make it difficult to create synthetic data on some dimensions that are required for analysis (e.g., analysis of some specific geographic location and/or sector of activity or analysis of the workers transiting from welfare to work in the metropolitan area and the firms that employ such workers). Second, there are still many open questions on the production of synthetic data. It may be that for certain classes of models (e.g., linear models) and applications, the synthetic data completely summarize the original confidential micro-data. However, for other classes of models and applications (e.g., non-linear models), we may establish that the best available synthetic data have severe limitations. Finally, the full integration of the vast range of data for which we propose creating synthetic data is a very ambitious agenda and, by itself, will take a substantial period of time. Moreover, what we can contemplate now certainly does not reflect the future imagination and creativity of the research community to develop ideas that require even richer and more complex integrated data. Put simply, we anticipate that the research community will demand access to the gold-standard data for the indefinite future–both for the direct results from such research and in the further development of new synthetic data. We further anticipate that because researcher access to the confidential micro-data is so essential to the Census Bureau’s quality assurance for its public-use products, particularly public-use synthetic data, the feedback relationship between RDC-based research and the Census Bureau’s statutory mission will substantially enhance their cooperation and sense of common mission. The fully integrated synthetic data production and access system is described in detail below. The key point of this overview is that the range of data products actively used by the scientific research community would be dramatically enhanced with public-domain (and perhaps some restricted-access) synthetic micro-data. 6347895 Project Description and Coordination Plan Page 5 IV. A Selection of Research Projects from the RDC Network and LEHD Most of the RDCs have, since their inception, pursued matching projects with highly confidential administrative and Census survey data sets. Indeed, most indicated that they would pursue this objective as part of their initial NSF grant establishing the RDC. While there are substantial logistical, legal, and administrative hurdles to overcome, these projects have nonetheless been undertaken, results disseminated to the wider community, and use made of the results in both state and Census data programs. Further, recent work has looked beyond the matching of administrative and Census data, to data records from engineering and sociology, promoting a highly effective interdisciplinary center of study for researchers on the west coast of the US (Table 1). The LEHD Program was created in 1998 explicitly for the purpose of longitudinally integrating Census household and economic data using administrative records. Its flagship products include the integration of quarterly unemployment insurance wage records from twenty-two states (covering 60% of the workforce) with quarterly micro-data on establishments from the ES-202 system, demographic data from the Social Security Administration and Census 2000, and economic data from Business Register, Economic Censuses and Annual Surveys. The flagship public use product is the Quarterly Workforce Indicators (http://lehd.dsd.census.gov). A research team at LEHD is developing synthetic data based on the 1990-1996 Surveys of Income and Program Participation, complete W-2 earnings histories, and complete Social Security Administration benefit and disability histories. The “results of prior NSF research section of the proposal contains a catalogue of the data products produced by this program. Table 1 Selection of RDC-based Research Projects with Data Matching Project Principal Title Data matched No. Investigators Chicago RDC 299 Mazumder, B (FRB Using Earnings Data from the SIPP to Study the SIPP to SSA-DER and SSA-SER Chicago) Persistence of Income Inequality in the U.S. 275 Lubotsky, D (U of How Well Do Immigrants Assimilate into the U.S. March CPS and SIPP to SSA-SER Illinois) Labor Market? 406 Heckman, J, et al Hedonic Models of Real Estate and Labor Markets ACS and other surveys, linked (U of Chicago, internally using GIS Argonne Natl Labs, FRB Chicago) Boston RDC Pending Liebman, J Measuring Income and Poverty from a Multi-year SIPP to SSA-SER, MEF, MBR, (Harvard U) Perspective SSR, and proposed link to LBD Research Triangle RDC 345 Sloan, F, D Ridley, Alcohol Markets and Consumption LBD, BR, HRS to, Behavioral Risk (Duke U) Factor Surveillance System Survey, CARDIA 418 Sloan, F. et al Markets for local sports and recreation facilities LBD, LRD, Economic Census, (Duke U & American Housing Survey, Research Triangle Decennial Census, Census of Institute) Government, phone directories. 196 Chen, S & W van An Evaluation of the Impact of the Social Security SIPP matched to SSA data der Klaauw, W Disability Insurance Program on Labor Force Duke and NC State Participation in the 1990s California RDC 429 D Card et al (UC Using Matched Employee Data to Examine Labor State UI records matched to DWS Berkeley) Market Dynamics and the Quality of DWS/CPS Data 70 Hildreth, A& V J Income receipt from welfare transitions SIPP matched to State UI data and Hotz, UCB, UCLA MEDS 269 Simcoe, T, & E Taxis and Technology: Contracting with Drivers and Census of Taxicab & Livery Rawley,, UC the Diffusion of Computerized Dispatching Companies to data from hardware Berkeley firms supplying technology for computerized taxicab dispatch New York RDC Pending Lu, H H,& N G Earned Income Credit pickup rate (temp) Decennial Census to IRS data 6347895 Bennett Pending Goldstein, J and Bennett, N.G. Pending Freeman, L. Project Description and Coordination Plan Page 6 Multiple race reporting in the 2000 Census (temp) 2000 Decennial Census to 1990 Decennial Census Spatial assimilation of immigrants (temp) NYCHVS to tract data from 1990 and 2000 Decennial Census Pending A Caplin,, J Leahy Family characteristics and health outcomes (temp) 1990 and 2000 Decennial Census Pending Ellen, I, G et al School choices of native-born and immigrant students 2000 PUMS to administrative data (temp) Pending Mollenkopf, J., How do the dynamics of multi-ethnic families differ ISGMNY matched to 2000 Census Kasinitz, P. from immigrant and other families? (temp) Pending Groshen, E. (FRB- Alternative aggregate measures of wages (temp) LRD, LBD with Community Salary NY) Surveys for 3 cities Michigan RDC 409 Carter, G.R. III, From Exclusion to Destitution: Residential Decennial Census to 1996 National Young, A.A. Jr. (U Segregation, Affordable Housing, and the Racial Survey of Homeless Assistance of Michigan) Composition of the Homeless Population Providers and Clients 419 Morash, M., Bui, Race, Ethnicity, and Sex Differences in Estimates and 2000 Decennial Census to National H., Zhang, Y. Explanations of Crime Victimization and Crime Crime Victimization Survey (Michigan State U) Reporting 13 Prestergard. D.M. Evaluating the Economic Impact of an Economic LRD to NIST/MEP (Cleveland State U) Development Program: Measuring the Performance of the Manufacturing Extension Program Pending Shapiro, M. Matching ISR Surveys to Census Data HRS and PSID to LBD and LEHD Glossary: BLS BR CPS DWS Bureau of Labor Statistics Business Register (formerly the SSEL) Current Population Survey (U.S. Census) bi-annual survey administered by the U.S. Census Bureau for the BLS with the aim of understanding the factors affect the displacement of workers from their previous employment HRS Health and Retirement Survey ISGMNY Immigrant Second Generation in Metropolitan New York. A random sample survey of 4,000 individuals aged 18-32 from five immigrant backgrounds (Dominican, South American, West Indian, Chinese, and Russian) and three native-born backgrounds (Puerto Ricans, African-Americans, and white Americans) taken at the same time as the 2000 Census. LBD Longitudinal Business Database LRD Longitudinal Research Database (predecessor to LBD) MBR Master Beneficiary Record (SSA) MEF Master Earnings File (SSA) MEDS State of California Medicaid Records (State of California) NIST National Institute of Standards and Technology NIST/MEP NIST’s Manufacturing Extension Partnership NYCHVS New York City Housing and Vacancy Survey PSID Panel Study of Income Dynamics PUMS Public Use Micro Sample of the Census SER Summary Earnings Records (SSA) SIPP Surveys of Income and Program Participation (U.S. Census) SSA Social Security Administration SSR Supplemental Security Record (SSA) GIS Geographic Information Systems ACS American Housing Survey 6347895 Project Description and Coordination Plan Page 7 V. Other RDC-based Proposed Research Activities Geospatial matching Many important social science applications of micro-data involve examination of the relationship between economic agents, both businesses and households, and their surroundings. The impact of location specific characteristics on behavior is the subject of numerous studies, most commonly the impact of location specific characteristics involving government policy and geopolitical boundaries, i.e., governmental tax districts, air quality regulatory zones, etc. Hedonic models have been applied to real estate, where housing characteristics include school district quality or other location related (dis)amenities. Business clustering and geographical markets are just a few relevant industrial organization topics. Racial discrimination is a major social science topic where geo-spatial identification is critical. All of these studies require, to varying levels of detail, the ability to match the individual micro data response to its physical location within a geographic information system (GIS). Robust application of GIS tools to micro-data would allow information to be generated for each establishment or household on the physical proximity of that economic agent to the study-relevant components. For example, for a study of business clustering and geographical markets the amount of geo-spatial information that must matched to each micro record can be large, e.g. distances to other nearby supporting or competing businesses or to relevant households (markets). Other important geo-spatial relationships include the proximity of physical infrastructure, e.g. roads, rail spurs, etc. This creates a potentially large “many to many” assignment problem. The notion of how to define “nearby” is a subject of analysis itself. Creating these geo-spatial characteristics, including those that are geo-physical, requires substantial computing power and would be amenable to parallel processing. Agent-based modeling and simulation Agent-based modeling and simulation (ABMS) uses micro-scale rules to computationally determine macroscale patterns (Bonabeau 2002). ABMS is primarily done for highly non-linear systems that are difficult to model using traditional techniques. The focus of ABMS in social science is designing artificial adaptive agents that behave like human agents (Arthur 1991, Holland and Miller 1991). In this case, the micro-rules describe the behavior of individual people and the macro-scale outcomes include results such as market prices, supply availabilities and group consensus. ABMS has been successful in deriving meaningful macro-scale results from micro-scale data in a variety of contexts ranging from medical care units (Riano 2002) to energy markets (North 2001). ABMS systems operate at the level of individual behavior. This makes such models easily factorable for parallel execution (Resnick 1994). ABMS systems factored in this way can make full use of available cluster computers. The non-linear behavior embedded in most ABMS systems means that this approach is highly appropriate for the analysis of sensitive personal micro-data. The complexities of the non-linear outputs from ABMS, combined with other standard census checks, will effectively mask all identifiable personal information. VII. Organization of CES, the National RDC Network and LEHD Three inter-related Census Bureau activities lie at the heart of this proposal. The oldest of these is CES, whose acting director Ron Jarmin is a Co-PI of this proposal. CES is the archival repository of the Census Bureau’s main economic sampling frame, census and survey products. Researchers at CES have built and maintained historical Business Registers (formerly called the Standard Statistical Establishment List, SSEL), Economic Censuses (all sectors authorized for census by the Department of Commerce), Annual Surveys, and many specialized economic surveys. CES also administers the Census RDC network. Many of the data files built by CES can be used only in RDCs. A description of the CES-maintained files, many of which originate in other parts of Census, can be found at http://www.ces.census.gov/ces.php/home. Over the last 20 years CES has created and improved longitudinally linked establishment-level micro data. The two best known products of this type are the original Longitudinal Research Database (LRD), which contains longitudinally linked records from the Census of Manufactures and the Annual Survey of Manufactures, and the Longitudinal Business Database (LBD), which contains longitudinally linked records from the Business Register/SSEL. These flagship CES products are an important component of the research program in this proposal. Research using these data was some of the first to identify the tremendous amount of dynamism in the US economy (Davis and Haltiwanger 1990, 1992). Most of the data linking projects described above use the LBD to bridge to other Census business data. Synthesizing a version of the LBD is a high priority for the proposal team. CES operates the RDC-network and employs the administrator in each of the 9 RDCs. The RDC-network, however, is a much broader activity. It is held together by the RDC directors, who coordinate the activities at their own facility and across RDCs using regularly scheduled conference calls. The process of nurturing proposals and 6347895 Project Description and Coordination Plan Page 8 educating potential users about the Census data and its research uses is a collaborative effort of the RDC directors, administrators, review panels and Census staff (at CES, LEHD and in other areas). The RDCs hold an annual meeting at one of the RDC locations to discuss management and research issues. The LEHD Program at Census was founded in 1998 with seed money from the Census Bureau, the National Science Foundation, the National Institute on Aging and the Russell Sage Foundation. In five years it has grown from a staff of 1 (plus senior research fellows Abowd, Haltiwanger and Lane) to a staff of 28 (some are regular Census employees, others are contractors). LEHD is the part of Census charged with creating integrated micro-data that combines information from the Census Bureau’s Demographic and Economic directorates, usually with the essential use of administrative records. The flagship product of LEHD is a set of infrastructure files described in the “results from prior NSF research” section. Research using these data has begun to identify the consequences of a changing economy on both firms and workers (Bowlus and Vilhuber, 2002; Abowd et al. 2002, Abowd, Lengermann and McKinney 2002, Andersson et al. 2002). Because of the joint efforts of CES and the RDC directors, much of the framework for implementing the projects at the heart of this proposal is already in place. But, the maintenance of the RDC network and, therefore, of the essential organization of this proposal requires recognition of the laboratory role of the RDCs and their sustained funding. Every RDC has contributed at least one member of the senior scientific team in this proposal. Every RDC has written a letter of support for the research program proposed herein or joins the proposal as one of the directly funded work units. Finally, two of the proposal’s senior scientists have been senior research fellows at LEHD since its inception. CES, the RDCs and LEHD are now functioning as a well-integrated scientific network. This network is absolutely essential for the research activities described herein and the continued viability of Census’s public-use data products. Census Bureau Access Modalities Research using confidential Census data is driven by the RDC access modality. Researchers work over a dedicated, encrypted connection to the Census-operated RDC computing network (in Bowie, MD). Researchers must perform their work in the supervised space operated by the RDC. The RDC administrator must conduct a detailed disclosure avoidance control before any results (final or intermediate) may be physically removed from the RDC. From the researcher’s perspective, beyond the obvious cost of traveling to an RDC, there are many other barriers to access. The typical researcher has limited knowledge of the requirements for access and even more limited knowledge about the micro data. This limited knowledge often requires multiple proposal submissions before the researcher satisfies the legal and feasibility requirements for working with the confidential micro-data. Moreover, given the nature of the micro data, there are substantial learning costs once access has been granted. These barriers to access can be reduced somewhat with education, training and enhanced documentation. However, the current proposal offers the possibility of substantial reductions in the costs of access and, in turn, much greater access by the research community. To understand these benefits, it is worth elaborating on each barrier to access and how this proposal can help alleviate key aspects of these costs. First, the most important point to emphasize is that access to the confidential micro data can only be for valid Title 13, Chapter 5 statistical purposes. While CES and Census have developed extensive documentation of valid Title 13 purposes and the RDC-network offers assistance and training on Title 13, it is fair to say that the research community still struggles to understand fully the nature and importance of this part of the U.S. Code that provides the statutory basis for the Census Bureau’s operations. As such, many proposals to CES do not adequately state their Title 13 purposes. While many of these projects have potential Title 13 justification, the typical proposal requires several iterations to meet these requirements. A second, related, point is that in the typical proposal received by CES, the researchers are not sufficiently familiar with the pitfalls and nuances associated with analyzing the micro-data. Researchers often must revise proposals to address feasibility and core measurement issues. A third point is that since most of these micro-data have not been developed as public use files, the editing, imputation and documentation of these data (especially business establishment data) do not meet the standards of conventional public-use data like those associated with household surveys. CES has partially remedied this shortcoming with substantial documentation and data editing, but there has been less standardization of the treatment of missing/erroneous data than for Census public-use micro-data products. Instead, individual statistical/research projects have made their own idiosyncratic data adjustments. There is “knowledge” capital that has been accumulated over time and shared among users but researchers still experience substantial learning costs in working with the confidential micro-data. Our proposal will dramatically increase the accessibility of the micro-data at Census by creating a multi-layered access system that will give researchers access to public-use synthetic data that will substantially lower the learning and proposal preparation costs of access to the gold-standard data. In the next section, we outline this multi-layered access system. We conclude with implementation and dissemination issues. 6347895 Project Description and Coordination Plan Page 9 VIII. Multi-layered Confidentiality Protection A. Public use synthetic data: These are micro data files constructed by synthesizing the underlying confidential data using an appropriate probability distribution. Though the files would contain synthetic or virtual data, not real data, it is possible to draw valid inferences about the unit of observation. The synthetic data would share many of the same statistical properties as the real data. (Rubin 1993, Raghunathan et al. 2003) B. Multiply-imputed partially synthetic data: This is a variant on approach A, sometimes referred to as pseudosynthetic data, where synthetic data are generated using a probability distribution that conditions on actual values of the confidential micro data; that is, the synthetic observation has a (potentially) identifiable source in the original confidential data. This approach requires additional procedures to assure confidentiality protection. (Little 1993, Abowd and Woodcock 2001, Reiter 2003) C. Gold-standard confidential data: These files contain micro data prepared and documented to public use standards with complete missing data and data edit models. Gold-standard confidential data are the normal product released for use in RDCs, for approved projects. These data are Title 13 (and sometimes Title 26 or Title 42) confidential. The gold-standard files are used to support the public use files by allowing rapid, straightforward verification of statistical results generated from the first two methods. The feedback between the public use files and the gold-standard files continually improves the quality of the public use products. D. Raw confidential data: These are the original archival data files upon which the other layers are built. In this case, confidentiality would result from the restrictions put on their use rather than from features built into the files themselves. That is, the files could be accessed at RDCs if Title 13, Chapter 5 benefits were sufficient. Synthetic data Synthetic data (also called virtual data) originated in confidentiality protection papers by Rubin (1993) and Feinberg (1994). The synthetic data samples can be designed for ease of analysis. Repeat the process of synthesizing the population and sampling from the synthetic population to create a set of M synthetic samples. Raghunathan et al. (2003) provide appropriate Bayesian formulae for the statistical analysis of synthetic samples. Synthetic data have the property that observations in the synthetic sample bear no necessary relation to observations in the actual sample, the one from which the posterior predictive distribution was estimated. Some techniques exclude the originally sampled units when constructing the synthetic samples. Other techniques allow the originally sampled units to occur in a synthetic sample if they are selected for that sample. All data in the synthetic samples are draws from the posterior predictive distribution. Hence, there are no observations in the synthetic samples that contain the actual data from the original sample, except for the variables constructed from the sampling frame information. Following Duncan and Lambert (1986) we can assess the confidentiality protection afforded by the synthetic data along two dimensions: identity disclosure and attribute disclosure. Identity disclosure is protected so long as the synthetic observation, which normally contains information from both the sampling frame and the posterior predictive distribution, cannot be used to re-identify the unit in the synthetic sample in the population. For synthetic samples of households, identity disclosure protection is very straightforward to provide. The standard technique is to coarsen enough of the information in the sampling frame variables to ensure that the most detailed tabulation of individuals or households that can be produced from the synthetic population has multiple units (usually at least three) in each populated cell. Alternative techniques include swapping and other confidentiality edits. We note that protection against identity disclosure is necessitated by the presence of variables derived from the sampling frame information. Since conventional public use files always coarsen this information as a standard confidentiality protection, the identity disclosure protection problem in synthetic household data is not generally considered a technically challenging problem. Attribute disclosure protection, however, remains a potentially important problem for synthetic household samples. An attribute disclosure occurs if a confidential characteristic of a unit in the synthetic sample can be correctly associated with the actual unit in the population. Duncan and Lambert (1989), Lambert (1993), and Trottini and Fienberg (2002) model the control of attribute disclosure using a loss function for the snooper’s estimates of sensitive attributes. Attribute disclosure risk is assessed based on that loss function. Duncan and Keller-McNulty (2000) provide a Bayesian framework for assessing the attribute disclosure risk of masked and synthetic data that use the inverse squared error as an estimate of the snooper’s loss. Reiter (2004) applies these methods to the attribute disclosure risk control problem arising from synthetic data. 6347895 Project Description and Coordination Plan Page 10 Following Duncan et al. (2001) we recognize that there is a tradeoff between the usefulness of a data product and the disclosure risk that the product presents. Efficient data dissemination methods seek the production possibility frontier of this tradeoff. Hence, they deliberately compromise data usefulness for confidentiality protection, for a given level of resources devoted to collection and dissemination. We consider here the applicability of this tradeoff to synthetic household data. We elaborate on these concerns below in our discussion of layered confidentiality protection. The usefulness of a set of M synthetic samples of household data is determined by the valid inferences that can be drawn from the posterior predictive distribution. Raghunathan et al. (2003) and Yancey et al. (2002) have suggested related criteria for making this assessment. There are two essential issues, common to all statistical methods: bias and precision. An inference based on synthetic data is biased if it differs substantially from inferences based upon actual samples. The precision of an inference based upon synthetic data depends only upon the between sample variability in the estimated quantity.1 If one could be relatively certain of controlling bias in the synthetic data, then the precision of inferences could always be improved by generating additional synthetic populations to sample from. One cannot, unfortunately, easily control bias in synthetic data. This problem is not unique to synthetic data. Confidentiality protection systems: for tabular data, primary and complementary cell suppression are known to introduce substantial bias that synthetic data may eliminate without decreasing the precision of inferences (Dandekar and Cox 2002). Controlling the bias in synthetic data requires feedback from the analysis to the synthesizer. The analyst must regularly ask for an assessment of the inferential bias based on analysis of the underlying confidential data. The data synthesizer must collect these assessments and modify the methods for estimating the posterior predictive distribution. For synthetic samples of businesses, identity disclosure protection is problematic even in synthetic data. Most business sampling frames (including all of those used by the Census Bureau) have self-representing ultimate sampling units (called “certainty cases” in Census samples). Re-identification of a certainty case in a synthetic sample is trivial; however, re-identification of a unit in a business sample is not always an identity disclosure. For example, in the Census Bureau’s County and Zipcode Business Patterns the existence of unique establishments in certain geographic and/or industrial classifications is published. We consider the problem of identity disclosure protection in synthetic business samples to be a high priority in our proposed research program. Attribute disclosure protection for business samples is also complicated by self-representing and large (in some sense) units. Standard practice (e.g., County and Zipcode Business Patterns, Current Employment Statistics) requires coarsening or suppressing attributes when there is a risk of attribute disclosure at the publication level. While the technical problems of identity and attribute disclosure control in synthetic data can be addressed by known methods in statistics and substantial computing power, two important perception issues remain. First, the research community is justifiably suspicious of synthetic data because the properties are not yet well-studied and because the use of this technique may further limit access to underlying confidential data without broadening general data access. Second, statistical agencies are justifiably suspicious of synthetic data because they must declare simultaneously that confidentiality protections applied to the synthetic data have eliminated all actual responses from the public use product but the data themselves can still be used to make valid statistical inferences. We are fully cognizant of both of these problems in our proposal. The layered confidentiality protection system is designed to address the researchers’ concerns about inference validity and improved access to confidential data. As we show below, current practice is inefficient: more data usefulness can be achieved at no additional confidentiality risk by using the mixture of data products that we propose. The statistical agencies’ problem requires some education of the data-using public. The production of synthetic data requires a substantial investment of resources by the data provider. Essentially all of the steps necessary to produce public-use micro data must be performed on the underlying confidential micro data. In addition, multiple methods of estimating the posterior predictive distribution must be developed. These estimation methods must be designed to allow subject matter concerns to influence the posterior distribution just as they would have influenced data editing and variable creation in a standard public use file. These methods must also be designed to allow the data to “speak for themselves.” The posterior predictive distribution is an elaborate joint probability distribution among all of the confidential variables in the original sample, given the data in the sampling frame. Needless to say, the estimation of such a distribution is challenging even accounting for the massive computing power that can be harnessed for the effort. Researcher skepticism is just one manifestation of Shannon’s (1948) information principle. The originally collected data represent information about the target population to the extent that they are not predictable from less information about that same population. Synthetic data consist of draws from an estimated posterior predictive distribution conditional on a relatively small amount of 1 Rhagunathan and Rubin (2000) demonstrate that the correct precision measure for the pure synthetic data case does not include a term for within sample variance of the estimated quantity. 6347895 Project Description and Coordination Plan Page 11 information about the population; namely, the data in the sampling frame. The information in the original sample is transmitted to the synthetic data via this posterior distribution. Research skepticism is justified to the extent that the estimated posterior predictive distributions used to synthesize the data might hide important relations that a direct use of the confidential data would reveal. This is especially important for biased negative results from the synthetic data since such results may discourage further research. As we make clear in our discussion of the layered confidentiality protection system, feedback from the synthetic data on the data synthesizer is essential to develop research confidence in these products and to ensure their continuous improvement. This feedback is designed to estimate and reduce the bias in the synthetic data while maintaining the maximum feasible level of precision. Researcher skepticism based on the intrinsic difficulty of producing forecasts for variables in most micro data bases is misplaced. A draw from the posterior predictive distribution is not a “forecast” in this sense. A better analogy is a stochastic simulation. It is straightforward to provide examples in which the posterior predictive distribution perfectly reproduces the information in the underlying confidential micro data (see, e.g., Feinberg et al. 1998). These examples tend to be for much simpler problems than one encounters in practice. The crux of the researcher concern about synthetic data removing too much information from the underlying confidential data lies in the implementation of data synthesizers using variants of Gibbs samplers or sequential multivariate regression models (see Raghunathan et al. 2001 and 2003). These implementations rely on having good quality conditional distributions for every variable (continuous and discrete) and every distinct cell of the sampling frame. An important part of our research program focuses on the interaction of data users and data providers to refine this part of the data synthesizer. Testing of nonlinear models and of models with important timing discontinuities is critical. Although these techniques are not yet in widespread use, the Survey of Consumer Finances (Kennickell 1998) makes use of multiple imputations for confidentiality protection using methods very similar to those we propose here. Intermediate public use and confidential data products As promising as synthetic data are for the front end of a layered confidentiality protection system, we also propose developing a series of related intermediate products that combine some features of synthetic data with traditional confidentiality protection systems and with newer micro-data-based confidentiality protection systems that are based on other probability models. The term “masking” is used in many contexts (see Domingo-Ferrer and Torra 2001 and Abowd and Wookcock 2001). There is not yet a standard vocabulary for these products in the statistical disclosure limitation literature. Normally, masking means some variant of the matrix mask (see Domingo-Ferrer and Torra 2001) for linear methods. For nonlinear methods there is no general definition of which we are aware. Whenever masking is used as part of confidentiality protection, one must take care to understand the precise definition. The term “swapping” is used more consistently. Swapping generally means exchanging all the values of a set of variables between two different observations conditional on a match for another set of variables, as is done in the 1990 and 2000 Decennial Census public use micro-data samples. Swapping is a special case of a matrix mask. The term “shuffle” means reusing all the values of every variable but not exchanging them in pairs. The values from one source observation are shuffled to many different destination observations (see Sarthy et al., 2002 ). New masking and shuffling algorithms share two important features: (1) the resulting confidentialityprotected product has the same structure as the underlying confidential data product, and (2) every observation in the masked/shuffled data has a true source record in the underlying confidential data (although not necessarily vice versa). We can, once again, use the Duncan and Lambert identity and attribute disclosure dichotomy to assess these methods. Because there is a true source record in the underlying confidential micro data, all of these methods require conventional assessments of their ability to control identity disclosures. In general, the same coarsening and suppression techniques applied to the variables in the sampling frame can be used to measure and control this risk. Controlling attribute disclosure risk is very similar to the synthetic data case. Reiter (2004) discusses this case briefly. Abowd and Woodcock (2001) and more recently Raghunathan et al. (2003) propose data masking methods based on multiply imputing the values of every confidential data item given all other confidential and nonconfidential data, including variables from the sampling frame that are not collected as a part of the survey or administrative record extraction. Both approaches use the sequential regression multivariate imputation (SRMI) technique developed by Raghunathan et al. (2001). Both can accommodate generalized linear models for (a transform of) every variable in the underlying confidential data. Thus, both can handle a mixture of discreet and continuous data. Abowd and Woodcock focus on methods that can be applied to linked survey and administrative data with multiple populations/sampling frames. Raghunathan et al. focus on single population methods; however, it is clear that their methods could be extended to linked data with multiple populations. The basic difference between 6347895 Project Description and Coordination Plan Page 12 the two approaches is that Abowd and Woodcock take a draw from the conditional posterior predictive distribution for each confidential variable given the realized values of all of the other variables (confidential and nonconfidential). Hence, their preferred method, “masking” or more precisely, “multiply-imputed partially synthetic data,” must be subjected to further analysis to assess its ability to control identity and attribute disclosure. Reiter’s (2003) partially synthetic data has the same feature and he derives the correct standard error formula. The confidentiality protection system proposed by Raghunathan et al (2003) uses the SRMI technique to take a draw from the complete posterior predictive distribution of the confidential data given the non confidential variables (e.g., information from the sampling frame). Thus, the Raghunathan et al. technique is a full synthetic data method. We propose to use both methods in our layered confidentiality protection system. The Abowd and Woodcock and Reiter methods have already been specialized to complex integrated statistical systems like LEHD and CES routinely produce. For this reason, several of the PIs have already undertaken extensive development work on potential public use files based on these methods. First, work is already underway developing a prototype multiply-imputed synthesizer for the core LEHD infrastructure files, described below in the “Results from previous NSF research” section. This confidentiality protection system uses the SRMI method. Second, the LEHD Program, in conjunction with an interagency committee that includes the Social Security Administration, Internal Revenue Service, Congressional Budget Office and other parts of the Census Bureau, is developing a public use file containing data from the Survey of Program Participation (1990-1996 panels) and Social Security administrative/tax data (W-2 information separately by employer, Summary Earnings Records, Master Beneficiary Records, Supplemental Security Records and Form 831 Disability Records). The confidentiality protection of this public use file is particularly challenging because the SIPP source records cannot be re-identifiable in the existing SIPP public use files; that is, this new public use file must be used independently from the existing SIPP public use files. Third, Reiter has applied these techniques to data from the Current Population Survey. The development of these partially synthetic multiply-imputed files has provided much needed experience in developing the layers of our confidentiality program. We illustrate using the SIPP-SSA files. Since this public use file is targeted at retirement and disability research for national programs, all geography has been removed from the public use portion. Of course, geography is still present in the internal files available within the secure RDC facilities. Removal of geography was necessary to limit the potential for re-identifying SIPP source records in the existing SIPP public use files. Preserving marital relations as well as basic demographic and education variables provided the maximum extent to which conventional identity disclosure control methods could be used. The interagency committee thought that linking a handful of extremely coarse demographic and educational variables from the SIPP to the massive amounts of administrative and tax data was not the most effective method of providing access to these data. As an alternative, a layered approach was adopted. Successive, confidential versions of the linked data including a long list of proposed variables from the SIPP and all of the administrative variables from SSA (including tax data) were developed. Researchers at Census, SSA, IRS, and CBO are studying the variables in these files, deemed gold-standard files because they contain all of the original confidential data. Once the research teams are satisfied that the gold-standard files adequately provide for the study of statistical models relating the variables of interest from the SIPP and the administrative data, a variety of masked potential public use files will be produced using the methods described in this section of the proposal. The same research teams will then assess the bias and loss of precision from the various techniques. Other research teams will assess the identity and attribute disclosure risks from each of the methods. The team will then be equipped with reasonable quantitative measures of the disclosure risks, scientific biases, and losses of precision associated with feasible implementations of these new confidentiality protection techniques. It is expected that a public use product will be available within two years. Interim products include full RDC support for the gold-standard files, which contain links that permit RDC use of any variable in the existing public-use SIPPs. The SIPP-SSA public use file is not a static product. We fully expect the interaction of RDC-based researchers with the data to provide much needed feedback to the process of variable selection and confidentiality protection for such files. IX. Dissemination and Adoption The cornerstone of our dissemination system is the virtual RDC, a replica of the research environment on the RDC network using synthetic data and the exact programming environment of the RDC network. The virtual RDC can be used for primary research, since the synthetic data are inference valid. More importantly, it can be used as an incubator for proposals to analyze the confidential data. The proposal development process, which can now take more than a year, would be improved and simplified using the virtual RDC. A researcher would benefit from the fact that the structure of the synthetic data and the structure of the gold-standard confidential data were identical. 6347895 Project Description and Coordination Plan Page 13 The researcher would develop the proposal in the same environment as a real RDC, thus guaranteeing that the tools needed to do the modeling were available and working properly. Some of the PIs and senior scientists on this proposal have participated in the development and use of the Cornell Restricted Access Data Center (CRADC, part of CISER at Cornell). The CRADC, which was developed under the NSF Social Data Infrastructure grant that supported LEHD, is our model for the virtual RDC. Authorized users access data from authorized providers using a “window” on the CRADC machines (which appear to be a Windows desktop to the user). The CRADC provides a complete research and reporting environment that fully supports collaboration among authorized users of the same data.2 Although the CRADC is a reasonable model for a virtual RDC, our proposal goes farther. Real RDCs operate with “thin client” interfaces to the RDC computing network, a specialized Linux environment. The virtual RDC will provide an exact replica of the supercluster computing system that we will implement to create the synthetic data and support the complex modeling on the gold-standard and synthetic data. The Census Bureau has agreed to support an advisory panel of ten experts and users as a part of its internal disclosure research program. One of the PIs (Abowd) is participating in the organization of this group, which will be already in place at the start date of the work proposed here. Their role will be to provide regular (3 times/year) feedback on which data sets should be synthesized and the quality of the synthesizers. The LEHD program also has such a panel run under the auspices of NIA. Abowd teaches “Social and Economic Data” at Cornell University in a distance learning equipped classroom. He will make the class available to graduate students at any RDC host institution who is a potential RDC user (thesis or RA). The course covers many of the data sets used in the RDCs. It also covers data protection law, secure access protocols (including RDCs), confidentially protection systems, RDC proposal development, and appropriate statistical tools. The RDC coordinators on the proposal will make appropriate arrangements for their institutions to subscribe to the course using each institutions distance learning facility. If there is over-subscription, we will investigate means of expanding the availability via recorded lectures and asynchronous distance learning. X. Implementation of the Computing Environment Five of the senior scientists on the proposal (Abowd, Raghunathan, Reiter, Roehrig and Vilhuber) have extensive experience implementing different components of the statistical software required to create the synthetic data. Bair and Boyd bring the expertise of the Argonne National Labs in designing and implementing parallel cluster systems. They also bring the expertise required to implement the geographical linking and the agent-based modeling and simulation. The Center for Economic Studies, directed by Co-PI Jarmin, designed and manages the RDC network. LEHD and the RDCs bring significant specialized computational skills to the research effort. What is missing is a comprehensive computing environment in which to perform the proposed work. The current RDC-network is a cluster of six 4-processor Xeon systems attached to a 10TB storage array network. The system runs SuSE Linux on each node. The nodes are coupled with gigabit Ethernet. Each node runs an independent copy of the operating system. Statistical software, including SAS, Stata and other specialized programs, compilers and libraries are maintained on the system. The computational demands of the present proposal would swamp this system. As a part of this proposal, we have assembled an industry support team that will permit us to add a supercluster to the RDC-network while remaining fully compliant with the security provisions of that network. From a computational viewpoint, the programming used in the data synthesizers provides an ideal problem for configuring and optimizing a supercluster of SMP nodes with relatively many processors on each node. The reason for this choice is that the statistical programming for the data synthesizers doesn't decompose into threads the same way as problems optimized for clusters like Jazz at ANL or Velocity at Cornell. These clusters have only a few (one to four) processors per node and many nodes. The data synthesizers rely on a statistical programming language (SAS) whose data handling and core statistical modules are taken as given. The multi-threading comes from applying the SAS MP-connect option to two aspects of the problem: (1) multiple statistical models fit from the same input data and (2) parallel production of the “implicates" of the synthetic data. Optimal performance for the statistical modeling normally means keeping all the threads on the same node when they are sharing the input data set, which will be (partially) cached in the memory of the node. Using multiple nodes on this part of the processing, requires creating many local copies of large input files on other nodes, which isn’t usually worth the time spent to do it. Optimal performance for the implicate production usually means distributing these calculations to as many nodes as needed in order to have one node working on each implicate. In prototypes we have used 64-bit nodes because the 2 Technically, all of the Census products on the CRADC are “public use” files; that is, they have been approved by the Census Disclosure Review Board for general distribution. 6347895 Project Description and Coordination Plan Page 14 matrices built by SAS in their statistical modules routinely violate the 2GB memory limit of 32-bit systems. Since this is the part of the code we take as “given,” the solution is to use a 64-bit environment with sufficient memory, where the matrix limit problem disappears. The characteristics of the data synthesizers discussed here are based on the computations in Abowd and Woodcock (2001) and extensions programmed at Census by the LEHD staff. In view of the security requirements, which dictate that the supercluster must either run SuSE Linux or be fully compatible, and the computational requirements, which dictate that the system must run 64-bit SAS, we approached Intel, Unisys, and SAS to assemble a proposed computing environment. Intel donated 256 Itanium 2 processors. Unisys configured a modular supercluster using their ES-7000 server and Myrinet cluster interface. Each node on the supercluster has 16 Itaniums with 32GB of memory running under a single image of SuSE Linux. The proposal budget allows for the immediate implementation of a four node cluster with this architecture (64 processors). The PIs will seek additional funding to complete the 256-processor supercluster. SAS has committed to delivering the required version (9.2 for Linux IA-64) in beta by the proposed start date of the grant. Letters from Intel, Unisys and SAS are in the supplemental documents to this proposal. Ray Bair of the ANL cluster computing team is a senior scientist on this proposal. The ANL team will implement and optimize the supercluster after it has been installed at Census. Parallel computing can dramatically reduce the time to solution as well as increase the size of problems being addressed. Argonne National Laboratory has considerable experience configuring and operating such clusters. They currently operate 350-node and 512node Linux clusters which are used for large-scale applications and parallel computing research. The extensive experience gained from this computing environment across a wide range of applications will be leveraged for the Census Bureau applications. Argonne also has experience with porting complex and large-scale military, scientific and engineering simulation models to a parallel processing environment using a wide range of parallel computers (e.g., Cray, IBM, Linux Networx, NEC, SGI). Additionally, Argonne staff has a long history of forefront research on programming models for parallel applications. XI. Results from Prior NSF Research Dynamic Employer-Household Data and the Social Data Infrastructure, National Science Foundation, SES-9978093 to Cornell University, September 28, 1999 – September 27, 2004, $4,084,634. (PI: John M. Abowd, Co-PIs: John Haltiwanger and Julia Lane). The LEHD program is an integral part this grant since $2.5 million of the award was subcontracted to that program. The LEHD program has created a vast array of new micro data and time series products at Census. These include (1) the Quarterly Workforce Indicators, a system of employment flows, job flows, and earnings measures provided at very detailed geographic and industrial bases for very detailed age and sex groups, 1990-2003. Full quarterly production began in September 2003 (see lehd.dsd.census.gov); (2) the supporting infrastructure files for the QWI (a) the Individual Characteristics File: Records for every individual in the system (approximately 65% of US workforce) with demographic data, place of residence latitude/longitude (1999-2001), links to other LEHD files including 2000 Decennial Census, SIPP (19901996 panels), CPS (1976-2001 March), Census Numident (provides sex, birth date, death date, place of birth from SSA records). Except for place of residence, data cover the period 1990-2001, updated quarterly; (b) the Employment History File: records for every employer-individual pair in the system with earnings data from state Unemployment Insurance wage records for CA, CO, FL, GA, IA, ID, IL, KS, MD, MN, MO, MT, NC, NJ, NM, OR, PA, TX, VA, WA, WI, and WV. Six more states have agreed to join as soon as funding is available (AK, DE, KY, MI, ND, OK). Data cover the period 1990-2003, updated quarterly; (c) the Employer Characteristics File: Records for every employer in the system with full ES-202 data, summary data from the UI wage records, SIC and NAICS industry coding, federal EIN, and location geo-coded to the latitude/longitude. Data cover the period 19902003, updated quarterly. The Human Capital Project produces a system of micro-data files with full-time, full-year annual earnings estimates for each individual in his/her dominant employer during each year (1990-2003, updated annually). Additional variables include the LEHD estimate of the individual’s human capital and the decomposed wage components upon which the estimate is based The supporting infrastructure files for the Human Capital Project: (a) EIN/Census File Number links to the Business Register (formerly known as the SSEL); (b) EIN/Census File Number links to the 1997 and 1992 Economic Censuses; (c) establishment micro-data summarizing the distribution of human capital at in-scope establishments (those in the states affiliated with the QWI project). Integrated SIPP/Social Security Administration data: for the 1990-1996 panels, information from every Detailed Earnings Record (W-2) for every respondent providing a valid SSN, Social Security benefit and application information from the Master Beneficiary Record, the 831 File, and the Supplemental Security Record. These data are being used to prepare files for RDC use and to develop a new public use file. The Sloan Project: a collaborative study with five of the Sloan Industry centers (steel, food retail, trucking, semiconductors, and software). The objective is to provide a better understanding of productivity, job and wage outcomes at the micro and industry level 6347895 Project Description and Coordination Plan Page 15 and in so doing help Census update its measurement methodology. The project combines the expertise of Sloan Industry Center experts with the LEHD infrastructure data and in turn with the expertise in the Census industry operating divisions. This project takes a bottoms-up approach as the Sloan industry experts have considerable knowledge of individual businesses and associated trends in their industries. Integrated IRS/DOL Form 5500 data: records from the public use Form 5500 have been integrated into the LEHD business file system using the federal EIN and supplementary information. Data are being developed on employee benefit plan coverage. A detailed geocoding system capable of geo-coding business and residential addresses to the latitude/longitude from 1990 forward. California Census Research Data Center, SES 9812174, to the University of California at Berkeley, September 1, 1998 – August 31, 2004, $687,300. New York Research Data Center, SES-0322902, to the National Bureau for Economic Research, August 1, 2003 to July 31, 2004, $300,000. Michigan Census Research Data Center, SES-0004322, to the University of Michigan, August 31, 2002 – August 31, 2004, $300,000. Triangle Research Data Center: Supplement, SES9900447, to Duke University July 1999 – June 2002, $300,000 and renewal July 2002 – 2004, $200,000. Expanding Access to Census Data: The Boston Research Data Center (RDC) and Beyond, SBR-9311572, to the National Bureau for Economic Research, September 1, 1993 - July 31, 1997, $404,991. Boston Data Research Center – Renewal, SBR9610331, to the National Bureau for Economic Research, May 15, 1997 – April 30, 1999, $188,080. Chicago Research Data Center, SES 0004335, to the Northwestern University, August 2001 – August 2004, $300,000. Each of these grants has provided basic support to the operations of the respective RDCs. Research conducted in these RDCs has led to over 100 research papers in economics, sociology, public, environmental, and occupational health, urban studies, and political science (see Research Produced at NSFSupported RDCs in the References Cited section). There are dozens of research projects as well as a pipeline of proposals under review in all these fields at these RDCs. National Bureau of Economic Research Summer Institute Supplement grant SES 0314087 (to SES9911686) to the National Bureau of Economic Research, April 1, 2003 – September 30, 2003, $10,000, funded a research conference in Washington, D.C. that brought together the researchers involved in the heretofore distinct LEHD and RDC projects with their sponsors and collaborators at Census. This conference provided the impetus for the increased cooperation and integration of these two Census micro-data projects necessary to undertake the proposal you see here. Workshop on Confidentiality Research, National Science Foundation Grant SES-0328395 awarded to the Urban Institute, June 1, 2003 – May 31, 2004, $43,602 (PI: Julia Lane, Co-PIs: John Abowd and George Duncan), supported a workshop that was conducted in June 2003 and issued a report to NSF. Coordination Plan Specific roles of PI, Co-PIs, and Other Senior Personnel The plan coordinates the activities of the nine RDCs, the LEHD Program, Argonne National Laboratory, Cornell, Carnegie-Mellon, Duke, and Michigan, teams as they relate to the activities of this proposal. The RDCs and the LEHD Program have a designated senior scientist whose role is to coordinate these activities. Their roles are listed below. ï‚· John M. Abowd, Cornell University, General grant coordinator, Cornell RDC coordinator, directs the Cornell and LEHD components of the synthetic data collaboration. ï‚· Ron S. Jarmin, U.S. Bureau of the Census, CES RDC coordinator, Directs the CES component of the synthetic data collaboration, supervises the Census staff who operate the RDC network ï‚· Trivellore E. Raghunathan, University of Michigan, Directs the Michigan component of the synthetic data collaboration. ï‚· Stephen R. Roehrig, Carnegie-Mellon University, Directs the CMU component of the synthetic data collaboration. Since CMU’s RDC will close before this grant work starts, it is not one of the supported RDCs. Roehrig will do algorithmic development on his Carnegie-Mellon workstation. These algorithms will be tested using the Cornell Restricted Access Data Center (initially) and the virtual RDC (when ready). Tested algorithms will be transmitted to the RDC network through the Cornell RDC, where the administrator will be the Censusmandated human interface. Roehrig has a limited travel budget that will permit occasional trips to the Washington RDC to work on site. 6347895 ï‚· ï‚· ï‚· Project Description and Coordination Plan Page 16 Matthew D. Shapiro, University of Michigan, Michigan RDC co-coordinator Neil Bennett, CUNY Baruch, New York Baruch RDC coordinator. Gale Boyd, Argonne National Laboratories, Chicago RDC coordinator, directs the geo-spatial integration and agent-based modeling and simulation research components of the proposal. Boyd’s research component will be accomplished by collaborating with other Chicago RDC-based researchers. The support of the Chicago RDC through its lab fee includes support for this coordination activity. See the Cornell budget justification for a general explanation. ï‚· Marjorie McElroy, Duke University, Duke RDC coordinator. ï‚· Wayne Gray, Clark University, Boston RDC coordinator. ï‚· John Haltiwanger, University of Maryland, LEHD coordinator ï‚· Andrew Hildreth, University of California, California Census RDC coordinator (includes both Berkeley and Los Angeles). ï‚· Margaret C. Levenstein, University of Michigan, Michigan RDC co-coordinator. ï‚· Jerome P. Reiter, Duke University, Directs the Duke component of the synthetic data research program. Reiter will work in the Duke RDC. ï‚· Ray Bair, Argonne National Laboratories, Team leader for the ANL team that will optimize the RDC supercluster. The ANL team will work at the Chicago RDC. ï‚· Lars Vilhuber, Cornell University, Senior scientist on the Cornell data synthesizing team. Responsible for integration of the different data components. A central tenet of the proposal is that the core RDC activities need to be supported in order to preserve the network of access points to the critical confidential micro-data that contribute to the basic scientific knowledge and provide the input models and data preparation for our synthetic data development. Since the RDCs receive a lab fee each year to assist in the proposed work, the coordinator for each RDC does not appear explicitly in the budget. See the budget justification for Cornell University for an explanation of how the lab fee supports the operations of the RDCs. Each year of the proposal the PIs and senior scientists performing the RDC coordination role will, as part of the operation of the RDC, assist a wide variety of scholars in preparing proposals to use the RDCs. The active commitment of this group of senior scientists is required to ensure that the RDC network is fully utilized. Each proposal involves preliminary discussions, a pre-proposal submission to Census, a feasibility and Title 13 assessment, review by the RDC review board, review by the Census review board, approval, research, data-basing of the project meta-data, and certification. Many of these steps iterate. The RDC-based PIs and senior scientists will specify and implement a meta-database into which all RDCbased projects will be archived. The meta-data will include specifications for all input data sets, a complete, reusable program streams for producing all major statistical results (some disclosable, some confidential) emanating from the project, the disclosure-proofed result file and its supporting data. The project meta-database specs will be completed within the first year and implemented retro-actively on all RDC projects active at the time this grant is awarded. All future RDC users will be required to use the meta-database to archive their results. The ANL team will take charge of creating a high-performance supercluster. The supercluster must be acquired quickly. Each of the cooperating industry sponsors (Intel, Unisys, and SAS) are aware of the requirements and will respond quickly to the acquisition RFP that Census will be required to issue before the supercluster can be acquired. (Census must own and operate the RDC-network computers as required by Title 13 and IRS publication 1075.) The ANL team will assist in writing the specifications for the acquisition and assessing all bids. It is anticipated that the Intel/Unisys will be successful. This will allow the acquisition of the supercluster during the first months that the grant is in progress. The ANL team will spend much of the first year tuning and optimizing the supercluster, with the assistance of RDC-network staff and the vendors, particularly SAS. Some testing will also be performed at Cornell and at ANL. Specific computing tasks are outlined below. Benchmarking. The applications placed on the cluster will be benchmarked for performance against the current computing environment of a single server, and in terms of the number of optimum number of processors for particular models and problem sizes. Benchmark results will provide guidance for supercluster users and reference points for code improvements. Training. Researchers will need to be trained to run their applications on the supercluster. Training will take the form of short seminars and documentation which will present the types of applications suited for a distributed cluster computer, the architecture of the machine, essentials of running a job, and tools to be used in adapting applications to this environment. 6347895 Project Description and Coordination Plan Page 17 Tools Development. Tools will be developed and procured to assist the researchers in porting their application to a distributed environment and to minimize the transition process. Parallel tools for clusters range from high-level (full scientific application analysis) to low level where the programmer has explicit control over all parallelism. In addition, a series of code performance analysis tools (e.g., Jumpshot) will provide the researcher with metrics on the execution itself, which help to locate and explain performance problems in the code and suggest avenues for improvement. The GeoViewer is an object-oriented geographic information system (GIS) developed by Argonne that can be leveraged as needed for this large-scale matching application. Since Argonne owns the code, this GIS application can be customized for RDC-network applications and integrated with other code modules as needed. Computing Environment Task Outline 1. 2. 3. 4. 5. 6. Distributed cluster procurement System setup, software installations and security configuration System checkout, configuration and creation of administrative procedures Tool development and deployment Training development (mini-courses and documentation) and delivery Porting applications and benchmarking How the project will be managed across institutions and disciplines The data synthesizing projects will work as a coordinated team. Abowd will do the coordination. The team will meet weekly by video conference call to review the work plan and discuss issues that have arisen for the research effort. Generally, Abowd will coordinate the creation of RDC projects that provide access to each of the gold-standard data sets to be synthesized. Abowd’s team at the Cornell RDC and inside LEHD will port the existing synthesizers to the RDC-network for review and use by the other teams. Raghunathan and Reiter, working with teams at the Michigan and Research Triangle RDCs, respectively, will develop appropriate synthesizers for data sets assigned to their teams. Roehrig’s team at CMU, which does not have access to an RDC, will develop algorithms in support of all the synthesizing teams. Roehrig’s team will also assist in developing algorithms for assessing the identity and item disclosure risk associated with a particular synthetic data product. Each synthetic data product will go through the following cycle. 1. Identification of the data product to be synthesized (All PIs and senior scientists working with the advisory panel). At this step the costs and potential benefits of each product are evaluated so that the grant’s resources are spent on the projects that are within its means and yield the greatest research benefits. 2. Assignment of a synthesizing team consisting of a senior scientist from this proposal, Census staff at CES, LEHD and selected RDCs, external researchers with projects authorized to use the candidate data 3. Certification of the gold-standard file by the synthesizing team and appropriate Census staff (CES or LEHD). At this point the structure and content of the gold-standard file are locked. 4. Production of a trial synthetic version by the assigned team. 5. Testing of the trial synthetic data product using models from the RDC meta-database and the active RDC researchers on the project team. 6. Disclosure risk assessment of the synthetic data product by a team at one of the RDCs working jointly with the Census Disclosure Review Board 7. Assessment of the tests, then either return to 3 and repeat or release and go to 8 8. Installation of the synthetic data product in the virtual RDC for use by the general research community. 9. Solicitation of RDC projects to further test the validity of the synthetic data product. These researchers can develop their proposals using the virtual RDC. The proposal must include a complete set of results from the synthetic data in RDC meta-database format. 10. Acceptance of some projects to test the synthetic data on the gold-standard. 11. Execution of those projects and certification of their results 12. Assessment of the projects in step 11, then return to step 1 with a proposal to enhance this product based on the tests. Identification of the specific coordination mechanisms The RDC network will be the primary coordination mechanism for the grant. The RDC executive directors meet regularly by conference call. Several executive directors are PIs or senior scientists on this proposal (Bennett, 6347895 Project Description and Coordination Plan Page 18 Hildreth, and Levenstein). They will keep the work of this grant on the active agenda of the RDC executive directors. The complete senior scientific staff of this proposal will meet monthly by video conference to discuss progress on all of the grant’s activities. Each team will make periodic detailed reports on the products they are developing. The necessary video conference equipment will be acquired by each RDC (or exists already at the host organization). The grant will acquire an MCU to support the multipoint conferences. So that Census staff at CES and LEHD can participate, the MCU will have both IP and ISDN capabilities. The budget justification for Cornell describes the MCU equipment. Each RDC location will decide whether to install a workstation based H.323 compatible video conference unit or use the facilities of the host organization/university. The virtual RDC will be maintained at Cornell University. This facility will be expanded as more synthetic data become available. The specifications for mirroring the virtual RDC will be developed and published. 6347895 References Cited Page 1 Abowd, John M. and Lars Vilhuber. “The Sensitivity of Economic Statistics to Coding Errors in Personal Identifiers,” Journal of Business and Economics Statistics, forthcoming. Abowd, John M., John Haltiwanger and Julia Lane, “Integrated Longitudinal Employee-Employer Data for the United States,” American Economic Review Papers and Proceedings, Vol. 94, No. 2 (May 2004): forthcoming. Abowd, John M., John Haltiwanger, Ron Jarmin, Julia Lane, Paul Lengermann, Kristin McCue, Kevin McKinney, and Kristin Sandusky “The Relation among Human Capital, Productivity and Market Value: Building Up from Micro Evidence,” in Measuring Capital in the New Economy, C. Corrado, J. Haltiwanger, and D. Sichel (eds.), (Chicago: University of Chicago Press for the NBER, forthcoming), working paper version LEHD Technical Paper TP-2002-14 (final version December 2002). Abowd, John M, Paul Lengermann and Kevin McKinney, “Measuring the Human Capital Input for American Businesses,” LEHD Technical Paper TP-2002-09 (last revised August 2002). Abowd, John M. and Simon Woodcock, “Disclosure Limitation in Longitudinal Linked Data,” in Confidentiality, Disclosure and Data Access: Theory and Practical Applications for Statistical Agencies, P. Doyle, J. Lane, J. Theeuwes, and L. Zayatz (eds.), (Amsterdam: North Holland, 2001), 215-277. Andersson, Fredrik, Harry Holzer and Julia Lane “The Interactions of Workers and Firms in the Low-Wage Labor Market Urban Institute Research Report available online at http://www.urban.org/url.cfm?ID=410608 (December 2002), cited on February 23, 2004. Arthur, W. B. (1991). “Designing Economic Agents that Act Like Human Agents: A Behavioral Approach to Bounded Rationality.” American Economic Review Papers and Proceedings (May): 353-359. Bonabeau, E., “Agent-based modeling: Methods and techniques for simulating human systems” Proceedings of the National Academy of Sciences of the USA, vol. 99, suppl. 3, pp. 7280-7287, National Academy of Sciences of the USA, Washington, DC, USA: May 14, 2002. Bowlus, Audra and Lars Vilhuber, “Displaced workers, early leavers, and re-employment wages” LEHD Technical Paper TP-2002-18 (last revised November 2002). Card, D.E., Hildreth, A.K.G., and Shore-Sheppard, L.D. “The Measurement of Medicaid Coverage in the SIPP: Evidence from California 1990-1996”, NBER Working Paper 8514 (2001). Forthcoming: Journal of Business and Economic Statistics. Dandekar, Ramesh A. and Lawrence H. Cox, Synthetic Tabular Data: an alternative to complementary cell suppression for disclosure limitation of tabular data. Manuscript (2002). Davis, Steven J. and John Haltiwanger “Gross Job Creation and Destruction: Microeconomic Evidence and Macroeconomic Implications,” NBER Macroeconomics Annual 1990, O. Blanchard and S. Fischer, eds. (Cambridge: MIT Press, 1990), pp. 123-68. Davis, Steven J. and John Haltiwanger “Gross Job Creation, Gross Job Destruction and Employment Reallocation,” Quarterly Journal of Economics 107 (1992): 819-63. Domingo-Ferrer, Josep and Vicenc Torra, “Disclosure Control Methods and Information Loss,” in Confidentiality, Disclosure and Data Access: Theory and Practical Applications for Statistical Agencies, P. Doyle, J. Lane, J. Theeuwes, and L. Zayatz (eds.), (Amsterdam: North Holland, 2001), 91-110. 6347895 References Cited Page 2 Doyle, Pat, Julia Lane, Jules Theeuwes, and Laura Zayatz “Introduction,” in Confidentiality, Disclosure and Data Access: Theory and Practical Applications for Statistical Agencies, P. Doyle, J. Lane, J. Theeuwes, and L. Zayatz (eds.), (Amsterdam: North Holland, 2001), 1-15. Duncan, George and Stephen R. Roehrig, “Mediating the Tension Between Information Privacy and Information Access: The Role of Digital Government”, in Public Information Technology: Policy and Management Issues, Idea Group Publishing, Hershey, PA, 2002. Duncan, George, Karthik Kannan and Stephen R. Roehrig, Final Report on the American FactFinder Disclosure Audit Project for the U.S. Census Bureau, confidential report to the U.S. Bureau of the Census, 2000. Duncan, George T., Thomas B. Jabine, and Virginia A. de Wolf, eds. Private Lives and Public Policies: Confidentiality and Accessibility of Government Statistics, National Academy Press, Washington, 1993. Duncan, George T., Stephen E. Fienberg, Ramayya Krishnan, Rema Padman, and Stephen R. Roehrig, “Disclosure Limitation Methods and Information Loss for Tabular Data,” in Confidentiality, Disclosure and Data Access: Theory and Practical Applications for Statistical Agencies, P. Doyle, J. Lane, J. Theeuwes, and L. Zayatz (eds.), (Amsterdam: North Holland, 2001), 135-166. Duncan, George T. and Diane Lambert “Disclosure Limited Data Dissemination,” Journal of the American Statistical Association, 81:393 (March 1986): 10-18. Duncan, George and Diane Lambert, “The risk of disclosure for micro-data,” Journal of Business and Economic Statistics, 7 (1989): 207-217. Duncan, G. R. Krishnan, R. Padman P. Reuther and Roehrig, S., “Exact and Heuristic Methods for Cell Suppression in Multi-Dimensional Linked Tables,” accepted for Operations Research. Duncan, G., S. Fienberg, R. Krishnan R. Padman, and S. Roehrig, S. “Disclosure Limitation Methods and Information Loss for Tabular Data”, in Confidentiality, Disclosure and Data Access: Theory and Practical Applications for Statistical Agencies, P. Doyle, J.I. Lane, J.J.M. Theeuwes and L.V. Zayatz, eds, NorthHolland, 2001. Duncan, G. and S. Keller-McNulty, “Bayesian Insights on Disclosure Limitation: Mask or Impute,” Bayesian Methods with Applications to Science, Policy, and Official Statistics Selected Papers from ISBA 2000: The Sixth World Meeting of the International Society for Bayesian Analysis, (2000), available online at http://www.stat.cmu.edu/ISBA/165f.pdf, cited on February 23, 2004. Dutta Chowdhury, S., G. Duncan, R. Krishnan and S. Mukhergee and S. Roehrig, “Disclosure Detection in Multivariate Categorical Databases: Auditing Confidentiality Protection Through Two New Matrix Operators”, Management Science December 1999. Elliot, Mark. “Disclosure Risk Assessment,” in Confidentiality, Disclosure and Data Access: Theory and Practical Applications for Statistical Agencies, P. Doyle, J. Lane, J. Theeuwes, and L. Zayatz (eds.), (Amsterdam: North Holland, 2001), 75-90. Entwisle, Barbara "The Contexts of Social Research," in Nancy M.P. King, Gail E. Henderson and Jane Stein (eds.), Reexamining Research Ethics: From Regulations to Relationships. (Chapel Hill, NC: University of North Carolina Press 1999), pp. 153-160. Feinberg, Stephen E. “A Radical Proposal for the Provision of Micro-Data Samples and the Preservation of Confidentiality,” Carnegie-Mellon University Department of Statistics Technical Report No. 611 (December 1994). 6347895 References Cited Page 3 Feinberg, Stephen E, U.E. Makov, and R.J. Steele “Disclosure Limitation Using Perturbation and Related Methods for Categorical Data,” Journal of Official Statistics, 14:4 (1998): 385-397. Feldstein, Martin and Jeffrey Liebman. “Social Security,” Handbook of Public Economics, Volume 4, 2002. Hildreth, A.K.G. and Lee, D.S. "A Guide to Fuzzy Name Matching: Application to Matching LRB/FMCS to ES-202 Data", mimeo, Department of Economics, University of California-Berkeley, September 2003. Holland, J. and J. Miller (1991). “Artificial Adaptive Agents in Economic Theory.” AEA Papers and Proceedings (May): 365-370. Kennickell, Arthur B. “Multiple Imputation in the Survey of Consumer Finances,” SCF Working Paper presented at the 1998 meetings of the American Statistical Association, available online at http://www.federalreserve.gov/Pubs/OSS/oss2/papers/impute98.pdf, cited on February 23, 2004. Lambert, Diane “Measures of disclosure risk and harm (Disc: p333-334),” Journal of Official Statistics, 9 , (1993): 313-331. Liebman, Jeffrey. "The Optimal Design of the Earned Income Tax Credit," in Making Work Pay: The Earned Income Tax Credit and Its Impact on American Families, edited by Bruce D. Meyer and Douglas HoltzEakin. New York: Russell Sage Foundation Press. ct on American Families, 2002. Liebman, Jeffrey B. “Redistribution in the Current U.S. Social Security System,” NBER Working Paper, 2003, available online at http://papers.nber.org/papers/w8625.pdf, cited on February 23, 2004. Liebman, Hildreth, A.K.G. and Lee, D.S. “A Guide to Fuzzy Name Matching: Application to Matching NLRB/FMCS to ES-202 Data”, Report for Labor Market Information Division (LMID), Sacramento, California. Mimeo, Department of Economics, University of California (2003). Little, Roderick J. A., “Statistical analysis of masked data” (Disc: p455-474) (Corr: 94V10 p469), Journal of Official Statistics, 9 ,(1993): 407-426 Mayer, Thomas S. “Privacy and confidentiality Research at the US Census Bureau: Recommendations Based on a Review of the Literature,” Research Report Series Survey Methodology 2002-01, available online at http://www.census.gov/srd/papers/pdf/rsm2002-01.pdf, cited on February 23, 2004. Mazumder, Bhashkar. “Revised Estimates of Intergenerational Income Mobility in the United States,” Federal Reserve Bank of Chicago Working Paper, available online at http://www.chicagofed.org/publications/workingpapers/papers/wp2003-16.pdf, cited on February 23, 2004. North, M., “Towards strength and stability: agent-based modeling of infrastructure markets,” Social Science Computer Review, pp. 307-323, Sage Publications, Thousand Oaks, California, USA: Fall 2001. Raghunathan, T. E., J. M. Lepkowski, J.VanHoewyk and P. Solenberger, “A Multivariate Technique for Multiply Imputing Missing Values Using a Sequence of Regression Models,” Survey Methodology, 2001, 27:85-95. Raghunathan, T.E., J.P. Reiter, and D.B. Rubin, “Multiple Imputation for Statistical Disclosure Limitation,” Journal of Official Statistics, 19 (2003): 1-16. Raghunathan, T.E. and D.B. Rubin, “Multiple Imputation for Disclosure Limitation,” Technical Presentation (available from the authors on request) 2000. Reiter, J. “Inference for Partially Synthetic Public Use Microdata Sets,” Survey Methodology, 29 (2003): 181-188. Reiter, J. “Releasing multiply-imputed, synthetic public use microdata: An illustration and empirical study.” Journal of the Royal Statistical Society, Series A (forthcoming, 2004). 6347895 References Cited Page 4 Resnick, M., Turtles, Termites, and Traffic Jams: Explorations in Massively Parallel Microworlds, MIT Press, Cambridge, Massachusetts, USA: 1994. Riano, D., Prado, S., A. Pascual and S. Martin, “A multi-agent system model to support palliative care units,” Proceedings of the 15th IEEE Symposium on Computer-Based Medical Systems, pp. 35-40, IEEE, Piscataway, New Jersey, USA: June 4-7, 2002. Rindfuss, Ronald “Conflicting Demands Confidentiality Promises and Data Availability,” IHPD Update: Newsletter of the International Human Dimensions Programme on GlobalEnvironmental Change (February 2002), article 1, available online at http://www.ihdp.unibonn.de/html/publications/update/update02_02/Update02_02_art1.html, cited on February 23, 2004. Rubin, Donald B. “Satisfying Confidentiality Constraints Through the Use of Synthetic Multiply-imputed Microdata,” Journal of Official Statistics, 91 (1993): 461-8. Sarathy, Rathindra, Krishnamurthy Muralidhar and Rahul Parsa, “Perturbing Nonnormal Confidential Attributes: The Copula Approach,” Management Science 48:12 (December 2002): 1613-27. Shannon, Claude E. “The Mathematical Theory of Communication,” Bell System Technical Journal, 27 (1948): 379423, 623-656. Sweeney, Latanya. “Information Explosion,” in Confidentiality, Disclosure and Data Access: Theory and Practical Applications for Statistical Agencies, P. Doyle, J. Lane, J. Theeuwes, and L. Zayatz (eds.), (Amsterdam: North Holland, 2001), 43-74. Trottini, Mario and Stephen E. Fienberg, Modelling User Uncertainty for Disclosure Risk and Data Utility. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 10(5), (2002):511-528. Tax Information Security Guidelines for Federal, State, and Local Agencies: Safeguards for Protecting Federal Tax Returns and Return Information, IRS Publication 1075, June 2000. Yancey, William E., William E. Winkler and Robert H. Creecy, “Disclosure Risk Assessment in Perturbative Microdata Protection,” Research Report Series Statistics 2002-01, available online at http://www.census.gov/srd/papers/pdf/rrs2002-01.pdf, cited on February 23, 2004. 6347895 References Cited Page 5 Research Produced at NSF-Supported RDCs Berkeley RDC Card, D, Hildreth, A., and Shore-Sheppard, L., “The Measurement of Medicaid Coverage in the SIPP: Evidence from California, 1990-1996”, NBER Working Paper 8514, 2001. Forthcoming: Journal of Business and Economic Statistics. Dinardo, John and David Lee, “Do Unions Cause Business Failures?” revised March 2003. Previous version: The Impact of Unionization on Establishment Closure: A Regression Discontinuity Analysis of Representation Elections, with John DiNardo, NBER Working Paper #8993, June 2002. Lubotsky, D. “The Labor Market Effects of Welfare Reform”, Forthcoming: Industrial and Labor Relations Review. Lubotsky, D. “Chutes or Ladders? A Longitudinal Analysis of Immigrant Earnings”. Forthcoming: The Journal of Political Economy. Van Biesebroeck, J. “Productivity Dynamics with Technology Choice: An Application to Automobile Assembly”, Review of Economic Studies, 70, 2003. Boston RDC 2000-2004 Black, S. & L. Lynch, "How to Compete: The Impact of Workplace Practices and Information Technology on Productivity," Review of Economics and Statistics, August 2001, 83(3) 434-45. Black, S. & L. Lynch, "What's Driving the New Economy?: the Benefit of Workplace Innovation," Economic Journal, February 2004, 114(493), 97-116. Black, S., L. Lynch & A. Krivelyova, "How Workers Fare When Employers Innovate," Industrial Relations, January 2004, 43(1), 44-66. Black, S. and L. Lynch, "Measuring Organizational Capital in the New Economy," in Carol Corrado, John Haltiwanger and Dan Sichel, editors, Measuring Capital in the New Economy, University of Chicago Press, forthcoming. Black, S. and L. Lynch, "The New Economy and the Organization of Work," in Derek Jones, ed., The Handbook of the New Economy, Academic Press, 2003. Dumais, G., G. Ellison & E. Glaeser, "Geographic Concentration as a Dynamic Process," Review of Economics and Statistics 84 (2), 193-204, May 2002. Bertrand, M. and S. Mullainathan, "Enjoying the Quiet Life? Corporate Governance and Managerial Preferences," Journal of Political Economy, 2003, 111(5), 1043-75. Bernard, A. & J.B. Jensen, "Why Some Firms Export," The Review of Economics and Statistics, forthcoming 2004. Bernard, A., J. Eaton, J.B. Jensen & S. Kortum, "Plants and Productivity in International Trade," American Economic Review 2003, Vol. 93, No. 4, September, 1268-1290. Bernard, A. & J.B. Jensen, "Entry, Expansion and Intensity in the U.S. Export Boom, 1987-1992," Review of International Economics forthcoming. 6347895 References Cited Page 6 Bernard, A. & J.B. Jensen, "Exceptional Exporter Performance: Cause, Effect, or Both?" Journal of International Economics, February 1999, 47(1), 1-25 Berman, E. & L. Bui, "Environmental Regulation and Productivity: Evidence from Oil Refineries," Review of Economics and Statistics, August 2001, 83(3), 498-510. Berman, E. & L. Bui, "Environmental Regulation and Labor Demand: Evidence from the South Coast Air Basin," Journal of Public Economics, February 2001, 79(2), 265-95. Feldstein, M. & J. Liebman, "The Distributional Effects of an Investment-based Social Security System," in M. Feldstein & J. Liebman, eds. The Distributional Aspects of Social Security and Social Security Reform, 2002, NBER Conference Report series, Chicago and London: University of Chicago Press. Liebman, J., "Redistribution in the Current U.S. Social Security System," in M. Feldstein & J. Liebman, eds. Distributional Aspects of Social Security and Social Security Reform, 2002, NBER Conference Report series, Chicago and London: University of Chicago Press Becker, R. & J.V. Henderson, "Effects of Air Quality Regulations on Polluting Industries," Journal of Political Economy, 2000, 108, 379-421. Reprinted in Economic Costs and Consequences of Environmental Regulation, W.B. Gray (ed.), Ashgate Publishing Limited, 2001. ____________, "Effects of Air Quality Regulation", American Economic Review, Vol. 86, no. 4 (September 1996): 789-813. Henderson, J.V., "Marshall's Scale Economies," Journal of Urban Economics, 53(1), 1-28, January 2003. Davis, J. & J.V. Henderson, "Evidence on the Political Economy of the Urbanization Process," Journal of Urban Economics, 53(1), 98-125, January 2003. Beardsell, M. & J.V. Henderson, "Spatial Evolution of the Computer Industry in the USA," European Economic Review, 43(2), February 1999, 431-56. Berry, Steve, Sam Kortum and Ariel Pakes, “Environmental Change and Hedonic Cost Functions for Automobiles,” Proceeding of the National Academy of Sciences, Vol. 93, No.23, pp. 12731-12738. Cooper, R. & A. Johri, "Learning by Doing and Aggregate Fluctuations," Journal of Monetary Economics, 49(8), November 2002, 1539-66. Villalonga, B., "Diversification discount or premium? New evidence from BITS establishment-level data," Journal of Finance, forthcoming. Ioannides, Y. & J. Zabel, "Neighborhood Effects and Housing Demand," April 2000, Journal of Applied Econometrics, Sept.-Oct. 2003, 18(5), 563-84. Downes, T. & J. Zabel, "The Impact of School Characteristics on House Prices: Chicago 1987-1991," Journal of Urban Economics, July 2002, 52(1), 1-25. Ono, Y., "Outsourcing Business Services and the Role of Central Administrative Offices," Journal of Urban Economics, Vol. 53, No. 3, pp. 377-395, May, 2003. Gray, W. & R. Shadbegian, "Plant Vintage, Technology, and Environmental Regulation," Journal of Environmental Economics and Management, November 2003, 46(3), 384-402. Gray, W., ed., Economic Costs and Consequences of Environmental Regulation, Ashgate Publications, 2002. 6347895 References Cited Page 7 Gray, Wayne B. and Ronald J. Shadbegian, "Environmental Regulation, Investment Timing, and Technology Choice", Journal of Industrial Economics, June 1998, pp. 235-256 (also NBER Working Paper 6036, May 1997). Gray, Wayne B. and Ronald J. Shadbegian, "Pollution Abatement Costs, Regulation, and Plant-Level Productivity", in Economic Costs and Consequences of Environmental Regulation, 2002 (also NBER Working Paper 4994, January 1995). Kahn, M., "City Quality-of-Life Dynamics: Measuring the Costs of Growth," Journal of Real Estate Finance and Economics, March-May 2001, 22(2-3), 339-52. Kahn, Matthew, “Particulate Pollution Trends in the United States,” Regional Science and Urban Economics v27, n1 (February 1997): 87-107 Kahn, M., "The Silver Lining of Rust Belt Manufacturing Decline," Journal of Urban Economics, November 1999, 46(3), 360-76. Kahn, M., "Smog Reductions Impact on California County Growth," Journal of Regional Science, August 2000, 40(3), 565-82. Lynch, Lisa M.; Black, Sandra E “Beyond the Incidence of Employer-Provided Training,” Industrial and Labor Relations Review v52, n1 (October 1998): 64-81 Mead, C.I., "The Impact of Federal, State, and Local Taxes on the User Cost of Capital," Proceedings: Ninetysecond Annual Conference on Taxation, National Tax Association, 2000, 487-92, Washington, D.C.: National Tax Association. Okamoto, Y., "Multinationals, Production Efficiency, and Spillover Effects: The Case of the U.S. Auto Parts Industry," Weltwirtschaftliches Archiv., 1999, 135(2), 241-60. Rajan, R., P. Volpin, & L. Zingales, "The Eclipse of the U.S. Tire Industry," in Kaplan, S., ed., Mergers and productivity, 2000, 51-86, NBER Conference Report series. Chicago and London: University of Chicago Press. Schuh, Scott and Robert Triest, “Gross Job Flows and Firms,” American Statistical Association Proceedings of the Government Statistics Section, 1999. Schuh, S. & R. Triest, "The Role of Firms in Job Creation and Destruction in U.S. Manufacturing," Federal Reserve Bank of Boston New England Economic Review, March-April 2000, 0(0), 29-44. Shadbegian, Ronald J. and Wayne B. Gray, "What Determines Environmental Performance at Paper Mills? The Roles of Abatement Spending, Regulation, and Efficiency" Topics in Economic Analysis & Policy, November 2003. Tootell, G., R. Kopcke & R. Triest, "Investment and Employment by Manufacturing Plants," Federal Reserve Bank of Boston New England Economic Review, 2001, 0(2), 41-58. Unpublished papers Davis, James, “Headquarters, Localization Economies and Differentiated Service Inputs,” Mimeo, December 2000. Freeman, Richard and Morris Kleiner, “The Last American Shoe Manufacturers: Changing the Method of Pay to Survive Foreign Competition,” Mimeo, March 1998. 6347895 References Cited Page 8 Gray, Wayne and Ronald Shadbegian, “Technology Change, Emissions Reductions, and Productivity,” Mimeo, January 2001. Gray, Wayne and Ronald Shadbegian, “When is Enforcement Effective – or Necessary?” Mimeo, August 2000. Hwang, Margaret and David Weil, “Who Holds the Bag?: The Impact of Information Technology and Workplace Practices on Inventory,” Mimeo, December 1997. Kahn, Matthew, “Does Smog Regulation Displace Private Self Protection?” Mimeo, January 1998. Kiel, Katherine and Jeffrey Zabel, “The Impact of Neighborhood Characteristics on House Prices: What Geographic Area Constitutes a Neighborhood?” Mimeo, March 2000. Lynch, Lisa and Anya Krivelyova, “How Workers Fare When Workplaces Innovate,” Mimeo, June 2000. Mead, C. Ian, “An empirical examination of the Determinants of New Plant Location in Four Industries,” Mimeo, November 2000. Ono, Yukako, “Outsourcing Business Service and the Scope of Local Markets,” Mimeo, July 2001. Schuh, Scott and Robert Triest, “Job Reallocation and the Business Cycle: New Facts for an Old Debate,” Mimeo, June 1998. Shadbegian, Ronald, Wayne Gray and Jonathan Levy, “Spatial Efficiency of Pollution Abatement Expenditures,” Mimeo April 2000. Chicago RDC Boyd, G., G. Tolley, and J. Pang, "Plant Level Productivity, Efficiency, and Environmental Performance: An Example from the Glass Industry," Environmental and Resource Economics. 23(1):29-43 (Sept. 2002). Boyd, G., and J. Pang, "Estimating the Linkage between Energy Efficiency and Productivity," Energy Policy 28:289-96 (2000). Boyd, G., and J. Pang, “Estimating the Linkage between Energy Efficiency and Productivity,” Energy Policy 28:289-96 (2000). Boyd, G., and J. McClelland, “The Impact of Environmental Constraints on Productivity Improvement and Energy Efficiency in Integrated Paper Plants,” The Journal of Economics and Environmental Management 38:121146 (1999). Boyd, G., J. Dowd, J. Freidman, and J. Quinn, “Productivity, Energy Efficiency, and Environmental Compliance in Integrated Pulp and Paper and Steel Plants,” ACEEE Industrial Summer Study, Grand Island, NY (Aug. 14, 1995). Boyd, G., J. McClelland, and M. Ross, “The Impact of Environmental Constraints on Productivity Improvement and Energy Efficiency in Integrated Paper and Steel Plants,” Proceedings of the 18th IAEE International Conference, Into the 21st Century: Harmonizing Energy Policy, Environment, and Sustainable Economic Growth, Washington, DC (July 5-8, 1995). Boyd, G., and J. McClelland, “Strategies for Reconciling Environmental Goals, Productivity Improvement and Increased Energy Efficiency in the Industrial Sector: An Analytic Framework,” Proceedings, Energy Efficiency and the Global Environment: Industrial Competitiveness and Sustainability, Newport Beach, CA, sponsored by Southern California Edison Company (Feb. 8-9, 1995). 6347895 References Cited Page 9 Bock, M., G. Boyd, S. Karlson, and M. Ross, “Best Practice Electricity Use in Steel Minimills,” Iron and Steel Maker, pp. 63-67 (May 1994). Boyd, G., S. Karlson, M. Neifer, and M. Ross, “Energy Intensity Improvements in Steel Minimills,” Contemporary Policy Issues XI(3):88-99 (July 1993). Boyd, G., S. Karlson, M. Neifer, and M. Ross, “Vintage Effects in the Steel Industry: The Potential for Energy Intensity Improvements,” Western Economics Association International 67th Annual Conference, San Francisco, CA (July 9-13, 1992). U.S. Department of Energy, The Interrelationship between Environmental Goals, Productivity Improvement, and Increased Energy Efficiency in Integrated Paper and Steel Plants, USDOE/PO-055 Technical Report 5 (June 1997). Boyd G., M. Bock, S. Karlson, and M. Ross, Vintage-Level Energy and Environmental Performance in Manufacturing Establishments, ANL/DIS/TM-15 (May 1994). Bock, M.J., G. Boyd, D. Rosenbaum, and M. Ross, Case Studies of the Potential Effects of Carbon Taxation on the Stone, Clay, and Glass Industries, ANL/EAIS/TM-91 (Dec. 1992). Boyd, G., M. Neifer, and M. Ross, Modeling Plant Level Industrial Energy Demand with the Longitudinal Research Database and the Manufacturing Energy Consumption Survey Database, ANL/EAIS/TM-96 (Jan. 1992). Syverson, Chad. "Product Substitutability and Productivity Dispersion." Review of Economics and Statistics, 2004. Syverson, Chad. "Market Structure and Productivity: A Concrete Example." Syverson, Chad. "Prices, Spatial Competition, and Heterogeneous Producers: An Empirical Test" Conference Proceedings (published paper) Boyd, G., "Parametric Approaches for Measuring the Efficiency Gap between Average and Best Practice Energy Use," 2003 ACEEE Summer Study on Energy Efficiency in Industry: Sustainability and Industry: Increasing Energy Efficiency and Reducing Emissions (July 29-Aug. 1, 2003). Submitted under review Boyd, G., "A Statistical Model for Measuring the Efficiency Gap between Average and Best Practice Energy Use: The ENERGY STAR(tm) Industrial Energy Performance Indicator," submitted to Journal of Industrial Ecology. Paper was also presented to 2003 International Society for Industrial Ecology Second International Conference, University of Michigan, Ann Arbor, MI (June 29-July 2, 2003) Oral Presentations Boyd, G., "Development and Testing Experience with the Energy Performance Indicator (EPI)," ENERGY STAR(r) Corn Refiners Industry Focus Meeting, Washington, DC (December 2003) Boyd, G., and T. Hicks, "Energy Performance Indicators," ENERGY STAR(r) Motor Vehicle Industry Focus Meeting and ENERGY STAR(r) Brewery Industry Focus Meeting, Washington, DC (June 2002). 6347895 References Cited Page 10 Hubbard, Tom "Hierarchies and the Organization of Specialization" Carnegie-Mellon University, February 2004 Columbia University, January 2004 American Economic Association, January 2004 University of Pennsylvania–The Wharton School, December 2003 Yale University, December 2003 University of Chicago, December 2003 University of California, Los Angeles, November 2003 University of Indiana, October 2003 University of Toronto, September 2003 University of California, San Diego, September 2003 Hubbard, Tom "Specialization, Firms, and Markets: The Division of Labor Within and Between Law Firms" Federal Reserve Bank of Chicago, May 2003 Dartmouth College, April 2003 Cornell University, April 2003 U.S. Department of Justice, April 2003 University of Virginia, April 2003 University of California, Los Angeles, April 2 003 National Bureau of Economic Research, January 2003 INSEAD, November 2002 London School of Economics, November 2002 University of Chicago, November 2002 Yale University, November 2002 New York University, November 2002 Bureau of the Census, September 2002 European Economic Association, July 2002 Stanford Institute for Theoretical Economics, June 2002 Michigan RDC Conference Presentations Christopher Kurz and James Levinsohn, ““Plant-Level Responses to Administered Trade Protection” Midwestern Economics Association, Chicago, IL March 2004. Christopher Kurz, “Production Sharing at the Plant Level” Midwestern Economics Association, Chicago, IL March 2004. Triangle RDC P. J. Cook and J. Ludwig, "The Effects of Gun Prevalence on Burglary: Deterrence vs Inducement" in J Ludwig and PJ Cook (eds.) Evaluating Gun Policy Washington, DC: Brookings Institution Press, 2003: 74-118. Submitted under review Chen, Susan and Wilbert van der Klaauw, "The Effect of Disability Insurance on Labor Supply of Older Individuals in the 1990s" Unpublished Manuscript, January 2004 (submitted for publication). Presented at the University of Michigan Economics of Aging Seminar, March 2004. Cutler, David, Edward Glaeser and Jacob Vigdor, "The Decline and Rise of the Immigrant Ghetto," North American Regional Science Council annual meeting November 2003. 6347895 References Cited Page 11 UCLA RDC Publications Ellis, Mark, Richard Wright, and Virginia Parks (2003) “Work together, live apart? Geographies of Residential and Workplace Segregation in Los Angeles” Forthcoming, Annals of the Association Of American Geographers Ellis, M. and Odland, J. “Intermetropolitan Variation in the Labor Force Participation of White and Black Men in the United States”, Urban Studies, 38, 2001. Moretti, Enrico "Workers' Education, Spillovers and Productivity: Evidence from Plant-Level Production Functions" AER, forthcoming. Moretti, E., “Estimating the Social Return to Higher Education: Evidence from Longitudinal and Repeated CrossSection Data”, Forthcoming: Journal of Econometrics. Robert Pedace and David Fairris "The Impact of Minimum Wages on Job Training: An Empirical Exploration with Establishment Data,". Southern Economic Journal, Vol. 70, No. 3 (2004). (CES Discussion Paper 03-04) John R. Logan, Richard D. Alba, and Wenquan Zhang. 2002. "Immigrant Enclaves and Ethnic Communities in New York and Los Angeles" American Sociological Review 67 (April):299-322. Wright, R. and Ellis, M. “Race, Region and the Territorial Politics of Immigration”, International Journal of Population Geography, 6, 2000. Unpublished Papers Holloway, Steven R., Mark Ellis, Richard Wright, Margaret Hudson (2003) “Partnering “Out” and Fitting In: Residential Segregation and the Neighborhood Contexts of Mixed-Race Households” (under review). Krivelyova, Anya and Lisa Lynch, “Changes in the Demand for Labor – The Role of Skill Biased Organizational Change,” Mimeo, September 2000. Valdez, Zulema. 2003. “Beyond Ethnic Entrepreneurship: Ethnicity and the Economy in Enterprise”. Center for USMexican Studies, Working Paper Series, usmex_02_07. http://repositories.cdlib.org/usmex/usmex_02_07. Under review at American Sociological Review. Valdez, Zulema. 2002. “Two Sides of the Same Coin? The Relationship between Socioeconomic Assimilation and Ethnic Entrepreneurship among Mexicans in Los Angeles”, in Latina/o Los Angeles: Global Transformations, Settlement, and Activism. Edited by Enrique C. Ochoa and Gilda Laura Ochoa. Volume currently under review at Rutgers University Press. Valdez, Zulema. 2003. “From the Enclave to the Ethnic Economy: The Effects of Ethnic Agglomeration and Ethnic Solidarity on Entrepreneurship”. Under review at International Migration Review. Wright, Richard, Mark Ellis and Virginia Parks (2003) “Re-placing Whiteness in Spatial Assimilation Research” (under review). 6347895 References Cited Page 12 New York City Baruch RDC The Baruch RDC is scheduled to open in 2004. It has not yet begun accepting proposals. Cornell RDC The Cornell RDC is scheduled to open in 2004. John Abowd will transfer many of his approved projects from the LEHD computers to the Cornell RDC. His participation in the LEHD Program is documented in the main proposal.