Proceedings of the Annual Meeting of the American Statistical Association, August 5-9, 2001 Preliminary Results from the Census 2000 Industry and Occupation Coding Mary Kirk, Ester Buckles, Wynona Mims, Marty Appel and Patricia Johnson, U.S. Census Bureau Key Words: Census, coding, evaluation Introduction: For Census 2000, the U.S. Census Bureau collected information about the industry in which a respondent worked and the occupation in which the respondent performed for a 1-in-6 sample of households. The process of classifying these responses to an industry code category and an occupation code category is called industry and occupation (I&O) coding. This paper will describe this process and present some preliminary results. Background: The Census Bureau has collected Industry and Occupation (I&O) data in Census and household surveys for many years. The questions on the decennial census form dates back to 1820 for industry and 1850 for occupation. Each decennial census since 1940 has included a series of questions that asked people to describe their jobs. In the 1990 Census, for the first time the Census Bureau used a combination of automated coding and computer-assisted clerical coding to assign the numeric code categories that identify the respondent's industry and occupation. The Automated I&O Coding Software (AIOCS) assigned a substantial fraction of codes without human intervention and the computer-assisted clerical coding operation classified the remaining cases that the AIOCS was unable to complete. Because of the success of the 1990 integrated coding system, the Census Bureau decided to use this approach again for the Census 2000. The I&O coding system developed for Census 2000 receives inputs from the Data Capture System (DCS). All responses to Census 2000, including the I&O questions1, were captured using a combination of data scanning and data keying methodology. The Census Bureau used those raw data files to produce data response files that became input to the automated coding software. (Scopp et al., 2001). The full I&O coding system is made up of four subsystems which are discussed below: I&O Coding System Descriptions Autocoder Subsystem: The autocoder subsystem is a computerized batch processing system that extracts 1 The I&O questions request information on type of industry, name of company, kind of work performed and most important activities and duties. respondent replies2 from the data capture file and uses a series of sophisticated computer algorithms and rules to classify a respondent to an industry and occupation category. (Gillman, 2000). Computer-assisted Clerical Coding Subsystem: Coding clerks at computer terminals assign I&O codes to the entries for which the autocoder was unable to assign an acceptable code. The acceptable code rate is the percentage of the records in the validation sample assigned a code by the autocoder whose confidence scores are equal to or above the cutoff score for that code. This clerical coding operation is divided into two parts: residual coding and referral coding. Residual Coding: The first part of the clerical coding process is called residual coding (the residual from the autocoder process). In residual coding, clerks assigned I&O codes based on the information provided by the respondent on the decennial census questionnaire. For the Census 2000 I&O coding process, the Census Bureau extracted the responses that were not assigned a code or received an unacceptable code from the autocoder and grouped these cases into batches. For active duty military cases, the residual and referral coders assigned an industry code for the branch of service and an occupation code. A special button accessed the military occupations on the occupation screen and sorted them in two ways: by branch of service, military occupation title, and pay grade or by branch of service, Military Occupation Specialty (MOS) code, and pay grade. The Military Census Report (MCR), Shipboard Census Report (SCR), and household responses were processed together. For U.S. questionnaires with responses written in Spanish and for all Puerto Rican questionnaires, a special unit of bilingual coders was formed. This special unit translated and coded the I&O items using resource materials similar to those used by the referral coders along with a special Employer Name List (ENL) for Puerto Rico. Problem Referral Coding: The last step in the coding process was referral coding. During the referral coding 2 Response variables used for I&O Coding: Company Name, Kind of Business, Industry Type, Kind of Work Performed, Duties, Class of Worker, Age, Education process, the referralists researched responses that neither the autocoder system nor the residual coders were able to code. The problem referralists were the last decisionmakers in the coding process.3 These referralists used other research methods and materials to assist in assigning I&O codes. These materials included the North American Industry Classification System (NAICS) Manual, the Standard Occupational Classification (SOC) Manual, the Dun and Bradstreet directories, and other on-line electronic reference files. All cases that got to this final stage were coded. Quality Assurance Subsystem: The quality assurance system measured the coding speed and accuracy of the residual and referral coding clerks. Each residual coding work unit included a sample of designated QA cases. The sample cases were replicated twice and distributed among the work units. Thus, a coder did not know which cases were the QA sample and all coders, in essence, did verification. The QA sample checked for coding consistency between the residual coders. If there was a majority code, the minority code was considered an error. A majority code is a code on which two or more coders agreed on a code for a given response. If there was no majority code, no coder received an error. Cases where all three codes disagreed were sent to the referral unit. The problem referral coders also used a 3-way QA. Clerical Coding Training: A new training system was designed for the Census 2000 clerical coding unit. This new training system included a separate two-week training program for the residual coders and an additional two-day training session for the referral coders. It was based on an interactive computer training system, and all materials were accessed on the computer. The traditional training for the I&O coding operation in both 1980 and 1990 has been a two-week session with paper references, procedures, and indexes. For Census 2000, the logic behind developing the system was to create an effective training mechanism combining examples from the I&O coding procedures with the principles of coding I&O. The interactive training software had a built-in help system that allowed the coders to reference information covered during their training. The software has: selected examples to practice for each section of the training; online help system; hot keys for traditional computer users; buttons for pull-down menus of abbreviations, indexes, self-employed people, and keywords; and other buttons to help coders. The Census 2000 I&O coders were trained on basic computer literacy, industry, occupation, class of worker, and military coding. Practice cases appeared on the screen as the coders were trained. At the end of the 3 Headquarters staff served as a resource for referralists for some cases. training the residual coders were tested to see if they qualified for production coding. Coders were required to pass one of three tests for industry and one of three tests for occupation as well as pass a final test with both I&O cases. Preliminary Industry & Occupation Coding Results: Autocoder: Prior to production coding, the automated coder was validated using Census 2000 data. The purpose of the validation process was to analyze the performance of the automated coder with actual Census 2000 responses. We processed the 22.5 million I&O responses from Census 2000 through the autocoder. (Scopp et al., 2001). After this process was complete we created the clerical workload batches. Table 1, below, provides detailed information on autocoder production and accuracy rates. The 2000 autocoder processed the entire 22.5 million I&O responses in 31 hours compared with 90 days in 1990. The gross autocoder production rates were 86% for industry and 81% for occupation. When we applied the cutoff scores criteria, the net production rates for the validation sample became 59% for industry and 56% for occupation. The accuracy levels were 94% and 92%, respectively. The industry net production rate was about the same as the 1990 level. However, the occupation rate was a vast improvement over the production figures for 1990, 56% versus 37%, and accuracy has improved for both industry and occupation compared with 1990. TABLE 1 Autocoder Production and Accuracy Rates Production Total records/people through autocoder (in millions) Gross production rate (assigned a code) Net production rates (gross x acceptance) From validation sample (percent) 2000 1990 IND OCC IND OCC 22.5 22.5 N/A N/A 86.4% 80.8% 94.0% 88.0% 58.6% 56.0% 58.0% 37.0% 94.0% 92.3% 90.0% 87.0% Residual Coding: The residual-coding unit assigned codes to approximately 15.3 million people. This workload included 10.0 million industry codes and 10.6 million occupation codes. After 28 weeks they coded 100% of the 15.3 million people. Table 2, below, shows detailed information for the 28 weeks of production coding for the residual coding unit. During the 28 weeks, the average number of person records processed by a coder was 67.4 per hour. In 1990, at the completion of the coding operation, the production rate for the day shift was 95.9 per hour. After the 28 weeks, 9.2 percent of the industry codes and 5.6 percent of the occupation codes were sent to the problem referral coding unit. These rates are lower than the final 1990 referral rates of 13 percent for industry and 9 percent for occupation. Some weeks are biased downward by the introduction of new coders. After processing the 2000 data, a study will be done to evaluate the productivity rates of the I&O coding process compared to previous years. TABLE 2 – 28 Reporting Weeks of Residual Production Coding 2001 Overall Total: Reporting Number of Period of 28 Residual Coders Weeks Overall Average: 02/17 – 03/10 03/17 – 04/07 04/14 – 05/05 05/12 – 06/02 06/09 – 06/30 07/07 – 07/28 08/04 – 08/25 307 292 373 350 335 327 322 Production Batches People 75398 13899470 Total Coding Hours 223474.6 6307 11611 10316 10810 13161 12029 11164 120331 2296706 2060990 2160395 2629425 2401731 2229892 22848.9 40091.7 34032.2 30959.4 35751.4 31706.8 28084.2 % Referred People Coded Hours 67.4 55.2 57.3 60.6 69.8 73.5 75.7 79.4 IND OCC 9.2 5.6 10.0 9.3 9.8 9.2 8.6 8.3 9.0 5.7 5.3 6.0 5.4 5.0 5.3 6.4 Problem Referral Coding: In 2000, the referral coding process started two months after the clerical coding process began. Most referral coders were selected after five weeks of residual coding experience for the 2000 operation. There were 72 referralists hired to code problem referral cases for the U.S. questionnaires. TABLE 3 – 21 Reporting Weeks of Referral Coding 2001 Overall Total: Reporting Period of 21 Weeks Production Batches Overall Average: 04/14 – 05/05 05/12 – 06/02 06/09 – 06/30 07/07 – 07/28 08/04 – 08/25 09/01–1 week 9736 819 1530 2171 1935 2639 642 People Total Coding Hours People coded Hours 1905687 160644 296864 416736 381586 523408 126449 53081.5 5827.0 9626.8 11375.1 10418.3 13296.4 2537.9 42.0 27.4 30.8 36.6 36.6 39.4 49.8 The problem referral coders were tasked with resolving the more difficult coding cases. Consequently, their productivity fluctuated. As shown in table 3, during the 21 weeks of this operation, the number of batches and coding hours steadily increased. However, the number of people coded per hour varied. The addition of new coders to the unit definitely affected the per hour rate. Quality Assurance: For the Census 2000, there were three objectives for the QA program: identify and correct clustering of significant errors due to the autocoder; identify and correct clustering of significant errors due to clerical coding; and promote continuous improvement within the clerical coding operation. As mentioned earlier, the QA sample checked for coding consistency between the residual coders as well as the referral coders. If there was a majority code, among three coders, the minority code was considered an error. A majority code was a code on which two or more coders agreed for a given responses. TABLE 4 – Residual and Referral Coders Minority Rates 28 Reporting weeks Residual Coders Overall Average: 02/17 – 03/10 03/17 – 04/07 04/14 – 05/05 05/12 – 06/02 06/09 – 06/30 07/07 – 07/28 08/04 – 08/25 09/01 21 Reporting weeks Referral Coders Overall Average: 04/14 – 05/05 05/12 – 06/02 06/09 – 06/30 07/07 – 07/28 08/04 – 08/25 09/01 -- 1 week % Minority Codes TOT 14.1 14.8 14.0 14.3 13.7 13.7 14.1 14.3 TOT 18.2 15.8 16.3 15.7 15.9 15.9 16.1 IND 12.8 13.8 12.7 13.0 12.0 12.0 12.6 13.2 OCC 15.5 15.9 15.4 15.6 15.5 15.4 15.5 15.4 % Minority Codes IND OCC 17.7 18.8 15.8 15.7 16.0 16.7 15.0 16.4 15.4 16.5 15.5 16.4 15.2 17.1 In Table 4, the residual coders’ minority rates after 28 weeks was 12.8 percent for industry and 15.5 percent for occupation. The referral coders’ minority rates after 21 weeks had an overall rate of 17.7 percent for industry and 18.8 percent for occupation. Coding Training: Challenges and difficulties encountered in creating an interactive software for training included: making frequent index changes based on the 2000 I&O classification system; updating the I&O system; coordinating the training and help screens; making improvements based on new clerical coding input; introducing an online help system; and customizing the help linkage system with examples. These new features helped improve the coders’ understanding of coding procedures. Our plan was to hire 350 residual, 100 referral, and 25 bilingual coders. The bilingual coders were hired to translate and code simultaneously Spanish entries written on the U. S. long form Spanish questionnaire and the Puerto Rican questionnaires. Because of the large number of new hires, we planned a series of training sessions to coordinate the staffing and the training. Table 5 below provides information on the hiring and training of coders. Of the 410 trained through May, 2001, 301 were residual coders (subtract problem referralists and bilinguals from 410), 96 were referral coders, and 13 were bilingual coders. There were 19 coding units of at least 25 coders. Each unit had a supervisor and lead clerk. Also, to manage the overall operation there was a section chief and two lead supervisors. The supervisors and lead clerks were also trainers for their respective units. This provided an easier transition for trainers to supervise their individual units. TABLE 5 – Training and Staffing of Coders Coder Training Activity Total Trainees (New Coders) Problem Referral Coders (New Coders) Bilingual Coders Prior to Production Coding 69 24 0 For Production Coding: Session 1 (1/29 – 2/9) Session 2 (2/12 – 2/23) Session 3 (2/26 – 3/9) Session 4 (3/12 – 3/23) Session 5 (3/26 – 4/6) Session 6 (4/9 – 4/20) Session 7 (4/23 – 5/4) Session 8 (5/7 – 5/18) 100 95 90 41 0 0 0 15 1 71 - 8 5 Total Trainees Total Staff 341 (410) 72 13 (13)* (96)* * - Trained on both residual and problem referral procedures. Conclusion: Preliminary results from the Census 2000 Industry and Occupation (I&O) coding operation indicate that the automated system designed to code write-in entries to the I&O questionnaire items and the assisted coder training were successful and an improvement over their 1990 counterparts. Major improvements noted thus far are: the ability of the autocoder to assign I&O codes increased dramatically from 47 percent for the 1990 operation to 56 percent for the 2000 operation. This production increase resulted in an estimated savings of about $2.4 million; the new interactive training system provided improved teaching capabilities. This was accomplished by having a built-in help system that allowed the coders to reference information that was covered during their training. This new system also gave the coders hands on training on the new production computer assisted clerical coding software; and the work by the clerical coding units also had specific achievements when compared to 1990. The 2000 final referral rates for residual coders were less than the 1990 rates. In 2000 only 9.2 percent of the industry codes and 5.6 percent of the occupation codes were sent to the referral unit, which were lower than the final 1990 rates of 13 percent for industry and 9 percent for occupation. The incorporation of a bilingual coding unit to translate and code respondents’ Spanish entries along with stateside coding was a major improvement for 2000. In 1990, the Puerto Rico questionnaires were coded on the paper forms and the responses keyed. We fully expect that because of these new and innovative training, coding, and QA systems, that are easily trainable and/or modifiable, that the entire I&O coding system will improve over time. References Appel, Martin V., "Automated Industry and Occupation Coding." Presented at the Development of Statistical Tools Seminar on Development of Statistical Expert Systems (DOSES), December 1-3, 1987. Appel, Martin V., Hellerman, Eli, "Census Bureau Experience with Automated Industry and Occupation Coding," 6/21/82. Appel, Martin V., Scopp, Thomas A., "Automated Industry and Occupation Coding." Presented at the American Statistical Association and on Population Statistics at the Joint Advisory Committee Meeting, April 25, 1985, Rosslyn, VA. Chen, Bor-Chung; Creecy, Robert; Apple, Martin V., "Error Control of Automated Industry and Occupation Coding," Journal of Official Statistics, Vol. 9, No. 4, 1993 pp 729-745, Statistics Sweden. Dalzell, Donald; Johnson, Patricia; Kirk, Mary; Ross, Dwayne; Scopp, Thomas, "Industry and Occupation Coding for Census 2000," Presented at the annual meeting of the American Statistical Association, August 2000, Indianapolis, IN. Gillman, Daniel, "Developing an Industry and Occupation Autocoder for the 2000 Census," Presented at the annual meeting of the American Statistical Association, August 2000, Indianapolis, IN. Gillman, Daniel, Appel, Martin V., "Automated Coding Research at the Census Bureau," SRD Research Report RR 94/04, 10/5/94. Johnson, Patricia A. (2000), "Results of Phase II I&O Autocoder Development and Recommendation for Phase III", Draft BOC memorandum, March 17, 2000. Knaus, Rodger, "Methods and Problems in Coding Natural Language Survey Data," Journal of Official Statistics, Vol. 3, No 1, 1987, pp 45-67, Statistics Sweden. Lyberg, Lars; Dean, Patricia, "International Review of Approaches to Automated Coding," Prepared for the conference, "Advanced Computing for the Social Sciences," Williamsburg, VA, April 10-12, 1990. Scopp, Thomas; Dalzell, Donald; Haley, Kevin, AA Preliminary Look at the Effects of Optical Character Recognition (OCR) and Keying on the Quality of Industry and Occupation Coding in Census 2000,@ Presented at the annual meeting of the American Statistical Association, August 2001, Atlanta, GA. Scopp, Thomas S.; Priebe, John; Earle, Katharine, "Adaptation of the Standard Industry and Occupation Classification System for Census 2000", Presented at the annual meeting of the American Statistical Association, August 2000, Indianapolis, IN. Scopp, Thomas S.; Tornell, Steven W., "The 1990 Census Experience with Industry and Occupation Coding." For presentation at the Southern Demographic Association Annual Meeting, Jacksonville, FL, October 11, 1991. Tornell, Steven W., "Documentation of 1990 Decennial Census Industry and Occupation Coding (Stateside)," DPD, 9/4/91 This paper reports the results of research and analysis undertaken by Census Bureau staff. It has undergone a Census Bureau review more limited in scope than that given to official Census Bureau publications. This report is released to inform interested parties of ongoing research and to encourage discussion of work in progress.