Preliminary Results from the Census 2000 Industry and Occupation Coding

advertisement
Proceedings of the Annual Meeting of the American Statistical Association, August 5-9, 2001
Preliminary Results from the Census 2000
Industry and Occupation Coding
Mary Kirk, Ester Buckles, Wynona Mims, Marty Appel and Patricia Johnson,
U.S. Census Bureau
Key Words: Census, coding, evaluation
Introduction: For Census 2000, the U.S. Census
Bureau collected information about the industry in which
a respondent worked and the occupation in which the
respondent performed for a 1-in-6 sample of households.
The process of classifying these responses to an industry
code category and an occupation code category is called
industry and occupation (I&O) coding. This paper will
describe this process and present some preliminary
results.
Background: The Census Bureau has collected
Industry and Occupation (I&O) data in Census and
household surveys for many years. The questions on the
decennial census form dates back to 1820 for industry and
1850 for occupation. Each decennial census since 1940
has included a series of questions that asked people to
describe their jobs. In the 1990 Census, for the first time
the Census Bureau used a combination of automated
coding and computer-assisted clerical coding to assign the
numeric code categories that identify the respondent's
industry and occupation. The Automated I&O Coding
Software (AIOCS) assigned a substantial fraction of codes
without human intervention and the computer-assisted
clerical coding operation classified the remaining cases
that the AIOCS was unable to complete. Because of the
success of the 1990 integrated coding system, the Census
Bureau decided to use this approach again for the Census
2000. The I&O coding system developed for Census
2000 receives inputs from the Data Capture System
(DCS). All responses to Census 2000, including the I&O
questions1, were captured using a combination of data
scanning and data keying methodology. The Census
Bureau used those raw data files to produce data response
files that became input to the automated coding software.
(Scopp et al., 2001). The full I&O coding system is made
up of four subsystems which are discussed below:
I&O Coding System Descriptions
Autocoder Subsystem: The autocoder subsystem is a
computerized batch processing system that extracts
1
The I&O questions request information on type of
industry, name of company, kind of work performed
and most important activities and duties.
respondent replies2 from the data capture file and uses a
series of sophisticated computer algorithms and rules to
classify a respondent to an industry and occupation
category. (Gillman, 2000).
Computer-assisted Clerical Coding Subsystem:
Coding clerks at computer terminals assign I&O codes to
the entries for which the autocoder was unable to assign
an acceptable code. The acceptable code rate is the
percentage of the records in the validation sample
assigned a code by the autocoder whose confidence scores
are equal to or above the cutoff score for that code. This
clerical coding operation is divided into two parts:
residual coding and referral coding.
Residual Coding: The first part of the clerical coding
process is called residual coding (the residual from the
autocoder process). In residual coding, clerks assigned
I&O codes based on the information provided by the
respondent on the decennial census questionnaire. For the
Census 2000 I&O coding process, the Census Bureau
extracted the responses that were not assigned a code or
received an unacceptable code from the autocoder and
grouped these cases into batches.
For active duty military cases, the residual and
referral coders assigned an industry code for the branch of
service and an occupation code. A special button accessed
the military occupations on the occupation screen and
sorted them in two ways: by branch of service, military
occupation title, and pay grade or by branch of service,
Military Occupation Specialty (MOS) code, and pay
grade.
The Military Census Report (MCR), Shipboard
Census Report (SCR), and household responses were
processed together. For U.S. questionnaires with
responses written in Spanish and for all Puerto Rican
questionnaires, a special unit of bilingual coders was
formed. This special unit translated and coded the I&O
items using resource materials similar to those used by
the referral coders along with a special Employer Name
List (ENL) for Puerto Rico.
Problem Referral Coding: The last step in the coding
process was referral coding. During the referral coding
2
Response variables used for I&O Coding: Company
Name, Kind of Business, Industry Type, Kind of Work
Performed, Duties, Class of Worker, Age, Education
process, the referralists researched responses that neither
the autocoder system nor the residual coders were able to
code. The problem referralists were the last decisionmakers in the coding process.3 These referralists used
other research methods and materials to assist in assigning
I&O codes. These materials included the North American
Industry Classification System (NAICS) Manual, the
Standard Occupational Classification (SOC) Manual, the
Dun and Bradstreet directories, and other on-line
electronic reference files. All cases that got to this final
stage were coded.
Quality Assurance Subsystem: The quality assurance
system measured the coding speed and accuracy of the
residual and referral coding clerks. Each residual coding
work unit included a sample of designated QA cases. The
sample cases were replicated twice and distributed among
the work units. Thus, a coder did not know which cases
were the QA sample and all coders, in essence, did
verification. The QA sample checked for coding
consistency between the residual coders. If there was a
majority code, the minority code was considered an error.
A majority code is a code on which two or more coders
agreed on a code for a given response. If there was no
majority code, no coder received an error. Cases where all
three codes disagreed were sent to the referral unit. The
problem referral coders also used a 3-way QA.
Clerical Coding Training: A new training system was
designed for the Census 2000 clerical coding unit. This
new training system included a separate two-week training
program for the residual coders and an additional two-day
training session for the referral coders. It was based on an
interactive computer training system, and all materials
were accessed on the computer.
The traditional training for the I&O coding operation
in both 1980 and 1990 has been a two-week session with
paper references, procedures, and indexes. For Census
2000, the logic behind developing the system was to
create an effective training mechanism combining
examples from the I&O coding procedures with the
principles of coding I&O. The interactive training
software had a built-in help system that allowed the coders
to reference information covered during their training.
The software has: selected examples to practice for each
section of the training; online help system; hot keys for
traditional computer users; buttons for pull-down menus
of abbreviations, indexes, self-employed people, and
keywords; and other buttons to help coders.
The Census 2000 I&O coders were trained on basic
computer literacy, industry, occupation, class of worker,
and military coding. Practice cases appeared on the
screen as the coders were trained. At the end of the
3
Headquarters staff served as a resource for referralists
for some cases.
training the residual coders were tested to see if they
qualified for production coding. Coders were required to
pass one of three tests for industry and one of three tests
for occupation as well as pass a final test with both I&O
cases.
Preliminary Industry & Occupation Coding Results:
Autocoder: Prior to production coding, the automated
coder was validated using Census 2000 data. The purpose
of the validation process was to analyze the performance
of the automated coder with actual Census 2000
responses. We processed the 22.5 million I&O responses
from Census 2000 through the autocoder. (Scopp et al.,
2001). After this process was complete we created the
clerical workload batches.
Table 1, below, provides detailed information on
autocoder production and accuracy rates. The 2000
autocoder processed the entire 22.5 million I&O
responses in 31 hours compared with 90 days in 1990.
The gross autocoder production rates were 86% for
industry and 81% for occupation. When we applied the
cutoff scores criteria, the net production rates for the
validation sample became 59% for industry and 56% for
occupation. The accuracy levels were 94% and 92%,
respectively. The industry net production rate was about
the same as the 1990 level. However, the occupation rate
was a vast improvement over the production figures for
1990, 56% versus 37%, and accuracy has improved for
both industry and occupation compared with 1990.
TABLE 1 Autocoder Production and Accuracy Rates
Production
Total
records/people
through
autocoder (in
millions)
Gross
production rate
(assigned a
code)
Net production
rates (gross x
acceptance)
From validation
sample
(percent)
2000
1990
IND
OCC
IND
OCC
22.5
22.5
N/A
N/A
86.4%
80.8%
94.0%
88.0%
58.6%
56.0%
58.0%
37.0%
94.0%
92.3%
90.0%
87.0%
Residual Coding:
The residual-coding unit assigned
codes to approximately 15.3 million people. This workload
included 10.0 million industry codes and 10.6 million
occupation codes. After 28 weeks they coded 100% of the 15.3
million people.
Table 2, below, shows detailed information for the 28
weeks of production coding for the residual coding unit. During
the 28 weeks, the average number of person records processed
by a coder was 67.4 per hour. In 1990, at the completion of the
coding operation, the production rate for the day shift was 95.9
per hour. After the 28 weeks, 9.2 percent of the industry codes
and 5.6 percent of the occupation codes were sent to the
problem referral coding unit. These rates are lower than the
final 1990 referral rates of 13 percent for industry and 9 percent
for occupation. Some weeks are biased downward by the
introduction of new coders. After processing the 2000 data, a
study will be done to evaluate the productivity rates of the I&O
coding process compared to previous years.
TABLE 2 – 28 Reporting Weeks of Residual Production Coding
2001 Overall
Total:
Reporting
Number of
Period of 28
Residual Coders
Weeks
Overall Average:
02/17 – 03/10
03/17 – 04/07
04/14 – 05/05
05/12 – 06/02
06/09 – 06/30
07/07 – 07/28
08/04 – 08/25
307
292
373
350
335
327
322
Production
Batches
People
75398
13899470
Total Coding
Hours
223474.6
6307
11611
10316
10810
13161
12029
11164
120331
2296706
2060990
2160395
2629425
2401731
2229892
22848.9
40091.7
34032.2
30959.4
35751.4
31706.8
28084.2
% Referred
People Coded
Hours
67.4
55.2
57.3
60.6
69.8
73.5
75.7
79.4
IND
OCC
9.2
5.6
10.0
9.3
9.8
9.2
8.6
8.3
9.0
5.7
5.3
6.0
5.4
5.0
5.3
6.4
Problem Referral Coding: In 2000, the referral coding
process started two months after the clerical coding
process began. Most referral coders were selected after
five weeks of residual coding experience for the 2000
operation. There were 72 referralists hired to code
problem referral cases for the U.S. questionnaires.
TABLE 3 – 21 Reporting Weeks of Referral Coding
2001 Overall
Total:
Reporting
Period of 21
Weeks
Production
Batches
Overall
Average:
04/14 – 05/05
05/12 – 06/02
06/09 – 06/30
07/07 – 07/28
08/04 – 08/25
09/01–1 week
9736
819
1530
2171
1935
2639
642
People
Total
Coding
Hours
People
coded
Hours
1905687
160644
296864
416736
381586
523408
126449
53081.5
5827.0
9626.8
11375.1
10418.3
13296.4
2537.9
42.0
27.4
30.8
36.6
36.6
39.4
49.8
The problem referral coders were tasked with
resolving the more difficult coding cases. Consequently,
their productivity fluctuated. As shown in table 3, during
the 21 weeks of this operation, the number of batches
and coding hours steadily increased. However, the
number of people coded per hour varied. The addition
of new coders to the unit definitely affected the per hour
rate.
Quality Assurance: For the Census 2000, there
were three objectives for the QA program: identify and
correct clustering of significant errors due to the
autocoder; identify and correct clustering of significant
errors due to clerical coding; and promote continuous
improvement within the clerical coding operation.
As mentioned earlier, the QA sample checked for
coding consistency between the residual coders as well
as the referral coders. If there was a majority code,
among three coders, the minority code was considered an
error. A majority code was a code on which two or more
coders agreed for a given responses.
TABLE 4 – Residual and Referral Coders Minority Rates
28 Reporting weeks
Residual Coders
Overall Average:
02/17 – 03/10
03/17 – 04/07
04/14 – 05/05
05/12 – 06/02
06/09 – 06/30
07/07 – 07/28
08/04 – 08/25
09/01
21 Reporting weeks
Referral Coders
Overall Average:
04/14 – 05/05
05/12 – 06/02
06/09 – 06/30
07/07 – 07/28
08/04 – 08/25
09/01 -- 1 week
% Minority Codes
TOT
14.1
14.8
14.0
14.3
13.7
13.7
14.1
14.3
TOT
18.2
15.8
16.3
15.7
15.9
15.9
16.1
IND
12.8
13.8
12.7
13.0
12.0
12.0
12.6
13.2
OCC
15.5
15.9
15.4
15.6
15.5
15.4
15.5
15.4
% Minority Codes
IND
OCC
17.7
18.8
15.8
15.7
16.0
16.7
15.0
16.4
15.4
16.5
15.5
16.4
15.2
17.1
In Table 4, the residual coders’ minority rates after 28
weeks was 12.8 percent for industry and 15.5 percent for
occupation. The referral coders’ minority rates after 21
weeks had an overall rate of 17.7 percent for industry
and 18.8 percent for occupation.
Coding Training: Challenges and difficulties
encountered in creating an interactive software for
training included: making frequent index changes based
on the 2000 I&O classification system; updating the I&O
system; coordinating the training and help screens;
making improvements based on new clerical coding input;
introducing an online help system; and customizing the
help linkage system with examples.
These new features helped improve the coders’
understanding of coding procedures. Our plan was to hire
350 residual, 100 referral, and 25 bilingual coders. The
bilingual coders were hired to translate and code
simultaneously Spanish entries written on the U. S. long
form Spanish questionnaire and the Puerto Rican
questionnaires. Because of the large number of new hires,
we planned a series of training sessions to coordinate the
staffing and the training. Table 5 below provides
information on the hiring and training of coders. Of the
410 trained through May, 2001, 301 were residual coders
(subtract problem referralists and bilinguals from 410), 96
were referral coders, and 13 were bilingual coders. There
were 19 coding units of at least 25 coders. Each unit had
a supervisor and lead clerk. Also, to manage the overall
operation there was a section chief and two lead
supervisors. The supervisors and lead clerks were also
trainers for their respective units. This provided an easier
transition for trainers to supervise their individual units.
TABLE 5 – Training and Staffing of Coders
Coder Training
Activity
Total
Trainees
(New
Coders)
Problem
Referral
Coders
(New
Coders)
Bilingual
Coders
Prior to Production
Coding
69
24
0
For Production Coding:
Session 1 (1/29 – 2/9)
Session 2 (2/12 – 2/23)
Session 3 (2/26 – 3/9)
Session 4 (3/12 – 3/23)
Session 5 (3/26 – 4/6)
Session 6 (4/9 – 4/20)
Session 7 (4/23 – 5/4)
Session 8 (5/7 – 5/18)
100
95
90
41
0
0
0
15
1
71
-
8
5
Total Trainees
Total Staff
341
(410)
72
13
(13)*
(96)*
* - Trained on both residual and problem referral
procedures.
Conclusion: Preliminary results from the Census
2000 Industry and Occupation (I&O) coding operation
indicate that the automated system designed to code
write-in entries to the I&O questionnaire items and the
assisted coder training were successful and an
improvement over their 1990 counterparts. Major
improvements noted thus far are: the ability of the
autocoder to assign I&O codes increased dramatically
from 47 percent for the 1990 operation to 56 percent for
the 2000 operation. This production increase resulted in
an estimated savings of about $2.4 million; the new
interactive training system provided improved teaching
capabilities. This was accomplished by having a built-in
help system that allowed the coders to reference
information that was covered during their training. This
new system also gave the coders hands on training on the
new production computer assisted clerical coding
software; and the work by the clerical coding units also
had specific achievements when compared to 1990. The
2000 final referral rates for residual coders were less
than the 1990 rates. In 2000 only 9.2 percent of the
industry codes and 5.6 percent of the occupation codes
were sent to the referral unit, which were lower than the
final 1990 rates of 13 percent for industry and 9 percent
for occupation. The incorporation of a bilingual coding
unit to translate and code respondents’ Spanish entries
along with stateside coding was a major improvement for
2000. In 1990, the Puerto Rico questionnaires were
coded on the paper forms and the responses keyed. We
fully expect that because of these new and innovative
training, coding, and QA systems, that are easily
trainable and/or modifiable, that the entire I&O coding
system will improve over time.
References
Appel, Martin V., "Automated Industry and Occupation
Coding." Presented at the Development of Statistical
Tools Seminar on Development of Statistical Expert
Systems (DOSES), December 1-3, 1987.
Appel, Martin V., Hellerman, Eli, "Census Bureau
Experience with Automated Industry and Occupation
Coding," 6/21/82.
Appel, Martin V., Scopp, Thomas A., "Automated
Industry and Occupation Coding." Presented at the
American Statistical Association and on Population
Statistics at the Joint Advisory Committee Meeting,
April 25, 1985, Rosslyn, VA.
Chen, Bor-Chung; Creecy, Robert; Apple, Martin V.,
"Error Control of Automated Industry and Occupation
Coding," Journal of Official Statistics, Vol. 9, No. 4,
1993 pp 729-745, Statistics Sweden.
Dalzell, Donald; Johnson, Patricia; Kirk, Mary; Ross,
Dwayne; Scopp, Thomas, "Industry and Occupation
Coding for Census 2000," Presented at the annual
meeting of the American Statistical Association,
August 2000, Indianapolis, IN.
Gillman, Daniel, "Developing an Industry and Occupation
Autocoder for the 2000 Census," Presented at the
annual meeting of the American Statistical
Association, August 2000, Indianapolis, IN.
Gillman, Daniel, Appel, Martin V., "Automated Coding
Research at the Census Bureau," SRD Research
Report RR 94/04, 10/5/94.
Johnson, Patricia A. (2000), "Results of Phase II I&O
Autocoder Development and Recommendation for
Phase III", Draft BOC memorandum, March 17, 2000.
Knaus, Rodger, "Methods and Problems in Coding
Natural Language Survey Data," Journal of Official
Statistics, Vol. 3, No 1, 1987, pp 45-67, Statistics
Sweden.
Lyberg, Lars; Dean, Patricia, "International Review of
Approaches to Automated Coding," Prepared for the
conference, "Advanced Computing for the Social
Sciences," Williamsburg, VA, April 10-12, 1990.
Scopp, Thomas; Dalzell, Donald; Haley, Kevin, AA
Preliminary Look at the Effects of Optical Character
Recognition (OCR) and Keying on the Quality of
Industry and Occupation Coding in Census 2000,@
Presented at the annual meeting of the American
Statistical Association, August 2001, Atlanta, GA.
Scopp, Thomas S.; Priebe, John; Earle, Katharine,
"Adaptation of the Standard Industry and Occupation
Classification System for Census 2000", Presented at
the annual meeting of the American Statistical
Association, August 2000, Indianapolis, IN.
Scopp, Thomas S.; Tornell, Steven W., "The 1990 Census
Experience with Industry and Occupation Coding."
For presentation at the Southern Demographic
Association Annual Meeting, Jacksonville, FL,
October 11, 1991.
Tornell, Steven W., "Documentation of 1990 Decennial
Census Industry and Occupation Coding
(Stateside)," DPD, 9/4/91
This paper reports the results of research and analysis undertaken by Census Bureau staff. It has undergone a Census Bureau review
more limited in scope than that given to official Census Bureau publications. This report is released to inform interested parties of
ongoing research and to encourage discussion of work in progress.
Download