Data Processing- An Overview

advertisement
COMMON FRAMEWORK FOR A LITERACY SURVEY PROJECT
Literacy Survey
Data Processing Guidelines
May 2014
<< Insert other relevant information/Logos on cover page of Guide>>
2
Table of Contents
Preface ............................................................................................................................................ 4
Data Processing- An Overview .............................................................................................. 5
Data Capture Standards......................................................................................................... 7
Coding Standards ..................................................................................................................... 9
Data-File Creation Standards ............................................................................................. 10
Data Editing .............................................................................................................................. 12
Reference.................................................................................................................................... 15
Annex: Scoring Guideline ..................................................................................................... 16
3
Preface
The Data Processing Guide builds upon a project “Common Framework for a Literacy Survey”
which was executed by the Caribbean Community (CARICOM) Secretariat under funding
provided by the Inter-American Development Bank (IDB) Regional Public Goods Facility. The
main aim of the project was intended to design a common approach to the measurement of
literacy in countries. This common framework is built upon international methodologies and
fundamentally the International Survey of Reading Skills (ISRS) that enable reliable
measurement of literacy than what presently exists in the Region.
The literacy assessment is designed to measure functional literacy. In other words, it will
determine an individual’s literacy level by employing a series of questions designed to
demonstrate the use of their literacy skills. This involves two steps – the objective testing of an
adult’s skill level and the application of a proficiency standard that defines the level of mastery
achieved. The assessment measures the proficiency of respondents on three continuous literacy
scales - prose, document and numeracy. In addition it will collect information on reading
component skills. Component skills are thought to be the building blocks upon which the
emergence of reading fluency is based. Information on the reading component skills will be
collected from people at the lower end of the literacy scale only. The testing phase is preceded by
a selection phase which includes the administering of a Background or Household questionnaire
and post the selection of the respondent from the specific household an initial pre-assessment is
undertaken through a filter test booklet to determine what type of assessment should be
undertaken in the testing phase.
A consultant, Mr. Scott Murray of Canada was hired to undertake the provision of services on
this project. The CARICOM Secretariat (including Regional Statistics and Human and Social
Development Directorate) and the CARICOM Advisory Group on Statistics (AGS) were
instrumental in the execution of the project throughout all phases. In addition, there was
participation by Member States and some Associate Members relative to the technical rollout of
the instruments and documents.
This Data Processing Guide is aimed at providing <<country undertaking a Literacy Survey>>
with guidelines that can enable the effective processing of the data obtained from the literacy
survey as recommended under the IDB-funded CARICOM project.
4
Data Processing- An Overview
Data processing is, broadly, "the collection and manipulation of items of data to
produce meaningful information."1 Data processing may involve various
processes, including2:






Validation – Ensuring that supplied data is "clean, correct and useful."
Sorting – "arranging items in some sequence and/or in different sets."
Summarization – reducing detail data to its main points.
Aggregation – combining multiple pieces of data.
Analysis – the "collection, organization, analysis, interpretation and
presentation of data.".
Reporting – list detail or summary data or computed information.
Therefore, data processing refers to all operations performed after data
collection, from the time the questionnaire and assessment booklets are
returned from the field, through to the final publication and dissemination of
results. Data processing operations should be as simple and straightforward as
possible. In addition, it is necessary to be able to solve problems as they arise.
Planning of and preparation for the data processing tasks are essential to a
successful literacy survey operation and hence, the initial phase should
coincide with the planning and preparation of the non-data processing phase of
the survey.
Plans for data processing should be formulated as an integral part of the
overall plan of the survey, and those responsible for the processing of the
survey data should be involved from the inception of the planning process.
Data processing will be required in connection with various survey activities
including analysis of the data from the survey pilot test, compilation of
preliminary results, preparation of tabulations, evaluation of survey results,
analysis of main survey data, arrangements for storage in and retrieval from a
database, and identification and correction of errors.
In addition, data-processing technologies are playing an increasing role in the
planning and control of field operations and other aspects of surveys and
censuses administration. Data processing has an impact on almost all aspects
of the survey operation ranging from the design of the questionnaire to the
analysis of the final results. Therefore, data-processing requirements in terms
1
2
French, Carl (1996). Data Processing and Information Technology (10th ed.). Thomson. p. 2. ISBN 1844801004.
http://en.wikipedia.org/wiki/Data_processing
5
of personnel, space, equipment and software (computer programs) need to be
looked at an early stage in the planning process.
Careful planning and control are required to ensure an uninterrupted flow of
work through the various stages from receipt of the survey questionnaire,
assessment booklets and scoring sheets through preparation of the database
and final tabulations. The plan should provide for the computer edit to follow
closely the coding/checking/recording of the data so that errors can be
detected while knowledge related to them is fresh and appropriate remedial
actions may be taken.
Countries are required to capture and process their assessment and
background questionnaires. In the computer-based collection option, virtually
all of these activities are managed by the collection software. The countries will
also be required to code and link to the master database any open ended fields
including industry and occupation.
This data processing guidelines document serves to provide details on the
processing standards and guidelines required for the capturing, coding, file
creation and editing of the literacy survey data.
6
Data Capture Standards
The purpose for data capture standards is to ensure that the data captured
from the survey instruments are done using uniform methods and to ensure
that the captured data are as free of capture errors as possible.
For paper and pencil assessments, a manual data capture method is preferred
since the data capture errors, usually key-entry operator errors, are easier to
identify and control. Scanning methods have been found to introduce errors
different from those of manual data capture methods. For example, scanning
errors may occur because of the scanner's failure to correctly read the
information provided (e.g. reading 4 as 9). Additionally, the data capture
system must be fully tested prior to the commencement of data capture. In
addition to a fully-tested data capture system, sound quality control
procedures such as 100% verification of the data capture will ensure that the
dataset is free of data capture errors.
The data capture standards required for a literacy assessment are:
1.
2.
3.
The responses from the Background Questionnaire will be manually
keyed from the completed questionnaire, or captured by the interviewer
using a Computer-Assisted method.
The data capture specifications and system must be tested before
implementation.
Data capture of the Background Questionnaire (paper and pencil), Filter
Booklet, Locator Booklet and Main Task Booklets will be 100% verified.
Guidelines
1.
If data collection is done via paper and pencil method, a manual data
capture method should be used to capture the assessment booklets’
responses. Likewise, if the scoring of the assessment Booklets is done
directly on the booklets then a manual data capture method should be
used to capture the scores.
2.
Although a manual data capture method is preferred for paper and
pencil collection method, an optical scanning method may be used to
capture the Background Questionnaire responses provided that a
modern scanning technology is utilized and the scanning method has
been fully tested using the Background Questionnaire. Similarly, an
optical scanning method may be used to capture the Scoring data. It is
also acceptable to employ a computer-assisted method to capture the
Background Questionnaire information, but this method should not be
used for capturing the assessment booklet responses if paper and pencil
data collection method is used.
7
3.
If a computer-assisted data collection method is to be used for the
Background Questionnaire, it is imperative that the system be
thoroughly tested before implementation.
4.
The testing of the data capture system involves a thorough review of the
programming specifications prior to the development of the computer
programming code, and the subsequent testing of the programs prior to
the start of the data capture operation. Testing is carried out by
preparing mock survey instruments (Background Questionnaire and
Scoring Sheets), passing them through the data capture system, and
then reviewing the resultant data file outputs. When satisfactory data
capture results are obtained the capture of the live survey data can
commence.
5.
In a 100% data capture verification operation, the data capture is done
twice, by two different operators. In a fully automated system, the second
operator will be alerted when his/her input does not match that of the
original. In these cases, the second operator will check the questionnaire
or the assessment Booklet Scoring sheet and enter the correct value.
Before the data capture operation takes place, countries will have to provide
documentation relevant to this operation. The documentation should include
information on the data capture software and system to be used, a description
of the data verification process, a description of the data capture system testing
procedures, and information on how the 100% data verification will be
implemented.
For quality assurance,
operation will have to
operation is executed.
standard international
comparative analysis.
the document prepared relative to the data capture
be reviewed by an international expert before the
The data capture specifications will be based on a
record layout for the data file so as to enable
8
Coding Standards
The purpose of coding standards relative to the literacy survey is to ensure that
the coding of questionnaire and assessment booklets is performed in a uniform
way within and across countries and with an acceptable quality.
The coding standards required for a literacy assessment are:
1.
The data from the Background Questionnaire and the assessment
booklets will be coded as specified by the International Record Layout
(IRL).
2.
The following codebooks will be used to code education, occupation, and
industry information respectively from the Background Questionnaire.
(a). ‘International Standard Classification of Education (ISCED)’- the most
recent- will be used to code the education variable, i.e., ‘highest level
of education/schooling’.
(b).‘International Standard Classification of Occupations (ISCO) Job Titles’
will be used to code the occupation variable.
(c). ‘International Standard Industrial Classification Of All Economic
Activities, Third Revision’ will be used to code the industry variable.
3.
Data that has been manually coded will be 100% verified by another
coder. The average error rate for manually coded data must not exceed
6%.
Guidelines
1.
Countries should train approximately five (5) coders. These coders
should preferably have extensive experience in coding Education,
Industry and Occupation data from censuses or other large-scale
surveys. Training materials for coders should consist of a master set of
descriptions with associated expert codes for the data to be coded. By the
end of the coders’ training sessions, the coder error rate should not
exceed 6%.
2.
Some countries may opt to utilize software for automated coding.
However, since automated coding software rarely is able to successfully
code 100% of the data, a manual coding operation is still necessary. In
this case, fewer manual coders may be required.
Countries will be required to prepare a document that includes a full
description of the coding system and coding quality control procedures.
9
Data-File Creation Standards
The purpose of the data-file creation standards is to ensure that countries
create its data file in the format that is required for estimation and tabulation
of survey results.
The data file provided by each country will be used for producing survey
estimates and for tabulating and publishing survey results. It is imperative that
the data is provided in the same format so that processing delays are avoided
and so that errors are avoided. An international consultant/ firm may be
required to assist in the database development.
Standard and Guidelines
1. The literacy survey data file will be created according to the International
Record Layout (IRL). An international consultant/ firm may be required to
assist in this process. Deviations from the prescribed record format must be
reconciled before the test data can be scaled and before the survey estimates
can be produced.
Database development
Countries will be responsible for creating and validating their data files by
following prescribed procedures for both the pilot survey and the main survey.
A standard codebook will be used in processing the survey data and for
creating a database with a standard record layout. Countries will be required to
follow the specifications for creating their data files and ensuring their
compliance to the prescribed format. All paper and pencil psychometric test
data must be key-entered with 100 percent verification. The captured
background questionnaire data must be subjected to statistical verification to
ensure an outgoing quality level of 99.5%.
Countries will be required to develop their own data capture and data editing
programs since the background questionnaires and data capture methods may
vary by countries. All data files must be edited to ensure that each record has a
unique identification number, all data fall within allowable ranges, each
respondent has answered all applicable questions, the question responses are
consistent with the responses to other questionnaire items where applicable,
and that extreme data values are identified and dealt with. Countries will be
obliged to correct any anomalies identified in the editing process. The edit
should flag missing information, remove data that does not respect the skip
patterns in the background questionnaire. Data on education, industry and
occupation collected in national form must be mapped into the international
10
ISCED, ISIC and ISOC classifications respectively.
Countries will be responsible for creating a ‘clean’ data file. An international
technical consultant/ firm will be required to review the data files from both
the pilot and main surveys to ensure compliance with the prescribed
procedures. If necessary, countries will be obliged to correct any anomalies
detected during the course of the review.
11
Data Editing
The purpose of data editing is to ensure the production of the Literacy Survey
data file that is error-free. The data file produced at the end of the data
processing phase will be used in subsequent steps such as weighting and
estimation. It is imperative that the data file has been properly edited so that a
‘clean' dataset is available for subsequent processing.
The data editing standards necessary for a literacy assessment are:
1.
Countries will perform an edit of its Literacy Survey data file in order to
identify and resolve errors in the data.
2.
The edit of the Literacy Survey data file will include the following
minimum checks. For each of these edits, if errors are discovered they
will be resolved, i.e., the original erroneous value will be replaced with a
corrected value.
(a). ID check- The record identification numbers on the data file should be
checked for uniqueness and integrity to ensure that there is only one
record per respondent on the file, and to ensure that the record
identification number is unique and in the specified format.
(b). Range checks- A range check will be carried out for all those variables
that can only take on specific values.
(c). Logic checks, i.e., question flows- The data file will be edited to check
the flow of respondents through the various sections of the
Background Questionnaire. The objective of this edit is to ensure that
the responses for respondents who should have skipped a given set of
questions have been properly coded as a ‘valid skip', and that there
are appropriately coded responses for respondents who should have
completed a given set of questions.
(d). Consistency checks- An edit of the data file will be performed to
identify inconsistencies that may have arisen as a result of response
errors, coding errors, and data capture errors.
(e). Outlier check- An edit should be performed to identify possible
outliers, i.e., extreme quantitative data values. All identified outliers
will be reviewed for legitimacy and to assess the potential effect on
the survey estimates.
12
3.
Imputation methods will not be used to treat missing Background
Questionnaire data, i.e., item non-response and complete non-response.
4.
Country will be required to produce an edited data file, in ASCII format,
that conforms to the International Record Layout (IRL).
Guidelines
1.
The checks outlined above should form part of the overall Editing System. The Editing
System should be run after all data has been captured.
2.
The errors to be resolved by the Editing System are those for which a correct value can be
logically determined from the respondent's record. In cases where a respondent does not
provide a question response, i.e., item non response or complete non response, imputation
methods must not be used to furnish a response.
3.
Range checks should be carried out as part of the overall Editing System. Where an
invalid value is found on the data set, a correct value should be entered to replace it. If
such a step is not possible, the field should be edited based on other variables for that
record. Otherwise, such fields should be set to the "Not Stated" code.
4.
The process of checking question flows is a manual procedure, typically done by
studying frequency counts for each of the variables in the database. Each of the
respondents must be accounted for in each question - either by having a valid response or
by being coded as a “valid skip" for those questions not applicable to them. In general,
‘top-down' editing rules should be followed for this procedure. This process involves
path-cleaning based on the answers to questions and the skip patterns associated with
these answers. Missing responses should be coded as ‘not stated' and any answers
between skips should be blanked out. If a missing response occurs in a filter question
(single or multiple skips) and if by looking forward one is unable to clearly identify the
path, one may then exercise the option of imputing the missing answer to the value
associated with that path. Otherwise, one must impute a ‘not stated' value for all the
questions in each of these paths until the common question is reached.
5.
Consistency edits are usually specified using a series of edit rules or decision rules. For
the Literacy Survey these rules must be automated to ensure their consistent application.
Typically, the specifications are programmed, thoroughly tested, and then run on the preedited data. Consistency edits should be performed on both the Background
Questionnaire data and on the scores from the assessment Booklets. For the scores from
the assessment Booklets, one must check that the scores are correctly entered into the
fields associated with the tests that were administered.
13
In the event countries deviates from the international editing standard for its data set, a
description of the editing system must be prepared and reviewed by an International Consultant
prior to implementation.
After the running of the Editing System, countries will be required to prepare relevant
documentation. The Editing System documentation will include the following:
1. a list of all data cleaning rules that have been applied to the data,
2. a list of all edits that have been globally applied to the data,
3. a list of specific edits that have been applied to individual observations, computational
definitions for any variables that have been re-coded or derived from other variables, and
documentation of any deviations from the international standard that remain upon
completion of the Editing phase.
Countries will prepare a clean data set in the record layout specified by the by the International
Consultant.
14
Reference
French, Carl (1996). Data Processing and Information Technology (10th ed.).
Thomson. p. 2. ISBN 1844801004.
Statistics Canada and Organization for Economic Cooperation and
Development (2003). Standards and Guidelines. For the Design and
Implementation of the Adult Literacy and Life-skills Survey.
15
Annex: Scoring Guideline
16
Download