COMMON FRAMEWORK FOR A LITERACY SURVEY PROJECT Literacy Survey Data Processing Guidelines May 2014 << Insert other relevant information/Logos on cover page of Guide>> 2 Table of Contents Preface ............................................................................................................................................ 4 Data Processing- An Overview .............................................................................................. 5 Data Capture Standards......................................................................................................... 7 Coding Standards ..................................................................................................................... 9 Data-File Creation Standards ............................................................................................. 10 Data Editing .............................................................................................................................. 12 Reference.................................................................................................................................... 15 Annex: Scoring Guideline ..................................................................................................... 16 3 Preface The Data Processing Guide builds upon a project “Common Framework for a Literacy Survey” which was executed by the Caribbean Community (CARICOM) Secretariat under funding provided by the Inter-American Development Bank (IDB) Regional Public Goods Facility. The main aim of the project was intended to design a common approach to the measurement of literacy in countries. This common framework is built upon international methodologies and fundamentally the International Survey of Reading Skills (ISRS) that enable reliable measurement of literacy than what presently exists in the Region. The literacy assessment is designed to measure functional literacy. In other words, it will determine an individual’s literacy level by employing a series of questions designed to demonstrate the use of their literacy skills. This involves two steps – the objective testing of an adult’s skill level and the application of a proficiency standard that defines the level of mastery achieved. The assessment measures the proficiency of respondents on three continuous literacy scales - prose, document and numeracy. In addition it will collect information on reading component skills. Component skills are thought to be the building blocks upon which the emergence of reading fluency is based. Information on the reading component skills will be collected from people at the lower end of the literacy scale only. The testing phase is preceded by a selection phase which includes the administering of a Background or Household questionnaire and post the selection of the respondent from the specific household an initial pre-assessment is undertaken through a filter test booklet to determine what type of assessment should be undertaken in the testing phase. A consultant, Mr. Scott Murray of Canada was hired to undertake the provision of services on this project. The CARICOM Secretariat (including Regional Statistics and Human and Social Development Directorate) and the CARICOM Advisory Group on Statistics (AGS) were instrumental in the execution of the project throughout all phases. In addition, there was participation by Member States and some Associate Members relative to the technical rollout of the instruments and documents. This Data Processing Guide is aimed at providing <<country undertaking a Literacy Survey>> with guidelines that can enable the effective processing of the data obtained from the literacy survey as recommended under the IDB-funded CARICOM project. 4 Data Processing- An Overview Data processing is, broadly, "the collection and manipulation of items of data to produce meaningful information."1 Data processing may involve various processes, including2: Validation – Ensuring that supplied data is "clean, correct and useful." Sorting – "arranging items in some sequence and/or in different sets." Summarization – reducing detail data to its main points. Aggregation – combining multiple pieces of data. Analysis – the "collection, organization, analysis, interpretation and presentation of data.". Reporting – list detail or summary data or computed information. Therefore, data processing refers to all operations performed after data collection, from the time the questionnaire and assessment booklets are returned from the field, through to the final publication and dissemination of results. Data processing operations should be as simple and straightforward as possible. In addition, it is necessary to be able to solve problems as they arise. Planning of and preparation for the data processing tasks are essential to a successful literacy survey operation and hence, the initial phase should coincide with the planning and preparation of the non-data processing phase of the survey. Plans for data processing should be formulated as an integral part of the overall plan of the survey, and those responsible for the processing of the survey data should be involved from the inception of the planning process. Data processing will be required in connection with various survey activities including analysis of the data from the survey pilot test, compilation of preliminary results, preparation of tabulations, evaluation of survey results, analysis of main survey data, arrangements for storage in and retrieval from a database, and identification and correction of errors. In addition, data-processing technologies are playing an increasing role in the planning and control of field operations and other aspects of surveys and censuses administration. Data processing has an impact on almost all aspects of the survey operation ranging from the design of the questionnaire to the analysis of the final results. Therefore, data-processing requirements in terms 1 2 French, Carl (1996). Data Processing and Information Technology (10th ed.). Thomson. p. 2. ISBN 1844801004. http://en.wikipedia.org/wiki/Data_processing 5 of personnel, space, equipment and software (computer programs) need to be looked at an early stage in the planning process. Careful planning and control are required to ensure an uninterrupted flow of work through the various stages from receipt of the survey questionnaire, assessment booklets and scoring sheets through preparation of the database and final tabulations. The plan should provide for the computer edit to follow closely the coding/checking/recording of the data so that errors can be detected while knowledge related to them is fresh and appropriate remedial actions may be taken. Countries are required to capture and process their assessment and background questionnaires. In the computer-based collection option, virtually all of these activities are managed by the collection software. The countries will also be required to code and link to the master database any open ended fields including industry and occupation. This data processing guidelines document serves to provide details on the processing standards and guidelines required for the capturing, coding, file creation and editing of the literacy survey data. 6 Data Capture Standards The purpose for data capture standards is to ensure that the data captured from the survey instruments are done using uniform methods and to ensure that the captured data are as free of capture errors as possible. For paper and pencil assessments, a manual data capture method is preferred since the data capture errors, usually key-entry operator errors, are easier to identify and control. Scanning methods have been found to introduce errors different from those of manual data capture methods. For example, scanning errors may occur because of the scanner's failure to correctly read the information provided (e.g. reading 4 as 9). Additionally, the data capture system must be fully tested prior to the commencement of data capture. In addition to a fully-tested data capture system, sound quality control procedures such as 100% verification of the data capture will ensure that the dataset is free of data capture errors. The data capture standards required for a literacy assessment are: 1. 2. 3. The responses from the Background Questionnaire will be manually keyed from the completed questionnaire, or captured by the interviewer using a Computer-Assisted method. The data capture specifications and system must be tested before implementation. Data capture of the Background Questionnaire (paper and pencil), Filter Booklet, Locator Booklet and Main Task Booklets will be 100% verified. Guidelines 1. If data collection is done via paper and pencil method, a manual data capture method should be used to capture the assessment booklets’ responses. Likewise, if the scoring of the assessment Booklets is done directly on the booklets then a manual data capture method should be used to capture the scores. 2. Although a manual data capture method is preferred for paper and pencil collection method, an optical scanning method may be used to capture the Background Questionnaire responses provided that a modern scanning technology is utilized and the scanning method has been fully tested using the Background Questionnaire. Similarly, an optical scanning method may be used to capture the Scoring data. It is also acceptable to employ a computer-assisted method to capture the Background Questionnaire information, but this method should not be used for capturing the assessment booklet responses if paper and pencil data collection method is used. 7 3. If a computer-assisted data collection method is to be used for the Background Questionnaire, it is imperative that the system be thoroughly tested before implementation. 4. The testing of the data capture system involves a thorough review of the programming specifications prior to the development of the computer programming code, and the subsequent testing of the programs prior to the start of the data capture operation. Testing is carried out by preparing mock survey instruments (Background Questionnaire and Scoring Sheets), passing them through the data capture system, and then reviewing the resultant data file outputs. When satisfactory data capture results are obtained the capture of the live survey data can commence. 5. In a 100% data capture verification operation, the data capture is done twice, by two different operators. In a fully automated system, the second operator will be alerted when his/her input does not match that of the original. In these cases, the second operator will check the questionnaire or the assessment Booklet Scoring sheet and enter the correct value. Before the data capture operation takes place, countries will have to provide documentation relevant to this operation. The documentation should include information on the data capture software and system to be used, a description of the data verification process, a description of the data capture system testing procedures, and information on how the 100% data verification will be implemented. For quality assurance, operation will have to operation is executed. standard international comparative analysis. the document prepared relative to the data capture be reviewed by an international expert before the The data capture specifications will be based on a record layout for the data file so as to enable 8 Coding Standards The purpose of coding standards relative to the literacy survey is to ensure that the coding of questionnaire and assessment booklets is performed in a uniform way within and across countries and with an acceptable quality. The coding standards required for a literacy assessment are: 1. The data from the Background Questionnaire and the assessment booklets will be coded as specified by the International Record Layout (IRL). 2. The following codebooks will be used to code education, occupation, and industry information respectively from the Background Questionnaire. (a). ‘International Standard Classification of Education (ISCED)’- the most recent- will be used to code the education variable, i.e., ‘highest level of education/schooling’. (b).‘International Standard Classification of Occupations (ISCO) Job Titles’ will be used to code the occupation variable. (c). ‘International Standard Industrial Classification Of All Economic Activities, Third Revision’ will be used to code the industry variable. 3. Data that has been manually coded will be 100% verified by another coder. The average error rate for manually coded data must not exceed 6%. Guidelines 1. Countries should train approximately five (5) coders. These coders should preferably have extensive experience in coding Education, Industry and Occupation data from censuses or other large-scale surveys. Training materials for coders should consist of a master set of descriptions with associated expert codes for the data to be coded. By the end of the coders’ training sessions, the coder error rate should not exceed 6%. 2. Some countries may opt to utilize software for automated coding. However, since automated coding software rarely is able to successfully code 100% of the data, a manual coding operation is still necessary. In this case, fewer manual coders may be required. Countries will be required to prepare a document that includes a full description of the coding system and coding quality control procedures. 9 Data-File Creation Standards The purpose of the data-file creation standards is to ensure that countries create its data file in the format that is required for estimation and tabulation of survey results. The data file provided by each country will be used for producing survey estimates and for tabulating and publishing survey results. It is imperative that the data is provided in the same format so that processing delays are avoided and so that errors are avoided. An international consultant/ firm may be required to assist in the database development. Standard and Guidelines 1. The literacy survey data file will be created according to the International Record Layout (IRL). An international consultant/ firm may be required to assist in this process. Deviations from the prescribed record format must be reconciled before the test data can be scaled and before the survey estimates can be produced. Database development Countries will be responsible for creating and validating their data files by following prescribed procedures for both the pilot survey and the main survey. A standard codebook will be used in processing the survey data and for creating a database with a standard record layout. Countries will be required to follow the specifications for creating their data files and ensuring their compliance to the prescribed format. All paper and pencil psychometric test data must be key-entered with 100 percent verification. The captured background questionnaire data must be subjected to statistical verification to ensure an outgoing quality level of 99.5%. Countries will be required to develop their own data capture and data editing programs since the background questionnaires and data capture methods may vary by countries. All data files must be edited to ensure that each record has a unique identification number, all data fall within allowable ranges, each respondent has answered all applicable questions, the question responses are consistent with the responses to other questionnaire items where applicable, and that extreme data values are identified and dealt with. Countries will be obliged to correct any anomalies identified in the editing process. The edit should flag missing information, remove data that does not respect the skip patterns in the background questionnaire. Data on education, industry and occupation collected in national form must be mapped into the international 10 ISCED, ISIC and ISOC classifications respectively. Countries will be responsible for creating a ‘clean’ data file. An international technical consultant/ firm will be required to review the data files from both the pilot and main surveys to ensure compliance with the prescribed procedures. If necessary, countries will be obliged to correct any anomalies detected during the course of the review. 11 Data Editing The purpose of data editing is to ensure the production of the Literacy Survey data file that is error-free. The data file produced at the end of the data processing phase will be used in subsequent steps such as weighting and estimation. It is imperative that the data file has been properly edited so that a ‘clean' dataset is available for subsequent processing. The data editing standards necessary for a literacy assessment are: 1. Countries will perform an edit of its Literacy Survey data file in order to identify and resolve errors in the data. 2. The edit of the Literacy Survey data file will include the following minimum checks. For each of these edits, if errors are discovered they will be resolved, i.e., the original erroneous value will be replaced with a corrected value. (a). ID check- The record identification numbers on the data file should be checked for uniqueness and integrity to ensure that there is only one record per respondent on the file, and to ensure that the record identification number is unique and in the specified format. (b). Range checks- A range check will be carried out for all those variables that can only take on specific values. (c). Logic checks, i.e., question flows- The data file will be edited to check the flow of respondents through the various sections of the Background Questionnaire. The objective of this edit is to ensure that the responses for respondents who should have skipped a given set of questions have been properly coded as a ‘valid skip', and that there are appropriately coded responses for respondents who should have completed a given set of questions. (d). Consistency checks- An edit of the data file will be performed to identify inconsistencies that may have arisen as a result of response errors, coding errors, and data capture errors. (e). Outlier check- An edit should be performed to identify possible outliers, i.e., extreme quantitative data values. All identified outliers will be reviewed for legitimacy and to assess the potential effect on the survey estimates. 12 3. Imputation methods will not be used to treat missing Background Questionnaire data, i.e., item non-response and complete non-response. 4. Country will be required to produce an edited data file, in ASCII format, that conforms to the International Record Layout (IRL). Guidelines 1. The checks outlined above should form part of the overall Editing System. The Editing System should be run after all data has been captured. 2. The errors to be resolved by the Editing System are those for which a correct value can be logically determined from the respondent's record. In cases where a respondent does not provide a question response, i.e., item non response or complete non response, imputation methods must not be used to furnish a response. 3. Range checks should be carried out as part of the overall Editing System. Where an invalid value is found on the data set, a correct value should be entered to replace it. If such a step is not possible, the field should be edited based on other variables for that record. Otherwise, such fields should be set to the "Not Stated" code. 4. The process of checking question flows is a manual procedure, typically done by studying frequency counts for each of the variables in the database. Each of the respondents must be accounted for in each question - either by having a valid response or by being coded as a “valid skip" for those questions not applicable to them. In general, ‘top-down' editing rules should be followed for this procedure. This process involves path-cleaning based on the answers to questions and the skip patterns associated with these answers. Missing responses should be coded as ‘not stated' and any answers between skips should be blanked out. If a missing response occurs in a filter question (single or multiple skips) and if by looking forward one is unable to clearly identify the path, one may then exercise the option of imputing the missing answer to the value associated with that path. Otherwise, one must impute a ‘not stated' value for all the questions in each of these paths until the common question is reached. 5. Consistency edits are usually specified using a series of edit rules or decision rules. For the Literacy Survey these rules must be automated to ensure their consistent application. Typically, the specifications are programmed, thoroughly tested, and then run on the preedited data. Consistency edits should be performed on both the Background Questionnaire data and on the scores from the assessment Booklets. For the scores from the assessment Booklets, one must check that the scores are correctly entered into the fields associated with the tests that were administered. 13 In the event countries deviates from the international editing standard for its data set, a description of the editing system must be prepared and reviewed by an International Consultant prior to implementation. After the running of the Editing System, countries will be required to prepare relevant documentation. The Editing System documentation will include the following: 1. a list of all data cleaning rules that have been applied to the data, 2. a list of all edits that have been globally applied to the data, 3. a list of specific edits that have been applied to individual observations, computational definitions for any variables that have been re-coded or derived from other variables, and documentation of any deviations from the international standard that remain upon completion of the Editing phase. Countries will prepare a clean data set in the record layout specified by the by the International Consultant. 14 Reference French, Carl (1996). Data Processing and Information Technology (10th ed.). Thomson. p. 2. ISBN 1844801004. Statistics Canada and Organization for Economic Cooperation and Development (2003). Standards and Guidelines. For the Design and Implementation of the Adult Literacy and Life-skills Survey. 15 Annex: Scoring Guideline 16