UNECA ACS ASSD African Handbook on Census Data Processing, Analysis and Dissemination St. Georges Hotel, Pretoria 15 November 2009 Data Capture Methods Traditional Scanning model Key from Paper (KFP) Key from Image (KFI) Optical Mark Recognition (OMR) Optical Character Recognition (OCR) Intelligent Character Recognition (ICR) Intelligent recognition (IR) Internet (IRS) Handheld (PDA, Laptop, Net book etc.) Forms Type (Source) Structured Semi-Structured Unstructured Scanning Models Type Acronym Description Identical to Key from paper method, Key from Image KFI however incorporation of data entry from a scanned image Data is produced from response marks Optical Mark Recognition OMR on the instrument during or post scanning Optical Character Recognition Intelligent Character Recognition OCR ICR OCR technology recognizes machineprinted characters on an instrument ICR technology recognizes handwritten characters on an instrument OCR Intelligent recognition IR technology recognizes handwritten and cursive characters on an instrument OMR OMR OMR is a technology that allows an input device (e.g. imaging scanner) to read hand-drawn marks such as small circles or squares on specially designed paper. OMR is captured by contrasting reflectivity at predetermined positions on a page. OMR OMR information is converted from marks into the form of numbers or letters and put it into the computer. There are two known methods of applying OMR technology in data processing, namely Form based OMR, and Image based OMR In form based OMR, one works with a specialized document that contains timing tracks along one edge of the form to indicate to the scanner where to read for marks which look like black boxes on the top or bottom of a form. In image based OMR, the scanned image is run through processing or interpret engines for a computer to electronically determine the mark received from the form. In effect, form based OMR does the ‘reading’ of data at scan time, whilst image based OMR can apply the creation of data during any subsequent process. Key difference, with form based OMR one cannot add fields for interpretation after scanning whilst with image based OMR, these can be added as and when required. However, with form based OMR, images can be saved during the scanning process and would require a KFI process for any further verification or exceptions management KFP, KFI, OMR OMR Advantages and Disadvantages Advantages Form based OMR is a data collection technology that does not require a recognition engine. Therefore it is fast, using minimum processing power to process forms and its costs are predictable and defined OMR capture speeds range around 4000 forms per hour and one can process quite a lot within a short period of time. Disadvantages OMR cannot recognize hand-printed or machine-printed characters. With OMR, images of forms are not captured by scanners so electronic retrieval is not possible. Tick boxes may not be suitable for all types of questions If a user wants to gather large amounts of text then OMR can complicate data collection. There is also the possibility of missing data in the scanning process, incorrectly or unnumbered pages can lead to them being scanned in the wrong order. OMR Best Practices The entire process must be tested: Information Capture Recognizing Verifying Results Questionnaire design and preparation is a critical aspect Forms must be easily scannable and in a good condition at scan time otherwise transcription will be required Enumerators must take particular care in filling out questionnaires Completeness and consistency checks must be in place Careful care must be taken for the condition of the Questionnaire (dust, humidity, transportation, etc) OMR Lessons Learnt OMR, in any form can be extremely powerful tool for use in data processing of large surveys and censuses, however they need to be carefully controlled and managed To achieve high accuracy, well structured design and good quality printing of forms is critical. This primarily brings to the fore the issue of costs as this printing can be extremely costly and limited geographically as service providers are far and few between. Although OMR data is relatively accurate, it is important to do detailed testing and constant review of data being produced to ensure that the right fields are being read. One can do this via various methods like an independent comparison of OCR read values versus KFI based values from the same images. Exceptions can also be easily corrected with images available on hand for correction. KFI KFI The actual process of KFI is quite similar to that of KFP in that the data capturer still enters in data manually; however instead of capturing from a manual form, he/she captures data directly from an image. KFI Advantages and Disadvantages Advantages Preparatory time Online verification Minimal time required to implement changes and modifications. A major advantage is the fact that verification of instruments occurs at the time of data entry and therefore errors and discrepancies can be picked up easily. However, this can be negated with data entry clerks independently changing content on the instrument to if the system hampers their performance due to constant error messages Disadvantages Production time Keying errors In KFP processes, no computer aided recognition occurs. Therefore, the data capturer will type each and every character as displayed on the questionnaire Keying errors are bound to occur as each and every character of information is being captured manually. As capturers try to reach their targets and increase performance, errors will start to creep in. Entry clerk changes data due to tight validation If tight validation is put into place only allowing the clerk a set number of values for entry, any inconsistent information will be changed to the easiest value the clerk can select. In this way invalid and out of range data is not consistently edited and correct and results in data problems downstream. Example of multi-type form OCR ICR OMR 47 Example of Census Form OCR/ICR OCR/ICR With scanning technology steadily becoming cheaper and more accessible and advancements in the development of recognition algorithms, OCR and ICR technology have became the foundation of image and forms processing around the world. This was done via two primary methods, OCR and ICR. OCR technology recognizes machine-printed characters on a form, whilst ICR technology recognizes handwritten characters on a form. OCR technology and the ability to read machine printed characters have largely been solved as accuracy thresholds are mainly between 99 and 100%. Key difference between OCR and ICR is that OCR is more accurate than ICR due to the large amount of variations which occur in handwriting. Nevertheless, ICR is a great advancement in character recognition as there is virtually no limit on the types of data that can be collected and converted. Albeit, this needs to be done with great care and attention to editing and data confrontation to avoid problems OCR ICR Segmentation of text OCR ICR Segmentation of text Engine A + Engine B 312 2430891 Types of Recognition Engines Different types of OCR/ICR/OMR engines are used to recognize characters (numeric or alpha-numeric). Clear Image ParaScript KADMOS TISICR NESTOR RecoStar AEG EXPERVISION LIGATURE A2iA JustICR Majority Voting Rules : Engines ICR 1 ICR 2 ICR 3 ICR 4 3 3 8 3 Majority = 3 Unanimous = ? Alpha Recognition - Voting ICR A ICR B ICR C *oshua Jo*hu* J*sh*a VOTING Joshua False Positive Marking OCR/ICR Advantages and Disadvantages Advantages Recognition engines used with imaging can capture highly specialized data sets Engines can be made to learn regional characteristics and its effects on handwriting Large saving on resources (human and machine) due to computer assistance in 80% of keying processes. OCR/ICR recognizes machine-printed or hand-printed characters. Scanning and recognition allowed efficient management and planning for the rest of the processing workload Quick retrieval of images for editing and reprocessing Disadvantages Technology is costly May require significant manual intervention if not implemented properly Additional workload to enumerators-ICR has severe limitations when it comes to human handwriting Characters must be hand-printed/machine-printed with separate characters in boxes Ineffective when dealing with cursive characters OCR/ICR Lessons Learnt ICR/OCR is technology that can benefit data processing immensely. However it must be carefully designed and implemented to avoid problems creeping into the production cycle. Algorithm development has improved over time and is getting much better, however if handwriting is poor, more data will be sent for correction and therefore resulting in greater workload for operators. Forms design and proper printing is key to the process in being successful Barcodes can play a vital part to proving a unique description to the form and instruments should be treated as forms before being treated as households. OCR/ICR QA/Exceptions One of the major issues of ICR/OCR is the fact that one places trust in the processing engine that it is providing data that is of excellent quality and is a direct reproduction of the instrument. Therefore it is vital to undertake QA processes on any OCR/ICR data to ensure that the conversion process was of adequate quality. This can be done by a sample based recapture of data in an independent system to ascertain a data quality rate or as the inverse the error rate. This can either be utilized a a measure of quality with further options of rejection to ensure that only acceptable levels of data is sent through the system. For exceptions, it has been found that tracking and correcting small cases through a bulk system can prove to be problematic and it would be more advantageous to follow a KFP solution for all exceptions. In this way, the bulk production system runs and is not hampered by exceptions. Internet Data Collection Internet Data Collection The most common methods of data collection for surveys and censuses are personal interviewing and self enumeration. The growing number of respondents with access to the Internet introduces a new data collection alternative that is likely to become increasingly important in the future. Like computer assisted telephone and personal interviewing, computer assisted self interviewing using the Internet permits an interactive exchange with the respondent through intelligence built into the computer application. While promising, Internet surveys also face a variety of challenges in survey coverage, in survey design, in security of confidential information, and in mastery of new and rapidly changing technologies Internet Data Collection The most important deciding factor on whether internet data collection should be a viable alternative is the rate of internet penetration in the respective country. Some countries have high penetration rates, like in Europe were some countries boast penetration rates of between 80 and 90 percent. However in Africa, where recent statistics indicate average internet penetration at around 6.7%, the internet can play an important part of a multi channel data collection system in Censuses and surveys Internet Data Collection The functional requirements for Internet questionnaires describe an interactive application where interview questions are presented to the respondent and actions are taken based on the responses The Internet consists of heterogeneous client hardware and software. The software or browser supports published and de facto standards which allow Web pages to be displayed and execute on the client computer. One needs to be careful to design an interface as simple and adaptable as possible such that it can be displayed correctly on any universal browser or web interface. Internet Data Collection Since the Internet is a public network, security vulnerabilities exist. They include the following: Eavesdropping, i. e., intermediaries can listen in on private conversations; Theft, data stolen during the course of transmission or from a computer or network; and Impersonation, a sender or receiver using a false identity for communication. The NSO needs to address these issues to provide respondents with a secure and private method to use the Internet for data collection. Internet Data Collection Security for Internet data collection had to be addressed at three levels: (1) the security of communication between the respondent and the NSO; (2) the security of respondent data at the NSO, and (3) the security of the NSO network Internet Data Collection Since Web data collection is in its infancy, this is only the beginning. As Web technology matures, guidelines for Web questionnaire design will be further tested, standardized, and documented. With these advances and increasing Web skills in the general public, respondents will find Web questionnaires increasingly easy to use. The ease of use and intuitiveness of a Web questionnaire is important since we do not have the luxury of training the respondent. The Web also offers the opportunity to use graphics, audio, and video to improve the overall interview experience for the respondent. Thank you… I reiterate… We still need your valuable inputs to make this document better….