African Handbook on Census Data Processing, Analysis

advertisement
UNECA ACS
ASSD
African Handbook on Census Data
Processing, Analysis and Dissemination
St. Georges Hotel, Pretoria
15 November 2009
Data Capture Methods

Traditional


Scanning model







Key from Paper (KFP)
Key from Image (KFI)
Optical Mark Recognition (OMR)
Optical Character Recognition (OCR)
Intelligent Character Recognition (ICR)
Intelligent recognition (IR)
Internet (IRS)
Handheld (PDA, Laptop, Net book etc.)
Forms Type (Source)
Structured
Semi-Structured
Unstructured
Scanning Models
Type
Acronym
Description
Identical to Key from paper method,
Key from Image
KFI
however incorporation of data entry from
a scanned image
Data is produced from response marks
Optical Mark Recognition
OMR
on
the
instrument
during
or
post
scanning
Optical
Character
Recognition
Intelligent
Character
Recognition
OCR
ICR
OCR technology recognizes machineprinted characters on an instrument
ICR technology recognizes handwritten
characters on an instrument
OCR
Intelligent recognition
IR
technology
recognizes
handwritten and cursive characters on
an instrument
OMR
OMR

OMR is a technology that
allows an input device
(e.g. imaging scanner) to
read hand-drawn marks
such as small circles or
squares on specially
designed paper. OMR is
captured by contrasting
reflectivity at
predetermined positions
on a page.
OMR


OMR information is converted from marks into the form of numbers or
letters and put it into the computer.
There are two known methods of applying OMR technology in data
processing, namely






Form based OMR, and
Image based OMR
In form based OMR, one works with a specialized document that
contains timing tracks along one edge of the form to indicate to the
scanner where to read for marks which look like black boxes on the
top or bottom of a form.
In image based OMR, the scanned image is run through processing or
interpret engines for a computer to electronically determine the mark
received from the form.
In effect, form based OMR does the ‘reading’ of data at scan time,
whilst image based OMR can apply the creation of data during any
subsequent process.
Key difference, with form based OMR one cannot add fields for
interpretation after scanning whilst with image based OMR, these can
be added as and when required. However, with form based OMR,
images can be saved during the scanning process and would require a
KFI process for any further verification or exceptions management
KFP, KFI, OMR
OMR Advantages and Disadvantages

Advantages



Form based OMR is a data collection technology that does not
require a recognition engine. Therefore it is fast, using minimum
processing power to process forms and its costs are predictable
and defined
OMR capture speeds range around 4000 forms per hour and one
can process quite a lot within a short period of time.
Disadvantages





OMR cannot recognize hand-printed or machine-printed
characters.
With OMR, images of forms are not captured by scanners so
electronic retrieval is not possible.
Tick boxes may not be suitable for all types of questions
If a user wants to gather large amounts of text then OMR can
complicate data collection.
There is also the possibility of missing data in the scanning
process, incorrectly or unnumbered pages can lead to them being
scanned in the wrong order.
OMR Best Practices

The entire process must be tested:








Information Capture
Recognizing
Verifying Results
Questionnaire design and preparation is a critical aspect
Forms must be easily scannable and in a good condition at
scan time otherwise transcription will be required
Enumerators must take particular care in filling out
questionnaires
Completeness and consistency checks must be in place
Careful care must be taken for the condition of the
Questionnaire (dust, humidity, transportation, etc)
OMR Lessons Learnt




OMR, in any form can be extremely powerful tool for use in
data processing of large surveys and censuses, however they
need to be carefully controlled and managed
To achieve high accuracy, well structured design and good
quality printing of forms is critical. This primarily brings to the
fore the issue of costs as this printing can be extremely costly
and limited geographically as service providers are far and few
between.
Although OMR data is relatively accurate, it is important to do
detailed testing and constant review of data being produced to
ensure that the right fields are being read. One can do this via
various methods like an independent comparison of OCR read
values versus KFI based values from the same images.
Exceptions can also be easily corrected with images available
on hand for correction.
KFI
KFI

The actual process of
KFI is quite similar to
that of KFP in that
the data capturer still
enters in data
manually; however
instead of capturing
from a manual form,
he/she captures data
directly from an
image.
KFI Advantages and Disadvantages

Advantages

Preparatory time


Online verification


Minimal time required to implement changes and modifications.
A major advantage is the fact that verification of instruments occurs at the time of
data entry and therefore errors and discrepancies can be picked up easily.
However, this can be negated with data entry clerks independently changing
content on the instrument to if the system hampers their performance due to
constant error messages
Disadvantages

Production time


Keying errors


In KFP processes, no computer aided recognition occurs. Therefore, the data
capturer will type each and every character as displayed on the questionnaire
Keying errors are bound to occur as each and every character of information is
being captured manually. As capturers try to reach their targets and increase
performance, errors will start to creep in.
Entry clerk changes data due to tight validation

If tight validation is put into place only allowing the clerk a set number of values
for entry, any inconsistent information will be changed to the easiest value the
clerk can select. In this way invalid and out of range data is not consistently
edited and correct and results in data problems downstream.
Example of multi-type form
OCR
ICR
OMR
47
Example of Census Form
OCR/ICR
OCR/ICR

With scanning technology steadily becoming cheaper and
more accessible and advancements in the development of
recognition algorithms, OCR and ICR technology have became
the foundation of image and forms processing around the
world. This was done via two primary methods, OCR and ICR.


OCR technology recognizes machine-printed characters on a
form, whilst ICR technology recognizes handwritten characters on
a form. OCR technology and the ability to read machine printed
characters have largely been solved as accuracy thresholds are
mainly between 99 and 100%.
Key difference between OCR and ICR is that OCR is more
accurate than ICR due to the large amount of variations which
occur in handwriting. Nevertheless, ICR is a great advancement in
character recognition as there is virtually no limit on the types of
data that can be collected and converted. Albeit, this needs to be
done with great care and attention to editing and data
confrontation to avoid problems
OCR ICR
Segmentation of text
OCR ICR
Segmentation of text
Engine A + Engine B
312 2430891
Types of Recognition Engines

Different types of OCR/ICR/OMR engines are used to
recognize characters (numeric or alpha-numeric).
Clear Image
ParaScript
KADMOS
TISICR
NESTOR
RecoStar
AEG
EXPERVISION
LIGATURE
A2iA
JustICR
Majority Voting Rules : Engines
ICR 1
ICR 2
ICR 3
ICR 4
3
3
8
3
Majority = 3
Unanimous = ?
Alpha Recognition - Voting
ICR A
ICR B
ICR C
*oshua
Jo*hu*
J*sh*a
VOTING
Joshua
False Positive Marking
OCR/ICR Advantages and
Disadvantages

Advantages







Recognition engines used with imaging can capture highly
specialized data sets
Engines can be made to learn regional characteristics and its
effects on handwriting
Large saving on resources (human and machine) due to computer
assistance in 80% of keying processes.
OCR/ICR recognizes machine-printed or hand-printed characters.
Scanning and recognition allowed efficient management and
planning for the rest of the processing workload
Quick retrieval of images for editing and reprocessing
Disadvantages





Technology is costly
May require significant manual intervention if not implemented
properly
Additional workload to enumerators-ICR has severe limitations
when it comes to human handwriting
Characters must be hand-printed/machine-printed with separate
characters in boxes
Ineffective when dealing with cursive characters
OCR/ICR Lessons Learnt




ICR/OCR is technology that can benefit data processing
immensely. However it must be carefully designed and
implemented to avoid problems creeping into the production
cycle.
Algorithm development has improved over time and is getting
much better, however if handwriting is poor, more data will be
sent for correction and therefore resulting in greater workload
for operators.
Forms design and proper printing is key to the process in being
successful
Barcodes can play a vital part to proving a unique description
to the form and instruments should be treated as forms before
being treated as households.
OCR/ICR QA/Exceptions



One of the major issues of ICR/OCR is the fact that one places
trust in the processing engine that it is providing data that is of
excellent quality and is a direct reproduction of the instrument.
Therefore it is vital to undertake QA processes on any OCR/ICR
data to ensure that the conversion process was of adequate
quality. This can be done by a sample based recapture of data in
an independent system to ascertain a data quality rate or as the
inverse the error rate. This can either be utilized a a measure of
quality with further options of rejection to ensure that only
acceptable levels of data is sent through the system.
For exceptions, it has been found that tracking and correcting
small cases through a bulk system can prove to be problematic
and it would be more advantageous to follow a KFP solution for all
exceptions. In this way, the bulk production system runs and is not
hampered by exceptions.
Internet Data Collection
Internet Data Collection



The most common methods of data collection for surveys
and censuses are personal interviewing and self
enumeration. The growing number of respondents with
access to the Internet introduces a new data collection
alternative that is likely to become increasingly important in
the future.
Like computer assisted telephone and personal
interviewing, computer assisted self interviewing using the
Internet permits an interactive exchange with the
respondent through intelligence built into the computer
application.
While promising, Internet surveys also face a variety of
challenges in survey coverage, in survey design, in security
of confidential information, and in mastery of new and
rapidly changing technologies
Internet Data Collection


The most important deciding factor on whether
internet data collection should be a viable
alternative is the rate of internet penetration in the
respective country.
Some countries have high penetration rates, like
in Europe were some countries boast penetration
rates of between 80 and 90 percent. However in
Africa, where recent statistics indicate average
internet penetration at around 6.7%, the internet
can play an important part of a multi channel data
collection system in Censuses and surveys
Internet Data Collection


The functional requirements for Internet
questionnaires describe an interactive application
where interview questions are presented to the
respondent and actions are taken based on the
responses
The Internet consists of heterogeneous client
hardware and software. The software or browser
supports published and de facto standards which
allow Web pages to be displayed and execute on
the client computer. One needs to be careful to
design an interface as simple and adaptable as
possible such that it can be displayed correctly on
any universal browser or web interface.
Internet Data Collection

Since the Internet is a public network,
security vulnerabilities exist. They include
the following:




Eavesdropping, i. e., intermediaries can listen in on
private conversations;
Theft, data stolen during the course of transmission or
from a computer or network; and
Impersonation, a sender or receiver using a false identity
for communication.
The NSO needs to address these issues
to provide respondents with a secure and
private method to use the Internet for data
collection.
Internet Data Collection

Security for Internet data collection
had to be addressed at three levels:
(1) the security of communication
between the respondent and the NSO;
 (2) the security of respondent data at
the NSO, and
 (3) the security of the NSO network

Internet Data Collection





Since Web data collection is in its infancy, this is
only the beginning.
As Web technology matures, guidelines for Web
questionnaire design will be further tested,
standardized, and documented.
With these advances and increasing Web skills in
the general public, respondents will find Web
questionnaires increasingly easy to use.
The ease of use and intuitiveness of a Web
questionnaire is important since we do not have
the luxury of training the respondent.
The Web also offers the opportunity to use
graphics, audio, and video to improve the overall
interview experience for the respondent.
Thank you…
I reiterate…
We still need your valuable inputs to
make this document better….
Download