Computer-Assisted Coding of Text CASCOT Software demonstration Rob Jones and Peter Elias

advertisement
Computer-Assisted Coding of Text
CASCOT
Software demonstration
Rob Jones and Peter Elias
Structure of presentation
• Background – manual text coding
• Development of software – history, aims
• CASCOT – demonstration
Coding text to a classification
• Coding is the process of categorising the range
of all possible answers to a pre defined set of
categories.
• The full set of categories is termed a
classification. Examples are:
–
SOC 2000 (Standard Occupational Classification 2000)
–
SIC 92 (Standard Industrial Classification 1992)
• Three parts to a classification: the structure, the
index and the classification rules
Text responses in surveys
Q:
What is your job title ?
Q:
Briefly describe your duties.
Q:
What does the organisation you work
for mainly make or do?
Manual coding procedures
• Manual methods
– code books;
– temporary labour;
– query resolution systems.
• No standardised approach, major
variations between institutions,
companies, etc. in quality of coding.
• Time-consuming, expensive.
Development of software
CASOC: Pascal/C++ text coding software
for DOS 1993 – 2001.
CASCOT: Java text coding software for any
operating system.
CASOC was ad hoc development, funded
from sales revenue.
CASCOT funded by ESRC.
Occupational coding in practice
• Quality of coding reflects quality of text available
for coding.
• Need rules which specify how to deal problems
such as ambiguous job titles (e.g. engineer,
teacher).
• Need to be aware that machine coding of text
can introduce bias.
• Need to establish ‘trade off’ between accuracy
and cost.
Cascot
• Cascot will provide:
A list of recommendations.
Code, title, best matching index entry, and certainty score
• Certainty Score
Approximates the probability that the recommended code
is correct.
This is represented by a number in the range 0-100.
People never 100% right. Computer can’t be 100% right.
Text Input Area
civil engineer
Type job title
Press enter, or click ‘Code’ button
Recommendations Table
Code
Group Title
Index Entry Score
Recommendations Table
Classification Structure
Index Entries
Output
Best recommendation selected automatically.
Select another by clicking a different line.
Structure & Index entry list will change.
And output has changed.
Change selection via structure
Index entry list will change.
And output has changed.
Reading from a file
• Instead of typing every job title in, we can
read job titles from a file.
• File must be in an acceptable format.
Reading from a file.
• Simplest file - each line is a job title.
• But how do we know which job title is for which
person?
(solution: use a delimited file)
Reading from a file.
• Tab delimited file.
• Each line = Person ID |TAB| Job Title
Reading from a file.
• Comma delimited file.
• Each line = Person ID |Comma| Job Title
Recording codes from Cascot
• Rather than having to copy the code
produced by Cascot we can have Cascot
record the codes to a file.
• Open an Output File.
• One line written for each piece of text
coded.
Output Items
• After coding we have the following facts:
•
•
•
•
•
The text that was coded.
The code it was given.
The title for that code
The best matching index entry within that code.
The score Cascot assigned the match.
• Each of these facts is a “Output Item”
• We can choose which we wish to output (on the
screen or to a file).
• Can also output items from the input file.
Example: Using Files
Input file (tab delimited).
Example: Using Files.
Step 1: Open Input File.
Select file, click open.
Confirm / Select File Format.
Choose selection options.
Click ok.
Input File Details
First job title coded
Step 2: Choose Output Items.
Step 2: Choose Output Items.
Click Edit.
Available Items
Current Items
Current Output
To add score click ‘Add’
Then, click OK
Step 3: Open Output File.
Select file, or type in name for new file.
Click Save
MyOutputFile.txt
You will be asked if you wish to
make the first output row be
column titles.
Output File Details
Select the preferred recommendation.
Or navigate to the correct code.
Once you are happy with the
code - click 'Accept'
The next job title appears.
(Automatically read from file after ‘Accept’)
Select the preferred recommendation.
Or navigate to the correct code.
Once you are happy with the
code - click 'Accept'
If you don’t know the code, or
wish to defer coding to a more
expert coder. Click No
Conclusion.
Output set to zeros.
The ‘no conclusion’ output is not
final until you click Accept.
Example: Using Files
Input file (tab delimited).
Example: Using Files
Output file (Output items = “Input Record, Code, Title, Score”)
A fully automated run.
• Rather than clicking “Accept” to agree to
the best recommendation every time we
can automate the process.
• But how good is this ?
• Example follows:
– Random sample of real data
– 1200 unique job titles
– Coded automatically, sorted by score.
Skipping some pages …..
Skipping some more pages …..
Skipping some more pages …..
Skipping some more pages …..
Skipping some more pages …..
Semi automatic coding.
• Job titles with high certainty scores = right.
– Humans agree with Cascot for high scores.
• Job titles with low certainty scores = wrong
– We need human intelligence to decide the correct code when we
have low certainty.
• Automatically agree to high scores but have human
decision for low scores.
• What score threshold ?
– Small study by IER, University of Warwick shows manual coders
happy with: 70-75 (some with 60).
– Balance between time (& money) vs. quality
• Best practice: sort input file alphabetically by job title.
Automated & Assisted Modes
• Requires input and output files.
• Threshold level = certainty score.
• Assisted mode
– score below threshold = user prompted
• Fully Automatic mode
– score below threshold = no code/zeros written
• Set Automation using “Options |
Automation” from the menu bar.
Using additional information to aid
coding.
• Ambiguous job title.
• Coding manually – look at other questions
E.g. Q
‘Briefly describe your duties’
• Do the same with Cascot.
– But: The data must be present in the input file.
– Best if: The input file is a delimited file.
“Teacher” is ambiguous.
Click “View Record” Button
This Information can be used
to determine that we want
“Secondary Teacher”
Click ‘X’ to close.
Now select “Secondary Teachers”
And Accept to move on.
Download
Related flashcards

Classes of computers

19 cards

Theory of computation

16 cards

ARM architecture

23 cards

Software

43 cards

Create Flashcards