Computer-Assisted Coding of Text
CASCOT
Software demonstration
Rob Jones and Peter Elias
• Background – manual text coding
• Development of software – history, aims
• CASCOT – demonstration
• Coding is the process of categorising the range of all possible answers to a pre defined set of categories.
• The full set of categories is termed a classification. Examples are:
– SOC 2000 (Standard Occupational Classification 2000)
– SIC 92 (Standard Industrial Classification 1992)
• Three parts to a classification: the structure, the index and the classification rules
Q: What is your job title ?
Q: Briefly describe your duties.
Q: What does the organisation you work for mainly make or do?
• Manual methods
– code books;
– temporary labour;
– query resolution systems.
• No standardised approach, major variations between institutions, companies, etc . in quality of coding.
• Time-consuming, expensive.
CASOC: Pascal/C++ text coding software for DOS 1993 – 2001.
CASCOT: Java text coding software for any operating system.
CASOC was ad hoc development, funded from sales revenue.
CASCOT funded by ESRC.
• Quality of coding reflects quality of text available for coding.
• Need rules which specify how to deal problems such as ambiguous job titles (e.g. engineer, teacher).
• Need to be aware that machine coding of text can introduce bias.
• Need to establish ‘trade off’ between accuracy and cost.
• Cascot will provide:
A list of recommendations.
Code, title, best matching index entry, and certainty score
• Certainty Score
Approximates the probability that the recommended code is correct.
This is represented by a number in the range 0-100.
People never 100% right. Computer can’t be 100% right.
civil engineer
Type job title
Press enter, or click ‘Code’ button
Recommendations Table
Code
Group Title Index Entry Score
Recommendations Table
Classification Structure
Index Entries
Output
Best recommendation selected automatically.
Select another by clicking a different line.
Structure & Index entry list will change.
And output has changed.
Change selection via structure
Index entry list will change.
And output has changed.
• Instead of typing every job title in, we can read job titles from a file.
• File must be in an acceptable format.
• Simplest file - each line is a job title.
• But how do we know which job title is for which person? (solution: use a delimited file)
• Tab delimited file.
• Each line = Person ID |TAB| Job Title
• Comma delimited file.
• Each line = Person ID |Comma| Job Title
• Rather than having to copy the code produced by Cascot we can have Cascot record the codes to a file.
• Open an Output File.
• One line written for each piece of text coded.
• After coding we have the following facts:
• The text that was coded.
• The code it was given.
• The title for that code
• The best matching index entry within that code.
• The score Cascot assigned the match.
• Each of these facts is a “Output Item”
• We can choose which we wish to output (on the screen or to a file).
• Can also output items from the input file.
Input file (tab delimited).
Example: Using Files.
Step 1: Open Input File.
Select file, click open.
Confirm / Select File Format.
Choose selection options.
Click ok.
Input File Details
First job title coded
Step 2: Choose Output Items.
Step 2: Choose Output Items.
Click Edit.
Available Items
Current Items
Current Output
To add score click ‘Add’
Then, click OK
Step 3: Open Output File.
Select file, or type in name for new file.
MyOutputFile.txt
Click Save
You will be asked if you wish to make the first output row be column titles.
Output File Details
Select the preferred recommendation.
Or navigate to the correct code.
Once you are happy with the code - click 'Accept'
The next job title appears.
(Automatically read from file after ‘Accept’)
Select the preferred recommendation.
Or navigate to the correct code.
Once you are happy with the code - click 'Accept'
If you don’t know the code, or wish to defer coding to a more expert coder. Click No
Conclusion.
Output set to zeros.
The ‘no conclusion’ output is not final until you click Accept.
Input file (tab delimited).
Output file (Output items = “Input Record, Code, Title, Score”)
• Rather than clicking “Accept” to agree to the best recommendation every time we can automate the process.
• But how good is this ?
• Example follows:
– Random sample of real data
– 1200 unique job titles
– Coded automatically, sorted by score.
Skipping some pages …..
Skipping some more pages …..
Skipping some more pages …..
Skipping some more pages …..
Skipping some more pages …..
• Job titles with high certainty scores = right.
– Humans agree with Cascot for high scores.
• Job titles with low certainty scores = wrong
– We need human intelligence to decide the correct code when we have low certainty.
• Automatically agree to high scores but have human decision for low scores.
• What score threshold ?
– Small study by IER, University of Warwick shows manual coders happy with: 70-75 (some with 60).
– Balance between time (& money) vs. quality
• Best practice: sort input file alphabetically by job title.
• Requires input and output files.
• Threshold level = certainty score.
• Assisted mode
– score below threshold = user prompted
• Fully Automatic mode
– score below threshold = no code/zeros written
• Set Automation using “Options |
Automation” from the menu bar.
• Ambiguous job title.
• Coding manually – look at other questions
E.g. Q ‘Briefly describe your duties’
• Do the same with Cascot.
– But: The data must be present in the input file.
– Best if: The input file is a delimited file.
“Teacher” is ambiguous.
Click “View Record” Button
This Information can be used to determine that we want
“Secondary Teacher”
Click ‘X’ to close.
Now select “Secondary Teachers”
And Accept to move on.