Computer-Assisted Coding of Text CASCOT Software demonstration Rob Jones and Peter Elias

advertisement

Computer-Assisted Coding of Text

CASCOT

Software demonstration

Rob Jones and Peter Elias

Structure of presentation

• Background – manual text coding

• Development of software – history, aims

• CASCOT – demonstration

Coding text to a classification

• Coding is the process of categorising the range of all possible answers to a pre defined set of categories.

• The full set of categories is termed a classification. Examples are:

– SOC 2000 (Standard Occupational Classification 2000)

– SIC 92 (Standard Industrial Classification 1992)

• Three parts to a classification: the structure, the index and the classification rules

Text responses in surveys

Q: What is your job title ?

Q: Briefly describe your duties.

Q: What does the organisation you work for mainly make or do?

Manual coding procedures

• Manual methods

– code books;

– temporary labour;

– query resolution systems.

• No standardised approach, major variations between institutions, companies, etc . in quality of coding.

• Time-consuming, expensive.

Development of software

CASOC: Pascal/C++ text coding software for DOS 1993 – 2001.

CASCOT: Java text coding software for any operating system.

CASOC was ad hoc development, funded from sales revenue.

CASCOT funded by ESRC.

Occupational coding in practice

• Quality of coding reflects quality of text available for coding.

• Need rules which specify how to deal problems such as ambiguous job titles (e.g. engineer, teacher).

• Need to be aware that machine coding of text can introduce bias.

• Need to establish ‘trade off’ between accuracy and cost.

Cascot

• Cascot will provide:

A list of recommendations.

Code, title, best matching index entry, and certainty score

• Certainty Score

Approximates the probability that the recommended code is correct.

This is represented by a number in the range 0-100.

People never 100% right. Computer can’t be 100% right.

Text Input Area

civil engineer

Type job title

Press enter, or click ‘Code’ button

Recommendations Table

Code

Group Title Index Entry Score

Recommendations Table

Classification Structure

Index Entries

Output

Best recommendation selected automatically.

Select another by clicking a different line.

Structure & Index entry list will change.

And output has changed.

Change selection via structure

Index entry list will change.

And output has changed.

Reading from a file

• Instead of typing every job title in, we can read job titles from a file.

• File must be in an acceptable format.

Reading from a file.

• Simplest file - each line is a job title.

• But how do we know which job title is for which person? (solution: use a delimited file)

Reading from a file.

• Tab delimited file.

• Each line = Person ID |TAB| Job Title

Reading from a file.

• Comma delimited file.

• Each line = Person ID |Comma| Job Title

Recording codes from Cascot

• Rather than having to copy the code produced by Cascot we can have Cascot record the codes to a file.

• Open an Output File.

• One line written for each piece of text coded.

Output Items

• After coding we have the following facts:

• The text that was coded.

• The code it was given.

• The title for that code

• The best matching index entry within that code.

• The score Cascot assigned the match.

• Each of these facts is a “Output Item”

• We can choose which we wish to output (on the screen or to a file).

• Can also output items from the input file.

Example: Using Files

Input file (tab delimited).

Example: Using Files.

Step 1: Open Input File.

Select file, click open.

Confirm / Select File Format.

Choose selection options.

Click ok.

Input File Details

First job title coded

Step 2: Choose Output Items.

Step 2: Choose Output Items.

Click Edit.

Available Items

Current Items

Current Output

To add score click ‘Add’

Then, click OK

Step 3: Open Output File.

Select file, or type in name for new file.

MyOutputFile.txt

Click Save

You will be asked if you wish to make the first output row be column titles.

Output File Details

Select the preferred recommendation.

Or navigate to the correct code.

Once you are happy with the code - click 'Accept'

The next job title appears.

(Automatically read from file after ‘Accept’)

Select the preferred recommendation.

Or navigate to the correct code.

Once you are happy with the code - click 'Accept'

If you don’t know the code, or wish to defer coding to a more expert coder. Click No

Conclusion.

Output set to zeros.

The ‘no conclusion’ output is not final until you click Accept.

Example: Using Files

Input file (tab delimited).

Example: Using Files

Output file (Output items = “Input Record, Code, Title, Score”)

A fully automated run.

• Rather than clicking “Accept” to agree to the best recommendation every time we can automate the process.

• But how good is this ?

• Example follows:

– Random sample of real data

– 1200 unique job titles

– Coded automatically, sorted by score.

Skipping some pages …..

Skipping some more pages …..

Skipping some more pages …..

Skipping some more pages …..

Skipping some more pages …..

Semi automatic coding.

• Job titles with high certainty scores = right.

– Humans agree with Cascot for high scores.

• Job titles with low certainty scores = wrong

– We need human intelligence to decide the correct code when we have low certainty.

• Automatically agree to high scores but have human decision for low scores.

• What score threshold ?

– Small study by IER, University of Warwick shows manual coders happy with: 70-75 (some with 60).

– Balance between time (& money) vs. quality

• Best practice: sort input file alphabetically by job title.

Automated & Assisted Modes

• Requires input and output files.

• Threshold level = certainty score.

• Assisted mode

– score below threshold = user prompted

• Fully Automatic mode

– score below threshold = no code/zeros written

• Set Automation using “Options |

Automation” from the menu bar.

Using additional information to aid coding.

• Ambiguous job title.

• Coding manually – look at other questions

E.g. Q ‘Briefly describe your duties’

• Do the same with Cascot.

– But: The data must be present in the input file.

– Best if: The input file is a delimited file.

“Teacher” is ambiguous.

Click “View Record” Button

This Information can be used to determine that we want

“Secondary Teacher”

Click ‘X’ to close.

Now select “Secondary Teachers”

And Accept to move on.

Download