Uploaded by Bá Hùng Lê

The path to

advertisement
The path to
Human-level
IELTS Essay Grading AI
Confidential – Do NOT public this video
Overview
●
Benefits & Limitations of grading IELTS essays with ChatGPT
●
How we achieved 90%+ accuracy on a really challenging Task Response
task
●
How it all works under the hood
●
What challenges we have to face before human-level IELTS Essay grading
AI
Intro
Using Chat GPT to grade IELTS can help:
●
Teachers be more productive
○
Traditionally 30 min/ essay. Now 1 min/ essay
●
Students improve learning outcomes
○
1-2 days to get feedback. Now 1 click.
Many Education Platforms applied this technology
●
ZIM
●
Prep
●
Ed Micro
●
IELTS Science (Writify)
Current problems
●
Relatively Low Score Evaluation accuracy
●
High cost ~ 200-400 VND/ essay
●
Not fully own the technology, dependant on OpenAI
What we achieved
●
Our model was able to over 90% accuracy in a small but
challenging task in the Task Response criteria of the IELTS
Writing test.
●
Trained in house, meaning we keep the ip
●
10-20 times cheaper to serve at scale vs Open AI models.
How we achieved high accuracy
Common approach of ZIM, Prep,….:
●
Prompt engineering:
○
●
Given the Band Descriptor…, please grade this essay…
Can be effective but can not achieve high accuracy because:
○
ChatGPT do not have high quality IELTS assessment data
○
The complex & subjective nature of the Band Descriptor
Our approach
●
We took an Open-source model Mistral 7b. The best 7B model to date, Apache 2.0
●
Trained on IELTS-specific dataset
●
Break down the Band descriptor from 4 to 15 fine-grained criteria with clear
definitions for each level
○
Task Response:
■
Answer all part of the questions
■
Idea development
■
On-topic
■
…
The training
process
The Task
Did the Essay answer all part of the question?
Off-topic - 3
Tangential - 4
Incomplete - 5
Appropriately
answered – 6+
Clearly define the level
●
No part of the prompt is adequately addressed: Assigned to essays completely off-topic, equating to
Band 3.
●
Minimal or tangential response: Essays share a broad topic but address different aspects, aligning with
Band 4.
○
The question asks: More and more people today are spending large amounts of money on their complexions in order to look younger. Why
do people want to look younger? Do you think this is a positive or negative development?
○
The learner writes about: More and more people today are spending large amounts of money on their bodies in order to look more
attractive. Why do people want to look more attractive? Do you think this is a positive or negative development
●
Incompletely addresses main parts: Essays that don’t deviate from the question but don’t fully answer it,
corresponding to Band 5.
●
Appropriately addresses the main parts of the prompt: Essays on topic and fully addressing the question,
meeting criteria for Band 6 or above.
The Training Data
16,000+ data points GPT labeled*
Off-topic - 3
Tangential - 4
Incomplete - 5
25%
25%
25%
Augmented from 4,000 + data points. The data was relatively good but not perfect*
Appropriately
answered – 6+
25%
The Testing Data
400 data points GPT labeled & rechecked manually
Off-topic - 3
Tangential - 4
Incomplete - 5
25%
25%
25%
Appropriately
answered – 6+
25%
Experiment Goals
Find out
●
The accuracy of existing models (GPT-3.5 & GPT-4) on this tasks compared to what we
were able to train.
●
The impact of data size & data quality.
Download