The path to Human-level IELTS Essay Grading AI Confidential – Do NOT public this video Overview ● Benefits & Limitations of grading IELTS essays with ChatGPT ● How we achieved 90%+ accuracy on a really challenging Task Response task ● How it all works under the hood ● What challenges we have to face before human-level IELTS Essay grading AI Intro Using Chat GPT to grade IELTS can help: ● Teachers be more productive ○ Traditionally 30 min/ essay. Now 1 min/ essay ● Students improve learning outcomes ○ 1-2 days to get feedback. Now 1 click. Many Education Platforms applied this technology ● ZIM ● Prep ● Ed Micro ● IELTS Science (Writify) Current problems ● Relatively Low Score Evaluation accuracy ● High cost ~ 200-400 VND/ essay ● Not fully own the technology, dependant on OpenAI What we achieved ● Our model was able to over 90% accuracy in a small but challenging task in the Task Response criteria of the IELTS Writing test. ● Trained in house, meaning we keep the ip ● 10-20 times cheaper to serve at scale vs Open AI models. How we achieved high accuracy Common approach of ZIM, Prep,….: ● Prompt engineering: ○ ● Given the Band Descriptor…, please grade this essay… Can be effective but can not achieve high accuracy because: ○ ChatGPT do not have high quality IELTS assessment data ○ The complex & subjective nature of the Band Descriptor Our approach ● We took an Open-source model Mistral 7b. The best 7B model to date, Apache 2.0 ● Trained on IELTS-specific dataset ● Break down the Band descriptor from 4 to 15 fine-grained criteria with clear definitions for each level ○ Task Response: ■ Answer all part of the questions ■ Idea development ■ On-topic ■ … The training process The Task Did the Essay answer all part of the question? Off-topic - 3 Tangential - 4 Incomplete - 5 Appropriately answered – 6+ Clearly define the level ● No part of the prompt is adequately addressed: Assigned to essays completely off-topic, equating to Band 3. ● Minimal or tangential response: Essays share a broad topic but address different aspects, aligning with Band 4. ○ The question asks: More and more people today are spending large amounts of money on their complexions in order to look younger. Why do people want to look younger? Do you think this is a positive or negative development? ○ The learner writes about: More and more people today are spending large amounts of money on their bodies in order to look more attractive. Why do people want to look more attractive? Do you think this is a positive or negative development ● Incompletely addresses main parts: Essays that don’t deviate from the question but don’t fully answer it, corresponding to Band 5. ● Appropriately addresses the main parts of the prompt: Essays on topic and fully addressing the question, meeting criteria for Band 6 or above. The Training Data 16,000+ data points GPT labeled* Off-topic - 3 Tangential - 4 Incomplete - 5 25% 25% 25% Augmented from 4,000 + data points. The data was relatively good but not perfect* Appropriately answered – 6+ 25% The Testing Data 400 data points GPT labeled & rechecked manually Off-topic - 3 Tangential - 4 Incomplete - 5 25% 25% 25% Appropriately answered – 6+ 25% Experiment Goals Find out ● The accuracy of existing models (GPT-3.5 & GPT-4) on this tasks compared to what we were able to train. ● The impact of data size & data quality.