Can AI Mark Essays More Accurately Than Humans?
New research suggests...no. But it can mark more consistently, and that matters.
Marking (or ‘grading’ in US money) is probably the most time-consuming and loathed aspects of a teacher’s weekly workload but emerging evidence suggests that things could be about to change.
Marking is a contentious topic. Every teacher knows they have to do it but many feel the opportunity cost is simply not worth it. To spend an hour marking 30 essays, writing extensive feedback only for students to just look at the grade is a dispiritingly common experience for many teachers. Recently
ruffled more than a few feathers by claiming it was largely a waste of time. Assessment overlord Dylan Wiliam once argued to me (quite convincingly it has to be said) that marking is the most expensive public relations exercise in history:
The emergence of AI has intensified this debate, raising a fundamental question: if large language models can mark with greater consistency and objectivity than human teachers, should we be rethinking the entire practice of marking?
A recent study published in npj Science of Learning examined how large language models perform in criterion-based marking/grading compared to human teachers. The findings suggest that AI may offer a more consistent and structured approach to assessment. But does that mean AI is ready to replace human marking?
How AI Performed
The study tested LLMs on IELTS academic writing tasks, a standardised test with clear grading criteria. The AI models were evaluated on their ability to follow these criteria accurately. The study revealed that while AI models demonstrated superior consistency compared to human graders and showed moderate agreement with expert examiners, a crucial caveat emerged: even under optimal conditions with carefully engineered prompts incorporating detailed grading criteria, AI only achieved "moderate" absolute agreement with official IELTS examiners. Prompt 3, which provided the most comprehensive criteria and band descriptors, achieved an interrater agreement of 0.61 [0.02, 0.85] - technically significant but far from ideal.
This moderate performance occurred despite AI's superior consistency, suggesting a fundamental gap between algorithmic assessment and expert human judgment. This finding is particularly significant given that IELTS writing tasks are highly structured with clear evaluation rubrics which is precisely the type of assessment where we might expect AI to excel. If AI can only achieve moderate agreement in such a well-defined context, it raises important questions about its readiness to handle more nuanced or complex assessment tasks. This limitation suggests that while AI might be valuable for reducing grading variability and managing teacher workload, it may not yet be sophisticated enough to fully replicate the nuanced evaluative judgment of experienced human examiners.
Ultimately however, the fundamental problem with human assessment is inconsistency (*gives VAR gesture*). Even trained educators can give different scores to the same piece of writing depending on factors like fatigue, implicit biases, or different interpretation of rubrics. Research has long shown that inter-rater reliability, ie how much agreement there is between different graders, varies significantly among human assessors.
Potential Benefits of AI Marking
Despite the issues, one of the biggest potential advantages of AI in marking and grading is its ability to scale. AI can process vast amounts of student work quickly, making it particularly valuable in large educational settings. Unlike human assessors, AI doesn’t suffer from fatigue, mood swings, lack of coffee, implicit biases (though weirdly it can inherit biases from its training data). The tantalising claim is that it *could* deliver consistent, criterion-based feedback, reducing the variability that often plagues human grading. By automating the assessment of lower-stakes assignments, AI could also free up teachers to focus on what most teachers prefer and arguably what truly matters: teaching.
One interesting thing was the comparisons between different AI models revealed that more sophisticated versions like ChatGPT 4.0 and Claude 3.0 Haiku performed similarly to ChatGPT 3.5, with no statistically significant improvements in accuracy or consistency. This finding is particularly revealing for me because it suggests that more advanced models with superior general capabilities don't necessarily provide better grading outcomes. Rather, this would seem to indicate that domain-specific knowledge and clear criteria matter more than the underlying sophistication of the AI model. This is a crucial finding for institutions considering which tools to implement.
All Too Human
So while AI grading shows promise, it’s still some way off. As discussed above, a major limitation is that AI doesn’t truly understand meaning; it identifies patterns but lacks the deeper comprehension that human discernment brings to assessment. There’s also the issue of systematic biases. If AI is trained on flawed datasets, it may reinforce those biases rather than eliminate them. Another concern is over-reliance. Both students and teachers might become too dependent on AI-generated feedback, potentially weakening critical evaluation skills. And then there’s resistance from educators; marking/grading isn’t just about assigning scores; for many teachers, it’s a core part of what they do in terms of ‘knowing’ their students, so making them understandably hesitant to hand it over to an algorithm. Very often, kids don’t care so much about how good a piece of work is, so much as they just want to know that you noticed their effort.
In my experience, most teachers prefer *teaching*, specifically talking and interacting with students. Marking can often be the most turgid aspect of the job and this research signals the potential of AI to handle routine marking tasks which could allow teachers to focus more on the relational enterprise of teaching. As imperfect as it may be, if it allows teachers to focus on being human then it will be seen as a trade-off worth making for many. And while the research suggests that AI might be more reliable in terms of consistency, it hasn't achieved the level of accuracy that expert human graders can provide.
In the end I think the most promising path is the idea of merging human and AI capabilities. Check out Daisy Christodoulou’s recent post on the ‘human in the loop’ which is an exciting path forward I think. AI has certainly not *solved* the problem of marking but it sure has shone a spotlight on the many issues with how it has traditionally been done which will become harder to continue justifying in the face of more efficient alternatives that could seriously address workload and assessment consistency.