Matada Research

How AI marking tools are transforming teacher workloads: Evidence from global implementation studies

How AI marking tools are transforming teacher workloads: Evidence from global implementation studies

6 minutes to read
Picture of Gerald Naepi | BSc, PgDipSci, BHSc Phsyio

Gerald Naepi | BSc, PgDipSci, BHSc Phsyio

Gerald has a background in science and health with a passion for impactful research for a strength based community outcomes. Under Geralds directorship Matada has worked with many national and international organisations including Health New Zealand Te Whatu Ora, The Human Rights Commision, Te Papa and the United Nations.

New Zealand’s government recently announced plans to deploy AI marking tools for student assessment, joining a global trend toward automated grading that promises significant efficiency gains but faces notable technical challenges.

Automated essay scoring platforms like Gradescope report up to 90% reductions in grading time for large engineering and science courses. In China, AI grading systems demonstrated 92% agreement with human teachers across trials involving 120 million students (Alsharif, 2025).

However, these statistics reveal only part of the story. Despite clear potential for transforming educational assessment, researchers are identifying critical gaps between the technology’s promise and its real-world performance, highlighting significant implementation challenges and technical limitations..

The reality behind AI marking performance

Contrary to widespread assumptions that artificial intelligence can uniformly automate grading tasks, new research exposes significant variations in AI marking performance. ChatGPT demonstrates markedly different accuracy levels depending on the type of assessment, showing far greater consistency when grading reflective essays than coding assignments. The system achieved a coefficient of variation of just 6.70% for high-quality written work, compared to a dramatic 33.89% variation for poor-quality coding assessments (Li et al., 2024). These disparities reflect AI training that heavily favors written content over technical materials.

An unexpected pattern emerges in how AI systems handle student work of varying quality levels. Rather than providing consistent feedback regardless of submission quality, ChatGPT becomes increasingly unreliable when evaluating weaker student work (Li et al., 2024). This creates a troubling dynamic where students who struggle most – and who need the most accurate guidance – receive the least dependable assessments from AI systems.

However, when machine and human scorers disagreed, follow-up analysis revealed instances where the AI-MLS score was correct rather than the human score, suggesting AI systems may actually help minimise human bias in academic assessment (Terrazas-Arellanes et al., 2025).

Technical architecture and implementation mechanics

Modern AI marking systems deploy sophisticated architectures that combine multiple approaches to evaluate student work. Automated Essay Scoring (AES) systems utilise hybrid models that blend rule-based statistical features with deep-learning algorithms, incorporating natural language processing frameworks such as PyTorch and Hugging Face, alongside transformer models like LongFormer (Fischer, 2023). This technical complexity enables nuanced analysis beyond simple keyword matching.

Training these systems reveals significant implementation challenges. While some applications require as few as two pre-scored essays for calibration, others demand up to 1,000 examples for reliable performance (Fischer, 2023). ChatGPT-based systems achieve optimal results through repeated testing, with accuracy and consistency improving when the same assessment is marked multiple times and results averaged (Li et al., 2024).

Assessment of these systems has moved beyond basic accuracy measures. The Quadratic Weighted Kappa (QWK) has emerged as the gold standard, measuring agreement between AI and human scores while accounting for the magnitude of disagreement. Research consistently shows QWK values above 0.70 indicate substantial agreement, with some systems achieving 0.72. comparable to human inter-rater reliability ranges of 0.60 to 0.75 (Terrazas-Arellanes et al., 2025).

Despite these technical advances, transparency remains problematic. Even with advanced large language models like ChatGPT providing explanations for scores, systems remain fundamentally opaque. As Fischer (2023) notes, “the processing side is still a massive area of learning,” with creators unable to predict conclusions their systems will reach, creating trust and accountability challenges in high-stakes assessment environments.

Why current implementation strategies are failing

AI marking systems face critical flaws that undermine their educational value. The most significant issue involves prompt dependency, where research shows a significant difference in the outcomes generated by different prompts, with identical AI systems producing varying results based on instruction phrasing rather than prompt quality (Li et al., 2024). This variability destroys consistency, supposedly AI marking’s primary advantage, and demands extensive testing and refinement that educators rarely have time to conduct.

Students can exploit these systems through gaming strategies that expose fundamental weaknesses. Studies reveal that using big, but meaningless words artificially inflates scores, while changing up to 20% of essay content may leave AI scores unchanged (Fischer, 2023). More troubling, adding just three words to a 350-word essay can increase scores by 50%, demonstrating how AI systems prioritize superficial markers over genuine understanding.

These technical problems compound when teachers lack proper preparation. Upper elementary teachers, who lack science training and are less knowledgeable about Next Generation Science Standards (NGSS) three-dimensional performance-based assessments, struggle to validate or interpret AI-generated scores effectively (Terrazas-Arellanes et al., 2025). Without training in both subject matter and AI system limitations, teachers cannot identify inappropriate or biased automated scores.

The equity concerns prove equally troubling. While AI marking promises to reduce human bias, research reveals it can amplify existing inequities, showing small but significant bias against male upper elementary school learners, with bias partially linked to essay word count. This factor may disadvantage certain student populations (Fischer, 2023), a disadvantage seen in other sectors.

These findings carry critical implications for educational leaders considering AI marking implementation. The evidence demonstrates that successful deployment requires far more strategic planning than simple technology adoption.

Business impact and resource allocation

AI marking systems carry financial costs that extend far beyond initial licensing fees. Schools must invest heavily in data collection, computing resources, and ongoing maintenance, with expenses mounting as models require updates for new languages or curriculum changes (Alsharif, 2025). The burden hits developing regions hardest, where licensing fees and infrastructure demands create a digital divide that makes AI marking a luxury of wealthy regions.

The operational reality often undermines efficiency promises. While AI marking pledges to reduce teacher workloads, this benefit only emerges through proper implementation. Quality assurance demands can become so extensive that, as Fischer (2023) notes, “you might as well pay a person to do it in the first place.” Schools must budget for pilot testing, prompt optimisation, and continuous monitoring, costs frequently overlooked in initial planning.

Time requirements further challenge efficiency claims. Teachers must learn to interpret AI outputs, validate scores for bias, and recognise system failures. This professional development represents a significant ongoing expense that institutions routinely underestimate when adopting AI marking systems.

Competitive landscape and market dynamics

Educational institutions worldwide are adopting AI assessment tools at dramatically different rates, creating a fragmented global landscape. China has deployed AI systems across approximately 60,000 schools, while fewer than 10% of Western institutions have established formal AI policies (Alsharif, 2025). This regulatory gap positions early adopters of comprehensive AI governance frameworks at a competitive advantage.

The high-stakes testing industry reflects this same divide in approach. Pearson’s PTE Academic now relies entirely on AI scoring for language assessments, while competitors maintain human oversight for accountability. However, market acceptance remains mixed; a UCL-Pearson study found that while test-takers valued AI’s objectivity, many felt unsettled by fully automated evaluation (Fischer, 2023).

Meanwhile, platform specialisation is becoming a critical market differentiator. Gradescope has carved out the STEM assessment niche, contrasting sharply with general-purpose tools like ChatGPT. Research indicates that domain-specific training significantly affects AI performance, requiring institutions to strategically align tools with their specific assessment needs rather than relying on one-size-fits-all solutions.

Future development trajectory

Regulatory frameworks are evolving rapidly to address AI assessment challenges. The EU AI Act, effective in 2024, classifies certain educational AI applications as high-risk, mandating risk assessments and user information disclosures (Fischer, 2023). This regulatory shift will compel institutions to develop more rigorous validation processes and transparency measures.

Technological advances are simultaneously reshaping the landscape. Large language models like ChatGPT represent the next evolutionary step, addressing some explainability issues by providing scoring rationales, though fundamental transparency problems persist. These models enable zero-shot learning with minimal training data, achieving 70-80% accuracy without specific training sets, impressive for research but insufficient for high-stakes assessment (Fischer, 2023).

International coordination efforts are establishing global standards for ethical AI implementation. OECD’s large-scale AI assessment tool and UNESCO’s competency frameworks will likely accelerate adoption while ensuring quality benchmarks across educational systems.

Educational leaders should pursue phased implementation strategies that prioritise validation and teacher training over rapid deployment. Beginning with low-stakes formative assessments allows system testing without consequences. Rigorous prompt testing protocols must evaluate multiple instruction variations against human scoring before deploying any AI marking system.

Successful implementation requires comprehensive governance frameworks addressing algorithmic transparency, bias detection, and ongoing monitoring. Substantial investment in teacher professional development ensures educators understand both AI capabilities and limitations. Human-in-the-loop validation processes for high-stakes assessments should position AI as a second marker rather than primary assessor.

Regular auditing of AI marking outcomes across different student populations remains critical for detecting bias patterns early. The evidence suggests that while AI marking offers significant potential benefits, successful implementation requires treating it as a complex socio-technical system rather than simple automation technology. Institutions investing in comprehensive validation, training, and governance frameworks will likely achieve sustainable benefits, while those pursuing quick deployment may face significant equity and accuracy challenges.

References

Alsharif, A. (2025). Artificial Intelligence and the Future of Assessment: Opportunities for Scalable, Fair, and Real-Time Evaluation. Libyan Journal of Educational Research and E-Learning (LJERE), 42–52. https://ljere.com.ly/index.php/ljere/article/view/5

Fischer, I. (2023). Evaluating the ethics of machines assessing humans. Journal of Information Technology Teaching Cases. https://doi.org/10.1177/20438869231178844

Li, J., Jangamreddy, N., Bhansali, R., Hisamoto, R., Zaphir, L., Dyda, A., & Glencross, M. (2024). AI-assisted marking: Functionality and limitations of ChatGPT in written assessment evaluation. Australasian Journal of Educational Technology, 40(4), 56-72.

Terrazas-Arellanes, F. E., Strycker, L., Alvez, G. G., Miller, B., & Vargas, K. (2025). Promoting Agency Among Upper Elementary School Teachers and Students with an Artificial Intelligence Machine Learning System to Score Performance-Based Science Assessments. Education Sciences15(1), 54. https://doi.org/10.3390/educsci15010054

Matada is a forward-thinking social enterprise delivering transformative research, evaluation, and strategic consultancy to shape legislation, policy, and practice, driving actionable solutions for the well-being and prosperity of the next generation. Supported by a team of highly qualified researchers and consultants with both global and local expertise, we operate with a values-driven approach centered on relationships, respect, reciprocity, community, and service.

Leave a Comment

Your email address will not be published. Required fields are marked *

By: Gerald Naepi

geraldnaepi@matadaresearch.co.nz