On the Use of Large Language Models as Judges for Mathematics Teaching Quality Assessment

If teaching quality is an important factor in learning, its systematic evaluation remains complex and costly, as it still relies heavily on human expertise based on classroom observation. In mathematics, the MQI (Mathematical Quality of Instruction) framework makes it possible to analyze classroom interactions in detail. This thesis explores a new direction: using large language models (LLMs) as raters (LLM-as-Judge paradigm) to automate the evaluation of classroom transcripts at scale. The aim was to determine whether a model from the Gemini 2.5 family could reliably reproduce expert human scores. Using the NCTE dataset, we also tested the influence of several parameters: codebook precision, the inclusion of examples, explicit reasoning (chain-of-thought) or implicit reasoning (thinking mode), and model version. Reliability was assessed through non-inferiority tests based on Krippendorff’s alpha. The results are mixed: across the 38 MQI dimensions examined, no LLM configuration reached the level of reliability required to match human raters’ performance. Although prompt optimization improved performance, the models remained consistently below human expert agreement. However, the analysis also showed that LLMs were able to rank teachers and lessons correctly according to their overall quality. While they cannot yet match expert level performance, they still appear useful for the exploratory analysis of large volumes of data. More broadly, this study highlights how difficult it remains to delegate pedagogical evaluation to AI when such judgments depend on a high level of interpretation and content knowledge.

On the Use of Large Language Models as Judges for Mathematics Teaching Quality Assessment

LEPENNETIER, SOPHIE MADELEINE MICHELINE

2025/2026

Abstract

If teaching quality is an important factor in learning, its systematic evaluation remains complex and costly, as it still relies heavily on human expertise based on classroom observation. In mathematics, the MQI (Mathematical Quality of Instruction) framework makes it possible to analyze classroom interactions in detail. This thesis explores a new direction: using large language models (LLMs) as raters (LLM-as-Judge paradigm) to automate the evaluation of classroom transcripts at scale. The aim was to determine whether a model from the Gemini 2.5 family could reliably reproduce expert human scores. Using the NCTE dataset, we also tested the influence of several parameters: codebook precision, the inclusion of examples, explicit reasoning (chain-of-thought) or implicit reasoning (thinking mode), and model version. Reliability was assessed through non-inferiority tests based on Krippendorff’s alpha. The results are mixed: across the 38 MQI dimensions examined, no LLM configuration reached the level of reliability required to match human raters’ performance. Although prompt optimization improved performance, the models remained consistently below human expert agreement. However, the analysis also showed that LLMs were able to rank teachers and lessons correctly according to their overall quality. While they cannot yet match expert level performance, they still appear useful for the exploratory analysis of large volumes of data. More broadly, this study highlights how difficult it remains to delegate pedagogical evaluation to AI when such judgments depend on a high level of interpretation and content knowledge.

Scheda

Scheda DC

	Facoltà/Dipartimento
	
				Dipartimento di Psicologia Generale - DPG
			
	Corso di studio
	
				COGNITIVE NEUROSCIENCE AND CLINICAL NEUROPSYCHOLOGY Laurea Magistrale (D.M. 270/2004)
			
	Anno Accademico
	
				2025
			
	Titolo inglese
	
				On the Use of Large Language Models as Judges for Mathematics Teaching Quality Assessment
			
	Abstract in italiano
	
				If teaching quality is an important factor in learning, its systematic evaluation remains complex and costly, as it still relies heavily on human expertise based on classroom observation. In mathematics, the MQI (Mathematical Quality of Instruction) framework makes it possible to analyze classroom interactions in detail. This thesis explores a new direction: using large language models (LLMs) as raters (LLM-as-Judge paradigm) to automate the evaluation of classroom transcripts at scale.
The aim was to determine whether a model from the Gemini 2.5 family could reliably reproduce expert human scores. Using the NCTE dataset, we also tested the influence of several parameters: codebook precision, the inclusion of examples, explicit reasoning (chain-of-thought) or implicit reasoning (thinking mode), and model version. Reliability was assessed through non-inferiority tests based on Krippendorff’s alpha.
The results are mixed: across the 38 MQI dimensions examined, no LLM configuration reached the level of reliability required to match human raters’ performance. Although prompt optimization improved performance, the models remained consistently below human expert agreement. However, the analysis also showed that LLMs were able to rank teachers and lessons correctly according to their overall quality. While they cannot yet match expert level performance, they still appear useful for the exploratory analysis of large volumes of data. More broadly, this study highlights how difficult it remains to delegate pedagogical evaluation to AI when such judgments depend on a high level of interpretation and content knowledge.
			
	Parola chiave
	
				LLM-as-Judge
Education
AI
Teaching
Assessment
			
	Relatore
	
				TESTOLIN, ALBERTO
			
	Correlatore
	
				DE KOCK, CHRISTIAAN
			
	Appare nelle tipologie:
	
				Lauree magistrali

File in questo prodotto:

File	Dimensione	Formato
Sophie_Lepennetier.pdf accesso aperto Dimensione 11.1 MB Formato Adobe PDF Visualizza/Apri	11.1 MB	Adobe PDF	Visualizza/Apri

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/108075