If teaching quality is an important factor in learning, its systematic evaluation remains complex and costly, as it still relies heavily on human expertise based on classroom observation. In mathematics, the MQI (Mathematical Quality of Instruction) framework makes it possible to analyze classroom interactions in detail. This thesis explores a new direction: using large language models (LLMs) as raters (LLM-as-Judge paradigm) to automate the evaluation of classroom transcripts at scale. The aim was to determine whether a model from the Gemini 2.5 family could reliably reproduce expert human scores. Using the NCTE dataset, we also tested the influence of several parameters: codebook precision, the inclusion of examples, explicit reasoning (chain-of-thought) or implicit reasoning (thinking mode), and model version. Reliability was assessed through non-inferiority tests based on Krippendorff’s alpha. The results are mixed: across the 38 MQI dimensions examined, no LLM configuration reached the level of reliability required to match human raters’ performance. Although prompt optimization improved performance, the models remained consistently below human expert agreement. However, the analysis also showed that LLMs were able to rank teachers and lessons correctly according to their overall quality. While they cannot yet match expert level performance, they still appear useful for the exploratory analysis of large volumes of data. More broadly, this study highlights how difficult it remains to delegate pedagogical evaluation to AI when such judgments depend on a high level of interpretation and content knowledge.

If teaching quality is an important factor in learning, its systematic evaluation remains complex and costly, as it still relies heavily on human expertise based on classroom observation. In mathematics, the MQI (Mathematical Quality of Instruction) framework makes it possible to analyze classroom interactions in detail. This thesis explores a new direction: using large language models (LLMs) as raters (LLM-as-Judge paradigm) to automate the evaluation of classroom transcripts at scale. The aim was to determine whether a model from the Gemini 2.5 family could reliably reproduce expert human scores. Using the NCTE dataset, we also tested the influence of several parameters: codebook precision, the inclusion of examples, explicit reasoning (chain-of-thought) or implicit reasoning (thinking mode), and model version. Reliability was assessed through non-inferiority tests based on Krippendorff’s alpha. The results are mixed: across the 38 MQI dimensions examined, no LLM configuration reached the level of reliability required to match human raters’ performance. Although prompt optimization improved performance, the models remained consistently below human expert agreement. However, the analysis also showed that LLMs were able to rank teachers and lessons correctly according to their overall quality. While they cannot yet match expert level performance, they still appear useful for the exploratory analysis of large volumes of data. More broadly, this study highlights how difficult it remains to delegate pedagogical evaluation to AI when such judgments depend on a high level of interpretation and content knowledge.

On the Use of Large Language Models as Judges for Mathematics Teaching Quality Assessment

LEPENNETIER, SOPHIE MADELEINE MICHELINE
2025/2026

Abstract

If teaching quality is an important factor in learning, its systematic evaluation remains complex and costly, as it still relies heavily on human expertise based on classroom observation. In mathematics, the MQI (Mathematical Quality of Instruction) framework makes it possible to analyze classroom interactions in detail. This thesis explores a new direction: using large language models (LLMs) as raters (LLM-as-Judge paradigm) to automate the evaluation of classroom transcripts at scale. The aim was to determine whether a model from the Gemini 2.5 family could reliably reproduce expert human scores. Using the NCTE dataset, we also tested the influence of several parameters: codebook precision, the inclusion of examples, explicit reasoning (chain-of-thought) or implicit reasoning (thinking mode), and model version. Reliability was assessed through non-inferiority tests based on Krippendorff’s alpha. The results are mixed: across the 38 MQI dimensions examined, no LLM configuration reached the level of reliability required to match human raters’ performance. Although prompt optimization improved performance, the models remained consistently below human expert agreement. However, the analysis also showed that LLMs were able to rank teachers and lessons correctly according to their overall quality. While they cannot yet match expert level performance, they still appear useful for the exploratory analysis of large volumes of data. More broadly, this study highlights how difficult it remains to delegate pedagogical evaluation to AI when such judgments depend on a high level of interpretation and content knowledge.
2025
On the Use of Large Language Models as Judges for Mathematics Teaching Quality Assessment
If teaching quality is an important factor in learning, its systematic evaluation remains complex and costly, as it still relies heavily on human expertise based on classroom observation. In mathematics, the MQI (Mathematical Quality of Instruction) framework makes it possible to analyze classroom interactions in detail. This thesis explores a new direction: using large language models (LLMs) as raters (LLM-as-Judge paradigm) to automate the evaluation of classroom transcripts at scale. The aim was to determine whether a model from the Gemini 2.5 family could reliably reproduce expert human scores. Using the NCTE dataset, we also tested the influence of several parameters: codebook precision, the inclusion of examples, explicit reasoning (chain-of-thought) or implicit reasoning (thinking mode), and model version. Reliability was assessed through non-inferiority tests based on Krippendorff’s alpha. The results are mixed: across the 38 MQI dimensions examined, no LLM configuration reached the level of reliability required to match human raters’ performance. Although prompt optimization improved performance, the models remained consistently below human expert agreement. However, the analysis also showed that LLMs were able to rank teachers and lessons correctly according to their overall quality. While they cannot yet match expert level performance, they still appear useful for the exploratory analysis of large volumes of data. More broadly, this study highlights how difficult it remains to delegate pedagogical evaluation to AI when such judgments depend on a high level of interpretation and content knowledge.
LLM-as-Judge
Education
AI
Teaching
Assessment
File in questo prodotto:
File Dimensione Formato  
Sophie_Lepennetier.pdf

accesso aperto

Dimensione 11.1 MB
Formato Adobe PDF
11.1 MB Adobe PDF Visualizza/Apri

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/108075