If teaching quality is an important factor in learning, its systematic evaluation remains complex and costly, as it still relies heavily on human expertise based on classroom observation. In mathematics, the MQI (Mathematical Quality of Instruction) framework makes it possible to analyze classroom interactions in detail. This thesis explores a new direction: using large language models (LLMs) as raters (LLM-as-Judge paradigm) to automate the evaluation of classroom transcripts at scale. The aim was to determine whether a model from the Gemini 2.5 family could reliably reproduce expert human scores. Using the NCTE dataset, we also tested the influence of several parameters: codebook precision, the inclusion of examples, explicit reasoning (chain-of-thought) or implicit reasoning (thinking mode), and model version. Reliability was assessed through non-inferiority tests based on Krippendorff’s alpha. The results are mixed: across the 38 MQI dimensions examined, no LLM configuration reached the level of reliability required to match human raters’ performance. Although prompt optimization improved performance, the models remained consistently below human expert agreement. However, the analysis also showed that LLMs were able to rank teachers and lessons correctly according to their overall quality. While they cannot yet match expert level performance, they still appear useful for the exploratory analysis of large volumes of data. More broadly, this study highlights how difficult it remains to delegate pedagogical evaluation to AI when such judgments depend on a high level of interpretation and content knowledge.
If teaching quality is an important factor in learning, its systematic evaluation remains complex and costly, as it still relies heavily on human expertise based on classroom observation. In mathematics, the MQI (Mathematical Quality of Instruction) framework makes it possible to analyze classroom interactions in detail. This thesis explores a new direction: using large language models (LLMs) as raters (LLM-as-Judge paradigm) to automate the evaluation of classroom transcripts at scale. The aim was to determine whether a model from the Gemini 2.5 family could reliably reproduce expert human scores. Using the NCTE dataset, we also tested the influence of several parameters: codebook precision, the inclusion of examples, explicit reasoning (chain-of-thought) or implicit reasoning (thinking mode), and model version. Reliability was assessed through non-inferiority tests based on Krippendorff’s alpha. The results are mixed: across the 38 MQI dimensions examined, no LLM configuration reached the level of reliability required to match human raters’ performance. Although prompt optimization improved performance, the models remained consistently below human expert agreement. However, the analysis also showed that LLMs were able to rank teachers and lessons correctly according to their overall quality. While they cannot yet match expert level performance, they still appear useful for the exploratory analysis of large volumes of data. More broadly, this study highlights how difficult it remains to delegate pedagogical evaluation to AI when such judgments depend on a high level of interpretation and content knowledge.
On the Use of Large Language Models as Judges for Mathematics Teaching Quality Assessment
LEPENNETIER, SOPHIE MADELEINE MICHELINE
2025/2026
Abstract
If teaching quality is an important factor in learning, its systematic evaluation remains complex and costly, as it still relies heavily on human expertise based on classroom observation. In mathematics, the MQI (Mathematical Quality of Instruction) framework makes it possible to analyze classroom interactions in detail. This thesis explores a new direction: using large language models (LLMs) as raters (LLM-as-Judge paradigm) to automate the evaluation of classroom transcripts at scale. The aim was to determine whether a model from the Gemini 2.5 family could reliably reproduce expert human scores. Using the NCTE dataset, we also tested the influence of several parameters: codebook precision, the inclusion of examples, explicit reasoning (chain-of-thought) or implicit reasoning (thinking mode), and model version. Reliability was assessed through non-inferiority tests based on Krippendorff’s alpha. The results are mixed: across the 38 MQI dimensions examined, no LLM configuration reached the level of reliability required to match human raters’ performance. Although prompt optimization improved performance, the models remained consistently below human expert agreement. However, the analysis also showed that LLMs were able to rank teachers and lessons correctly according to their overall quality. While they cannot yet match expert level performance, they still appear useful for the exploratory analysis of large volumes of data. More broadly, this study highlights how difficult it remains to delegate pedagogical evaluation to AI when such judgments depend on a high level of interpretation and content knowledge.| File | Dimensione | Formato | |
|---|---|---|---|
|
Sophie_Lepennetier.pdf
accesso aperto
Dimensione
11.1 MB
Formato
Adobe PDF
|
11.1 MB | Adobe PDF | Visualizza/Apri |
The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License
https://hdl.handle.net/20.500.12608/108075