Examining the consistency of instructor versus large language model ratings on summary content: Toward checklist-based feedback provision with second language writers

Published in Language Testing, 2025

Abstract
This study examined the consistency between instructor ratings of learner-generated summaries and those estimated by a large language model (LLM) on summary content checklist items designed for undergraduate second language (L2) writing instruction in Japan. The effects of the LLM prompt design on the consistency between the two were also explored by comparing six types of prompts obtained by altering the amount of information included in the prompt and the direction concerning the order in which different parts of the LLM output (a checklist-based rating and its rationale) are generated. Ninety-seven summaries written by Japanese undergraduate students were analyzed by employing three checklist items on the use of topic sentences included in the source text. Satisfactory agreement between instructor and LLM ratings for low-stakes use was obtained for certain checklist item-by-prompt type combinations. When discrepancies between the two were observed, LLM ratings tended to be harsher than instructor ratings in general. Furthermore, the amount of information included in the LLM prompt affected the instructor-LLM rating agreement more than the order of generating a rating and its rationale in the output. The results offered initial empirical support for employing LLM-generated formative feedback on summary content in L2 writing classrooms.

Recommended citation:
Yasuyo Sawaki, Yutaka Ishii, Hiroaki Yamada, Takenobu Tokunaga. Examining the consistency of instructor versus large language model ratings on summary content: Toward checklist-based feedback provision with second language writers, Language Testing (2025).

Download paper here