The Role of Prompt Engineering in Ensuring the Consistency Between Instructor and LLM Checklist Ratings on Written Summary Content

Date:

More information here

The rapidly-growing large language model (LLM) applications to L2 instruction and assessment in recent years informs explorations of options for timely provision of fine-grained feedback on traditionally underexplored, complex task types such as summary writing. Yet, key validity issues such as the consistency between LLM ratings and instructor ratings and effects of different prompts for automated scoring and feedback on the consistency require careful examination. The present study addressed exactly these issues, specifically focusing on checklist-based rating on main idea representation in written summaries (Kintsch & van Dijk, 1978; van Dijk & Kintsch, 1983). Ninety-seven summaries written in English by undergraduates in Japan were analyzed. Two writing course instructors rated all summaries with partial double rating. We then developed six prompts by manipulating two features: the amount of information included (three types including few-shot); and the order in which rating and its explanation were generated in the LLM output (two types). By employing OpenAI GPT- 4 turbo through Open AI API, we examined the instructor vs. LLM rating consistency based on agreement indices and confusion matrices. Results showed satisfactory levels of agreement for low-stakes purposes for certain prompt type-by-item combinations, with a notable effect of the amount of information included in the prompt. LLM ratings were also found to be generally harsher than human ratings. Key results will be discussed along with qualitative analysis results of the LLM output as well as study implications for LLM analysis of checklists and their applications to granular feedback provision.