Abstract. Scaling high-quality tutoring is a major challenge in education. Because of the growing demand, many platforms employ novice tutors who, unlike professional educators, struggle to effectively address student mistakes and thus fail to seize prime learning opportunities for students. In this paper, we explore the potential for large language models (LLMs) to assist math tutors in remediating student mistakes. We present ReMath, a benchmark co-developed with experienced math teachers that deconstructs their thought process for remediation. The benchmark consists of three step-by-step tasks: (1) infer the type of student error, (2) determine the strategy to address the error, and (3) generate a response that incorporates that information. We evaluate the performance of state-of-the-art instruct-tuned and dialog models on ReMath. Our findings suggest that although models consistently improve upon original tutor responses, we cannot rely on models alone to remediate mistakes. Providing models with the error type (e.g., the student is guessing) and strategy (e.g., simplify the problem) leads to a 75% improvement in the response quality over models without that information. Nonetheless, despite the improvement, the quality of the best model's responses still falls short of experienced math teachers. Our work sheds light on the potential and limitations of using current LLMs to provide high-quality learning experiences for both tutors and students at scale.
The benchmark has three core tasks. Task A: Infer the type of student error, Task B: Determine a response strategy and intention of that strategy, and Task C: Generate a response that incorporates the information. The framework for ReMath emerged from an extensive co-development with math teachers, spanning several months of collaboration. We developed it with the intention that the framework is comprehensive, intuitive, and aligned with the process that teachers actually follow.
Identifying the student's errors is prerequisite to successful remediation (Easley and Zwoyer, 1975; Bamberger et al., 2010). Task A involves teachers to infer the most likely cause of mistake from context. Prior research---particularly in math teacher education---has often focused on topic-specific categories of misconceptions, such as Bamberger et al. (2010). Our approach intends to support tutors who are not necessarily content experts, therefore we instead define ``error'' as a student's degree of understanding; this definition aligns with literature on the hierarchical design of mathematics curricula (Gagne, 1962, 1968; White, 1973; Resnick et al., 1973) and psychometrics, including constructs like the Zone of Proximal Development and item-response theory which have continuous scales of mastery and use student responses to update the inferred level of student understanding(Glaser and Nitko, 1970; Vygotsky and Cole, 1978; Wertsch, 1985; Embretson and Reise, 2013). As such, our error categories are topic-agnostic descriptions of a student's understanding, and complement the topic-agnostic strategies in Task B.
Student errors are usually persistent unless the teacher intervenes pedagogically (Radatz 1980). This task involves selecting a course of action that guides the student towards improving their understanding. It also involves specifying the intention.
Strategies: Explain a concept, Ask a question, Provide a hint, Provide a strategy, Provide a worked example, Provide a minor correction, Provide a similar problem, Simplify the question, Affirm the correct answer, Encourage the student, Other.
Intentions: Motivate the student, Get the student to elaborate their answer, Correct the student's mistake, Hint at the student's mistake, Clarify a student's misunderstanding, Help the student understand the lesson topic or solution strategy, Diagnose the student's mistake, Support the student in their thinking or problem-solving, Explain the student's mistake (e.g., what is wrong in their answer or why is it incorrect), Signal to the student that they have solved or not solved the problem, Other.
The table above summarizes the evaluation results of Task C. Notably, models consistently outperform the original tutor response (indicated by positive values) on all dimensions, with the exception of strategy-constrained Flan-T5 being worse on all dimensions and ChatGPT being worse on care. The teacher-written responses are the most highly rated on all dimensions except, surprisingly, on human-soundingness. The best model across all settings is GPT4, and it benefits most from teacher-annotations. The ratings for GPT4 from the unconstrained to strategy-constrained setting nearly double across all the dimensions except care. Its care rating only improves when error information is added. The image before shows responses from GPT4 that illustrate the diversity in remediation strategies. In the unconstrained setting, GPT4 directly corrects the student, while the other models utilize different approaches to prompt the student to try again. The error-constrained GPT4 provides a solution strategy tailored to the specific problem, while the strategy-constrained GPT4 abstracts the details of applying the strategy. The complete-constrained GPT4 model combines both approaches, offering a comprehensive but long remediation response. These results highlight the challenge for models in generating simultaneously useful and caring responses to student mistakes.
|
This website is adapted from this website, which was adapted from this website, which was in turn adapted from this website. Feel free to use this website as a template for your own projects by referencing this!
Icons used in some of the above figures were made by Freepik, ThoseIcons, dDara, Pixel perfect, ThoseIcons, mynamepong, Icongeek26, and Vitaly Gorbachev from flaticon.com.