Step-by-Step Remediation of Students' Mathematical Mistakes
Step-by-Step Remediation of Students' Mathematical Mistakes

Rose E. Wang
Qingyang Zhang
Carly Robinson
Susanna Loeb
Dorottya Demszky
rewang@cs.stanford.edu, ddemszky@stanford.edu
2023
[Paper]
[Code]

Abstract. Scaling high-quality tutoring is a major challenge in education. Because of the growing demand, many platforms employ novice tutors who, unlike professional educators, struggle to effectively address student mistakes and thus fail to seize prime learning opportunities for students. In this paper, we explore the potential for large language models (LLMs) to assist math tutors in remediating student mistakes. We present ReMath, a benchmark co-developed with experienced math teachers that deconstructs their thought process for remediation. The benchmark consists of three step-by-step tasks: (1) infer the type of student error, (2) determine the strategy to address the error, and (3) generate a response that incorporates that information. We evaluate the performance of state-of-the-art instruct-tuned and dialog models on ReMath. Our findings suggest that although models consistently improve upon original tutor responses, we cannot rely on models alone to remediate mistakes. Providing models with the error type (e.g., the student is guessing) and strategy (e.g., simplify the problem) leads to a 75% improvement in the response quality over models without that information. Nonetheless, despite the improvement, the quality of the best model's responses still falls short of experienced math teachers. Our work sheds light on the potential and limitations of using current LLMs to provide high-quality learning experiences for both tutors and students at scale.



ReMath: A Step-by-Step Remediation Framework

Example of remediation.

The benchmark has three core tasks. Task A: Infer the type of student error, Task B: Determine a response strategy and intention of that strategy, and Task C: Generate a response that incorporates the information. The framework for ReMath emerged from an extensive co-development with math teachers, spanning several months of collaboration. We developed it with the intention that the framework is comprehensive, intuitive, and aligned with the process that teachers actually follow.

Task A. Infer the Type of Student Error

Identifying the student's errors is prerequisite to successful remediation (Easley and Zwoyer, 1975; Bamberger et al., 2010). Task A involves teachers to infer the most likely cause of mistake from context. Prior research---particularly in math teacher education---has often focused on topic-specific categories of misconceptions, such as Bamberger et al. (2010). Our approach intends to support tutors who are not necessarily content experts, therefore we instead define ``error'' as a student's degree of understanding; this definition aligns with literature on the hierarchical design of mathematics curricula (Gagne, 1962, 1968; White, 1973; Resnick et al., 1973) and psychometrics, including constructs like the Zone of Proximal Development and item-response theory which have continuous scales of mastery and use student responses to update the inferred level of student understanding(Glaser and Nitko, 1970; Vygotsky and Cole, 1978; Wertsch, 1985; Embretson and Reise, 2013). As such, our error categories are topic-agnostic descriptions of a student's understanding, and complement the topic-agnostic strategies in Task B.

  • guess: The student does not seem to understand or guessed the answer.This error type is characterized by expressions of uncertainty or answers unrelated to the problem or target answer.
  • misinterpret: The student misinterpreted the question. This error type is characterized by answers that arise from a misunderstanding of the question. Students may mistakenly address a subtly different question, leading to an incorrect response. One example is reversing numbers, e.g., interpreting ``2 divided by 6'' as ``6 divided by 2.''
  • careless: The student made a careless mistake. This error type is characterized by answers that appear to utilize the correct mathematical operation but contain a small numerical mistake, resulting in an answer that is slightly off.
  • right-idea: Student has the right idea, but is not quite there. This error type is characterized by situations where the student demonstrates a general understanding of the underlying concept but falls short of reaching the correct solution. For example, a student with a ``right-idea'' error may recognize that multiplication is required to compute areas but may struggle with applying it to a specific problem.
  • imprecise: Student’s answer is not precise enough or the tutor is being too picky about the form of the student’s answer. This error type is characterized by answers that lacks the necessary level of precision or when the tutor places excessive emphasis on the form of the student's response.
  • not-sure: Not sure, but I’m going to try to diagnose the student. This option is used if the teacher is not sure why the student made the mistake from the context provided. We encourage the teachers to use this error type sparingly.
  • N/A: None of the above, I have a different description. This option is used when none of the other options reflect the error type. Similar to the not-sure, we encourage teachers to use this error type sparingly.

Task B. Determine a Response Strategy and Intention of the Strategy

Student errors are usually persistent unless the teacher intervenes pedagogically (Radatz 1980). This task involves selecting a course of action that guides the student towards improving their understanding. It also involves specifying the intention.

Strategies: Explain a concept, Ask a question, Provide a hint, Provide a strategy, Provide a worked example, Provide a minor correction, Provide a similar problem, Simplify the question, Affirm the correct answer, Encourage the student, Other.

Intentions: Motivate the student, Get the student to elaborate their answer, Correct the student's mistake, Hint at the student's mistake, Clarify a student's misunderstanding, Help the student understand the lesson topic or solution strategy, Diagnose the student's mistake, Support the student in their thinking or problem-solving, Explain the student's mistake (e.g., what is wrong in their answer or why is it incorrect), Signal to the student that they have solved or not solved the problem, Other.

Task C. Generate the Response

Once the student error has been identified and a response strategy has been determined, the next task is to generate a suitable response. We instruct teachers to respond in a useful and caring way. Experienced educators possess the instructional expertise to generate responses that are tailored to the individual student's needs (e.g., their error type) and age group. This is important as the students from this tutoring program are elementary school students, who require different pedagogical strategies than older students \citep{anghileri2006scaffolding}.

Some results...

Evaluation of Task C.

The table above summarizes the evaluation results of Task C. Notably, models consistently outperform the original tutor response (indicated by positive values) on all dimensions, with the exception of strategy-constrained Flan-T5 being worse on all dimensions and ChatGPT being worse on care. The teacher-written responses are the most highly rated on all dimensions except, surprisingly, on human-soundingness. The best model across all settings is GPT4, and it benefits most from teacher-annotations. The ratings for GPT4 from the unconstrained to strategy-constrained setting nearly double across all the dimensions except care. Its care rating only improves when error information is added. The image before shows responses from GPT4 that illustrate the diversity in remediation strategies. In the unconstrained setting, GPT4 directly corrects the student, while the other models utilize different approaches to prompt the student to try again. The error-constrained GPT4 provides a solution strategy tailored to the specific problem, while the strategy-constrained GPT4 abstracts the details of applying the strategy. The complete-constrained GPT4 model combines both approaches, offering a comprehensive but long remediation response. These results highlight the challenge for models in generating simultaneously useful and caring responses to student mistakes.

For more details, please check out our paper linked below!



Paper and BibTeX

@misc{wang2023stepbystep,
          title={Step-by-Step Remediation of Students' Mathematical Mistakes}, 
          author={Rose E. Wang and Qingyang Zhang and Carly Robinson and Susanna Loeb and Dorottya Demszky},
          year={2023},
          eprint={2310.10648},
          archivePrefix={arXiv},
          primaryClass={cs.CL}
    }


Acknowledgements

This website is adapted from this website, which was adapted from this website, which was in turn adapted from this website. Feel free to use this website as a template for your own projects by referencing this!

Icons used in some of the above figures were made by Freepik, ThoseIcons, dDara, Pixel perfect, ThoseIcons, mynamepong, Icongeek26, and Vitaly Gorbachev from flaticon.com.