Authors:
(1) Gladys Tyen, University of Cambridge, Dept. of Computer Science & Technology, ALTA Institute, and Work done during an internship at Google Research (e-mail: [email protected]);
(2) Hassan Mansoor, Google Research (e-mail: [email protected]);
(3) Victor Carbune, Google Research (e-mail: [email protected]);
(4) Peter Chen, Google Research and Equal leadership contribution ([email protected]);
(5) Tony Mak, Google Research and Equal leadership contribution (e-mail: [email protected]).
Table of Links
Conclusion, Limitations, and References
5 Related work
Datasets To our knowledge, the only publicly available dataset containing mistake annotations in LLM outputs is PRM800K (Lightman et al., 2023), which is a dataset of solutions to Olympiad-level math questions. Our dataset BIG-Bench Mistake covers a wider range of tasks to explore the reasoning capabilities of LLMs more thoroughly.
Additionally, the generator LLM used in PRM800K has been fine-tuned on 1.5B math tokens as well as a dataset of step-by-step math solutions. For this paper, we wanted to explore few-shot in-context learning methods, which is typically used in realworld applications with API-based LLMs.
Self-correction Pan et al. (2023) present a plethora of self-correction methods in recent literature. While their list includes training-time correction strategies such as RLHF (Ouyang et al., 2022) and self-improve (Huang et al., 2022), our backtracking method falls into the category of post-hoc correction, where the correction process is applied to outputs that have already been generated.
Our paper focuses on correction of logical and reasoning errors, rather than stylistic or qualitative improvements. Previous post-hoc correction methods that are applied to reasoning errors include Reflexion (Shinn et al., 2023) and RCI (Kim et al., 2023), both of which cause performance deterioration when the oracle label is not used (Huang et al., 2023). Other methods such as Self-Refine (Madaan et al., 2023) and iterative refinement (Chen et al., 2023) focus on q ualitative or stylistic improvements rather than correcting logical errors.
This paper is available on arxiv under CC 4.0 license.