Test-Time Training for LLMs: Scaling Problem Solving via TextGrad

4 Jun 2026

Table of Links

Abstract and 1. Introduction

TEXTGRAD: Optimizing AI systems by backpropagating text feedback
Results

3.1 Code optimization

3.2 Solution optimization by test-time training to improve problem solving

3.3 Prompt optimization for reasoning

3.4 Molecule optimization

3.5 Radiotherapy treatment plan optimization
Related work
Discussion, Acknowledgements, and References

A. TEXTGRAD Details

B. Optimizer Extensions

C. Code Optimization

D. Solution Optimization

E. Prompt Optimization

F. Molecule Optimization

G. Treatment Plan Optimization

3.2 Solution optimization by test-time training to improve problem solving

Here, we focus on the task of solution optimization via TEXTGRAD. In solution optimization, the goal is to improve the solution to a complex problem, such as a question about quantum mechanics or organic chemistry. In particular, we often have a computation graph like the following:

where the parameter we optimize is the Solution, and the loss function is obtained by an evaluation of the solution, e.g., with an LLM. At each iteration, the LLM is prompted with the question, current solution, and some test-time instruction asking to critique or investigate the current iteration. Over the optimization trajectory, the solution is refined using this test-time self-evaluation. More generally, this idea is known as test-time training [42, 43], where a machine learning model is trained on a test instance at test-time, often with a self-supervised objective. Similarly, recent work have shown the merits of self-refinement also for reasoning tasks [26, 30, 44]. In particular, even though an LLM may not get the answer to a question or the solution to a problem right at first attempt, it can improve the response through iterative refinement.

For instance, the objective function for the refinement looks like the following:

Below is a representative implementation of solution optimization in TEXTGRAD.

Code Snippet 1: A representative implementation of solution optimization in TEXTGRAD.

Task: Google-proof Question Answering (GPQA) [27] is a recent benchmark where challenging multiplechoice questions in physics, biology, and chemistry are created and labeled by domain experts who have or are pursuing PhD degrees. In this benchmark, experts and skilled non-experts are reported to achieve 81% and 22% accuracy respectively, demonstrating the difficulty of the questions. Importantly, this is a benchmark where performance has not yet saturated, where to our knowledge, the best reported results, achieved by gpt-4o, gets 53.6% accuracy in the diamond subset. We also use two challenging subsets (Machine Learning and College Physics) of the MMLU [45] question answering benchmark that is used to track the progress of language modeling and whether LLMs reached human-level performance. Here the expert human accuracy on average is around 90%. For the details of the question format and prompts, please see Appendix D.

Method: We report two baselines. First, the reported results in the gpt-4o release document states 53.6% accuracy. However, their official implementation uses a temperature of 0.5 for generations, thus we also test gpt-4o with temperature 0 and Chain-of-Thought (CoT) [46, 47] prompting provided in the official implementation[1]. For TEXTGRAD, we perform 3 iterations of test-time updates (i.e. update the solution three times) and perform majority voting across all solutions to get the final answer. We use string-based metrics to compute the final accuracy of each answer.

Results: With TEXTGRAD, we improve the performance of gpt-4o in the challenging question-answering tasks and report the results in Table 2. To our best knowledge, 55% is the best known result in the GPQA dataset so far. Similarly, we improve the performance in MMLU subsets from 85.7% to 88.4% (Machine Learning) and 91.2% to 95.1% (College Physics). These results show that by spending more computation at test-time through TEXTGRAD self-refinement, we can improve the question answering performance of even the most capable models.

Authors:

(1) Mert Yuksekgonul, Co-first author from Department of Computer Science, Stanford University ([email protected]);

(2) Federico Bianchi, Co-first author from Department of Computer Science, Stanford University ([email protected]);

(3) Joseph Boen, Co-first author from Department of Biomedical Data Science, Stanford University ([email protected]);

(4) Sheng Liu, Co-first author from Department of Biomedical Data Science, Stanford University ([email protected]);

(5) Zhi Huang, Co-first author from Department of Biomedical Data Science, Stanford University ([email protected]);

(6) Carlos Guestrin, Department of Computer Science, Stanford University and Chan Zuckerberg Biohub ([email protected]);

(7) James Zou, Department of Computer Science, Stanford University, Department of Biomedical Data Science, Stanford University, and Chan Zuckerberg Biohub ([email protected]).

This paper is available on arxiv under CC BY 4.0 license.

[1] We do this to minimize randomness, however, discussions claim there may be other sources of non-determinism with gpt-4.

← Previous

TextGrad Code Optimization: Outperforming Reflexion on LeetCode Hard

Up Next →

Algorithmic Prompt Refining: Elevating Smaller LLMs with Textual Gradients