Benchmarking TextGrad: Automated Code Optimization on LeetCode Hard

17 Jun 2026

Table of Links

Abstract and 1. Introduction

TEXTGRAD: Optimizing AI systems by backpropagating text feedback
Results

3.1 Code optimization

3.2 Solution optimization by test-time training to improve problem solving

3.3 Prompt optimization for reasoning

3.4 Molecule optimization

3.5 Radiotherapy treatment plan optimization
Related work
Discussion, Acknowledgements, and References

A. TEXTGRAD Details

B. Optimizer Extensions

C. Code Optimization

D. Solution Optimization

E. Prompt Optimization

F. Molecule Optimization

G. Treatment Plan Optimization

C Code Optimization

C.1 Methodology

Baseline Details. We re-run Reflexion [26] using the code by the authors currently available online. We had to make minor changes to ensure it ran correctly in our setting but we also contacted the original authors and ask feedback on our edits to ensure our evaluation was consistent. Moreover, we had to re-extract the LeetCodeHard [26] dataset using the authors’ pipeline; this means that it is likely that the dataset we are using in this manuscript is not the same dataset that was used in the Reflexion paper. While this dataset contains a set of simple tests to check if the code works as expected, the real evaluation in this context is given by passing the tests on the LeetCode platform.

Reflexion is run with a one shot prompt that is meant to instruct the model on how to provide feedback. The pipeline run by Reflexion is as follows: given in input a prompt to generate code, a language model generates a first solution. If this solution passes the local tests, it is then submitted to the LeetCode platform to be evaluated on harder tests. However, if the solution does not pass the tests, we ask for feedback through Reflexion and optimize the code. Once again, if the new solution passes the local tests, we submit it to the LeetCode platform. We do this optimization for 5 iterations.

We ran the experiment 5 times with 5 different seeds and we averaged the results. At each iteration of optimization, TEXTGRADmakes 1 call to gpt-4o to evaluate the test time loss, 1 call to collect gradients, and 1 call to update the code snippet. The number of coding problems in LeetCodeHard is 39.

Authors:

(1) Mert Yuksekgonul, Co-first author from Department of Computer Science, Stanford University ([email protected]);

(2) Federico Bianchi, Co-first author from Department of Computer Science, Stanford University ([email protected]);

(3) Joseph Boen, Co-first author from Department of Biomedical Data Science, Stanford University ([email protected]);

(4) Sheng Liu, Co-first author from Department of Biomedical Data Science, Stanford University ([email protected]);

(5) Zhi Huang, Co-first author from Department of Biomedical Data Science, Stanford University ([email protected]);

(6) Carlos Guestrin, Department of Computer Science, Stanford University and Chan Zuckerberg Biohub ([email protected]);

(7) James Zou, Department of Computer Science, Stanford University, Department of Biomedical Data Science, Stanford University, and Chan Zuckerberg Biohub ([email protected]).

This paper is available on arxiv under CC BY 4.0 license.

← Previous

TextGrad Optimization Extensions: Momentum, Constraints, & Batching

Up Next →

Optimizing Reasoning Benchmarks: GPQA and MMLU via TextGrad