Algorithmic Prompt Refining: Elevating Smaller LLMs with Textual Gradients

cover
11 Jun 2026

Abstract and 1. Introduction

  1. TEXTGRAD: Optimizing AI systems by backpropagating text feedback

  2. Results

    3.1 Code optimization

    3.2 Solution optimization by test-time training to improve problem solving

    3.3 Prompt optimization for reasoning

    3.4 Molecule optimization

    3.5 Radiotherapy treatment plan optimization

  3. Related work

  4. Discussion, Acknowledgements, and References

A. TEXTGRAD Details

B. Optimizer Extensions

C. Code Optimization

D. Solution Optimization

E. Prompt Optimization

F. Molecule Optimization

G. Treatment Plan Optimization

3.3 Prompt optimization for reasoning

While LLMs demonstrate an impressive performance in reasoning tasks, their performance can be sensitive to the prompt used to guide their behavior. In particular, with the right choice of a prompt, their performance can be significantly improved [49]. In prompt optimization, the goal is to find a prompt or an instruction to guide the behavior of an LLM, such that it performs well on a given task. In particular, we often have a computation graph like the following:

Table 2: Solution optimization for zero-shot question answering with gpt-4o.

where we have a Question for the task, an Answer to the question, and an Evaluation Metric indicating the quality of the output given the ground truth answer. For instance, for a question-answering task, the evaluation metric would be the accuracy of the answer, evaluated e.g. using string-based metrics.

Here, given a handful of training examples to optimize a Prompt, the goal is to maximize the performance of an LLM on the given task. In our experiments, our goal is to improve the performance of a weaker and cheaper model (e.g., gpt-3.5-turbo-0125) using the feedback generated by stronger models (e.g., gpt-4o). This is useful in practice because by paying a fixed cost to optimize a Prompt, the prompt-optimized weaker model can be used with cheaper inference costs instead of using the strong and more expensive model.

Code Snippet 2: A simple implementation of prompt optimization with TEXTGRAD.

Here, we use TEXTGRAD in a minibatch stochastic gradient descent setting [18, 34]. In particular, at each iteration, we use a few training examples to run the forward pass in Equation 16. A pseudocode and short implementation can be found in the snippet above. Full details of prompt optimization can be found in Appendix E. Unlike instance optimization, where TEXTGRAD tries to optimize each individual solution, the goal here is to optimize a single prompt that works well across all the questions in a benchmark.

Tasks: We explore prompt optimization in multiple datasets, including the two standard reasoning tasks (Object Counting and Word Sorting) from Big Bench Hard (randomly split into 50/100/100 train/validation/test samples) [50, 51], GSM8k grade-school math problem solving [52] (using train/validation/test splits from DSPy). For GSM8k and Object Counting, we use a string-based exact match metric to quantify accuracy (i.e. whether the final number provided in the response is the same as the ground truth answer). For Word Sorting, we use an LLM to compare the response and the ground truth answer. We give more details about the tasks, prompts, and example queries in Appendix E.1.

Methods: We explore improving the performance of gpt-3.5-turbo-0125 using gpt-4o [48] to provide feedback during backpropagation. In particular, while the forward model that performs the reasoning is gpt-3.5-turbo-0125, we use gpt-4o to provide the feedback and improve the prompt. We use a batch size of 3 with 12 iterations, i.e., the model sees 36 training examples in total, sampled randomly with replacement. After each iteration, we run a validation loop with the validation set of the datasets, and if the performance is better than the previous iteration we update the Prompt.

Baselines: We have two main baselines:

  1. Zero-shot Chain-of-Thought (CoT) [46, 47]: We initialize all prompts as zero-shot CoT prompt, where a model is instructed to ‘Think step-by-step’ to explain its reasoning before giving its answer. This strategy is well-known to be a strong baseline for prompting.

  2. DSPy is a state-of-the-art language model programming and prompt optimization framework [10], thus we use it as the reference baseline. We instantiate DSPy’s BootstrappedFewShotRandomSearch (BFSR) optimizer with 10 candidate programs and 8 few-shot examples. This optimizer identifies demonstrations to include in the prompt as few-shot examples. This is done through generating traces of LLM inputs and outputs that individually pass the metric (in this case, accuracy) and includes CoT reasoning. It then applies random search over subsets of up to size eight shots with these demonstrations.

Table 3: Prompt optimization for reasoning tasks. With TEXTGRAD, we optimize a system prompt for gpt-3.5-turbo using gpt-4o as the gradient engine that provides the feedback during backpropagation.

Results: Across all three tasks, TEXTGRAD improves the performance of the 0-shot prompt significantly. It performs similarly to DSPy [10] for Word Sorting and GSM8k, and improves over DSPy by 7% for Object Counting. While the 8 demonstrations in the context can help guide the behavior of the LLM, it can increase the cost of inference. Interestingly, the DSPy optimizer and TEXTGRAD make complementary adjustments—the former adds in-context demonstration examples and latter optimizes the system prompt. Adding the examples selected by DSPy to TEXTGRAD’s optimized prompt could further improve performance (for GSM8k, directly combining the demonstrations from DSPy with the instruction from TextGrad increases the accuracy to 82.1%), suggesting that a fruitful direction is to combine both approaches.

Authors:

(1) Mert Yuksekgonul, Co-first author from Department of Computer Science, Stanford University ([email protected]);

(2) Federico Bianchi, Co-first author from Department of Computer Science, Stanford University ([email protected]);

(3) Joseph Boen, Co-first author from Department of Biomedical Data Science, Stanford University ([email protected]);

(4) Sheng Liu, Co-first author from Department of Biomedical Data Science, Stanford University ([email protected]);

(5) Zhi Huang, Co-first author from Department of Biomedical Data Science, Stanford University ([email protected]);

(6) Carlos Guestrin, Department of Computer Science, Stanford University and Chan Zuckerberg Biohub ([email protected]);

(7) James Zou, Department of Computer Science, Stanford University, Department of Biomedical Data Science, Stanford University, and Chan Zuckerberg Biohub ([email protected]).


This paper is available on arxiv under CC BY 4.0 license.