Textual Gradient Descent: A New Tuning Paradigm for Compound AI

3 Jun 2026

Table of Links

Abstract and 1. Introduction

TEXTGRAD: Optimizing AI systems by backpropagating text feedback
Results

3.1 Code optimization

3.2 Solution optimization by test-time training to improve problem solving

3.3 Prompt optimization for reasoning

3.4 Molecule optimization

3.5 Radiotherapy treatment plan optimization
Related work
Discussion, Acknowledgements, and References

A. TEXTGRAD Details

B. Optimizer Extensions

C. Code Optimization

D. Solution Optimization

E. Prompt Optimization

F. Molecule Optimization

G. Treatment Plan Optimization

Abstract

AI is undergoing a paradigm shift, with breakthroughs achieved by systems orchestrating multiple large language models (LLMs) and other complex components. As a result, developing principled and automated optimization methods for compound AI systems is one of the most important new challenges. Neural networks faced a similar challenge in its early days until backpropagation and automatic differentiation transformed the field by making optimization turn-key. Inspired by this, we introduce TEXTGRAD, a powerful framework performing automatic “differentiation” via text. TEXTGRAD backpropagates textual feedback provided by LLMs to improve individual components of a compound AI system. In our framework, LLMs provide rich, general, natural language suggestions to optimize variables in computation graphs, ranging from code snippets to molecular structures. TEXTGRAD follows PyTorch’s syntax and abstraction and is flexible and easy-to-use. It works out-of-the-box for a variety of tasks, where the users only provide the objective function without tuning components or prompts of the framework. We showcase TEXTGRAD’s effectiveness and generality across a diverse range of applications, from question answering and molecule optimization to radiotherapy treatment planning. Without modifying the framework, TEXTGRAD improves the zero-shot accuracy of GPT-4o in Google-Proof Question Answering from 51% to 55%, yields 20% relative performance gain in optimizing LeetCode-Hard coding problem solutions, improves prompts for reasoning, designs new druglike small molecules with desirable in silico binding, and designs radiation oncology treatment plans with high specificity. TEXTGRAD lays a foundation to accelerate the development of the next-generation of AI systems.

1 Introduction

There is an emerging paradigm shift in how AI systems are built, owing to the breakthroughs of Large Language Models (LLMs) [1–6]. The new generation of AI applications are increasingly compound systems involving multiple sophisticated components, where each component could be an LLM-based agent, a tool such as a simulator, or web search. For instance, a system of LLMs communicating with symbolic solvers can solve olympiad-level math problems [7]; a system of LLMs using search engines and code interpreter tools performs comparably to human competitive programmers [8] and are solving real-world GitHub issues [9]. However, many of these breakthroughs came from systems that are hand-crafted by experts in the domain of application and are tweaked through heuristics. Therefore, developing principled and automated ways to optimize AI systems is one of the most crucial challenges for building compound systems with LLMs, and is necessary for unlocking the power of AI [10–12].

For the past 15 years, many advances in AI have relied on artificial neural networks and differentiable optimization [7, 13–17]. Different parts of neural networks (e.g., two artificial neurons) communicate through differentiable functions like matrix multiplications [18]. Therefore, using numerical gradients and backpropagation [19], which provide the direction to adjust each parameter to improve a model, has been the natural way to train AI models. Flexible automatic differentiation frameworks implementing backpropagation [20–24] have been indispensible to the development of AI models.

To optimize the new generation of AI systems, we introduce TEXTGRAD, automatic differentiation via text. Here we use differentiation and gradients as a metaphor for textual feedback from LLMs. In this framework, each AI system is transformed into a computation graph, where variables are inputs and outputs of complex (not necessarily differentiable) function calls. The feedback to the variables (dubbed ‘textual gradients’ [25]) are provided in the form of informative and interpretable natural language criticism to the variables; describing how a variable should be changed to improve the system. The gradients are propagated through arbitrary functions, such as LLM API calls, simulators, or external numerical solvers.

We demonstrate the power of our framework in a diverse set of domains, ranging from question answering benchmarks to radiotherapy treatment plan optimization and molecule generation (Fig. 1). LLMs can provide very rich, legible, and expressive natural language gradients to variables in this wide range of domains, such as proposing modifications to molecules, prompts to other LLMs, and code snippets. Our framework is built on the assumption that the current state-of-the-art LLMs are able to reason about individual components and subtasks of the system that it tries to optimize. We demonstrate the flexibility of TEXTGRAD with the following results:

Coding: In Section 3.1, we optimize solutions to difficult coding problems from LeetCode [26], where we boost the performance of gpt-4o and best existing method by 20% relevant performance gain.
Problem Solving: In Section 3.2, we optimize solutions to complex scientific questions to improve the zero-shot performance of GPT-4o. For instance, in Google-Proof Question Answering [27] benchmark, we improve the zero-shot accuracy from 51% to 55% by refining the solutions at test-time.
Reasoning: In Section 3.3, we optimize prompts to improve the LLM performance, where we push the performance of GPT-3.5 close to GPT-4 in several reasoning tasks.
Chemistry: In Section 3.4, we design new small molecules with desirable druglikeness and in silico binding affinity to drug targets.
Medicine: In Section 3.5, we optimize radiation treatment plans for prostate cancer patients to achieve desirable target dosage and reduce side effects.

Our results in a broad set of applications demonstrate the promise of TEXTGRAD to automatically optimize compound AI systems via backpropagation of text feedback. To accelerate progress in this direction, we open-source our framework at https://github.com/zou-group/textgrad.

This paper is available on arxiv under CC BY 4.0 license.

Authors:

(1) Mert Yuksekgonul, Co-first author from Department of Computer Science, Stanford University ([email protected]);

(2) Federico Bianchi, Co-first author from Department of Computer Science, Stanford University ([email protected]);

(3) Joseph Boen, Co-first author from Department of Biomedical Data Science, Stanford University ([email protected]);

(4) Sheng Liu, Co-first author from Department of Biomedical Data Science, Stanford University ([email protected]);

(5) Zhi Huang, Co-first author from Department of Biomedical Data Science, Stanford University ([email protected]);

(6) Carlos Guestrin, Department of Computer Science, Stanford University and Chan Zuckerberg Biohub ([email protected]);

(7) James Zou, Department of Computer Science, Stanford University, Department of Biomedical Data Science, Stanford University, and Chan Zuckerberg Biohub ([email protected]).

← Previous

What Developers Ask ChatGPT When Writing Code

Up Next →

Textual Autograd Mechanics: Computation Graphs in Language Optimization