Table of Links
-
TEXTGRAD: Optimizing AI systems by backpropagating text feedback
-
Results
3.2 Solution optimization by test-time training to improve problem solving
G. Treatment Plan Optimization
3 Results
We demonstrate the flexibility of TEXTGRAD in a diverse array of applications. In § 3.1, we optimize code snippets to solve hard coding problems from LeetCode. In § 3.2, we optimize solutions to science questions. In § 3.3 we optimize prompts to improve the reasoning of LLMs. In § 3.4, we optimize chemical structures for improved molecular properties. In § 3.5, we optimize treatment plans for prostate cancer patients.
3.1 Code optimization
Code optimization is a hallmark use case of instance optimization. Here, the goal is to optimize some code to improve e.g., correctness or runtime complexity. We often have a computation graph like the following:

where we optimize the Code to solve a given Problem with limited, local test supervision and self-evaluation through a test-instruction asking to critique the current iteration of the code. Figure 1e shows an example for the problem You are given an array nums of size n consisting of distinct integers from 1 to n and a positive integer k. Return the number of non-empty subarrays in nums that have a median equal to k. The first solution proposed by gpt-4o does not pass the tests. TEXTGRAD identifies an edge case in the first solution and provides a suggestion on how to improve it. The optimized implementation passes all tests.
Task: We use the LeetCode Hard dataset [26] to benchmark code optimization. LeetCode is an online platform that offers coding practice questions in preparation for technical interviews. The LeetCode Hard dataset contains examples of hard coding problems that are meant to be challenging for both humans and language models, where the success metric is Completion Rate, i.e., passing all test cases for a given problem (GPT-4 reportedly achieved a 7% completion rate [26]). LeetCode test cases are not public, and thus, after generation, the code has to be submitted to the LeetCode platform for evaluation on the unseen test cases. This makes the platform more suitable to evaluate the performance of language models.
Baseline: Reflexion [26] is the state-of-the-art method on the LeetCode Hard dataset. Their approach prompts an LLM to self-reflect on code snippets and the errors that were generated at test time using candidate unit tests. Given the self-reflection, the LLM is prompted again to provide an updated piece of code, conditioned on the self-reflection and the errors. We ran Reflexion on LeetCodeHard using gpt-4o using 1 in-context demonstration to guide the behavior (one-shot). In addition to Reflexion, we also run a zero-shot baseline using gpt-4o mimicking the same zero-shot baseline described in [26]. In comparison, TEXTGRAD runs in a zero-shot setting, without any demonstrations.
Results: Existing results [26] showed a 7% pass rate for GPT-4 zero-shot and 15% for GPT-4 with Reflexion. We show that these results have now been boosted to 23% for gpt-4o zero-shot and 31% when Reflexion is used. With TEXTGRAD, we can optimize solutions to achieve a performance of 36%. These improvements are more impressive considering that Reflexion was ran with in-context demonstrations and TEXTGRAD did not use any demonstrations (i.e. zero-shot).


Authors:
(1) Mert Yuksekgonul, Co-first author from Department of Computer Science, Stanford University ([email protected]);
(2) Federico Bianchi, Co-first author from Department of Computer Science, Stanford University ([email protected]);
(3) Joseph Boen, Co-first author from Department of Biomedical Data Science, Stanford University ([email protected]);
(4) Sheng Liu, Co-first author from Department of Biomedical Data Science, Stanford University ([email protected]);
(5) Zhi Huang, Co-first author from Department of Biomedical Data Science, Stanford University ([email protected]);
(6) Carlos Guestrin, Department of Computer Science, Stanford University and Chan Zuckerberg Biohub ([email protected]);
(7) James Zou, Department of Computer Science, Stanford University, Department of Biomedical Data Science, Stanford University, and Chan Zuckerberg Biohub ([email protected]).
This paper is
