Evolutionary Chemistry via LLM Agents: Multi-Objective SMILES Optimization

11 Jun 2026

Table of Links

Abstract and 1. Introduction

TEXTGRAD: Optimizing AI systems by backpropagating text feedback
Results

3.1 Code optimization

3.2 Solution optimization by test-time training to improve problem solving

3.3 Prompt optimization for reasoning

3.4 Molecule optimization

3.5 Radiotherapy treatment plan optimization
Related work
Discussion, Acknowledgements, and References

A. TEXTGRAD Details

B. Optimizer Extensions

C. Code Optimization

D. Solution Optimization

E. Prompt Optimization

F. Molecule Optimization

G. Treatment Plan Optimization

3.4 Molecule optimization

TEXTGRAD supports a variety of optimization problems, including multi-objective optimization tasks commonly found in science and engineering applications. For example, in drug discovery, researchers seek to discover or design molecules that maximize a variety of objectives with regards to synthesizability, efficacy, and safety [53, 54]. To demonstrate TEXTGRAD’s applications to multi-objective optimization, we apply TEXTGRAD to drug molecule optimization, and show how our framework can interface with computational tools and optimize chemical structures towards simultaneously improving their binding affinity and druglikeness.

Task: A critical consideration for potential drug molecules is their binding affinity, which represents the strength of the interactions between the molecule and its protein target. Drug designers seek molecules with high binding affinities to relevant drug targets, as they require lower and less frequent doses to achieve efficacy. This affinity can be quantified by free energy ∆G, which describes the ratio of probabilities between bound and unbound ligand-receptor pairs. ∆G can be estimated using “docking” simulations of proteinligand binding [55, 56]. In our experiments, we employ the Vina score from the Autodock Vina tool, a widely used physics-based docking simulator [57]. The more negative the Vina score, the greater probability that the drug will bind to its intended target.

Potential drug molecules are also evaluated by their druglikeness, which estimates how the molecule will behave in vivo, with respect to solubility, permeability, metabolic stability and transporter effects. Molecules with high druglikeness are more likely to be absorbed by the body and reach their targets [58]. One popular metric for “druglikeness” is the Quantatiative Estimate of Druglikeness (QED) score, a weighted composite metric of important chemical characteristics such as molecular weight, lipophilicity, polar surface area, among others. The QED score ranges from 0 to 1, where 1 indicates high druglikeness [59].

Though there are many more considerations for successful molecules, in our experiments, we restrict our objectives to minimizing the Vina score and maximizing the QED, due to the relative maturity of these two metrics. The competing tradeoffs between these two metrics makes the optimization task realistic and challenging. In particular, docking scores tend to prefer larger molecules with many functional groups that maximize interactions with a binding site [55, 56, 60]. In contrast, the druglikeness encourages lighter, simpler molecules that have better absorption properties [58, 59]. Thus, simultaneously optimizing both objectives is non-trivial.

Methods: We apply TEXTGRAD to drug molecule optimization by encoding molecules as SMILES strings and constructing a multi-objective loss from the Vina and QED scores. Namely, we perform instance optimization over SMILES strings, where the gradients generated by TEXTGRAD with respect to the multi-objective loss are used to update the text representing the molecule.

We use gpt-4o as our LLM with the prompt text found in Appendix F.2. At each iteration, the current molecule is evaluated by estimating its binding affinity to the target protein using the Vina score from Autodock Vina and using the QED score for druglikeness from RDKit (Appendix F.1). Each molecule is initialized as a small chemical fragment from a functional group. We apply TEXTGRAD to all 58 targets in the DOCKSTRING molecule evaluation benchmark [61]. These 58 targets consist of clinically relevant proteins sampled from a variety of structural classes, 29 of which have clinically approved drugs. For each target, we optimize a starting fragment using TEXTGRAD for 10 iterations, for 3 unique initial fragments. To evaluate our performance, we compare the characteristics of the molecules generated by TEXTGRAD to clinically approved drugs for the respective protein (Appendix F.3).

Results: For all 58 targets, TEXTGRAD consistently generates molecules with improved binding affinity and druglikeness irrespective of the initial starting fragment (Appendix F.4). For the 29 protein targets with

clinically approved drugs, we observe that TEXTGRAD generates molecules with highly competitive affinity and druglikeness when compared to clinical molecules evaluated using the same loss function (Figure 2 (b)). The resulting molecules exhibit unique structures compared to their clinical approved counterparts and existing compounds (Appendix F.5), while maintaining similar in silico safety profiles (Appendix F.6). While there exist alternative machine learning methods for de novo molecular generation, TEXTGRAD offers two key advantages over its counterparts: Firstly, by combining traditonal chemoinformatics tools with the the general knowledge and reasoning capabilities of LLMs, TEXTGRAD produces competitive results even without a prior training set. Secondly, TEXTGRAD’s framework of natural language gradients produce explainable decisions, enabling researchers to understand precisely how and why a molecule’s structure was constructed. Together, these two characteristics invoke a promising future for the role of AI agents in scientific discovery

Authors:

(1) Mert Yuksekgonul, Co-first author from Department of Computer Science, Stanford University ([email protected]);

(2) Federico Bianchi, Co-first author from Department of Computer Science, Stanford University ([email protected]);

(3) Joseph Boen, Co-first author from Department of Biomedical Data Science, Stanford University ([email protected]);

(4) Sheng Liu, Co-first author from Department of Biomedical Data Science, Stanford University ([email protected]);

(5) Zhi Huang, Co-first author from Department of Biomedical Data Science, Stanford University ([email protected]);

(6) Carlos Guestrin, Department of Computer Science, Stanford University and Chan Zuckerberg Biohub ([email protected]);

(7) James Zou, Department of Computer Science, Stanford University, Department of Biomedical Data Science, Stanford University, and Chan Zuckerberg Biohub ([email protected]).

This paper is available on arxiv under CC BY 4.0 license.

← Previous

Algorithmic Prompt Refining: Elevating Smaller LLMs with Textual Gradients

Up Next →

Agentic Radiotherapy Planning: Automating Outer-Loop Tuning via TextGrad