Agentic Molecular Design: Generative Chemistry via TextGrad

cover
23 Jun 2026

Abstract and 1. Introduction

  1. TEXTGRAD: Optimizing AI systems by backpropagating text feedback

  2. Results

    3.1 Code optimization

    3.2 Solution optimization by test-time training to improve problem solving

    3.3 Prompt optimization for reasoning

    3.4 Molecule optimization

    3.5 Radiotherapy treatment plan optimization

  3. Related work

  4. Discussion, Acknowledgements, and References

A. TEXTGRAD Details

B. Optimizer Extensions

C. Code Optimization

D. Solution Optimization

E. Prompt Optimization

F. Molecule Optimization

G. Treatment Plan Optimization

F Molecule Optimization

F.1 Docking and Druglikeness Evaluation

To optimize molecules, we evaluate the binding affinity and druglikeness of chemical structures encoded as SMILES strings. To compute both metrics, the generated SMILES string is first converted into an octetcomplete Lewis dot structure using RDKit’s MolFromSmiles functionality. This method “sanitizes” molecules by adding explicit hydrogens, kekulizing aromatic rings, standardizing valence states, and assigning radicals [100]. If this sanitization process fails at any step, whether through a structural ambiguity or an invalid molecule, the QED and Vina scores are replaced with a single string informing TEXTGRAD that the molecule is invalid.

If the generated SMILES string does represent a valid chemical structure, we compute the QED score using RDKit’s Chem.QED function. While QED scores can be quickly and reliably computed, docking scores can vary significantly depending on target and ligand structure preparation. To ensure consistency, we calculate Vina scores using the DOCKSTRING package [61], which implements a standardized docking workflow and provides a set of gold-standard targets. In particular, after the ligand has been sanitized by RDKit, DOCKSTRING (de)-protonates it at pH 7.4 using Open Babel, and prepares and refines the 3D geometric structure using the Euclidean distance geometry algorithm ETKG and the classical force field MMFF94. Finally, Gasteiger charges are computed for all ligand atoms, and the resulting structure is saved as a ligand PDBQT file passed to Autodock Vina. For target preparation, the DOCKSTRING benchmark suite provides 58 curated crystal structures of clinically relevant proteins, with the majority at less than 2.5 Å resolution. These structures are specially prepared to improve correlations between theoretical and experimental binding affinities, for example by manual addition of polar hydrogens and removal of residual water and solute molecules. DOCKSTRING also standardizes simulation parameters such as numerical seeds, search box coordinates, and sampling exhausitivity, to ensure reproducible scoring.

F.2 Objective Functions

Once the scores have been computed, they are passed to an LLM in the following format.

Note that this prompt allows us to specify both the target name as well as a prioritization between these two objectives. When optimizing promising molecules in late stage drug discovery, medicinal chemists use detailed structural knowledge about a protein target’s binding pocket in addition to docking scores, for example through interactive 3D molecular visualization software. To preserve the generality of TEXTGRAD in our experiments, we do not to attempt to provide similar geometric information, as not all LLMs support multimodal inputs. However, we do include the protein_name in the loss prompt to inject supplementary structural information.

F.3 Benchmarks

To benchmark the performance of TEXTGRAD generated molecules, we compare their characteristics to clinically approved drugs for the same protein targets found in DrugBank, a database of 16, 619 drugs [103]. To ensure that the DrugBank molecules were both comparable and high quality, we filtered for drugs that were small molecules, had full clinical approval, and were designed for orthosteric binding with the same active site in the DOCKSTRING benchmark suite. After these filtering criteria, we identified 118 drugs targeting 29 of the 58 DOCKSTRING proteins. When evaluating binding affinity and druglikeness for the clinically approved drugs, we compute the Vina and QED scores using the same tools and workflow described in section F.1, exactly as applied to the TEXTGRAD molecules.

F.4 Initialization

Supplementary Figure 1: Molecule Initialization. We initialize TEXTGRAD with fragment molecules from three, diverse functional groups and optimize each initial molecule for 10 iterations for all 58 targets in the DOCKSTRING benchmark suite. For each fragment and each protein, we perform post selection on the generated molecules using the summary score outlined in equation 25 and visualize the resulting distribution. We observe that while the initial QED and Vina score distributions of the starting fragments varies greatly, the distribution of the optimized molecules is highly overlapping.

In practice, molecular optimization is typically accelerated by large scale pre-optimization screening, where libraries of millions or even billions of existing chemical structures are scored and ranked by docking, druglikeness, and other metrics. Only the most promising structures, termed "leads", are further refined by medicinal chemists [104]. While TEXTGRAD is capable of optimizing any initial molecule, in this work, to more accurately characterize it’s performance and avoid biasing its designs towards existing drugs, we instead select our three initial molecules from simple fragments of common functional groups, shown in Figure 1. These fragments are highly diverse, and belong to different functional groups. Although these initial fragments have differing druglikness and binding affinity characteristics, the impact of the starting fragment on TEXTGRAD’s performance is minimal, and TEXTGRAD is capable of generating molecules with high druglikeness and binding affinity from all three fragments.

F.5 Chemical Novelty

Supplementary Figure 2: Structural Novelty In panel (a), we observe increasing novelty over optimization updates, where a generated molecule is considered“novel” if there does not exist any existing molecule in ChEMBL with a Tanimoto similarity score greater than 0.8. In panels (b,c,d), we estimate substructure similarity using the Tversky metric between TEXTGRAD molecules and clinically approved drugs for each target. While there are a wide range of similarity scores, we observe that similarity with known drugs has little correlation with molecule performance, suggesting that TEXTGRAD’s generation process is weakly influenced by “memorized” knowledge of clinically approved structures.

One of the primary concerns with using LLM models in scientific discovery is their capacity to simply memorize and regurgitate their training sets instead of performing logical reasoning [105]. Since recent models like gpt-4o are trained on massive, opaque datasets, a key concern is that the molecules TEXTGRAD generates may simply be duplicates of clinically approved molecules for their respective targets. To quantify and test the extent or existence of this “memorization” effect, we compare the molecules that TEXTGRAD generates with approved drugs for their respective protein targets, as well as unrelated but existing druglike chemical compounds.

We first compare TEXTGRAD generated molecules for a particular protein target with all the clinically approved small molecules found in Drugbank for the same target (see F.3 for filtering criteria) using the Tversky similarity score on RDKit daylight chemical fingerprints. The Tversky similarity score is not symmetric, and compares substructures between a variant and reference molecule [106]. In our application, this is preferred as it assigns a high score to generated (variant) molecules that contain most or all of the substructures in the Drugbank (reference) molecule, even if there exist other extraneous substructures in the generated molecule that reduce the symmetric overlap between the two molecules. A Tversky score of 1.0 indicates that the TEXTGRAD molecule is a complete subset of the Drugbank molecule, while 0.0 indicates no overlap

In order to analyze the most relevant chemical structures, we restricted our analysis to a set of high performing TEXTGRAD molecules, rather than all molecules from all iterations. In particular, using the overall score described in Equation 25, we selected the best molecule for each protein target and for each initial fragment, for a total of 87 generated molecules (29 druggable targets x 3 initial fragments) and 118 clinically approved drugs across 29 targets. For each target, we compute all pairwise Tversky scores between the 3 TEXTGRAD molecules and the Drugbank molecules approved for that target, setting the Drugbank molecule as the reference, and the TEXTGRAD molecule as the variant.

We observed that the distribution of Tversky similarity scores was quite broad, with a median of 0.42 but ranging from 0.14 to 0.90 (Figure 2 (b)). However, we observe that Tversky similarity between TEXTGRAD and DrugBnak molecules is actually slightly anti-correlated with molecule performance as measured by the overall score, suggesting that TEXTGRAD ’s optimization procedure is at best weakly influenced by prior knowledge of approved drugs. In fact, for a variety of generated molecules along a gradient of similarity scores, the generated molecules exhibit QED and Vina scores that match or even exceed their DrugBank counterparts (Figure 2 (d)).

Beyond the similarity to known drugs, we are also interested in observing if TEXTGRAD is generating novel molecules, or discovering previously unknown properties in existing compounds, for example a strong binding affinity to a protein target in a compound that has not been associated with the protein. To answer this question, we perform a Tanimoto similarity search across all of ChEMBL, a manually curated database of 2.4 million bioactive molecules with drug-like properties [107]. Unlike the Tversky score, the Tanimoto metric measures the symetric overlap between two chemical structures, where 1.0 indicates an exact match, and 0.0 no overlap. We classify a TEXTGRAD compound as “novel” if and only if there does not exist any molecule in ChEMBL with a with a Tanimoto similarity score over 0.80.

While the starting fragments are known molecules, we observe that as the number of iterations increases, TEXTGRAD generates molecules that are progressively less likely to be previously known compounds. By the 6th iteration, 95% of all the molecules generated by TEXTGRAD across all 58 targets and 3 starting fragments are novel using the criteria above (Figure 2 (a)). By observing the trajectories of generated molecules and analyzing the textual gradients, we hypothesize that the feedback provided by the Vina and QED scores encourages TEXTGRAD to explore chemical space beyond known molecules by progressively adding and removing combinations of functional groups. Since these chemical updates can form a combinatorial number of unique structures, it is reasonable that TEXTGRAD would reach previously unexplored regions of chemical space in a relatively small number of iterations.

F.6 Implicit Objectives

Another key concern in applying LLMs in science is their propensity for hallucinations, where models generate factually incorrect or illogical responses while attempting to satisfy user requests [105]. In our setting, this hallucinations could manifest by TEXTGRAD proposing invalid, toxic, or otherwise undesirable molecules in order to optimize its objective function. We control for severe hallucinations by preprocessing molecules using RDKit sanitization, ensuring at the bare minimum that TEXTGRAD generates chemically valid molecules. However, this simple preprocessing step does not completely specify desirable chemical behavior. A direct strategy would be to exhaustively encode all possible metrics for desirability in drug molecules beyond druglikeness and binding affinity into the objective function, and extend it to include sythesizability, toxicity, among other criteria. Unfortunately, this approach is not realistically feasible as not all criteria for desirability have mature computational metrics. Thus, a key question is whether TEXTGRAD obeys so-called implicit objectives during its optimization process that curtails illogical or undesirable behavior.

To evaluate the extent or existence of undesirable molecules, we can characterize the harmfulness of the generated molecules, focusing on mutagenisis and clinical toxicity. Mutagenicity refers to the ability of a drug to induce genetic alterations, which may lead to DNA damage and harmful long-term affects. Clinical toxicity refers to a broad range of adverse short or long-term side effects. Importantly, neither druglikeness

Supplementary Figure 3: Safety Properties We evaluate the predicted harmfullness of TEXTGRAD molecules using the ADMET-AI model and compare them to clinically approved drugs. For both mutagenicity and clinical toxicity, 1.0 indicates a highly likelihood for harm and 0.0 a low likelihood. We observe that despite the fact that neither of these characteristics are directly encoded into TEXTGRAD’s objective function, TEXTGRAD implicitly avoids proposing harmful molecules.

nor binding affinity are strongly correlated with these criteria, and thus these desirability metrics are not directly optimized or otherwise explicitly encoded in TEXTGRAD’s objectives.

To evaluate the propensity for mutagenisis and clinical toxicity, we employ the ADMET-AI model, that predicts these scores from the chemical structures of molecules [108]. ADMET-AI employs a deep learning model trained on multiple relevant datasets. In particular, for mutagenicity, ADMET-AI is trained on 7, 255 drugs from the Ames dataset, a bacterial reverse mutation assay for rapidly screening large numbers of compounds for can induced genetic damage and frameshift mutations. A predicted label of 1.0 indicates a high likelihood that the drug will induce mutagenesis, while a label of 0.0 indicates a low likelihood. For clinical toxicity, ADMET-AI is trained on the ClinTox dataset, a dataset of 1, 484 drugs consisting of molecules that have failed clinical trials for toxicity reasons and also drugs that are associated with successful trials. Similarly, a predicted label of 1.0 indicates a high likelihood of clinical toxicity, while a label of 0.0 indicates a low likelihood.

Once again, we restrict our analysis to the best performing generated molecules with druggable targets as measured by the overall score, and select the best molecule for each protein target and for each initial fragment, for a total of 87 generated molecules. We then compare their predicted mutagenisis and clinical toxicity to the the 118 clinically approved molecules from DrugBank. We observe that for both Mutagenicity and Clinical Toxicity, the molecules generated by TEXTGRAD have predicted distributions that indicate a low likelihood of harmful effects, and closely match the distributions of the clinically approved molecules. Together, these results suggest that TEXTGRAD implicitly avoids proposing harmful molecules, even though these criteria are not directly encoded in its loss function.

Authors:

(1) Mert Yuksekgonul, Co-first author from Department of Computer Science, Stanford University ([email protected]);

(2) Federico Bianchi, Co-first author from Department of Computer Science, Stanford University ([email protected]);

(3) Joseph Boen, Co-first author from Department of Biomedical Data Science, Stanford University ([email protected]);

(4) Sheng Liu, Co-first author from Department of Biomedical Data Science, Stanford University ([email protected]);

(5) Zhi Huang, Co-first author from Department of Biomedical Data Science, Stanford University ([email protected]);

(6) Carlos Guestrin, Department of Computer Science, Stanford University and Chan Zuckerberg Biohub ([email protected]);

(7) James Zou, Department of Computer Science, Stanford University, Department of Biomedical Data Science, Stanford University, and Chan Zuckerberg Biohub ([email protected]).


This paper is available on arxiv under CC BY 4.0 license.