Invalid SMILES are beneficial rather than detrimental to chemical language models,Nature Machine Intelligence

当前位置： X-MOL 学术 › Nat. Mach. Intell. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Invalid SMILES are beneficial rather than detrimental to chemical language models
Nature Machine Intelligence ( IF 23.8 ) Pub Date : 2024-03-29 , DOI: 10.1038/s42256-024-00821-x
Michael A. Skinnider

Generative machine learning models have attracted intense interest for their ability to sample novel molecules with desired chemical or biological properties. Among these, language models trained on SMILES (Simplified Molecular-Input Line-Entry System) representations have been subject to the most extensive experimental validation and have been widely adopted. However, these models have what is perceived to be a major limitation: some fraction of the SMILES strings that they generate are invalid, meaning that they cannot be decoded to a chemical structure. This perceived shortcoming has motivated a remarkably broad spectrum of work designed to mitigate the generation of invalid SMILES or correct them post hoc. Here I provide causal evidence that the ability to produce invalid outputs is not harmful but is instead beneficial to chemical language models. I show that the generation of invalid outputs provides a self-corrective mechanism that filters low-likelihood samples from the language model output. Conversely, enforcing valid outputs produces structural biases in the generated molecules, impairing distribution learning and limiting generalization to unseen chemical space. Together, these results refute the prevailing assumption that invalid SMILES are a shortcoming of chemical language models and reframe them as a feature, not a bug.

中文翻译：

无效的 SMILES 对化学语言模型有益而不是有害

生成机器学习模型因其能够对具有所需化学或生物特性的新型分子进行采样的能力而引起了人们的浓厚兴趣。其中，基于SMILES（简化分子输入行输入系统）表示训练的语言模型已经经过了最广泛的实验验证并被广泛采用。然而，这些模型有一个被认为是主要的限制：它们生成的 SMILES 字符串的某些部分是无效的，这意味着它们无法解码为化学结构。这种明显的缺点激发了一系列非常广泛的工作，旨在减少无效微笑的产生或事后纠正它们。在这里，我提供了因果证据，表明产生无效输出的能力不会有害，反而对化学语言模型有利。我证明无效输出的生成提供了一种自我纠正机制，可以从语言模型输出中过滤掉低似然样本。相反，强制执行有效输出会在生成的分子中产生结构偏差，从而损害分布学习并限制对看不见的化学空间的泛化。总之，这些结果驳斥了普遍的假设，即无效的 SMILES 是化学语言模型的缺点，并将其重新定义为一个功能，而不是一个错误。

更新日期：2024-03-30

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文

全部期刊列表>>