ORDerly: Data Sets and Benchmarks for Chemical Reaction Data,Journal of Chemical Information and Modeling

当前位置： X-MOL 学术 › J. Chem. Inf. Model. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

ORDerly: Data Sets and Benchmarks for Chemical Reaction Data
Journal of Chemical Information and Modeling ( IF 5.6 ) Pub Date : 2024-04-22 , DOI: 10.1021/acs.jcim.4c00292
Daniel S. Wigh ₁ , Joe Arrowsmith ₁ , Alexander Pomberger ₁ , Kobi C. Felton ₁ , Alexei A. Lapkin ₁

Affiliation

Machine learning has the potential to provide tremendous value to life sciences by providing models that aid in the discovery of new molecules and reduce the time for new products to come to market. Chemical reactions play a significant role in these fields, but there is a lack of high-quality open-source chemical reaction data sets for training machine learning models. Herein, we present ORDerly, an open-source Python package for the customizable and reproducible preparation of reaction data stored in accordance with the increasingly popular Open Reaction Database (ORD) schema. We use ORDerly to clean United States patent data stored in ORD and generate data sets for forward prediction, retrosynthesis, as well as the first benchmark for reaction condition prediction. We train neural networks on data sets generated with ORDerly for condition prediction and show that data sets missing key cleaning steps can lead to silently overinflated performance metrics. Additionally, we train transformers for forward and retrosynthesis prediction and demonstrate how non-patent data can be used to evaluate model generalization. By providing a customizable open-source solution for cleaning and preparing large chemical reaction data, ORDerly is poised to push forward the boundaries of machine learning applications in chemistry.

中文翻译：

ORDerly：化学反应数据的数据集和基准

机器学习有潜力为生命科学提供巨大的价值，它提供的模型有助于发现新分子并缩短新产品上市的时间。化学反应在这些领域发挥着重要作用，但缺乏用于训练机器学习模型的高质量开源化学反应数据集。在此，我们介绍 ORDerly，这是一个开源 Python 包，用于根据日益流行的开放反应数据库 (ORD) 模式存储的反应数据进行可定制和可重复的准备。我们使用 ORDerly 清理 ORD 中存储的美国专利数据，并生成用于正向预测、逆合成的数据集，以及反应条件预测的第一个基准。我们在 ORDerly 生成的数据集上训练神经网络以进行条件预测，并表明缺少关键清理步骤的数据集可能会导致性能指标悄然过度膨胀。此外，我们还训练用于正向和逆向综合预测的变压器，并演示如何使用非专利数据来评估模型泛化。通过提供用于清理和准备大型化学反应数据的可定制开源解决方案，ORDerly 准备突破化学领域机器学习应用的界限。

更新日期：2024-04-22

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>