当前位置: X-MOL 学术J. Hazard. Mater. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Machine learning for predicting halogen radical reactivity toward aqueous organic chemicals
Journal of Hazardous Materials ( IF 13.6 ) Pub Date : 2024-05-06 , DOI: 10.1016/j.jhazmat.2024.134501
Youheng Liang , Xiaoliu Huangfu , Ruixing Huang , Zhenpeng Han , Sisi Wu , Jingrui Wang , Xinlong Long , Jun Ma , Qiang He

Rapid advances in machine learning (ML) provide fast, accurate, and widely applicable methods for predicting free radical-mediated organic pollutant reactivity. In this study, the rate constants () of four halogen radicals were predicted using Morgan fingerprint (MF) and Mordred descriptor (MD) in combination with a series of ML models. The findings highlighted that making accurate predictions for various datasets depended on an effective combination of descriptors and algorithms. To further alleviate the challenge of limited sample size, we introduced a data combination strategy that improved prediction accuracy and mitigated overfitting by combining different datasets. The Light Gradient Boosting Machine (LightGBM) with MF and Random Forest (RF) with MD models based on the unified dataset were finally selected as the optimal models. The SHapley Additive exPlanations revealed insights: the MF-LightGBM model successfully captured the influence of electron-withdrawing/donating groups, while autocorrelation, walk count and information content descriptors in the MD-RF model were identified as key features. Furthermore, the important contribution of pH was emphasized. The results of the applicability domain analysis further supported that the developed model can make reliable predictions for query compounds across a broader range. Finally, a practical web application for calculations was built.

中文翻译:


用于预测卤素自由基对水性有机化学品的反应性的机器学习



机器学习 (ML) 的快速发展为预测自由基介导的有机污染物反应性提供了快速、准确且广泛适用的方法。在本研究中,使用摩根指纹(MF)和莫德雷德描述符(MD)结合一系列机器学习模型预测了四种卤素自由基的速率常数()。研究结果强调,对各种数据集进行准确预测取决于描述符和算法的有效组合。为了进一步缓解样本量有限的挑战,我们引入了一种数据组合策略,通过组合不同的数据集来提高预测精度并减轻过度拟合。最终选择基于统一数据集的带有MF的光梯度提升机(LightGBM)和带有MD模型的随机森林(RF)作为最佳模型。 SHapley Additive exPlanations 揭示了一些见解:MF-LightGBM 模型成功捕获了吸电子/供电子基团的影响,而 MD-RF 模型中的自相关、步行计数和信息内容描述符被确定为关键特征。此外,还强调了 pH 值的重要贡献。适用性域分析的结果进一步证明所开发的模型可以对更广泛范围内的查询化合物做出可靠的预测。最后,构建了一个实用的计算Web应用程序。
更新日期:2024-05-06
down
wechat
bug