Response to “The perpetual motion machine of AI-generated data and the distraction of ChatGPT as a ‘scientist’”,Nature Biotechnology

当前位置： X-MOL 学术 › Nat. Biotechnol. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Response to “The perpetual motion machine of AI-generated data and the distraction of ChatGPT as a ‘scientist’”
Nature Biotechnology ( IF 46.9 ) Pub Date : 2024-05-01 , DOI: 10.1038/s41587-024-02230-2
William Stafford Noble

Many of Jennifer Listgarten’s arguments are compelling: in particular, that the protein folding problem is an outlier relative to other grand challenges in science, both in terms of the precise way the problem can be stated and performance measured and in terms of the amount of available, high quality data¹. However, although existing biological databases tend to be small relative to the compendia used to train large language models, it seems plausible that one type of biological data — whole genome sequencing — will soon be generated at massive scales, opposite to what was argued¹. As genome sequencing costs go down and the potential for clinical use of genomic data goes up, it will make economic sense to fully sequence everyone. Each 3 billion base-pair individual genome can be represented as 30 million unique bases, so fully sequencing the US population of 300 million individuals yields a total of 9 × 10¹⁵ bases, which is comparable in size to the 400-terabyte Common Crawl dataset used to train large language models. Using such data to train large-scale machine learning models will be challenging because of privacy considerations. Nonetheless, I see at least four paths where such models could be built on massive genomic data.

The first path involves federated data access. A federated approach uses software to enable multiple databases to function as one, facilitating interoperability while maintaining autonomy and decentralization². Federation capabilities are supported by existing genomic biobanks, such as the UK Biobank, NIH All of Us and Finland’s FinnGen initiative³, and are further facilitated by commercial entities such as lifebit.ai. In a federated approach, a deep learning model can be trained from data drawn from multiple biobanks while maintaining privacy guarantees.

中文翻译：

回应“AI生成数据的永动机和ChatGPT作为‘科学家’的分心”

詹妮弗·利斯特加滕（Jennifer Listgarten）的许多论点都很引人注目：特别是，蛋白质折叠问题相对于科学上的其他重大挑战来说是一个异常值，无论是在问题陈述的精确方式和性能测量方面，还是在可用的数量方面。、高质量数据¹．然而，尽管现有的生物数据库相对于用于训练大型语言模型的概要而言往往较小，但似乎有可能很快会大规模生成一种生物数据（全基因组测序），这与所争论的相反¹。随着基因组测序成本的下降和基因组数据临床应用潜力的上升，对每个人进行全面测序将具有经济意义。每个 30 亿个碱基对的个体基因组可以表示为 3000 万个独特的碱基，因此对美国 3 亿人口进行全面测序可得到总共 9 × 10 ^{15 个}碱基，其大小与 400 TB 的 Common Crawl 数据集相当用于训练大型语言模型。出于隐私方面的考虑，使用此类数据来训练大规模机器学习模型将具有挑战性。尽管如此，我认为至少有四种途径可以在海量基因组数据上构建此类模型。

第一条路径涉及联合数据访问。联合方法使用软件使多个数据库能够作为一个数据库运行，从而促进互操作性，同时保持自治和去中心化²。联合能力得到现有基因组生物库的支持，例如英国生物库、NIH All of Us 和芬兰的 FinnGen Initiative ³，并得到 lifebit.ai 等商业实体的进一步促进。在联合方法中，可以根据从多个生物库提取的数据来训练深度学习模型，同时保持隐私保证。

更新日期：2024-05-01

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>