当前位置: X-MOL 学术npj Digit. Med. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Multimodal data fusion using sparse canonical correlation analysis and cooperative learning: a COVID-19 cohort study
npj Digital Medicine ( IF 15.2 ) Pub Date : 2024-05-07 , DOI: 10.1038/s41746-024-01128-2
Ahmet Gorkem Er , Daisy Yi Ding , Berrin Er , Mertcan Uzun , Mehmet Cakmak , Christoph Sadee , Gamze Durhan , Mustafa Nasuh Ozmen , Mine Durusu Tanriover , Arzu Topeli , Yesim Aydin Son , Robert Tibshirani , Serhat Unal , Olivier Gevaert

Through technological innovations, patient cohorts can be examined from multiple views with high-dimensional, multiscale biomedical data to classify clinical phenotypes and predict outcomes. Here, we aim to present our approach for analyzing multimodal data using unsupervised and supervised sparse linear methods in a COVID-19 patient cohort. This prospective cohort study of 149 adult patients was conducted in a tertiary care academic center. First, we used sparse canonical correlation analysis (CCA) to identify and quantify relationships across different data modalities, including viral genome sequencing, imaging, clinical data, and laboratory results. Then, we used cooperative learning to predict the clinical outcome of COVID-19 patients: Intensive care unit admission. We show that serum biomarkers representing severe disease and acute phase response correlate with original and wavelet radiomics features in the LLL frequency channel (cor(Xu1, Zv1) = 0.596, p value < 0.001). Among radiomics features, histogram-based first-order features reporting the skewness, kurtosis, and uniformity have the lowest negative, whereas entropy-related features have the highest positive coefficients. Moreover, unsupervised analysis of clinical data and laboratory results gives insights into distinct clinical phenotypes. Leveraging the availability of global viral genome databases, we demonstrate that the Word2Vec natural language processing model can be used for viral genome encoding. It not only separates major SARS-CoV-2 variants but also allows the preservation of phylogenetic relationships among them. Our quadruple model using Word2Vec encoding achieves better prediction results in the supervised task. The model yields area under the curve (AUC) and accuracy values of 0.87 and 0.77, respectively. Our study illustrates that sparse CCA analysis and cooperative learning are powerful techniques for handling high-dimensional, multimodal data to investigate multivariate associations in unsupervised and supervised tasks.



中文翻译:

使用稀疏典型相关分析和合作学习的多模态数据融合:一项 COVID-19 队列研究

通过技术创新,可以利用高维、多尺度生物医学数据从多个角度检查患者群体,以对临床表型进行分类并预测结果。在这里,我们的目标是提出使用无监督和有监督稀疏线性方法在 COVID-19 患者队列中分析多模态数据的方法。这项针对 149 名成年患者的前瞻性队列研究是在三级医疗学术中心进行的。首先,我们使用稀疏典型相关分析(CCA)来识别和量化不同数据模式之间的关系,包括病毒基因组测序、成像、临床数据和实验室结果。然后,我们使用合作学习来预测 COVID-19 患者的临床结果:重症监护病房入院。我们表明,代表严重疾病和急性期反应的血清生物标志物与 LLL 频率通道中的原始和小波放射组学特征相关(cor ( Xu 1 , Z v 1 ) = 0.596,p值 < 0.001)。在放射组学特征中,报告偏度、峰度和均匀性的基于直方图的一阶特征具有最低的负系数,而与熵相关的特征具有最高的正系数。此外,对临床数据和实验室结果的无监督分析可以深入了解不同的临床表型。利用全球病毒基因组数据库的可用性,我们证明了 Word2Vec 自然语言处理模型可用于病毒基因组编码。它不仅可以分离主要的 SARS-CoV-2 变体,还可以保留它们之间的系统发育关系。我们使用Word2Vec编码的四元模型在监督任务中取得了更好的预测结果。该模型的曲线下面积 (AUC) 和准确度值分别为 0.87 和 0.77。我们的研究表明,稀疏 CCA 分析和合作学习是处理高维、多模态数据以研究无监督和监督任务中多变量关联的强大技术。

更新日期:2024-05-07
down
wechat
bug