Lp-slam: language-perceptive RGB-D SLAM framework exploiting large language model,Complex & Intelligent Systems

当前位置： X-MOL 学术 › Complex Intell. Syst. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Lp-slam: language-perceptive RGB-D SLAM framework exploiting large language model
Complex & Intelligent Systems ( IF 5.8 ) Pub Date : 2024-04-30 , DOI: 10.1007/s40747-024-01408-0
Weiyi Zhang , Yushi Guo , Liting Niu , Peijun Li , Zeyu Wan , Fei Shao , Cheng Nian , Fasih Ud Din Farrukh , Debing Zhang , Chun Zhang , Qiang Li , Jianwei Zhang

With the development of deep learning, a higher level of perception of the environment such as the semantic level can be achieved in the simultaneous localization and mapping (SLAM) domain. However, previous works did not achieve a natural-language level of perception. Therefore, LP-SLAM (Language-Perceptive RGB-D SLAM) is proposed that leverages large language models (LLMs). The texts in the scene can be detected by scene text recognition (STR) and mapped as landmarks with a task-driven selection. A text error correction chain (TECC) is designed with a similarity classification method, a two-stage memory strategy, and a text clustering method. The proposed architecture is designed to deal with the mis-detection and mis-recognition cases of STR and to provide accurate text information to the framework. The proposed framework takes input images and generates a 3D map with sparse point cloud and task-related texts. Finally, a natural user interface (NUI) is designed based on the constructed map and LLM, which gives position instructions based on users’ natural queries. The experimental results validated the proposed TECC design and the overall framework. We publish the virtual dataset with ground truth, as well as the source code for further research. https://github.com/GroupOfLPSLAM/LP_SLAM.

中文翻译：

lp-slam：利用大语言模型的语言感知 RGB-D SLAM 框架

随着深度学习的发展，在同步定位与建图（SLAM）领域可以实现语义层面等更高层次的环境感知。然而，之前的作品并没有达到自然语言级别的感知。因此，提出了利用大语言模型（LLM）的LP-SLAM（语言感知RGB-D SLAM）。场景中的文本可以通过场景文本识别 (STR) 进行检测，并通过任务驱动的选择映射为地标。采用相似性分类方法、两阶段存储策略和文本聚类方法设计了文本纠错链（TECC）。所提出的架构旨在处理 STR 的误检测和误识别情况，并向框架提供准确的文本信息。所提出的框架获取输入图像并生成具有稀疏点云和任务相关文本的 3D 地图。最后，基于构建的地图和LLM设计了自然用户界面（NUI），该界面根据用户的自然查询给出位置指令。实验结果验证了所提出的 TECC 设计和总体框架。我们发布具有真实数据的虚拟数据集以及用于进一步研究的源代码。 https://github.com/GroupOfLPSLAM/LP_SLAM。

更新日期：2024-04-30

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>