Scaling Up Multi-domain Semantic Segmentation with Sentence Embeddings,International Journal of Computer Vision

当前位置： X-MOL 学术 › Int. J. Comput. Vis. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Scaling Up Multi-domain Semantic Segmentation with Sentence Embeddings
International Journal of Computer Vision ( IF 19.5 ) Pub Date : 2024-05-01 , DOI: 10.1007/s11263-024-02060-4
Wei Yin , Yifan Liu , Chunhua Shen , Baichuan Sun , Anton van den Hengel

The state-of-the-art semantic segmentation methods have achieved impressive performance on predefined close-set individual datasets, but their generalization to zero-shot domains and unseen categories is limited. Labeling a large-scale dataset is challenging and expensive, Training a robust semantic segmentation model on multi-domains has drawn much attention. However, inconsistent taxonomies hinder the naive merging of current publicly available annotations. To address this, we propose a simple solution to scale up the multi-domain semantic segmentation dataset with less human effort. We replace each class label with a sentence embedding, which is a vector-valued embedding of a sentence describing the class. This approach enables the merging of multiple datasets from different domains, each with varying class labels and semantics. We merged publicly available noisy and weak annotations with the most finely annotated data, over 2 million images, which enables training a model that achieves performance equal to that of state-of-the-art supervised methods on 7 benchmark datasets, despite not using any images therefrom. Instead of manually tuning a consistent label space, we utilized a vector-valued embedding of short paragraphs to describe the classes. By fine-tuning the model on standard semantic segmentation datasets, we also achieve a significant improvement over the state-of-the-art supervised segmentation on NYUD-V2 (Silberman et al., in: European conference on computer vision, Springer, pp 746–760, 2012) and PASCAL-context (Everingham et al. in Int J Comput Visi 111(1):98–136, 2015) at \(60\%\) and \(65\%\) mIoU, respectively. Our method can segment unseen labels based on the closeness of language embeddings, showing strong generalization to unseen image domains and labels. Additionally, it enables impressive performance improvements in some adaptation applications, such as depth estimation and instance segmentation. Code is available at https://github.com/YvanYin/SSIW.

中文翻译：

使用句子嵌入扩展多域语义分割

最先进的语义分割方法在预定义的封闭集个体数据集上取得了令人印象深刻的性能，但它们对零样本域和看不见的类别的泛化是有限的。标记大规模数据集具有挑战性且成本高昂，在多领域上训练鲁棒的语义分割模型引起了广泛关注。然而，不一致的分类法阻碍了当前公开可用注释的简单合并。为了解决这个问题，我们提出了一种简单的解决方案，以更少的人力来扩展多域语义分割数据集。我们用句子嵌入替换每个类标签，这是描述该类的句子的向量值嵌入。这种方法可以合并来自不同领域的多个数据集，每个数据集具有不同的类标签和语义。我们将公开可用的噪声和弱注释与最精细注释的数据（超过 200 万张图像）合并在一起，这使得训练模型能够在 7 个基准数据集上实现与最先进的监督方法相同的性能，尽管没有使用任何来自其中的图像。我们没有手动调整一致的标签空间，而是利用短段落的向量值嵌入来描述类。通过在标准语义分割数据集上微调模型，我们还比 NYUD-V2 上最先进的监督分割取得了显着改进（Silberman 等人，见：欧洲计算机视觉会议，Springer，第746–760, 2012）和 PASCAL 上下文（Everingham 等人，Int J Comput Visi 111(1):98–136, 2015）分别为\(60\%\)和\(65\%\) mIoU 。我们的方法可以根据语言嵌入的紧密程度来分割看不见的标签，对看不见的图像域和标签表现出很强的泛化能力。此外，它还可以在一些自适应应用程序中实现令人印象深刻的性能改进，例如深度估计和实例分割。代码可在 https://github.com/YvanYin/SSIW 获取。

更新日期：2024-05-01

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>