SegViT v2: Exploring Efficient and Continual Semantic Segmentation with Plain Vision Transformers,International Journal of Computer Vision

当前位置： X-MOL 学术 › Int. J. Comput. Vis. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

SegViT v2: Exploring Efficient and Continual Semantic Segmentation with Plain Vision Transformers
International Journal of Computer Vision ( IF 19.5 ) Pub Date : 2024-04-01 , DOI: 10.1007/s11263-023-01894-8
Bowen Zhang , Liyang Liu , Minh Hieu Phan , Zhi Tian , Chunhua Shen , Yifan Liu

Abstract

This paper investigates the capability of plain Vision Transformers (ViTs) for semantic segmentation using the encoder–decoder framework and introduce SegViTv2. In this study, we introduce a novel Attention-to-Mask (ATM) module to design a lightweight decoder effective for plain ViT. The proposed ATM converts the global attention map into semantic masks for high-quality segmentation results. Our decoder outperforms popular decoder UPerNet using various ViT backbones while consuming only about \(5\%\) of the computational cost. For the encoder, we address the concern of the relatively high computational cost in the ViT-based encoders and propose a Shrunk++ structure that incorporates edge-aware query-based down-sampling (EQD) and query-based up-sampling (QU) modules. The Shrunk++ structure reduces the computational cost of the encoder by up to \(50\%\) while maintaining competitive performance. Furthermore, we propose to adapt SegViT for continual semantic segmentation, demonstrating nearly zero forgetting of previously learned knowledge. Experiments show that our proposed SegViTv2 surpasses recent segmentation methods on three popular benchmarks including ADE20k, COCO-Stuff-10k and PASCAL-Context datasets. The code is available through the following link: https://github.com/zbwxp/SegVit.

中文翻译：

SegViT v2：利用 Plain Vision Transformers 探索高效且持续的语义分割

摘要

本文研究了普通视觉变换器 (ViT) 使用编码器-解码器框架进行语义分割的能力，并介绍了SegViTv2。在这项研究中，我们引入了一种新颖的 Attention-to-Mask (ATM) 模块来设计一个对普通 ViT 有效的轻量级解码器。所提出的 ATM 将全局注意力图转换为语义掩模，以获得高质量的分割结果。我们的解码器优于使用各种 ViT 主干的流行解码器 UPerNet，同时仅消耗约\(5\%\)的计算成本。对于编码器，我们解决了基于 ViT 的编码器中计算成本相对较高的问题，并提出了一种Shrunk ++ 结构，该结构结合了边缘感知的基于查询的下采样（EQD）和基于查询的上采样（QU））模块。 Shrunk++ 结构可将编码器的计算成本降低高达\(50\%\)，同时保持具有竞争力的性能。此外，我们建议采用 SegViT 进行持续语义分割，证明对先前学习的知识的遗忘几乎为零。实验表明，我们提出的 SegViTv2 在 ADE20k、COCO-Stuff-10k 和 PASCAL-Context 数据集等三个流行基准数据集上超越了最新的分割方法。该代码可通过以下链接获取：https://github.com/zbwxp/SegVit。

更新日期：2024-03-27

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>