当前位置: X-MOL 学术Int. J. Comput. Vis. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Matching Compound Prototypes for Few-Shot Action Recognition
International Journal of Computer Vision ( IF 19.5 ) Pub Date : 2024-04-29 , DOI: 10.1007/s11263-024-02017-7
Yifei Huang , Lijin Yang , Guo Chen , Hongjie Zhang , Feng Lu , Yoichi Sato

The task of few-shot action recognition aims to recognize novel action classes using only a small number of labeled training samples. How to better describe the action in each video and how to compare the similarity between videos are two of the most critical factors in this task. Directly describing the video globally or by its individual frames cannot well represent the spatiotemporal dependencies within an action. On the other hand, naively matching the global representations of two videos is also not optimal since action can happen at different locations in a video with different speeds. In this work, we propose a novel approach that describes each video using multiple types of prototypes and then computes the video similarity with a particular matching strategy for each type of prototypes. To better model the spatiotemporal dependency, we describe the video by generating prototypes that model the multi-level spatiotemporal relations via transformers. There are a total of three types of prototypes. The first type of prototypes are trained to describe specific aspects of the action in the video e.g., the start of the action, regardless of its timestamp. These prototypes are directly matched one-to-one between two videos to compare their similarity. The second type of prototypes are the timestamp-centered prototypes that are trained to focus on specific timestamps of the video. To deal with the temporal variation of actions in a video, we apply bipartite matching to allow the matching of prototypes of different timestamps. The third type of prototypes are generated from the timestamp-centered prototypes, which regularize their temporal consistency while serving as an auxiliary summarization of the whole video. Experiments demonstrate that our proposed method achieves state-of-the-art results on multiple benchmarks.



中文翻译:

匹配复合原型以进行少样本动作识别

少镜头动作识别的任务旨在仅使用少量标记的训练样本来识别新的动作类别。如何更好地描述每个视频中的动作以及如何比较视频之间的相似度是该任务中最关键的两个因素。直接全局描述视频或通过其各个帧来描述视频不能很好地表示动作内的时空依赖性。另一方面,天真地匹配两个视频的全局表示也不是最佳的,因为动作可能以不同的速度发生在视频中的不同位置。在这项工作中,我们提出了一种新颖的方法,该方法使用多种类型的原型来描述每个视频,然后使用每种类型的原型的特定匹配策略来计算视频相似度。为了更好地建模时空依赖性,我们通过生成原型来描述视频,该原型通过变压器对多级时空关系进行建模。原型一共有三种类型。第一种类型的原型被训练来描述视频中动作的特定方面,例如动作的开始,而不考虑其时间戳。这些原型直接在两个视频之间进行一对一匹配,以比较它们的相似度。第二种类型的原型是以时间戳为中心的原型,经过训练以专注于视频的特定时间戳。为了处理视频中动作的时间变化,我们应用二分匹配来允许不同时间戳的原型匹配。第三类原型是从以时间戳为中心的原型生成的,它规范了它们的时间一致性,同时作为整个视频的辅助总结。实验表明,我们提出的方法在多个基准测试中取得了最先进的结果。

更新日期:2024-04-29
down
wechat
bug