Abstract

Few-shot action recognition, i.e. recognizing new action classes given only a few examples, benefits from incorporating temporal information. Prior work either encodes such information in the representation itself and learns classifiers at test time, or obtains frame-level features and performs pairwise temporal matching at test time. We first evaluate a number of matching-based approaches using features from spatio-temporal backbones, a comparison missing from the literature, and show that the gap in performance between simple baselines and more complicated methods is significantly reduced. Inspired by this, we propose Chamfer++, a non-temporal matching function that achieves state-of-the-art results in few-shot action recognition. We show that, when starting from temporal features, our parameter-free and interpretable approach can outperform all other matching-based and classifier methods for one-shot action recognition on three common datasets without using temporal information in the matching stage.

Example from Something-Something V2

Query example
name

Picking something up (example 1)	Picking something up (example 2)
Pulling Something From Left To Right (example 1)	Pulling Something From Left To Right (example 2)

Paper and Supplementary Material

J. Bertrand, Y. Kalantidis, G. Tolias.
Rethinking matching-based few-shot action recognition.
(hosted on ArXiv)

[Bibtex]

Acknowledgements

The authors would like to sincerely thank Toby Perrett and Dima Damen for sharing their early code and supporting us, Diane Larlus for insightful conversations, feedback, and support, as well as Zakaria Laskar, Monish Keswani, and Assia Benbihi for their feedback. This work was supported in part by Naver Labs Europe, by Junior Star GACR GM 21-28830M, and by student grant SGS23/173/OHK3/3T/13. This template was originally made by Phillip Isola and Richard Zhang for a colorful ECCV project; the code can be found here.