Few-shot action recognition, i.e. recognizing new action classes given only a few
examples, benefits from incorporating temporal information. Prior work either
encodes such information in the representation itself and learns classifiers at test
time, or obtains frame-level features and performs pairwise temporal matching at
test time.
We first evaluate a number of matching-based approaches using features from
spatio-temporal backbones, a comparison missing from the literature, and show that
the gap in performance between simple baselines and more complicated methods is
significantly reduced. Inspired by this, we propose Chamfer++, a non-temporal
matching function that achieves state-of-the-art results in few-shot action
recognition. We show that, when starting from temporal features, our parameter-free
and interpretable approach can outperform all other matching-based and classifier
methods for one-shot action recognition on three common datasets without using
temporal information in the matching stage.
|