Test-Time Training for Matching-Based Video Object Segmentation
Juliette Bertrand1,2*
Giorgos Kordopatis-Zilos1*
Yannis Kalantidis2
Giorgos Tolias1
1VRG, FEE, Czech Technical University in Prague
2NAVER LABS Europe
*Equal contribution
Paper
GitHub
Dataset
Examples

Abstract

The video object segmentation (VOS) task involves the segmentation of an object over time based on a single initial mask. Current state-of-the-art approaches use a memory of previously processed frames and rely on matching to estimate segmentation masks of subsequent frames. Lacking any adaptation mechanism, such methods are prone to test-time distribution shifts. This work focuses on matching-based VOS under distribution shifts such as video corruptions, stylization, and sim-to-real transfer. We explore test-time training strategies that are agnostic to the specific task as well as strategies that are designed specifically for VOS. This includes a variant based on MCC tailored to matching-based VOS methods. The experimental results on common benchmarks demonstrate that the proposed test-time training yields significant improvements in performance. In particular for the sim-to-real scenario and despite using only a single test video, our approach manages to recover a substantial portion of the performance gain achieved through training on real videos. Additionally, we introduce DAVIS-C, an augmented version of the popular DAVIS test set, featuring extreme distribution shifts like image-video-level corruptions and stylizations. Our results illustrate that test-time training enhances performance even in these challenging cases.


The DAVIS-C Dataset

To model distribution shifts during the testing, we process the set of test videos of the original DAVIS dataset to create the DAVIS-C dataset. This newly created test set offers a test bed for measuring robustness to corrupted input and generalization to previously unseen appearance. It is the outcome of processing the original videos to introduce image-level and video-level corruption and image-level appearance modification. We perform 14 different transformations to each video, each one applied at three different strengths, namely low, medium, and high strength, making it a total of 42 different variants of DAVIS.

The 14 types of corruption in DAVIS-C at “medium” strength



Qualitative results

Sim-to-real transfer

RGB frames
Groundtruth mask
STCN
tt-MCC

Corrupted test examples

RGB frames
Groundtruth mask
STCN
tt-MCC

More qualitative examples comparing the proposed method (ttMCC) on top of STCN can be found here.

Paper and Supplementary Material

J. Bertrand, G. Kordopatis Zilos, Y. Kalantidis, G. Tolias.
Test-Time Training for Matching-Based Video Object Segmentation.
(hosted on Openreview)




Acknowledgements

This work was supported in part by Naver Labs Europe, by Junior Star GACR GM 21-28830M, and by student grant SGS23/173/OHK3/3T/13. This template was originally made by Phillip Isola and Richard Zhang for a colorful ECCV project; the code can be found here.