Monocular Video Segmentation with Transformers

Sep 7, 2024

Take quiz

Moving Object Segmentation for Monocular Video with Transformers

Presenter: Christian Hohmeyer, University of Heidelberg

Introduction

Research conducted at Bosch Research Germany.
Focus on accurately detecting and segmenting autonomous agents in dynamic environments.
Challenge: Monocular video alone makes it difficult to segment unknown objects.

Problem Statement

Moving object segmentation as a binary instance segmentation problem.
Generic detection may include object classes not seen during training.
Non-rigid motion complicates segmentation (e.g., only a part of a person moving).

Approach

Utilize both reconstruction and recognition (similar to the two-stream hypothesis in neuroscience).
Introduce pseudo-modalities like optical flow and 3D scene flow to improve segmentation.
Combine motion and appearance features for robust segmentation.

Literature Review

Motion segmentation has been a significant issue in computer vision.
Previous works often focused on single modalities (e.g., optical flow) or fixed data combinations using CNNs.
Datasets vary widely in motion diversity, semantic classes, and degenerate cases, complicating comparisons.

Experiment Design

Systematic comparison of motion representations from 2D to 3D using transformer architecture.
Compute 2D optical flow or 3D scene flow for motion representation.
Infer single-image depth maps for scene flow computation across multiple frames.
Input both appearance and motion data into a state-of-the-art transformer for segmentation.

Transformer Architecture

Extend transformer architecture to handle multiple modalities.
Use a two-stream architecture with separate training weights for motion and appearance.
Freeze the appearance branch after pre-training on COCO for general object appearance knowledge.
Features combined via an attention mechanism across multiple scales of the feature pyramid.

Alignment Problem

Multimodal models face alignment issues with multiple information sources.
Introduced negative examples to enforce consideration of motion maps in final class labels.

Initial Experiments

Analyzed motion representation effects on segmentation performance.
2D optical flow is robust but ineffective in degenerate motion cases.
3D scene flow offers a unique representation but is dependent on depth map quality.
Higher dimensional handcrafted motion costs offer a middle ground but still fall short of image detector performance.

Findings

Pure motion representation is insufficient for high precision segmentation.
Non-rigid motion patterns must be grouped to perform well.
Using multiple modalities can enhance performance over single modalities.
Diverse training datasets are necessary to achieve state-of-the-art performance.

Model Behavior

Models behave logically; lack of training data for certain motion patterns leads to poor performance.
Strong performance on KITI achieved using appearance and optical flow, despite no real driving data in MIX2.
Scene flow performs better with high-quality depth; optical flow is more effective in wild datasets like DAVIS.

Conclusion

Appearance data drives detection and segmentation, while motion data helps classify objects as moving or static.
Robust training on diverse data improves resilience to out-of-distribution data (e.g., camouflaged animals).
Attention and gradients vary between streams: more local in appearance, global in motion for classification.
Despite improvements, failure cases remain likely due to data limitations rather than architectural issues.

Acknowledgments

Thank you for your attention!

Full transcript