Monocular Video Segmentation with Transformers

Sep 7, 2024

Moving Object Segmentation for Monocular Video with Transformers

Presenter: Christian Hohmeyer, University of Heidelberg

Introduction

  • Research conducted at Bosch Research Germany.
  • Focus on accurately detecting and segmenting autonomous agents in dynamic environments.
  • Challenge: Monocular video alone makes it difficult to segment unknown objects.

Problem Statement

  • Moving object segmentation as a binary instance segmentation problem.
  • Generic detection may include object classes not seen during training.
  • Non-rigid motion complicates segmentation (e.g., only a part of a person moving).

Approach

  • Utilize both reconstruction and recognition (similar to the two-stream hypothesis in neuroscience).
  • Introduce pseudo-modalities like optical flow and 3D scene flow to improve segmentation.
  • Combine motion and appearance features for robust segmentation.

Literature Review

  • Motion segmentation has been a significant issue in computer vision.
  • Previous works often focused on single modalities (e.g., optical flow) or fixed data combinations using CNNs.
  • Datasets vary widely in motion diversity, semantic classes, and degenerate cases, complicating comparisons.

Experiment Design

  • Systematic comparison of motion representations from 2D to 3D using transformer architecture.
  • Compute 2D optical flow or 3D scene flow for motion representation.
  • Infer single-image depth maps for scene flow computation across multiple frames.
  • Input both appearance and motion data into a state-of-the-art transformer for segmentation.

Transformer Architecture

  • Extend transformer architecture to handle multiple modalities.
  • Use a two-stream architecture with separate training weights for motion and appearance.
  • Freeze the appearance branch after pre-training on COCO for general object appearance knowledge.
  • Features combined via an attention mechanism across multiple scales of the feature pyramid.

Alignment Problem

  • Multimodal models face alignment issues with multiple information sources.
  • Introduced negative examples to enforce consideration of motion maps in final class labels.

Initial Experiments

  • Analyzed motion representation effects on segmentation performance.
  • 2D optical flow is robust but ineffective in degenerate motion cases.
  • 3D scene flow offers a unique representation but is dependent on depth map quality.
  • Higher dimensional handcrafted motion costs offer a middle ground but still fall short of image detector performance.

Findings

  • Pure motion representation is insufficient for high precision segmentation.
  • Non-rigid motion patterns must be grouped to perform well.
  • Using multiple modalities can enhance performance over single modalities.
  • Diverse training datasets are necessary to achieve state-of-the-art performance.

Model Behavior

  • Models behave logically; lack of training data for certain motion patterns leads to poor performance.
  • Strong performance on KITI achieved using appearance and optical flow, despite no real driving data in MIX2.
  • Scene flow performs better with high-quality depth; optical flow is more effective in wild datasets like DAVIS.

Conclusion

  • Appearance data drives detection and segmentation, while motion data helps classify objects as moving or static.
  • Robust training on diverse data improves resilience to out-of-distribution data (e.g., camouflaged animals).
  • Attention and gradients vary between streams: more local in appearance, global in motion for classification.
  • Despite improvements, failure cases remain likely due to data limitations rather than architectural issues.

Acknowledgments

  • Thank you for your attention!