Coconote
AI notes
AI voice & video notes
Export note
Try for free
Monocular Video Segmentation with Transformers
Sep 7, 2024
🤓
Take quiz
Moving Object Segmentation for Monocular Video with Transformers
Presenter: Christian Hohmeyer, University of Heidelberg
Introduction
Research conducted at Bosch Research Germany.
Focus on accurately detecting and segmenting autonomous agents in dynamic environments.
Challenge: Monocular video alone makes it difficult to segment unknown objects.
Problem Statement
Moving object segmentation as a binary instance segmentation problem.
Generic detection may include object classes not seen during training.
Non-rigid motion complicates segmentation (e.g., only a part of a person moving).
Approach
Utilize both reconstruction and recognition (similar to the two-stream hypothesis in neuroscience).
Introduce pseudo-modalities like optical flow and 3D scene flow to improve segmentation.
Combine motion and appearance features for robust segmentation.
Literature Review
Motion segmentation has been a significant issue in computer vision.
Previous works often focused on single modalities (e.g., optical flow) or fixed data combinations using CNNs.
Datasets vary widely in motion diversity, semantic classes, and degenerate cases, complicating comparisons.
Experiment Design
Systematic comparison of motion representations from 2D to 3D using transformer architecture.
Compute 2D optical flow or 3D scene flow for motion representation.
Infer single-image depth maps for scene flow computation across multiple frames.
Input both appearance and motion data into a state-of-the-art transformer for segmentation.
Transformer Architecture
Extend transformer architecture to handle multiple modalities.
Use a two-stream architecture with separate training weights for motion and appearance.
Freeze the appearance branch after pre-training on COCO for general object appearance knowledge.
Features combined via an attention mechanism across multiple scales of the feature pyramid.
Alignment Problem
Multimodal models face alignment issues with multiple information sources.
Introduced negative examples to enforce consideration of motion maps in final class labels.
Initial Experiments
Analyzed motion representation effects on segmentation performance.
2D optical flow is robust but ineffective in degenerate motion cases.
3D scene flow offers a unique representation but is dependent on depth map quality.
Higher dimensional handcrafted motion costs offer a middle ground but still fall short of image detector performance.
Findings
Pure motion representation is insufficient for high precision segmentation.
Non-rigid motion patterns must be grouped to perform well.
Using multiple modalities can enhance performance over single modalities.
Diverse training datasets are necessary to achieve state-of-the-art performance.
Model Behavior
Models behave logically; lack of training data for certain motion patterns leads to poor performance.
Strong performance on KITI achieved using appearance and optical flow, despite no real driving data in MIX2.
Scene flow performs better with high-quality depth; optical flow is more effective in wild datasets like DAVIS.
Conclusion
Appearance data drives detection and segmentation, while motion data helps classify objects as moving or static.
Robust training on diverse data improves resilience to out-of-distribution data (e.g., camouflaged animals).
Attention and gradients vary between streams: more local in appearance, global in motion for classification.
Despite improvements, failure cases remain likely due to data limitations rather than architectural issues.
Acknowledgments
Thank you for your attention!
📄
Full transcript