🤖

Exploring Object Detection with VLM

May 30, 2025

Lecture Notes: Object Detection Using Natural Language and VLM

Overview

Discussion on using Vision Language Models (VLM) for object detection.
Focus on mapping and tracking objects in 3D space (XY for location, Z for depth).
Use of robotic arms for interaction with detected objects.

Key Concepts

Object Tracking

Use of VLM allows for tracking a wide array of objects based on natural language descriptions.
Example: Tracking a "black robotic hand" and a "graphics card."
Objects marked with different colored plus signs in the UI for identification.

Functionality of VLM

Can track any object described in natural language, without preset limitations.
Allows for dynamic updates and changes in the tracking list.

Challenges

Occasional misidentification (e.g., confusing the hand with other objects like a tripod).
Suggestions to improve accuracy include adding distinguishing features (e.g., colored tape).

System Performance

The robotic arm moves slowly by design for safety during testing.
Current frames per second (FPS) for predictions are low (~0.5 FPS), but model can run faster (~150 ms per prediction) with optimizations.
Proof of concept stage; aims to determine if the approach is viable.

Hardware and Design Issues

Camera placement on the robot's head presents challenges with depth perception and object visibility.
Proposed solution: Use additional cameras, ideally on the gripper, for accurate depth perception.

Hand Functionality

Issue with the right robotic hand not working, confirmed through various tests.
Potential solutions included swapping parts or using adapter boards.
Thermal imaging used to diagnose the state of the equipment.

Vision Language Model (VLM) Details

VLM Capabilities

Moonream 2 VLM with nearly 2 billion parameters.
Features include captioning, object detection, and point marking.
Model requires around 5 GB in memory.
Capable of understanding abstract object descriptions.

Application Demonstrations

GUI developed for interacting with the VLM for object detection and description.
Examples include detecting a "red bottle of water" or "device to heat food" (microwave), showing the model’s flexibility.

Additional Exploration

Head Tilt and SLAM

Experimenting with head tilt to improve visibility without losing SLAM data.
Adjustments made to correct occupancy grid orientation with head tilt.

Future Directions

ARM Policy Improvement

Interest in inverse kinematics (IK) for better arm movement aesthetics.
Current challenges include accurate depth perception and path planning.

Simulation and Path Planning

Debating whether to use simulators for training robots.
Importance of real-life data vs. simulation in path planning and ARM policy.

Conclusion

Emphasis on the potential of VLM in robotics.
Open questions about camera placement and path planning.
Discussion on future steps and improvements for robotic arm and overall system efficiency.

Closing Remarks

Encouragement for feedback and suggestions.
Plan to improve arm policy and address camera and hand detection challenges in future work.

Full transcript