Coconote
AI notes
AI voice & video notes
Try for free
🤖
Exploring Object Detection with VLM
May 30, 2025
Lecture Notes: Object Detection Using Natural Language and VLM
Overview
Discussion on using Vision Language Models (VLM) for object detection.
Focus on mapping and tracking objects in 3D space (XY for location, Z for depth).
Use of robotic arms for interaction with detected objects.
Key Concepts
Object Tracking
Use of VLM allows for tracking a wide array of objects based on natural language descriptions.
Example: Tracking a "black robotic hand" and a "graphics card."
Objects marked with different colored plus signs in the UI for identification.
Functionality of VLM
Can track any object described in natural language, without preset limitations.
Allows for dynamic updates and changes in the tracking list.
Challenges
Occasional misidentification (e.g., confusing the hand with other objects like a tripod).
Suggestions to improve accuracy include adding distinguishing features (e.g., colored tape).
System Performance
The robotic arm moves slowly by design for safety during testing.
Current frames per second (FPS) for predictions are low (~0.5 FPS), but model can run faster (~150 ms per prediction) with optimizations.
Proof of concept stage; aims to determine if the approach is viable.
Hardware and Design Issues
Camera placement on the robot's head presents challenges with depth perception and object visibility.
Proposed solution: Use additional cameras, ideally on the gripper, for accurate depth perception.
Hand Functionality
Issue with the right robotic hand not working, confirmed through various tests.
Potential solutions included swapping parts or using adapter boards.
Thermal imaging used to diagnose the state of the equipment.
Vision Language Model (VLM) Details
VLM Capabilities
Moonream 2 VLM with nearly 2 billion parameters.
Features include captioning, object detection, and point marking.
Model requires around 5 GB in memory.
Capable of understanding abstract object descriptions.
Application Demonstrations
GUI developed for interacting with the VLM for object detection and description.
Examples include detecting a "red bottle of water" or "device to heat food" (microwave), showing the model’s flexibility.
Additional Exploration
Head Tilt and SLAM
Experimenting with head tilt to improve visibility without losing SLAM data.
Adjustments made to correct occupancy grid orientation with head tilt.
Future Directions
ARM Policy Improvement
Interest in inverse kinematics (IK) for better arm movement aesthetics.
Current challenges include accurate depth perception and path planning.
Simulation and Path Planning
Debating whether to use simulators for training robots.
Importance of real-life data vs. simulation in path planning and ARM policy.
Conclusion
Emphasis on the potential of VLM in robotics.
Open questions about camera placement and path planning.
Discussion on future steps and improvements for robotic arm and overall system efficiency.
Closing Remarks
Encouragement for feedback and suggestions.
Plan to improve arm policy and address camera and hand detection challenges in future work.
📄
Full transcript