🤖

Exploring Object Detection with VLM

May 30, 2025

Lecture Notes: Object Detection Using Natural Language and VLM

Overview

  • Discussion on using Vision Language Models (VLM) for object detection.
  • Focus on mapping and tracking objects in 3D space (XY for location, Z for depth).
  • Use of robotic arms for interaction with detected objects.

Key Concepts

Object Tracking

  • Use of VLM allows for tracking a wide array of objects based on natural language descriptions.
  • Example: Tracking a "black robotic hand" and a "graphics card."
  • Objects marked with different colored plus signs in the UI for identification.

Functionality of VLM

  • Can track any object described in natural language, without preset limitations.
  • Allows for dynamic updates and changes in the tracking list.

Challenges

  • Occasional misidentification (e.g., confusing the hand with other objects like a tripod).
  • Suggestions to improve accuracy include adding distinguishing features (e.g., colored tape).

System Performance

  • The robotic arm moves slowly by design for safety during testing.
  • Current frames per second (FPS) for predictions are low (~0.5 FPS), but model can run faster (~150 ms per prediction) with optimizations.
  • Proof of concept stage; aims to determine if the approach is viable.

Hardware and Design Issues

  • Camera placement on the robot's head presents challenges with depth perception and object visibility.
  • Proposed solution: Use additional cameras, ideally on the gripper, for accurate depth perception.

Hand Functionality

  • Issue with the right robotic hand not working, confirmed through various tests.
  • Potential solutions included swapping parts or using adapter boards.
  • Thermal imaging used to diagnose the state of the equipment.

Vision Language Model (VLM) Details

VLM Capabilities

  • Moonream 2 VLM with nearly 2 billion parameters.
  • Features include captioning, object detection, and point marking.
  • Model requires around 5 GB in memory.
  • Capable of understanding abstract object descriptions.

Application Demonstrations

  • GUI developed for interacting with the VLM for object detection and description.
  • Examples include detecting a "red bottle of water" or "device to heat food" (microwave), showing the model’s flexibility.

Additional Exploration

Head Tilt and SLAM

  • Experimenting with head tilt to improve visibility without losing SLAM data.
  • Adjustments made to correct occupancy grid orientation with head tilt.

Future Directions

ARM Policy Improvement

  • Interest in inverse kinematics (IK) for better arm movement aesthetics.
  • Current challenges include accurate depth perception and path planning.

Simulation and Path Planning

  • Debating whether to use simulators for training robots.
  • Importance of real-life data vs. simulation in path planning and ARM policy.

Conclusion

  • Emphasis on the potential of VLM in robotics.
  • Open questions about camera placement and path planning.
  • Discussion on future steps and improvements for robotic arm and overall system efficiency.

Closing Remarks

  • Encouragement for feedback and suggestions.
  • Plan to improve arm policy and address camera and hand detection challenges in future work.