Exploring Object Detection with VLM

Welcome everyone to the next installment of the Jeff the G1 series. Here we have object detection using natural language and a VLM in attempt to essentially map the objects that we're tracking and their location in basically XY space and then based on that XY space we're extrapolating the Z delta based on the depth camera and then attempting to translate that into movements using the arm policy. So what you can see here at the bottom of the UI, you know, the black robotic hand and graphics card, these are comma-epparated objects that we're tracking with that VLM. And these can be changed on the fly. And it can be anything. It's not some preset list. It could be anything you can describe because it's a VLM and they're magical. So in this case, we have black robotic hand. And then that's going to attempt the black robotic hand is going to attempt to seek out that graphics card. So, we're using that VLM to mark the locations of those two things. So, the red plus sign is the point of the black robotic hand, and then the yellow plus sign is the location of that second object, which is the graphics card. The idea here being we can really track any object. Gone are the days where you have like 30, you know, preset uh objects that you could track. Now you can track literally anything and the end user can uh you know tell its robot in in natural language what they want. You know go to the kitchen and get me a bottle of water. Ow. You would just use function calling for where to go and also what is the object that we're seeking. Also black robotic hand there might be a much better way to identify. Sometimes it messes up and it detects a tripod or or something something that is not the black robotic hand. And you could go from here and honestly you could put like a ring of tape like colored tape around the hand or something. So you could say robotic hand with green tape and that would probably increase the accuracy a ton. So there's lots of lots of things to do. I literally this was the first thing I tried black robotic hand. So so uh lots of things can be done from here but obviously you know very very very proof of concept. Um, it's not the greatest thing in the world. The arm is moving exceptionally slow by design. So, this arm can move much much quicker. This is not a quirk of um the ARM policy being slow or anything like that. The frames per second that we are actually generating those uh predictions of the points that is uh pretty slow right now. It's we're running at like one or even maybe a half frame per second, but the actual Moon Dream model itself is more like 150 milliseconds or something like that. So, we we can run this all much much faster if we just do a little bit of optimization. So, this really just proof of concept trying to see like will this work? Can this work? And like what should our next steps along this journey uh be so everything can get better. And then yeah, the arm. I'm intentionally moving that really really slow because my arms are in the way and I uh I don't want to break my arm. At the moment, the weak points are the camera location being on the head. So, it's above the arms and angled down. If you have it angled down like that and you walk into the kitchen, you really can't see what's on the countertops in the kitchen. If you angle the head backwards such that the camera is pointed outwards, you can see what's on the countertops, but then you can't see the robot hands like in space. So, tracking becomes pretty hard. Really what you need is two cameras. And the depth measurement being at the head isn't probably what we actually want. We really more so need the depth with respect to the objects from the hands position. Because at the head that you might get the impression that two objects are the same depth, but they're the same depth from the camera on the head, not from each other. And this is why a lot of like robot arms will end up having cameras literally on the gripper itself. Now, before actually getting to the arm control and grabbing things and seeking objects, I was actually started off by trying to program the grab functionality with those Dex 3 in one hands. But I was like I was just like hitting this total wall. I could I couldn't figure out why nothing was working. But then instead of trying to control the right hand, I tried the left hand just as like debugging and it immediately it it worked just fine. So I was pretty confident something was actually just wrong with the right hand. Unfortunately, they are side specific and it was the right arm that we had made the arm policy for. So, that kind of sucked. At first, I thought I could just maybe even swap in uh one of the Inspire hands. But then going through the manual for swapping the hands, I realized, oh my gosh, we need like an adapter board to use the Inspire hands, which we don't have. So, again, I contacted Robboto. We're supposed to get one at some point, but I don't have that. So, that's not an option. So um so then from there the idea was well can we confirm that the hand is like actually dead. So uh we took off the back plate and checked like the ports basically see like can we try a different port because there's actually two port options strangely on on the back. You can either either use the top ones or the bottom ones and then also the side could be swapped. So you could take the left hand and plug it into the right hand side and still test and then that would kind of tell you is something wrong with like that port potentially or also potentially the hand. Unfortunately, nothing nothing really worked. Also, it's just very clear that hand is just simply not receiving power. You know, the adage of like robots being cold and dead is really just simply not true. If you look at robots like under thermals, they're they have warmth to them and if you touch them, they feel warm. And if you look at this robot under thermals, it is very clear that hand is dead. But this did give us a pretty cool look into the sort of internals of the Unit G1 with the the whole back plate being off. I did post a picture of this to X and got a lot of comments about how it just looks like a laptop as if that's like a bad thing. And I mean to be honest, I'm not sure what what I would have expected or what anyone should expect when you take the back plate off. many years have gone into stuffing the most most amount of performance into a very small form factor and dealing with heat and all of that. So, it really does kind of just make a whole lot of sense to essentially use laptop technology to stuff into the back of the robot. So, I I'm not really sure what to think about that. And as as far as thermals are concerned, um at least the competing components, they look perfectly fine, like well within operating range. And some of these thermal shots are actually with the back plate off, which would probably be less ideal thermal flow for like the fans and stuff like that. They're meant to have that plate on and direct air flow. So, uh this would be like worst case scenario. And it was I mean it looks fine. I took these thermals after letting uh Jeff just kind of like walk around for about 5 minutes until uh temperatures just kind of stabilized. And so running all the slam, all that stuff. So the computer's running uh Jeff's internal computer is running for like the IMU type stuff and balance and gate and all that stuff. Um I don't think heat is is a real concern. Uh but if you are using thermals, uh you could definitely spot Jeff in the forest uh if you needed to. A big piece of the intelligence of getting this all to kind of work is the VLM or vision language model. The one that we're using here is Moonream 2. It is a little shy of two billion parameters. It takes about 5 gigs in memory. There's also a quantized version uh but it doesn't work with the points and the points are very specific to what we're we need and there's for some reason there's like a little bit of an accuracy difference uh between points and object detection. I'll try to illustrate that in a little bit. But anyways, I'll put a link to this in the description. I might put up this the UI that I'll show you here in a moment also in the des or in the GitHub. Uh but anyways, this model is actually really cool. So you can it can either caption like a real short caption or do like a normal caption. Why would you want either one? Uh the short caption is just going to be faster. It's going to be quicker to generate a short caption. A normal caption, it might take a little longer, have a little more detail. You can ask questions. Uh you can do object detection, which is like make a bounding box essentially. And then you can draw a point. And so anyway, um I made a little gooey. Well, I I used uh codeex in 03. And u also since the last video, Claude 4 has released. So I I do need to try that out. See, we might actually make a change in the series. We'll see. But uh I've been very happy with 03 so far and there were some things I did not like with claude 3.5 with codeex so or not codeex so hard to keep track of this claude code anyways um so we'll see we'll see what I think but anyways so this is a guey and basically uh what I want to do is just show you some of the so it can detect uh so object detection point and then query uh queer let's see query uh oh and then we have the captions oh my gosh I got something in my eye it's brutal uh anyway Right. Okay. So, uh, so what we can do, let's see if I can make this a little wider. Okay. So, we'll type into here what we want to look for. So, we've got lots of stuff on the counter. It's just in the kitchen. Uh, for the record, the head in this photo is actually tilted back. I want to say it's tilted 25°. Uh, we'll talk a little more about that. Uh, we have already kind of referenced the issue with the head tilts, but anyway. So, if we search for, for example, uh, we'll do point. We'll do red bottle of water and we will uh run. And you can see this apparently took 466 milliseconds probably because that's the first query. Let's just run it one more time. Yeah, 142. We can keep running it. We can see we're mostly in the 140s 50s and so on. Um so there there this is our red bottle of water. Uh we can also do I think in this guey I have the multiple we'll have to check. Uh yellow bottle of water. I can't even remember. Yes. Okay. So, we have the red bottle of water and then the yellow bottle of water. Hopefully, that's coming through on the video. Okay. Um Okay. And then we can do a detection. So, let me go ahead and run that. And you can see here, this is a perfect example of the two things are not the same. Like point and object detection. So, uh it's red bottle of water, I think, will detect the yellow bottle of water. Yeah. So, red bottle of water, it detects that one. Um, so anyways, I did seemingly find that point seems to be more accurate for some reason on a lot of things, not just red and yellow bottles of water. Um, some of the other things that you can do. So again, it's because it's a VLM, you can it you can describe things in a lot of ways. So you could also say maybe sync. Hopefully we'll Yeah. So there's our sync. Um, you could say you could say microwave, right? Run that. Okay, that is that's the microwave. Um, but then you don't it doesn't have to be like an object described in that way. So it could be like um like like device to heat food or something. Uh let's see if that works. Yeah. And uh it still detects the microwave. Um so it could be um much more abstract type things. And so like the the power of these VLMs for robotics is honestly just like super staggering to me because um there's so many things that you can do. So, uh, another one like canned air. I don't know. Let's see if we can detect our canned air. Yep. So, that's a little bottle of canned air. So, um, you know, in the in the old days, you had these like object detection models that had maybe 30 objects that it was trained on, and it would work pretty well for those 30 objects. And then if you wanted new objects, you would have to like fine-tune the model basically and sometimes even take the place of some other previous object. So, um, so this this way is it's it's actually really cool, uh, to see that, you know, like how powerful these things are and how fast they are. Like 143 milliseconds, that's crazy. Um, we can also check the short caption. So, you can just kind of get an idea. So, short caption here. White dishwasher and a sink. A blue robotic arm reaching towards it. Interesting to call it blue. I think it's probably it probably is detecting the blue from like this. The the head like lights up blue, you know, with the little LED lights or whatever. um occupying a kitchen with white cabinets, black countertop. Sure, we'll take that. And then there is a window there. It's kind of surprising, but yeah, even that is being detected as and that's accurate. There's a window there. Um and then we could go along caption just so you can kind of get an idea what's the difference, but uh this time it's a white robotic arm. Interesting. Um yep. Kitchen features cabinets with silver handles so it gets a little more uh descriptive. Brown laminate floor. Yep. Sink visible in the background. on various bottles and containers on the countertop. Black microwave is also present on the countertop. Um, oh, and then we get the angle. Interesting. Emphasizing the height, emphasizing the height and the reach of the Interesting. So, anyways, I'm not using those, but again, it's a it is a way for your robot to begin to get an understanding of its surroundings. Um, and then like I said, for even very general abstract ideas of what things are, like a thing to heat up food for example, um, that's just cool. I mean, what a what an incredible time to be doing stuff in robotics when you have just these tiny tiny models like the nano board on the uh on the unitry has 16 gigs or 32 and I I can't even remember now if it's 32 that we have on on the one that we have or not. It might be 32. Well, I'll have to check that in a moment. Um but uh it's just crazy. And this is only 5 gigs. Like it's just incredible. Uh really cool and really cool model. So uh yeah, definitely check that out if you have VLM needs. So, one of the other side quests was the figuring out, can we retain the slam? Well, really the liar, which is giving us slam, um, and the occupancy grid. So, can we retain all of that technology while tilting that head backwards such that the camera aims outwards? So, uh, part of the problem is like, so if we run it with the head tilted back, we'll get this started up real quick here. He sticks his hand out. Oh my gosh. So, you can see how fast he moves his hands. That's why I don't really want to be there. So, you can already see here. Let me bring this over to hereish maybe. Yeah, you can see now at least the occupancy grid. Um, and then yeah, so like if we uh I think I'll let him come down just a little bit here. So, um, as you can see, it's already it like the LAR works, right? Like you can see, um, like here's the this is like the pitch roof of the uh, of where I'm working. And but the problem is like the actual like you can't understand like the orientation of the robot. And then um like if we just kind of walk around a little bit more here. Um as you can see part of the issue is this occupancy grid is just totally um saturated basically. So, uh, one of the questions I had was, well, can we can we figure this out? Like, because this on the one hand, um, you know, it's still it obviously is still working, right? Like, this would essentially be the occupancy grid, right? So, it's still it's still very functional. It's just it it doesn't understand the orientation yet. And you can see the orientation of the robots all wrong. And because of that, our um occupancy grid is is messed up. So, one of the first things that I wanted to figure out was can we can we adjust for that? And why not? Like you should be able to identify what is the floor and then you would it should just be like a simple calculation of the um the detected angle of the robot with respect to the floor and then do the translation, right? And I couldn't find a way or I couldn't figure out how to just do that automatically, but I did. But we can't move the head dynamically anyways. Like we can only manually move the head where we want it to be. So then I was like, well I could just I could just move it 25 degrees. And then I know we are 25°. And then what about then can we then translate that 25° angle to the slam and occupancy grid calculations to fix it? And the answer of course is going to be yes. Um before I do that though, I am going to bring our our buddy Jeff back here. Uh, mostly I just wanted to bring them back so I can hook them back up and do the reset safely just in case anything goes wrong. But one of the cool things I do want to just kind of like point out is that this gives you like in here like this is the actual ceiling and if we allowed like we didn't even enter the kitchen and like you get so much more like fidelity to the actual um slam data here which I think is is is kind of cool actually. You you just get more out of the LAR unit. So um I don't know that's cool. I I don't know if we're going to use it like this. Um but it is kind of interesting to me. Uh so anyways, I did uh let me just copy and paste. So the we can export an environment uh variable here. So we're going to say we're going to say the LAR tilt is actually 25°. We'll go ahead and run that. He'll probably drop his hand. He's so aggressive. Oh my god. Okay. So, as you can see now, uh with that change, uh one is we do have an improved occupancy grid. It does actually look like um some over here is off. My guess is it's we probably have the head at not quite um 25 degrees. You'd probably want to use a level or something, but you get the idea though. It is likely working. And in fact, it's curious. I mean, there's there's also just a lot of junk over there. Um, as you can as you can very clearly see, my uh my working space uh is run by toddlers now for the most part. So, it gets messy. So, now we'll move Jeff out a little bit. Hopefully, he's not going to trip on his wiring here. There we go. Bye, Jeff. Okay, so um yeah, I almost wonder if this is the uh if that's because of the roof potentially. I actually don't know because like in the slam it doesn't it doesn't look like it should look like that. So I that might just be the height calculation screwing up potentially. I'm not really sure. Let's head over this way. He looks so stupid with his head back though. I'm not going to lie. Yeah, at some point he he definitely is picking up um he's definitely picking up a little too much. Uh and I don't know if that's because he's maybe bouncing weird as he's walking. Um but you can even tell like it's detecting a lot of these as being higher pixels than they are. So maybe, you know, it might be potent like we might have to change that. Like that's a that's like a hard-coded value for um like the floor height basically. So, what would you mark in the occupancy grid? So, we obviously just need to move that around, but you you should be able to tell visually that we are at least much closer to having this robot um his level being correct. Okay. So, um so you can do that. And then I also I do like I said, I don't know. I just the geek in me really likes that we have like the roof and the ceiling over here like all that stuff is like mapped. Um which it's just kind of cool. But yeah, so if we want to continue through, you know, having the head tilted back, we will we'll have to do something like this and probably improve it a little bit. I think the problem is as he walks, he does kind of his head bobbles and um we're likely detecting some pixels as being too high. So if you have that degree just wrong a little bit at a distance, it makes a big difference. And I think that's why at the start of all this over here, it was detecting this as being um higher. And I think it's because we'll probably have the degrees off just a little bit and at a distance it's thinking those come up. So anyways, um we we can work on that. Uh long story short, you it is fixable. Um we got to get the degree right like it looks like maybe we are off just by a few like maybe two degrees or something like that. But anyway, so uh that can work. Um so that that's cool. Uh because I do think we're going to need the occupancy grid likely at some point. But anyways, that's the uh slam and the side quest that I went down that currently we're not using at all. Okay, so what is next? Um I think the I'd like to improve the ARM policy for Jeff. And along the way, you know, like the the the word or the phrase of the day from the previous video was inverse kinematics. And that does seem like a very appealing way to control the arm. and it may still be a viable option. But let me talk about at least some of the problems that we have. So, first of all, like we have with the angle of the head, part of the problem is like say you've got this the a bottle of water and then you have a hand here and then let's say uh I don't really know. I'm trying to think of like how to exemplify this, but you like if you had a bottle of water here, maybe the robot doesn't understand likely. It's going to detect that the bottle of water is actually closer than the robot hand. Hopefully, you can see that is. But anyway, from my perspective, it's obvious. But but it's because the camera is right here. So, it's going to think the depths are are are different because this is actually closer to the depth camera than the robot hand is, right? It's detecting this as being, I don't know, two feet from the camera whereas this is, I don't know, 3 feet or something like that, right? Um, even though they're actually at the same depth with respect to the hand because that's what matters. What matters is the depth with respect to the hand. Now, a lot of like robot arms um with like grippers or whatever, they'll actually have a camera like on the hand, right? So, like a camera like right here or something like that. And you wouldn't need much. You would just need a camera. Like you could have a 240p camera there and it would be very useful. But the problem is the camera is up here. And so we not only is it like just s, you know, to to get an idea of what's around us, you know, like looking down, you know, if we if we angle up and get a better idea of what's in the room, we can't see the hands. If we angle down and see the hands, we can't see what's in the room. So we kind of need like more cameras, right? So you you could, you know, we could just like strap a camera like right here, right? Or or potentially put a camera like on the chest somewhere would be kind of cool or like cut this out and put a camera right in there or something. I don't really know. Um, this is probably actually like intake or something. So we probably can't do that, but we could probably strap a camera or something to the chest of the robot. And if we had it about more like hand level, like where we might grab stuff more frequently, I think that would be more useful. The other thing to think about, so like getting back to inverse kinematics, just for the arm policy of like moving the arm up, down, left, right, forward, back, inverse kinematics, I think actually is a really uh attractive option. And I might I might pursue that a little more just because I want the arm to look better when it's moving. Um, and so I so I might do that. But then the other problem is like if you have a bottle of water and the hand is here, right? And the bottle of water is here. Hopefully that's coming through. Yes. Um the currently the way things are written, if in the most perfect world, the robot's just going to go and like bang this away. Or if the bottle of water is like on the other side of a bar, it's going to try to go through the bar. Like it has no understanding of like path planning. And we need you're gonna need path planning. So I think that no matter what you're going to need some sort of like understanding of the environment and then you have to plan the path and then you need an ARM policy to get there. So the path like the path is going to need something different likely than the inverse kinematics part. But once you have the path, you could use inverse kinematics to to to map the the pathway, right? Uh or or to go along like follow that path, right? So that kind of makes sense. And I guess the question I'm having is at the end of the day, we're we're going to likely either need a simulator or again we're going to need something very specific to this robot. And again, the problem is what will the inputs be to the let's say the neural network to get at least path planning done. And I I'm not sure I have it I haven't really fully decided because I think like potentially you you could have like we had the point, right? Like that's just like a that's a pixel point, right? So you have a point and you could get the point and so like what are the x ycoordinates of that point and then you extrapolate the delta of distance based on the depth camera. I think we could probably create an algorithm that understands the head is up here. It's angled down at this degree. So then we could do the calculations for what is the position of the hand and what is the position of whatever the detected object is based on the XY and depth value and then do a little bit of a calculation to get the depth with respect to the hand um based on the the the angle you like the camera angle the hand position the delta or the depth reading for the hand and the depth reading for the object. I think we can do a translation it because we also would know the X and Y coordinates. So I think we could translate that and convert it to a depth based on the hand instead of from the camera itself. And then you'd have a neural network essentially that you have um you really would just the input would be the two X and Y values and then the extrapolated depth value. And then from there you could you could use inverse kinematics to bring the arm to the object. So, I think to start before we get too crazy, before I'm just trying to avoid the sim as long as possible, I kind of want to try something like that where we use I'm going to say IK for now, otherwise people are going to make fun of me for saying uh inverse kinematics too many times, like cartisian space. Um, so I think that's probably the next thing I want to try to do because I just want the arm to look better when it moves. Like that's I I really want that. And so either inverse kinematics will work. I should have said IK or if that doesn't work, I think I would you I could we could use the simulator just to train an ARM policy. Um, and I think that would work pretty well. But like for you like you trying to use a simulator to learn to like grab objects and like the path planning all that, that's actually kind of hard. Like when you think about so like physics in simulators is hard and I don't want to like berate that topic. Um, but if you're trying to train like a gate or like make the robot dance or do a backflip or any of these things, you can train that in a simulator and you have no in you have like no sensory inputs. You have like just the stuff that's on the robot itself that's like maybe reading uh physics values back. But you don't have to map the camera. You don't have to map like the depth camera stuff. You don't have to map LAR because that's going to be way different in this sim. It's going to be very difficult I think for us to like actually get a good representation or a good match that will such that we could go from sim to real with those things. I think that will be that will be a challenging task. But I also it's it's tough because I think that's also that's one of the more most challenging tasks with doing like gate or anything like that because the physics in the sim just isn't going to match real physics. So any whatever your task is it's always best if you could use real life data or it's going to be easier. But I think the way forward for robotics in the future, everything will be done in the sim. So I think that is the inevitable path and we're probably just like wasting our time not getting there. But even then, I still think well I would like to at least try to see what works best in reality and then we can move to the simulator and kind of like work off that principle at least. But anyway, that's my idea. Uh if you have, you know, suggestions, as I'm sure you are going to do, you can feel free to leave them leave them below. Um otherwise, I will see you in another video. Hopefully, we'll have a little bit more attractive of an arm policy. Um and hopefully I'll figure out what I want to do about the cameras. I that that's really the hardship is like even with all these other things solved. Uh we still have a real problem. We can't walk into the kitchen and see what's on the counters. or we can walk into the kitchen and see what's on the counters, but we can't see the hands. And then, but potentially maybe what we do is we create some sort of awareness of the robot of like where the hand is in space or something. Um because you you could also calculate that like it doesn't really matter um if the camera can see the hands. It just needs to be able to see objects and then know even when it can't see its hand, where is the hand in space, right? because you could definitely calculate that as well. You know the arm pos all the positions of all the mo motors down the arm. So just because you can't see it with the camera doesn't mean you you can't know where is the hand. Um so so yeah um lots of rabbit holes have been um uncovered here and uh yeah and we don't have we only have one working hand at the moment. So that kind of stinks. Uh but anyway, okay that's all for now. This is going to be too long of a video already. Uh like I said, questions, comments, concerns, inverse kinematics, uh whatever, feel free to leave those below. Otherwise, I will see you guys in another video.

Transcript for:Exploring Object Detection with VLM

Transcript for:
Exploring Object Detection with VLM