Transcript for:
Advancements in Low-Level Embodied Intelligence

all right so hey guys um thanks for coming to our second class um today we have the pleasure of welcoming aishia he's a senior research scientist at Google mine where he works on the robotics team um he received his PhD here actually um working with um Sylvio salaries and Stanford vision and learning lab as well as leonit um and his mission is to build intelligent embodied agents um that can interact with complex and unstructured real world environments um with applications home robotics um recently has he has been exploring the use of um Foundation models um for robots decision making and action Generation Um so now I'll hand it off to P um hi everyone um I'm super happy to be here and happy to be back I graduated from here two years ago and now I'm a research scientist at Google deep mind uh I work on the robotic team and today I will be talking about low-level embodied intelligence with Foundation models so uh it's definitely an interesting topic and I will introduce what is embodied intelligence and what is low-level embodied intelligence and how we can um accelerate the building of them with the foundation models um all right so why are we working on embodied intelligence so embodied intelligence is an integral part of um artificial elligence and it's an important Milestone to um artificial general intelligence and it has a lot of use cases like for example we all hope we have a home robot that can be our home um 247 and clean the home for us or um um clean up our um messy room or cook for us or taking care of our our uh aging family members so we are not quite there yet in fact we are quite far from it uh that is because our intelligence is currently mostly in the the virtual world so we have ai agent that can help us draft emails or write eloquent essays but they are not super good at interacting with the messy uh real world unstructured complex environment that human resides in so just to give you guys a couple examples of how messy the the real world can be and how hostile it could be to to robotics I want to show you a curious mistake or curious error uh from one of our robots so the task is to put the uh put the Cod can in the sink and watch what the robot do the robot grab the Coke can and open the tab so um this is kind of dangerous but it's kind of interesting right because we never expect it would uh do something like that it's just from random noise it it it starts to open the top and the water starts to come out so for a an an agent to have this type of physical intelligence it need to understand the effect of its actions and what is so-called a world model so people have been complaining that language model so far don't have a world model so it doesn't understand geometry doesn't understand spatial relationship of objects or the effect of actions basically how object will will move according to physical laws so uh we are not quite there yet in another case so uh this is our robot that is ready to deliver a can or throw away again but as you could see we have this Pro pre-programmed behavior of tucking the arm behind and in doing that the can is upside down so if there's any liquid in a can it will spill and damage the robot so it's another example of real world is really complex and there are a lot of things to model and um in order for our robots to have this sort of ambient intelligence it really need to understand a lot of very Nuance um uh details of the environment and understanding the physics physical laws and understanding um effect of its actions um how do we do that there are many ways to achieve embodied intelligence actually throughout my PhD study I've been fascinated by this idea of uh creating interactive environments basically let agent explore in this interactive environments basically create environments that are complex enough so so that if the agent need to survive in such environment it must develop intelligence so it's a ecological view of perception and agency and is popularized by American psychologist James J Gibson so um he has a famous quote ask not what what is inside your head but what your head is inside of so human learned this type of embodied intelligence human is able to manipulate objects e effortlessly one because of evolution second because the the childhood experience we have been playing with this toy we have been interacting with this toy and watch the physical effect so that we learn um and similarly we can give robot uh a safe play pan so they can explore in those uh environment and interact with environment and play and watch the uh effect of actions and effectively understand how to manipulate those objects um so I have been um developing this uh simulation environment uh one of which is called Gibson environment which is published as cvpr uh it's mainly aiming at simulating the visual World faithfully and also simulate physical world to some extent so we we build this environment which is a scanned environment from a lot of houses and then an agent we can spawn an agent in that in this case a humanoid agent and the agent can learn learn to walk or to run in this environment and simulate all this perception information so we can can create a perception action Loop uh for this for this agent and similarly we can put other types of agent in this environment in this case a little a little cart and we can also put quadruped or or this ant into this environment so essentially we create um an environment where we can simulate a perception for for the agent and then we can create a neuron Network to map the perception to action and uh this way we achieve some sort of physical intelligence it's mostly for uh navigate navigation and Locomotion this is not enough um so in this case the the the environment is one monolithic piece of mesh as you can see the the agent run into the wall and it it bounc back so there's no articulation in this environment so it's not simulating the full complexity of the environment so the the the things uh that we can do with our agent is rather limited so that's what why we create other simulation environment one of which is I Gibson environment which is called interactive Gibson so uh what we do is we create we we again scan a lot of uh real world houses and then we convert them to uh CAD assets basically mesh assets that are interactable in this case we have a simulated agent that go into the environment and then close all the drawers um so we are able to do that because we we model the complexity of the world a little bit more we go beyond just modeling the visual world we we start to model physics a little bit more basically um modeling the degree of freedom in environment and and our agent can do more than just navigating around so um we can go even further so we can even model more degree of freedom and our agent can develop more complicated Behavior such as unloading a dishwasher and find a bow or uh take out a bow and put it on the table so as we scale up the complexity of the environment we we are able to um learn much more complicated skills um in in simulation and that's one way to achieve embodied intelligence uh which is to build complex enough simulation environment um not just the my research but the entire field of computer vision is undergoing a paradigm shift so previously we are focusing on internet AI we cured a lot of Internet data sets to uh study problems like classification segmentation and detection basically all these computer Vis problems now we focus a lot more on embodied AI uh which is adding the the action Dimension to the problem that we are studying problems like visual navigation manipulation rearrangement embody question answering instruction following um and the simulators replace in some in some some sense replace the original role of data sets one thing that doesn't change uh which is the data is still super important we are still relying on a large amount of data to uh learn this intelligent Behavior no matter it's from a static data set or from a simulator so um learning in simulation can can take a lot of interactions so just give you um an example we create this uh ibson environment and we want to learn a behavior called go into a room through a closed door so this is a rather simple Behavior which I can show on the top right of the screen so the agent need to stop in front of the door it need to stop at the right distance if it's stop too close to the door it cannot extend its arm if it's too far it cannot open the door and then it basically open the door let me play this again open is a door when there's enough clearance it will go into the door however it takes about 50,000 episodes or 1 25 million environment interactions to learn this type of behavior um this is because we are using model free reinforcement learning the agent is exploring this environment it could really push any point it could rather stop at any point so um we we give it a reward function to go into the room but it's very rare that it will stumble upon this Behavior I would like to argue with Foundation models we can do a lot uh a lot more different right so what do you what do you do nowadays you just ask a hbt how do you go into your room through a closed door and it will say open the door walk through the door so this is a this is a gross simplification of the problem of course um it it's the problem is not that simple but what I'm just saying is that we can leverage a lot of semantic prior from the the foundation models so if we if we if we really like data if we really need a lot of data the foundation model is a compressed version of the entire data and it's a know base that you can quiry and to accelerate the development of Robotics of course simulation and real world data is still super super important but maybe we can uh get the best of both worlds we can use Foundation models plus uh a limited amount of simulation of real world data um so that's what I'm going to talk about today so where are we in terms of foundation models plus robotics so our team at Google deepmind has been piloting in Foundation model Plus robotics so we developed Advanced planning high level planning algorithm one of the first is called Palm Sayan it is an algorithm that can parse a user command so so here is a demo here's a scenario here is the user command I spill my Coke on the table how would you throw it away and bring me something to help clean and it's quiring a large language model uh which is given a score uh um highlighted in blue and there's also an affordance score the affordance will tell you whether an action at a given state is possible it's argumenting the language model to give you only possible things so essentially it is doing the planning doing the semantic planning with a language model but it's also taking into consideration what it can do so it's not just a um outputting the like language model are tend to hallucination uh it it doesn't hallucinate it only give you what is possible for the robot to do and what is actionable for the robot and the robot is doing the thing that is advancing the long Horizon task progress and also each task is executed by a low Lev policy here it doesn't quite uh clean the the the table uh because uh we haven't added this to the low level skill but imagine there is a low level skill to clean the table it will finish the entire thing uh what is the Lo of policy used to here the Lo of policy used to here is robotic Transformer one uh rt1 it's our team's homegrown transform forer essentially we collect a large data set of uh human demonstrations we put a Transformer um and we we train train it on uh this large data set of expert trajectories it is able to do about 700 tasks with 97% success rate so um and it gener it has interesting generalization Behavior it can operate in a new kitchen it has never seen before so which is showing um there is a successful recipe to apply foundation models in robotics so that's roughly where are we in terms of foundation model plus robotics um and I would I will I will um talk about a few new works that is bringing this to uh the next level so um actually U my teammate Ted give a talk of foundation models plus robotics at the beginning of this this year um it's also this class cs25 I highly recommend it it's uh it's available on YouTube I actually watch it last night so that I don't repeat um some of the the contents but what he basically mentioned is that we um he revealed our team's progress in terms of building this robot robotic Foundation models and we have had a lot of some somewhat detour and now we we sort of figured out a recipe so in 2021 to 2022 is how we scale to many tasks with demonstrations how do we collect a large amount of data uh in fact um about 100,000 U demonstrations and we tried different ways to do it we tried the behavior cloning we tried imitation learning plus reinforcement learning and um some other ways or combining them with language models such as Sean in 2022 to 2023 uh is about how we can leverage Foundation models to accelerate robotics we really see a proliferation of um using Foundation models to accelerate robotics uh both on the high level planning and low level control probably leaning more towards the high level planning so if the recipe Works um so so the recipe is essentially uh combine a large scale diverse offline data set which uh with with high capacity architecture such as a Transformer and using language as a universal glue so this will be the the recipe to build Foundation models for robotics okay so if this recipe Works what do we do what do we do next essentially we just uh let's just scale everything to to others of magnitude and uh and be done with it and and and solve Robotics and and and guess what that's what we did so that's the end of the lecture maybe we can cut this a little bit short um and that that's a joke that's that's not happening um so we we we are still on our way uh on our quest to solve lowlevel embodied intelligence when I talk to people uh that you can use Foundation models to uh uh to do robotics uh their reaction would be it's mostly doing highle reasoning it doesn't do the lowlevel uh manipulation really well uh and that that's for a reason like one of the reason is there's a morava paradox uh morava Paradox is observation that in artificial intelligence and Robotics contrary to traditional assumptions or our intuitions reasoning requires very little computation but S motor control and perception skills require enormous comput sources that is because as U biological creators we um we we acquire the sensory model skills through Evolution um this is very different so we don't we might not be able to um like reason or do large amount of large scale computation but this uh sensory model control is integral to our survival so it's it's essentially already learned in our DNA but in in robotics it's a little bit different so the the the ch tips are very good at doing reasoning and computation but they are not super good they haven't experienced the world they haven't acquired the sensory model skills that is uh necessary for them to uh do task in the real world here is example when the the the uh computer beat the Caspar of basically the human champion in chess uh there is another robot arm moving the chest piece it can beat the human champion in uh in chess but there are still some need to move the chest piece similarly in in the alphao moment when lisador was beaten by alphao there's still someone who is moving the chest piece for them it's not a robot doing that so this is showing the reasoning is the hard things are easy and the easy things are hard there's another thing that prevent us from like using Foundation models more prevalently more in a larger skill in robotics which is the training data bias like the training data of Foundation models or large language models are mostly language tasks so it's perhaps not that surprising it knows how to clean up a kitchen because maybe there are W Wiki how articles teaching you how to clean up a kitchen or to do something in a procedural way but there is no Wiki how articles teaching you how to move your finger 5 centimeters to the left because people just don't uh say that people don't write that down so there is very limited amount of uh this lowlevel Control Data in Lar anguage model training cpra so we do have a lot of challenges in bringing the foundation models to a lower level so that's what I mean by lowle embodied intelligence so uh any any questions so far also I want to make this quite interactive so there is any questions feel free to interrupt me anytime all right if not we can uh we can continue so there are a couple challenges of using large language models for lowle control U as I just mentioned men the first thing is lack of data so we only have um perhaps 100,000 episodes uh of um of human demonstration data and takes about 13 robots 17 months to collect so it's a huge amount of effort on the country large language models are trained on the order of a thousand billion tokens pal a smaller Palm was trained on 780 billion parameter uh 780 billion tokens and the L one is trained uh the the the following the chinchilla rule you would need to train on 1.35 trillion tokens so it's a huge amount of um discrepancy between how much data we can achieve in robotics and how much uh we can get in large language models so we will always be bounded by robotic data so can we maybe we can scale on other fronts maybe we can keep the robotics data the same and then we can scale on on other fronts like maybe we can scale the pretraining mix of text and image or maybe image and text pairs maybe we can build this this cake and the robotics data is just a cherry on top of it and we can scale the foundation really really well um some of my my work that I'm going to talk about today actually reuse the rt1 data we don't collect new data for rt2 but we want to do more things with the same amount of data um the Second Challenge is um kind of related to the first challenge language models lacks an interface for lowle control if you ask a language model how do you make a robot dog stand up on two feet it will tell you a lot of things that sounds reasonable sounds plausible it will tell you the robot dog's torso is upright balance over to hind feet and standing shoulder width apart this is great this is all great but we cannot put it on the robot um on on on the other on the other hand maybe we can ask language model to write control code to directly control the robot but usually that requires you to curate an API that is friendly to the language model if I directly ask it to give you my join angles to make the robot stand upright it will not give you the right thing because it doesn't have enough context so essentially um large language models don't speak robot language uh can we actually find the the right robot language can we find the interface between large language models and robot control or can we just treat Robot action as another language so that's what we want to find out so um in today's agenda I will be talking about lowlevel embod intelligence with Foundation models it's uh separated into two parts and it's addressing the two challenges that I have just mentioned part one is about model consolidation uh joint scaling and positive transfer so I have to put them in one uh part because they are somewhat related and part two is developing new interface of large language models so uh what do I mean by model consolidation model consolidation uh yes question yeah I just ask why you use and for generating L code description yeah yeah that's a great question so the question is why cannot we find tun language model to directly output low level code um and or like robot actions so I will be talking about rt2 which does somewhat similar to that it fine-tune language model to Output action as a language uh to find to Output our action representation uh there are certain downsides to that like for example you would need to collect additional data for to find a language model um so either we can find T that or we can use a language model zero shot if you find the right interface which I will talk about a little bit in a part two without find yeah so uh the model consolidation is essentially we can do the high level reasoning and low level control in one model and Joint scaling is not only we we scale the robot data which is expensive we also scale um the the the pr trining data or we already start from a pre-trained vision language model and a positive transfer is model benefiting from diverse uh joint training across internet scale language vision and vision language domains uh combined with robotics so uh this is a continuation of the the the axis that Tad get Drew in his previous talk so we can see there is a trend so so this uh visualization basically highlights some of the work on our team and uh each each each work each column is basically a robotic system that is able to do both high level reasoning and lowle control so previously we need to uh have separate models for each thing previously uh in the initial release of Sean the planning is done by a large language model and affordance is done by um the affordance is done by um a QT op like policy uh trained with Sim real so um and the low LEL low level policy is robotic Transformer one so it's it's each model doing its dedicated thing and uh uh we need to train each model differently and perhaps with different type of data and later we have q Transformer which unifies which is a kind of a offline RL method uh that is leveraging Transformer architecture so it's it's a high-capacity architecture it can train on both positive data and negative data and with that we we able to get a policy uh that is Al also understanding affordances so we can unify the low Lev policy and affordances but the planning is still a large language model and then we have py uh which is a vision language model uh which is a large language model also trained on Vision language domain so the P can do planning and affordance in just a one model but the low level is still using rt1 and finally we unify everything together like there is rt2 which I'm going to talk about today that can do both high level planning to some extent uh generating affordance and do lowlevel uh policies so behind the model consolidation is the consolidation of tasks we can represent every task as a Vision Plus text to text uh task so it's a it's a really Universal representation of the task and then with that you can train it really on using a lot of data and you can see positive transfer basically learning learning affordance can also tell you uh how to achieve a task like like there there are transfer between tasks when you pull all the tasks together so to understand this join scaling and to understand the model consolidation we need to um understand palm e a little bit so palm e is embodied multimodal language model it's based on the palm architecture so Palm is a large language model we made some adaptation on the architecture so it can understand multimodel input so it's it is basically one model um that is able to take in multimodal input so um in large language models each word is tokenized and tokenized and getting this this embedding of this um words and then that is feed fed into a large language model so in pal what we do is instead of using words we can use multimodal tokens so the multimodal tokens Can come from a vision Lang can come from a vision Transformer a viit or it can come from uh robot sensory data so every um every multimodel token then we map it to the text embedding space um we basically train a a linear a fine transform between the multimodal token and the the text in beding space and then we can treat um the multimodal token as words as well so essentially we have a language model as as as a solid base and then we start to adapt it to understand multimodal tokens so this is quite interesting because we don't it doesn't require a ton of adaptation or fine-tuning for it to understand uh multimodal input it just aligns naturally to the multimodal input such as images I will show a couple of examples of what they can do and and we can train the same way as training large language models so essentially we can reuse the same infrastructure and training algorithm and everything uh to train this P me um a couple other things we we find along the way is positive transfer uh which I will share in a little bit so um I guess here I also want to mention uh P me is one of the largest model we have explored so far it has 562 billion parameters which is by concatenating the the Palm 540 billion parameters and the 22 billion VI and we find a lot of uh emerging capabilities of these models that is we we haven't um uh expected during training time but really we can prompt these models and ask you to do interesting things we have also explored uh using neuros sing representation basically an object Centric representation um to and fed into into pal so um object Centric representation assigns one token to each object and we find this representation is super helpful for robot planning tasks because traditional viit representation is based on grid and it doesn't have a full understanding of like objects and their relationships uh we have done a ex um extensive study on the the scaling performance and the catastrophic forgetting performance um and all other uh interesting experiment in in the paper so please refer to the paper for more so here I'm just showing some interesting uh qualitative examples of um or some emerging capability of Tomy that we found out so first we found this model has some reasoning capability you can give it an image and ask it questions uh that requires a little bit reasoning and you can prompt this with with let's think step by step which is um a technique used to elicit reasoning in large language models but here in multimodal language models you can do the same I guess people also are also experimenting it uh these days with gbt 4V you can also prompt it to think step by step or count row by row uh but here um this is before gbt 4V and we were able to elicit reasoning using some of the interesting prompts such as we can ask it uh in this photo um are there more cats or more dogs let's think step by step and they found the the pal me found out there are equal amount of dogs and cats and on the right um given image can I go down the street on a bicycle yes or no let's think step by step and the reply is do not enter second except the bicycles do not entry except the bicycles yes so it's doing this mod of reasoning and it's mixing um this understanding of symbols and also mixing the understanding of text so this is quite quite amazing to me uh to be to be honest when I first saw this I didn't expect a multi mod language model would be able to do that and uh we also tried one thing which is traditionally very difficult to to uh to language models which is to tell a joke language models can understand joke but sometimes it just doesn't it's not able to tell you a joke um like when it comes to the punch line because it's just trying to make something that is plausible and sounds like a joke and when there it comes to the punchline it doesn't really know what to say so here I give it an image and I ask it to come come up with a description and then come comes up with a joke so this guide language model to think step by step and the description is a donkey is carrying a dog cat and rooster and the joke is what do you call a donkey with a rooster on his back a rooster booster um it's so creative like when I saw this I'm I'm pleasantly uh surprised and I searched online I couldn't find another joke like that so it's actually original joke by pomy and finally we see some math reasoning with this model um um basically I give it a messy manual from a pizza store and I ask it um how much is a I'm just buying a pizza for me and my friend how much should I pay let's think step by step and it's figuring out there's a pizza and there's a $9.99 and it tell you the price uh in some of the answers it even calculate text but text but the text is hallucinated so that doesn't work all right let's talk about positive transfer so apart from the amazing capab amazing um uh things that py can do it also has interesting positive transfer Behavior so um when we train P me on a single domain when we train it on just a single robotics task the performance is not super great but when we pull all the data together and we also include uh internet SC SC um Vision language tasks such as captioning or Vision visual question answering it is able to do much better so this shows that it's important to mix all the data together and train it jointly the internet scale data can uh can act as a a regularizer for you to not forget the representations and those representations are in turn very useful for robotics so that's that's a POS transfer results and we start to see more and more positive transfer in other of our studies yes so how much to colletive like in simulation or in real world like I think the like the playing with the Sorting stuff on the table is very impressive right yeah that's that's a very good point so um for the for the for the so these are all planning data like high level planning um it is um so so so maybe let's let's just talk about two things so first of all the the Sorting results um the lowlevel policy is still using uh a traditional controller so it's using a policy called lava and that policy is trained down 68 uh thousand episodes um the high level planning is probably easier than you think because it it only need to tell um so put the it's given command to the low LEL policy so it's basically only need to um put the red block into top left corner put another red block into top left corner so it's rather a quite Standard Auto regressive um Auto regressive like language modeling task the only thing you need to do is to determine what task is not finished yet so for example if the block is already in a corner it shouldn't call Lo of policy to move it to the corner again so it's it's rather like understand parsing the states and understanding the states so this high level policy only requires about like 50 to 100 demonstrations to learn so it's quite parameter efficient and in the future that's a very good question actually in the future a lot of uh these tasks can be taught in context so maybe we just demonstrate it once to the large language model then it knows how to do that how can how can the language model know like like which level this is yeah this is through human demonstration as well so human human on a low level can demonstrate low level policy by T operating a robot to do a certain task but on a high level it could also um it could also just uh give the low policy like imagine your control interface is through text and then you can you as a human you can also guide a low Lev policy to accomplish a task and then nothing thing uh can can then be used to to train a large language model so that's for the Sorting sorting block the second is a little bit more interesting because the planning steps are actually generated by pal so we essentially distilled pal plus this affordance model into pal e so that that's that's a little bit more interesting that's like using the AI data to boost strap itself that that that uh that one has about 3,000 episodes also not not quite a lot U but it's able to learn complex planning Behavior replanning behavor your error recovery which I will show in this slide so with the paly as a high level planner we are able to do um pick the take the rice chips out of the drawer and uh there is a Twist which is um I will be messing with the robot and so as it put onto the onto counter I put it back to the drawer and as it uh um pick it up again and then I put it back again so it's able to understand the state it's able to understand my task is not finished I cannot proceed with the next task now after I don't mess it with mess with it anymore it's able to close a drawer and pick up the the bag of chips so um Tomy is able to combine affordance and planning in one model and do uh complex reasoning of of the scene and environment and interestingly we can use the exact same model checkpoint to do block sorting as well so this is the same model chech point it can not only reason about how to bring a b of chips to a user it can also th blocks so and it's also responding to a theal uh peration like if the user is uh putting the Block in the middle again it's able to recover from that so these are all coming from the same model and it can also tell joke so yeah this is the power of like uh uh like Vis Vision language models uh now we want to go a level deeper like these are all Vision language models that are used for planning or high level reasoning can we use them for low LEL control it turns out we can we can um and that's the rt2 work which is Vision language action model that transfer web knowledge to robotic control what can I do when when asked uh pick up the uh EX tin animal and it has a whole range of objects on the table it will pick up the dinosaur so it can link the extinct animal to dinosaur and to the action that pick the dinosaur up so it's really doing like this this emergent reasoning and also the manipulation in just the one model and by the way uh this robot hasn't seen any of these before in at least in the robot training data it it might have seen this in the internet catalog right but it has never seen it in the um in the robotics training data so uh it's it's quite interesting how we need to evaluate these robot nowadays um so in U when we evaluate language models to prevent uh data contamination every time you need to give it new questions because otherwise it might already in memorize it in in its training when we evaluate this robots we actually go to Dollar Store dollar store to buy all these toys to make sure it hasn't seen that before and as we run more evaluation maybe there will be some replication as well but as you can see it is able to understand um to pick up this dinosaur toy um how do we do that how did we do that so we start from a vision language model that is trained on internet scale data and then we also combine it with robotics action data uh which is the rt1 data and we get rt2 and u we can dive deeper a little bit deeper into into A2 so first what is a vision language Model A Vision language model is a Transformer is a Transformer that take in image and text and then output text so um we uh within Google there's a ver language model called P which is encoder decoder type of architecture it's basically having a viit to understand images and then a Transformer encoder um and the Transformer decoder um they Encompass both the visual and semantics to understand the world and in robotics we have to deal with a lot of both of these and the question is can we leverage the knowledge in the the the the uh Vision language models and apply them to robotics on the other hand we have the it1 uh if you want to learn more about rt1 you can listen to the previous um um episode of this cs25 by T so he give a detailed introduction on rt1 but the rt1 is if you stand far enough um it is also a vision language to action or something model right it takes in a human instruction it takes in the current camera image the camera image pass through a film efficient net which is tokenized into 81 tokens and then going to a token learner which compress everything into eight tokens and then there is a Transformer block leveraging a lot of self intention layer and then generate actions the action is also tokenized uh the robot have a um seven degree of Freedom um basically the the the and Factor has six degree of Freedom it's position and rotation and the Grier can open and close and there's another um Dimension representing terminate the episode or not terminating means my task is is already done and we discretize every um Dimension into 256 bins and then we do crossentropy loss on those bins so that's a rt1 architecture in a nutshell it's it's it's surp it's quite similar to a vision language model with different output tokens so it's rather natural that we just use a large pretrained Vision language model directly as policy we can use the pi or Pal as a policy and one question is how do we deal with actions when using pre-trained uh Vision language models and here is action representation that we use um the robot actions here are the eight uh dimensions and as I mentioned there's termination position change and rotation change and we discretize everything into 2 256 bins we also have tried other alternative representations but they are not as good as just this like naive representation yes yeah yeah what is film effici oh the film efficient net is a pretrained convolutional neural network it's used to tokenize the images so the reason that we do this is um through some ablation study we can tokenize the image in different ways we can tokenize in rest net we can tokenize everything into rest net and we can tokenize using film efficient net film what it means is it also take into the language embedding and append it to the inter the layers of the of the rest net so we basically have like yeah that goes through this that's right that's right action that's right is action as in code the action is not in code the action is is in a is in text is in like basically what is shown here like this is the action it's eight numbers each number range from 0 to 255 yeah and and maybe another note on the film rest net it's about how we tokenize the images and how we combine um Vision information and language information there are many ways to do that this is not the only way there is early fusion and late fusion and there is also cross attention you can basically tokenize your image just by itself and then you can have Lang and you use cross attention to combine the image and text representation so here we are using the this this model this is it1 for robotics so we do have a lot of considerations such as latency that's why we use this film resonant because it's super fast and it can output limited amount of tokens which we can further compress with token learner yeah yeah and this like aive like every sing right so it is agressive yeah and every time we use history of up to six six steps so every time you see this image right now and you see about two seconds of History before before it and um this will be the your your input yeah again if you have more questions about rt1 I recommend watching the uh previous episode um and U and here is all about rt2 so we can convert the the string of numbers um this will be our output of our Transformer uh which is a vision language model we tried other Alternatives uh such as floating numbers floating numbers is not super friendly to language model tokenizer because uh it has this decimal points we also tried the human language such as left or right it's more semantic representation but they cannot be directly executed on robot which is liation of this method um so if we commit to this action representation which is just a string of numbers we essentially get a vision language action model we tried different variants including polyx um this is a Pathways language and image model uh it can um there are five billion parameters variant and 55 billion parameter variant and we also tried P me which is 12 billion parameters the the procedure that we did um to train this altit is via coine tuning coine tuning is to put the internet scale data and the robotic data together and then we fine tune it um on this mixture of data so that it doesn't it retains the internet scale knowledge um maybe that's also an artifact of our data is too small and not diverse enough so if you just a fine tune on robotics data it will quickly overfit and forgot about um all this paining mixture uh maybe it's it's a dynamic of scale so so uh we'll see uh at inference time how do we do this uh we basically we again we do this Auto regressively we have a an instruction um of um task and we format this as a question and answering task uh what should the robot do to achieve a certain task and the task is a strin that human give the robot for the robot to achieve and it also have the current observation um which is the um which is the um U robot observation the camera image RGB image it pass through AIT and then it pass through the large language model and then output uh a list of tokens so we leverage constraint decoding to make sure it always have eight numbers um and um because otherwise we cannot DET toonize it it's very easy for language model to just miss one number right so we do have some mechanism uh such as constraint the decoding and beam search to make sure the format is correct after we get a string of a numbers we DET toonize it to a delta T and Delta R which is the and effector Delta post and the robot can just directly run this on the robot after it run on the robot we repeat this process we get another New Image run through this process and get a new action and we repeat this process until a termination is decoded so um some people might be concerned that this is rather slow um it's it's in fact quite slow uh because it's a 12 12 billion parameters or five billion parameters uh we cannot run on a robot so we run on a TP cluster and the robot is quiring the TP cluster to get the numbers and apply it uh on on the robot so for the 12 billion parameters we can actually run at 10 Hertz so it's quite fast for all of model we can run at at least three Hertz so that is that is sufficient for uh controlling a Rob robot and we see a lot of emerging skills that is not trained um that is not on the training set essentially uh as I just mentioned we are probing what this rt2 can do we actually don't know so we are trying to figure out what rt2 can do so we we we test it with a lot of new tasks such as put a strawberry into the correct bow or move banana to Germany like just to test its understanding of symbols or Flags um pick land animal uh there's a horse there's octopus basically tested semantic reasoning and also uh low level manipulation skills and we uh divide the the tasks into simple understanding and reasoning and human recognition and average we found that uh with rt1 which is not trained on internet scale data we we do quite poorly in this emerging evaluation tasks and uh in uh in the um rt2 variance which is co coine tuned on the internet data and on robotics data we do much better in these tasks and there is also um effect of scale so the R2 with a 55 billion po um is performing better than the 12 billion py although they perform quite similarly for IND domain tasks so but the the generalization is kind of interesting it seems uh with larger scale you can better and here are some videos of the robot um achieving these tasks like moving the banana to a number put the strawberry in into the correct bow move move a Rubik's Cube to the water bottle but I'm speaking Chinese um moving the banana to a German flag so it's able to do all all of these very interesting tasks uh in terms of the quantitative evaluations uh we also found that um the altitude policy is quite robust to unseeing objects unseeing backgrounds and unseeing environments and here is another evidence of positive transfer so coine tune with vqa data outperforms fine tuning on robotics only and if you're train on robot data from scratch it it barely works it almost doesn't work because it overfits to robot data and our robot data is just too small so we do need uh to um do Co fine tuning or at least the fine tuning so it retains its internet skill knowledge this is also a recipe for how people would develop a domain specific Vision language model so you start from a very General Vision language model and you find you on your domain or you can coine tune with with your your specific domain data this is likely a problem that each vertical of uh artificial intelligence would uh would uh incur someday uh we can also test on other platforms like this shows some cross embodiment the rt2 PO 3B outperforms previous models in terms of moving blocks around a TOD environment so this is um and and in large language models we have this Chain of Thought reasoning which is um a method to elicit reasoning in in large language models um you can either do zero short CH of thought reasoning by say less thing step by step I'll give it examples of reasoning is is basically decoding more things and then come to the conclusion we can uh use a similar procedure for um the rt2 as well so in rt2 P me instead of directly decoding the actions we can actually decode a plan and then aend it with actions so this gives the language model and opportunity to understand a question or par a question differently it also giv us the the opportunity to reason about things a little bit for example if you'll say uh bring me a drink and it will say pick up seven up can because there's a seven up can on the on the table so we synthesized a couple hundred such examples using a large language model just by argumenting the instruction and then fine tune the ait2 just for a couple hundred steps so it's between full fine-tuning and Inc context learning and is able to do some reasoning and some of the interesting reasoning tasks include I need to hammer a nail Which object from the scene might be useful and in the scene there is a a headphone there is a rock and there is a sticky note and the robot will say rocks and then generate actions to pick up the rock so it's interesting that it's able to do this of reasoning with with it2 and here is a demonstration of some of the Chain of Thought reasoning with it2 P me and the task is pick up the thing that is different from all other objects um and it pick up the the the chocolate because this is a snack and other things are the drink and I can also speak a different language and the plan would be to translate it into a language that is familiar with which is English and then do the task um there are also potentially failure case of the Chain of Thought reasoning so here I say move the green object together and as you can see the robot oscillates between the two green objects because there are rather two plants it could move the the can to the back of chips or it could move the back of chips to the can uh it oscillates between two plant until uh one action bring it to an object and it will commit to one of a plan rather than another it's not always guaranteed to work but it's quite interesting and it's also interesting that again we are testing the manipulation policy like how we test intelligence of human or animals or kids um because they they're they're getting more and more advanced um as a summary we have the vision language and action model um that is able to achieve improve improve the generalization it can do new tasks and operate new objects it can also do Chain of Thought reasoning and improving the underlying model um such as the the vision language model Itself by scaling it up um and um trained with internet scale data or trained with larger or higher quality internet scale data we can achieve better robot control which is quite amazing because robotics field has been traditionally developing quite slowly and is bounded by Hardware bounded by a lot of different things bounded by operation but now it seems we can piggyback on the development of the foundation model field and whatever they do will trickle down to our field as well and the future will be to increase the motion diversity and extent uh on the Chain of Thought reasoning capability and many more and uh um so there's another example of positive transfer which you you might have seen recently uh so so far I've been talking about uh scaling differently right I've been talking about don't scale robotics data um and scale other data that's because robotics data data is so hard to collect and and the purpose is not to avoid collecting robotics data it's to uh develop a recipe that you can do more with limited robotics data however there's also an effort from our team and the the entire robotics field uh to scale up the robot data collection which is called the open X embodiment and the model train is called RTX robotics Transformer X it's basically 22 type of embodiments and two uh 57 two skills and 60 data sets pulled all together so this will be the ultimate data set we can use to study positive transfer and the scale to to study this joint scaling and um there are already evidences of uh positive transfer so um we pulled all the data together from all these labs and find a common action representation that we can use to train a robotic Transformer and we have already found this a jointly trained model can outperform task specific model uh that is developed in in each of the lab so there is some benefits in pulling all the data together so um scaling robot data is also quite important so the summary for this part is that we are having a model consolidation we can now do the high level reasoning and load of control in one model and the load of control part is what excites me because it's so far away from the traditional language model U domain it's it's it's so different and it so shows science of life that we can trickle down a lot more than we used to think uh is possible and we can scale the pre-training of vision language models as well as scaling robotics data and we observe more and more positive transfer a model benefiting from diverse joint training across internet scale language vision and vision language domains all right so um I noticed that we are close to running out of time so I will just very quickly go through the second part which I think is also interesting it is to find new interfaces of language models but I would only talk at a very high level uh so language models as we can see can directly output action tokens if we found action representation so we can treat action as yet another language to the language model so language model can do translation so it should be able to generate action as well but that requires f can we do can we do it without fine tuning or can we generate more expressive actions that is beyond this the scope of fine tuning so um that is about finding the right interface so previously we have already established that language model doesn't have an action interface if it has an action interface it's it's not as effective so what is the best interface between language and low LEL actions I would argue the best interface um between language model and the Loos loo actions is reward functions and reward functions is universal it it has been used in reinforcement learning and it's also a repar reparameterization of actions what is action let me let's see if I want to pick up this uh bottle and I can say what what is a skill a skill is a mapping between my observation and my action so the mapping between my observation and action can be seen as a scale but a skill can have alternative definition which is a set of constraints and a set of objectives so picking up the bottle means the bottle is in my right hand and the bottle is off is off a supporting surface that means picking up and how do I pick it up doesn't really matter uh that's a more kind of uh to its broader sense a definition of skills it's more transferable between different different skills and um that and the constraints and objectives can be represented as rewards so we can ask language model to generate these reward functions and then uh there's Optimizer it could be reinforced learning or it could be model predictive control uh that optimize for those rewards and then um run it on the robot so what is in the reward translator it is uh let's let's open a box so the reward translator basically is is a two-stage process it's using the same language model and it is using two different prompts so the motion description basically describes the the motion so just now we we found that the language model can can output a description of how a robot doc should stand up but it's not able to achieve that but the the motion description is still sensible it still makes sense it gives you the right thing so we're just a like generate this motion description and then we have a reward translator reward coder that translate this motion description into a piece of code uh that is uh representing that is representing reward functions and these reward functions cannot be directly executed on the robot but it can go through a optimization process to to learn how to achieve those reward functions so we're using reward as the the interface between language model and a lot of controller and for the low controller we're using modal MPC which is a model predicted control uh model predict control algorithm it's basically uh a blackbox controller it samples a lot of tra trajectories and find one that optimize your reward and we test it on a robot dog a quadr robot essentially and a dextrus manipulator so the dextrus manipulator has an arm of six or seven degree of freedom and a hand um it's impossible to control it because it has so many degre Freedom so um it's highly um challenging so just to Showcase some of the examples I admitted the motion description part I only output the um the uh reward code part so it seems that the language model is able to generate the right reward functions to make the robot stand up on two back feet like a human all right and then now we are a little bit more ambitious like we know it can stand up uh can we make the robot do a moon walk while standing up like this so moonwalk is from Michael Jackson and it's very challenging how do we make the robot to do it so it generate the motion description and gener the reward code but the motion is uh not not so correct not exactly what we want the nice thing about using a language model and using the reward function is that you can coach the robot you can go back and explain what went wrong and ask the language model to fix it so now we can actually say you're you're you're being very patient you say moonwalk means a robot should mo mo walk backward while the feet swing as if they are moving forward uh such a such a great uh explanation kudos to my colleague and correct your answer and also make it work at a speed of 0.5 meters per second and after you like being very patient and give it the right instruction it's able to modify the reward the motion descriptor descriptor and also generate the right set of rewards to make this happen and now you can teach robot to do moonwalk just by using the language as an interface and one day we'll be able to do this on the real robot as well yes so previous section you show how numbers and your constra them toate numbers here how do you prevent it from Just hallucinating right so uh that's a great question um in in in this work we are not preventing it hallucination to do hallucination in a programmatic way we have a a set of system prompts or a set of rules that is explaining the API after all the reward functions need to be able to be compiled by the optimizer and then like uh after yeah so we do need to have some check uh what's more if it doesn't compile we can just give the error message to the language model it doesn't have to propagate all the way to the motion descriptor it can stay at the reward coder if there are errors please fix it so after that you should be able to fix it um we can also chain multiple tasks together uh using this framework we can say open the drawer uh take the the Apple put it into the drawer and close the drawer and it's it will be able to do that so we tried that just using reward coder is not good enough it's rather our two-stage prompt is us really really helpful uh I think that's another inspiration for other fields like when your domain is too different from language domain maybe it would be good to find an intermediate representation and ask the language model to explain in that intermediate representation before directly go to a more obscure uh representation and uh uh finally we want to transfer this to the real world but but there is a challenge in simulation it might generate actions that are too dextrous like this um this uh uh this thing is not possible to do in the real world so we add a few more regular Riser terms to stabilize the motion and we also run some State estimation on the real robot so that they um they understand uh where is the the cubes and then we can in the simulation grab the motion and then achieve it in the real world so here are some of the execution in the real world so you can say pick up the Rubik's Cube and it will generate the motion to pick up the Rubik SCP and this is quite different from rt2 the motion are quite smooth um it's uh it's quite fast it's much faster than uh three Herz so here it can do 10 Hertz or even uh even 30 Hertz so it's comparable with human all right so that's a language to reward there's one last thing that I want to talk about in terms of finding the a new interface so a lot of time we have been thinking about language model as a semantic engine a semantic machine it understands semantics so for example you say the the the student takes out the student takes out the book you will say book like language model is able to reason about such a sequence but if you do lowlevel patterns like if you just give it the Obscure numbers can you do what can you do like it's it's actually a lowlevel interface and we can open up the lowle interface to uh off language model and ask it to do robotics tasks so in this paper large language language model model as general pattern machines we explore using the lowlevel interface of a large language model essentially asking it to reason about different sequences and is surprisingly quite effective and it can solve tasks like the The Arc Challenge and the pcfg um and it can even do sequence Improvement so I will dig a little bit into sequence Improvement because that's quite relevant to robotics so sequence Improvement is that you prompt the language model with State action and reward tupos and you Pro just prompt it with higher reward and see if it can generate actions that achieve the higher reward so it's doing reinforc learning or reinforc learning like thing but in context so this is quite amazing so previously you would need a a dedicated algorithm collecting data replay buffer to do this reinforc learning but now you can just build everything in the language model context by leveraging the lowle interface of a language model and with that we can actually do something like clicker training so if you're are not very familiar with clicker training is how you can do you can U you can have a dog and you you can um when it does the right thing you give it a reward by by right so you can uh so the clicker training is um giving the the agent a reward and uh we we can now use clicker training to train robots as well so here uh the robot is exploring but I would give click it does the right thing or towards the right direction and over time uh it will be able to push the back of chips which is the objective of this training so you can do this entire uh decision Transformer like operation but purely in context by just give the language Model A bunch of patterns and ask you to figure out what is what is what is the regularity of this sequence and this way it can generate new actions to improve previous sequence all right uh um so for the language model uh we can find new interfaces that are more uh suitable for teaching it low level skills uh reward is a bridge and of language model and low level control and we can fully leverage it as a universal interface and you can optimize in real time um sometimes it outperforms generating action directly so it uh really motivates to use the uh reward functions as interface and the language model as a general pattern machines we can use language model beyond the semantic tasks we can ask it to reason low level things and also robotics as a do domain Rich of sequence transformation and sequence completion and sequence Improvement tasks so we can really study the the lower level mechanisms of language models and the key takeaway is that um for for this talk is that we are seeing more and more use of foundation models not only on the semantic reasoning side side of Robotics but more on the dextrus on the generating actions on the lower level embod embod and intelligence side of Robotics and we can we need to rethink the scaling law of Robotics and Transformer how do we scale it with limited amount of data we have a new recipe for scaling robot model and data in rt2 which shows that you can do more with the same data with essentially rt1 data plus internet data you can generalize to a lot more things in RTX shows that you can do a lot more with more data there is also benefits to collecting more robotics data and there's positive transfers everywhere and apart to uh in terms of new interfaces for language models I think it's worth uh for the robotics field to think about developing new and lower level interface to language models which facilitate uh learning low level skills with that I I would like to conclude my talk and if you find it interesting there are a lot of references for you to to uh look into and special thanks to my team uh Google mind Robo robotics team so we're at a Forefront of developing Foundation models for Robotics and stay tuned for more in the future thank you yes you mentioned that FL numbers are difficult for large language model but if you're just generating the action tokens like no you know no rocks or whatever you had an example why don't you just have like linear layer appended to the you the Transformer that just numbers from whatever yeah the the question is that uh um if the if we are uh the large language models have difficulty understanding numbers why don't we use a linear layer to outut the action directly I think so language models are difficult to understand numbers but sometimes we still want it to kind of bring in Knowledge from the pre-training mixture like if if um I want if I have a new layer like that that new layer is not present in the pre training so how do I expect it to transfer I think that's that's an interesting question but at the same time I don't necessarily think using the raw numbers is a right interface there we probably could do some action representation learning to learn a representation and the language model can output that representation so we're still trying to figure out what is the uh right representation so among the representations that we have tried before like decimal numbers Flo numbers uh extra tokens we find just using numbers or actual tokens would be would be good enough yes pring yeah um I think both directions are are worth exploring there are different advantages of generating like generating action directly I think it's it it borrows the auto regressive nature of like um language modeling and it's it aligns with a lot of other tasks like visual question answering really well the limitation is that then uh when you're generating actions it's heavily regularized can you generate dextrous actions that is so out of distribution that is kind of difficult the language to reward actually brings a a page of the book of traditional robotics like right this optimization based or model predicted control so and you can also take into let's say safety constraints uh more more more easily it can generate more diverse actions maybe one recipe is to generate a lot of data with the language to reward system and distill them into a Transformer because then you're imbuing your large language model with all this other desirable uh the language through World itself I don't know how scalable it is like it's we're not fine tuning language model so maybe you are limited to what the uh you are at the mercy of the training data of the language model the language model can do uh moonwalk because it it knows what moonwalk is it roughly knows how that uh like how to do that but if you want to scale to completely new things maybe you can use the language to world to boost drop your data generation and then put into like the other policy okay what's the next directions gos it's like the language the yeah I think that's a good question so this the scaling the the scaling being the end of the lecture uh that has a joke but as as I'm I'm being quite serious like it's actually a promising recipe so uh we have been we have been um everybody is believing in the power of the scaling rule so just by giving it more data giving it more compute you will see interesting kind of capabilities coming out yeah Al gb2 GB big J so do you think like we ready for robotics to have that Chum we can you see the cap yeah um I still I still think we are not quite like we don't quite have enough data I think that's still probably the biggest bottleneck so we are trying to find ways to do more with limited data and we are trying to collect more data and I think it needs some time for for us to accumulate enough data and currently I say uh we have science of life for positive transfer but in in language models people don't talk about positive transfers anymore because it's so common place right you you see it everywhere um and robot robotics is not at that stage yet yeah how much is your te thinking about safety and Alignment yeah are you just like right now like relying on kind of the ethics that emerge from the large language models like it won't you know tell you to kill to achieve yeah that's a very good question actually we take safety very very seriously because all of the the domains of developing language models um it doesn't have direct like impact on on the physical world but here like um it could have potential harm to to human and to the environment and Gary Marcus actually give a comment to previously to our work that what if you say U bring bring out a bow feed the cat and put it in the dishwasher where it put the cat in the dishwasher right if it misunderstand actually it will have a a catastrophic uh um failure case um we we take uh safety carefully by designing like hardware and software safety layers and uh there are also some constitutional like safety um uh thing that is coming out sometime soon I cannot tell much details right now but uh sometime soon we releas some work is it something like if there's a human like just don't interact well know I think it's a little bit more nuanced and more detailed than that yeah but we do take safety quite seriously and in some of our experiments actually the robot finger would break off because it cannot apply enough Force to an environment so that's just yet another way of ensuring safety can we have some like Vis mod and maybe this is kind of like but right right yeah I think it it would be possible thank you for cool cool