Introduction to Course, Multitask Learning, and Meta Learning

so today uh we will jump into the goals of the course and the logistics of the course and also talk a little bit about what multitask and metal learning is and why we might want to study it uh before we get started I want to make some introductions so my name is Chelsea I'm the main instructor for the course and we also have seven really awesome Tas um great so I'd love to welcome you to the course uh and I guess as a first question to get a sense everyone's at uh how are you doing if you want to answer briefly do you want to raise your hand and share how things are going yeah but my first class I went to the wrong class it's always interesting how like five minutes three to class I saw the title I see public policy when I was supposed to use AI great great well I'm glad you made it to the right class this time anyone else four plus cool cool good to figure everything out anyone else okay well um I guess one thing I wanted to share or say is that I it's feels great to be closer to something that's a little bit more normal um and that's yeah it's really awesome I'm really excited about that at the same time it's not like uh every it's like like there's lots of other stuff going on in the world right now that's um not great and so we acknowledge that and we um we've tried to set course policies that make give you a little bit of flexibility in the course and um yeah recognize that this course isn't the only thing going on in your lives and so um hopefully those policies will help with that um great so we have a lot of information and resources uh about the course uh your first place to go is the course website uh we put a lot of information here so please read through it uh and if you have any questions make sure that you read through it first but then feel free to post any questions that you have on Ed Ed is connected to Canvas and you can also reach out to the staff mailing list as well we encourage you to post any questions that you have to add because that makes it so that other students can see your questions because other people probably have the same question as you but you are allowed to make a private post on Ed or email the staff mailing list in cases where you don't want it to be shared with others in the class for example if you have an oae letter you can send this either to the staff mailing list or in a private Ed post um we also have office hours that are all posted on the course website um and the zoom links are on canvas and office hours will start on Wednesday uh cool so what will you learn in this course um there's really three main things that we're going to hope you'll be able to learn in this course the first is the foundations of modern deep learning methods for multitask learning and generally learning across tasks the second is not just learning about those methods but actually getting experience implementing them and working with them uh in code in pytorch and trying to understand how these systems actually work in practice Beyond just how they are supposed to work from lectures and lastly I also want to try to give you a glimpse of the process behind actually building these kinds of methods I think that a lot of courses will present you knowledge and present you ideas as they are and not actually talk a little bit about the scientific process and the engineering process of arriving upon those ideas and I'm hoping that maybe by giving a little bit of a glimpse into the process of building these algorithms and understanding these algorithms that will encourage you to not take them as them is to challenge them and also to help you learn about the process of developing these kinds of algorithms in the first place cool um so along those lines uh we'll cover a wide range of topics in the course um we'll start with kind of the basics of multitask learning and transfer learning we'll move into three broad classes of meta learning algorithms including Black Box approaches optimization based approaches and Metric learning we'll also cover more advanced metal learning topics such as overfitting and meta learning unsupervised meta learning and Bayesian metal learning methods before moving into other approaches for few shot learning and adaptation including unsupervised pre-training and domain adaptation and domain generalization um all throughout this will be an emphasis on deep learning techniques uh and we'll also study a number of different case studies in real applications uh this includes things like multitask learning and recommender systems like the recommendation system behind YouTube also meta learning for land cover classification and education as well as few shot learning in large language models now uh one thing that's a little bit different from the last time we offered this course is uh these topics right here are all new to the course and one thing that's different is we're not going to have any lectures or homeworks that cover reinforcement learning topics um and so essentially we're removing the reinforcement learning topics from the previous quarters and adding in this new content including a few shot learning with unsupervised pre-training um how this relates to Foundation models as well as domain adaptation and domain generalization now you might ask why are we removing the reinforcement money content what if I want to work on reinforcement learning um for that we're introducing a new course in the Spring quarter on deep reinforcement learning that I think will do a nice job of complementing some of the other reinforcement learning offerings on campus removing reinforcement learning will also make the course more accessible to people who don't have a lot of background in reinforcement learning we found that in the previous quarters just one kind of refresher on reinforcement learning often wasn't enough to get to the more advanced topics um that said if you're really excited about applying some of the ideas in the course to reinforcement learning topics you can still explore that in the final project and get support from the core staff many of whom also are very familiar and experts in reinforcement learning problems awesome um so for lectures uh all the lectures are in person they're also live streamed and recorded and you'll have access to the recordings on canvas we're also going to have two guest lectures as well around the the end of the quarter and those are not fully sorted out yet but we'll announce those shortly I really encourage you to ask questions during the lecture this serves a lot of different purposes you can ask questions by raising your hand you can also ask them by entering questions in the zoom chat and we have a TA who will monitor the chat and make sure those questions get answered during the lecture um I find it really helpful when people ask questions because it helps me understand engage if you're understanding what I'm saying and if you don't understand something some concept or want to learn more or I'm not covering something important uh that's my fault that's not your fault and chances are other people have the same sort of misunderstanding and so if you ask that question that will help everyone in the course and help me help you basically my goal is to help you learn the topics in the course rather than standing up and listening to myself speak uh office hours will have a mix of in-person and remote um mostly in person but there will be two remote options especially for scpd students who are in the course great um and then I'll prerequisites uh the main prerequisite is to have sufficient background in machine learning so something like cs229 or equivalent because we'll be building a lot on the basic concepts of machine learning and we're not going to kind of cover topics like cross-validation training sets and test sets and so forth were the basics of neural networks um all the assignments are going to require training neural networks in pytorch if you really hate pytorch for whatever reason you could implement the assignments in some other framework but all the starter code is in pytorch and you'll be able to get more support if you do everything in pytorch so we'd encourage you to do it in pytorch a few quarters ago the course use tensorflow instead and people seem to really like to switch to pytorch um but hopefully uh yeah we provide some flexibility there we're also going to have a pytorch review session on Thursday at 4 30 PM in this room and so if you want a refresher or some get some of the kind of Concepts in pytorch that will be useful for the assignments you can come to that review session it will also be live streamed and recorded on Zoom as well cool um so digging it in a little bit more into the content will have a number of different assignments the first assignment is just going over it's basically just a warm-up to make sure you're familiar with pytorch and going over some of the basics in multitask learning um then we'll start to go into our core assignments the first one will be on Black Box metal learning and how do you set up the data to make few shot learning with these block box models more effective the second we'll go into gradient-based metal learning and Metric learning both these assignments will include kind of few shot character recognition kinds of problems and then the third homework will go into fine-tuning pre-trained models with an emphasis on uh natural language processing and language models rather than image recognition and then the last homework um is going to be optional and it's going to cover some more conceptual aspects of the course so the first four homeworks are all going to be implementation based whereas this last homework will be just kind of so paper the grading for the course is 50 homework 50 project uh the first homework is just a warm up so it's only five percent of the grade whereas the remaining three the next three homeworks are each 15 of the grade so this adds up to fifty percent um I mentioned that we wanted to provide some flexibility to students in the course and so if you complete this fourth homework it can be used to replace one of the previous homeworks or it can be used to replace part of the project grade and if you do it we'll we'll always do whatever is best for your grade and so you don't if you don't do well on that homework but you still try to complete it then it won't hurt your grade it will only help um and then the second aspect of giving some flexibilities will give six late days across homeworks and project related assignments um you can use up to two late days per assignment um no questions asked anyone can use these um up to the six of course if you have other extenuating circumstances that make it difficult to submit courses on time then feel free to send us a note send us an email and we may be able to make accommodations Beyond these six late days um great and then lastly uh for the collaboration policy please read the course website and the honor code for the homeworks you're allowed to talk to other people about the homeworks but you should document your collaborators and please write up the homework solutions on your own without referring to other student Solutions and without referring to Solutions on the internet cool um for the final project this is the basically the main parts of the course are in terms of your work or the assignments in the project the project is a research level project of your choice that you can do in groups of one to three um if applicable if you're doing research on campus we really encourage you to use your research for this project um and as long as it's applicable to the topics that are covered in the course and you can also share the project with other courses although we'll have slightly higher expectations in that case uh and then same late day policy as the homeworks um but there's no late days for the poster session because the poster session will be alive um event and you can't uh there won't be a late poster session or anything like that and the poster presentation the poster session is on December 7th which is basically the last day of classes which will happen instead of a lecture cool um so any questions on course um Logistics yeah what type of topics can we work on for a final project um yeah so I mean basically the question is what kind of topics can you work on for the final project basically whatever you want as long as it pertains to the course content um it's very open-ended I guess one other thing that I can mention there is that we're going to have we are soliciting ideas from the broader Stanford AI community and so we'll post a list of ideas for the project on Monday next week and so if you're not sure what kind of project you want to do you can look at that list for some some nice ideas but if you have something in mind already or if you um if you want to be creative and think of something else it just needs to pertain to the the topics of the course we also have detailed guidelines on the project and what we expect posted on the course website and so you can refer to that document um for a lot more details examples of what people have done in the past yeah that's a great question so we haven't posted examples yet but we want to post some examples of of some previous projects one thing that you can see already is that if you um if you look at a previous offering of the course I think it was two years ago we posted titles of all the court of all the course projects and then some of them have links from students who are willing to to make them public and so you could already take a look at that although we're planning to provide some more explicit examples in the coming week cool um so in terms of initial steps uh homework zero has already been posted this should be pretty lightweight uh it's due in a week all the assignments are due at 11 59 Pacific time um and I'd encourage you to start trying to form groups uh if you want to work in a group and posting making posts on Ed and so forth can be helpful for that um and we're also like um happy to try to help you connect with other students in the course as well if that would be helpful [Music] right here yeah I think that there can be pros and cons to working in a group uh the I think that the benefit is that there's there's a little bit more that you can do in a group and uh you might also have complimentary expertise it can also be fun to work with other people um the downside is that you are kind of relying on other people a little bit and um you want to kind of make sure that the people you're working with are you're kind of compatible and can rely on them to some degree um generally we recommend it uh but it's certainly not required and um and you're welcome to to work alone cool so let's dive into uh why we might want to study multitask learning and metal learning so um the first thing I'll cover here is a little bit of my perspective and why I find multitask learning and metal learning really cool and exciting uh and in particular a lot of the research that I do in my lab is trying to think about this question of how we can allow agents to learn a breadth of different skills in the real world uh and what I mean by agents is actually working with real robots uh and allowing them to learn skills like this so here I'm holding the the Block in front of the robot and it's learning to to place the red block into the shape sorting Cube but not just something like that being able to do something like watching a video of a human place an object into a bowl and having the robot figure out how to do the task as well um or by figuring out how to use tools to complete a task even if it's not explicitly told that it should use that tool to complete the task um and I think that robots are really cool and interesting because I I think that they can teach us things about intelligence uh they are faced with the real world they aren't just um they aren't just kind of looking at images and in a static we actually have to have to contend with the complexity of the real world in order to be useful in a in the real world they have to generalize across lots of different tasks objects and environments and I think they also need some sort of Common Sense understanding they need to understand what will happen if they try to pick up objects or try to move their arm in a certain way uh and then the last thing is it's not it's not clear exactly what the supervision should be here as well um and I guess Beyond these two things beyond beyond trying to teach us things about intelligence is also that there's also this aspect that uh if if we can't actually build robots that are very useful then they can help out in a wide range of aspects of society um where people are doing jobs that are dangerous or people that are people are doing jobs that are tedious um or jobs that they would rather not be in and so I guess from that standpoint I would like to tell a little bit of a story which is that at the beginning of my PhD I was a PhD student on the other side of the bay and I was working with this robot right here and this is a project that was happening the robot was learning through trial and error it was trying to figure out how to insert the wheels of this toy airplane into uh the corresponding hole and what you can see is that at the very beginning it didn't really know anything about the task and over time it gets better and better at trying to figure out how to kind of assemble this part into the plane uh this seems pretty cool uh it yeah I found it like really really cool to see this whole process but one caveat here is that the robot effectively had its eyes closed so the robot couldn't actually see anything it wasn't using the camera in any way it was just using the position of of the joints and so the next step that I really wanted to explore in in a follow-up project was to think about can we allow the robot to complete these kinds of tasks but with its eyes open uh and actually use Vision in order to solve these tasks um and this was uh the resulting project and here we were trying to learn a neural network policy that maps from images taken from the robot's camera directly to torque supplied at the robot's joints and you can see that over time it uh is getting better at the task um at the beginning it was just moving his arm around pretty randomly and it gets closer and closer to inserting the block into the red hole and not only can It insert the Block in the red hole for one position of the cube but it can do it for multiple different positions of the cube and this is why it needs uh Vision in order to succeed um so this is pretty cool um this was I mean these days this maybe isn't that impressive but uh six years ago no one had really ever applied neural networks for this kind of task before and um here's actually kind of the results of the final policy so you can actually see the robot's perspective right here and we can see that like if I held it in different positions the robot was able to insert the block into the correct place um this is pretty cool now I think what was exciting about this wasn't that the robot had figured out how to do this one particular task but rather that we had a reinforcement learning algorithm that could allow robots to do lots of different tasks and so if you took the same exact algorithm and gave it a different reward function then it could figure out how to do other tasks so it could figure out how to place the claw of the toy hammer underneath the nail or if you gave it a task um right here it could figure out how to screw a cap onto a bottle and an example of one more task this was in a follow-up work we got the robot to use a spatula to lift an object into a bowl this last task is actually surprisingly challenging because the robot has to fairly aggressively maneuver the spatula underneath the object in order to lift it up um so this is this was really exciting This was um like first or second year of my PhD um that we got the robot to do these kinds of tasks all with um learning neural network policies and other people use the same kind of algorithm and built up on it extended it to other tasks into other robots to learn things like hitting a Pucket into a goal like opening a door and like throwing an object to hit a Target and around the same time people were also using deep reinforcement learning algorithms to play Atari games and to play the game of Go and to learn how to walk in simulation so in general around kind of 2016 2017 was a very exciting time for reinforcement learning and for deep learning um the catch though is that we have a bit of a problem here in general so it was all very exciting progress but there was this problem that in each of these cases when we train the robot to do a task like when we trained it to lift the object into the bowl we didn't train it to use spatulas generally and how to lift objects into bowls but we trained it to lift that particular object with that particular spatula into that bowl and so if you gave the robot a different spatula put it in a different environment the robot wouldn't successfully complete the task and this is a huge problem because it means that if we actually want to put the robot into real world situations it hasn't learned something that will actually work in general um and you might say that okay maybe we can just give the robot a lot more spatulas and train it with um with more data but the tricky part is that when you train these systems behind the scenes if you actually look at the learning process often it looks something like this where the robot attempts the task uh and then it needs these attempt the task again and so for it to tempt the toss again you need to actually kind of put put the environment back to where it was and so that they can attempt the task again from that state um in this video uh this is my friend yevgen uh and one of the things you probably noticed here is that yevgen is doing more work than the robot is doing and this doesn't seem like maybe the right way to be going about things um it importantly doesn't seem very scalable it's not practical to collect a ton of data in this fashion and so this is starting to get into why multitask learning and Better Learning matters which is that um we're training these systems to do one very narrow thing and this very narrow thing requires detailed supervision a lot of extensive human effort in order to get that system to do that one particular thing and then if we want to do something else we also need another again a lot of human effort to train from scratch on that new thing um this isn't just a problem with reinforcement learning and Robotics and so if you look at problems in um kind of speech recognition or object detection um these systems are these systems are trained on more diverse data but they're still learning one task starting from scratch with a lot of supervision and Engineering for that one task and so I'd refer to all these systems as kind of Specialists we're training a machine Learning System to do one thing and what in many cases would be more useful is systems that are not trained on a single task but on many different tasks um and so for example if we look at what people can do people aren't trained from day one to lift up spatulas uh or sorry use use spatulations to lift up objects but they're trained to learn much more broadly about things in the world um and in that sense I would refer to humans as generalists um I'm interested in this question of how we can build machine learning systems that are more General um and I guess as maybe one more note on this um if you take a system like alphago which is became kind of champion at the the game of Go um this is another example of kind of a specialist system and uh if maybe someone analogous to training a baby on day one to try to figure out how to play go without teaching them lots of other things about the world uh and in fact it turns out that um even like training a training a robot to pick up go pieces and place them into the configuration is actually still beyond the capabilities of uh of AI systems and so when you watch maybe if you ever watched any of the alphago matches you'll notice that um the alphago player is a human who's just watching this computer screen and lifting the pieces uh for the system cool um so that's my perspective in terms of why I'm excited about these algorithms um I guess any questions on all that before I move on to things Beyond robots and general purpose machine learning I might go into this question and make the next section but the question has to do with um all of the work that's happening on these large pre-trained models now especially in NLP where maybe the models are kind of implicitly learning a lot of tasks internally so um would you say that it's still important to kind of explicitly teach and model how to like encode a bunch of these different Pastor can be implicit learning kind of get us to moving to people yeah so the question is about um like a lot of these large pre-trained models are trained on very broad data sets and they aren't explicitly trained to do multiple tasks but they're implicitly trained to learn very broad things and in some ways I would um I guess we'll get we'll talk about this a little bit later in the course but I um there are ways to connect that to multitask learning and I I kind of view that as an example of something that's more of a generalist rather than something that's learning one very narrow task um and so what we'll definitely connect to that and I think that also gets to some of the motivation behind using generalists trying to train generalist systems as well which is that if we can train a pre-trained model on very broad data and have it learn something more General about the world then if we want it to do something narrow after that we can use that as initialization we don't have to start from scratch we can start from this more General understanding of the world and use that as initialization to learn much more quickly for a new task and so some of the two the two of the lectures that we're adding this year will be precisely on training these pre-trained models um in a more General way with unsupervised pre-training and then fine-tuning them to with a small amount of data to a new task yeah so the question was um a lot of kind of General NLP pre-training tasks are things like fill in the blank um is there an analog in something like robotics and uh in general I mean there are some things that are somewhat analog so you can take video data from the robot's experience and have it interpolate frames like say predict this Frame or you can mask out part of an image and say predict where um basically fill in this this part of the image and so you can make very direct analogs like that and the approaches like that have shown some success in robotics although um there's there are also other aspects of Robotics that make it very challenging for example in NLP we have all of Wikipedia and we don't have Wikipedia for robotics we don't have data of robots tying their shoes or robots learning how to pour water it's just lying on the internet in massive quantities and that's um that brings up another challenge that's um that is more readily solved in NLP yeah like this multitasking learning or or I call this uh what it has a single task and I give us this single task to learn yeah I'll get to that um at the very end of the lecture cool so um why should we care about multitask learning and Better Learning Beyond Robotics and general purpose machine Learning Systems um and specifically why should we care about deep learning in this context so um I don't think deep learning needs too much motivation these days uh if you've taken a machine learning class but um in terms of a couple slides historically the approach to things like computer vision was to design some features uh to try to design mid-level features and then try to kind of train a classifier on top of those mid level features and there are many aspects of this kind of pipeline that were designed by hand um and then the more modern approach to computer vision isn't to try to hand design low level features and mid-level features and so forth but rather to try to just train a single neural network end to end train the parameters end to end to do the entire task and um there are some benefits to the former approach you get some some notion of interpretability for example but um in general the second approach here uh works a lot better which we'll see on the next slide and it allows us to handle unstructured inputs things like pixels things like language sensor readings uh really any input that you could imagine without having to understand without having to engineer like really good features for that particular domain um so it allows us to perform tasks without a lot of domain knowledge and as we saw over the years on The imagenet Benchmark this is showing uh performance or error rate on the image that Benchmark between the years 2011 and 2016. um overall we see a downward Trend but what's notable here is that um this dot right here is Alex set which was the first end-to-end approach for the imagenet competition and everything after that is uh is also deep learning based approaches um and so we saw this this really striking Paradigm Shift uh and also very striking shift in the performance that you can get on these kinds of computer vision tasks um in a completely different domain in natural language processing um if we take uh this is a paper from I think 2016 or 2017 um they're trying to use deep learning for machine translation before this paper Google translate was using um something that wasn't doing end-to-end deep learning it was called the phrase based system and um so that's called pbmt um whereas gnmt was standing for Google's neural machine translation which is an end-to-end approach and we again see um really large improvements um like 60 to 87 Improvement on these different translation tasks and now systems like Google translate use exactly these kinds of models when making predictions and they work uh well there's still obviously lots of room for improvement in the translations but they work far better than the previous systems um cool so that was some brief motivation for why why we might focus on deep Learning Systems now why might we focus on deep multitasking meta learning systems so in deep learning we've seen that if we have a large and diverse data set and a large model that leads to good generalization on the tasks that I showed on the previous slide we saw this with imagenet with things like Transformers machine translation but there's a lot of scenarios where you don't have a large and diverse data set at the outset there's scenarios like Medical Imaging where you there are privacy concerns with sharing lots of data or robotics where we don't have Wikipedia for robotics or things like personalized education or translation for rare languages where we don't have a large data set already sitting on the internet and it would be very expensive and costly to try to collect a large data set and so these are the scenarios where um where this kind of recipe will start to break down and it's impractical to learn from scratch for each of these um for each of these different circumstances like for each rare disease or for each robot or for each person or each language now beyond that there's also scenarios where maybe you have a large data set but that data set is very skewed so you have a long tail distribution where this is showing kind of a histogram of the number of data points for different aspects or different parts of your distribution different slices of your distribution and these different slices could correspond to objects that the system has encountered or interactions with different people or the words that it has heard um or driving scenarios and so on and so forth and these kinds of data sets don't arise in a lot of machine learning Benchmark problems but they actually come up all the time in real world applications there's a lot of words for example that you hear all the time and a very very long tail of words that come up much less frequently and this long tail of edge cases actually presents a major problem for modern machine learning systems and I would argue that's why we don't have self-driving cars on the road today because there's so many edge cases that come up in in self-driving situations and multitask learning and metal learning won't solve this problem in and of itself but there is some signs of life that kind of indicate that if you can leverage some priors from the big data and try to translate that to the tale of of situations then you might be able to better handle these kinds of distributions um cool um beyond that uh what if um what if you want your system to quickly learn something new uh this is again a scenario where you don't have a lot of data because you want to learn very quickly you want to learn something very quickly about a new person like a new user or about a new environment like a new environment that you've placed your system into um and for this I'd like to actually give you guys a little test um where I want you guys to learn something new so in particular um for this test I'm going to give you a training data set uh the training data set is the six images on the left and the far left images are all paintings painted by Brock and the next three columns are paintings painted by Cezanne and so your goal is to learn a binary classifier between uh paintings by Brock and painting spices on um so I'll let you train a little bit your classifier um and now that you've hopefully learned a decent classifier um your goal is to classify this test data point and so raise your hand if you think this is by by Brock okay and raise your hand if you think the device is on Okay cool so um most people got it right so this is indeed by Brock um and I tried to give you a little bit of time to to train your classifier but maybe some of you didn't converge and you can see this into the styles of the edges here for example um I picked this example not to be one that's a little bit harder maybe closer to the decision boundary um and yeah so this is an example of few shot learning and so you took a really tiny training data set with six data points and we're able to generalize to a new data point um so how are you able to do that so um if you were to train a machine Learning System from scratch like a convolutional neural network on those data points from scratch it probably wouldn't have gotten the right answer like many of you did um but the way that it was able to do that or the way that you guys were probably able to do that is while you may not have seen these paintings before or maybe even paintings by these painters you've learned how to see you you've learned um how to recognize patterns and images how to recognize paintings and all of your previous experience allows you to learn new tasks with small amounts of data you weren't starting from scratch on this problem um and so this is what's called few shot learning where your training data set has a few data points it's small um and if you start learning is something that you should be able to achieve if you leverage prior experience rather than starting from scratch cool um so each of these four things that we went over if you have um you want a more general purpose system you don't have a large data set you have a long tail or you want to learn something new all of these are scenarios where ideas from multitask learning and meta learning might be useful um and and where these elements can come into play um so now Beyond why we should study it there's a question of why we should study it now and in particular if you take some papers from the 90s uh maybe I think probably everyone was born by the by 97. um but if you take some papers by the late 90s uh maybe not is anyone born before 97 or sorry after 97 oh wow I'm getting a little old here um cool well if you take a paper before most of you were born um it says things like um we can try to train tasks in parallel using a shared representation or um we can try to do multitask inductive transfer adding extra tasks to a back propagation Network so they're already doing deep multitask learning in this paper you can take a paper from 98 um talking about few shot learning so the ability to generalize correctly from a single training example when faced with the new things to learn humans can usually exploit an enormous amount of training data and experiences that stem from other related learning tasks um or uh from even earlier from 1992 uh some folks like Sammy bengio and Joshua bengio who you may have heard of were talking about the possibility of a learning rule to learn to solve new tasks um so ideas from metal learning um so a lot of these ideas aren't that new uh they've existed for a pretty long time at this point um and yet even though they've existed for a long time they're continuing to play a major role in AI systems question uh does meta learning also help model generalization you have a large data set so does metal learning help generalization even when you have a large data set so in general these methods will give you the most bang for your buck if you have a small data set um because that's where leveraging previous experience will be the most useful if you have a really massive data set then you will probably do pretty well just training from scratch on that data set um it's possible that it some prior knowledge might come in handy if you have some distribution shift basically if you have a large data set but your test set is actually from a slightly different distribution then you might be able to learn in variances from your previous data in a way that allow you to do better even when you have a large data set but in general if you have a standard IID problem and you have a large data set then things like prior experience will be much less useful cool um so now what are some examples of actually looking at these kinds of systems in uh deepmind they were training a system to do lots of different kinds of vision and language tasks and they found that they could actually train a single model that could do something like object recognition where um in this example they describe a chinchilla then they describe a Shiva and then they give it an image of a new animal and is able to recognize that that new animal is an image of a flamingo and also describe where the the flamingo is found um so I can do object recognition and the same exact model can also do things like reading um so you can give it a few examples of images of numbers like arithmetic problems 2.2 plus one uh five plus six and so forth and then if you give it a new image of in this case three times six it's able to both read the numbers and also complete the arithmetic problem and again the same model can also do yet another task in this case it's trying to count animals and so here it's given an image of a few pandas and it's told that this corresponds to Three Pandas and um at the end is given an image of several four drafts and it's able to count the number of drafts and identify that they're drafts so really cool about this system is that it's it's not just specialized for one thing it can do lots of different things and it can do lots of different things in a few shot way so um you're giving it a few examples of how you want to perform the task and is able to leverage that in order to figure out what you want it to do and ultimately do that task um so this is a one modern example um that was pretty exciting one other modern example is a paper that I co-authored in 2021 that I personally was very excited about where we were using meta learning in an education application we're looking at trying to provide feedback to students um on work that they did open-ended student work in a intro CS course and so in particular the code in place course was a really massive online course that was offered by a few Folks at Stanford and at the end of the course uh there were 16 000 solutions that students wrote up to the course and grading 16 000 giving feedback and Grading 16 000 Solutions is a massive undertaking and we were able to get volunteers to give feedback on around a thousand of the student programs but if you train on those 1000 programs from scratch uh you're not able to get a very good system because a thousand programs from scratch these are python programs um that we're trying to get feedback on that doesn't work very well but with Better Learning we were able to train on previous data previous student data of taking exams and giving feedback on those exams and use that metal Learning System adapt to this new course with these new problems and ultimately give feedback on the remaining and so in particular what it looked like is we were able to give a student program like the program right here and there's a syntax error in this program that would make unit tests um not useful at all and so actually there were a few thousand cases where unit tests were helpful but most of the solutions they were not very useful um and they were able to actually generate feedback so in this case the feedback is that there is kind of a Minor error on getting the input from the user which could be something like forgetting to convert the user input to a float um and this system actually worked really well and it was actually worked well enough to actually deploy in this course to give feedback to real students um which wouldn't have been possible without some of the ideas that we'll cover in this course um so those are two examples the Bingo model and the education model yeah you compare the learning outcomes for the the two groups um so we did actually do a blind a B test with the system and so a thousand of them got human feedback 15 000 got the feedback from the AI system um and they the students agreed with the AI with the metal Learning System feedback slightly more is like one percent more I think they agreed with it around 97 of the time and they agreed with the human feedback around 96 of the time um of course they might just be agreeing with it because they they like the feedback or whatever um and so we actually asked them also how useful it was and they rated it as useful a scale of one to five I think they rated it around an average of 4.6 out of five um so they found it useful um we weren't able to measure learning outcomes between the human feedback and the the feedback from the system um because this was towards the end of the course um it was kind of a diagnostic at the end of the course um we also didn't feel like it would make sense to withhold feedback from students to compare no feedback in The Meta Learning System um but the I think the real the real win is trying to um is basically out trying to give feedback in scenarios where it was otherwise would be very difficult to provide that feedback I want to be a big difference yeah so multi-objective learning is uh we'll cover it in the lecture on Wednesday um but it's basically you can think of it as a subset of of multitask learning like a special case yeah are there any applications for real-time restraining data timer streaming data [Music] um I guess there's nothing um I can certainly imagine there being actually a lot of applications making sense because if you have streaming data um you may want to adapt very quickly to the current circumstance that you're in with a small amount of compute and a small amount of data so from that standpoint things like few shot learning may be very applicable um but I don't have um I haven't encountered there's nothing that comes to mind immediately in terms of a specific application I've seen in the lecture how do we think about these two things yeah so the question is what's the difference between future prompting and few shot learning um and there is I think there's a very Gray Line like I think it's I don't think it's it's black and white um we'll we'll talk about this a little bit in in some of the future lectures um you can think of yeah we'll talk about it more in future lectures I think it's not it's it's unclear it's fuzzy yeah because mother learning always require a model that has previously been trained on some data because it feels like now learning is adapting this model to a new set of smaller data or um yeah yeah so you're asking does meta learning always require some previous data for um or like is my learning just um essentially adopting a model towards like you were on your data if that makes sense yeah so um kind of transfer learning and meta learning are in many cases trying to adapt to new circumstances um maybe I should I'll move on actually a little bit because we'll all give kind of some definitions of what multitask learning and meta learning and transfer learning are at least what the problem statements are and that might answer your question awesome um as a few more example applications of very recent use cases of multitask learning and meta learning um one from 2019 is uh looking at machine translation uh it turns out if you instead of translating between just one pair of languages if you translate between lots of different languages in this case 102 languages you're able to surpass very strong bass lines that are just trying to train on a pair of languages there's a lot of shared structure and a lot of shared information that you can leverage from those other data sets or from data sets of other languages um and people have also been using uh multitask Learning Systems actually for multi-objective optimization where you have multiple competing objectives um in a YouTube recommendation system and thinking about how do we optimize these objectives um and we'll actually consider a case study of this paper in the next lecture um so these are a little bit more on the applied side a little bit more on the research side there's also um I mean there's lots of papers on these topics these days and so I'm just highlighting a few but one example is a paper called a generalist agent that was training on like really a really really wide range of tasks so ranging from dialogue systems to playing Atari games to controlling a quadruped robot and simulation to controlling a real robot arm um and so they were able to find that you could actually stuff the data of all these different tasks into a single model to have it do all those different tasks um and lastly I guess I showed this example on one of the earlier slides but you can also apply this in examples in real robotics where you want a robot to take experience from previous tasks or sorry from previous objects and ultimately perform a task with a new object so in this case the robot hadn't seen this Red Bull before or this Peach before um or the distractor objects for that matter and is able to figure out that it should um place the object into the Red Bull in this case cool um so those are kind of a few modern examples that are quite interesting yeah is there any study looking into trying to understand controls kind of models like the Fantastic they train the models before maybe disentangle them or are they something that way the function journalists like kind of a model that is able to um understanding of the test performance yeah so the question is is there work that tries to understand what these multitask learning systems are learning and if they're disentangling the tasks in some way in general because there's no particular paper that comes to mind and I think that it's also a very challenging question because we don't have good tools for interpreting neural networks and what they're learning um I think the biggest tool is actually just to observe its observe Its Behavior on new inputs and so for example if it's able to generalize to new tasks effectively then that's an indication that it's not learning the task completely separately that is actually learning the shared structure whereas if it's completely unable to generalize to that new task it's maybe an indication that it's not is not learning kind of a unified representation of those tasks or that the new task is just too far out of distribution compared to the previous tasks why is it considered what impressive um so in this case the uh this paper is from a few years ago but the it certainly in this case overfitting to the task of placing I guess I mean overfitting is is a term that means many many things oftentimes but you can think of it as that um I think what's interesting about it is it is um at least this paper was one of the first examples of just looking at a raw video and actually interpreting that raw video to figure out how to do the task um and I think that we've also seen certainly since then seen more kind of more interesting and more impressive things being done uh with future training yeah yeah I was just wondering about this one shot application learning paper because um like in general you know that the human intended to put um versus like always putting it in the bottom left place right like how how should an AI consider like these differences yeah so the question is um it seems like this task is a little undrefined it's unclear from a single example whether the goal was to put the peach exactly right here or if it was to put it into the Red Bull or or something like maybe there's other ways to interpret the video as well um and the reason it's able to do at least in this case what at least to me aligns with human judgment is by looking by having prior data and so if you only gave it this one example and learned from scratch the problem of inferring the intent is actually mathematically underdefined in many different ways um you could like there's also especially from images it's especially under defined because it could be that you wanted to like change the pixels here to be orange for example and maybe that doesn't involve moving the peach there um so it's under defined if you learn from scratch but when you have previous data and previous experience with other objects you're able to um leverage that previous experience to figure out what exactly was intended here so in this case the previous example involved placing into containers and in each case it was trained to generalize the place into the Container rather than place the object in the same position and instead if you gave it previous experience that said if I see a demo like this then place it in that exact same position it would instead learn that from the previous experience but yeah cool question okay um and then one other thing that I think is important in the context of why we might study these methods now is I think that it's important for making methods and deep learning accessible to many different people um I mentioned that deep learning works really well when we have a large data set and if we take some of the most common data sets that are out there like imagenet or these machine translation data sets and so forth they have a lot of data in them uh image that has 1.2 million images this data set of English to French translation has 40 million paired sentences and the switchboard data set has 300 hours of label data uh and so this is a lot of data uh and so if if these are exactly the problems you care about or the data distributions you care about you're in great shape but a lot of problems that we look at don't have this much this this amount of data and so for example if you look at kaggle's diabetic diabetic retinopathy detection data set this only has 35 000 label data sets uh sorry 35 000 labeled images and this is something where deep learning isn't going to work as well if you train from scratch on this um and likewise there's a data set on trying to adaptively treat epilepsy and this has less than an hour of data and one of the papers that I showed before from the beginning of my PhD where we were learning the spatula task this had less than 15 minutes of data and so this is much much less data in many applications we don't have tons of data from or maybe we're we're looking at a population that has a lot less data of it um and I one thing that one reason why I think that things like multitask learning and metal learning is important is if we can extract prior information about from other data sets that are larger um that we might be able to actually start to better um better solve tasks that have much less data which will in turn make this kind of Technology accessible to um to people who don't have the money to collect a huge data set or don't already have a data set collected for their problem yeah is there a standard way to quantify how useful data from one test could detail learning a new test your way to quantify might be useful how how useful it is for another task and I guess the short answer is uh that's an open problem um it's a really useful and interesting problem the short answer is it's unsolved but there is some work on it on trying to relate the similarity between two tasks and it's actually not really a similarity function it's more of a directional similarity like how useful is one thing for another thing um there's also some work just purely on data evaluation in general like how valuable is a data point in the context of a larger data set and so like James you for example on campus has some work on that as well [Music] [Music] yeah absolutely so the question is um a lot of the examples we've talked about you have a large data set and a small data set what if you have lots of small data sets uh and absolutely we've seen examples where these kinds of techniques can be super useful in that scenario actually in your homework we'll be working with the Omniglot data set which has about 20 examples of each character um but lots of characters so around 1200 characters or actually more than that um and so that's an example where these kinds of systems can work quite well um and there are other examples as well where we can sort of like amortize the the cost of learning yeah machine learning models to generate data stats [Music] for example with your blogs I thought about maybe open pose um where you can actually get the joints of the person and maybe that's helpful for robotics so maybe that that way we could generate I don't have beta because I'm sure we have a lot of information for humans um so is that something that's applicable yeah so the question is um can we generate data sets and is that useful and applicable in common um there's a little bit of work on that so like Phil Isola for example has been doing research a little bit of research on that topic um in general there's possibly kind of a no free lunch thing where um basically if like if you're learning to generate data from a data set then you're not really creating additional information when you train that generative model and so that might not be more useful than the original data that the generative model was trained on um so from that standpoint it's a little bit tricky I think to get value out of that sort of thing but if you have domain knowledge that you can put into the generative model that might help um so yeah it's I think it's an interesting problem and they're actually there have been a few works that have done kind of interesting stuff along those lines that I could talk about in office hours um like data set distillation for example um but in general it's It's Tricky cool um so I've talked about a few different successes in a few different exciting applications of these kinds of systems but I'd also like to emphasize that there's also lots of open questions and challenges and we've seen some of these in some of the questions like how do we determine the usefulness of one data set for another and I think that also makes it equally exciting to study because it means that there's open problems that that we can solve and that all of you can solve um in the last 15 minutes or so I'd like to dive into what actually a task is and what multitask learning is um and we'll we'll do this fairly informally in this lecture but more formally Define things in the next lecture so uh informally we can think of a task um or other a machine learning task as something that takes as input a data set and a loss function and tries to produce a model I think this is a useful way to think or a fairly intuitive way to think about a machine learning task because when you want a machine Learning System to solve a task you typically give it a data set and a loss function and optimize that to get a model um and different tasks can vary in a number of different ways at a high level or kind of intuitively you could have different objects different people different objective functions which was mentioned different lighting conditions different words different languages um so the different tasks that you might throw into a multitask Learning System might be fairly varied and they could vary along lots of different axes um and so the reason why I bring this up is that multitask learning doesn't just cover what you might think of as different tasks in terms of the English definition of the word task it could also mean that um like typically you don't think of different objects as different tasks from the English definition of the word task but if you want a system that can handle lots of different objects and you might want to train it across lots of different objects that might still pertain to this more technical definition of a machine learning task um but as I kind of mentioned this there's one really critical assumption that comes up with these kinds of systems um and the bad news about this assumption is that the the tasks that you train on they need to have some shared structure if they're completely independent from one another then you won't get any benefit from training together or you won't be able to get any benefit from trying to exploit the shared structure um and if you don't have any shared structure if they're completely independent from one another you're better off just using single test learning um the good news though is that there are many tasks that have shared structure um and so as as one example um if you consider the task of unscrewing a jar a bottle cap or even like using a pepper grinder for example um using a pepper grinder and opening a water might seem very different tasks but they all involve a very similar motion um and even if the tasks are seemingly unrelated um the laws of physics underlie real data and so that there is already a lot of common structure there unless maybe you're on a different planet or something and and I mean the laws of physics won't change there but uh gravity might change for example um people are all organisms with intentions um the rules of English all underlie all English language data languages are all developed for similar purposes so even across languages there's a lot of shared structure um and so on so um I think that there are actually very few cases where some tasks that you come up with are independent in a statistical sense and so with that the model can or these kinds of methods can pick up on the shared structure and leverage that in order to do better cool um now Yeah question let's say language and images so what is the shared structure in that house yeah so the question is um in a model I got to it was doing both tasks and not involved language and images and what is the shared structure there um I actually think that in that particular case the the gothic paper didn't show significant signs of being able to like significant indications that it was actually learning shared structure between those two things in terms of generalization um and so it's not actually clear to me that it was learning shared structure across images and text uh and I could also imagine that it may be easier for them all to learn shared structure if for example you gave it images of text rather than tokens of text because uh those modalities are very different from one another it's kind of like um if instead of well for for us humans we see everything through the same exact embodiment through our eyes and and so forth and we're never like getting uh one hot vectors of tokens just passed into our brain um and that's kind of what neural networks get uh they don't they don't um see things in a unified way and so um yeah the short answer though is that it's unclear if it was actually learning structure between the two um although there is there is um there is some work that is actually found that you are able to learn a shared space um at least a shared embedding space with images and text um that is more Unified lots of questions in the back yeah so the question is how do you quantify the amount of sharedness uh the it's it's difficult to quantify given a pair of data sets um but one way that's nice to think about that we'll talk about in the Bayesian metal learning lecture is from the notion of uh kind of Bayesian graphical models um if you have two random variables um if they're independent from one another then they don't have any shared structure whereas if there is some dependency between them um if there's if there's an edge between them or either direct Edge or an indirect Edge then they do have some shared structure quantifying how much shared structure they have is hard although conceptually I think that thinking about it from that Bayesian standpoint can be useful yeah foreign yeah so if there isn't any shared structure the model might just basically learn to like use half the bottle for one task half the model for another task um and uh I guess there isn't there actually isn't that many downsides to that but there also aren't any upsights to that as well yeah shared structures are necessary but the tasks that we're training on have shared structure or would it also be fine include the tasks that we're trying to generalize to is related to um both of the tasks independently but the the tasks that we're actually training on they don't have any like overlap if that makes sense can you repeat that yeah it's true is Task C and that relates to both A and B in some way but maybe you're not related but we're actually training on a and b um two Journal ones to see I see so if you have two tasks that are unrelated and then you want to learn a new task that's related to both of them um so for example if you want to like uh task A's to pick up a fork task B is use the fork to like skewer something and then taski is to like to pick up the fork and ski or something um yeah uh the there are certainly instances where things like that come up um and and so I would say that well so yeah yes that's a scenario where these kinds of techniques make sense um although some techniques will be more useful than others just like when we when we belong to the same modality like text or image or can we have a combination of protocols yeah so the question is can you kind of have a combination of multiple modalities and in some sense the flamingo model that I showed before it's actually already example of a multimodal model that takes us in but both images and text um and so yeah the you could have definitely have gato is another example where different tasks do have different data modalities um if you are representing those modalities differently than the shared structure it may be harder for the model to find the shared structure um but certainly something that's possible with these models cool I'm gonna try to move on a little bit I think I only have a few more slides um and then we can take a little bit more questions at the end um cool so um what are some of the problem definitions that we'll cover in this course so uh informally we can think of the multitask learning problem as learning a set of tasks and generally trying to learn that set of tasks more quickly or more proficiently than learning them independently uh and kind of here the set of tasks is this we see the same set of tasks during training and during testing so we're not trying to handle a new task in contrast the transfer learning problem is we're given data on previous tasks and our goal is to learn a new task more quickly or more proficiently and this is also the problem that meta learning algorithms also aim to solve and so basically in this course we'll be looking at any kind of method that tries to solve one or both of these problem statements now um one question that came up earlier is that uh doesn't multitask learning reduce to single test learning uh so one thing that you could do is you could say you have a data set for each task Di and you have a loss function for each task Li can you basically sum up your loss functions combine your data sets to create a data set at a loss function and then you have a single task learning problem where you have one data set and one loss function um so are we are we done um so in some sense uh it can reduce to a single test learning and aggregating the data across tasks and learning a single model is one very viable approach to multitask learning um the um the transfer learning problem where you want to learn new tasks is what will focus on more in this course than multitask learning because it is a little bit more challenging and this solution um it doesn't just uh reduce a single test learning but we will have one lecture on multitask learning on Wednesday we'll cover other things like how do we tell the model what tasks we want to do or what if aggregating the data set and training on all of it doesn't work [Music] um so yeah um we'll focus more on the on the second problem statement but there is also still some challenges that come up in um in the multitask problem statement Beyond just training a single model with a single loss function um cool um so that's it uh I Let's see we could take a few more questions as a group but I also um maybe we could also just take questions up in front if people have additional questions and uh we'll end there as a couple reminders homework zero is out and it's due on Monday next week uh and if you want to work in a group for your final project we'd encourage you to start performing project groups

Transcript for:Introduction to Course, Multitask Learning, and Meta Learning

Transcript for:
Introduction to Course, Multitask Learning, and Meta Learning