okay um So the plan for today is we're just going to start with a little bit of of admin just to outline what this course is um about particularly for those of you who are actually doing this for credit and I know that's probably uh about half of the audience here um there's a few people just auditing it or trying to learn something about reinforcement learning um so please everyone feel very welcome um then we're going to move on to really you know the content of this whole class which is reinforcement learning um going to describe what it is so you have some clear idea of what reinforcement learning is really about um and to do that I'm going to start by just introducing the problem um and then talking about first of all what the problem setting is what what do we mean when we want to solve a reinforcement learning problem you know what where does this fit within machine learning within science you know what are we really talking about um before we'll start talking about solution methods um where we'll start to talk about what does it mean to actually build um an agent which can solve the reinforcement learning problem so that will be this section inside in ourl agent and then finally we'll talk about some of the key problem s within reinforcement learning and start to get some insight into the key components of what it actually means to try and solve this problem okay so just to begin with let's start with some admin um so um we're trying something new this year so um in previous years this has been 9:30 class so um today okay we're starting at 9:30 but for future classes if we could try and start at 9:15 I know that because there are a number of people here I'd actually like to encourage people to ask questions and make this a little bit interactive um and that means um typically I found that that means we need just a little bit more time to get through the content um so you know so this can be a little bit more relaxed and not something where we try and go right up to um cramming everything in at the last minute um there is a website um this is the website on which all the teaching material for this part of the class will be posted the reinforcement learning um um class all the slides are up there however I should note that I reserve the right to change the slides right up to the last minute and I often do um so if you print the out in advance and turn up with them they might you might find they're slightly different to the ones I actually teach with um but I will update the the website and make it clear when those slides have been updated after the class so that you know if you're trying to revise for the exam at the end you should know what the actual material is that I've taught with and that is the examinable material for this year um it would be great if everyone could join this this Google group um that will just help us to coordinate this will also be the whole of advanced topics with Arthur greton as well we're going to coordinate on this ma group so if there are cancellations or um or issues that come up or um issues regarding the assignments in particular um they can be communicated on this group um so it's just handy for us if there's one place where we can get in touch with everyone um and feel free to email me if any issues come up during the course of of this class okay so for those of you who are taking this course for credit I think Arthur explained this in in the first class but uh very briefly um again this is a split class between um the advanced topics covers both the kernel methods component and the reinforcement learning uh part of the course we teach them as if they're two different courses so you can think of them as like two half courses that happen to sit under the same umbrella of what we call the advanced topics um class um what that means is that we arrange it so that you don't have to take one or um you can get away with just taking one or the other you don't have to um sit in both parts of the class um and so the way we do that is you there's an assessment for the RL part an exam for the RL part there's an assessment for the kernel methods um and in the exam you can choose kind of which or both of those those um questions to answer so specifically the RL part will be 50% coursework the the whole overall assessment will be 50% coursework 50% exam and and there'll be two different assignments you can choose one or the other if you happen to do both of them um the bonus you get is that your overall score will be the max of how you did in both of those assignments so it can be a significant Advantage for those of you who do both of these um assignments but you can also get away with just doing one of them and in the exam the way that we get to the way that we get around this issue of having two half classes is that there are three reinforcement learning questions in the exam and three kernel methods questions in the exam and you can choose any three um to answer in the exam so if you do study for both parts you have the flexibility of choosing amongst them um but you can again get away with just learning one part or the other although you're really just stuck with the course uh with the questions that I um will offer you if you're doing the RL side or Arthur if you just do the colel method side okay is that clear to everyone good okay so textbooks um so the main textbook which we sort of be semi following on this course um it's called an introduction to reinforcement learning by Rich saton and Andy BTO um so this book is the um considered the 7 old textbook for reinforcement learning um it's actually available free online um and in fact rich saton is currently working on a second edition to the book which is also available online in its current draft um we'll be using actually the notation from the second edition um in this class if you just want to be compatible and understand exactly by looking comparing between them that's probably the one to look at um so this is a really good choice um however Some people prefer something a bit more concise this one is maybe 400 pages and very readable but it's also you know gives a good overview and sense of the big ideas of reinforcement learning um a sort of intuitive level um however Some people prefer a more rigorous sort of mathematical textbook and so the one which I'd recommend if you're one of those type of people is U by Chaba seari uh it's called algorithms for reinforcement learning um this is a much shorter textbook it's less than 100 Pages covers all the main ideas very concisely much more sort of T style um but you'll get less of the intuition but more of the the rigor and you'll really get sort of for those of you who prefer the theor itical side um I find that this is the U what some people actually prefer reading um but the style of this course will be somewhat closer to to to the first TT okay so let's start by trying to understand reinforcement learning and get some idea of you know what is this thing we're all sitting here to try and um um learn about and talk about um so begin with let's try and sort of place it within U the whole field of of Science and I think one of the special things about reinforcement learning so it sort of sits at the intersection of many different fields of science um so we have here in the middle um this sort of V diagram we've got reinforcement learning and what this is supposed to illustrate is the fact that you know for many of these different fields of uh of endeavor uh there is a branch of that field which actually is trying to study the same problem that we're going to talk about in reinforcement learning so what is that problem it's essentially the science of decision- making I think that's what makes it so uh so General and so interesting across many many different fields it's a fundamental science it's trying to understand the optimal way to make decisions and so this comes up again and again so in computer science we study this under the umbrella of machine learning and specifically reinforcement learning and that's what we'll look at in this course um but if you go to engineering and talk to people in the engineering World they have a large part of engineering devoted to what's called optimal control which essentially is studying the same type of problems um with many of the same methods under different names which is you know in other words how to optimally decide a sequence of actions so was to to get the best results at the end of the day um in Neuroscience it turns out that one of the major discoveries in Neuroscience in the last couple of decades is actually um an understanding of how uh the human brain actually um is believed to make decisions and that a large part of the human brain is devoted to What's called the dopamine um system and this um transmit neurotransmitter dopamine actually um exactly reflects one of the main algorithms which we'll study in this course and so the rewards system of the human brain is really um studied widely and a major part of Neuroscience now and it's really people there are trying to understand exactly the reinforcement learning methods that we'll talk about in this course and it's believed to underly human decision- making as well again in Psychology there's been a lot of work um going back to Skinner um and so forth on on classical conditioning and operant conditioning which again is trying to understand how how animals make decisions how and why animal behavior occurs if you give them um some rewards and you see that this animal start to salivate the theory that underl that is essentially reinforcement learning again um in mathematics there's a um again equivalent to reinforcement learning studying the MTH of of optimal control again U and it's known as operations research um and finally in economics uh there's the fields of Game Theory and utility Theory and um bounded rationality and these are all again getting the same questions of how and why people make decisions if they're trying to optimize their utility so really it's something fundamental and I think it's something of general interest in many different areas so if we zoom in bit to this diagram and try to understand in this course we're going to talk about enforcement learning um so so there's a few things which differentiate reinforcement learning from supervised learning um and from unsupervised learning as well um and so let's just see if we can understand those distinctions so the first one is um maybe the most obvious that there's no supervisor when we do reinforcement learning no one tells us the right act the right action to take instead it's a trial and error p Paradigm there's no supervisor there's just this reward signal saying you know that was good or that was bad or that was worth three points or that was worth minus 10 points no one actually says that was the best thing to do do that action in this situation so there's no supervisor the second major distinction is that um when you get that feedback saying good or bad it doesn't come instantaneously it may be delayed by many steps so in the reinforcement learning Paradigm you make a decision now and it may be you know many many steps later that you actually see whether that was a good decision or a bad decision as the kind of um the decisions that you may unfold over time and it may be that it's only retrospectively that you realized that was a bad decision because at the time it looked like it was very good and maybe even you got positive rewards for several steps before some catastrophic large negative reward so in RL the um the feedback is is delayed and that makes it very different again furthermore um time really matters in in RL so we're talking about sequential processes sequential decision- making processes where one step after another after another the agent gets to make decisions pick actions see how much reward uh the agent gets and then optimize those rewards to to get the best possible outcomes so so we're not talking about IID data here we're not talking about the classical supervised or unsupervised learning settings where you get some IID data set and you just get to learn on that data set here we've got a dynamic system there's an agent moving through a world and you know imagine a robot sitting you know walking through some environment uh that robot Sees at one second is going to be very correlated to what it sees at the next second this doesn't really um the IID Paradigm that we're familiar with doesn't apply to RL and specifically perhaps the most important way in which IID breaks down is that in reinforcement learning the agent gets to take actions it gets to influence its environment it gets to move around you know imagine this robot it gets to move around in its robot you know if I if I'm in the robot and I walk to this side of the room I will receive very different data so I'll see different things I'll get different Rewards to if I've moved over to this side of the room so the agent is actually influencing the data that it sees this is like an active learning process um so we've basically got this this combined set of differences that defines the RL Paradigm and makes it quite distinct but in many ways this is really the Paradigm that's faced by all of those fields of science we just talked about when we try to understand what it really means to to optimize a sequence of decisions okay so let's try and make this concrete by talking about some examples of you know what what would be our reinforcement learning problems you know let's not talk in the abstract let's get a feel of what RL is all about um so so I'm just going to kind of illustrate this with a few examples and then we'll have a couple of videos this is the first lecture you know it's nice to have a bit of fun and um so so one example would be flying stunt Maneuvers in a helicopter we'll see a video in just a second um so here you know that you've got a helicopter being controlled you want this helicopter to make a particular maneuver um it's not that anyone tells you any given moment yes you've done the right thing or no you've done the wrong thing is that at the end of um some period of time you want to have executed this maneuver and maybe someone says good or bad at the end of it and crashing is typically very bad another example would be uh to play the game of b gamman um and one of the famous successes of reinforcement learning is to um when Jerry Taro defeated the world champion the world human champion at the game of bat gamman just by reinforcement learning so this is a system that was basically playing the game again and again and just by trial and error learning figured out a way to play B gam better than humans um another example would be to manage an Investment Portfolio where now the time steps are um you know decisions which are being made perhaps in real time perhaps there's some stream of U of data coming in in this trading agent and and it has to make decisions about what to invest where and in what products and and and the reward in this case might be money and it's trying to maximize the the amount that it makes over time uh but it's a clear example of a reinforcement learning strategy reinfor enforcement learning problem or to control a power station a power station is is a um an RL problem in the sense that there's a sequence of controls that you can make of that power station um perhaps the controls are um you know um the it could be the torqus on the motors of um how how things are being converted it could be different ratios that are controlling the uh batteries and so forth there's all kinds of different tweaks you know in a a wind turbine you have the blade pitches and so forth there's all kinds of different parameters which can be controlled every second so as to optimize the throughput of this power station um so there are many different decisions that can be made there over time and there's some long-term goal U which is efficient generation another example would be to make a humanoid robot walk so you don't want this thing to fall over you maybe want it to get to the other side of the room how do you do that this is a reinforcement learning problem there's rewards at every step telling you whether it's falling over or whether it's making progress um and every step it's got to learn for itself how to figure out this behavior of of of walking across the room uh finally this is an example from Deep Mind where we've been working on this recently um how do you get a single program um to play against um a suite of different game so you just sit down in front of an Atari emulator um you want to play this Atari game um say better than a human how do you do that if you just sit down watch a stream of video coming in and get to control the joystick how do you learn how to play that game um to some good level and get the maximum amount of score on that game okay so let's have some videos just um to make this a little bit concrete so each of those examples I just gave I should add as um a real successes of reinforcement learning so they weren't just abstract examples they're things where where reinforcement learning agents have been successfully built to to solve each of these problems so this is a a nice example of the helicopter um so it's not a full-size helicopter it's a fairly substantial Model helicopter so you know when it crashes it doesn't cost like a million dollars but um but it's it's learned through reinforcement learning to perform particular stunt Maneuvers which are described down in the bottom right there and and it's being asked now to kind of show off and execute all of these behaviors that it's learned essentially through trial and error basically this thing learned by accumulating experience learning from that experience being told you know a reward function of what's good and what's bad and then executing that behavior to see um all of these different um Maneuvers now um some of them are quite fun and believe me if you just take a Model helicopter and try to control it yourself it's very hard to make it fly upside down or to do a split s or to um do some of these Maneuvers okay would they TR like I mean they simulate um so so the way this was done they built a model they learned a model so we'll come to this in later in the course they built a model first uh once they had a model of how the the helicopter behaves they then um did some planning with respect to to the model um and they were able to actually then so that so the learning was done on the model offline and then applied to the so the model was learned from real data and then it was able to learn with respect to that model how to perform optim I just want to show one more may okay so this is an example of something we've been working on um this is the agent which plays the Atari games against um um so this is basically a system which is learning um by trial and error it doesn't know anything about each of these games these are all different Atari games for those of you are familiar with the classic Atari console and all it's shown is basically this video we're seeing here and it gets to control the joystick and it's told how much score it gets and it basically has to learn to figure out how to play the game uh with no knowledge no one tells it the rules of the game or what's going to happen it just figures out how to play very well and maximize its score through reinforcement learning and you can see the games are very different this is a SES scrolling game where you have to kind of blow up all of these things and and um accumulates score by shooting stuff and moving along some of them are sort of pseudo 3D like this battle zone is like a pseudo 3D game we don't claim these are good games by the way just that um you know we build a reinforcement learning agent to be able to play them all and it learns to do better than humans in um in more than half of the games that we played against um this is a really ridiculous game where the game is literally to make a chicken cross the road um and and it learns to do that um this is more of a classic kind of shoot them up um Space Invaders type game called U demon attack uh learns to kind of duck in and out of the the bullets and and shoot all the aliens um this one is pong the classic original Atari game and we're controlling the green guy here and basically learning to hit it right you have to get these like sharp angles off the corner to be able to to get it past the opponent and it sort of learns this strategy and eventually learns a perfect strategy of of winning every game yes you allowing presumably the reinforcement algorithm to respond to stim in a much shorter space of time than a human would be able to do that simply because of the you know the hand eye coordination for example that that's got there must be a lower actually actually no we tried to match it to roughly human so we we've made decisions at 15 Hertz which is not Way Beyond what humans are capable of so so it's a fair I think it's I think it's reasonably Fair yeah so this game is cquest here you'll notice it has to do like delayed um decision making like it has to go up to the top fill up with fuel which doesn't get at any score so that then it can go back and shoot agent you all these different monsters and things and fish and sharks and go back and get more more score here it has to jump around this whole grid turn everything um yellow whilst avoiding this coily thing here um it can use this little teleport thing and the goal is has to figure out just from these raw joystic Maneuvers how to kind of fill up the whole screen and get this big score at the End by doing doing that and again we haven't told anything about these different games it's just figured figured out what to do just by triy and error on the joystick uh this game's kind of weird it has to sort of sweep this thing around and get the Liars whilst avoiding the the things which will kill it this one I guess I'm probably familiar with Space Invaders um we've got this agent that's being controlled and it learns this very human-like strategy first of all cches the Mother Ship also learns a very humanlike strategy of doing things a column at a time which gives it more time as the game speeds up it basically it takes longer for the uh for the aliens to move from one side of the screen to up to the other and move down and that actually gives it more time to be able to sweep up and kill them all and it figures out that strategy for itself um um this game um boxing again little bit of a strange Atari game um I don't think racial stereotypes are intended with this one but um anyway um so we're controlling this one on the left and learning this strategy of um pummeling this other guy into the corner there um right so a real variety of different different games there how long does it take to train this algorithm and were there any ones that were particularly tricky to to train um so so some so yes it just off like a off a Macbook yeah so you on a MacBook would take quite a while we use it on a on a GPU yeah um so using deep learning so there's a whole story there of exactly what we do U but roughly speaking it takes around three or four days of training to get to this the human level of performance on each game um per game so so three or four days of of compute time to do that okay um so that hopefully gives a flavor of what RL is about and I don't want you to think that RL is just about games but actually a lot of my background in reinforcement learning has been on um on applying reinforcement learning to games to board games to video games and I think they're kind of fun to talk about so you know some of the examples will be drawn from my experience and so I I really want you to see this as just examples of how RL can be applied and in that sense games are just like little microcosms of real things which happen in the real world uh which have very clearly defined rules tools and help us to understand um how these ideas can be applied so um so I don't think it's just a game specific idea it's very generally applicable okay so so let's talk now about the the reinforcement learning problem so any questions first about you know before we move on I think we'll start to get you know into more details and understand things a bit more clearly in just a second yeah just on the Atari games thing do you use what it learns on one game and then sort of transfer that knowledge to a different game where is it starting fresh each time it starts it starts fresh each time okay so so the RL problem so one of the most fundamental quantities in reinforcement learning we're just going to talk about in a bit more detail now and this is the idea of rewards so a reward what is it well it's basically it's just a number it's a scaler feedback signal this random variable RT so every time step T we're going to Define this this feedback signal RT saying basically how well is the agent doing at that time step um and the job of the agent is basically to get as many um to sum up these RTS and get as much reward as possible in total that's what the agent's goal is um and so you know is this really a good way to understand what we mean by goals does this cover what we mean by goals in all kinds of different problems well what we follow in reinforcement learning reinforcement learning is kind of based on the following premise which is a hypothesis if you like informally um which is that all goals can be described by the maximization of expected cumulative reward so there's nothing that we mean by goal following which can't be described by some scal of feedback signal and the maximization of that feedback signal um over time into the future so you know that's a little bit controversial so you should think about that do you agree with it what do people think is there something you you agree with we're certainly going to be following that and using that as the premise for the remainder of the course but it's okay if you've got you know in the back of your mind well really can you do this this or this um so just have a think about that for a second any thoughts people object are they happy I'll take silence happiness the end of the game there's no intermediate reward how would you you wouldn't be able to use reinforcement learning where that was the case would you so intermediate rewards are fine or no no intermediat rewards it's the ball at the end you don't know anything until you get to the end that's absolutely fine so that just means that we're defining the goal then so if there's no intermediate rewards is the question what happens and and then in that case what we Define is that we Define an end of episode and we Define a reward at the end of the episode and now um the sum of the rewards is exactly how well do at the end of that episode and the goal of the agent is to pick action so as to maximize that expected sum of rewards at the end of of the episode so what if the goal is say to pass some sort of challenge within the shortest amount of time so it's okay so so the question was what about if it's a Time based goal like if the goal is to try and do something in the shortest amount of time so typically what we do there is we Define the reward signal to be minus one per time step um and then again there's a a termination of the episode um at the end of um when you actually achieve your goal you stop and now it's a well- defined objective to maximize your your cumulative reward basically minimizes the time that it takes to reach goal so let's actually make this concrete Again by talking about some different rewards so let's actually look at the examples we Ed so we started by talking about the stunt Maneuvers in a helicopter um so we have in that case you know the rewards might take the following form of a positive reward each time we follow the desired trajectory or come within some Epsilon radius of of where we want to be um and a large negative reward for crashing you know really crashing should be bad it should learn to to not do that um if we're playing a game like B gamon um there would be zero intermediate rewards to follow the question but at the end of the game we would give a signal saying um if you won the game that's good if you lost the game bad and then the agent just figures out for itself how to maximize those those rewards and it will learn to take decisions along the way that um will maximize at the end of the game how well it does U if you're managing an Investment Portfolio um I guess one of the good things about Finance has a very clear reward signal which is um you know dollars or pounds um and and so in that case the goal is simply to maximize the total reward controlling a power station there would typically be some positive reward for each unit of power that's produced by this power station but there might also be negative rewards for exceeding safety thresholds or um doing something which is not um you know maybe there some there's regulations um by the regul which AR have to be respected and if you want to make a robot walk you might have a positive reward for forward motion each unit of distance that you travel might perhaps give um a unit of reward um and there might be again a large negative reward for falling over so each of these examples of problems where we they're very different problems they might not feel at first glance like there's a common framework for all of these but our goal is going to be to build a unifying framework where within machine learning we can address all of these different types of problems within the same formalism and therefore solve them all using the same agents and the same ideas um and so the first step is understanding the reward signal that we get a reward at every time step okay and in the Atari example finally in that case we just gave a positive or negative reward for each change in score at every step like if you got 10 more points in that step be a plus 10 reward at that time step okay so now what is this framework it should you should be asking yourself well these are all really very different problems how can we even imagine a a unifying framework for all of them um and so you know we think of this as sequential decision making and the and the goal in each case is the same that's what unifies them together that the goal is to select actions um so as to maximize the total future reward we basically want to pick a sequence of actions so that we we get the best results the most total reward along up our trajectory um and in particular we need to note that what that means is that we have to we have to plan ahead we have to think ahead because actions may have long-term consequences and the reward that we get might not come now it might come at some future step and sometimes that might even mean that you have to kind of give up some good reward now so as to get more reward later so you can't be greedy when you do reinforcement learning you have to think ahead um an examples of that would be you know like a financial investment where you have to spend some money now um which so you're losing money you spend some money but but then you believe that later you'll get more money back once this matures or if you in the helicopter example maybe it's running low on fuel um so you might want to stop lose some reward for following your Maneuvers um to refuel for a while um but that might prevent a crash in several hours time and therefore actually lead to longer runs and more reward in the in the long run um or if you're playing a game of bat gamon or or chess or something you might want to choose a move which doesn't look like it's going to um make your immediate gains um you don't take the opponent's Queen but instead you do some strategic thing which which helps to much later on because you think that blocking your opponent now might help you do better later on in the game okay so the formalism we're going to use um we're just going to develop that a little bit in terms of this interaction between agents and environments so I'm going to use this big brain thing here to represent the agent um so this is the thing which we are controlling our goal is to build this brain we want to build the uh an algorithm which is going to sit inside the brain of something we're going to call the agent and that thing is going to be responsible for taking actions like it's able to take actions like deciding what the ts are going to be on the motor for controlling the robot or deciding you know what investments to make or what moves to play those are the actions that our algorithm our agent is able to make um and at each step um those actions are taken based on on information that it's receiving so every step it basically gets to see something of the world like it might be a robot with a camera and it gets to see a snapshot of what's happening at that time in the world um and it gets some reward signal this is our reward that we've just discussed saying how well it's doing at that step and that's it that's all the agent sees it gets the observation coming in reward coming in it has to make a decision so that's our goal is to figure out algorithms that sit in this brain here on the other side of the fence we have an environment so I'm going to represent the environment by the sort of world this is what's out there um on the other side of the agent the thing it's interacting with um and what's going to happen is there's this Loop over time you can think of this Loop where the agent is interacting with the environment and every step um it's seeing something in the world it gets some observation where it's seeing a snapshot of the world at this moment the agent wandering around seeing stuff um and the the environment is generating um what that observation will be and what the reward is so if you imagine you know this is the Atari environment or something then there's some actual real Atari game that's generating the next observation the next screen and the score um but we're not controlling that part we have no control over the environment except through this channel here we get to influence the environment solely by the actions that we take within that environment so the agent influences the environment by taking U actions within it a robot can move around and influence it where it is within the environment or where objects are um and and so forth okay so an interaction between agents and environment and this just goes on and on and the trial and error Loop that we Define for reinforcement learning is basically a Time series of um observations rewards and actions and that time series defines the experience of the agent and that experience is the data that we use for reinforcement learning so the machine learning problem of RL is concerned with this source of data this stream of data coming through observations actions rewards we always limit ourselves to just strictly a scal of reward because for example I can imagine interacting with a world and make a decision that you know like increases my standing with my boss that pisses off my girlfriend for example and I might like good okay so the question was do we have to limit ourselves to a scaler reward so I thought people were suspiciously silent so let's go back to to this hypothesis so the so the reward hypothesis was that the reward is a single Scala feedback signal and a scal of feedback signal is sufficient to describe everything we mean by goals so the question is well sometimes there are like you know maybe different conflicting goals that you might have and how do you how do you know which of those it is that you you care about optimizing in a given time um and so I guess the the RL view of that would be to say that ultimately the agent has to pick actions and to pick actions you have to be able to weigh up these different goals because ultimately you have to pick an action you have to decide whether you're um going to meet your girlfriend or whether you're going to you know um do the thing you the other thing you care about work stay late at work or hang out with your girlfriend and you have to pick amongst those ultimately and so to pick amongst them you need to be able to compare them and in order to be able to compare two things you have to be able to line them up on an axis which kind of implies that you have some scalar um um there's some scale on which you can compare them which can be converted into a scaler reward so there's always a conversion ultimately there must be a conversion into a single thing that you can decide over and so a Scala reward ultimately must be enough that's the RL view not everyone has to agree with that okay so so this stream of experience that comes in this sequence of OB observations actions rewards uh we call it the history so the history is what the agent has seen so far so each step it takes its action U Seas an observation Cesar award all the way up to the current time step T and the history HT is basically the the sequence of um everything it's seen so far and that's in some sense you know this is all the observable variables like there might be other things hiding inside the environment but the agent doesn't know about those it can't observe those things and so in some sense they're irrelevant to the algorithms that we create the algorithms that we create can only be concerned with what the agent can actually see remember we're trying to build that brain the algorithm sits in that brain and so we should only concern ourselves with what the brain is actually exposed to um and so you can think of this as like the sensory motor stream of a robot or an embodied agent um but this also applies more generally you know a game playing Agent or a trading agent it has its own sensory motor stream of you know what what the inputs are that it's seeing and the decisions that it's able to take there's some well- defined interface between agent and environment and we just get to control and see this stream of stuff that's coming into the agent from from its environment and what happens next actually depends on this history so so the agent you know our algorithm we get to build something which is essentially a mapping from this history U to an action so that's our goal is to build a mapping an algorithm is a mapping from from one of these histories to picking the next action so what happens next from from the agent's point of view depends on the history and what the environment does is the environment um looks at what it's received it looks at the history of um of what's happened in terms of the agent and it also looks at the history and it uses this history to decide what it's going to do next in terms of what observation it's going to emit at the next time step so this history literally determines um the nature of how things proceed the history determines what happens next but it's not very useful um the history isn't very useful useful because it's typically enormous you know we want to have agents that have long lives and can deal with microsc interactions and each of these observations maybe a video and and we don't want to have to go back to this history every time and typically what we talk about instead is is State and so state is like a summary of the information that's used to determine what happens next so in other words if we can replace the history by some concise summary that captures all the information we need to determine what happens next then we've got better chance to do some some real things with our machine learning and we're going to spend some time talking about States I think it's a really crucial concept that's widely misunderstood and I really want to carry this over really carefully so formally what is state so every time step T we can build construct this state U which is a function of the history so it's just any function of the history that's the definition of State um but so you know a valid definition of state would be to only look at the last observation for example and just ignore everything that's come before or to look at the last four observations that's actually what we did inar Prett simple um okay but there's many different definitions of state and what I want to pick apart now is at least three different definitions of state to understand how what they really mean and how they relate to each other so the first definition is what we call the environment State um so the environment state is basically um the information that's used within the environment to determine what happens next so if you think about it there's some in this world um imagine this is you know the robot interacting with some real environment that Real Environment has some sort of set of numbers determining what's going to happen next or if this is the Atari emulator the the emulator has some internal State U that's going to decide what what that emulator does next or um if this is you know a factory that what happens to that factory has some process that's going to decide what happens next um and there's some set of numbers that determines what happens next from the environment's point of view based on what it's seen so far its history and um and and there some state that summarizes everything it's seen and which will spit out the very next observation and reward and so we call that the environment State this is literally you know what if you kind of informally were to say what state is the environment in it's the set of numbers that that is contained in here the information that is necessary to determine what happens next from the environment's perspective so the special thing about the environment is that it's not usually visible to the agent so you know we don't get to look inside our crystal ball and we don't get to see everything that's happening within the environment if you've got some robot walking around in in the world it doesn't know the state of um all of the um Atomic configuration of um a rock in Australia it only gets to see what's in front of it in this video stream that's coming through um to it right now so the environment State this set of numbers that's determining what happens next isn't usually visible to the agent um so it's more like a you know a formalism that helps us understand what an environment is than something practical that helps us build our algorithms here our algorithms cannot depend on these numbers our algorithms don't see these numbers our algorithms just see the observation coming in and the reward and the actions coming out and furthermore even if we could see this information sometimes it might not be the right information that we would like to use to make effective decisions here so again you know the configuration of of Air's rot isn't probably very relevant to the robot that's wandering around this room right now it needs to be able to look at its own stream and understand what's locally um around it to be able to make good decisions and so actually having its own subjective representation may be a good thing yes question um if you put a lot of Agents together influencing the same environment do you get any sort of self-organized behavior patterns um so okay so the question is about multi-agent systems so so I just want to say one thing about multiagent systems which is um essentially they're beyond the scope of this course but let me just briefly say that from the perspective of each individual agent it can consider all of the other agents that it's interacting with to be part of the environment so from the perspective of one brain here the fact that there are other brains um wandering around in this environment doesn't really change the formalism it doesn't have to change the formalism there's just one agent interacting with them your question I think is whether um whether the emergent Behavior shows some patterns between the brains here and the brains there um it can and there's a lot of work on that um but it's kind of beyond the scope of what I'm going to talk about in this course right so so the environment State doesn't tell us anything useful for actually building algorithms we don't see that um so instead we talk about the agent State the agent State now is the set of numbers that actually lives inside our algorithm So within our algorithm we're going to have some set of numbers that we use to basically capture exactly what's happened to the agent so far summarize what's gone on summarize everything it's seen so far and use those numbers to basically pick the next action so whatever information we choose to store and capture here is what we call the agent State whatever information is used to pick our next action that's what we call the agent State it's and that's our decision our decision is how to process those observations and what to remember and what to throw away um and and that um there are many different choices there which we'll talk about so and this is the information that's used by a reinforcement learning algorithm so when we build an RL algorithm we'll really always be talking about the agent state that we'll be thinking about you know formalist where we take an agent State and we build something that picks actions from our state and that's really our goal then is to really understand how to pick actions given some state that summarized everything we've seen so far and this state could be any function of History this is this is our choice we can build this function that's part of the agent the agent gets to decide what this function is going to be how it's going to convert the history of all of the actions and observations and rewards it's seen so far into a useful say Vector of um um information that's going to be used to characterize it future Behavior okay so that was two definitions of state agent State and environment State now here's a more mathematical definition of State uh which we call the information state or sometimes called the Markov State and the markof state or the information State basically is a information theoretic concept it tells us when we've built a state representation um that contains all useful information from the history so what we're going to do is we're going to define something we're going Define a mark of property so this is the mark of property it's probably familiar to those of you who've studied anything to do with um Mark of chains or anything like that it's the same idea um and the mark of property basically says um that the probability of the next state conditioned on the state you're in um is the same as the probability of the next state if you showed all of the previous States um to this system in other words you can throw away all of the previous States and just retain your current state and you would get the same characterization of the future so let's just try and understand what that really means so I like to think of it this way that what this is really saying is that the future what's going to happen in the future um if you have the mark of property then the future is independent of the past given the present in other words that you only need to store this St here if you've got this state representation s and it's Markov you can throw away the whole rest of the history you don't need it because that history doesn't give you any more information about what will happen in the future than the state that you have that's what the mark of property means it means you can throw away everything that came before just keep your state and you're good because you haven't given up anything that state still characterizes everything about the future all future observations actions rewards the distribution over those um events is the same um conditioned on your state as it is conditioned on the whole history so you can just give up this work with something much more Compact and everything will be fine so that's the definition of a Markov state um so we can throw away the history now and another way to say the same thing is that the state is a sufficient statistic of the future so this state here fully characterizes the distribution um over future actions observations and rewards so once we have this state um we can say everything there is to say about that distribution Yeah question I think you said it the beginning though that you know the the reward comes many come many time steps after the the the ACT the observation and the action yes so I mean how does that reconcile with the fact that you're throwing everything away except the previous time step that's fine so so which all we're saying here is that the way that this environment is going to evolve going into the future um that everything we can say about the future the distribution of events that might happen to us um conditioned on this state is the same as if we conditioned on the full history so we still have to make figure out how to take the right actions we still have to figure out what the optimal Behavior will be all this tells us is that is that if we we make decisions based on S those decisions can still be optimal because we haven't thrown away any information we haven't talked yet about how to make decisions based on S the nature example ofic clear It Go depends okay so the question was how does this fit in with the helicopter example so for the helicopter example um a markof state for the helicopter would be the current position um velocity uh angular velocity angular position of the helicopter um would be uh roughly a Markoff State for the helicopter once you do all of those things um and maybe you need to know the wind direction and things like that as well but if you know all of those things it doesn't matter what the position of the helicopter was 10 minutes ago that doesn't make any difference to where the helicopter will be at the next moment all that matters is where it is now and the wind direction and and and that will fully determine what will happen to the a to to the helicopter the next moment you don't need to remember this history detailed history its previous trajectory because it's irrelevant in contrast let's say you took an imperfect state representation that's non-markov like you just had the position but not the velocity okay now where the helicopter is does not fully determine where it will come next because um you don't know how fast it's moving and so you have to look back in time to figure out actually what its velocity is and what what its momentum will be all these kind of things okay so that's the the information State or the markof state can I just ask one question on this was saying the state um if we know the state then we know everything we to know about future States yes but that itself is not necessarily very helpful because the state might be nothing to do with the reward um okay so this history don't forget that the history includes all rewards so to say that it's a sufficient um statistic of the future means that it's a sufficient statistic of all future rewards so that by definition means that a markof state contains enough information to characterize all future Awards okay so I want to give two just brief examples before we move on so so one thing to note is that the environment state if only we had access to the environment state that would be great because the environment state is markof by definition the environment State fully characterizes what will happen next in the environment because that's what the environment is using to pick its next um what it's um the next observation it's going to Adit is and the reward so sort of by definition the environment state is markof also by definition if we keep the entire history of everything that is also a markof state it's just not a very useful one but if we retain everything and make our decisions based on the entire history um the entire history contains as much information as the entire history it's sort of tautological so therefore it's Markoff so these are two examples where we can to show that there's always a markof state it's always possible to come up with some markof State the only question is you know in practice how do we do we find a useful representation okay so let's just make this concrete with one example um so this example we've got this rat and I want you guys to be the rat okay um so and I'm going to be the evil experimenter who's going to either give you a piece of cheese or if um you don't do what I like then I'm going to electrocute you okay so um so here's um a few sequences that that you observe so you're learning by trial and error so light comes on light comes on you pull a lever hear a bell and then zap I electrocute you okay sorry um second episode you hear a bell see a light come on you pull a lever pull a lever again and you get a delicious piece of cheese okay third episode pull a lever see a light come on pull a lever again hear a bell and now you need to predict for me are you going to get electrocuted or you going to get a delicious piece of cheese okay so so let's have a show of hands but you know based on this experience that you've seen so far you know if you were this rat how many people think that you'll be electrocuted okay a lot of people how many people think that you will get a delicious piece of cheese okay a few and um Can someone say why you thought so someone from the electrocution why did you think you'd be electrocuted because of the okay okay right so so the comment was that that the the recent history was the same so we've seen light leave a bell here light leave a bell here so from the last three we would expect there to be um electrocution so another another way to say that is that if you're a rat and your agent State um the state that you're using to make your decisions is the last three things in the sequence then you would believe that you're going to be electrocuted however that's not the only choice of a state you might for example choose the agent state to be counts for the lights the bells and the levers so if that was your agent State then you might count them up and say hey look you know in this example uh we see the levers being pressed twice the light has come on once and the bells sounded once so that's exactly the same as this second episode here so if that's our representation of State we would expect that the cheese would um appear now so it should be clear that what we believe will happen next depends on our representation of State um so what about this one what if the agent state is the complete sequence what will happen next if we think that U the next thing which happens is the complete sequence we've seen so far what will happen next we don't know and that's another legitimate case you know it might be that actually the environment you know what happens next does depend on the entire sequence and you just have to see all four things before you know what happens next in which case we just don't have the data to decide yet so the State Rep representation really defines what happens next in some way that um can be characterized in all kinds of different ways and you know in some sense our job is to build an agent state that that's useful that that is effective in doing the best job of predict predicting what happens next okay so let's finally on this part just talk about couple of special cases so so fully observable environment this is what we're going to work with for for a lot of the remainder of the course so this is the the nice case so this is the type of environment where we get to see everything the agent literally gets to see what's going on in the environment state so it gets to look inside see the numbers inside the environment and work with those numbers so that's the best case um and so another way to say that is that you know all of these things collapse to the same quantity so the observation that we see um is the same as the agent state which is the same as the environment State we get to see the environment State and use it as our agent State and when we work with this type of representation we come up with the main form ISM for reinforcement learning which is called a markof decision process or mdp for short and we'll spend the next lecture actually understanding mdps in more detail because they're a really powerful tool um but I really want people when we go through the rest of the course to you know remember these ideas of state representation because not everything is fully observable and and it's really crucial that we understand how to deal with these other very realistic problems um and luckily the mdp formalism which we'll develop actually also helps us to deal with those cases too so what about the other case the other case is um partially observable environments so this is where the agent indirectly observes the environment so it doesn't get to see everything about the environment um so for example a robot with a camera you know might not know exactly where it is unless it has a GPS maybe it just gets to see this camera stream um it has to figure out for itself where it is within the room has to localize um or another example would be um a trading agent um this trading agent might only observe the current prices um but it might not know you know the trends or how where these prices are coming from has to just figure out what to do next based on partial information or a poke playing Agent it only observes the public cards the cards which are face up on the table and it doesn't know what's hiding in someone else's hand these are all examples of partial observability and in all of these cases the agent State we have to build a state that's distinct from the environment State because we don't know the environment State we don't know what's in the opponent's hand or what's hiding behind the environment that's driving the prices we don't know all that information it may as well not exist as far as the brain is concerned that we're trying to create um so this needs a different formalism and this is known as a partially observable um mdp or a pomdp um and now our job is to build this um agent State we have to build this sat so how do we do that well there's lots of ways we could do it um the naive approach is just remember everything so remember all of the observations actions rewards we've seen so far and say hey that whole thing that whole sequence is going to be our state let's work with it um or we can um build beliefs this is the um probalistic or basian approach so at every step you build beliefs over where you think so you think okay I don't know what's happening in the environment but I'm going to keep a probability distribution over where I think I am in the environment so I'm going to have some probability that I think the environment state is S1 some probability the environment state is SN Etc and this whole Vector of probabilities defines the state that we're going to use um to actually decide what to do so you actually keep a whole probability of all of those things that's not the only choice another choice is to do something um you don't need to use probabilities at all it's not necessary we can have any numbers we want inside our agent for example our current neural network um so that basically means that you take your um your agent state that you had at the previous time step and you just take a a linear combination of the agent State you had at the last time step with your latest observations and that gives you a new state so you just have some linear transformation of your old state to a new state that takes account of the latest observations have some nonlinear nonlinearity around the whole thing and that gives you a new a new state so that's the recurrent your network approach okay um any questions before I think I missed one yeah just toif for instance between these two descriptions in your state would presumably be the rules of the game so agent would of realiz okay rul so the question was can I clarify what's going on with the Atari example let me just jump forward I have a a figure here to help clarify that um which I was going to do later but maybe it's helpful to do now um so this is what the RL cycle looks like between agent environment in the Atari example so in the Atari example we basically have um the agent is making its decisions um it gets to take actions which are to control the the joystick um the environment in this case is is the the actual machine and it's got some cartridge in it which determines the rules of the game we don't know what's on this cartridge we don't get to see it we don't know how it's picking its um next screens that it's going to show us in the next scores but that cartridge given the actions that we take on our joystick will determine both what it shows on the screen and also the score those things are fed into the agent and it gets to take a new action as a result so the state in this case the um environment state would be the set of numbers there's 1,24 bits in the emulator it's a 1024-bit machine so there's exactly 1024 numbers binary numbers characterizing what's going on inside this um Atari and those have to be enough to tell it what to do next um and inside the agent um we have some representation which in our particular algorithm was just a stack of the last four things we've seen um and that was the um information that we used used to decide what to do next okay right so let's open up an RL agent and see what's inside so so far just to clarify we've talked only about the problem we haven't even mentioned how to solve the problem we've just talked about what it is to Define this reinforcement L problem so what I'd like to do is talk about the main components um of an RL agent what are the main pieces that go inside it and I'm going to use that to build a bit of a taxonomy of of how we talk about reinforcement so an RL agent may include one of these three things these are the main objects for reinforcement they're not the only ones so I don't want you to think you know this is an exclusive list but these are the main um Concepts that we're going to talk about the main pieces you may have or not have inside your agent um the first one is what we call a policy so this is how the agent picks its actions it's Its Behavior function it's the way that it goes from its state to a decision about um what action to take um the second idea is is what we call a value function this is the thing which basically says how good is it to be in a particular State how good is it to take a particular action how much reward do we expect to get U if we take that action in this particular State that's the idea of the value function a really central idea to a lot of reinforcement learning estimating how well we're doing in a particular situation and the third quantity is what we call a model which is how the agent thinks the environment work works so not the real environment but the agent's kind of uh view of the environment it's got its own model of how the environment may or may not work okay now these are not always required for an agent but these are three pieces that that may or may not be used okay so let's just go through them one by one so the policy um is essentially a map from state to action so if we're in some State s um then we could have a deterministic policy that's one Choice it's like just a function now this function Pi tells us how to get from some State s summarizing the situation we're in to some action a this is the decision the agent takes and so this policy is really the thing which we care about we want this thing to um we want to learn this thing from experience and we want this policy to be such that we get the most possible reward so that's the the game we're playing is to try and the Brain we're trying to build is the thing that figures out this policy we can also learn a stochastic policy so there's no reason our policies have to be deterministic functions they can also be stochastic and often this is very useful it helps us to make random exploratory decisions for example um to see more of the state space um and so this is just the probability um of taking a particular action conditioned on being in some state so it's a stochastic way to stochastic map now from states to actions and we'll talk about both of those um in this course okay the next Central quantity is the value fun function so so let's really try and unpack this a little bit so the value function is a prediction of expected future reward um so why do we need this well we need this to say you know if you want to choose between um State one and state two or between action one and action two how do you choose between them well we want to be able to choose on the the correct basis and the basis that we're trying to to use here is how much total reward we expect to get in the future so the expected future total reward is is what we mean by a value function so formally we typically write it something like this so this would be a value function and the value function depends on the way in which you're behaving so we have to index it by pi here so so if we were to if I was to have a robot um which is falling over a lot that robot will get a very different total reward to an agent which is standing up and or walking effectively so the amount of reward that we get depends on the policy and so the value function for a policy tells us how much total reward we expect to get going into the future so we're going to look at the expectation over the future of the reward at time t plus the reward at the next time step plus the reward at the next time step and we can also have this discounting going into the future this is something that says we care more about immediate rewards than later rewards so we'll understand this in more detail next lecture but just to give you some intuition you know this is a really central idea that we care about basically how much total reward we can get from some State onwards if we're going to follow this particular Behavior you know if I'm in my helicopter and and I know that I'm going to execute some particular trajectory how much reward will that trajectory get me that's what we're trying to say here and if we can compare these things then we've got the basis to to make good decisions because we know which of those you know should we choose the trajectory that's going to get us 73 points of reward in total or should I choose the trajectory that's going to get me um 65 points of reward well it's obvious when you say it like that um so the value function is a really helpful quantity for uh for optimizing Behavior okay okay um so I'm just going to do one more video just to get make this idea a bit more um okay so this is the value function in Atari so what we're looking at here on the top right if we look at the top right is exactly this quantity V that we were just talking about saying how much reward the agent thinks it's going to get going into the future and the one down the bottom is saying how much reward it thinks it's going to get um for for each different action just back up a little bit um and what you'll notice is that there's this sort of oscillating Behavior where um each time the the bat is getting each time it hits the ball it was going to get more award let me just back up a second we look at that here so so each time the ball is getting closer to hitting something it knows it's just about to get this reward there's no no uncertainty left um and so it gets this big spikes in the reward according to this and each time the ball's coming back down the predicted future reward is going down and so you get this sort of oscillating behavior and it's only when the agent's actually close that that it matters which act which way it goes has to make a decision when the ball's coming down whether to go right or left and until it gets right there it doesn't really matter which one it chooses now for Space Invaders what we see here just back up again is that um what we see is that there's BAS basically we'll see a big spike which is when the mothership starts coming along the top it it realizes that that Mothership might mean a big spike in in possible reward it's about to get like if it actually shoots that mother ship it should expect to get a very large positive reward and so when that appears in its state when it starts to see this thing its prediction of future reward should go up um and so that's what we see is that now this mother ship comes up the value function starts to climb um it misses the opportunity to get it and the value function goes back down again um now we see another example the value function is kind of flat it's getting all of these things as it's shooting here comes the Mother Ship this time it shoots it and the value function still goes back down because we're always just talking about the future reward the value function is how much reward we expect to get in the future so the fact that we've just got that reward um so now we've received that reward but we still predict that we're going to get less reward from now on because the mother ship's already come which hot it is probably not going to come again for a while okay so we've got like this agent's internal measure of how much reward how much score it's going to get going into the future from here and this thing is like it's a cillos scope of how well it's doing at any given moment in time it can be used to make decisions so this reward is is is predicted let's say one second yeah the reward has a time scale to it and the time scale we just go back to the the time scale so the question was is there a time scale to the how far ahead we looking is there a horizon to how far ahead we look into the future for these scores and the answer is that the Horizon is given by this discount Factor so every step we discount the reward a little bit more and a little bit more and a little bit more going into the future until essentially we don't care about um things very far into the future so for that particular example it was about 099 um the the discounting which means that we're essentially looking you know say 100 steps into the future yeah is there some between the value functions so for example you care about value functions which expectation without more stable action you do okay so the question is there is there some kind of like um risk trade-off that you can make where you can um for example can you balance variance of these reward predictions um the answer is that ultimately what we care about we've defined a formalism in which what we care about is just the amount of reward we get and if you care about just expected reward then risk um is already factored in that you will figure out the the behavior which is which correctly balances risk um automatically if you optimized for this however there are people who specifically try to do for example trading who who take account of risk and there's risk sensitive um markup decision processes which do try to explicitly account for this so if you if you really want to bring in Risk um you can but I think the right way to think of this is that risk is already accounted for that it will automatically the the behavior that maximizes value is the one that correctly trades off the agent's risks so as to get the maximum amount of reward going into the future and that just automatically emerges from from the formalism okay so the third so remember we talking about these different quantities so we've had um the U policy and and the value function so now let's talk about models so a model is basically um it's not the environment itself but sometimes it's useful to imagine what the environment might do to try and learn the behavior of the environment and then use that model of the environment to help make a plan to help figure out what to do next um so specifically the way we normally do this is to have two parts to our model transition model and a reward model where the transition model is basically predicting what the next state will be predicting the Dynamics of the environment so if this was the helicopter this is the thing that says that if you're in this particular if the helicopter's facing this way in this position with this angle um and the wind is coming at this direction then it's likely to move to this position and this angle and so forth that's the Dynamics of the environment and we can try and model those Dynamics we can try and learn what will happen to that helicopter at the next step and figure that out and we can also learn to predict how much reward will we'll get so we can learn that if the helicopter's in this situation um that it's not crashing and that we'll get um you know um one reward for staying alive or whatever it is and so formally we break this up into a um a state transition model that just tells us the probability of being in the next state given the previous state in action and a reward model that tells us the expected reward given the previous given the current state in action okay so it's optional to do this and actually a lot of the course will focus on model free methods that don't use a model at all so this isn't a requirement it's not necessary to explicitly build a model of environment but it is something you can do so let's try and make these Concrete in a very simple example which is a maze so this maze example it's like a canonical um RL problem these little grid worlds we use them to for sort of dactic purposes and so the idea is we're going to start here and try and get to this goal here and there's going to be a minus one per time step reward just like the question earlier the goal here is essentially to solve this maze as quick as quickly as possible POS and that's reflected by this minus one time step and and the problem will finish when the agent reaches the goal so solving this the optimal solution will be just uh whiz around it as quickly as possible and the actions are just like for connected like it can go NorthEast Southwest and the state is basically the agent's location in this grid okay um so an example of a policy would be represented like this that these arrows are an example policy that sort of repres presents um what the agent would choose to do if it was in any one of these states so this is an example of a deterministic policy function that says if I'm in this state go left if I'm in this state go up if I'm in this state go right it's a mapping from states to actions once you have this mapping you can just kind of read it out and behave like the agent can just read out its behavior and move all the way to the goal and we'll get there so so that's what we mean by a policy we mean like these arrows here um the value function um remember we're going through these three key Concepts so the value function looks something like this this is the value function for each of these states where we know that if we're just about to reach the goal we're only one step away from the goal here so this must have a value of minus one here we're two steps away from the goal this has a value of min-2 uh if we go all the way back to here we have a a value of - 16 um we can have worse values you know if we stuck over here and we've gone the wrong way we might have a value of- 24 now we have these values it's very easy to build um an optimal policy because for example when we get here we can just look up what's the value of going north compared to the value of going south and it's very clear that the value of going north is greater than the value of going south and so this is useful for a gives us a basis for optimal behavior and finally a model so let's say the agent has seen one trajectory through this Maze and moved around like this um and then it reached the goal and now it tries to build a model of how how this environment works so you can think of this as it's trying to kind of build its own map of the environment it's trying to figure out what will happen in the environment and the model in this case would look something like this and these minus ones represent the fact that it's seen that you get minus one per step and this map represents what it's understood of the Dynamics of the environment so far it understands that as you move around these are the transitions where you can move from state to state um it's not reality it's just the agent's model of reality okay so when we categorize ourl agents we can basically build a taxonomy now where we can taxonomize all of reinforcement learning according to these three quantities and the way that I taxonomize RL is basically by which of these um key components our agent contains and so we say that when we build a reinforcement learning agent um we say that it's a value based algorithm if it contains a value function um and if it just contains a value function the policy is kind of implicit it just has to read out uh look at the value function and pick the best action just like when we were here we didn't need an explicit representation of policy because we could just look at our value function and say hey the value function here is higher than the value function here so always just pick actions greedily with respect to your value function and so that that basically gives us a way to be to behave and means that it's a value based algorithm could look like this just a value function and no policy or an implicit policy previous slide so the question was why were the minus ones on the previous slide from so the minus ones were the agent's model of how much immediate reward it gets at every step so it's basically saying it learns so the setup of this problem was minus one per time step no matter what happens and so now the agent's map says that each step it takes no matter where it moves it's going to get minus one so this isn't the value this isn't a long-term prediction of reward this is a one-step prediction of reward saying every single step it's going to you can go from here to here and get minus one here to here you can get minus one okay but we didn't Define the direction would go second yeah sorry it's just in in in this down here that so the numbers here represent the immediate reward from each state um and it's the same number for all a just as an easy way to to Define this in general you would want to have a different number for each action um sometime there are different formalisms for award and sometimes you can have action dependent reward sometimes action independent it's a good question but kind of detail okay um so to categorize the agents they can be value based um or they can be policy based so to build a policy based um agent means that now instead of representing inside this agent the the value function how well it's going to do from each of these states instead we explicitly represent the policy and we work with the policy so so those arrows that we had here so a policy based agent would maintain some kind of data structure equivalent to these arrows and it would work directly in that space it would basically look at how well it does and adjust these arrows so as to get um the most possible reward without ever explicitly storing the value function so that's what a policy based agent means a policy based agent is one that stores the policy a value based agent is one that stores the value function um and we can also have what's called an actor critic which basically combines them both together and tries to get the best of both worlds so an actor critic which we'll see later in the course is an agent that's stores both the policy and directly um represents these arrows and also stores how much reward it's getting from each state yeah what do you do in the way of exploration like um say you you I'm going to come back to the question was what what about exploration I've got a section on that coming okay so to finish the taxonomy um now we can either have a model or not have a model so first of all one of the major distinctions in reinforcement learning is between model free and modelbased reinforcement learning perhaps the fundamental distinction in RL so model three means that we do not try to explicitly understand the environment we don't try to build the Dynamics of how the helicopter moves instead we go directly to the policy or the value function so we just see experience and figure out our policy of how to behave so as to get the most reward without going through this indirect step of figuring out how the environment works so that's model free reinforcement learning and a lot of the course will be about understanding how that can be possible in how do you even do this without building a representation of the environment first um alternatively we can do modelbased reinforcement learning where we first of all our first step is to build up a model of how the environment works we first build our Dynamics model of the helicopter and then we plan with it we do look ahead we say you know what would happen if we looked ahead with this model and figured out the optimal way to behave um so we have something like this V diagram now where we can um either have a value function or not if have a policy or not we can have a model or not um now ultimately you have to behave the agent has to select actions and to select actions really it needs to um have a policy to pick it or a value function that implicitly defines a policy so the majority of algorithms look like that okay so this is our taxonomy of of RL agents and how to understand if you pick up a random paper from reinforcement learning it will fit in somewhere in this diagram okay so brief for the last uh 5 or 10 minutes I just want to talk about some problems within reinforcement learning so we've talked about the reinforcement learning problem overall but there are some key sub problems that we're not going to talk about how to solve these but just uh you know name them and and make you aware of their existence and to start thinking about them is is my goal at this stage okay um so so let's start with learning and planning I think this is maybe the most important conceptual um distinction to understand um there's actually two different problems in in sequential decision making so if we go back to this science of what does it mean to make optimal decisions there's two very different problems we might care about first of all there's what we call the reinforcement learning problem where the environment is unknown that the agent isn't told how the environment works you know you literally just drop your robot into the U Factory floor and tell it to get the most reward or you turn on your power station and you tell it to go and figure out how much reward it gets you don't tell it um you know how um how the wind blows or how the uh Factory operates it just figures stuff out for itself um so and the way it does that is by interacting with the environment and through interactions with the environment figuring out a better policy that's going to maximize reward and get the most possible future reward so this is the reinforcement learning problem doesn't know about the environment interacts with the environment essentially through child and ero learning figures out the best way to to behave within that environment and maximize its reward okay there's a second problem set up which we call planning planning is different we tell it the the environment in the planning problem you know we say this is exactly uh the environment that you're working within we tell it all the rules of the game if you like we tell it the details of um the Dynamics of the wind for the helicopter and how helicopters move we tell it we give it the differential equations describing wind and so forth so the model of the environment is fully known um and it's instead of interacting with an environment the agent performs internal computations with its with its model now it's given this perfect model um it's given computation time it interacts with its model doesn't actually have to do anything in the real world it just kind of thinks for a while figures out what to do next maybe it's doing look ahead planning and as a result of those interactions it improves its policy and comes up with a way to behave but they're very different setups you know in RL the environment is unknown in planning the environment is known now of course one way to do reinforcement learning is to First learn how the environment works and then do planning so they're linked these two things are intimately linked together but they're separate problem setups so let's try and make this concrete again um so okay so let's talk about the Atari problem again um so this I already went through this slide um earlier but this is this was the reinforcement learning problem right we start no one told us this emulator works we just put our machine in front of this thing it's trying to figure it out through trial and error we don't know the rules of a game we just see the scores and screen and and get to control the joystick that's the RL problem what about the planning problem what would that look like so what if someone actually told us how the emulator Works what what if we actually had access to a perfect emulator of these Atari games um and we were able to do look ahead planning with our emulator we can query the emulator you can think of this as our perfect model now we could say you know if I was in this state here um in the this game sequest and I move the joystick right uh where will I end up next or if I'm in this situation here and I move the joystick left where will I end up next and this perfect model will tell us what happens and that gives us the ability to do things like look ahead search or tree search or or planning there's all these different planning methods where we can kind of think ahead and we can say well you know if I went right and then went right again then I'd get some oxygen and then i' shoot these different fish and I would end up with a reward of 100 without ever actually interacting with the Real Environment so you can kind of build this whole search tree figure things out um and and that's the planning process but it's a a different problem setup right we're told in advance the rules of the game the environment's known Yeah question if there's more than only two actors like say multidimensional m this exp multiple possibilities right okay so the question is what happens to the branching factor of this search space So at the moment we're just talking about problem definitions we'll have much more on how to plan effectively and efficiently and deal with large branching factors in later classes so in practice we can we can we can do this um so I just want to cover just before I take more questions um someone asked about exploration so this is another problem within reinforcement learning um how do we balance exploration and exploitation it's another key part of the identity of what it means to do reinforcement Val so so reinforcement learning we can think of as type of trial and error learning you know we don't know what the environment looks like we we we have to figure it out through trial and error we have to go find out which parts of the space are good and which parts of the space are bad um but the problem is that while we're doing that we might actually be losing reward along the way and so the goal is we want to actually figure out a good policy we want to figure out the best part of this space we want to figure out where the gold is in um you know imagine you've got some pirate hunting for Treasure it wants to find the policy that has the most gold U but along the way it wants to make sure that it's not giving up opportunities until it's found where the treasure chest is it shouldn't be giving up the opportunities to exploit the things it has discovered um so the one way to say this is it basically gets to interact with experience it's trying to find the best possible policy without losing too much reward along the way by exploring so it has to balance exploration and exploitation in other words exploration means choosing to give up some reward that you know about in order to find out more about the environment so you know maybe you think given everything you've seen so far you think that by going left you'll get the most reward but you know that you haven't really explored right very much and so maybe it's better to go right for a while discover that there's a big Treasure Chest there and so that in the future you might end up actually getting a lot more award in the long run so we've got this balance between exploration and exploitation exploration find out more about the environment exploitation means exploit the information you've already found to maximize your reward it's the exploration exploitation tradeoff comes up you know it's Universal to reinforcement learning and kind of special to reinforcement learning doesn't come up elsewhere in machine um so let's just have some examples so you know you all go to restaurants occasionally um so what happens if you're trying to pick the best restaurant so what does it mean there so exploitation means you kind of go to your favorite restaurant exploration would be trying a restaurant now how do you balance these like if you never explore and keep going to your favorite restaurant you kind of get stuck on something suboptimal but if you explore all the time you end up you know maybe eating food that you don't like because you're exploring a lot and maybe you should have gone to your favorite restaurant a little bit more often so you have to balance these things out so that you end up finding the best restaurant and eating delicious food along the way um the classic example of this it's one of the big um success stories of machine learning is um advertising online um so is a big use case for um for a lot of this Machinery um where you're trying to basically decide what advert to show on some Banner on some website um so this would be like you know Google AdSense or something like this um and this exploitation then would be to show the most successful advert the one which people are clicking on the most but sometimes you also need to explore and show some different adverts that haven't been displayed before to see if people will click on them and maybe those adverts will turn out to be better make more income or oil drilling you might want to drill at the best known location so you know that there's likely to be oil over here so maybe you just keep Drilling and Drilling and drilling but sometimes it might be a good idea to try some new location drill down a bit see if there's oil there um on game playing you know you might play the move you believe is best right now or you might just try a completely different strategy something quite experimental see how your opponent responds maybe over the course of many games that will end up being more effective okay um so just to finish okay so final problem I just cover briefly which is prediction and control um so this is another distinction within reinforcement learning between prediction which basically means how well will I do if I follow my current policy so someone tells me the policy I'm given a policy like you know if I walk straight forward how much reward will I get now that's a different problem to someone saying what's the optimal policy what's the which direction should you walk to get the most reward and typically what happens in reinforcement learning is we need to solve the prediction problem in order to solve the control problem we need to be able to evaluate all of our policies to figure out the best one um so if we go back to our grid World example you can have a a grid world like this um this again is something where you've got an agent that's moving um around this grid world it's um basically gets U minus one per time step but if it goes to a it gets a plus 10 reward and teleports to to a prime it goes to B it gets a plus five reward and teleports to B Prime we can ask what's the value function for the uniform random policy like if we just move around this grid randomly how much reward will we get that's a prediction problem we're just asking for this fixed policy of Behaving randomly how much reward will we get this is the answer you get you know a best if you're in this top state here you get 8.8 um units of reward in total going into the future roughly and we'll see how to solve these things in subsequent lectures if we were doing control we ask the question what's the optimal behavior in this grid world you know if I'm anywhere on this grid um and I behave optimally now what's the value function and the value function is very different so now if we look again at that best state before we were getting 8.8 now if we behave optimally in this grid World U we get a very different answer the optimal value according to the best possible Behavior gives us like more than 24 units of reward because we're going to follow this optimal strategy and Loop Round and Round these these links and get more and more reward so now it means that if we're in the state a we can expect to get more reward in the future so the optimization problem the control problem is very different to the prediction problem okay so I just want to finish with the course outline um and so people understand what's coming next so the course is roughly divided into two halves um I'm afraid because of last week it won't align with reading week but we'll figure that out as it goes um but the first part is roughly you know covering the basics covering really fundamental understanding of reinforcement learning so introducing the problem we've just done understanding markof decision processes how to plan so solving the planning problem using dynamic programming which is the elementary building blocks that underly all of this and then understanding model free methods for both prediction and then control um in the second part of the course we'll look at how to scale up how to really apply you know at the end of this part you won't really be able to go off and solve interesting problems you shouldn't have that expectation but all of the ideas which we'll learn here do scale up and in the second part of the course we'll understand how they scale up and apply to more interesting Problems by using value function approximation policy gradient methods integrating learning and planning using more sophisticated approaches um solving the exploration exploitation problem um and so that's where things will become interesting and then finally in the last lecture which is the one which we'll um we'll have to find a good time for that um I'll just give a case study which is some of the success stories of um reinforcement learning in games and we'll see that RL has been used to achieve not just world champion performance in Bat gam but actually across all of the different classic games are