hello and welcome to this course on reinforcement learning my name is harvan husselt and i'm a research scientist at deepmind in london and every year we teach this course on reinforcement training at ucl this year it's a little bit different because due to the pandemic situation with covet 19 we are pre-recording the lectures so instead of talking from a lecture hall i'm now talking to you from my home um the topic of course as mentioned is reinforcement training i will explain what that means what those words mean reinforcement learning and we'll go into some depth in uh multiple lectures to explain different concepts and different algorithms that we can build i'm not teaching this course by myself some of the lectures will be taught by diana bursa and some will be told by matteo hessel and today will be about introducing reinforcement learning there's also a really good book on this topic by rich sutton and andy bartow which i highly recommend and this is also going to be used basically background material for this course and if you go to the url that is shown here on the slide you can access a free copy of that book just a little bit of admin before we get started for students taking this for credit at ucl there's a portal called moodle and we'll be using that to communicate with you so please check that for updates and please use the forum thereon for asking questions if you do that then if we answer these questions and other people can also benefit from that interaction and multiple people might have the same question or people might have a question but not even realize that they have that question so then it's very useful if you ask it publicly on that forum in terms of grading we will have assignments which will be graded this year there will not be an exam so now about this course specifically what are we talking about um the main question for this first lecture especially is just the question what is reinforcement learning and i'll explain it a little bit and then we'll go into a lot of depth into different subtopics of this question and in order to understand what reinforcement is it's actually useful to first ask the question what artificial intelligence is and how these two are related because turns out these are closely related and to understand at least what i mean when i say artificial intelligence i'm going to pop up a level and we're going to turn first to the industrial revolution so this is a period in time that happened a couple of hundred years ago or started a couple hundred years ago and one could argue this is all about automating repeated physical solutions or manual solutions if you will so think for instance of a steam train or a steamboat and how this replaced the manual labor of pulling a cart by yourself or using for instance animal labor horses to draw those cards now of course some of that still happens we still have a manual layer but we replaced a lot of that with machines and this led to the machine age where we started replacing more and more things with machines and in addition to that also coming up with new things that we could solve with machines so even things that we weren't doing before we could now make machines that could do those things for us of course this led to huge productivity increase worldwide and it also fed into a new stage you could argue comes after this which you could call the digital revolution and one way to interpret this is to say the digital revolution was all about automating repeated mental solutions so a classic example here would be a calculator we know how to add two numbers together we know how to multiply two numbers together and in fact we know that precisely enough that we can write a program and implement that on a machine on a computer if you will and then automate that process in such a way that it's very fast and very precise and therefore replace the slower mental calculations that we had to do before now this of course also led to a lot of productivity increase but both of these phases they have something in common which is that we first had to come up with the solutions now i'm going to argue that there's a next thing that you can think of which is already ongoing and that would be to allow machines to find solutions themselves and this you could call the domain of artificial intelligence now this has huge potential upside because if you are able to find machines that can learn for themselves to find solutions then this takes away the responsibility on us to find a solution in advance and then to automate it instead then all that we need to do is specify a problem and a goal and then have the machine figure out how to solve this as we'll see later this will often involve interacting you have to have some data to find the solution and this means that there's a process of learning so here we already bump into this term learning which i'll get into much more in depth in addition this requires you to autonomously make decisions so i'm i'm putting these terms basically up front and center so there's learning and autonomy and decisions and these are all quite central to this generic problem of trying to find solutions of course we're not the first to talk about artificial intelligence this has been a topic of investigation for many decades now and there's this wonderful paper by alan turing from 1950 called computing machinery and intelligence and the very first sentence of that paper reads i propose to consider the questions can machines think now i by the way recommend you to read this paper it's wonderfully written it's very accessible and it has lots of really interesting thoughts but there's one paragraph that i want to highlight specifically and i'll read that to you now so alan turing writes in the process of trying to imitate an adult human mind we are bound to think a good deal about the process which has brought it to the state that it is in we may notice three components the initial state of the mind say at birth the education to which it has been subjected and other experience not to be described as education to which it has been subjected instead of trying to produce a program to simulate the adult mind why not rather try to produce one which simulates the child's if this were then subjected to an appropriate course of education one would obtain the adult brain presumably the child brain is something like a notebook as one buys it from the stationers rather little mechanism and lots of blank sheets mechanism and writing are from our point of view almost synonymous our hope is that there is so little mechanism in the child brain that something like it can be easily programmed so what is aventurine talking about here he's essentially talking about learning he's essentially conjecturing that trying to write the program which constitutes an adult mind might be quite complicated which makes sense because we're subjected to a lot of experience throughout our lives this means we learn a lot of you can think of these as being rules or pattern matching that we learn how to do skills that we acquire and enumerating all of that describing that all of that clearly enough and cleanly enough that you have something with the same capability as an adult mind he's conjecturing that that might actually be quite tricky and maybe it's easier to actually write a program that can itself learn in the same way maybe that we do or maybe in a similar way or maybe in a slightly different way but it can learn by interacting with the world by in in his words subjecting itself to education um maybe to find similar solutions as the adult mind has and he's conjecturing that maybe this is easier now this is a really interesting thought and it's really interesting to think about this a little bit so maybe this is a good time for you also to maybe pause the video and ponder this a little bit whether you agree with this conjecture that indeed maybe it might be easier to write a program that can learn than it is to write a program that has the same capabilities as the program that can learn will achieve over time so what is an artificial intelligence well one way to define it would be that the goal would be to to be able to learn to make decisions to achieve goals this is not the only possible definition of artificial intelligence and other people have proposed sometimes slightly different versions or vastly different versions i'm not going to argue that this is the best definition of artificial intelligence maybe there are different types of artificial intelligence that we could consider but this is the one that is central to us so we're going to basically ask this question how could we build something that is able to learn to make decisions to achieve goals and that is our central question and note that the learning decisions and goals are all very central concepts in this and we'll get into a little bit more detail what i mean with all of them so this brings us to this question what is reinforcement learning and this is related to this um experience that alan turing was also talking about because we know that people and animals learn by interacting with our environment and this differs from certain other types of learning and that's good to appreciate first of all it's active rather than passive and we'll get back to that extensively in the next lecture what this means is that you are subjected to some data or experience if you will but the experience is not fully out of your control the actions that you take might influence the experience that you get in addition to that interactions might be sequential so future interactions might depend on earlier ones if you do something this might change the world in such a way that later other things are possible or impossible we are also goal directed we don't just randomly meander we do things with a purpose and this this is also this is at large but also at small scale i might have a goal to pick up a glass for instance that is a small thing perhaps but you could think of this as being a directed action where i pick up that glass and of course this consists of many small little micro actions of me sending signals to my muscles to actually execute that also we can learn without examples of optimal behavior and this one's interesting it's good to think about that a little bit and to appreciate what i mean when i say with that because obviously we are subjected to education as engineering courses so we do get examples of what we um behavior that we want or behavior that other people want us to do and we try to follow those examples in many cases but what i mean here is something a little bit different i mean that when you do pick up a cup that maybe somebody showed you at some point oh it's useful to pick up a cup um or you could think of it that way maybe that's not the greatest example of that but somebody taught you how to write or taught you how to do math but nobody actually told you exactly how to steer your muscles in such a way as to move your arm to pick up a pen to pick up a cup things like this so clearly we still learn some sort of a behavior there we learn to control our muscles but not in a way that somebody tells you exactly oh this is how you should have moved your muscle and now you just replicate so that is what i mean when i say we learn without examples nobody gives you exactly the low level actions that are required to execute that thing that you want to execute and this actually maybe constitutes most of the learning that we do most of the learning that we do is actually of that form where maybe we do interpret something that we see in some sense as an example but maybe typically at a much higher level of abstraction and in order to actually fill in that example in order to execute what we want to mimic this might still require us to learn skills in a much more autonomous way without clear examples so one way to think about this and i'll go back to this is that you can think of this as optimizing some reward signal we want to achieve something and by achieving it we feel in some sense satisfaction or happiness and this is what stares our behavior we notice that some things are more pleasing than other things so this brings us to a very central uh picture that i'm going to show multiple times during this course which is the interaction loop and one can perceives this as being basically the setting that we find ourselves in so this is something to keep in mind that we are basically considering an agent interacting with an environment and here i drew them separately but you could also think of the agent as basically being inside that environment there's a huge world out there the agent lives somewhere in that world now this could be um quite concrete for instance the agent could be a robot and the environment could be the real world it could also be much more abstract for instance the environment could be some abstract game or it could be a virtual environment it could be the internet and the agent could be some program that tries to interact with that environment instead so it's quite a flexible framework and we basically say that the agent then executes actions and observes the environment this is typically drawn in such a way as i did here where the actions go from the agent into the environment and the observations go from the environment into the agent but of course you could also think of the observation as being something that the agent pulls in in fact the observation typically depends on the agent because the agent will have some sort of sent a sensory motor stream that is defined by its interface for instance the agent could have a camera and that defines which observations it gets so the main purpose of this course is then to go basically inside that agent and figure out how we could build learning algorithms that can help that agent learn to interact better and what does better mean here well the agent is going to try to optimize some reward signal this is how we're going to specify the goal and the goal is not to optimize the immediate reward so we're not just interested in taking an action and then forgetting about everything that might happen after no we're actually interested in optimizing the sum of rewards into the future i'll explain it a little bit more clearly in a moment but it's good to appreciate that there must be some goal to this right if there's no goal specified then it's unclear what we're actually optimizing and it's unclear what the agent will actually learn to do so we need some way some mechanism to specify that goal many cases when people show versions of this interaction loop when talking about reinforcement learning they put these rewards next to the observation and that's one useful way to think about this that you take an action and then the environment gives you an observation and a reward but sometimes it's a bit more natural to think of the reward as being internal to the agent so you could also think of the reward signal as being some sort of a preference function over these observations or over sequences of observations that the agent receives and the agent just observes the world and feels happier or less happy about what it sees and then tries to optimize its behavior in such a way that it achieves more of these rewards so this is why i didn't put the reward in the figure because sometimes it's easier to think of it as just coming from the environment through the external to the agent sometimes it's easier to think of it as being in some sense part of the agent but it should still be there and it should be clearly specified because otherwise it's unclear what the goal of this whole system would be and this reward function is quite central so it's good to stop and think about why this is a good way to specify a goal and this is formulated in this reward hypothesis that we see on the slide which states that any goal can be formalized as the outcome of maximizing a cumulative reward now i want to encourage you to think about that and think about it critically and try to see if you can maybe even break it in some sense so breaking it would mean coming up with a counter example of a goal that you cannot specify by maximizing the cumulative reward feel free to pause the video and think about this for a bit so i've not been able to come up with examples myself and maybe this is somewhat even trivially true in some sense because you could think of a reward signal that basically just checks whether you've achieved that goal that we want to specify and then whenever you've achieved the goal the reward signal becomes one and before that it's zero then optimizing this cumulative reward would clearly correspond to maximizing or achieving that goal doesn't mean it's easy to specify that sometimes it's hard to specify your goal precisely or sometimes it's hard to specify a reward which is easy to optimize which is a completely different problem but that's not under this reward hypothesis this just states that there must exist a reward and indeed sometimes there's many different ways to specify the same goal for instance instead of saying you get a reward of plus one whenever you achieve the goal and zero before that you could also say let me give you a reward of minus one in some sense a penalty on every step before you've achieved the goal and then zero rewards after you've achieved it then you could think of the agent as maximizing this cumulative reward as in some sense minimizing these penalties which would also then maybe lead to the behavior of achieving the goal as quickly as possible because minimizing the number of -1 rewards number of steps until you've achieved the goal will then become relevant so we see that a goal could also include not just that it happens but also when it happens if we specify it in this way so it's quite a flexible framework and it seems to be a useful one that we can also use to create create concrete algorithms that work rather well so some examples some concrete examples of what reinforcement learning problems could then exist so here's a list including flying a helicopter managing an investment portfolio controlling a power station making a robot walk or playing video or board games all of these examples were picked because they have actually been used and reinforcement has been applied to them successfully and for instance we could have a reward function for the helicopter that is related to air time or inverse distance to some goal or to pick for instance the video games you could or board games you could think of a reward function that just looks at whether you win or not so think of the game of chess for instance you could have a reward function that gives you plus one whenever you win minus one whenever you lose and if the goal isn't to learn via interaction these are all reinforcement learning problems this is irrespective of which solution you use and i put that on the slide because sometimes people conflate the current set of algorithms that we have in reinforcement learning to solve these type of problems with the field of reinforcement learning but it's good to separate that out and to appreciate that there's a reinforcement building problem and then there's a set of current solutions that people have considered to solve these problems and that set of solutions might be under a lot of development they might change over time but it's first good to think about and appreciate whether we agree with the the goal with the problem statement and if we agree with the problem statement then we can think flexibly about the solutions we don't have to be dogmatic about that and we can think about new solutions that achieve the same goal so it's good to separate that out and i would argue that if you're doing any of these problems where there is a reward function errors or sequential interaction then you're doing reinforcement learning whether or not you call your algorithm three and four enforcement learning algorithms so in each of these problems that i specified these reinforcement training problems there might actually be two distinct reasons to learn so the first one maybe obviously is to find solutions so going back to the example of the helicopter for instance you might want to find a policy of behavior for this for this helicopter so that it flies to a goal as quickly as possible but maybe in order to optimize its cumulative reward it also sometimes has to do some more complicated things such as first go somewhere else to refuel because otherwise it won't even reach the goal but in the end you might have some learning process and you might find a solution two examples here to make that concrete you could think of a program that plays chess really well that might be something you desire or you might want a manufacturing robot with a specific purpose and then reinforcement could potentially be used to solve these problems and then to deploy that solution now a subtly different but importantly different thing that you might want is a system that can adapt online and the purpose for this would be to deal with unforeseen circumstances so to take the same two examples and to contrast what we how this is different in the chess program for instance you might not want a chess program that displays maybe the most optimal form of tests that you can find but instead you might want to find a program that learns to adapt to you now why would you do that well for instance you might want a program that doesn't win too often because then maybe your enjoyment is less so instead of optimizing the number of times it wins maybe it actually wants to optimize so that the number of times it wins is maybe like roughly half of the time or something like that or maybe it wants to optimize how often or how long you play it because maybe that's a good proxy for how much you enjoy playing it similarly you can think of a robot that can learn to navigate unknown terrains maybe you can pre-train this manufacturing robot from the first example because you have a very good simulator for the setting that it might be in but in other cases maybe you don't have a very good simulator or maybe you do have good simulators for different types of the terrains that the robot might encounter but you do not know yet exactly what the terrain will look at where it will be deployed and there might be unknown unknowns there might be things that you haven't foreseen in those cases obviously it's quite useful if you can continue to adapt if you can continue to learn and we do that as well we continue to learn throughout our lifetimes so that's a different purpose but fortunately reinforcement learning can provide algorithms for both those cases it's still good to keep in mind that these are actually different goals and sometimes that becomes important note that the second point about adaptive algorithms to be able to adapt online is not just about generalizing it's not about finding a solution similar to in the first category but one solution that is very general in some sense now it's really about unknown unknowns it's really about what if the environment changes what if for instance you have a robot and it gets deployed and at some point there's wear and tear and you haven't foreseen this there was no way to know exactly what would happen and all of a sudden the robot has to deal with this somehow then if it can't adapt online then it's really hard to find a solution that is generic enough general enough that can deal with that and in indeed there are other reasons why it might be useful to learn online because it might be easier to have a smaller program that continues to continue to track the world around it than it is to try to find this one humongous solution that can deal with all of the unforeseen circumstances that you could possibly come up with so these are really different different settings okay so now we finally are ready basically to answer this question what is reinforcement learning and i'm going to say that basically reinforcement learning is the science and framework of learning to make decisions from interaction so reinforcement learning is not a set of algorithms also not a set of problem problems sometimes in shorthand we say reinforcement planning when referring to the algorithms but maybe it's better to say reinforcement learning problems or reinforcement learning algorithms especially if we want to specify those two different parts of it and then reinforcement learning itself could just be the the science and the framework around all of that this has some interesting properties it requires us to think about time and consequences of actions and this is a little bit different from many other types of learning from for instance other types of machine learning where oftentimes you are given a data set and for instance you want to find a classifier or something of the form and then maybe there are no long-term consequences you basically just specify that your goal is to minimize the errors that the system makes now in reinforcement learning we would argue that maybe you want to consider the whole of the system so maybe you don't just want to consider the classifier but you also want to consider the consequences of classifying something wrong and that might be taken into account if you consider the whole framework this makes it more challenging and it also means we have to actively gather experience because these actions will change the data that we see we might want to predict the future so not just on one step thing so unlike a classifier where you just get an input and you're only interested in like the input for that or the output for that specific input we might actually want to consider future steps further into the future which is an interesting and tricky subject and in addition we this is a more typical thing that happens in machine learning we have to deal with uncertainty somehow now the benefit of this is that there's huge potential scope but you might have also realized by that this is also very complicated or difficult question in general how to solve this very generic problem but the upside is huge if we are able to find good generic algorithms that can deal with this very generic setting then maybe we can apply this to many different problems successfully and indeed one way to think about reinforcement planning is that it's a formalization of the ai problem as i defined it earlier so it's good to appreciate the the basic the ambition here that reinforcement building is quite an ambitious vicious endeavor that doesn't mean that of course it sits on an island and in fact we will see uh during this course that current day reinforcement learning is very synergetic with uh with deep learning which is all about training deep neural networks and indeed this is also very it seems very very uh suitable component for a full ai problem so the reinforcement learning description is just about formalizing the problem that doesn't mean that we don't need solutions from all sorts of subparts of machinery okay now i'm going to show you an example what we see here is an atari game this is an old video game from the 1980s called beam rider and the agent that is playing this game has learned to play it by itself its observations were the pixels as you also see them on the screen here's a different atari game with different pixels and in each of these cases the actions that the agents would take are the motor controls which are basically just the joystick inputs for the atari games which means the agent could press up down left right or diagonally and it will press a fire button and then the agent just had to deal with that in input output stream so it just gets these observations these pixels from the screen and it outputs these uh joystick controls and we see that they did relatively well learning to play each of these different games even though they're quite different so here's a racing game enduro and it's good to appreciate that the agent is not even told what it's controlling right it just gets these pixels so it's not told oh there's this thing at the bottom here which is kind of meant to be a racing car and your goal is to pass these other cars now instead you just get these pixels you get your motor controls and you get a reward signal now in these games the reward signal was defined as the difference in score on every time step on a lot of time steps this difference in score is zero that's fine but on other time steps it will be positive and the agent tries to maximize the summation of that over time so it wants to take actions that will lead it to good rewards later on now the most important thing to take away from this is that we have used a learning system to find these solutions but we didn't need to know anything about these games ourselves like there was nothing put into the agent in terms of strategy or even in terms of prior knowledge on what you're controlling on the screen so the agent when it started playing space invaders did not know that it was going to control this thing at the bottom which is shooting or that it was controlling one of these boxes in this example and that is the benefit of having a generic learning algorithm in this case this algorithm is called dq n and we'll discuss it later in the course as well okay so now i'll go back to the slides so now i've given you a couple of examples i've shown you these atari games and now is a good time to start formalizing things a little bit more completely so that we know a little bit more what's happening and in future lectures of course we will make this much more clear and rigorous and for now we're going to give you kind of like a high level overview of what happens here what is this reinforcement learning problem what's inside that agent how could this work so we're going to go back to this interaction loop and we're going to introduce a little bit of notation where we basically say that every time step t we receive some observation ot and some reward rt as i mentioned the reward could also be thought of as being inside the agent maybe it's some function of the observations or you could think of this as coming with the observations from the environment and then the agent executes some action so the action can be based on this observation ot in terms of our sequence of interactions and then the environment receives that action and emits a new observation or we could think of the agent as pulling in a new observation and a next reward note that we increment the time step after taking the action so we say that the action is emitted at time step t and then the next observation is received at time set t plus one that's just convention this is where we increment the time index you can actually extend reinforcement learning to continuous time as well rather than having these discrete time steps but we won't cover that in this course the extensions are in some sense not too difficult so it's good to have that in mind but there are some subtleties that one would have to consider so the reward here is a scalar feedback signal it's just a number it can be positive it can be negative a negative reward you could call a penalty but we just call that a negative reward just to have this one word that refers to the feedback signal and just to recall i put the reward hypothesis on the slide again where we state that any goal can be formalized as the outcome of maximizing a cumulative reward this this instantaneous reward that indicates how well the agent is doing at that time step t and this helps define the goal of the agent and the cumulative reward is and the accumulation or the sum of these rewards over time it's useful to also devote a letter to that which we'll call g so roughly speaking you can think of g as kind of specifying the goal but we'll use the term return to refer to this so the return is just shorthand for the cumulative reward or the sum of rewards into the future note that the return is only about the future right so this is at some time set t so this is useful to determine which action to take because your actions cannot influence the past they can only influence the future so when we define the return the return is defined as all of the future rewards summoned together but the past rewards are in the past and we can't change them anymore then we can't maybe always hope to optimize the return itself so instead we're going to define the expected return which we'll call a value so the value at time s would simply be the expectation of the return so that's the sum of the rewards going into the future conditioned on the fact that you're in that state s i haven't defined what a state is yet but for simplicity you could now think of this as just being your observation but i'll talk more about that in a moment so this value does depend on the actions the agent takes and i will also make that a little bit more clear in the notation later on so it's good to know that the expectation depends on the dynamics of the world but also the policy that the agent is following and then the goal is to maximize the values we want to pick actions such that this value becomes large so one way to think about that is that rewards and values together define the utility of states and actions and there's no supervised feedback so we're not saying this action is correct that action is wrong instead we're saying this sequence of actions has this value that sequence of actions has that value and then maybe pick the one that has the highest value conveniently this and this is used in many algorithms the rewards sorry the returns and the values can be defined recursively so the return at time sub t can be thought of as simply the first reward plus the return from that time step t plus one similarly the value can be defined recursively so the value at some time sorry the value at some state s is the expected first reward you get after being in that state and then the value of the state you expect to be in after being in that state so the goal is maximizing value by taking actions and actions might have long-term consequences so this is captured in this value function because the value is defined as the expected return where the return sums the rewards into the future and one way to think about this is that actual rewards associated with certain actions can be delayed what i mean with that is you might pick an action that might have consequences later on that are important to keep in mind but that do not show up immediately in the reward that you get immediately after taking that action this also means that sometimes it's better to sacrifice immediate reward to gain more long-term reward and i'll talk more about that in the next lecture so some examples of this might be one that i mentioned before today refuting a helicopter might be an important action to take even if it takes you slightly farther away from where you want to go so this could be formalized in such a way that the rewards for that are low or even negative for the act of refuting but the sum of rewards over time might be higher because eventually you get closer to your goal than if you wouldn't refuel or to pick the last example learning a new skill might be something that is costly and time consuming at first might not be hugely enjoyable but maybe in the long term it will yield you more benefits and therefore you learn this new skill to maximize your value rather than the instantaneous reward for instance maybe that's why you're following this course just in terms of terminology we call a mapping from states to actions a policy this is just shorthand in some sense for an action selection policy it's also possible to define values on not just states but on actions so these are typically denoted with the letter q for historical reasons so we have the letter v to denote the value function of states and we have the letter q to denote the value function of states and actions and this is simply defined as the expected return condition on being in that state and then taking that action a so instead of considering some sort of a policy which immediately could pick a different action in state s we're saying no no we're in state s and we're considering taking this first action a now this total expectation will then of course still depend on the future actions that you take so this still depends on some policy that we have to define for the future actions but we're just pinning down the first action and conditioning the expectation on that we'll talk much more in depth about this in lectures three four five and six so now we can basically summarize the course concepts before we continue so we said that the reinforcement planning formalism includes an environment which basically defines the dynamics of the problem it includes a reward signal which specifies the goal and sometimes this is taken to be part of the environment but it's good to basically list it separately and then it contains an agent now this agent might contain different parts and most of this course will essentially be about what's in the agent what should be in the agent how could we build learning algorithms that work well and some of the parts are listed here so the agent will contain some agent state this is just the internal state of the agent it will contain some policy and it could contain a value function estimate so a prediction of the value or a model which might be a prediction of the dynamics of the world i put questions mark marks there because these are in some sense more optional than the first two the agent must have some internal state this could be a very simplistic state it could be a null state or your agency could simply be the immediate observation that you've received right now but it could be a more complicated state and it must have some policy it must select actions in some way again this policy could be particularly simple it could be a random policy that just selects actions completely uniformly at random but there must be some policy the value function and the model are more optional in the sense that they're not essential parts but they are very common parts and i will discuss them a little bit in the remainder of this lecture so now it's time to go into the agent and we'll start with the alien state so this is one way to depict the internals of the agent so now we're inside the agent and in this schematic here on the right hand side time increments as we go to the right and so we see inside the agent from the view inside the agent on every time set there's an observation that comes in and then there's some internal state of the agent and from the state of the age the agent might make predictions and it should define some policy somehow and then the action gets selected by this policy so i could have basically drawn another arrow going from the policy into the action which would then go back into the environment but we're focusing here on that state component and state here basically refers to everything that the agent takes along with it from one time to the next so if there are things that are not taking along for instance the policy at the instantaneous policy at the time set might not be taken along the predictions might not be taken along or they could be in that case they could just be part of the state but there might be other things in the state as well there might be some memory in the state there might be a learned components in the states everything that you take along with you from one time to the next we could call the agent state we can also talk about the environment states which is the uh the other side of that coin um in many cases the environment will have some really complicated internal states for instance in uh the example where the agent is a robot and the environment is the real world then the state of the environment is basically just the state of all of the physical quantities of the world all of the atoms all of the quantum mechanics of the world that's the environment state of course in many smaller examples if it's a virtual environment it could be much smaller but it could still be quite complicated this also means it's usually invisible to the agent it's very really really large and it's not part of the observation stream per se even if it would be visible it might contain lots of irrelevant information and it might just be simply too large to process but actually the first one is more interesting it's usually just invisible to the agent we can only see a sub-slice of it we can see a small part of it via our observation stream an important concept to keep in mind then is that we can also formulate the whole interaction sequence into something that we could call the history of the agent and this is simply everything that the agent could have observed so far so that includes the observation from the environment but also the actions that the agent took and the rewards that it received so this is really just taking that interface and storing everything that happens on the interface level and we could call that the history of the agent so for instance it could be the full sensory motor stream of the robo now we can say that the history is the only thing that can be used to construct the aging stage in some sense apart from whatever prior knowledge you put in all the way at the beginning but let's just set that aside for a moment and everything else must be a function of your history there's nothing else essentially the agent has no additional information apart from its sensory motor stream so that's that's what you should be using to construct your agency then a special case is when the agent can see the full environment state so that the observation is the full environment state and this is called full observability i mentioned before already this is a very special case this is not the common case at all but it's a useful one and sometimes it's used for instance in theoretical statements just because it's easier to reason about in some cases and in that case the agent state can just be the observation right we don't need to worry about this whole interaction stream we can just observe whatever the environment state is and then this should be sufficient in order to basically tell where you are you don't need additional memory you don't need anything else you just need the environment state as your state now in addition to that there could be the learnable parts of the agent right the agent might have some parameters that it's learning and you could also consider that to be part of the agent state in this case i'm actually not considering that to be part of the agency that's something that we also have that's also part of the asians but let's just set that aside and call that the the agent's mind is essentially separate from its state in this sense so in the fully observable case you can just look at your observation you could say oh that tells me everything i need to know about the environment so i don't need to log any of the previous interactions and this leads us to an important concept in reinforcement training which is the markov property and this has been used to formulate essentially the reinforcement problem and also precursors to this and importantly a markov decision process is essentially a very useful mathematical framework that allows us to reason about algorithms that can be used to solve these decision problems the mark property itself states that a process is markovian or a state is markovian for this process if the probability of a reward and a subsequent state doesn't change if we add more history that's what the equation on the slide means so we can see the probability of a reward and a state you could you should interpret this as the probability of those occurring on times of t plus one and we say that the the probability of this happening condition on state st is equal to conditions on the full history up to the time set that means if this is true that the state contains excuse me that the state contains all the means you need to know so we don't need to store anything else from the history doesn't mean that the state contains everything it just means that adding more history doesn't help for instance if your observations are particularly uninformative then adding more uninformative observations might not help so that might lead to a markovian state but it doesn't mean that you can observe the full environment state however if you can observe the full environment states then you're also markovian so once the state is known the history might be thrown away if you have this market property and of course this sounds uh very useful because the state itself might be a lot smaller than the full history so as an example the full agent and environment state is markov but it might be really really large because as i mentioned the environment state might be humongous it might be the real world and also the full history is markov which you can kind of clearly read from this equation because if you put ht on the left hand side where it says st then obviously this is true but the problem with that is that that state keeps growing so if we want to use the full history as our agent states then the amount of memory that we're using inside the agent's head keeps growing linearly over time and sometimes that also becomes too too large or actually oftentimes it also becomes too large so typically the agent said is some compression of the history whether it instead of the markov property is actually maybe not even the most important question but it's an interesting thing to keep in mind so note here that we use st deno to denote the agent states not the environment states and we'll use that convention basically throughout where sometimes as a special case these are these will be the same because the environment state might be fully observable but in general we will not assume that and then whenever you say states this is basically the state and part on the side of the agent and that's specified differently now i said that full observable cases are very rare so that we should talk about the uh the complement of that which is the partial observable case so in this case the observations are not assumed to be markovian and i'll give you an example or a couple of examples so for instance a robot with a camera which is not tallest absolute location would not have markovian observations because at some point it might be staring at a wall and it might not be able to tell where it is it might not be able to tell what's behind it or behind the wall it can maybe just see the wall and then this observation will not be markovian because the probability of something happening might depend on things that it has seen before but it doesn't see right now it may have just turned around and there might be information about what's behind it which should influence the probability of what happens next but it can't see this from its observations per se similarly a poker playing agent only observes public cards and its own cards it doesn't observe the cards from the other players but obviously these are important for its future rewards so part of the environment state is then hidden to the agent so now using the observation essay would not be markovian that doesn't mean it's necessarily a bad idea but it means it can be a bad idea because you're ignoring some information that might be contained in your past observations this is then called a partial observable markov decision process or pomdp for short and it's basically just an extension of the market decision processes which we'll define more rigorously in future lectures and it's good to keep in mind that this is basically the common case note that the environment state itself could still be mark off it's just that the agent can't see it and therefore can't know it in addition we might still be able to construct a markov agent state the example i gave in the previous slide is you could always take your full history and that would be markovian the problem with that is just it's too large but maybe there are smaller asian states we can construct which still hold enough information to be markovian so the agent states is an important concept and it must depend on this information that you've seen before right this must depend on this interaction stream and the agent actions then depend on the state and it's some function of history so the examples that i gave were like the states could be the observation could be your full history but more generally you can also write this on recursively where the state at your next time set t plus one is some function of your previous state the action that you take that you've taken the reward you've seen and the observation that you see so we're taking one step in this interaction loop and we're basically saying we're going to update the state to [Music] be aware of this new time step clearly if we're concatenating the action reward and observation then st 1 could just be your full history if st is your full history up to time step t so the full history is contained within this formulation also quite clearly the special case of just looking at the observations contains formulation and this is a more flexible way to think about it and then u is the state update function now as i mentioned it's often useful um to consider the agent's say to be much much smaller than the environment set and in addition you also typically wanted to be much smaller than the full history so we want this agent update function to give us some compression of the full history maybe recursively and maybe the state actually stays the same size right so st could be of a certain size we see new action reward and observation and we condense all of the information together into something that is the same size as st and here's an example just to make that a little bit more concrete all of that so let's consider a maze and let's say that the full state of the environment in a maze is this layout and in addition it's where you are in the maze and that would define the full environment state but let's say that the agent can't observe all of that it can't observe its location in the maze and instead maybe you can only see this little three by three around itself so in this case the agent would be in the center of this three by three uh block and what it can see is exactly the pixels around it the the cells around it so it can see for instance that above it it's empty to the left and the right there's a wall in black below it it's empty so it could walk up it could walk down and also it can look slightly around the corner where i can see that if it goes up and then right there's also an empty spot but if it goes up and left it would bump into a wall and that's all that it can see now this observation is not markovian because if we look at a different location these observations are actually indistinguishable so if we would just use the observation in this case then the agent won't be able to tell where it is and we can also talk about why that might be problematic so let's say that the agent starts in the top right corner and let's say that the goal for the agent is to go to the top left corner then if you consider the shortest path in the state that we see the observation that we see in the top right here the optimal action would be the step down because that's in the direction of the goal because we have to go via the bottom of this maze in order to reach the top left corner however if you then look at the left observation in that observation the optimal action would be to go up but if the agent can't distinguish between these two if it would be using the observation as its full agent state and its action selection policy must depend on only that observation then it's unclear what it should be doing in the top right it should be going down in the left should be going up but there's no single policy single function of this observation that will do the right thing in both cases this is why it can be problematic to not have a markovian state observation so now i actually actually want you to think about for a second so feel free to pause the video here and think about how you might be able to construct a markov in asian state for this specific problem and maybe for any reward sequence so not just for the one that goes from the top right to the top left but maybe that one that works for any reward signal so feel free to pause the video and then i'll talk about this a little bit more so one thing that you may have come up with is well maybe we can use that thing that you said where you can use the full history yes the full history would be markovian would be would be rather large so i think many of you will have kind of discounted that as being not the most pleasant or feasible solution so maybe we could do something that's a little bit in that direction but not quite the same so let's say we consider storing not just the observation that we see right now but also the previous observation would that work well it kind of depends actually it depends on the policy and it depends whether the state transitions here in the real world are are completely deterministic so if you go down you really go down or whether there are some noise in there where sometimes when you press down you actually go up because note that if you look at both of these observations that are highlighted right now if you step down one step the observation is still the same so if you would come from this situation below where we currently are and you would just concatenate these two observations that would not be sufficient to be able to tell where you are so just concatenating two observations is not necessarily markovian in this environment however it can be sufficient if your policy never does this so if we stepped on the right hand side in the right top corner right first we step left and then we step down this brothers to where we currently are if we would have stored the fact that we just stepped down and then we see this then we know that we are where we are because then the previous observation is sufficient and then we step down again and if the policy never does that same action in the left state then the ordering of the observations is enough to distinguish the left from the from the top right but in general for any policy for instance for a uniformly random policy just concatenating two observations is not sufficient in order to get a markovian state in this case okay so in general what i'm doing there is basically trying to construct a suitable state representation to deal with the partial observability in the maze and as examples i mentioned using just the observation might be enough using the full history might be too large but generic you can think of some update function and then the question is how do we pick that update function and that's actually what we were doing just now like we were trying to hand pick a function u that updates the state in such a way to take into account the stream of observations the example that i gave where i just concatenate two observations would be where you just keep track of this buffer and whenever you see a new observation it basically replaces the oldest observation with a new one with a newer one and then adds the newest one on top so you have like a two observation buffer in that case excuse me so this is a generic update you can you can do other other things there as well of course but it's good to to note that constructing a full markov unit you say it might not be feasible like your observation might be really complicated and it might be really hard to construct a full markovian agency and so instead instead of trying to always shoot for complete markovianness maybe that's not necessary maybe it's more important that we allow good policies and good value predictions and sometimes that's easier sometimes going for optimal is really really hard but going for very good is substantially easier and that's something more generally that we'll keep in mind when we want to deal with messy big real world problems where optimality might be out of reach okay now we're going to continue our journey inside the agent and we're going to go to the next bits which are the policy the value function in the model starting with the policy so we covered the agencies now we're going into policy and then immediately into the value function and the model and the policy is simply something that defines the agent behavior it's not a very complicated construct it's a mapping from agent say to actions and for instance we can write this like this for a deterministic policy it could be considered simply a function that takes a state as input and outputs an action now actually it will be more common and uh often more useful to think of stochastic policies where instead pi means the probability of an action given a state pi is just conventional notation for policies we often use pi to denote a policy and the stochastic policies in some some more general case so typically we consider this a probability distribution of actions and that's basically it in terms of policies of course we're going to say a lot more about how to optimize these policies how to represent them how to optimize them and so on but in terms of definitions all that you need to remember is that pi denotes the probability of an action given a state and then we can move on to value functions and value estimates and i'm going to what i have here on the slide is a version of the value function as i defined it earlier and i want to mention a couple of things about this first of all it's good to appreciate that this is the definition of the value later we'll talk about how to approximate that this is just defining it and i've extended it in two different ways from the previous definition that i had first i made it very explicit now that the value function depends on the policy and the way to reason about this if i have this conditioning on pi means that i uh i could write this long form to say that every action at subsequent time steps is selected according to this policy pi so note that we're not conditioning on a sequence of actions nowhere conditioning on a function that is allowed to look at the states that we encounter and then pick an action which is slightly different the other thing that we've done now on this slide introduce a discount factor this is a somewhat orthogonal thing but i thought i should include it here so that we have the generic form of a value function which conditions on the policy and includes potentially this discount factor which is a very common construct in reinforcement learning and one way to think about that is that the discount factor helps determine the goal in addition to the reward function for instance if you consider a reward function to be plus one on every time step um then it could be infinitely large alternatively if you think of a maze where the reward is zero on every time set until you reach the goal then the value function for a uniformly random policy would be sorry if it's zero every time's at one when you reach the goal then any policy that eventually reaches goals gets a value of one so then we can't distinguish between getting there quickly so sometimes discount factors are used to define goals in the sense oh maybe it's better to look at the near-term rewards a little bit more unless it's a long-term reward so this allows us to trade off the importance of immediate versus long-term rewards so to look at the extremes to make it a bit more concrete you can consider a discount factor of zero if you plug that into the definition of the value as it's written on the slide there you see that then the value function just becomes the immediate reward all of the other rewards are cancelled out because they're multiplied with the zero discount so that means if your discount factor is small or in a special case if it's zero then you only care about the near term future if you don't want to optimize your policy then the policy would also be a myopic policy a short-sighted policy which only cares about immediate reward conversely the other extreme would be when the discount factor is one this is sometimes called the undiscounted case because then the discounts basically disappear from the value definition we get the definition that we had before where all rewards are equally important not just the first one but the second one also is equally important the first one and that also means that you no longer care in which order you receive these rewards and sometimes it's useful to have a discount factor that is in between these two extremes in order to define the problem that you actually want to be solving now as i mentioned the value depends on the policy and then ultimately we want to optimize these so we want to be able to reason about how can we pick different policies and we can now do that because the value function cannot be used to evaluate the desirability of states and also we can compare different policies on the same state we can say one value might have a different sorry one policy might have a higher value than a different policy and then we can maybe talk about the desirability of different policies and ultimately we can also then use this to select between actions so we could do so no here we've defined the value function as a function of a policy but then if we have a value function or estimated value function we can then maybe use that to determine a new policy so this will be talked about in a lot more depth in future lectures but you can think of this as kind of being an incremental learning system where you first estimate the value of a policy and then you improve your policy by picking better policies according to these values and that's indeed a relevant algorithm idea that we'll get back back to later as i mentioned before the value functions and returns have recursive forms so the return now has its discount factor in the more general case and the value function is also recursive where again as i mentioned before the value of a state can be defined as the expected value of the immediate reward plus now the discounted value at the future state for that same policy and here the notation a tilde pi just means that a is sampled according to the probability distribution pi and we'll just use that same notation even if the probability distribution is just deterministic for simplicity this is called a bellman equation it was first described by richard bellman in the 1950s and it's useful because you can turn it into algorithms so these equations are heavily exploited and a similar equation can be written down for the optimal value which is really interesting so note that the equation above this is conditioned on some policies so we have some policy and we can then determine its value turns out we can also write down an equation for the optimal value that you can have so there is no higher value that you can get in this setting and this turned out to adhere to this recursion that is written on the slide where v star the optimal value of state s is equal to the maximization over actions of the expected reward plus discounted next value conditioned on that state in action importantly this does not depend on any policy it just depends on the state and this recursion is useful it defines recursively the um the optimal value because know that the v star is on the left hand side and the right hand side but we can use this to construct algorithms that can then learn to approximate v-star in the future in future lectures we will heavily exploit these equations and we'll use them to create concrete algorithms and in particular of course we often need to approximately so the previous slide just defines the value of a certain policy and it defines the optimal value doesn't tell you how to get them and in practice you can't actually get them exactly and we'll have to approximate them somehow and we will discuss several algorithms to learn easy efficiently and the goal of this would be that if you have an accurate value function then we can behave optimally i mean if we have a fully accurate value function because then you can just look at the value function we could define a similar equation that we had on the previous slide for state action values rather than just for state values and then the optimal policy could just be picking the optimal action according to those values so if we have a fully accurate value function we can use that to construct an optimal policy this is why these value functions are important but if we have a suitable approximation which might not be optimal might be we might not be perfect it might still be possible to behave very well even in interactively large domains and this is kind of the promise for these approximations that we don't need to find the precise optimal value in many cases it might be good enough to get close and then the resulting policies might also be performed very well okay so the final component inside the agent will be a potential model this is an optional component similar to how the value functions are optional although they are very common and a model here refers to a dynamics model of the environment the term is sometimes used more generally for other things as well in uh artificial intelligence or machine learning but in reinforcement we typically when we say we have a model we typically mean a model of the world in some sense so that means the model predicts what the environment will do next for instance we could have a model p which predicts the next state where maybe if you give it inputs as inputs a state an action and a next state the output of this thing is an approximation to the actual probability of seeing that next state after observing this previous state and action so again for simplicity might be good to keep in mind a specific agent states where for instance these agencies could be your observation then this would be the probability of the next observation given the previous observation and the previous action and we could try to model that we could try to approximate this and then in addition we could also approximate the reward function which could be for instance conditioned on state and action where this would just be the expected reward given that you are in that state and taking that action a model doesn't immediately give us a good policy like for value functions we can actually just kind of read off a policy if we have state action value functions we can pick actions according to these values for a model we don't immediately have that we would still need to conduct some sort of a planning mechanism we'll talk about specific algorithms that can be used in addition to like uh sorry on top of a model in order to extract a policy but it's good to keep that in mind in general that the model would still require additional computation in order to extract a good policy in addition to the expectation above for instance for the reward we consider the expected reward we could also consider a stochastic model or an expectation model for the state so the state's model here in particular this would be an example of a distribution model where we try to actually grasp the full distribution of the next state given the current state in action you could also instead try to approximate the expected next state or you could try to find a model that just outputs a plausible next state or maybe randomly gives you one of the states that could happen these are all choices design choices and it's not 100 clear in general yet what the best choices are and now i'll go through an example to talk about all of these agent components a little bit it's just a very simple example we'll see much more extensive examples in data lectures and in particular we're going to consider this maze so we'll start at the left and the goal is at the right and we define a certain reward function which gives you a minus one per time step that means that the optimal thing to do is to go to the goal as quickly as possible because then you'll have the lowest number of minus ones then the actions will be up down left and right or north east south and west if you prefer and the agent location is the state let's say that this is fully observable so you can basically just tell where you are maybe you could think of this as x y coordinates which are easily easily shown to be markovian in this setting so here's an example which shows a policy and in fact it shows the optimal policy so in every state we see an arrow this arrow depicts which action to take so for instance in the left most states the arrow points right so we say that in the leftmost state the policy now will take the action right this policy is a deterministic policy that indeed gives us the shortest path to the goal and it will be an optimal policy you could also consider a stochastic policy which might select multiple actions with non-zero probability here is the value of that policy on the previous slide which happens to also be the optimal value function which as you can see increments every time you step away from the goal and this is because the value function is defined as the expected sum of rewards until uh into in into the indefinite future but if the episode ends at the goal then the rewards stop there so if you're one step away from the goal the value will just be minus one for that optimal policy if your two steps away it will be -2 and so on this is a model and specifically this is an inaccurate model because note that all of a sudden a part of the maze went missing so in this case the numbers inside the squares are the rewards so these are molds as just oh we've learned the reward is basically minus one everywhere maybe this is very quick and easy to learn and the dynamics model was learned by simply interacting with the environments but turns out maybe we haven't actually gone to that it's that portion there in the left corner left bottom corner and therefore the model is inaccurate and wrong there if you then would use this model's plan it would still come up with the optimal solution for the other states that this can see but it might not have any solutions for the states it hasn't seen it's just an example of course it's unrealistic to have an accurate value function but an inaccurate model in uh in this way specifically but it's just an example to say oh yeah your model doesn't have to be perfect if you learn it it could be imperfect the same of course holds for the policy and value function these could also be imperfect okay now finally before we reach the end of this lecture i'm going to talk about a different some different agent categories and in particular this is basically a categorization it's good to have this terminology in mind which refers to which part of the agent are used or not used and a value-based agent is a very common uh version of an agent and in this agent we'll learn a value function but there's not explicitly a policy separately instead the policy is based on the value function this agent that i showed earlier that was playing atari games is actually of this form where this agent learns state action value functions and then picks the picks the highest rated action in every state with a high probability conversely you can think of a policy-based agent which has an explicit notion of a policy but doesn't have a value function i haven't yet told you any algorithms how you could learn such a policy if you're not learning values but we'll actually see an example of that in the next lecture and then there's the terminology actor critic the term actor critic refers to an agent which both has an explicit representation of a policy and an explicit representation of a value function these are called actor critics because the actor refers to the policy part there's some part of the agent that acts and the value function is then typically used to update that policy in some way so this is an interpreted as a critic that critiques the actions that the policies takes and helps it select better policies over time now all of these agents could be model-free which means they could have a policy and or a value function but they don't have an explicit model of the environment so note in particular that a value function can of course also be considered some model of some part of the environment it's a model of the cumulative expected rewards but we're not calling them a model in reinforcement learning parlance typically so instead if you just have a value function we tend to call this model free i'm saying that not because it's a great definition or a great division between agents but because it's a very common one so if you read papers and they say something about model-free reinforcement planning this is what they mean there's no explicit dynamics model so conversely a model-based agent could still ultimately oh sorry could still optionally have an explicit policy and or a value function but it does in any case have a model some model based agents only have a model and then have to plan in order to extract their policy other model based agents have a model but in addition to that have an explicit policy and for instance use the model to sometimes just incrementally improve value function or our policy so now finally we're going to talk about some sub problems of the rl problem prediction is about evaluating the future so for instance learning a value function you could call a prediction problem and this is indeed often the terminology that is used typically when we say prediction we mean for a given policy so you could think about predicting the value of the uniformly random policy for instance or of a policy that always goes left or something of the form conversely control is about optimizing the future finding the best policy so it's good to note that this terminology is used quite frequently in papers so it's good to have that in mind and often of course these are quite related because if we have good predictions then we can use that to pick new policies in fact the definition of the optimal policy pi star is the arc max of policies over these value functions by definition the value function defines which policies are uh the ranking on policies essentially your preference on policies that doesn't mean that you need to have these value functions per se in order to learn policies but it just shows shows how strongly related to the problems of prediction and control are in addition there's an interesting question that i encourage you to ponder a little bit which is that this is something that rich saturn often says that in one way or the other prediction is maybe very good form of knowledge and in particular if we could predict everything it's unclear that we need additional types of knowledge and i want you to ponder that and think about whether you agree with this or not so if you could predict everything is there anything else that we need so feel free to pause the video and think about that for a second i'm going to give you one suggestion so indeed if you can't predict everything about the world this gives you a lot of knowledge it might not immediately tell you how to do things so maybe it's sometimes useful similar to these policies and value functions sometimes it can be useful especially if we're approximating so we can't predict everything perfectly it can be useful to separately store predictions and separately scores store policies or you could think of these as them being skills in some sense but indeed predictions are very rich form of knowledge and many things can be phrased as a predictive problem even if they're not immediately clearly a predictive problem uh if you first think about them and as i've referred to when i was talking about models there's two different parts to the reinforcement voting uh problem one is about learning and this is the common setting which we assume where the environment is initially unknown and the agent interacts with the environment and someone has to learn whether it's learning a value function a policy or a model all of that could be put under the header of learning and then separately we could talk about planning so planning is a common term in artificial intelligence research and planning is typically about when you have a model so let's say the model of the environment is just given to you and then the agent somehow figures out how best to optimize that problem that would be planning so that means you're using some compute to infer from the statement of the problem from the model is given what the best thing to be to be done is now importantly the model doesn't have to be given but could also be learned but then it's good to keep in mind that the model might be slightly inaccurate so if you plan exhaustively in a learn model you might find a certain policy but it's unclear that this policy is actually optimal in the true world because the model might not be completely accurate and indeed the planning might latch on to certain inaccuracies in the model and hence might find solutions that are actually not that suitable for the real world because for instance the model might have a holy wall somewhere that is not actually there and then the shortest path might take the agent through that hole which isn't actually there and the policy might not be great that you get from there but we can think of planning more generally as some sort of an internal computation process so then learning refers to absorbing new experiences from this interaction loop and planning is something that sits internally inside the agent's head it's a purely computational process and indeed i personally like to define planning as any computational process that helps you improve your policies or predictions or other things inside the agent without looking at new experience learning is the part that looks at a new experience that takes in your experience and somehow condenses that and planning is the part that does the additional compute that maybe turns in a model that you've learned into a new policy it's important also to know that all of these components that we've talked about so far can be represented as functions we could have policies that map states to actions or to probabilities over actions value functions that map states to to expected rewards or indeed also two probabilities of these we have models that map states to states or state actions to states and we could have rewards that map states to rewards again or distributions over these and we have a state of that function that takes a state and an observation and potentially an action and a reward and maps it to a subsequent state all of these are functions and that's important because we have very good tools to learn functions specifically these days neural networks are very popular very successful and the field of researching how to train neural networks is called deep learning and indeed in reinforcement we can use these deep learning techniques to learn each of these functions and this has been done with great success it is good to take a little bit of care when we do so because we do often violate assumptions from say supervised learning for instance the data coming at us might be correlated because for instance think of a robot operating in a room it might spend some substantial time in that room so if you look at the data coming into the agent it might be correlated over time and then sometime later might go somewhere else and this might be less correlated but there might be in the near term quite some strong correlations in the data which are sometimes assumed not to be there when you do supervised learning in addition the problem is often assumed to be stationary in supervised learning in many supervised learning problems not in all of course but in reinforcement we're often interested in non-stationary things think for instance of a value function as i mentioned the value function is a is typically conditioned on a policy but if we're doing control if we're trying to optimize our policy the policy keeps on changing that means that the relevant value functions maybe also keep on changing over time because maybe we want to keep track of the value of the current policy but if the policy keeps on changing that means that the value function also needs to change so this is what i mean when i say we often violate assumptions from supervised learning that's not necessarily a huge problem but it does mean that whenever we want to use some sort of a deep learning technique sometimes they don't work out of the box so deep learning is an important tool for us when we want to apply reinforcement learning to big problems but deep reinforcement learning which is basically a research field of the merger of deep learning and reinforcement burning or how to use deep learning in reinforcement birding is a very rich and active research field you can't just plug in deep learning and then hope that everything will immediately work that works up to a point but there's lots of reasons to be done exactly at that intersection of deep learning and deep reinforcement learning we'll talk much more about that later in this course okay now that brings us to the final examples so i talked about atari let's make it a little bit more specific now what was happening in the atari game that i showed you so you can think of the observations as the pixels as i mentioned at that time point in time as well the output is the action which is the joystick controls and the input is the reward here on the slide it actually shows the score but the actual reward was the difference in score on every time step note that the rules of the game are unknown and you learn directly from interactive gameplay so you pick actions on the joystick you see pixels and scores and this is a well-defined resource and printing problem and we have algorithms that can learn to deal well with this as a different example here's a schematic example a little bit more of an illustrative example and this is easy to easier to reason through this is why we sometimes use these much smaller examples and oftentimes the conclusions still transfer so the entire example is an example of a rich messy hard problem in some sense right and this would be an example of a very small scale illusive problem and we do this because we can often learn something from these smaller problems that we can apply to these much harder to understand difficult big problems so in this specific example which is from the susan lombardo book it's basically a great world without any walls although there might be walls at the at about at the edges essentially but not any walls inside the 5x5 grid and there's a reward function which is defined as -1 when bumping into a wall zero on most steps but in if you take any action from state a the state that is labeled with a you get a plus 10 reward and you transition to a prime so even if you press say up from set a you still find yourself in a prime and you get plus 10. similarly from state b you would transition to state b prime and you get plus five now we can ask several different questions about this setting and there might be reasons why we might be interested in these different questions so a first question could be a prediction question which is for instance what is the value of the uniform random policy that selects all of the actions uniformly at random and that's depicted here on the right hand side in figure b and what we see here is that this is quite a complicated construct right i wouldn't have been able to tell you just immediately just by looking at the problem what the value function is for the optimal for sorry for the uniformly random policy but we can use reinforcement building algorithms which we'll talk about in future lectures to infer this and to figure out what that value is and it turns out just to look at this a little bit more in detail that of course the value of state a is quite high because from this state you often get a high you always get a high reward but it's lower than 10 because the rewards after this first reward of 10 are negative you see that the value of state a prime is actually minus 1.3 sorry i didn't say but there's a discount factor here as well 0.9 which means that the this is why the the value of state a is 8.8 and the value of state a prime is 1.3 and the difference between them is not quite 10 right um but from from state b you often get a minus one because you often find yourself bumping to the to the ball wall to bottom or you don't get a minus one but then you might get a minus one on the next time so because you might have walked left to the corner and it's quite a complicated thing because of the discount factor because of the dynamics of the world but we can see that state a is desirable state b is somewhat desirable states in the bottom left are quite undesirable but you might actually be more interested in okay but what's the optimal thing to be doing and that to me is not immediately obvious right should you be going to state a and then loop to a prime and get this plus 10 every time you could but it takes you a couple of steps in between each two times you do that transition you could also go to state b and go to b prime and then you can get these transitions more often now it turns out we could also figure out what the optimal value function is for this problem and what the optimal policy is if you look at the optimal value they're all positive now because you never have to bump into a wall anymore because the optimal policy doesn't bump into walls so even the top sorry the bottom left corner now has positive values and in fact the lowest positive values are in the bottom right corner now because from there it takes you a long time to get to the best possible state you can and it turns out the best state you can be in is state a looping with these plus 10 rewards is apparently more beneficial than looping with these plus 5 rewards even though the difference in uh distance on these plus fives is smaller so you can get more plus fives in a row very quickly by going from b to b prime after each like every time again but going from a to a prime is apparently more profitable in the long term and we can see this in figure c here as well where the optimal policy is depicted we see that if you're in almost any state what you should be doing is you should be moving to state a this will transition you all the way to the bottom to a prime and from there you'll just move straight up again up to state a and repeat conversely if you're in state b prime if you just look at where b prime is in this you would either go up or left it doesn't actually matter which one they're equally good but if you go up you would then move left so you wouldn't move into state b instead you would move left and then you move up or left again in order to get state a there's only one state that would move into b which is the top right corner because from the top right going around state b and then going all the way to a would take so long that it's actually more beneficial to jump into state b which will transition you to b prime and then from there we'll go to state a and then loop indefinitely so this is quite subtle i wouldn't have been able to tell you just from looking at the problem that this would be the optimal policy but fortunately we have learning and planning algorithms that can sort that out for us and they can find this optimal solution without us having to find it so popping up in this course we will discuss how to learn by interaction we didn't really discuss it in this lecture in this lecture we just talked about the concepts and the terminology and things like that but we haven't really given you algorithms yet we will do that in the subsequent lectures and the focus will be on understanding the core principles and learning algorithms so it's less about what the current state of the artist will touch upon that a little bit for sure but it's less about specific algorithms that people happen to use right now and then go all the way to the depth of those we will do that for some algorithms but it's much more important to understand the core principles and learning algorithms because the algorithms that are currently safer they will change next year there will be new algorithms and if you understand the core principles then you can understand these new algorithms and maybe you could even invent your own algorithms topics include exploration in the next lecture in something called bandits which is basically one step mark of decision processes we will talk about more about what marketing decision processes actually are like how are they mathematically defined and what can we say about them and we will talk about how to plan in those with dynamic programming this will be the lectures after the next lecture by this user will be given by diana and then we will use that to then go into model-free prediction and control algorithms you may have heard of an algorithm called q-learning or i mentioned earlier in this lecture an algorithm called dqn dqm is short for deep q network q as i mentioned is often used to refer to state action values q learning is an algorithm that can learn state action values and then the dqn algorithm is an algorithm that uses q-learning in combination with deep neural networks to learn these entire games this falls on the model 3 prediction and control because no explicit model of the environment is learned in that algorithm we will also talk about policy gradient methods we in fact already touch upon them in the next lecture but we'll talk about them more later and these are methods that can use be used to learn policies directly without necessarily using a value function but we also discuss actor critic algorithms in which you have both an explicit policy network or function and you have an explicit value function and this brings us also to deep reinforcement training because as i mentioned these functions are often represented these days with deep neural networks that's not the only choice they could also be linear or it could be something else but it's a popular choice for a reason and it works really well and we'll discuss that at some length later in this course and also we will talk about how to integrate learning and planning i talked a little bit about planning being an internal computation process and then learning meaning the process that takes a new experience and learns from that and of course we could have both of those happening at the same time in an agent and then we want them to play nicely together and there will be much more there will be other topics that we'll touch upon when we go through all of this okay now finally i want to show you one final example of a reinforcement writing problem again what we'll see here is a little bit more of a complicated example so what we'll see is a system which was learned to control the body so you can see the body already here on the still i'll press play in a moment and what will happen is is that there's an algorithm that controls the uh basically the forces of these uh body parts so this agent specifically it can run right and it had to learn itself how to move its limbs in such a way as to produce forward motion the reward was a very simple one the reward was just go in that one direction and you'll get plus one or you get positive reward basically proportional to how fast you go in that direction so it really wants to go really fast in one direction it was not told how to do that so it doesn't at first when it starts swelling it doesn't know how to control its limbs it just knows that it that it's it perceives the world in some sense by by sensors which i won't go into that much depth it's too not too important here but the point is it doesn't know how to move its limbs it has to figure that out by itself and it just notices that when it moves in certain ways it gets more rewarding in other ways doing that you get the following behavior with simplified vision as it says on the slide and proprioception which means it can it can feel essentially where its own limbs are in some sense and then it can traverse through this very complicated uh domain and it can learn how to jump over things and how to maybe even climb in some sense just because it wants to go to the right not everything is easy but it does manage to get there now interestingly by using this setting by just having a simple reward you can traverse different types of domains you can learn to terrain sorry you can learn to traverse different types of terrains and do this in very non-trivial ways it would be very hard to specify a policy by hand that does this and in fact because we have a learning system it's not just that we don't have to specify a thing by hand but we can also apply the exact same learning system to different body types so this was learned with the exact same system that was used for this other thing and you can use this in two dimensions or you can use it in three dimensions and in each of these cases the agent can learn by interaction how to actually scale these obstacles so the reward is particularly simple we didn't have to think about how to move the limbs in order to do that we can just have the learning system come up with that and that's the important point here and you can apply some more difficult uh terrains you can apply this to different body types you can get quite non-trivial behavior in doing so okay so that brings us to the end of this lecture thank you for paying attention and we'll see you in the next lecture which will be on the topic of expiration