Lecture Notes on Reinforcement Learning

welcome back so today we're continuing this lecture series on reinforcement learning which is one of my favorite topics at the intersection of control theory and machine learning and actually also at the intersection of biological learning so really really old rich topic decades of development and especially recently there's a lot of renewed interest in reinforcement learning because of its combination with deep neural networks and so today i'm going to focus primarily on model-free reinforcement learning and specifically on gradient-free methods so this lecture is going to culminate in q-learning which is one of the most popular reinforcement learning techniques available today so that's this q learning here which is an off policy gradient free method and to build up to q-learning we're going to have to introduce some concepts like monte carlo learning and temporal difference learning and td learning especially temporal difference learning has really deep connections to biological learning how we actually learn in our kind of coupled network of neurons so i'm really excited i think this is going to be a pretty neat lecture and just a reminder this is roughly following a new chapter in the second edition of our book uh data driven science and engineering so this is a chapter i put together kind of to be the the end cap of the control section because reinforcement learning really is such a an exciting uh kind of a growing set of techniques okay good so model free reinforcement learning specifically gradient-free methods today now before uh jumping into kind of all of the the gradient-free optimizations i'm going to talk about which is going to culminate in q-learning i want to introduce or kind of recap a couple of topics that we talked about in the last few video lectures so the first thing that is really important is just to remind you that there exists this quality function q which essentially tells you the joint quality or value of taking an action a given a current state s so the value function v tells you what the value of being in a current state s is assuming you take the best action this quality function is somehow richer and it contains more information about what is the quality of being in this state s for any action i might take so this is the joint quality function is kind of sometimes what we call it and this quality function can be defined using kind of the standard markov decision process mdp language that we've also introduced where it is the expected future reward uh given that i am in a state s now and i take an action a so it is the expected reward function uh currently the the instantaneous reward i'm going to get taking action a now which will lead me to state s prime plus any future rewards that i might get uh just for being in that next state s prime okay so we we talked a lot about this in the last lecture and this is essentially just saying that the uh the quality of being in a state now is my expected current reward plus all of my future rewards for being in the next state that i'm gonna jump to and you can write it out in terms of this summation over probabilities because this markov decision process this mdp which is kind of the cornerstone of a lot of reinforcement learning algorithms assumes that your environment the the environment that you find yourself in evolves probabilistically so there's kind of a coin flip uh randomness element to how the state evolves even given uh the same action so even if i take the same action in the same state i might have different future states according to some probability and that's given by this function p so s prime again is my future state s is my current state and a is my current action and so this expected value here has to be summed up over all of the possible future states s prime i might find myself in which are given by this model this this probabilistic evolution p so right off the bat what i want to point out and this is why uh we need this lecture that i'm going to give you today is that all of the learning that we we have demonstrated kind of using the bellman equation and dynamic programming up until now so policy iteration and value iteration assume that you have a model for this p this probabilistic markov decision process that evolves your your system forward in time and we also assume that we have a model for this reward function and you can then optimize using this knowledge and if you have this quality function remember that you can derive the value function at a state s it's simply the best possible action you could take it's the quality for the best action a you could take and similarly you can derive the optimal policy pi it's simply the a that maximizes the quality function it's the action a that maximizes the quality function but again the central challenge what we talked about in the past lecture was dynamic programming uh using bellman optimality when we actually have a model for p and for r when we have a model for how the system evolves and what the reward is as a function of the state in action but in many systems we don't actually know how the system is going to evolve we don't have a model for that and we don't know what the reward function is or what the reward structure is before kind of exploring and trying things out so that's going to be the motivation for these kind of model free reinforcement learning algorithms i'm going to talk about today so this is just recap i wanted you to remember the quality function is really important we also have a value function and a policy pi and again you know if i had a model for p and i had a model for r i could run through this value iteration where i would essentially take actions that maximize my value and i would update my value function uh at every state that i evolve on on some trajectory so let's say i'm playing chess i would just play chess over and over and over again and every state i find myself in i would use my best policy to get my best action a and i would somehow update the value of my current state s using the new reward information that i get so this all relies on on bellman's equation but we're going to talk about today when we don't actually have access to a model so when we don't have access to a model for how the system evolves or what the reward structure is we essentially are stuck using trial and error learning and that makes a lot of sense in a lot of biological situations too right the first time that you play tic-tac-toe maybe the person that taught you told you you know that three x's in a row wins or three o's in a row wins but they didn't tell you what the value of a particular state was and they didn't give you a model or a rule for how this game is going to unfold you just had to play a bunch of games learn the rules and learn the reward structure okay and so the simplest approach and i don't actually really recommend doing monte carlo learning unless you have to but this is almost to illustrate the simplest approach uh to model free learning learning purely from trial and error is called monte carlo learning and so in monte carlo learning what we're going to first do is define this this cumulative reward function so monte carlo learning is what's called an episodic learning algorithm which means that it needs to run an entire episode many many states action sequence leading to some final end of the game before it can it can use that information for learning and so monte carlo learning works you know in theory for things like games that have a definitive end or maybe if you're trying to make a a racer you know a bipedal walker race to a finish line that would have a definitive end for that that episode as well and so what you do is you essentially you pick some policy and you enact that policy and you run through your game you play your game of chess or you run your race and you compute your cumulative total rewards of course discounted by this gamma we always discount by gamma because that's how we can prove optimality in certain conditions and so based on this total cumulative reward the simplest thing we could possibly do is take and divvy up that reward equally among every state that uh that my system took on the way to the finish line okay so that i'm going to demonstrate this for for value function learning and it can also be done for q learning and so essentially what we would do is at the end of an episode once we've computed this cumulative reward how much did i gain over that episode i would say that my my new value function at every single state along the way is equal to my old value function plus i essentially take and i add a little bit of this r sigma divided by how many steps there were in my uh in my trajectory to my my value function so this is essentially an update that says that we're our best estimate for our value function is still kind of the old value function but any states that we we hit along the way to the finish line we are going to add the cumulative reward divided by the total number of steps now this is a very inefficient way of learning you have to play tons and tons and tons of games of chess or tic-tac-toe uh to actually learn this way because what you're assuming is that every single step you took along the way to victory is equally important to victory or loss is equally important so for example if i played a brilliant game of chess and then i made a bad move at the end all of those moves would be considered equal in this scheme or if i played a bad game of chess at the first game and then my opponent got unlucky and lost then i would actually wait all of those bad moves towards towards this updated value function but again this is the simplest kind of most common sense thing you could do to start accumulating that information you've experienced so this is the experienced reward and you just divvy it up over every single step you took to get that reward okay that's uh that's what monte carlo learning is and you can uh do the same thing for q learning so you can also update your quality function in exactly the same way where basically uh every state action pair that you uh you saw along along the the trajectory that you just played would get updated again by um a small amount uh given by this reward divided by n okay and you can kind of convince yourself that this should be you know an optimizer this should actually converge under certain conditions okay um and we're only updating based on the difference between the reward and the old value function or the reward and the old q function because if my old value function or q function were perfect then they would perfectly predict what this future reward would be so if i had a perfect value function this would actually be this difference would be zero here and zero here so uh that's why we're kind of updating with the difference between the actual reward and the estimated reward because that's kind of what this is this uh v old and this q old here are essentially my estimated future reward being in state sk okay good that's monte carlo learning relatively inefficient but one of the nice things about it is that it has no bias okay so it's very very sample and efficient i don't recommend doing this in practice but it has it introduces no bias into the learning procedure if you play enough games in principle this should this should converge and you should do a thought experiment of whether or not we should be enacting our optimal policy pi or a sub-optimal policy uh if i was doing this i would always be up uh enacting my optimal policy pi kind of based on my new either q function or or value function but i want you to think through what would happen if you didn't enact an optimal policy pi okay good so that's monte carlo learning and that's kind of one extreme and on the other extreme is what we have called temporal difference learning and td learning is one of the most important advances that has happened in reinforcement learning this is decades old and it gives us a better strategy in many cases for learning that essentially uh it highlights events that are more recently uh giving me rewards so i have some state action sequence and i have some rewards i've gotten along the way and to some extent i want to say that maybe the events that happen more recently are somehow related to the rewards i'm getting and so that kind of difference in time is is also a property of biological learning systems so if you think about pavlov's dog right pavlov's dog was conditioned to expect a food reward at some point after hearing this bell ring and so if you ring the bell you know the dog starts to salivate and so there is a finite difference in time between the action or the state the environmental cue and the actual expected reward so in monte carlo learning you're basically saying that the right action could have happened at any point in the past leading up to the reward now what temporal difference learning does is it says there is kind of an optimal time in the past where i am most likely to be correlated with a current reward and that's a tunable parameter i'll tell you about okay so i'm going to start with something called td0 there's a whole family of temporal difference learning algorithms we're going to start with the kind of zeroth model here and again we're going to start by looking at this in a value function formulation so remember that the value function at state now so subscript k means at time k so the value function at my my state now at time k is given by my expected reward now rk plus a discounted value function at the next state i find myself at sk plus one okay and this is a bellman's optimality condition essentially that this can be written recursively so the value function now can be recursively written in terms of the value function at the next state i find myself in and so what temporal difference learning does this is different than monte carlo learning is that it updates the value function at my state sk and again it similarly starts with my old value function and there is a correction term and what that correction term does is instead of averaging over all of the trajectory like what monte carlo learning does in td learning we have this uh this first term here called the td target estimate and this if you look this term here is almost exactly the expected value function okay so this term here is what i expect my value to be given a state s k and i take its difference with the old estimate of my value function at sk so this is my my old best estimate of the value of being in the current state now and this is new information of what actually happened so i i take an action i get a reward rk and i find myself in a new state sk plus one where i have an estimate of the value function so this right here is my expected value be of being at point s k so i want to say that again this is really important and it's a little confusing because of the bellman optimality condition here where the value of being in a state at time now is equal to my reward now plus a discounted value at being in this state the next time step into the future i can essentially say that my my estimate of my value function should look like my current reward plus gamma times my next value function at the next state but if i actually am playing a trial and error game if i'm playing a game of chess using my supposedly best policy pi i can actually take that step to sk plus one i can get a reward or lose a reward whatever and i can compute what this actual cumulative um we call it r sigma the cumulative reward is and if there's a discrepancy between my my actual r sigma and my estimated uh value then i will update my value function uh with some weight alpha here okay so this is in in some sense it has the same flavor of the monte carlo learning i start with my old value function and i'm iterating to find a new value function every time i take a step i'm iterating this but now i'm using this kind of bellman nested uh reward our sigma to to update and so this is called the td target and this is called the td error the temporal difference error and it's kind of interesting so there is actually evidence that there are analogs of this in neuroscience so when you know cells kind of fire together wire together that's that's one of the principles of neuroscience and dopamine when dopamine is released it can strengthen connections um between the the cells that fired together right before dopamine was released those cells will strengthen uh their their correlation of firing together and so there is a lot of really interesting theory and experiments that kind of connect uh these td error signals and the release of dopamine in actual biological systems so it's believed that there are circuits in your brain and in your nervous system that are actually estimating this future reward this r sigma and when uh i guess that would be this this here this would be the estimate of your future reward and then when you actually get a future reward and it's different than what you estimated you uh you fire uh to strengthen those connections okay so this is nice because it also this at least td0 is only looking one time step into the future so you get your reward now and you get your uh your estimated reward one time step in the future and so this is a somehow waiting events right before a reward takes place so this is kind of a one delta t delay between an action and a reward and that one delta t delay things that are correlated with one delta t will get strengthened in this td0 framework okay um good so i already told you about the strong connections to biological learning with dopamine but it you could expand this value function out and this would be rk plus gamma rk plus one plus gamma squared maybe i'll write it out here plus gamma squared of my value function two steps into the future and i can do this as many times as i want so this is kind of a two-step uh expansion of my of my value function using again the bellman optimality these are both completely equivalent to each other and i can replace this estimate for r sigma with this longer expression that has two events in the past that has two action state sequences and so if i do that now i have uh kind of a richer set of experiences of what my actual cumulative reward is now taking two steps into the future and i can use that as my uh to generate my my temporal difference error signal that i use to update my my value function and you can imagine doing this as many times as you want you could have an n step cumulative reward and in the limit as n goes to infinity or to the end of your episode this will actually converge to monte carlo learning which is kind of neat so at the very limit of temporal difference learning as n goes to infinity or to the end of your sequence this will converge in some sense to monte carlo learning but if i take you know n equals zero or one or two then i'm waiting only a few steps in the past more heavily to update my value function good so that's the idea of temporal difference learning and let's see i'm pretty sure the next thing i'm going to tell you about is probably well okay in a minute we're going to show you that essentially q learning is just temporal difference learning on the quality function q so i'll show you that in a minute there's one last variant called td lambda and essentially what it does is it says instead of just using a one-step temporal difference td0 or a two-step temporal difference td1 or an n-step temporal difference what i'm going to do is i'm going to compute that target that r sigma for all of those and i'm going to weight them in a certain way using this kind of exponential weighting with some some factor lambda that's between 0 and 1. okay so essentially what i'm doing is i'm waiting all of the td algorithms and then i use that uh weighted error reward signal to compute my td error and this has some nice uh some nice properties you can kind of tune this lambda depending on the problem you're working on to get nice convergence properties to wait kind of recent stuff more or older stuff more but you're kind of waiting all of them instead of just ignoring past events good so td lambda is pretty pretty interesting i believe this was introduced by sutton um originally okay so q learning This is the culmination of what I really wanted to show you, is that Q-learning is essentially just temporal difference learning on the Q function So I'm going to write it out... So Q learning, it's exactly the same template as temporal difference learning we are going to play games we're going to you know do trial and error learning by trying our optimal policy and we have an estimate of what the quality is, given being in a state *s* taking action *a* But that's not a perfect estimate -- we're still learning -- and so what we do is we measure what the actual reward and future Q function is, at the next step we find ourselves in, and we use this as the TD target for TD learning. So again this is the the TD target estimate this is essentially my observed cumulative reward and I use that to generate an error signal which I feedback to update my Q function So essentially what this means is that if I got more reward than I thought I would, I should increase my Q function at that *s* and that *a* ... so that's what that means: if I experienced more reward than what I expected, then I should have a positive increase in my quality function at that *s_k* and *a_k*. And you'll notice here, this is a very subtle difference, I'm going to say this again very subtle difference... here we're maximizing over *a* ... so what we're actually doing is we we take some step we take some action *a* to get our reward *r_k*, but it doesn't have to be our optimal on policy action *a_k* it could be... I could actually do some random action to get this *r_k* but then when I'm computing my quality function here for what my value is in that next state *s_k+1*, we're going to optimize over actions *a*, at least to compute this TD target. Now this is a little bit strange, but I'll tell you what this means in a minute. Okay so we call that "off policy", which means that I can actually take sub-optimal actions *a* -- I don't have to enact my optimal current policy *pi* -- I can take sub-optimal actions *a* to get this to get this reward but then I do need to maximize over *a* to get my best Q function in this step of the the TD target. And this has a lot of benefits -- essentially I can learn from experience, I can replay old experiences, even when I know that those were not using an optimal policy *pi*, because I've learned more since then. Or I can watch a grand master play and even if my quality function is not very good, because I'm an amateur, I can watch their actions and learn from their actions. So I'll talk a little bit more about this but that essentially is what we mean by off-policy, is that you can still learn even if you take sub-optimal actions *a* and you'll notice that this is a td0 variant because this is only taking one step forward and then using the recursive queue formulation good there is another variant so q learning is off policy td learning of the q function and there's something called sarsa which is on policy td learning of the q function very closely related only a tiny tiny difference but it's an important one and i'll tell you about it so sarsa means state action reward state action it basically means you start with a state action you take you're in a state and you take an action you measure a reward and then you uh are in another state and you take another action okay sarsa and so here you'll notice that this is the exact same expression the same exact temporal difference learning expression td0 except here i am not doing a max over q a max over a in this part of the td estimate and so what that means is that i actually have to be taking my best possible policy actions a a k plus 1 and a k at every step or else this will start to degrade my estimate because the quality function q is always my estimate of the quality of being in that next state taking that next action assuming i do the best thing forever after okay and so if i don't plug into here my best possible a and into here if i didn't take my best possible action ak to get this reward then this will actually be a sub-optimal estimate of my my quality function and it will start to degrade as i iterate through very subtle difference and i'll put them next to each other so you can see the difference but this is what's called on policy because you have to enact the optimal policy for this to work for this to converge and not degrade one of the cool things about sarsa is that sarsa can actually work for any of the td variants i showed it for td0 here but you can do it for td1 td2 tdn and you would basically just replace this td target here with a more complex or rich expression taking more steps and again if you think about this what this means is that you have to take actual steps you actually have to take actions and get that trial and error experience and then you look back and you see okay well what were my sequence of events and rewards and you can use that to update your quality function after gaining that experience good so so uh yeah good this is um yeah and you can essentially just replace this td target for the cumulative reward with any old td algorithm you like td0 td1 tdn or the temporal difference lambda scheme i showed you a minute ago okay and so this is on policy tv learning of the quality function q so i want to show these kind of simultaneously next to each other so that we can compare and contrast because they are almost perfectly identical okay except again q learning is what's called off policy and we enable q learning to be off policy by replacing this one a k plus one here with a max over a here because again the quality function and the value function are always assuming you do whatever is optimal in the future so they're always assuming that after the next step you do the best thing forever after so it's the optimal kind of quality of being in that state and so if you didn't maximize over a here then if you took an off policy a a sub-optimal action a this would actually degrade your quality function over time but because we are actually optimizing over a in our quality function at the next step we can take sub-optimal actions a to explore and to learn from off policy experiences and again this is a huge huge benefit so this kind of off policy this this single difference here that allows us to be off policy is one of the things that allows q learning to learn from a lot of different types of experiences so you can watch someone else play a chess game and still learn in an off policy manner even if your quality function is not as good as the grand masters that you're learning from you can accumulate information about how they play and improve your quality function so you can learn from imitation and you can also learn by replaying your past experiences even if they were sub-optimal so when you go to sleep at night and you dream of playing tennis even if you are replaying things that were not done perfectly because you've learned throughout the day you can still use that experience to still improve your quality function and so the q learning often is faster at learning because you can explore more so you get better optimal solutions and faster convergence and oftentimes you get lower variance uh solutions in i'm sorry i think this is higher variance i think this is a typo you'll have to check the book chapter because i'm pretty sure that's a typo um okay good so and when you do q learning there are lots of strategies for how you introduce randomness into the policy that for what action you take so there's something called epsilon greedy which basically means that one minus epsilon you know of the time so let's say epsilon is is 0.05 then 95 of the time you would do your optimal policy and 5 of the time you would do something random you would do a random action to get more kind of exploratory actions to explore your space more so that's an epsilon greedy algorithm and in these epsilon greedy algorithms often you start with a big epsilon so you start by only doing random stuff and then eventually that epsilon cools down to zero in a kind of annealing or simulated annealing strategy so as you get more experience as you learn better you eventually get to go more on policy as that epsilon decays to zero so that's a really important one called epsilon greedy but there's tons of different ways of introducing randomness and exploration uh into your your off policy search strategies okay good uh and in you know contrast with sarsa so it has some benefits also so the benefits of sarsa is that it is often safer in training because it is always going to do what you think is the best thing to do it's not going to randomly explore so if you're teaching a teenager how to drive you probably want to be doing something more like sarsa than q learning because you don't want to just every epsilon you know flip a coin and decide on some random action that might take you off the road or have some safety issue so sarsa because it's on policy means you're always doing what you think is the best thing even though it's more conservative so this will learn less quickly and probably less optimally but safer while it's learning and similarly sarsa because it's always on policy it typically will maximize it will have more cumulative rewards during the learning process so if i if i care about how much reward i get during my learning process then sarsa will actually give me more cumulative reward because it's always on policy it's not making these occasional kind of bad or sub-optimal moves just to gain experience and so it's an interesting trade-off right like this actually has a lot of parallels with human life as you might expect and in biological learning sometimes you want to be more conservative sometimes you want to maximize your rewards throughout the learning process sometimes you put a premium on being able to take chances and you know introduce some randomness and learn from experiences that are not what you think are the optimal experience you know go backpacking in europe right that's a that's a q learning type of type of action okay good um so what i hope i've convinced you of is that when you don't have a model for your mdp for your markup decision process there are things that are inspired by dynamic programming in bellman's equation you can think of these almost as approximate dynamic programming based on trajectories of real world experiences i actually take actions in the real world through trial and error and i start to accumulate information that can give me a better world view of what the quality of a given state action pair are and there are off policy and on policy variations of this that have different strengths and weaknesses okay good so that's what we talked about here is kind of this gradient-free model-free reinforcement learning q-learning is used a lot for deep reinforcement learning and so i'll definitely talk about that in a future lecture all right stay tuned there's more to come thank you

welcome back so today we&#39;re continuing this 
lecture series on reinforcement learning which is   one of my favorite topics at the intersection of 
control theory and machine learning and actually   also at the intersection of biological 
learning so really really old rich topic   decades of development and especially recently 
there&#39;s a lot of renewed interest in reinforcement   learning because of its combination with deep 
neural networks and so today i&#39;m going to focus   primarily on model-free reinforcement learning 
and specifically on gradient-free methods so   this lecture is going to culminate in q-learning 
which is one of the most popular reinforcement   learning techniques available today so that&#39;s 
this q learning here which is an off policy   gradient free method and to build up to q-learning 
we&#39;re going to have to introduce some concepts   like monte carlo learning and temporal difference 
learning and td learning especially temporal   difference learning has really deep connections 
to biological learning how we actually learn   in our kind of coupled network of neurons so 
i&#39;m really excited i think this is going to be   a pretty neat lecture and just a reminder 
this is roughly following a new chapter   in the second edition of our book uh 
data driven science and engineering   so this is a chapter i put together kind of 
to be the the end cap of the control section   because reinforcement learning really is such a 
an exciting uh kind of a growing set of techniques   okay good so model free reinforcement learning 
specifically gradient-free methods today   now before uh jumping into kind of all of the 
the gradient-free optimizations i&#39;m going to talk   about which is going to culminate in q-learning 
i want to introduce or kind of recap a couple   of topics that we talked about in the last few 
video lectures so the first thing that is really   important is just to remind you that there exists 
this quality function q which essentially tells   you the joint quality or value of taking an action 
a given a current state s so the value function   v tells you what the value of being in a current 
state s is assuming you take the best action   this quality function is somehow richer and 
it contains more information about what is the   quality of being in this state s for any 
action i might take so this is the joint   quality function is kind of sometimes what 
we call it and this quality function can be   defined using kind of the standard markov decision 
process mdp language that we&#39;ve also introduced   where it is the expected future reward uh given 
that i am in a state s now and i take an action a   so it is the expected reward function uh currently 
the the instantaneous reward i&#39;m going to get   taking action a now which will lead me to state s 
prime plus any future rewards that i might get uh   just for being in that next state s prime okay so 
we we talked a lot about this in the last lecture   and this is essentially just saying that the uh 
the quality of being in a state now is my expected   current reward plus all of my future rewards for 
being in the next state that i&#39;m gonna jump to and   you can write it out in terms of this summation 
over probabilities because this markov decision   process this mdp which is kind of the cornerstone 
of a lot of reinforcement learning algorithms   assumes that your environment the the 
environment that you find yourself in evolves   probabilistically so there&#39;s kind of a coin flip 
uh randomness element to how the state evolves   even given uh the same action so even if i take 
the same action in the same state i might have   different future states according to some 
probability and that&#39;s given by this function   p so s prime again is my future state s is 
my current state and a is my current action   and so this expected value here has to be summed 
up over all of the possible future states s   prime i might find myself in which are given by 
this model this this probabilistic evolution p   so right off the bat what i want to point out and 
this is why uh we need this lecture that i&#39;m going   to give you today is that all of the learning 
that we we have demonstrated kind of using   the bellman equation and dynamic programming up 
until now so policy iteration and value iteration   assume that you have a model for this p 
this probabilistic markov decision process   that evolves your your system forward in 
time and we also assume that we have a   model for this reward function and you 
can then optimize using this knowledge   and if you have this quality function remember 
that you can derive the value function at a   state s it&#39;s simply the best possible action 
you could take it&#39;s the quality for the best   action a you could take and similarly you can 
derive the optimal policy pi it&#39;s simply the a   that maximizes the quality function it&#39;s the 
action a that maximizes the quality function   but again the central challenge what we talked 
about in the past lecture was dynamic programming   uh using bellman optimality when we actually 
have a model for p and for r when we have a   model for how the system evolves and what the 
reward is as a function of the state in action   but in many systems we don&#39;t actually know 
how the system is going to evolve we don&#39;t   have a model for that and we don&#39;t know what the 
reward function is or what the reward structure is   before kind of exploring and trying things out 
so that&#39;s going to be the motivation for these   kind of model free reinforcement learning 
algorithms i&#39;m going to talk about today   so this is just recap i wanted you 
to remember the quality function   is really important we also have 
a value function and a policy pi   and again you know if i had a model for p and i 
had a model for r i could run through this value   iteration where i would essentially take actions 
that maximize my value and i would update my value   function uh at every state that i evolve on on 
some trajectory so let&#39;s say i&#39;m playing chess   i would just play chess over and over and over 
again and every state i find myself in i would   use my best policy to get my best action a 
and i would somehow update the value of my   current state s using the new reward information 
that i get so this all relies on on bellman&#39;s   equation but we&#39;re going to talk about today 
when we don&#39;t actually have access to a model   so when we don&#39;t have access to a model for how 
the system evolves or what the reward structure   is we essentially are stuck using trial and 
error learning and that makes a lot of sense   in a lot of biological situations too right 
the first time that you play tic-tac-toe maybe   the person that taught you told you you know that 
three x&#39;s in a row wins or three o&#39;s in a row wins   but they didn&#39;t tell you what the value of 
a particular state was and they didn&#39;t give   you a model or a rule for how this game is going 
to unfold you just had to play a bunch of games   learn the rules and learn the reward structure 
okay and so the simplest approach and i don&#39;t   actually really recommend doing monte carlo 
learning unless you have to but this is almost   to illustrate the simplest approach uh to model 
free learning learning purely from trial and error   is called monte carlo learning and so in monte 
carlo learning what we&#39;re going to first do   is define this this cumulative reward function so 
monte carlo learning is what&#39;s called an episodic   learning algorithm which means that it needs 
to run an entire episode many many states   action sequence leading to 
some final end of the game   before it can it can use that information for 
learning and so monte carlo learning works you   know in theory for things like games that have a 
definitive end or maybe if you&#39;re trying to make   a a racer you know a bipedal walker race to a 
finish line that would have a definitive end   for that that episode as well and so what you do 
is you essentially you pick some policy and you   enact that policy and you run through your game 
you play your game of chess or you run your race   and you compute your cumulative total rewards 
of course discounted by this gamma we always   discount by gamma because that&#39;s how we 
can prove optimality in certain conditions   and so based on this total cumulative reward 
the simplest thing we could possibly do   is take and divvy up that reward equally among 
every state that uh that my system took on the   way to the finish line okay so that i&#39;m going to 
demonstrate this for for value function learning   and it can also be done for q learning and so 
essentially what we would do is at the end of   an episode once we&#39;ve computed this cumulative 
reward how much did i gain over that episode   i would say that my my new value function 
at every single state along the way   is equal to my old value function plus i 
essentially take and i add a little bit of this   r sigma divided by how many steps there were in 
my uh in my trajectory to my my value function so   this is essentially an update that says that we&#39;re 
our best estimate for our value function is still   kind of the old value function but any states 
that we we hit along the way to the finish line   we are going to add the cumulative reward divided 
by the total number of steps now this is a very   inefficient way of learning you have to play tons 
and tons and tons of games of chess or tic-tac-toe   uh to actually learn this way because what you&#39;re 
assuming is that every single step you took along   the way to victory is equally important to victory 
or loss is equally important so for example if i   played a brilliant game of chess and then i 
made a bad move at the end all of those moves   would be considered equal in this scheme or if 
i played a bad game of chess at the first game   and then my opponent got unlucky and lost then 
i would actually wait all of those bad moves   towards towards this updated value function but 
again this is the simplest kind of most common   sense thing you could do to start accumulating 
that information you&#39;ve experienced so this is   the experienced reward and you just divvy it up 
over every single step you took to get that reward   okay that&#39;s uh that&#39;s what monte carlo learning is 
and you can uh do the same thing for q learning so   you can also update your quality function in 
exactly the same way where basically uh every   state action pair that you uh you saw along along 
the the trajectory that you just played would get   updated again by um a small amount uh given by 
this reward divided by n okay and you can kind   of convince yourself that this should be you know 
an optimizer this should actually converge under   certain conditions okay um and we&#39;re only updating 
based on the difference between the reward   and the old value function or the reward and the 
old q function because if my old value function   or q function were perfect then they would 
perfectly predict what this future reward would be   so if i had a perfect value function this would 
actually be this difference would be zero here and   zero here so uh that&#39;s why we&#39;re kind of updating 
with the difference between the actual reward and   the estimated reward because that&#39;s kind of what 
this is this uh v old and this q old here are   essentially my estimated future reward being in 
state sk okay good that&#39;s monte carlo learning   relatively inefficient but one of the nice 
things about it is that it has no bias   okay so it&#39;s very very sample and efficient 
i don&#39;t recommend doing this in practice   but it has it introduces no bias into the learning 
procedure if you play enough games in principle   this should this should converge and you should do 
a thought experiment of whether or not we should   be enacting our optimal policy pi or a sub-optimal 
policy uh if i was doing this i would always be up   uh enacting my optimal policy pi kind of based 
on my new either q function or or value function   but i want you to think through what would 
happen if you didn&#39;t enact an optimal policy pi   okay good so that&#39;s monte carlo 
learning and that&#39;s kind of one   extreme and on the other extreme is what 
we have called temporal difference learning   and td learning is one of the most important 
advances that has happened in reinforcement   learning this is decades old and it gives us 
a better strategy in many cases for learning   that essentially uh it highlights events that are 
more recently uh giving me rewards so i have some   state action sequence and i have some rewards 
i&#39;ve gotten along the way and to some extent i   want to say that maybe the events that happen more 
recently are somehow related to the rewards i&#39;m   getting and so that kind of difference in time is 
is also a property of biological learning systems   so if you think about pavlov&#39;s dog right pavlov&#39;s 
dog was conditioned to expect a food reward   at some point after hearing this 
bell ring and so if you ring the bell   you know the dog starts to salivate and so there 
is a finite difference in time between the action   or the state the environmental cue and the actual 
expected reward so in monte carlo learning you&#39;re   basically saying that the right action could have 
happened at any point in the past leading up to   the reward now what temporal difference learning 
does is it says there is kind of an optimal   time in the past where i am most likely to be 
correlated with a current reward and that&#39;s a   tunable parameter i&#39;ll tell you about okay so i&#39;m 
going to start with something called td0 there&#39;s   a whole family of temporal difference learning 
algorithms we&#39;re going to start with the kind of   zeroth model here and again we&#39;re going to start 
by looking at this in a value function formulation   so remember that the value function at state 
now so subscript k means at time k so the value   function at my my state now at time k is given 
by my expected reward now rk plus a discounted   value function at the next state i find myself 
at sk plus one okay and this is a bellman&#39;s   optimality condition essentially that this can 
be written recursively so the value function now   can be recursively written in terms of the value 
function at the next state i find myself in   and so what temporal difference learning does 
this is different than monte carlo learning   is that it updates the value function at my state 
sk and again it similarly starts with my old value   function and there is a correction term and what 
that correction term does is instead of averaging   over all of the trajectory like what monte carlo 
learning does in td learning we have this uh this   first term here called the td target estimate and 
this if you look this term here is almost exactly   the expected value function okay so 
this term here is what i expect my value   to be given a state s k and i take its difference 
with the old estimate of my value function at sk   so this is my my old best estimate of the 
value of being in the current state now   and this is new information of what actually 
happened so i i take an action i get a reward   rk and i find myself in a new state sk plus one 
where i have an estimate of the value function   so this right here is my expected 
value be of being at point s k   so i want to say that again this is really 
important and it&#39;s a little confusing   because of the bellman optimality condition here 
where the value of being in a state at time now   is equal to my reward now plus a discounted 
value at being in this state the next time   step into the future i can essentially say 
that my my estimate of my value function   should look like my current reward plus gamma 
times my next value function at the next state   but if i actually am playing a trial and error 
game if i&#39;m playing a game of chess using my   supposedly best policy pi i can actually take that 
step to sk plus one i can get a reward or lose a   reward whatever and i can compute what this actual 
cumulative um we call it r sigma the cumulative   reward is and if there&#39;s a discrepancy between 
my my actual r sigma and my estimated uh value   then i will update my value function uh with 
some weight alpha here okay so this is in   in some sense it has the same flavor of the 
monte carlo learning i start with my old   value function and i&#39;m iterating to find a 
new value function every time i take a step   i&#39;m iterating this but now i&#39;m using this kind of 
bellman nested uh reward our sigma to to update   and so this is called the td target and this 
is called the td error the temporal difference   error and it&#39;s kind of interesting 
so there is actually evidence that   there are analogs of this in neuroscience so 
when you know cells kind of fire together wire   together that&#39;s that&#39;s one of the principles 
of neuroscience and dopamine when dopamine is   released it can strengthen connections um between 
the the cells that fired together right before   dopamine was released those cells will strengthen 
uh their their correlation of firing together   and so there is a lot of really interesting 
theory and experiments that kind of connect   uh these td error signals and the release 
of dopamine in actual biological systems   so it&#39;s believed that there are circuits in your 
brain and in your nervous system that are actually   estimating this future reward this r sigma and 
when uh i guess that would be this this here   this would be the estimate of your future reward 
and then when you actually get a future reward   and it&#39;s different than what you estimated you 
uh you fire uh to strengthen those connections   okay so this is nice because it also this at least 
td0 is only looking one time step into the future   so you get your reward now and you get your uh 
your estimated reward one time step in the future   and so this is a somehow waiting events 
right before a reward takes place   so this is kind of a one delta t delay between 
an action and a reward and that one delta t   delay things that are correlated with one delta 
t will get strengthened in this td0 framework okay um good so i already told 
you about the strong connections   to biological learning with dopamine but 
it you could expand this value function   out and this would be rk plus gamma rk 
plus one plus gamma squared maybe i&#39;ll   write it out here plus gamma squared of my 
value function two steps into the future   and i can do this as many times as i want so this 
is kind of a two-step uh expansion of my of my   value function using again the bellman optimality 
these are both completely equivalent to each other   and i can replace this estimate for r sigma with 
this longer expression that has two events in the   past that has two action state sequences and 
so if i do that now i have uh kind of a richer   set of experiences of what my actual cumulative 
reward is now taking two steps into the future   and i can use that as my uh to generate my 
my temporal difference error signal that i   use to update my my value function and you can 
imagine doing this as many times as you want   you could have an n step cumulative reward and 
in the limit as n goes to infinity or to the end   of your episode this will actually converge to 
monte carlo learning which is kind of neat so at   the very limit of temporal difference learning as 
n goes to infinity or to the end of your sequence   this will converge in some sense to monte carlo 
learning but if i take you know n equals zero or   one or two then i&#39;m waiting only a few steps in 
the past more heavily to update my value function   good so that&#39;s the idea of temporal difference 
learning and let&#39;s see i&#39;m pretty sure the next   thing i&#39;m going to tell you about is probably 
well okay in a minute we&#39;re going to show you that   essentially q learning is just temporal difference 
learning on the quality function q so i&#39;ll show   you that in a minute there&#39;s one last variant 
called td lambda and essentially what it does is   it says instead of just using a one-step temporal 
difference td0 or a two-step temporal difference   td1 or an n-step temporal difference what i&#39;m 
going to do is i&#39;m going to compute that target   that r sigma for all of those and i&#39;m 
going to weight them in a certain way   using this kind of exponential weighting with 
some some factor lambda that&#39;s between 0 and 1.   okay so essentially what i&#39;m doing is i&#39;m 
waiting all of the td algorithms and then   i use that uh weighted error reward signal to 
compute my td error and this has some nice uh   some nice properties you can kind of tune this 
lambda depending on the problem you&#39;re working on   to get nice convergence properties to wait kind 
of recent stuff more or older stuff more but   you&#39;re kind of waiting all of them instead of just 
ignoring past events good so td lambda is pretty   pretty interesting i believe this was introduced 
by sutton um originally okay so q learning This is the culmination of what I really wanted 
to show you, is that Q-learning is essentially   just temporal difference learning on the Q function 
So I&#39;m going to write it out...   So Q learning, it&#39;s exactly the same template 
as temporal difference learning we are going to play games we&#39;re going to you know do trial 
and error learning by trying our optimal policy   and we have an estimate of what the quality is, given 
being in a state *s* taking action *a* But that&#39;s not a perfect estimate -- we&#39;re still learning  -- and 
so what we do is we measure what the actual reward   and future Q function is, at the next step we find 
ourselves in, and we use this as the TD target   for TD learning. So again this is the the TD 
target estimate this is essentially my observed   cumulative reward and I use that to generate an 
error signal which I feedback to update my Q function So essentially what this means is that if 
I got more reward than I thought I would, I should   increase my Q function at that *s* and that *a* ... 
so that&#39;s what that means: if I experienced   more reward than what I expected, then I should have a 
positive increase in my quality function at that  *s_k* and *a_k*. And you&#39;ll notice here, this is a very 
subtle difference, I&#39;m going to say this again   very subtle difference... here we&#39;re maximizing 
over *a* ... so what we&#39;re actually doing   is we we take some step we take some action *a* to 
get our reward *r_k*, but it doesn&#39;t have to be our   optimal on policy action *a_k* it could be... I could 
actually do some random action to get this *r_k*   but then when I&#39;m computing my quality 
function here for what my value   is in that next state *s_k+1*, we&#39;re going to 
optimize over actions *a*, at least to compute this   TD target. Now this is a little bit strange, 
but I&#39;ll tell you what this means   in a minute. Okay so we call that &quot;off policy&quot;, which 
means that I can actually take sub-optimal actions   *a* -- I don&#39;t have to enact my optimal current policy 
*pi* -- I can take sub-optimal actions *a* to get this   to get this reward but then I do need 
to maximize over *a* to get my best   Q function in this step of the the TD target. 
And this has a lot of benefits -- essentially   I can learn from experience, I can replay old 
experiences, even when I know that those were   not using an optimal policy *pi*, because I&#39;ve 
learned more since then.  Or I can watch a grand   master play and even if my quality function is not 
very good, because I&#39;m an amateur, I can watch their   actions and learn from their actions. So I&#39;ll talk 
a little bit more about this but that essentially   is what we mean by off-policy, is that you can 
still learn even if you take sub-optimal actions *a*   and you&#39;ll notice that this is a 
td0 variant because this is only   taking one step forward and then 
using the recursive queue formulation   good there is another variant so q learning 
is off policy td learning of the q function   and there&#39;s something called sarsa which is 
on policy td learning of the q function very   closely related only a tiny tiny difference but 
it&#39;s an important one and i&#39;ll tell you about it   so sarsa means state action reward state action 
it basically means you start with a state action   you take you&#39;re in a state and you take an 
action you measure a reward and then you uh   are in another state and you take another action 
okay sarsa and so here you&#39;ll notice that this   is the exact same expression the same exact 
temporal difference learning expression td0   except here i am not doing a max over q a max over 
a in this part of the td estimate and so what that   means is that i actually have to be taking my 
best possible policy actions a a k plus 1 and a   k at every step or else this will start to 
degrade my estimate because the quality function   q is always my estimate of the quality of being in 
that next state taking that next action assuming i   do the best thing forever after okay and so if i 
don&#39;t plug into here my best possible a and into   here if i didn&#39;t take my best possible action ak 
to get this reward then this will actually be a   sub-optimal estimate of my my quality function and 
it will start to degrade as i iterate through very   subtle difference and i&#39;ll put them next to each 
other so you can see the difference but this is   what&#39;s called on policy because you have to 
enact the optimal policy for this to work   for this to converge and not degrade one of 
the cool things about sarsa is that sarsa can   actually work for any of the td variants i showed 
it for td0 here but you can do it for td1 td2 tdn   and you would basically just replace this td 
target here with a more complex or rich expression   taking more steps and again if you think about 
this what this means is that you have to take   actual steps you actually have to take actions 
and get that trial and error experience and then   you look back and you see okay well what were my 
sequence of events and rewards and you can use   that to update your quality function after gaining 
that experience good so so uh yeah good this is   um yeah and you can essentially just replace this 
td target for the cumulative reward with any old   td algorithm you like td0 td1 tdn or the temporal 
difference lambda scheme i showed you a minute ago   okay and so this is on policy tv 
learning of the quality function q   so i want to show these kind of simultaneously 
next to each other so that we can compare and   contrast because they are almost perfectly 
identical okay except again q learning is   what&#39;s called off policy and we enable q learning 
to be off policy by replacing this one a k plus   one here with a max over a here because again the 
quality function and the value function are always   assuming you do whatever is optimal in the future 
so they&#39;re always assuming that after the next   step you do the best thing forever after so it&#39;s 
the optimal kind of quality of being in that state   and so if you didn&#39;t maximize over a here then if 
you took an off policy a a sub-optimal action a   this would actually degrade your quality function 
over time but because we are actually optimizing   over a in our quality function at the next step 
we can take sub-optimal actions a to explore   and to learn from off policy experiences 
and again this is a huge huge benefit so   this kind of off policy this this single 
difference here that allows us to be off policy   is one of the things that allows q learning 
to learn from a lot of different types of   experiences so you can watch someone else play a 
chess game and still learn in an off policy manner   even if your quality function is not as good 
as the grand masters that you&#39;re learning from   you can accumulate information about how they 
play and improve your quality function so you   can learn from imitation and you can also 
learn by replaying your past experiences   even if they were sub-optimal so when 
you go to sleep at night and you dream   of playing tennis even if you are replaying 
things that were not done perfectly because   you&#39;ve learned throughout the day you can 
still use that experience to still improve   your quality function and so the q learning 
often is faster at learning because you can   explore more so you get better optimal 
solutions and faster convergence and oftentimes you get lower variance uh solutions 
in i&#39;m sorry i think this is higher variance   i think this is a typo you&#39;ll have to check the 
book chapter because i&#39;m pretty sure that&#39;s a typo   um okay good so and when you do q learning 
there are lots of strategies for how you   introduce randomness into the policy that for 
what action you take so there&#39;s something called   epsilon greedy which basically means that one 
minus epsilon you know of the time so let&#39;s say   epsilon is is 0.05 then 95 of the time you would 
do your optimal policy and 5 of the time you would   do something random you would do a random action 
to get more kind of exploratory actions to explore   your space more so that&#39;s an epsilon greedy 
algorithm and in these epsilon greedy algorithms   often you start with a big epsilon so you start by 
only doing random stuff and then eventually that   epsilon cools down to zero in a kind of annealing 
or simulated annealing strategy so as you get more   experience as you learn better you eventually 
get to go more on policy as that epsilon decays   to zero so that&#39;s a really important one called 
epsilon greedy but there&#39;s tons of different   ways of introducing randomness and exploration 
uh into your your off policy search strategies   okay good uh and in you know contrast with sarsa 
so it has some benefits also so the benefits of   sarsa is that it is often safer in training 
because it is always going to do what you think   is the best thing to do it&#39;s not going to randomly 
explore so if you&#39;re teaching a teenager how to   drive you probably want to be doing something 
more like sarsa than q learning because you   don&#39;t want to just every epsilon you know flip a 
coin and decide on some random action that might   take you off the road or have some safety issue 
so sarsa because it&#39;s on policy means you&#39;re   always doing what you think is the best thing 
even though it&#39;s more conservative so this will   learn less quickly and probably less optimally 
but safer while it&#39;s learning and similarly   sarsa because it&#39;s always on policy it typically 
will maximize it will have more cumulative rewards   during the learning process so if i if i care 
about how much reward i get during my learning   process then sarsa will actually give me more 
cumulative reward because it&#39;s always on policy   it&#39;s not making these occasional kind of bad 
or sub-optimal moves just to gain experience   and so it&#39;s an interesting trade-off right 
like this actually has a lot of parallels with   human life as you might expect and in 
biological learning sometimes you want to be   more conservative sometimes you want to maximize 
your rewards throughout the learning process   sometimes you put a premium on being able 
to take chances and you know introduce some   randomness and learn from experiences that are 
not what you think are the optimal experience   you know go backpacking in europe right that&#39;s 
a that&#39;s a q learning type of type of action   okay good um so what i hope i&#39;ve convinced 
you of is that when you don&#39;t have a model   for your mdp for your markup decision process 
there are things that are inspired by dynamic   programming in bellman&#39;s equation you can think 
of these almost as approximate dynamic programming   based on trajectories of real world experiences 
i actually take actions in the real world through   trial and error and i start to accumulate 
information that can give me a better world   view of what the quality of a given state action 
pair are and there are off policy and on policy   variations of this that have different 
strengths and weaknesses okay good so   that&#39;s what we talked about here is kind of this 
gradient-free model-free reinforcement learning   q-learning is used a lot for deep reinforcement 
learning and so i&#39;ll definitely talk about that   in a future lecture all right stay 
tuned there&#39;s more to come thank you

Transcript for:Lecture Notes on Reinforcement Learning

Transcript for:
Lecture Notes on Reinforcement Learning