Transcript for:
Lecture Notes on Reinforcement Learning

welcome back so today we're continuing this  lecture series on reinforcement learning which is   one of my favorite topics at the intersection of  control theory and machine learning and actually   also at the intersection of biological  learning so really really old rich topic   decades of development and especially recently  there's a lot of renewed interest in reinforcement   learning because of its combination with deep  neural networks and so today i'm going to focus   primarily on model-free reinforcement learning  and specifically on gradient-free methods so   this lecture is going to culminate in q-learning  which is one of the most popular reinforcement   learning techniques available today so that's  this q learning here which is an off policy   gradient free method and to build up to q-learning  we're going to have to introduce some concepts   like monte carlo learning and temporal difference  learning and td learning especially temporal   difference learning has really deep connections  to biological learning how we actually learn   in our kind of coupled network of neurons so  i'm really excited i think this is going to be   a pretty neat lecture and just a reminder  this is roughly following a new chapter   in the second edition of our book uh  data driven science and engineering   so this is a chapter i put together kind of  to be the the end cap of the control section   because reinforcement learning really is such a  an exciting uh kind of a growing set of techniques   okay good so model free reinforcement learning  specifically gradient-free methods today   now before uh jumping into kind of all of the  the gradient-free optimizations i'm going to talk   about which is going to culminate in q-learning  i want to introduce or kind of recap a couple   of topics that we talked about in the last few  video lectures so the first thing that is really   important is just to remind you that there exists  this quality function q which essentially tells   you the joint quality or value of taking an action  a given a current state s so the value function   v tells you what the value of being in a current  state s is assuming you take the best action   this quality function is somehow richer and  it contains more information about what is the   quality of being in this state s for any  action i might take so this is the joint   quality function is kind of sometimes what  we call it and this quality function can be   defined using kind of the standard markov decision  process mdp language that we've also introduced   where it is the expected future reward uh given  that i am in a state s now and i take an action a   so it is the expected reward function uh currently  the the instantaneous reward i'm going to get   taking action a now which will lead me to state s  prime plus any future rewards that i might get uh   just for being in that next state s prime okay so  we we talked a lot about this in the last lecture   and this is essentially just saying that the uh  the quality of being in a state now is my expected   current reward plus all of my future rewards for  being in the next state that i'm gonna jump to and   you can write it out in terms of this summation  over probabilities because this markov decision   process this mdp which is kind of the cornerstone  of a lot of reinforcement learning algorithms   assumes that your environment the the  environment that you find yourself in evolves   probabilistically so there's kind of a coin flip  uh randomness element to how the state evolves   even given uh the same action so even if i take  the same action in the same state i might have   different future states according to some  probability and that's given by this function   p so s prime again is my future state s is  my current state and a is my current action   and so this expected value here has to be summed  up over all of the possible future states s   prime i might find myself in which are given by  this model this this probabilistic evolution p   so right off the bat what i want to point out and  this is why uh we need this lecture that i'm going   to give you today is that all of the learning  that we we have demonstrated kind of using   the bellman equation and dynamic programming up  until now so policy iteration and value iteration   assume that you have a model for this p  this probabilistic markov decision process   that evolves your your system forward in  time and we also assume that we have a   model for this reward function and you  can then optimize using this knowledge   and if you have this quality function remember  that you can derive the value function at a   state s it's simply the best possible action  you could take it's the quality for the best   action a you could take and similarly you can  derive the optimal policy pi it's simply the a   that maximizes the quality function it's the  action a that maximizes the quality function   but again the central challenge what we talked  about in the past lecture was dynamic programming   uh using bellman optimality when we actually  have a model for p and for r when we have a   model for how the system evolves and what the  reward is as a function of the state in action   but in many systems we don't actually know  how the system is going to evolve we don't   have a model for that and we don't know what the  reward function is or what the reward structure is   before kind of exploring and trying things out  so that's going to be the motivation for these   kind of model free reinforcement learning  algorithms i'm going to talk about today   so this is just recap i wanted you  to remember the quality function   is really important we also have  a value function and a policy pi   and again you know if i had a model for p and i  had a model for r i could run through this value   iteration where i would essentially take actions  that maximize my value and i would update my value   function uh at every state that i evolve on on  some trajectory so let's say i'm playing chess   i would just play chess over and over and over  again and every state i find myself in i would   use my best policy to get my best action a  and i would somehow update the value of my   current state s using the new reward information  that i get so this all relies on on bellman's   equation but we're going to talk about today  when we don't actually have access to a model   so when we don't have access to a model for how  the system evolves or what the reward structure   is we essentially are stuck using trial and  error learning and that makes a lot of sense   in a lot of biological situations too right  the first time that you play tic-tac-toe maybe   the person that taught you told you you know that  three x's in a row wins or three o's in a row wins   but they didn't tell you what the value of  a particular state was and they didn't give   you a model or a rule for how this game is going  to unfold you just had to play a bunch of games   learn the rules and learn the reward structure  okay and so the simplest approach and i don't   actually really recommend doing monte carlo  learning unless you have to but this is almost   to illustrate the simplest approach uh to model  free learning learning purely from trial and error   is called monte carlo learning and so in monte  carlo learning what we're going to first do   is define this this cumulative reward function so  monte carlo learning is what's called an episodic   learning algorithm which means that it needs  to run an entire episode many many states   action sequence leading to  some final end of the game   before it can it can use that information for  learning and so monte carlo learning works you   know in theory for things like games that have a  definitive end or maybe if you're trying to make   a a racer you know a bipedal walker race to a  finish line that would have a definitive end   for that that episode as well and so what you do  is you essentially you pick some policy and you   enact that policy and you run through your game  you play your game of chess or you run your race   and you compute your cumulative total rewards  of course discounted by this gamma we always   discount by gamma because that's how we  can prove optimality in certain conditions   and so based on this total cumulative reward  the simplest thing we could possibly do   is take and divvy up that reward equally among  every state that uh that my system took on the   way to the finish line okay so that i'm going to  demonstrate this for for value function learning   and it can also be done for q learning and so  essentially what we would do is at the end of   an episode once we've computed this cumulative  reward how much did i gain over that episode   i would say that my my new value function  at every single state along the way   is equal to my old value function plus i  essentially take and i add a little bit of this   r sigma divided by how many steps there were in  my uh in my trajectory to my my value function so   this is essentially an update that says that we're  our best estimate for our value function is still   kind of the old value function but any states  that we we hit along the way to the finish line   we are going to add the cumulative reward divided  by the total number of steps now this is a very   inefficient way of learning you have to play tons  and tons and tons of games of chess or tic-tac-toe   uh to actually learn this way because what you're  assuming is that every single step you took along   the way to victory is equally important to victory  or loss is equally important so for example if i   played a brilliant game of chess and then i  made a bad move at the end all of those moves   would be considered equal in this scheme or if  i played a bad game of chess at the first game   and then my opponent got unlucky and lost then  i would actually wait all of those bad moves   towards towards this updated value function but  again this is the simplest kind of most common   sense thing you could do to start accumulating  that information you've experienced so this is   the experienced reward and you just divvy it up  over every single step you took to get that reward   okay that's uh that's what monte carlo learning is  and you can uh do the same thing for q learning so   you can also update your quality function in  exactly the same way where basically uh every   state action pair that you uh you saw along along  the the trajectory that you just played would get   updated again by um a small amount uh given by  this reward divided by n okay and you can kind   of convince yourself that this should be you know  an optimizer this should actually converge under   certain conditions okay um and we're only updating  based on the difference between the reward   and the old value function or the reward and the  old q function because if my old value function   or q function were perfect then they would  perfectly predict what this future reward would be   so if i had a perfect value function this would  actually be this difference would be zero here and   zero here so uh that's why we're kind of updating  with the difference between the actual reward and   the estimated reward because that's kind of what  this is this uh v old and this q old here are   essentially my estimated future reward being in  state sk okay good that's monte carlo learning   relatively inefficient but one of the nice  things about it is that it has no bias   okay so it's very very sample and efficient  i don't recommend doing this in practice   but it has it introduces no bias into the learning  procedure if you play enough games in principle   this should this should converge and you should do  a thought experiment of whether or not we should   be enacting our optimal policy pi or a sub-optimal  policy uh if i was doing this i would always be up   uh enacting my optimal policy pi kind of based  on my new either q function or or value function   but i want you to think through what would  happen if you didn't enact an optimal policy pi   okay good so that's monte carlo  learning and that's kind of one   extreme and on the other extreme is what  we have called temporal difference learning   and td learning is one of the most important  advances that has happened in reinforcement   learning this is decades old and it gives us  a better strategy in many cases for learning   that essentially uh it highlights events that are  more recently uh giving me rewards so i have some   state action sequence and i have some rewards  i've gotten along the way and to some extent i   want to say that maybe the events that happen more  recently are somehow related to the rewards i'm   getting and so that kind of difference in time is  is also a property of biological learning systems   so if you think about pavlov's dog right pavlov's  dog was conditioned to expect a food reward   at some point after hearing this  bell ring and so if you ring the bell   you know the dog starts to salivate and so there  is a finite difference in time between the action   or the state the environmental cue and the actual  expected reward so in monte carlo learning you're   basically saying that the right action could have  happened at any point in the past leading up to   the reward now what temporal difference learning  does is it says there is kind of an optimal   time in the past where i am most likely to be  correlated with a current reward and that's a   tunable parameter i'll tell you about okay so i'm  going to start with something called td0 there's   a whole family of temporal difference learning  algorithms we're going to start with the kind of   zeroth model here and again we're going to start  by looking at this in a value function formulation   so remember that the value function at state  now so subscript k means at time k so the value   function at my my state now at time k is given  by my expected reward now rk plus a discounted   value function at the next state i find myself  at sk plus one okay and this is a bellman's   optimality condition essentially that this can  be written recursively so the value function now   can be recursively written in terms of the value  function at the next state i find myself in   and so what temporal difference learning does  this is different than monte carlo learning   is that it updates the value function at my state  sk and again it similarly starts with my old value   function and there is a correction term and what  that correction term does is instead of averaging   over all of the trajectory like what monte carlo  learning does in td learning we have this uh this   first term here called the td target estimate and  this if you look this term here is almost exactly   the expected value function okay so  this term here is what i expect my value   to be given a state s k and i take its difference  with the old estimate of my value function at sk   so this is my my old best estimate of the  value of being in the current state now   and this is new information of what actually  happened so i i take an action i get a reward   rk and i find myself in a new state sk plus one  where i have an estimate of the value function   so this right here is my expected  value be of being at point s k   so i want to say that again this is really  important and it's a little confusing   because of the bellman optimality condition here  where the value of being in a state at time now   is equal to my reward now plus a discounted  value at being in this state the next time   step into the future i can essentially say  that my my estimate of my value function   should look like my current reward plus gamma  times my next value function at the next state   but if i actually am playing a trial and error  game if i'm playing a game of chess using my   supposedly best policy pi i can actually take that  step to sk plus one i can get a reward or lose a   reward whatever and i can compute what this actual  cumulative um we call it r sigma the cumulative   reward is and if there's a discrepancy between  my my actual r sigma and my estimated uh value   then i will update my value function uh with  some weight alpha here okay so this is in   in some sense it has the same flavor of the  monte carlo learning i start with my old   value function and i'm iterating to find a  new value function every time i take a step   i'm iterating this but now i'm using this kind of  bellman nested uh reward our sigma to to update   and so this is called the td target and this  is called the td error the temporal difference   error and it's kind of interesting  so there is actually evidence that   there are analogs of this in neuroscience so  when you know cells kind of fire together wire   together that's that's one of the principles  of neuroscience and dopamine when dopamine is   released it can strengthen connections um between  the the cells that fired together right before   dopamine was released those cells will strengthen  uh their their correlation of firing together   and so there is a lot of really interesting  theory and experiments that kind of connect   uh these td error signals and the release  of dopamine in actual biological systems   so it's believed that there are circuits in your  brain and in your nervous system that are actually   estimating this future reward this r sigma and  when uh i guess that would be this this here   this would be the estimate of your future reward  and then when you actually get a future reward   and it's different than what you estimated you  uh you fire uh to strengthen those connections   okay so this is nice because it also this at least  td0 is only looking one time step into the future   so you get your reward now and you get your uh  your estimated reward one time step in the future   and so this is a somehow waiting events  right before a reward takes place   so this is kind of a one delta t delay between  an action and a reward and that one delta t   delay things that are correlated with one delta  t will get strengthened in this td0 framework okay um good so i already told  you about the strong connections   to biological learning with dopamine but  it you could expand this value function   out and this would be rk plus gamma rk  plus one plus gamma squared maybe i'll   write it out here plus gamma squared of my  value function two steps into the future   and i can do this as many times as i want so this  is kind of a two-step uh expansion of my of my   value function using again the bellman optimality  these are both completely equivalent to each other   and i can replace this estimate for r sigma with  this longer expression that has two events in the   past that has two action state sequences and  so if i do that now i have uh kind of a richer   set of experiences of what my actual cumulative  reward is now taking two steps into the future   and i can use that as my uh to generate my  my temporal difference error signal that i   use to update my my value function and you can  imagine doing this as many times as you want   you could have an n step cumulative reward and  in the limit as n goes to infinity or to the end   of your episode this will actually converge to  monte carlo learning which is kind of neat so at   the very limit of temporal difference learning as  n goes to infinity or to the end of your sequence   this will converge in some sense to monte carlo  learning but if i take you know n equals zero or   one or two then i'm waiting only a few steps in  the past more heavily to update my value function   good so that's the idea of temporal difference  learning and let's see i'm pretty sure the next   thing i'm going to tell you about is probably  well okay in a minute we're going to show you that   essentially q learning is just temporal difference  learning on the quality function q so i'll show   you that in a minute there's one last variant  called td lambda and essentially what it does is   it says instead of just using a one-step temporal  difference td0 or a two-step temporal difference   td1 or an n-step temporal difference what i'm  going to do is i'm going to compute that target   that r sigma for all of those and i'm  going to weight them in a certain way   using this kind of exponential weighting with  some some factor lambda that's between 0 and 1.   okay so essentially what i'm doing is i'm  waiting all of the td algorithms and then   i use that uh weighted error reward signal to  compute my td error and this has some nice uh   some nice properties you can kind of tune this  lambda depending on the problem you're working on   to get nice convergence properties to wait kind  of recent stuff more or older stuff more but   you're kind of waiting all of them instead of just  ignoring past events good so td lambda is pretty   pretty interesting i believe this was introduced  by sutton um originally okay so q learning This is the culmination of what I really wanted  to show you, is that Q-learning is essentially   just temporal difference learning on the Q function So I'm going to write it out...   So Q learning, it's exactly the same template  as temporal difference learning we are going to play games we're going to you know do trial  and error learning by trying our optimal policy   and we have an estimate of what the quality is, given  being in a state *s* taking action *a* But that's not a perfect estimate -- we're still learning -- and  so what we do is we measure what the actual reward   and future Q function is, at the next step we find  ourselves in, and we use this as the TD target   for TD learning. So again this is the the TD  target estimate this is essentially my observed   cumulative reward and I use that to generate an  error signal which I feedback to update my Q function So essentially what this means is that if  I got more reward than I thought I would, I should   increase my Q function at that *s* and that *a* ... so that's what that means: if I experienced   more reward than what I expected, then I should have a  positive increase in my quality function at that  *s_k* and *a_k*. And you'll notice here, this is a very  subtle difference, I'm going to say this again   very subtle difference... here we're maximizing  over *a* ... so what we're actually doing   is we we take some step we take some action *a* to  get our reward *r_k*, but it doesn't have to be our   optimal on policy action *a_k* it could be... I could  actually do some random action to get this *r_k*   but then when I'm computing my quality  function here for what my value   is in that next state *s_k+1*, we're going to  optimize over actions *a*, at least to compute this   TD target. Now this is a little bit strange,  but I'll tell you what this means   in a minute. Okay so we call that "off policy", which  means that I can actually take sub-optimal actions   *a* -- I don't have to enact my optimal current policy  *pi* -- I can take sub-optimal actions *a* to get this   to get this reward but then I do need  to maximize over *a* to get my best   Q function in this step of the the TD target.  And this has a lot of benefits -- essentially   I can learn from experience, I can replay old  experiences, even when I know that those were   not using an optimal policy *pi*, because I've  learned more since then. Or I can watch a grand   master play and even if my quality function is not  very good, because I'm an amateur, I can watch their   actions and learn from their actions. So I'll talk  a little bit more about this but that essentially   is what we mean by off-policy, is that you can  still learn even if you take sub-optimal actions *a*   and you'll notice that this is a  td0 variant because this is only   taking one step forward and then  using the recursive queue formulation   good there is another variant so q learning  is off policy td learning of the q function   and there's something called sarsa which is  on policy td learning of the q function very   closely related only a tiny tiny difference but  it's an important one and i'll tell you about it   so sarsa means state action reward state action  it basically means you start with a state action   you take you're in a state and you take an  action you measure a reward and then you uh   are in another state and you take another action  okay sarsa and so here you'll notice that this   is the exact same expression the same exact  temporal difference learning expression td0   except here i am not doing a max over q a max over  a in this part of the td estimate and so what that   means is that i actually have to be taking my  best possible policy actions a a k plus 1 and a   k at every step or else this will start to  degrade my estimate because the quality function   q is always my estimate of the quality of being in  that next state taking that next action assuming i   do the best thing forever after okay and so if i  don't plug into here my best possible a and into   here if i didn't take my best possible action ak  to get this reward then this will actually be a   sub-optimal estimate of my my quality function and  it will start to degrade as i iterate through very   subtle difference and i'll put them next to each  other so you can see the difference but this is   what's called on policy because you have to  enact the optimal policy for this to work   for this to converge and not degrade one of  the cool things about sarsa is that sarsa can   actually work for any of the td variants i showed  it for td0 here but you can do it for td1 td2 tdn   and you would basically just replace this td  target here with a more complex or rich expression   taking more steps and again if you think about  this what this means is that you have to take   actual steps you actually have to take actions  and get that trial and error experience and then   you look back and you see okay well what were my  sequence of events and rewards and you can use   that to update your quality function after gaining  that experience good so so uh yeah good this is   um yeah and you can essentially just replace this  td target for the cumulative reward with any old   td algorithm you like td0 td1 tdn or the temporal  difference lambda scheme i showed you a minute ago   okay and so this is on policy tv  learning of the quality function q   so i want to show these kind of simultaneously  next to each other so that we can compare and   contrast because they are almost perfectly  identical okay except again q learning is   what's called off policy and we enable q learning  to be off policy by replacing this one a k plus   one here with a max over a here because again the  quality function and the value function are always   assuming you do whatever is optimal in the future  so they're always assuming that after the next   step you do the best thing forever after so it's  the optimal kind of quality of being in that state   and so if you didn't maximize over a here then if  you took an off policy a a sub-optimal action a   this would actually degrade your quality function  over time but because we are actually optimizing   over a in our quality function at the next step  we can take sub-optimal actions a to explore   and to learn from off policy experiences  and again this is a huge huge benefit so   this kind of off policy this this single  difference here that allows us to be off policy   is one of the things that allows q learning  to learn from a lot of different types of   experiences so you can watch someone else play a  chess game and still learn in an off policy manner   even if your quality function is not as good  as the grand masters that you're learning from   you can accumulate information about how they  play and improve your quality function so you   can learn from imitation and you can also  learn by replaying your past experiences   even if they were sub-optimal so when  you go to sleep at night and you dream   of playing tennis even if you are replaying  things that were not done perfectly because   you've learned throughout the day you can  still use that experience to still improve   your quality function and so the q learning  often is faster at learning because you can   explore more so you get better optimal  solutions and faster convergence and oftentimes you get lower variance uh solutions  in i'm sorry i think this is higher variance   i think this is a typo you'll have to check the  book chapter because i'm pretty sure that's a typo   um okay good so and when you do q learning  there are lots of strategies for how you   introduce randomness into the policy that for  what action you take so there's something called   epsilon greedy which basically means that one  minus epsilon you know of the time so let's say   epsilon is is 0.05 then 95 of the time you would  do your optimal policy and 5 of the time you would   do something random you would do a random action  to get more kind of exploratory actions to explore   your space more so that's an epsilon greedy  algorithm and in these epsilon greedy algorithms   often you start with a big epsilon so you start by  only doing random stuff and then eventually that   epsilon cools down to zero in a kind of annealing  or simulated annealing strategy so as you get more   experience as you learn better you eventually  get to go more on policy as that epsilon decays   to zero so that's a really important one called  epsilon greedy but there's tons of different   ways of introducing randomness and exploration  uh into your your off policy search strategies   okay good uh and in you know contrast with sarsa  so it has some benefits also so the benefits of   sarsa is that it is often safer in training  because it is always going to do what you think   is the best thing to do it's not going to randomly  explore so if you're teaching a teenager how to   drive you probably want to be doing something  more like sarsa than q learning because you   don't want to just every epsilon you know flip a  coin and decide on some random action that might   take you off the road or have some safety issue  so sarsa because it's on policy means you're   always doing what you think is the best thing  even though it's more conservative so this will   learn less quickly and probably less optimally  but safer while it's learning and similarly   sarsa because it's always on policy it typically  will maximize it will have more cumulative rewards   during the learning process so if i if i care  about how much reward i get during my learning   process then sarsa will actually give me more  cumulative reward because it's always on policy   it's not making these occasional kind of bad  or sub-optimal moves just to gain experience   and so it's an interesting trade-off right  like this actually has a lot of parallels with   human life as you might expect and in  biological learning sometimes you want to be   more conservative sometimes you want to maximize  your rewards throughout the learning process   sometimes you put a premium on being able  to take chances and you know introduce some   randomness and learn from experiences that are  not what you think are the optimal experience   you know go backpacking in europe right that's  a that's a q learning type of type of action   okay good um so what i hope i've convinced  you of is that when you don't have a model   for your mdp for your markup decision process  there are things that are inspired by dynamic   programming in bellman's equation you can think  of these almost as approximate dynamic programming   based on trajectories of real world experiences  i actually take actions in the real world through   trial and error and i start to accumulate  information that can give me a better world   view of what the quality of a given state action  pair are and there are off policy and on policy   variations of this that have different  strengths and weaknesses okay good so   that's what we talked about here is kind of this  gradient-free model-free reinforcement learning   q-learning is used a lot for deep reinforcement  learning and so i'll definitely talk about that   in a future lecture all right stay  tuned there's more to come thank you