welcome back so today we're continuing this
lecture series on reinforcement learning which is one of my favorite topics at the intersection of
control theory and machine learning and actually also at the intersection of biological
learning so really really old rich topic decades of development and especially recently
there's a lot of renewed interest in reinforcement learning because of its combination with deep
neural networks and so today i'm going to focus primarily on model-free reinforcement learning
and specifically on gradient-free methods so this lecture is going to culminate in q-learning
which is one of the most popular reinforcement learning techniques available today so that's
this q learning here which is an off policy gradient free method and to build up to q-learning
we're going to have to introduce some concepts like monte carlo learning and temporal difference
learning and td learning especially temporal difference learning has really deep connections
to biological learning how we actually learn in our kind of coupled network of neurons so
i'm really excited i think this is going to be a pretty neat lecture and just a reminder
this is roughly following a new chapter in the second edition of our book uh
data driven science and engineering so this is a chapter i put together kind of
to be the the end cap of the control section because reinforcement learning really is such a
an exciting uh kind of a growing set of techniques okay good so model free reinforcement learning
specifically gradient-free methods today now before uh jumping into kind of all of the
the gradient-free optimizations i'm going to talk about which is going to culminate in q-learning
i want to introduce or kind of recap a couple of topics that we talked about in the last few
video lectures so the first thing that is really important is just to remind you that there exists
this quality function q which essentially tells you the joint quality or value of taking an action
a given a current state s so the value function v tells you what the value of being in a current
state s is assuming you take the best action this quality function is somehow richer and
it contains more information about what is the quality of being in this state s for any
action i might take so this is the joint quality function is kind of sometimes what
we call it and this quality function can be defined using kind of the standard markov decision
process mdp language that we've also introduced where it is the expected future reward uh given
that i am in a state s now and i take an action a so it is the expected reward function uh currently
the the instantaneous reward i'm going to get taking action a now which will lead me to state s
prime plus any future rewards that i might get uh just for being in that next state s prime okay so
we we talked a lot about this in the last lecture and this is essentially just saying that the uh
the quality of being in a state now is my expected current reward plus all of my future rewards for
being in the next state that i'm gonna jump to and you can write it out in terms of this summation
over probabilities because this markov decision process this mdp which is kind of the cornerstone
of a lot of reinforcement learning algorithms assumes that your environment the the
environment that you find yourself in evolves probabilistically so there's kind of a coin flip
uh randomness element to how the state evolves even given uh the same action so even if i take
the same action in the same state i might have different future states according to some
probability and that's given by this function p so s prime again is my future state s is
my current state and a is my current action and so this expected value here has to be summed
up over all of the possible future states s prime i might find myself in which are given by
this model this this probabilistic evolution p so right off the bat what i want to point out and
this is why uh we need this lecture that i'm going to give you today is that all of the learning
that we we have demonstrated kind of using the bellman equation and dynamic programming up
until now so policy iteration and value iteration assume that you have a model for this p
this probabilistic markov decision process that evolves your your system forward in
time and we also assume that we have a model for this reward function and you
can then optimize using this knowledge and if you have this quality function remember
that you can derive the value function at a state s it's simply the best possible action
you could take it's the quality for the best action a you could take and similarly you can
derive the optimal policy pi it's simply the a that maximizes the quality function it's the
action a that maximizes the quality function but again the central challenge what we talked
about in the past lecture was dynamic programming uh using bellman optimality when we actually
have a model for p and for r when we have a model for how the system evolves and what the
reward is as a function of the state in action but in many systems we don't actually know
how the system is going to evolve we don't have a model for that and we don't know what the
reward function is or what the reward structure is before kind of exploring and trying things out
so that's going to be the motivation for these kind of model free reinforcement learning
algorithms i'm going to talk about today so this is just recap i wanted you
to remember the quality function is really important we also have
a value function and a policy pi and again you know if i had a model for p and i
had a model for r i could run through this value iteration where i would essentially take actions
that maximize my value and i would update my value function uh at every state that i evolve on on
some trajectory so let's say i'm playing chess i would just play chess over and over and over
again and every state i find myself in i would use my best policy to get my best action a
and i would somehow update the value of my current state s using the new reward information
that i get so this all relies on on bellman's equation but we're going to talk about today
when we don't actually have access to a model so when we don't have access to a model for how
the system evolves or what the reward structure is we essentially are stuck using trial and
error learning and that makes a lot of sense in a lot of biological situations too right
the first time that you play tic-tac-toe maybe the person that taught you told you you know that
three x's in a row wins or three o's in a row wins but they didn't tell you what the value of
a particular state was and they didn't give you a model or a rule for how this game is going
to unfold you just had to play a bunch of games learn the rules and learn the reward structure
okay and so the simplest approach and i don't actually really recommend doing monte carlo
learning unless you have to but this is almost to illustrate the simplest approach uh to model
free learning learning purely from trial and error is called monte carlo learning and so in monte
carlo learning what we're going to first do is define this this cumulative reward function so
monte carlo learning is what's called an episodic learning algorithm which means that it needs
to run an entire episode many many states action sequence leading to
some final end of the game before it can it can use that information for
learning and so monte carlo learning works you know in theory for things like games that have a
definitive end or maybe if you're trying to make a a racer you know a bipedal walker race to a
finish line that would have a definitive end for that that episode as well and so what you do
is you essentially you pick some policy and you enact that policy and you run through your game
you play your game of chess or you run your race and you compute your cumulative total rewards
of course discounted by this gamma we always discount by gamma because that's how we
can prove optimality in certain conditions and so based on this total cumulative reward
the simplest thing we could possibly do is take and divvy up that reward equally among
every state that uh that my system took on the way to the finish line okay so that i'm going to
demonstrate this for for value function learning and it can also be done for q learning and so
essentially what we would do is at the end of an episode once we've computed this cumulative
reward how much did i gain over that episode i would say that my my new value function
at every single state along the way is equal to my old value function plus i
essentially take and i add a little bit of this r sigma divided by how many steps there were in
my uh in my trajectory to my my value function so this is essentially an update that says that we're
our best estimate for our value function is still kind of the old value function but any states
that we we hit along the way to the finish line we are going to add the cumulative reward divided
by the total number of steps now this is a very inefficient way of learning you have to play tons
and tons and tons of games of chess or tic-tac-toe uh to actually learn this way because what you're
assuming is that every single step you took along the way to victory is equally important to victory
or loss is equally important so for example if i played a brilliant game of chess and then i
made a bad move at the end all of those moves would be considered equal in this scheme or if
i played a bad game of chess at the first game and then my opponent got unlucky and lost then
i would actually wait all of those bad moves towards towards this updated value function but
again this is the simplest kind of most common sense thing you could do to start accumulating
that information you've experienced so this is the experienced reward and you just divvy it up
over every single step you took to get that reward okay that's uh that's what monte carlo learning is
and you can uh do the same thing for q learning so you can also update your quality function in
exactly the same way where basically uh every state action pair that you uh you saw along along
the the trajectory that you just played would get updated again by um a small amount uh given by
this reward divided by n okay and you can kind of convince yourself that this should be you know
an optimizer this should actually converge under certain conditions okay um and we're only updating
based on the difference between the reward and the old value function or the reward and the
old q function because if my old value function or q function were perfect then they would
perfectly predict what this future reward would be so if i had a perfect value function this would
actually be this difference would be zero here and zero here so uh that's why we're kind of updating
with the difference between the actual reward and the estimated reward because that's kind of what
this is this uh v old and this q old here are essentially my estimated future reward being in
state sk okay good that's monte carlo learning relatively inefficient but one of the nice
things about it is that it has no bias okay so it's very very sample and efficient
i don't recommend doing this in practice but it has it introduces no bias into the learning
procedure if you play enough games in principle this should this should converge and you should do
a thought experiment of whether or not we should be enacting our optimal policy pi or a sub-optimal
policy uh if i was doing this i would always be up uh enacting my optimal policy pi kind of based
on my new either q function or or value function but i want you to think through what would
happen if you didn't enact an optimal policy pi okay good so that's monte carlo
learning and that's kind of one extreme and on the other extreme is what
we have called temporal difference learning and td learning is one of the most important
advances that has happened in reinforcement learning this is decades old and it gives us
a better strategy in many cases for learning that essentially uh it highlights events that are
more recently uh giving me rewards so i have some state action sequence and i have some rewards
i've gotten along the way and to some extent i want to say that maybe the events that happen more
recently are somehow related to the rewards i'm getting and so that kind of difference in time is
is also a property of biological learning systems so if you think about pavlov's dog right pavlov's
dog was conditioned to expect a food reward at some point after hearing this
bell ring and so if you ring the bell you know the dog starts to salivate and so there
is a finite difference in time between the action or the state the environmental cue and the actual
expected reward so in monte carlo learning you're basically saying that the right action could have
happened at any point in the past leading up to the reward now what temporal difference learning
does is it says there is kind of an optimal time in the past where i am most likely to be
correlated with a current reward and that's a tunable parameter i'll tell you about okay so i'm
going to start with something called td0 there's a whole family of temporal difference learning
algorithms we're going to start with the kind of zeroth model here and again we're going to start
by looking at this in a value function formulation so remember that the value function at state
now so subscript k means at time k so the value function at my my state now at time k is given
by my expected reward now rk plus a discounted value function at the next state i find myself
at sk plus one okay and this is a bellman's optimality condition essentially that this can
be written recursively so the value function now can be recursively written in terms of the value
function at the next state i find myself in and so what temporal difference learning does
this is different than monte carlo learning is that it updates the value function at my state
sk and again it similarly starts with my old value function and there is a correction term and what
that correction term does is instead of averaging over all of the trajectory like what monte carlo
learning does in td learning we have this uh this first term here called the td target estimate and
this if you look this term here is almost exactly the expected value function okay so
this term here is what i expect my value to be given a state s k and i take its difference
with the old estimate of my value function at sk so this is my my old best estimate of the
value of being in the current state now and this is new information of what actually
happened so i i take an action i get a reward rk and i find myself in a new state sk plus one
where i have an estimate of the value function so this right here is my expected
value be of being at point s k so i want to say that again this is really
important and it's a little confusing because of the bellman optimality condition here
where the value of being in a state at time now is equal to my reward now plus a discounted
value at being in this state the next time step into the future i can essentially say
that my my estimate of my value function should look like my current reward plus gamma
times my next value function at the next state but if i actually am playing a trial and error
game if i'm playing a game of chess using my supposedly best policy pi i can actually take that
step to sk plus one i can get a reward or lose a reward whatever and i can compute what this actual
cumulative um we call it r sigma the cumulative reward is and if there's a discrepancy between
my my actual r sigma and my estimated uh value then i will update my value function uh with
some weight alpha here okay so this is in in some sense it has the same flavor of the
monte carlo learning i start with my old value function and i'm iterating to find a
new value function every time i take a step i'm iterating this but now i'm using this kind of
bellman nested uh reward our sigma to to update and so this is called the td target and this
is called the td error the temporal difference error and it's kind of interesting
so there is actually evidence that there are analogs of this in neuroscience so
when you know cells kind of fire together wire together that's that's one of the principles
of neuroscience and dopamine when dopamine is released it can strengthen connections um between
the the cells that fired together right before dopamine was released those cells will strengthen
uh their their correlation of firing together and so there is a lot of really interesting
theory and experiments that kind of connect uh these td error signals and the release
of dopamine in actual biological systems so it's believed that there are circuits in your
brain and in your nervous system that are actually estimating this future reward this r sigma and
when uh i guess that would be this this here this would be the estimate of your future reward
and then when you actually get a future reward and it's different than what you estimated you
uh you fire uh to strengthen those connections okay so this is nice because it also this at least
td0 is only looking one time step into the future so you get your reward now and you get your uh
your estimated reward one time step in the future and so this is a somehow waiting events
right before a reward takes place so this is kind of a one delta t delay between
an action and a reward and that one delta t delay things that are correlated with one delta
t will get strengthened in this td0 framework okay um good so i already told
you about the strong connections to biological learning with dopamine but
it you could expand this value function out and this would be rk plus gamma rk
plus one plus gamma squared maybe i'll write it out here plus gamma squared of my
value function two steps into the future and i can do this as many times as i want so this
is kind of a two-step uh expansion of my of my value function using again the bellman optimality
these are both completely equivalent to each other and i can replace this estimate for r sigma with
this longer expression that has two events in the past that has two action state sequences and
so if i do that now i have uh kind of a richer set of experiences of what my actual cumulative
reward is now taking two steps into the future and i can use that as my uh to generate my
my temporal difference error signal that i use to update my my value function and you can
imagine doing this as many times as you want you could have an n step cumulative reward and
in the limit as n goes to infinity or to the end of your episode this will actually converge to
monte carlo learning which is kind of neat so at the very limit of temporal difference learning as
n goes to infinity or to the end of your sequence this will converge in some sense to monte carlo
learning but if i take you know n equals zero or one or two then i'm waiting only a few steps in
the past more heavily to update my value function good so that's the idea of temporal difference
learning and let's see i'm pretty sure the next thing i'm going to tell you about is probably
well okay in a minute we're going to show you that essentially q learning is just temporal difference
learning on the quality function q so i'll show you that in a minute there's one last variant
called td lambda and essentially what it does is it says instead of just using a one-step temporal
difference td0 or a two-step temporal difference td1 or an n-step temporal difference what i'm
going to do is i'm going to compute that target that r sigma for all of those and i'm
going to weight them in a certain way using this kind of exponential weighting with
some some factor lambda that's between 0 and 1. okay so essentially what i'm doing is i'm
waiting all of the td algorithms and then i use that uh weighted error reward signal to
compute my td error and this has some nice uh some nice properties you can kind of tune this
lambda depending on the problem you're working on to get nice convergence properties to wait kind
of recent stuff more or older stuff more but you're kind of waiting all of them instead of just
ignoring past events good so td lambda is pretty pretty interesting i believe this was introduced
by sutton um originally okay so q learning This is the culmination of what I really wanted
to show you, is that Q-learning is essentially just temporal difference learning on the Q function
So I'm going to write it out... So Q learning, it's exactly the same template
as temporal difference learning we are going to play games we're going to you know do trial
and error learning by trying our optimal policy and we have an estimate of what the quality is, given
being in a state *s* taking action *a* But that's not a perfect estimate -- we're still learning -- and
so what we do is we measure what the actual reward and future Q function is, at the next step we find
ourselves in, and we use this as the TD target for TD learning. So again this is the the TD
target estimate this is essentially my observed cumulative reward and I use that to generate an
error signal which I feedback to update my Q function So essentially what this means is that if
I got more reward than I thought I would, I should increase my Q function at that *s* and that *a* ...
so that's what that means: if I experienced more reward than what I expected, then I should have a
positive increase in my quality function at that *s_k* and *a_k*. And you'll notice here, this is a very
subtle difference, I'm going to say this again very subtle difference... here we're maximizing
over *a* ... so what we're actually doing is we we take some step we take some action *a* to
get our reward *r_k*, but it doesn't have to be our optimal on policy action *a_k* it could be... I could
actually do some random action to get this *r_k* but then when I'm computing my quality
function here for what my value is in that next state *s_k+1*, we're going to
optimize over actions *a*, at least to compute this TD target. Now this is a little bit strange,
but I'll tell you what this means in a minute. Okay so we call that "off policy", which
means that I can actually take sub-optimal actions *a* -- I don't have to enact my optimal current policy
*pi* -- I can take sub-optimal actions *a* to get this to get this reward but then I do need
to maximize over *a* to get my best Q function in this step of the the TD target.
And this has a lot of benefits -- essentially I can learn from experience, I can replay old
experiences, even when I know that those were not using an optimal policy *pi*, because I've
learned more since then. Or I can watch a grand master play and even if my quality function is not
very good, because I'm an amateur, I can watch their actions and learn from their actions. So I'll talk
a little bit more about this but that essentially is what we mean by off-policy, is that you can
still learn even if you take sub-optimal actions *a* and you'll notice that this is a
td0 variant because this is only taking one step forward and then
using the recursive queue formulation good there is another variant so q learning
is off policy td learning of the q function and there's something called sarsa which is
on policy td learning of the q function very closely related only a tiny tiny difference but
it's an important one and i'll tell you about it so sarsa means state action reward state action
it basically means you start with a state action you take you're in a state and you take an
action you measure a reward and then you uh are in another state and you take another action
okay sarsa and so here you'll notice that this is the exact same expression the same exact
temporal difference learning expression td0 except here i am not doing a max over q a max over
a in this part of the td estimate and so what that means is that i actually have to be taking my
best possible policy actions a a k plus 1 and a k at every step or else this will start to
degrade my estimate because the quality function q is always my estimate of the quality of being in
that next state taking that next action assuming i do the best thing forever after okay and so if i
don't plug into here my best possible a and into here if i didn't take my best possible action ak
to get this reward then this will actually be a sub-optimal estimate of my my quality function and
it will start to degrade as i iterate through very subtle difference and i'll put them next to each
other so you can see the difference but this is what's called on policy because you have to
enact the optimal policy for this to work for this to converge and not degrade one of
the cool things about sarsa is that sarsa can actually work for any of the td variants i showed
it for td0 here but you can do it for td1 td2 tdn and you would basically just replace this td
target here with a more complex or rich expression taking more steps and again if you think about
this what this means is that you have to take actual steps you actually have to take actions
and get that trial and error experience and then you look back and you see okay well what were my
sequence of events and rewards and you can use that to update your quality function after gaining
that experience good so so uh yeah good this is um yeah and you can essentially just replace this
td target for the cumulative reward with any old td algorithm you like td0 td1 tdn or the temporal
difference lambda scheme i showed you a minute ago okay and so this is on policy tv
learning of the quality function q so i want to show these kind of simultaneously
next to each other so that we can compare and contrast because they are almost perfectly
identical okay except again q learning is what's called off policy and we enable q learning
to be off policy by replacing this one a k plus one here with a max over a here because again the
quality function and the value function are always assuming you do whatever is optimal in the future
so they're always assuming that after the next step you do the best thing forever after so it's
the optimal kind of quality of being in that state and so if you didn't maximize over a here then if
you took an off policy a a sub-optimal action a this would actually degrade your quality function
over time but because we are actually optimizing over a in our quality function at the next step
we can take sub-optimal actions a to explore and to learn from off policy experiences
and again this is a huge huge benefit so this kind of off policy this this single
difference here that allows us to be off policy is one of the things that allows q learning
to learn from a lot of different types of experiences so you can watch someone else play a
chess game and still learn in an off policy manner even if your quality function is not as good
as the grand masters that you're learning from you can accumulate information about how they
play and improve your quality function so you can learn from imitation and you can also
learn by replaying your past experiences even if they were sub-optimal so when
you go to sleep at night and you dream of playing tennis even if you are replaying
things that were not done perfectly because you've learned throughout the day you can
still use that experience to still improve your quality function and so the q learning
often is faster at learning because you can explore more so you get better optimal
solutions and faster convergence and oftentimes you get lower variance uh solutions
in i'm sorry i think this is higher variance i think this is a typo you'll have to check the
book chapter because i'm pretty sure that's a typo um okay good so and when you do q learning
there are lots of strategies for how you introduce randomness into the policy that for
what action you take so there's something called epsilon greedy which basically means that one
minus epsilon you know of the time so let's say epsilon is is 0.05 then 95 of the time you would
do your optimal policy and 5 of the time you would do something random you would do a random action
to get more kind of exploratory actions to explore your space more so that's an epsilon greedy
algorithm and in these epsilon greedy algorithms often you start with a big epsilon so you start by
only doing random stuff and then eventually that epsilon cools down to zero in a kind of annealing
or simulated annealing strategy so as you get more experience as you learn better you eventually
get to go more on policy as that epsilon decays to zero so that's a really important one called
epsilon greedy but there's tons of different ways of introducing randomness and exploration
uh into your your off policy search strategies okay good uh and in you know contrast with sarsa
so it has some benefits also so the benefits of sarsa is that it is often safer in training
because it is always going to do what you think is the best thing to do it's not going to randomly
explore so if you're teaching a teenager how to drive you probably want to be doing something
more like sarsa than q learning because you don't want to just every epsilon you know flip a
coin and decide on some random action that might take you off the road or have some safety issue
so sarsa because it's on policy means you're always doing what you think is the best thing
even though it's more conservative so this will learn less quickly and probably less optimally
but safer while it's learning and similarly sarsa because it's always on policy it typically
will maximize it will have more cumulative rewards during the learning process so if i if i care
about how much reward i get during my learning process then sarsa will actually give me more
cumulative reward because it's always on policy it's not making these occasional kind of bad
or sub-optimal moves just to gain experience and so it's an interesting trade-off right
like this actually has a lot of parallels with human life as you might expect and in
biological learning sometimes you want to be more conservative sometimes you want to maximize
your rewards throughout the learning process sometimes you put a premium on being able
to take chances and you know introduce some randomness and learn from experiences that are
not what you think are the optimal experience you know go backpacking in europe right that's
a that's a q learning type of type of action okay good um so what i hope i've convinced
you of is that when you don't have a model for your mdp for your markup decision process
there are things that are inspired by dynamic programming in bellman's equation you can think
of these almost as approximate dynamic programming based on trajectories of real world experiences
i actually take actions in the real world through trial and error and i start to accumulate
information that can give me a better world view of what the quality of a given state action
pair are and there are off policy and on policy variations of this that have different
strengths and weaknesses okay good so that's what we talked about here is kind of this
gradient-free model-free reinforcement learning q-learning is used a lot for deep reinforcement
learning and so i'll definitely talk about that in a future lecture all right stay
tuned there's more to come thank you