Today I'll be discussing an exciting area of research that's been instrumental to many recent successes in reinforcement learning and this is typically referred to as deep reinforcement learning. So this is the combination of reinforcement learning algorithms such as the ones that we have discussed so far with the use of deep neural networks as function approximators. So the motivation for function approximation and the core ideas behind it have already been introduced so you will remember how we discussed that tabular IRL cannot possibly scale to certain large complex problems.
The reason being that, of course, if we want to estimate, say, the value of each state separately for every single one state, this will have a memory cost that naively scales linearly with the state space, which by itself would make it impractical. But even if we had that limited memory, there was still a fundamental problem that it would be just very slow to learn the values of all states separately. If we need to visit every single one and potentially multiple times to make an even reasonable guess of the value of a state, we are definitely in trouble. So our response, as again discussed in previous lectures, is the use of function approximation, which is our key tool to generalize what we learn about one state to all other states that are close according to a reasonable definition of close.
We have already introduced function approximation, but the purpose of this chapter will be to discuss the use of deep neural networks specifically for the purpose of of function approximation and this is what is typically referred as deep reinforcement learning. Now I will delve soon in some of the practical challenges that arise in this setting and some exciting research in the area, but before we go there I want to take this introduction to basically recap a few core ideas of function approximation in general and also to discuss the role of automatic differentiation in easily supporting any one of you to freely experiment with deep reinforcement learning ideas. So when using function approximation to estimate values in previous lectures, we typically proposed a simple scheme where we would have some fixed mapping that transform any one state in some feature representation phi, and then we would have a parametric function that is linear and that maps features to values. The problem of reinforcement learning then becomes to fit these parameters theta, so that's for instance this value function v theta makes predictions that are as close as possible to the true values v pi for whatever policy pi we for instance wish to evaluate. And we should have that we can turn this into concrete algorithms.
So the first step would be to to formalize the goal of minimizing the difference between v theta and v pi. So for instance we could use some loss function like the unexpected squared error over the state. Typically this would also be weighted by the visitation distribution under the policy pi itself, in order to allocate capacity sensibly. And then given a loss function we could use gradient descent to optimize the loss function.
This sounds very simple and easy but of course the devil is in the details. And actually doing this process for reinforcement learning introduces quite a few subtle challenges. So first of all, of course, computing the expectation over all states is too expensive. But this is maybe the least of the problem. The most deep problem is that the very target vPi that we want to learn to predict accurately is actually unknown.
So the solution to both problems is to sample the gradient descent update by considering just one or few states in each update and then use the sampled estimates of vPi as targets. And to do so we can reuse all the ideas that we have discussed in, for instance, model-free algorithms. So we could do Monte Carlo deep prediction by using an episodic return as target in the gradient update.
Or we could implement a deep TD prediction algorithm by bootstrapping on our own value estimates to construct the one-step target. And again, with the bootstrap itself being parameterized by the very same parameters theta that we wish to update. So this is all good.
in principle and we already saw that if some mapping phi is given then it's possible to train a linear function to make reasonable value predictions. But today we want to consider a more complicated setting. So a setting where maybe the feature mapping is too naive and it's not informative enough to support reasonable value predictions if we have just a linear mapping on top of it. And in this case we might want to use a more complicated nonlinear transformation from states to values. And for instance by using a deep neural network as a function approximator.
You may wonder why choosing a neural network, and this is definitely a good question. So it's valuable to stress that these are by no means the only possible choice for a more complex function approximator. But neural networks do have some advantages.
The first is maybe that this class of parametric function is well known and is known to be able to discover quite effective feature representations that are tailored to any specific tasks. that you might apply this parametric function to. So in our case reinforcement learning, but the neural network have been used also for processing language or vision. Importantly, this future representation that is learned by a neural network is optimized in an end-to-end manner by the same gradient process that in linear function approximation is reserved just to define a linear mapping.
So we have a unified way of training the entire parametric model to represent the state in a way that is expressive and then make a reasonable good value prediction from this representation. The second reason to consider deep neural networks is that given the extensive adoption of deep learning methods in machine learning, using neural networks allows us to just leverage lots of great research. So all the ideas that have been introduced in supervised learning, for instance, for network architectures or optimization, we can leverage all this great research and benefit from it when we use neural networks for functional approximation in reinforcement learning. What does parameterizing a value function with a neural network actually look like in practice though? In the simplest case maybe we could consider what is called a multi-layer perception.
This would be a model that takes a very basic encoding of a state, so for instance In a robot, this might be the raw sensor readings that we get. And the MLP would then take these as inputs, compute a hidden representation by applying a linear mapping, so W s plus some bias b, followed by a nonlinear transformation, such as a tanh or a ReLU. And then the actual value estimate would be computed as a linear function of this embedding. But the important thing is that this embedding would not be fixed, would be learned.
And the parameters theta of our value estimate that then we train using DeepRL would include not just the final linear mapping but also the parameters of the hidden representation. So the entire system would be trained end-to-end. This sounds appealing of course but what if to compute the gradient with respect to theta we need to differentiate through say not an MLP but maybe a continent? Or what if we have a batch norm later or we use a transformer? Well It turns out that there is actually a way to compute exact derivatives for any of these architectures in a computationally efficient way that basically allows us to get any gradient we need without having to derive its expression ourselves.
Given the popularity of deep learning in modern machine learning, these methods that are typically referred to as autodiff are actually available in most scientific computing packages. In this chapter, we'll be focusing mostly on the conceptual challenges of combining RL and deep learning. And we'll mostly just assume that we have these kinds of tools to support us in getting the gradients through any arbitrary neural network architecture that we may want to use for functional approximation.
But I want to give at least a brief intuition about how these tools work, since they are so fundamental to everything that we do and to the practice of DeepRL. And specifically, I will... give you a very brief introduction and also show you how you can use them to implement a very very simple deep rl agent specifically a q-learning agent with a neural network as functional proximity the first important concept behind automatic differentiation is the concept of a computational graph so this is a abstract representation of any computation that you might want to perform in our case estimating a value say in the form of a direct acyclic graph For instance, in this slide I show a very simple instance of such a computational graph, where I have two inputs, a and b, and I compute some intermediate representation as a plus b and b plus 1 respectively.
This is of course a very simple example. And then I compute some output by taking the product of these two numbers. The reason computational graphs are interesting to us is that if we know how to compute gradients for individual nodes in a computational graph, We can automatically compute gradients for every node in the graph with respect to any other node in the graph. And do so by just running the computation graph once forward from the inputs to all outputs and then performing a single backward sweep through the graph accumulating gradients along the path.
and summing gradients when paths merge. So for instance in this simple example the gradient of the output e with respect to input a can be computed by just taking the product of the derivative of e with respect to its input c times the derivative of c with respect to its input a, which is trivial compute because like the individual nodes can you can always decompose your computation so that the individual nodes are really just simple arithmetic operations. If you take again in the simple example maybe a slightly more complicated derivative where we want the gradient of e with respect to the input b then this will be just the sum of the derivative of e with respect to c times the derivative of c with respect to b so this is one path from e to b and then following the other path we just need to take the gradient of e with respect to d and then there times the derivative of d with respect to b and sum these two paths since the two paths merge.
And this might even seem slightly magical at first, but if you actually write it down it's just implementing the standard chain rule that you all have seen in a calculus course for instance. The advantage though of doing this in the form of a computational graph is that it helps implementing the chain rule and basically implement the gradient computation. in an efficient way for any arbitrary numerical function by just decomposing your computation, which in our case will typically just be a program, decompose it into the sequence of basic operations, additions, multiplications and so on, and then considering each of these basic operations a node in the graph that then we can differentiate by implementing the graph algorithm I just described. And I wanted to stress here that the entire process is not just computationally efficient, so it's always order of the cost of doing it forward. It's also exact, so this is not a numerical approximation like you could do for instance by computing gradients by finite differences.
This is an actual way of evaluating the true gradient of an arbitrary numerical function in a fully automatic way. It's really really exciting if you think about it. AutoDiff is natively implemented in many modern machine learning frameworks, including JAX, which is the framework of choice for this course's assignment.
Some of these frameworks like StanSharp flow. require you to explicitly define your computation in terms of a computational graph so that they can then do things like out to diff very naturally on top of this graph. Others though like JAX allow you to still write your computation in a standard imperative way but then implicitly recover the graph by tracing your computation.
In JAX, out to diff is implemented based on this tracing mechanism and it's exposed to the users via the JAX.grad program transformation. So this is a utility of JAX that takes a Python function that you can just write in NumPy, so in a very standard format, and then returns you another Python function but that now computes the gradient of the original function and evaluates it for any given input instead of computing just a forward pass. To conclude the introduction I want to show a very simple example of implementing a basic Q-learning agent that uses a neural network as a function approximator using an out-of-diff tool from JAX.
So what does such a deep queue learning agent look like? First we will need to choose how to approximate queue values. And for this we will use a single neural network, let's say, that takes the status input and outputs a vector output with one element for each available action. For instance this network could be an MLP as before. And note that we have implicitly made a design choice here.
the network taking a single state and outputting all action values, but this is not a strict requirement. We could also pass state and action both as inputs to the network and then the network would return you just the Q value for that action. But in general it tends to be more computational efficient if we can compute all Q values in a single pass, so this is a fairly common choice in practice. In JAX we could define such a neural network very trivially using the Haiku model. With this library we just need to define the forward pass of the network, as for instance in the case of an MLP, a sequence of a linear, a ReLU, and another linear.
And then we can use the haiku.transform to extract Python functions, two Python functions specifically. One, networkInit, that initializes the parameters of the network, and the second, networkApply, that basically computes the forward pass, taking the parameters and the state as input. Once we have a network, though, of course our job is not done.
If we want to implement deep Q-learning, we will also need to define a gradient update to the network parameters. And with Q-learning, this looks very much like the DTVD update that I showed at the beginning, but we will be updating one specific action value, so the Q theta for a specific SD and AT, and we will be using as a target, as a sample estimate of a return, the immediate reward plus the discounted max Q value, where again we're using our own estimates that depends on parameters t does as to bootstrap. It's interesting to point out though that while you can write your update exactly in this form that matches very naturally the math after all, often if you for instance look for an implementation of a DeepQ learning agent you might see it written in a slightly different way. For consistency basically with standard deep learning you might see that instead of defining directly a gradient update they might define a pseudo loss function which is the second equation on this slide that is in this case just the one half times the mean squared error between the chosen action and the target which is given as usual by the by the reward plus the discounted maxq value and this is fine But for a gradient of this loss function to actually recover the correct Q-learning updates, there are a couple of important caveats.
So first of all, when we compute the gradient of this loss we need to ignore the dependency of max Q on the parameters theta. This is denoted by the fact that we have a double vertical bar around max Q. If we did not ignore this, then the gradient of this loss would not recover the first equation. We would have an additional term. And the second, It's good to realize that this is not a true loss function, it's just a prop that has the property of returning you the right update when taking a gradient.
How does this translate into code? Well this is again fairly simple in JAX, because we can just define the pseudo loss function that takes the parameters theta and the transition, so an observation, an action, a reward, a discount and a subsequent observation, and then first computes the q values in both st-1 and st. assembles the target by summing the reward plus the discounted max queue, and then computes the squared error by just one half the square of these two. What is critical is that to ensure that the gradient of the pseudo loss implements our deep queue learning update, we must take care of adding this stop gradient term in line 25. The stop gradient implements this idea that The gradient computation will ignore the dependency of target t-1 on the parameter zeta.
To actually get the update, then we need to actually take the gradient. So this is done in line 30, where we compute jax.grad of this last function. This gives us back another python function again that then we evaluate in theta, obst t-1, at-1, rt, dt, and obst t.
Finally, just updating the parameters, we can do it by just stochastic gradient descent, and in this case will be equivalent to just taking the parameters theta, the gradients, and subtracting a small step size times the gradients. This is of course a very simple agent, but it already shows actually a full pipeline for defining a deep parallel agent, because we have defined the network and the gradient updates, and we have applied these updates. And the reason it What looks so simple and clean is that we have exploited this beautiful JAX AOTDif capabilities that allow us to get an update from a loss function or a pseudo loss function in this case by just calling jax.grad on the relevant numerical function.
So in the next section we'll actually now delve into what are the challenges of DeepRL, how we can make the reinforcement learning part aware of our specific choice of function approximation and also vice versa how we can make our deep networks more suitable for our REL. by understanding the unique properties of updating the parameters with the reinforcement learning objective. In this section I want to give you some insight in what happens when ideas from reinforcement learning are combined with deep learning, both in terms of how known RL issues manifest when using deep learning for function approximation, and also in terms of how we can control these issues by keeping in mind the effect on function approximation of our reinforcement learning choices. Let's start! with the simple online deep Q learning algorithm from the previous section.
What are potential issues with this algorithm? Well, to start, we know from deep learning literature, for instance, that stochastic gradient descent assumes gradients are sampled iid. And this is definitely not the case if we update parameters using gradients that are computed on consecutive transitions in MDP, because what you observe on one step is strongly correlated with what you observed and the decisions that you made in the previous steps. On the other side we also know that, for instance, in deep learning typically mini-batches, instead of using single samples, is typically better in terms of finding a good bias-variance trade-off. And again this doesn't quite seem to fit the the online deep Q-learning algorithm that I described in the previous section, because there we perform an update on every new step, so every time we are in a state and execute an action, get a reward, a discount, and another observation, we would compute an update to our parameters theta.
So how can we make RL more deep learning friendly? If we look back to previous lectures, it's quite clear that certain algorithms may better reflect deep learning's assumptions than others. So it's good to keep deep learning in mind when choosing what to do on the RL side.
So for instance during the planning lecture we discussed DynaQ and the experience replay where we mixed online updates with also updates computed on data that we sampled from a learned model of the environment in the in the case of DynaQ or from a buffer of past experience when we're doing experience replay. And both actually might address very directly the two concerns that we just highlighted in the vanilla DeepQ learning agent that I described before. Because by sampling state-action pairs or entire transitions from a memory buffer, we effectively reduce correlations between consecutive updates to the parameters.
And also we get for free basically support for mini-batches, instead of doing a loop where we apply n planning updates, we could just batch them together and do mini batch gradient descent instead of vanilla SGD. Similarly, if we know we are using deep learning for function approximation, there are many things we can do to help learning to be stable and effective. So we could use, for instance, alternative online RL algorithms, such as eligibility traces, that naturally integrate in each update to the parameters theta information that comes from multiple steps and multiple interactions with the environment.
without requiring explicit planning or explicit replay, like in DynaQ or Experience Replay. Alternatively, we could also, for instance, think about certain optimizers from the deep learning literature that might actually address and alleviate the issues that come from online deep reinforcement learning by, for instance, also integrating information across multiple updates in a different way, for instance, by using Momentum or the Adam update. And finally, in some cases, if we keep in mind the properties of deep learning, we might even be able or willing to change the problem itself to make it more amenable.
So for instance we could think of having multiple copies of the agent that interact with parallel environments and then mix this diverse data that comes from multiple instances of the environment in each single update to the parameters. Let's now delve even deeper. So we said that if we use DynaQ or Experience Replay we can address certain issues that better fit certain assumptions about deep learning. At the same time, if you think about dynaq and dqn and what happens when we use these algorithms in combination of function approximation specifically, then we are actually combining function approximation, of course, but also bootstrapping because we are using, for instance, q-learning as model-free algorithm, and off-policy learning because by replaying past data we are effectively mixing The data is sampled from a mixture of past policies rather than just from the latest one.
And you might remember that we denoted the combination of exactly these three things the deadly triad, which doesn't sound good, right? And the reason is that we know from the theory that when you combine these three things there is a possibility of divergence. At the same time, if you read the Deep RL literature, you will find that many many successful agents do combine these three ingredients. So it's There is maybe a tension here. How is this possible?
Well, a partial resolution is that the deadly triad says that divergence is possible when combining these, but not that it's certain, and not even that it's likely. So if we understand and keep in mind the properties that underlie both RL and deep learning algorithms, there is actually much we can do to ensure the stability and reliability of deep RL agents, even when they are combining all the elements of the deadly triad. And in the following slides, I want to...
exactly help you develop an understanding and an intuition about how and when the deadly trident manifests when combining reinforcement learning with the use of neural networks for function approximation. And as done with the two initial issues we discussed, batching and correlations, by understanding and keeping in mind the properties underlying RL and deep learning, this will already go a long way towards being able to design deep RL algorithms that are quite robust. So let's start with a large empirical study that I performed with Haddo, where we looked at the emergence of divergence, due to the deadly triad, across a large number of variants of deep Q-learning agents. and across many domains. What we found is that empirically unbounded divergence is actually very very rare, so parameters don't tend to actually go to infinity even if you are combining all the elements of the deadly triad.
What is more common is a different phenomenon that we called soft divergence, and this is shown on this slide where we show the distribution of value estimates across many many agents in many many environments and for different networks on the left and the right respectively. And what you see is that the values explode initially, so they grow to orders of magnitudes larger than it's reasonable to expect in any of the environments that we're considering, because the the max true values were at most 100. Well here we're seeing values even in the order of tens of thousands or hundreds of thousands or millions. But what we see is that these values that are initially diverging don't go to infinity, so over time the estimates actually recover and go back down to reasonable values.
You may wonder, well, if soft divergence mostly resolves itself when doing the parallel in practice, is it even a concern? Should we even be discussing how to minimize this initial instability? And I think the answer is that yes, I think it's worth discussing these things.
Because even if it doesn't diverge fully to infinity, having many hundreds of thousands or millions of interactions with the environment where values are wildly incorrect does affect... the behavior of the agent and the speed of convergence to the correct solution. Let's then discuss what we can do about it.
How do different reinforcement learning ideas help us ensure that the learning dynamics when using deep networks for function approximation is stable and effective? The first approach I want to tell you about was introduced by the DQN algorithm and is known as target networks. The idea here is to hold fixed the parameters that are used to compute the bootstrap targets.
So in queue learning for instance this would be the parameters that are used to estimate the max queue on the next step. And then only update these parameters periodically, maybe every few hundreds or even thousands of updates. By doing so we interfere with the feedback loop that is at the heart of the deadly triad. If you remember the core issue with the deadly triad is that when you update a state you also inadvertently possibly update the next state, the one that you are going to bootstrap on, and this can create certain feedback loops, but if the parameters that you use to bootstrap are frozen, at least for a while, then this feedback loop is broken. This is also the only approach.
So for instance we know that Q-learning by itself, even in a tabular setting, has an overestimation bias, and this can result in unreasonably high value estimates, at least in an initial phase. So it's possible that actually this... could interact with the deadly triad and increment the likelihood of observing these explosions in the value estimates.
If this was the case then we could use for instance double queue learning to reduce the overestimation of the update and this would also help us make the algorithms more stable with respect to the deadly triad. In double queue learning you may remember we use separate networks to choose which action to bootstrap on and the evaluation of such action. And neatly This actually combines really well with the target networks idea because then we can use the frozen parameters, the ones of the target network, as one of the two networks. And in this plot you can see and compare the effects of both using target networks using only double Q, which is what is dubbed here inverse double Q, and doing both.
So double Q learning typically uses both the separation of the evaluation from the selection of the bootstrap action but also uses target network for the target estimation. And what we see is that actually both these ideas have a strong stabilizing effect. So just using target network for instance dropped the likelihood of observing soft divergence across the many agent instances and environments from 61% of the experiments to just 14%.
And if we consider the full double queue that combines both the idea of the coupling action selection and evaluation and the target network idea, then the likelihood of observing soft divergence actually dropped to only 10%. In general, I personally find it's quite insightful to see these things and see how different choices on the reinforcement learning side interact with the deadly triad that is triggered by the combination of bootstrapping and of policy with function approximation. And it's quite interesting to see that this is not just about, for instance, target network or double-key learning.
Throughout our design choices that we make in our agents, each design choices can interact with function approximation and the learning dynamics of our deep networks. So consider for instance a different aspect of our agent. If you're using dyna or experience replay, we will be sampling either state action pairs or entire transitions. And one choice there is to just sample uniformly. This is the most simple but also the safest choice as we'll see.
But it's also somewhat unsatisfying, because not all transitions that you have seen in the past are equally useful or important. In some states we might be able to make already predictions very accurate, very well, while in others there might be a lot that we can still learn, because maybe are states that we haven't seen many times. And prioritized replay was introduced to allow us to make more efficient use of resources by sampling more often these more interesting transitions. And there are many ways you can prioritize, so I don't want to go into too much detail here, but for instance you could sample more often transitions with the higher TDArror. But regardless of the details, what I want you to think about for the purpose of this lecture is how does the choice of prioritization affect the emergence of the deadly triad, in terms for instance of the likelihood of observing soft divergence.
And if you think about it carefully, by prioritizing you are increasing the amount of off-policiness. And indeed what we see in our empirical study is that we find the stronger the prioritization, the more likely it is that we observe self-divergence. This by itself does not mean at all that prioritization is a bad idea, just that we need always to strike a balance when we're using prioritization between the benefits of seeing more often the most useful transition, and the potential risk of reduced value accuracy at least in the short term.
due to the emergence of this deadly triad. By being aware of this subtle interaction, we can also modify the RL algorithm to reduce the adverse effects on the learning dynamics of our function approximator. So for instance in the orange line in this plot, what we show is the likelihood of soft divergence if we at least partially correct the off-policiness via importance sampling.
And as you can see we can actually push prioritization much further along the x-axis. while still keeping the risk of soft divergence reasonably low. But there are other things that, if you are thinking about the deadly triad, could be important to think about to ensure that our deep RL agents are stable. So if you think about the deadly triad, then at its core it's an issue of inappropriate generalization.
We already said this many times. The reason we observe the deadly triad is that the value of a state and the value of the subsequent state that we bootstrap on are tied via generalization. But we don't need to immediately bootstrap when we're computing a deep RL update. So for instance, instead of accumulating one reward and then bootstrapping on the maxq value on the next step, which is very close to the source state and therefore might be more affected by generalization, we could instead accumulate many rewards and only then bootstrap on the maxq value.
Potentially we could even do this and combine it, for instance, with using a double estimator for extra stability. And in the DPARL setting this would result in a multi-step double queue update, which is shown in this slide, and this will go a long way to help us because it's still a well-defined target, it's still a policy improvement, but it's much less susceptible to the triad because we are reducing the reliance on bootstrapping and we're reducing the amount of appropriate generalization we get on the bootstrap target. For instance in this plot we are showing again in our usual empirical study across many agents and many environments, what is the likelihood of soft divergence if we increase the number of steps before bootstrapping from 1 to 3 and to 10? And what you can see is that the likelihood of soft divergence drops radically from more than 80 percent to just 9 percent.
It might seem maybe a bit scary at first, the way in which basically every design choices in our RRL updates subtly interacts with the learning dynamics of our deep function approximator. And while in this lecture I focus on the deadly trial, this is by no means the only case. But hopefully you can also see this in a positive way, because this is what makes DeepRL quite an exciting area of research, because It really requires you to think holistically about the whole process of learning and decision making and how all the components fit together.
But if you have a good understanding of the core ideas behind both RL and deep learning, and I hope you will have by the end of this course, you can do this. So you can reason about how these components fit together and you have a chance to build really powerful learning systems that can really push the limits of what is currently possible with AI. In this section I want to take a...
rather different perspective from the previous section. Instead of focusing on being aware of our function approximation choices when designing our RL algorithms, I want to discuss about the deep learning side. So can we design our neural network architectures to be especially effective for reinforcement learning? Can we understand what it means to optimize a neural network with targets constructed via bootstrapping?
What does it mean to increase, for instance, capacity in the network if we are training a reinforcement learning problem? This of course is a huge area of research, so I won't be able to cover all ideas in the space of this lecture, but hopefully I can give you at least some intuitions and some ideas. If you think about the recent history of deep learning, much of its success comes from being able to encode certain inductive biases in the network architecture, so that they can best support learning in certain broad categories of problems without limiting their ability to build tailored representations specific to each task, purely from end-to-end gradient descent.
So for instance, convolutional nets source much of their power from their capability of learning translational invariance detectors that make learning computer vision tasks much easier. Or similarly, LSTMs support long-term memory by a specific architecture that allows gradients to learn to use gating to preserve information over large horizons, and this was instrumental to many early successes in natural language processing. Since RL is radically different from both vision and NLP and the most common supervised learning task where deep learning is applied to, I think it would be surprising if just copying network architectures designed for supervised learning would actually give us the optimal results.
Instead, I think we should consider what are the right architectures for reinforcement learning, which inductive biases should we incorporate to make, for instance, value estimation as easy as possible. And this was the motivation of the Dueling Networks paper that was introduced a few years ago. This introduced a network architecture that could improve the performance of DeepQ learning agents quite significantly and basically out of the box without requiring any change on the reinforcement learning side. As with the convolutional nets and STMs, the idea is actually remarkably simple. Normally, DeepQ learning agents use a network architecture like the one on the top in this slide.
They take an observation as input, for instance the pixels on the screen if the agent is learning to master a video game, process this input via a number of hidden layers, so typically comb layers if the observation is in visual form, and then apply some fully connected layers to output all action values as a vector with as many elements as actions. In general though we know that we can decompose action values as the sum of state value estimates that looks at long term and is independent of the action and only depending on the state, plus an immediate advantage term that depends also on which action we are going to take right now. So this actually suggests somewhat naively a different architecture, where as before we take an observation as input, process it by some comb layers if it's an image, but then the network generates the Q values by summing a scalar and a vector output with their own separate stream of processing. therefore forcing the network to represent action values as the sum of an action dependent and action independent term. If you use something that looks like this, you can train the resulting Q values using standard deep Q learning.
You don't need to change basically anything else. But will this help? Well, let's consider a concrete problem where we have trained this viewing Q network to estimate Q values for an agent that must control a car running on a highway.
There are some turns, there are some other cars that you must avoid crashing into. And in the images in these slides you can see what the scalar and vector component of the dueling network are mostly attending to. So the plots are generated by highlighting in red the pixels that most affect the two outputs. So this is what is called a saliency map, with on the left what the scalar term is mostly attending to, and on the right what the vector action dependent term is attending to. What you can see is the scalar term on the left learns to attend mostly to the direction of the road far ahead of the current location of the car, as this is what matters for the long-term value of the current state.
Instead, the vector term on the right learns to attend to mostly what is straight in front of the car, as this is what is important to estimate the immediate advantage of each action. So this I think would be very interesting by itself, but it also results in quite significantly improved learning, with agents learning both quicker and more stably. If we're thinking about RLA-ware deep learning though, the topology of the network is not the only important factor. So in In supervised learning, for instance, we often find that more data plus more capacity, not just better architectures, equals better performance. For instance, because the loss might be easier to optimize.
But how does network capacity affect, for instance, value function estimation in reinforcement learning? In my experience, typically larger networks do tend to perform better in reinforcement learning as well. But it's not all roses. So if a we use larger networks we will for instance be more susceptible to the deadly triad, especially if using vanilla Q-learning.
In the large empirical study that I discussed already extensively in the previous section we also looked for instance at different network capacities and what we found is that with Q-learning for instance the likelihood of soft divergence, where at least initially the values might grow to unreasonable estimates, was actually increased as we would make the network larger and larger. There are many reasons for this. One potential leading candidate is that larger networks tend to have a smoother landscape.
At least initially, this means that they might suffer more from inappropriate generalization. Of course, as always in reinforcement learning and in deep reinforcement learning, these are all trade-offs. It doesn't mean that we should not use large neural networks, because those do tend to result in better final performance at convergence. But we should be aware that as we increase the capacity of our network, for instance, this might result in some instability early on if we don't take care to use double-q estimators, multi-step targets, or generally the function-approximator-friendly updates that we discussed in the previous section.
Another issue that is worth being aware of, that also relates to smoothness, is that many deep learning architectures have a bit of a problem at approximating sharp discontinuities. And this is not so much of an issue in supervised learning, but it can be quite tricky when we're doing RL. And I'll try to give you a bit of insight why this smoothness property might be problematic. So in the plot in this slide, we have a grid world where an agent might move from any cell to any of the neighbor cells, but the grid world is split in two parts, separated by a wall.
So if an agent starts on the top, it will not be able to reach the bottom portion of the grid world and vice versa. Critically though, a rewarding location is present only in the bottom half, but not the top half of the grid, meaning that the values in the upper half are exactly zero, while there are non-zero immediately below the wall. So if you attempt to learn state values for this problem, for instance under an assumed fixed random policy, we could use Monte Carlo updates on the parameters of the network to do so.
If you do it, what you will see is basically the error heatmap on the left. The predictions are mostly accurate everywhere, with some noise in the bottom half due to the fact that Monte Carlo updates are high variance, and some error in the upper half, in a very tight band just above the wall. This is due to the fact that a network with the canonical architecture such as those that we use in supervised learning has trouble to approximate the sharp discontinuity in values. that go from non-zero to exactly zero as soon as you cross the wall.
Overall this though might seem reasonable, sure, you know, smoothness might have introduced some errors but maybe that's that's expected, it's not too concerning. But now consider what happens in the error map on the right. This corresponds to training the same value network but using TD instead of Monte Carlo and the result is quite fascinating I would say because in the lower half the values are still very accurate, actually even more because the TD updates are lower variance, so you don't get that noise that you are seeing in the bottom half with the Monte Carlo prediction. But on the upper half, the errors now are not confined anymore to a small band close to the wall, and instead the difficulty in fitting the discontinuity just above the wall actually leaks to the value predictions in the entire upper half. So this maybe might be surprising at first, but But the reason is that states throughout the upper half will bootstrap on the incorrect states just above the wall.
And this means that the error will propagate. And this is a phenomenon that has been called leakage propagation. and is an example of how certain smoothness properties of a function approximator that might be acceptable in supervised learning can be quite problematic in RL. So this is maybe a further motivation to double down on reinforcement learning-aware deep architectures to make sure that we can support the kind of problem and function classes that we care about in RL. And there is much more still that could be said than certain architectural choices that could alleviate these issues.
But for now I really want to conclude highlighting what's behind many of the things I discussed in this section. And this is, again, like we saw in the past, inappropriate generalization. So smoothness leads to inappropriate generalization and this leads to leakage propagation and this leads to the deadly triad, for instance. But so while RLIW deep learning architecture will certainly help, and I think this is a very important, although some comparatively underexplored topic in deep RL research, the issue of learning good representations for RL will likely not be entirely solved by architecture alone. And to help, we will also need to change, I think, the way we train the representations in the network, because by having better representation, we will then be able to have less inappropriate generalization.
And one leading idea in this space that I just want to mention now, but will be the topic of the. of the several lectures in the remaining part of the course, is that maybe we should consider learning state representation that are not just for the singular purpose of predicting a single value function for a singular reward, and instead we should maybe share representations across many tasks. For instance, predict the value for different policies, or for the same policy but under different discounts, or for the same policy under the same discount but for a different equivalent other than the main task reward.
And if we can find good ways of doing this, this might alleviate many of the issues that we discussed throughout this DeepRL chapter, both in terms of deadly triad, leak-shaped propagation, and so on. So this is a very, I think, interesting area of research in DeepRL that we are going to touch on quite extensively.