Hello everyone, this is Alice Gao. In the previous videos, I talked about the high-level ideas of the policy iteration algorithm. To make it more concrete, let me now run through an entire example.
executing the policy iteration algorithm. Obviously, we can't do this with a very large example because I have to calculate everything by hand. So I constructed a tiny example that we can work with.
So here is a tiny grid world, a two by two world, where we have two goal states, plus one and minus one with reward plus one and minus one. So there are really only two states that we're concerned with. So S11 and S21.
Those are the two states we want to determine the policy and also figure out the utility values. The setting is pretty much the same as S11. the more complicated example that I talked about before.
So the reward of entering a non-goal state is minus 0.04. So same as before. And also we'll keep the transition probabilities to be the same. So whenever we are trying to go in a direction, there is 80% chance that we will end up going in the intended direction. Then 10% chance we'll go to the left of the intended direction.
and 10% chance will go to the right of the intended direction. And same as before, if I ever hit a wall, then I'm going to remain in the same location. So let's execute the policy iteration algorithm on this tiny grid world to find out what are the utility values for these two states and also what should be the optimal policy.
Now if you remember for the policy iteration algorithm, we are going to alternate between two steps. One is evaluating the policy and the other one is improving the policy. Let's start with an initial policy where we're going right in both states. I set this arbitrarily.
Well, I didn't set this arbitrarily. I set it so that our example is substantial, so we do have to go through multiple steps, but we don't... don't have to repeat this for too long.
So there are not too many steps. So starting from this initial policy, then we're going to do policy evaluation, which means given this policy, we will figure out the corresponding utility values for the two states. Let's take a look.
So I've drawn the picture for our initial policy. And next, we're going to do policy evaluation. So policy evaluation essentially means we are going to take the Bellman equations and write them down for this particular policy, right?
And if you recall, the Bellman equations basically gives us a relationship between the utility values for neighboring states, right? For states who are next to each other. other. Because there are two states we're interested in, so we're going to end up writing down two equations, one for the V value of s11 and the other one for the V value of s21. All right, so what do the Bellman equations say?
Well, for each state, it's going to say that we have some immediate reward of entering that state. In this case, either state is a non-goal. state.
So we'll get minus 0.04. And then we also get some long-term utility, right? Long-term discounted reward of getting into the next state, right? Right.
So based on our transition probabilities for S11, for example, well, our current policy for S11 is going to our right. So with 80% chance, we'll go to our right and end up in the plus one reward state as one two. So we'll get plus one with 10% chance. We're going to go to up our up direction, which is left of the intended direction.
Okay. There's a wall, so we're going to come back, and that's why we get V of S11, the second term. And finally, the third possible direction is going down.
We're going to the right of the intended direction, which gets us to S21. So the third term is 0.1 multiplied by V of S21. So notice here that this is different from the original Bellman equations because the original Bellman equation says we're going to take the maximum over all possible policies.
And there are four possible policies for each location. Right. But here we have a fixed policy already, so we don't need to take the maximum.
We can simply follow that fixed policy and see what happens. Right. So the.
And also notice the equation is linear, right, because of that. If we had the maximum term in the equation, it would not be linear. So the equation for S21 is very similar. Let me just talk through it very quickly.
So we get the immediate reward again, and because our policy is also going right, so with 80% chance, we'll get into the minus 1 reward state. With 10% chance, we'll go up and get into S11. And with another 10% chance, we'll go down and get into S21 because we hit a wall, right? So now we have the two equations. And notice what's happening here.
These two equations have two unknowns, right? V of S11 and V of S21. And they're both linear equations.
So we basically have a linear system of equations. with two equations and two unknowns. And we can solve this exactly by hand or using linear programming. Here, it's simple enough, so we can do it by hand. So the following steps are simply me trying to simplify the equations and solving them by hand.
So first, wrote down things, combined terms, combine some of the terms, so you get these two equations. And next. I multiplied one of them by a factor and then add them together.
So to eliminate some of the term. Anyway, you should be able to do this from linear algebra. So the final result is that V of S11 is 0.75 and V of S21 is minus 0.85.
And these two numbers somewhat make sense, right? Because both policies are going right. So for the top square going right, we will probably get into the plus one state, which is great.
So that's why V of S11 has a positive value and it's close to one. Whereas for S21, we'll probably get into the minus one state. So that's why the number is negative and close to minus one as well, right? So this is the first policy evaluation step. So starting from the initial policy, which is going right in both squares, we derive the utility values for the two squares.
So next, what we're going to do, well, we're going to, based on these two utility values, derive a new policy. So this is the policy improvement step. Policy improvement step says, given the utility values, what is the best policy, right?
So on this page and the next page, I'm going to solve for the best policy given the current estimates of the utility values. Well, what is the best policy for S11 here, right? So I consider all possible policies, therefore going right, going up, going left, and going down. And for each of the policies, I'm going to calculate my expected reward or utility.
So I ignore the immediate reward because I always get the same immediate reward. So the only difference is really what happens next, right? So I calculated if I went in the intended direction, then how much do I expect to get?
If I went to the left of the intended direction, how much do I expect to get? If I go to the right of the intended direction, how much do I expect it to get? And finally, I get a number.
So this is my expected long-term reward if my policy is going right. So having done the four calculations, then finally, we just have to compare all the numbers and pick the policy that gives us the largest number. In this case... The policy of going right gives us the largest expected reward, which is 0.79.
So given the current utility values, the optimum policy for S11 is going right. Similarly, we can do this for S21. And similar calculations, you can take a look on your own time.
But in the end, we found that the best policy is going up. And this makes sense, right? If you look at S21, going right will most likely give us a reward of minus one, whereas going up there, we have a positive reward there.
We have a positive reward of 0.75. So that looks pretty good and much better than minus one. So probably going up is a good idea. This is it for our policy improvement step.
So after this policy improvement step, what happened? Well, before the policy improvement step, our policy was going right for both squares. After the policy improvement step, we found a better one. So the better policy is going right in S1-1 and going up for S2-1. And since the policy changed, we have to keep executing the algorithm.
Remember, for policy iteration, as long as the policy changes, we have to keep going, and we can only terminate when the policy stays the same, when we cannot improve the policy anymore. Okay. That means we're going to do policy evaluation again, right? So our current policy is right and up, and we'll follow a very similar procedure to do the policy evaluation.
Of course, these two equations will look different now, because, well, in fact, S11, the equation for S11 would look similar if not identical. because our policy did not change, right? But equation for S2 once certainly changed because our policy changed from going right to going up.
So you can check this yourself. So we wrote down these two equations. Again, we have two equations, two unknowns. They're both linear.
So using your linear algebra knowledge, you should be able to solve it. So eventually we solved them and got that. The two utility values as... 0.918 and 0.660. So this is another policy evaluation step where we take the current policy and figure out the utility values given the policy.
Next, we have to do another policy improvement step. So given our current utility values, which are 0.918 and 0.660, can we derive a better policy? Is the policy going to change? Is the best policy going to change?
So the process is exactly the same before. Given these numbers, we are going to consider all four possible policies, and for each policy, we'll calculate our expected long-term reward. And the key is thinking about where we're going to go next. With 80% chance, where are we going to end up, and so on and so forth.
So after all of these calculations, the largest number we have is for going right. So therefore, the optimal policy for S11 is still going right, same as our current policy. We'll do a very similar calculation for S21. And again, trying all four possibilities, and we'll end up with all of these numbers.
And turns out the best policy for S21 is going up. So can we stop the algorithm now? Well, our previous policy was right at up, and our new policy is still right at up.
So the policy did not change. Because of this, we can terminate the algorithm and say that we have found the optimal policy. That's everything for this example of the policy iteration algorithm. Thank you very much for watching.
I will see you in the next video. Bye for now.