Lecture on Policy Iteration Algorithm - Example by Alice Gao

Introduction

Non-goal state reward: -0.04.
Transition probabilities:
- 80% intended direction.
- 10% left of intended direction.
- 10% right of intended direction.
Hitting a wall results in staying in the same location.

Policy Evaluation
- Evaluate initial policy (going right in both states).
- Use Bellman equations for utility values.
- Linear system of equations due to fixed policy.
- Solve for utilities: V(S11) = 0.75, V(S21) = -0.85.
Policy Improvement
- Derive new policy based on utility values.
- Calculate expected utility for each potential policy direction.
- S11 optimal policy remains going right.
- S21 optimal policy changes to going up.
Repeating the Process
- Continue the iteration until policy stabilizes.
- New policies:
  - S11 continues right.
  - S21 confirmed to go up.
- Reevaluate utilities with updated policy: V(S11) = 0.918, V(S21) = 0.660.

Continue alternating between evaluation and improvement until policy no longer changes.
Final policy: S11 goes right, S21 goes up.
Convergence reached, example complete.

Example illustrated how to execute policy iteration algorithm.
Detailed calculations ensure understanding of policy evaluation and improvement steps.
Termination upon policy stabilization signifies finding an optimal policy.