Coconote
AI notes
AI voice & video notes
Try for free
📊
Understanding Policy Iteration Algorithm Steps
Jan 13, 2025
Lecture on Policy Iteration Algorithm - Example by Alice Gao
Introduction
Previous videos covered high-level ideas of policy iteration algorithm.
Example execution of the policy iteration algorithm.
Small 2x2 grid world with goal states.
States of interest: S11 and S21 with rewards +1 and -1 respectively.
Grid World Setup
Non-goal state reward: -0.04.
Transition probabilities:
80% intended direction.
10% left of intended direction.
10% right of intended direction.
Hitting a wall results in staying in the same location.
Policy Iteration Algorithm Steps
Policy Evaluation
Evaluate initial policy (going right in both states).
Use Bellman equations for utility values.
Linear system of equations due to fixed policy.
Solve for utilities: V(S11) = 0.75, V(S21) = -0.85.
Policy Improvement
Derive new policy based on utility values.
Calculate expected utility for each potential policy direction.
S11 optimal policy remains going right.
S21 optimal policy changes to going up.
Repeating the Process
Continue the iteration until policy stabilizes.
New policies:
S11 continues right.
S21 confirmed to go up.
Reevaluate utilities with updated policy: V(S11) = 0.918, V(S21) = 0.660.
Convergence
Continue alternating between evaluation and improvement until policy no longer changes.
Final policy: S11 goes right, S21 goes up.
Convergence reached, example complete.
Conclusion
Example illustrated how to execute policy iteration algorithm.
Detailed calculations ensure understanding of policy evaluation and improvement steps.
Termination upon policy stabilization signifies finding an optimal policy.
Closing Remarks
Thank you for watching, and see you in the next video.
📄
Full transcript