📊

Understanding Policy Iteration Algorithm Steps

Jan 13, 2025

Lecture on Policy Iteration Algorithm - Example by Alice Gao

Introduction

  • Previous videos covered high-level ideas of policy iteration algorithm.
  • Example execution of the policy iteration algorithm.
  • Small 2x2 grid world with goal states.
  • States of interest: S11 and S21 with rewards +1 and -1 respectively.

Grid World Setup

  • Non-goal state reward: -0.04.
  • Transition probabilities:
    • 80% intended direction.
    • 10% left of intended direction.
    • 10% right of intended direction.
  • Hitting a wall results in staying in the same location.

Policy Iteration Algorithm Steps

  1. Policy Evaluation

    • Evaluate initial policy (going right in both states).
    • Use Bellman equations for utility values.
    • Linear system of equations due to fixed policy.
    • Solve for utilities: V(S11) = 0.75, V(S21) = -0.85.
  2. Policy Improvement

    • Derive new policy based on utility values.
    • Calculate expected utility for each potential policy direction.
    • S11 optimal policy remains going right.
    • S21 optimal policy changes to going up.
  3. Repeating the Process

    • Continue the iteration until policy stabilizes.
    • New policies:
      • S11 continues right.
      • S21 confirmed to go up.
    • Reevaluate utilities with updated policy: V(S11) = 0.918, V(S21) = 0.660.

Convergence

  • Continue alternating between evaluation and improvement until policy no longer changes.
  • Final policy: S11 goes right, S21 goes up.
  • Convergence reached, example complete.

Conclusion

  • Example illustrated how to execute policy iteration algorithm.
  • Detailed calculations ensure understanding of policy evaluation and improvement steps.
  • Termination upon policy stabilization signifies finding an optimal policy.

Closing Remarks

  • Thank you for watching, and see you in the next video.