Reinforcement Learning Lecture

May 19, 2024

Reinforcement Learning - Lecture 14

Administrative Details

  • Midterm Grades: Released last night, see Piazza for details.
  • A2 and Milestone Grades: Scheduled for later this week.
  • Project Registration: All teams must register projects via a form on Piazza.
  • Tiny ImageNet Challenge: Evaluation servers are now online.
  • Course Survey: Available on Piazza, please provide feedback.

Reinforcement Learning Overview

  • Previous Topics:
    • Supervised Learning: Learning a function mapping from x (data) to y (labels).
    • Unsupervised Learning: Learning underlying structure from data without labels.
  • Today's Topic: Reinforcement Learning (RL)
    • Agent takes actions in an environment to maximize rewards.

Key Concepts in Reinforcement Learning

  • Agent and Environment:
    • Agent receives a state from the environment, takes an action, and receives a reward and the next state.
    • The loop continues until a terminal state is reached.
  • Episodes and Examples:
    • Cart-Pole Problem: Balance a pole on a movable cart.
    • Robot Locomotion: Making a robot move forward.
    • Atari Games: Maximizing the score in classic arcade games.
    • AlphaGo: Go game; agent aims to win against human players.

Formalizing the RL Problem: Markov Decision Processes (MDP)

  • MDP Components: State set S, action set A, reward function R, transition probability P, discount factor γ.
  • MDP Process:
    1. Initialize state from distribution P(s_0).
    2. Agent takes actions.
    3. Transition to next state and get reward.
    4. Episode ends on terminal state.
  • Policy π: Function mapping states to actions; can be deterministic or stochastic.
  • Bellman Equation: Defining the value of state-action pairs to find optimal policy.
  • Value function (V): Expected cumulative reward from state s.
  • Q-value function (Q): Expected cumulative reward from state-action pair (s,a).

Solving for Optimal Policy

  • Q-Learning and Deep Q-Learning:
    • Value Iteration: Iterate to refine approximation of Q* using Bellman equation.
    • Deep Q-Learning: Use neural networks to approximate Q-value function.
    • Loss Function: Minimize error of Bellman equation.
    • Experience Replay: Storing state-action-reward-next state transitions in replay memory.

Case Study: Deep Q-Learning for Atari Games

  • Environment Setup: State is raw game pixels; actions are game controls; rewards are game scores.
  • Network Architecture: Convolutional layers with final fully-connected layer outputting Q-values for each action.
  • Training: Using loss function and gradient updates to enforce Bellman equation.
  • Experience Replay: Sampling random mini-batches from replay memory to break correlation among samples.
  • Algorithm: Combine exploration with greedy policy and update Q-network with experience replay.

Challenges with Q-Learning

  • State-Action Space Complexity: Hard to compute Q-values for large state-action spaces like robot control problems.
  • Function Approximation for Q: Use deep networks but computationally expensive.
  • Example: Google DeepMind's DQN on Atari games showing learned agent behavior.

Policy Gradients

  • Objective: Directly maximizing expected cumulative reward by adjusting policy parameters (θ).
  • REINFORCE Algorithm: Gradient ascent on policy using sampled trajectories.
  • Variance Reduction Techniques:
    • Using future rewards instead of the total reward.
    • Discount factor to weight near-future rewards higher.
    • Baseline functions to normalize rewards.
  • Actor-Critic Methods: Combining policy gradients with Q-learning to optimize both actor (policy) and critic (value function).

Applications: Recurrent Attention Model and AlphaGo

  • Recurrent Attention Model:
    • Focus on parts of an image for tasks like image classification or captioning.
    • Using an RNN to model the state and determine where to look next.
  • AlphaGo:
    • Combining supervised learning, reinforcement learning, and Monte Carlo Tree Search.
    • Policy network initialized from expert moves, updated with policy gradients through self-play.
  • Outcome: AlphaGo defeated top human Go players using RL techniques.

Summary

  • Policy Gradients: General but high variance, require more samples.
  • Q-Learning: Sample efficient but harder to apply, computationally intensive.
  • Key Guarantees:
    • Policy gradients converge to a local minimum.
    • No guarantees for Q-learning due to function approximator complexity.

Next Lecture: Guest lecture by Song Han on model compression and energy-efficient deep learning.