Reinforcement Learning - Lecture 14
Administrative Details
- Midterm Grades: Released last night, see Piazza for details.
- A2 and Milestone Grades: Scheduled for later this week.
- Project Registration: All teams must register projects via a form on Piazza.
- Tiny ImageNet Challenge: Evaluation servers are now online.
- Course Survey: Available on Piazza, please provide feedback.
Reinforcement Learning Overview
- Previous Topics:
- Supervised Learning: Learning a function mapping from
x
(data) to y
(labels).
- Unsupervised Learning: Learning underlying structure from data without labels.
- Today's Topic: Reinforcement Learning (RL)
- Agent takes actions in an environment to maximize rewards.
Key Concepts in Reinforcement Learning
- Agent and Environment:
- Agent receives a state from the environment, takes an action, and receives a reward and the next state.
- The loop continues until a terminal state is reached.
- Episodes and Examples:
- Cart-Pole Problem: Balance a pole on a movable cart.
- Robot Locomotion: Making a robot move forward.
- Atari Games: Maximizing the score in classic arcade games.
- AlphaGo: Go game; agent aims to win against human players.
Formalizing the RL Problem: Markov Decision Processes (MDP)
- MDP Components: State set
S
, action set A
, reward function R
, transition probability P
, discount factor γ
.
- MDP Process:
- Initialize state from distribution
P(s_0)
.
- Agent takes actions.
- Transition to next state and get reward.
- Episode ends on terminal state.
- Policy π: Function mapping states to actions; can be deterministic or stochastic.
- Bellman Equation: Defining the value of state-action pairs to find optimal policy.
- Value function (V): Expected cumulative reward from state
s
.
- Q-value function (Q): Expected cumulative reward from state-action pair
(s,a)
.
Solving for Optimal Policy
- Q-Learning and Deep Q-Learning:
- Value Iteration: Iterate to refine approximation of Q* using Bellman equation.
- Deep Q-Learning: Use neural networks to approximate Q-value function.
- Loss Function: Minimize error of Bellman equation.
- Experience Replay: Storing state-action-reward-next state transitions in replay memory.
Case Study: Deep Q-Learning for Atari Games
- Environment Setup: State is raw game pixels; actions are game controls; rewards are game scores.
- Network Architecture: Convolutional layers with final fully-connected layer outputting Q-values for each action.
- Training: Using loss function and gradient updates to enforce Bellman equation.
- Experience Replay: Sampling random mini-batches from replay memory to break correlation among samples.
- Algorithm: Combine exploration with greedy policy and update Q-network with experience replay.
Challenges with Q-Learning
- State-Action Space Complexity: Hard to compute Q-values for large state-action spaces like robot control problems.
- Function Approximation for Q: Use deep networks but computationally expensive.
- Example: Google DeepMind's DQN on Atari games showing learned agent behavior.
Policy Gradients
- Objective: Directly maximizing expected cumulative reward by adjusting policy parameters (θ).
- REINFORCE Algorithm: Gradient ascent on policy using sampled trajectories.
- Variance Reduction Techniques:
- Using future rewards instead of the total reward.
- Discount factor to weight near-future rewards higher.
- Baseline functions to normalize rewards.
- Actor-Critic Methods: Combining policy gradients with Q-learning to optimize both actor (policy) and critic (value function).
Applications: Recurrent Attention Model and AlphaGo
- Recurrent Attention Model:
- Focus on parts of an image for tasks like image classification or captioning.
- Using an RNN to model the state and determine where to look next.
- AlphaGo:
- Combining supervised learning, reinforcement learning, and Monte Carlo Tree Search.
- Policy network initialized from expert moves, updated with policy gradients through self-play.
- Outcome: AlphaGo defeated top human Go players using RL techniques.
Summary
- Policy Gradients: General but high variance, require more samples.
- Q-Learning: Sample efficient but harder to apply, computationally intensive.
- Key Guarantees:
- Policy gradients converge to a local minimum.
- No guarantees for Q-learning due to function approximator complexity.
Next Lecture: Guest lecture by Song Han on model compression and energy-efficient deep learning.