Reinforcement Learning - Lecture 14

Administrative Details

Midterm Grades: Released last night, see Piazza for details.
A2 and Milestone Grades: Scheduled for later this week.
Project Registration: All teams must register projects via a form on Piazza.
Tiny ImageNet Challenge: Evaluation servers are now online.
Course Survey: Available on Piazza, please provide feedback.

Previous Topics:
- Supervised Learning: Learning a function mapping from x (data) to y (labels).
- Unsupervised Learning: Learning underlying structure from data without labels.
Today's Topic: Reinforcement Learning (RL)
- Agent takes actions in an environment to maximize rewards.

Agent and Environment:
- Agent receives a state from the environment, takes an action, and receives a reward and the next state.
- The loop continues until a terminal state is reached.
Episodes and Examples:
- Cart-Pole Problem: Balance a pole on a movable cart.
- Robot Locomotion: Making a robot move forward.
- Atari Games: Maximizing the score in classic arcade games.
- AlphaGo: Go game; agent aims to win against human players.

MDP Components: State set S, action set A, reward function R, transition probability P, discount factor γ.
MDP Process:
1. Initialize state from distribution P(s_0).
2. Agent takes actions.
3. Transition to next state and get reward.
4. Episode ends on terminal state.
Policy π: Function mapping states to actions; can be deterministic or stochastic.
Bellman Equation: Defining the value of state-action pairs to find optimal policy.
Value function (V): Expected cumulative reward from state s.
Q-value function (Q): Expected cumulative reward from state-action pair (s,a).

Q-Learning and Deep Q-Learning:
- Value Iteration: Iterate to refine approximation of Q* using Bellman equation.
- Deep Q-Learning: Use neural networks to approximate Q-value function.
- Loss Function: Minimize error of Bellman equation.
- Experience Replay: Storing state-action-reward-next state transitions in replay memory.

Environment Setup: State is raw game pixels; actions are game controls; rewards are game scores.
Network Architecture: Convolutional layers with final fully-connected layer outputting Q-values for each action.
Training: Using loss function and gradient updates to enforce Bellman equation.
Experience Replay: Sampling random mini-batches from replay memory to break correlation among samples.
Algorithm: Combine exploration with greedy policy and update Q-network with experience replay.

State-Action Space Complexity: Hard to compute Q-values for large state-action spaces like robot control problems.
Function Approximation for Q: Use deep networks but computationally expensive.
Example: Google DeepMind's DQN on Atari games showing learned agent behavior.

Objective: Directly maximizing expected cumulative reward by adjusting policy parameters (θ).
REINFORCE Algorithm: Gradient ascent on policy using sampled trajectories.
Variance Reduction Techniques:
- Using future rewards instead of the total reward.
- Discount factor to weight near-future rewards higher.
- Baseline functions to normalize rewards.
Actor-Critic Methods: Combining policy gradients with Q-learning to optimize both actor (policy) and critic (value function).

Recurrent Attention Model:
- Focus on parts of an image for tasks like image classification or captioning.
- Using an RNN to model the state and determine where to look next.
AlphaGo:
- Combining supervised learning, reinforcement learning, and Monte Carlo Tree Search.
- Policy network initialized from expert moves, updated with policy gradients through self-play.
Outcome: AlphaGo defeated top human Go players using RL techniques.

Policy Gradients: General but high variance, require more samples.
Q-Learning: Sample efficient but harder to apply, computationally intensive.
Key Guarantees:
- Policy gradients converge to a local minimum.
- No guarantees for Q-learning due to function approximator complexity.

Next Lecture: Guest lecture by Song Han on model compression and energy-efficient deep learning.