🎓

Lecture Notes on Reinforcement Learning

Jul 26, 2024

View transcript

Review flashcards

Reinforcement Learning Lecture Notes

Overview

Topic: Model-free reinforcement learning focused on gradient-free methods.
Main Technique: Q-learning, an off-policy gradient-free method, will be covered.
Connections: Deep learning has ignited recent interest in reinforcement learning.

Key Concepts Recap

Quality Function (Q)

Definition: It assesses the joint quality of taking action a in a current state s.
Components:
- Value function (v): Value of being in state s assuming the best action is taken.
- Goal: Q provides richer information about the quality of being in state s for any action.
Markov Decision Processes (MDP): The Q function can be defined within this probabilistic framework.
Expectation: It reflects future rewards considering the environment's randomness.

Bellman Equation and Dynamic Programming

Assumptions:
- A known MDP model (i.e., system's evolution/transition probabilities and rewards).
Dynamic Programming Techniques: Policy iteration and value iteration depend on knowing the MDP.
Policy (π): Determines actions that maximize the Q function's value.

Model-free Reinforcement Learning Motivation

Context: Many systems lack knowledge of the MDP or the reward model.
Approach: Learning through trial and error, similar to biological processes.

Monte Carlo Learning

Description: An episodic learning algorithm requiring an entire episode to learn.
Process:
- Enacts a selected policy and computes cumulative rewards, discounted by gamma (γ).
- Distributes total rewards among all states visited.
Efficiency:
- Inefficient due to equal weighting of states visited but unbiased.

Temporal Difference Learning (TD Learning)

Importance: A significant advance that weighs recent events more heavily than past ones.
Example: TD(0) updates the value function based on current rewards and next state predictions:
- TD Target: Expected value plus reward and the next state value.
- TD Error: The difference between actual and expected rewards.
Connection to Biology: Potential biological parallels, linking dopamine release and neural connection strengthening to TD errors.

Advanced TD Learning Variants

N-Step TD Learning: Use of cumulative rewards over several future steps.
TD(λ): Combines multiple TD steps with exponential weighting based on a parameter λ (0<λ<1).

Q-Learning

Framework: Off-policy TD learning extending to the Q function.
Core Concept: Updates the Q function using TD targets from past experiences, independent from direct optimal policy execution.
- Maximize Q to update current beliefs about future rewards.
Advantages:
- Learn from sub-optimal actions,
- Replay past experiences.

Comparison with SARSA

SARSA Overview: On-policy TD learning, which necessitates following the best known policy.
Key Differences:
1. Q-learning uses a max operation over potential actions for updates while SARSA uses the current policy's action.
2. SARSA's estimates may degrade if actions deviate from the optimal policy.

Exploration Strategies

Epsilon-Greedy: Balances exploitation and exploration by selecting random actions a certain percentage of the time.
Annealing: Starting with high exploration (large epsilon) and gradually lowering it as learning progresses.

Summary

Learning Without a Model: Utilizes principles of dynamic programming inspired methods through real-world experiences.
Strategic Exploration: Balancing risk versus reward in learning strategies enhances the overall learning process.
Future Directions: Applications and advanced techniques, including deep reinforcement learning concepts will follow in upcoming lectures.

Full transcript