馃帗

Lecture Notes on Reinforcement Learning

Jul 26, 2024

Reinforcement Learning Lecture Notes

Overview

  • Topic: Model-free reinforcement learning focused on gradient-free methods.
  • Main Technique: Q-learning, an off-policy gradient-free method, will be covered.
  • Connections: Deep learning has ignited recent interest in reinforcement learning.

Key Concepts Recap

Quality Function (Q)

  • Definition: It assesses the joint quality of taking action a in a current state s.
  • Components:
    • Value function (v): Value of being in state s assuming the best action is taken.
    • Goal: Q provides richer information about the quality of being in state s for any action.
  • Markov Decision Processes (MDP): The Q function can be defined within this probabilistic framework.
  • Expectation: It reflects future rewards considering the environment's randomness.

Bellman Equation and Dynamic Programming

  • Assumptions:
    • A known MDP model (i.e., system's evolution/transition probabilities and rewards).
  • Dynamic Programming Techniques: Policy iteration and value iteration depend on knowing the MDP.
  • Policy (蟺): Determines actions that maximize the Q function's value.

Model-free Reinforcement Learning Motivation

  • Context: Many systems lack knowledge of the MDP or the reward model.
  • Approach: Learning through trial and error, similar to biological processes.

Monte Carlo Learning

  • Description: An episodic learning algorithm requiring an entire episode to learn.
  • Process:
    • Enacts a selected policy and computes cumulative rewards, discounted by gamma (纬).
    • Distributes total rewards among all states visited.
  • Efficiency:
    • Inefficient due to equal weighting of states visited but unbiased.

Temporal Difference Learning (TD Learning)

  • Importance: A significant advance that weighs recent events more heavily than past ones.
  • Example: TD(0) updates the value function based on current rewards and next state predictions:
    • TD Target: Expected value plus reward and the next state value.
    • TD Error: The difference between actual and expected rewards.
  • Connection to Biology: Potential biological parallels, linking dopamine release and neural connection strengthening to TD errors.

Advanced TD Learning Variants

  • N-Step TD Learning: Use of cumulative rewards over several future steps.
  • TD(位): Combines multiple TD steps with exponential weighting based on a parameter 位 (0<位<1).

Q-Learning

  • Framework: Off-policy TD learning extending to the Q function.
  • Core Concept: Updates the Q function using TD targets from past experiences, independent from direct optimal policy execution.
    • Maximize Q to update current beliefs about future rewards.
  • Advantages:
    • Learn from sub-optimal actions,
    • Replay past experiences.

Comparison with SARSA

  • SARSA Overview: On-policy TD learning, which necessitates following the best known policy.
  • Key Differences:
    1. Q-learning uses a max operation over potential actions for updates while SARSA uses the current policy's action.
    2. SARSA's estimates may degrade if actions deviate from the optimal policy.

Exploration Strategies

  • Epsilon-Greedy: Balances exploitation and exploration by selecting random actions a certain percentage of the time.
  • Annealing: Starting with high exploration (large epsilon) and gradually lowering it as learning progresses.

Summary

  • Learning Without a Model: Utilizes principles of dynamic programming inspired methods through real-world experiences.
  • Strategic Exploration: Balancing risk versus reward in learning strategies enhances the overall learning process.
  • Future Directions: Applications and advanced techniques, including deep reinforcement learning concepts will follow in upcoming lectures.