Coconote
AI notes
AI voice & video notes
Try for free
馃帗
Lecture Notes on Reinforcement Learning
Jul 26, 2024
馃搫
View transcript
馃儚
Review flashcards
Reinforcement Learning Lecture Notes
Overview
Topic:
Model-free reinforcement learning focused on gradient-free methods.
Main Technique:
Q-learning, an off-policy gradient-free method, will be covered.
Connections:
Deep learning has ignited recent interest in reinforcement learning.
Key Concepts Recap
Quality Function (Q)
Definition:
It assesses the joint quality of taking action
a
in a current state
s
.
Components:
Value function (v):
Value of being in state
s
assuming the best action is taken.
Goal:
Q provides richer information about the quality of being in state
s
for any action.
Markov Decision Processes (MDP):
The Q function can be defined within this probabilistic framework.
Expectation:
It reflects future rewards considering the environment's randomness.
Bellman Equation and Dynamic Programming
Assumptions:
A known MDP model (i.e., system's evolution/transition probabilities and rewards).
Dynamic Programming Techniques:
Policy iteration and value iteration depend on knowing the MDP.
Policy (蟺):
Determines actions that maximize the Q function's value.
Model-free Reinforcement Learning Motivation
Context:
Many systems lack knowledge of the MDP or the reward model.
Approach:
Learning through trial and error, similar to biological processes.
Monte Carlo Learning
Description:
An episodic learning algorithm requiring an entire episode to learn.
Process:
Enacts a selected policy and computes cumulative rewards, discounted by gamma (纬).
Distributes total rewards among all states visited.
Efficiency:
Inefficient due to equal weighting of states visited but unbiased.
Temporal Difference Learning (TD Learning)
Importance:
A significant advance that weighs recent events more heavily than past ones.
Example:
TD(0) updates the value function based on current rewards and next state predictions:
TD Target:
Expected value plus reward and the next state value.
TD Error:
The difference between actual and expected rewards.
Connection to Biology:
Potential biological parallels, linking dopamine release and neural connection strengthening to TD errors.
Advanced TD Learning Variants
N-Step TD Learning:
Use of cumulative rewards over several future steps.
TD(位):
Combines multiple TD steps with exponential weighting based on a parameter 位 (0<位<1).
Q-Learning
Framework:
Off-policy TD learning extending to the Q function.
Core Concept:
Updates the Q function using TD targets from past experiences, independent from direct optimal policy execution.
Maximize Q to update current beliefs about future rewards.
Advantages:
Learn from sub-optimal actions,
Replay past experiences.
Comparison with SARSA
SARSA Overview:
On-policy TD learning, which necessitates following the best known policy.
Key Differences:
Q-learning uses a max operation over potential actions for updates while SARSA uses the current policy's action.
SARSA's estimates may degrade if actions deviate from the optimal policy.
Exploration Strategies
Epsilon-Greedy:
Balances exploitation and exploration by selecting random actions a certain percentage of the time.
Annealing:
Starting with high exploration (large epsilon) and gradually lowering it as learning progresses.
Summary
Learning Without a Model:
Utilizes principles of dynamic programming inspired methods through real-world experiences.
Strategic Exploration:
Balancing risk versus reward in learning strategies enhances the overall learning process.
Future Directions:
Applications and advanced techniques, including deep reinforcement learning concepts will follow in upcoming lectures.
馃搫
Full transcript