Back to notes
What is the primary goal in reinforcement learning?
Press to flip
The primary goal is for the agent to take actions in an environment to maximize cumulative rewards.
How does experience replay contribute to the efficiency of Deep Q-Learning?
Experience replay stores state-action-reward-next state transitions and samples random mini-batches from this replay memory to break correlation among samples and improve learning stability.
How does Monte Carlo Tree Search (MCTS) integrate with reinforcement learning in systems like AlphaGo?
MCTS is used to explore possible future game moves and their outcomes efficiently, combined with policy and value networks to evaluate moves and guide the decision-making process in a more informed way.
Describe the components of a Markov Decision Process (MDP).
An MDP consists of a state set `S`, action set `A`, reward function `R`, transition probability `P`, and a discount factor `γ`.
Explain what the discount factor (γ) signifies in reinforcement learning.
The discount factor (γ) determines how future rewards are weighted compared to immediate rewards, with a value between 0 and 1. A higher γ values future rewards more.
Explain the difference between the value function (V) and the Q-value function (Q).
The value function (V) represents the expected cumulative reward from a state `s`, while the Q-value function (Q) represents the expected cumulative reward from a state-action pair `(s,a)`.
What is a baseline function in policy gradient methods, and why is it used?
A baseline function is used to reduce the variance of the policy gradient estimates by normalizing the rewards, making learning more stable and efficient.
What role does the policy network play in the AlphaGo system?
In AlphaGo, the policy network is initialized from expert moves and continuously updated using policy gradients through self-play, ultimately guiding the agent's decision-making in the game of Go.
What is the purpose of the Bellman Equation in reinforcement learning?
The Bellman Equation is used to recursively define the value of state-action pairs, helping to find the optimal policy by breaking down the value function into immediate rewards and expected future rewards.
Why is it beneficial to use convolutional layers in the network architecture for Deep Q-Learning in visual-based environments?
Convolutional layers are effective at capturing spatial hierarchies and patterns in visual data, which is critical for tasks like playing Atari games where the state is represented by raw game pixels.
How does the network architecture of Deep Q-Learning typically look for tasks like Atari games?
It typically consists of convolutional layers followed by a fully-connected layer that outputs Q-values for each possible action in the game environment.
What are some challenges associated with Q-learning?
Challenges with Q-learning include high computational expense for large state-action spaces, difficulty in function approximation for Q-values, and complexity in problems like robot control.
What does 'variance reduction' mean in the context of policy gradients, and how can it be achieved?
Variance reduction aims to make policy gradient estimates more stable and accurate. It can be achieved using techniques like future rewards, discount factors, and baseline functions.
Describe the REINFORCE algorithm in policy gradients.
The REINFORCE algorithm uses gradient ascent on the policy by adjusting its parameters (θ) based on sampled trajectories to directly maximize the expected cumulative reward.
What are actor-critic methods in reinforcement learning?
Actor-critic methods combine policy gradients with Q-learning, optimizing both the policy (actor) and the value function (critic) to improve efficiency and performance.
Previous
Next