Understanding Multi-Armed Bandit Techniques

Aug 22, 2024

Multi-Armed Bandit Problem - Chapter 2 Notes

Introduction to Multi-Armed Bandit Problem

Explore the fundamentals of the multi-armed bandit problem.
Focus on exploration vs. exploitation.
Techniques covered:
- Absolute problem solution.
- Upper Confidence Bound (UCB) algorithm.
- Thompson Sampling algorithm.

Key Concepts

Exploration vs. Exploitation

Exploration: Trying out new options (e.g., new restaurants).
Exploitation: Utilizing existing knowledge (e.g., returning to a favorite restaurant).
Important to balance both for long-term strategy.
- Sacrificing some short-term gains for better long-term rewards.

Multi-Armed Bandit Problem Overview

Scenario: Facing multiple slot machines (bandits) with different winning probabilities.
Goal: Identify the best strategy to maximize long-term rewards by playing different machines.
Consider infinite trials to accurately estimate probabilities.
The problem is a single-state problem where rewards do not change over time.

Basic Approaches

Play each machine a fixed number of times - naive and inefficient.
Random exploration - lacks strategy and knowledge accumulation.
Epsilon-Greedy Policy:
- Randomly explore with a probability of epsilon and exploit known rewards otherwise.
- Epsilon might decay over time to shift from exploration to exploitation.

Epsilon-Greedy Strategy

Generate a random number to decide between exploration or exploitation.
Epsilon decay: Start with high exploration and gradually reduce it as knowledge increases.

Updating Actions and Rewards

Calculate the Q-value for each action based on observed rewards.
Arc Max: Retrieve the index of the action with the highest Q-value.

Challenges in Epsilon-Greedy

Tuning the epsilon value and decay rate can be complex and crucial for performance.

Upper Confidence Bound (UCB)

A smarter approach that favors actions with potential for optimal values.
Calculate a confidence bound for each action based on play frequency and reward estimates.
UCB Equation:
- UCB(a) = Q(a) + sqrt((2 * log(T)) / N(a))
  - Q(a): estimated value of action a
  - T: total number of trials
  - N(a): number of times action a has been chosen

Thompson Sampling Algorithm

Uses probabilities to decide which action to take, inspired by the beta distribution.
Parameters: Alpha (successful trials) and Beta (unsuccessful trials) change based on machine performance.
Iteratively sample to update action probabilities, leading to better decision-making over time.

Conclusion

Reviewed the concepts of the multi-armed bandit problem, Epsilon-Greedy, UCB, and Thompson Sampling.
Assignment upcoming: Compare the performance of these techniques in practical implementation.

Full transcript