Understanding Multi-Armed Bandit Techniques

Aug 22, 2024

Multi-Armed Bandit Problem - Chapter 2 Notes

Introduction to Multi-Armed Bandit Problem

  • Explore the fundamentals of the multi-armed bandit problem.
  • Focus on exploration vs. exploitation.
  • Techniques covered:
    • Absolute problem solution.
    • Upper Confidence Bound (UCB) algorithm.
    • Thompson Sampling algorithm.

Key Concepts

Exploration vs. Exploitation

  • Exploration: Trying out new options (e.g., new restaurants).
  • Exploitation: Utilizing existing knowledge (e.g., returning to a favorite restaurant).
  • Important to balance both for long-term strategy.
    • Sacrificing some short-term gains for better long-term rewards.

Multi-Armed Bandit Problem Overview

  • Scenario: Facing multiple slot machines (bandits) with different winning probabilities.
  • Goal: Identify the best strategy to maximize long-term rewards by playing different machines.
  • Consider infinite trials to accurately estimate probabilities.
  • The problem is a single-state problem where rewards do not change over time.

Basic Approaches

  1. Play each machine a fixed number of times - naive and inefficient.
  2. Random exploration - lacks strategy and knowledge accumulation.
  3. Epsilon-Greedy Policy:
    • Randomly explore with a probability of epsilon and exploit known rewards otherwise.
    • Epsilon might decay over time to shift from exploration to exploitation.

Epsilon-Greedy Strategy

  • Generate a random number to decide between exploration or exploitation.
  • Epsilon decay: Start with high exploration and gradually reduce it as knowledge increases.

Updating Actions and Rewards

  • Calculate the Q-value for each action based on observed rewards.
  • Arc Max: Retrieve the index of the action with the highest Q-value.

Challenges in Epsilon-Greedy

  • Tuning the epsilon value and decay rate can be complex and crucial for performance.

Upper Confidence Bound (UCB)

  • A smarter approach that favors actions with potential for optimal values.
  • Calculate a confidence bound for each action based on play frequency and reward estimates.
  • UCB Equation:
    • UCB(a) = Q(a) + sqrt((2 * log(T)) / N(a))
      • Q(a): estimated value of action a
      • T: total number of trials
      • N(a): number of times action a has been chosen

Thompson Sampling Algorithm

  • Uses probabilities to decide which action to take, inspired by the beta distribution.
  • Parameters: Alpha (successful trials) and Beta (unsuccessful trials) change based on machine performance.
  • Iteratively sample to update action probabilities, leading to better decision-making over time.

Conclusion

  • Reviewed the concepts of the multi-armed bandit problem, Epsilon-Greedy, UCB, and Thompson Sampling.
  • Assignment upcoming: Compare the performance of these techniques in practical implementation.