📊

Exploring Entropy in Data Science

Mar 21, 2025

Understanding Entropy in Data Science

Introduction

  • Host: Josh Starmer
  • Topic: Entropy for data science
  • Assumes familiarity with expected values.

Applications of Entropy

  • Used in:
    • Building classification trees for classification tasks
    • Mutual information (relationship quantification)
    • Relative entropy (Kullback-Leibler distance) and cross entropy
    • Dimension reduction algorithms (e.g., t-SNE and UMAP)

Concept of Surprise

Example with Chickens

  • Types of chickens: Orange and Blue
  • Organization into areas A, B, and C by Stat Squatch:
    • Area A: 6 Orange, 1 Blue
      • High probability of picking Orange → low surprise
      • Low probability of picking Blue → high surprise
    • Area B: More Blue than Orange
      • Higher probability of Blue → low surprise
    • Area C: Equal numbers of Orange and Blue
      • Equal probability → equal surprise

Relationship Between Probability and Surprise

  • Inverse relationship:
    • High probability → low surprise
    • Low probability → high surprise

Calculating Surprise

  • Simple inverse of probability is inadequate.

  • Instead, use:

    Surprise = log(1/probability)

  • Probability of heads = 1 → Surprise = 0

  • Probability of tails = 0 → Surprise is undefined

Coin Flipping Example

  • Coin flips with probabilities:
    • Heads: 90%
    • Tails: 10%
  • Calculate surprise for:
    • Heads → 0.15
    • Tails → 3.32
  • Total surprise for a sequence of flips:
    • E.g., for 2 heads and 1 tail: 3.62
  • Estimate total surprise for 100 flips: 46.7

Entropy Calculation

  • Average surprise per flip:
    • Entropy = Total Surprise / Number of Flips
    • Example: 0.47
  • Definition:
    • Entropy is the expected value of surprise.

Deriving Entropy

  • Use Sigma notation for calculation:

    • Sum of (specific surprise value * probability)
  • Standard entropy equation:

    Entropy = -Σ (P(x) * log(P(x)))

  • Claude Shannon's original version published in 1948.

Chicken Areas Entropy Calculation

  • Area A: 0.59
  • Area B: 0.44
  • Area C: 1.00
    • Highest entropy due to equal distribution of colors.

Key Takeaways

  • Entropy quantifies similarity/difference in distributions.
  • Highest entropy occurs with equal representation of outcomes.

Conclusion

  • Encouraged to review statistics and machine learning offline:
    • Stack Quest study guides available at stackquest.org
  • Call to action: Subscribe and support Stack Quest.

Final Note

  • Whispering the log of the inverse of the probability can surprise someone!