Decision Trees and Random Forests

Jul 28, 2024

Notes on Decision Trees and Random Forests

Introduction to Decision Trees

  • Definition: A tree-like structure for machine learning tasks (either classification or regression).
  • Decision trees can be for:
    • Classification Problems: Classifying data into predefined classes.
    • Regression Problems: Predicting continuous target values.

Key Components of Decision Trees

  • Nodes: Each node represents a collection of data points.
  • Metrics for Tree Development: Algorithms used to generate trees differ, but metrics like Gini impurity and information entropy are common.
    • Classification: Gini impurity, Information Entropy.
    • Regression: Variance reduction.

Gini Impurity

  • Concept: A measure of how often a randomly chosen element from the set would be incorrectly labeled.
  • Gini Impurity Formula:

    Gini = 1 - Σ(F_i^2) where F_i is the fraction of samples in class i.

  • Pure Node: A node can be pure if it contains samples from only one class (Gini impurity = 0).

Decision Tree Example: Iris Flower Classification

  • Data Set: 149 samples of iris flowers classified into 3 species:
    • Setosa
    • Versicolor
    • Virginica
  • Features:
    • Sepal Length
    • Sepal Width
    • Petal Length
    • Petal Width

Building the Decision Tree

  1. Root Node: Contains all samples (50 Versicolor, 50 Virginica, 49 Setosa).
  2. Splitting Nodes: Use features and thresholds to guide where to send the next data point.
    • Example: If Petal Length < 2.4, go left; otherwise, go right.
  3. Split Calculation: Calculate Gini impurity for each possible split and select the one that optimizes the reduction in impurity.

Traversing the Decision Tree

  • A data point traverses the tree based on feature values and the decisions at each node.
  • End result: A prediction based on which class a new data point falls into.

Random Forests

  • Concept: An ensemble method that constructs multiple decision trees.
  • Purpose: Increase robustness and accuracy against errors in the dataset.
  • How It Works:
    • Create many random subsets (subset of samples or features) from the original data and build separate trees.
    • Use majority voting among the predictions of all trees to make the final prediction.

Key Advantages of Random Forests

  • Reduces variability due to different subsets of data.
  • More robust against overfitting compared to a single decision tree.
  • Provides better generalization capability on unseen data.

Conclusion

  • Decision trees are foundational for both classification and regression tasks.
  • Understanding their construction helps in grasping more complex methods like random forests.
  • Leveraging libraries and tools simplifies building models in practice but understanding the underlying mechanics is crucial for effective use.