Decision Trees and Random Forests

Jul 28, 2024

Notes on Decision Trees and Random Forests

Introduction to Decision Trees

Definition: A tree-like structure for machine learning tasks (either classification or regression).
Decision trees can be for:
- Classification Problems: Classifying data into predefined classes.
- Regression Problems: Predicting continuous target values.

Key Components of Decision Trees

Nodes: Each node represents a collection of data points.
Metrics for Tree Development: Algorithms used to generate trees differ, but metrics like Gini impurity and information entropy are common.
- Classification: Gini impurity, Information Entropy.
- Regression: Variance reduction.

Gini Impurity

Concept: A measure of how often a randomly chosen element from the set would be incorrectly labeled.
Gini Impurity Formula:

Gini = 1 - Σ(F_i^2) where F_i is the fraction of samples in class i.
Pure Node: A node can be pure if it contains samples from only one class (Gini impurity = 0).

Decision Tree Example: Iris Flower Classification

Data Set: 149 samples of iris flowers classified into 3 species:
- Setosa
- Versicolor
- Virginica
Features:
- Sepal Length
- Sepal Width
- Petal Length
- Petal Width

Building the Decision Tree

Root Node: Contains all samples (50 Versicolor, 50 Virginica, 49 Setosa).
Splitting Nodes: Use features and thresholds to guide where to send the next data point.
- Example: If Petal Length < 2.4, go left; otherwise, go right.
Split Calculation: Calculate Gini impurity for each possible split and select the one that optimizes the reduction in impurity.

Traversing the Decision Tree

A data point traverses the tree based on feature values and the decisions at each node.
End result: A prediction based on which class a new data point falls into.

Random Forests

Concept: An ensemble method that constructs multiple decision trees.
Purpose: Increase robustness and accuracy against errors in the dataset.
How It Works:
- Create many random subsets (subset of samples or features) from the original data and build separate trees.
- Use majority voting among the predictions of all trees to make the final prediction.

Key Advantages of Random Forests

Reduces variability due to different subsets of data.
More robust against overfitting compared to a single decision tree.
Provides better generalization capability on unseen data.

Conclusion

Decision trees are foundational for both classification and regression tasks.
Understanding their construction helps in grasping more complex methods like random forests.
Leveraging libraries and tools simplifies building models in practice but understanding the underlying mechanics is crucial for effective use.

Full transcript