Hello people from the future, welcome to Normalize Nerd. Today we will set up our camp in the random forest. First we will see why the random forest algorithm is better than our good old decision trees and then I will explain how it works with visualization.
If you want to see more videos like this please subscribe to my channel and hit the bell icon because I make videos about machine learning and data science regularly. So without any further ado let's get started. To begin our journey We need a dataset. Here I am taking a small dataset with only 6 instances and 5 features. As you can see, the target variable y takes two values, 0 and 1. Hence, it's a binary classification problem.
First of all, we need to understand why do we even need the random forest when we already have the decision trees. Let's draw the decision tree for this dataset. Now, if you don't know what a decision tree really is or how it is trained, then I would highly recommend you to watch my previous video. In short, a decision tree splits the dataset recursively using the decision nodes unless we are left with pure leaf nodes. And it finds the best split by maximizing the entropy gain.
If a data sample satisfies the condition at a decision node, then it moves to the left child else. it moves to the right and finally it reaches a leaf node where a class label is assigned to it. So what's the problem with decision trees? Let's change our training data slightly.
Focus on the row with ID 1. We are changing the X0 and X1 features. Now if we train our tree on this modified data set we will get a completely different tree. This shows us that decision trees are highly sensitive to the training data, which could result in high variance.
So our model might fail to generalize. Here comes the random forest algorithm. It is a collection of multiple random decision trees, and it's much less sensitive to the training data.
You can guess that we use multiple trees, hence the name forest. But why it's called random? Keep this question in the back of your mind.
You will get the answer by the end of this video. Let me show you the process of creating a random forest. The first step is to build new datasets from our original data. To maintain simplicity, we will build only 4. We are gonna randomly select rows from the original data to build our new datasets.
And every dataset will contain the same number of rows as the original one. Here's the first dataset. Due to lack of space.
I am writing only the row IDs. Notice that row 2 and 5 came more than once. That's because we are performing random sampling with replacement.
That means after selecting a row, we are putting it back into the data. And here are the rest of the datasets. The process we just followed to create new data is called bootstrapping. Now, we will train a decision tree on each of the bootstrap datasets independently.
But here's a twist, we won't use every feature for training the trees. We will randomly select a subset of features for each tree and use only them for training. For example, in the first case, we will only use the features x0 and x1.
Similarly, here are the subsets used for the remaining trees. Now that we have got the data and the feature subsets, let's build the trees. Just see how different the trees look from each other.
And this my friend is the random forest containing four trees. But how to make a prediction using this forest? Let's take a new data point.
We will pass this data point through each tree one by one and note down the predictions. Now, we have to combine all the predictions. As it's a classification problem, we will take the majority voting. Clearly, 1 is the winner.
Hence the prediction from our random forest is 1. This process of combining results from multiple models is called aggregation. So in the random forest, we first perform bootstrapping, then aggregation. And in the jargon, it's called bagging. Okay.
So that was how we build a random forest. Now I should discuss some of the very important points related to this algorithm. First, why it's called random? Because we have used two random processes bootstrapping and random feature selection.
But what is the motivation behind bootstrapping and feature selection? Well, bootstrapping ensures that we are not using the same data for every tree. So in a way it helps our model to be less sensitive to the original training data.
The random feature selection helps to reduce the correlation between the trees. If you use every feature, then most of your trees will have the same decision nodes and they will act very similarly. That will increase the variance.
There's another benefit of the random feature selection. Some of the trees will be trained on less important features. So, they will give bad predictions.
But there will also be some trees that give bad predictions in the opposite direction. So they will balance out. Next point, what's the ideal size of the feature subset? Well in our case we took two features which is close to the square root of the total number of features that is five.
Researchers have found that values close to the log or square root of the total number of features work well. And how do you use it for a regression problem? Pretty easy.
While combining the predictions, just take the average and you are all set to use it for a regression problem. So that was all about it. I hope now you have a pretty good understanding of random forest.
If you enjoyed this video, please share this video and subscribe to my channel. Stay safe and thanks for watching.