Random Forest:

• Random Forest is supervised learning and it is an ensemble classifier made using many decision tree models.
• Ensemble models combine the results from different models.
• The combination of learning models increases the overall result ,called bagging method.
In simple words: Random forest builds multiple decision trees and merges them together to get a more accurate and stable prediction.

The concept Sampling with and without replacements should be known before deep dive into the Random forest algorithm.

Let us assume the dataset D={1,2,3,4,5,6}.If someone ask to generate the sample of size 3.

Sampling without replacement :

Let’s say 2 is extracted from the dataset, then for next selection D-> {1,3,4,5,6}
Let’s say 4 is also fetched {2,4} then for next selection D–>{1,3,5,6}
6 is fetched this time {2,4,6}, then for next selection D-> {1,3,5}
We will stop fetching number as need to choose sample size of 3.

Sampling with replacement :
Let’s say 2 is fetched then for next selection D->{1,2,3,4,5,6}
2,4 then for next selection {1,2,3,4,5,6}
2,4,2 then for next selection {1,2,3,4,5,6}

How Random Forest Algorithm work?

• Let the number of training case be N and the number of features in the dataset are M.
• The number m of input variables are used to determine the decision at a node of the tree. m will be less than M.
• Choose the training set of size N with replacement from the input N available training set and rest of the cases to be used to estimate the error of the tree.
• m random features are chosen and root node is the best attribute from m random features using gini index or entropy ,similarly find internal nodes from another set of m and prepare decision tree.
• Prepare another trees using m random features and N training cases with replacement
• Final prediction or output is based on Voting from all the decision trees.

Let us understand the random forest algorithm through simple example:

(1) Assume a dataset where rows(N)=1000 and attributes(M)=20.

(2) Suppose m=3, three random attributes will be chosen from twenty attributes.Number of attributes are always same but attributes may be different in each iteration to form the decision tree.

(3) Choose the random samples of N1 size 1000 dataset with replacement and prepare the decision tree on the N1 samples only.

(4) Suppose three random attributes A1,A6,A8 chosen in the first decision tree to determine the root node and find the best attribute (based on gini index or information gain) out of three random attributes. Lets say best attribute is A6,set as root node.

(5) Now Split the dataset(N1) on A6 and there will be two subsets now N1_1,N1_2 .For N1_1, we again select random attributes A5,A7,A9 to determine the child node or internal nodes and find again best attribute and set as internal nodes. Similarly for N1_2,choose the best attribute.

(6) Prepare the first decision tree.

(7) Similarly prepare the another decision trees based on m=3 and N1=1000. Prediction is based on the voting process. If let us say there are 10 decision trees in which 8 are saying “Yes” and 2 are saying “No”, final output will be according to majority voting that is “Yes”.

Implement Random Forest algorithm using Python

Please find the full implemetation here.

Parameters using in Random Forest algorithm and the ways to tune it :

1-n_estimators-: The number of trees in the forest. Its default value is 10.

2-criterion-:Gini impurity is default although we can set the criterion as entropy.

3-max_features-: The number of features to consider when looking for the best split. Its default value is “auto”,max_features=sqrt(n_features)

4-max_depth-The maximum depth of the tree. If None,then expanded until all leaves are pure.

5-min_samples_split- The minimum number of samples required to split an internal node. Its default value is two.

6-min_samples_leaf- The minimum number of samples required to be a leaf node. Its default valued is one.

7-bootstrap- Bootstrap samples are used while building the tree.

8-oob_score-Whether to use out-of-bag samples to estimate the generalization accuracy.

9-n_jobs-The number of jobs to run in parallel for both fit and predict.If -1, then the number of jobs is set to the number of cores.

Parameters Tuning- It is required to set the random values to the parameters and find the accuracy each time and when the accuracy is maximum ,then finalize those parameters. There is a library gridSearchCV which does the same.

Please find the full code here for gridSearchCV.

$${}$$