Decision Tree is a graphical representation that identifies ways to split a data set based on different conditions. It is one of the most widely used and practical methods for supervised learning used for both classification and regression tasks.

Decision Tree is a tree-like structure or model of decisions with nodes where pick an attribute and ask the questions , each edges represent the answer to the question, and each leaf node represents the actual output . Root to Leaf path is also known as Classification Rules.
Decision Tree algorithms are referred to as CART(Classification and Regression Trees).

Pseudo Code for Decision Tree Algorithm

• Find the best attribute(Highest information gain or lowest gini index ) out of all the features and set as root node.
• Data is splitted into the subsets against root node and find the best attribute(Highest Information gain or lowest gini index ) out of rest feature that will decide the internal nodes and set as left child node and right child node.
• Find the Impurity(Entropy,gini) of the node, if it is zero then set as leaf node.
• Repeat above steps until get minimum or zero impurity.

Common terms used with Decision trees:

1. Root Node: It represents entire sample and this further gets divided into two or more nodes.
2. Splitting:It is a process of dividing a node into two or more sub-nodes.
3. Leaf/ Terminal Node:Nodes do not split is called Leaf or Terminal node.
4. Pruning: When we remove sub-nodes of a decision node, this process is called pruning.
5. Branch / Sub-Tree: A sub section of entire tree is called branch or sub-tree.

How decision tree algorithm works internally?
It is easier to understand the algorithm when we know the answer for below questions.

(a)What do you mean by Impurity in the dataset?
(b)How to find the root or intermediate nodes for decision Tree?
(c)What all criterion to split the the dataset and form a decision tree?
(d)How to find the leaf nodes of the decision tree?
(e)When to stop the decision tree algorithm?

Ans(a)– Impurity represents how much data is mixed.

Lets say there are ten records in dataset in which five are “Yes” and five are “No”, we say it is 100% impure data.
Lets say again if there are ten records in which all ten records are either “Yes” or “No”, we say it is 0% impurity or 100% pure data.

Ans(b)- Decision tree majorly use Gini index, Gini Gain or Entropy, Information Gain techniques to find the root node and intermediate nodes.
Entropy- It is the indicator of how messy the data is. Its value varies between 0 and 1.
If data is 100% impure then Entropy is 1 and if data consist 0% impurity then Entropy is 0.
Information Gain- Reduction in Entropy is called information gain. It also decides the root node, internal nodes of the decision tree.
Information Gain=Entropy(S) – [(Weighted Avg)* Entropy(each feature)]﻿
Gini Impurity– It is similar to entropy but don’t use the log function to calculate the impurity. It is faster than Entropy. Scikit Learn uses gini index by default criteria to prepare the tree.

Gini Gain-Gini Gain is the difference between the “Gini index for the whole dataset” and “weighted average of Gini index for the individual features”.

Root node are selected out of all the feature on the basis of information gain or gini gain. Feature with highest information gain or highest gini gain are selected as root node. Similarly intermediate nodes are chosen.

Please go through the link to understand these techniques entropy,gini,entropy vs gini in depth.

Ans-(c)Below are the criteria to split the dataset in which gini index,entropy,information gain are the most popular.

Ans-(d) and (e)-When the entropy or gini impurity is zero or minimum then set the node as leaf node and when all the conditions on the features applied and get minimum entropy or gini impurity from the output of each node then stop the algorithm.

Let us apply the discuss concepts in one of the dataset and form the tree.

Gini Index,Gini Gain will be calculated for Outlook,Humidity and Wind features.

Or Information Gain can be calculated for Outlook,Humidity and Wind features.

As we can see that Outlook is the feature which has maximum information gain so chosen as root node.

Outlook->Sunny and Outlook->Rain branches will be splitted further as it is still impure. Outlook->Overcast branch output has pure data so set it as leaf node and won’t split further.
Now Gini gain or Information gain will be calculated for the rest of features and chosen higher value of it as intermediate node. Final decision tree will be look like:

Decision Tree Algorithm implementation using Python

Please find the full code here . It includes the data preprocessing, standardization,algorithm implementation,accuracy,gridsearchcv.

Parameters and ways to tune it :

1-criterion- Scikit learn uses gini as a criterion,although we can set it as entropy as well.
2-max_features- The number of features to consider when looking for the best split.

• When max_feature is None means consider all the features.
• When max_feature is auto means sqrt(number of features).
Usually it is set as None.
3-max_depth- The maximum depth of the tree.If it is None, then nodes are expanded until all leaves are pure.Usually it is set as None only

4-min_samples_split- The minimum number of samples required to split an internal node. Its default value is two.

5-min_samples_leaf- The minimum number of samples required to be a leaf node. Its default valued is one.

Tuning the parameters- We need to try the random value for different parameters in decision tree algorithm and check the accuracy against the parameters. There is a gridSearchCV library available which does the same thing.

You can find the full code here.

$${}$$