- It defines the randomness in the data.
- It helps to find out the root node,intermediate nodes and leaf node to develop the decision tree
- It is just a metric which measures the impurity.
- It reaches its minimum (zero) when all cases in the node fall into a single target and maximum when half cases in to each targets.
In the above graph , H(X) is entropy and it is maximum when probability is 0.5 and minimum(zero) when probability is either 0 or 1.
Que-How to calculate the Entropy if there are Ten records in the dataset?
- It measures the reduction in Entropy.
- Higher the Entropy reduces, Higher the Information Gain.
- Decides which node should be selected as root or intermediate node in decision tree.
Information Gain can be calculated using below formula:
Information Gain=Entropy(S) – [(Weighted Avg)* Entropy(each feature)]
We are going to discuss about these below terms in the content and prepare the decision tree:
Entropy– We already discussed the use of Entropy and formula to calculate it. Decision tree nodes are splitted until we get minimum entropy or zero entropy. So entropy is calculated in each iteration and the node will be treated as leaf when entropy is lowest or zero.
Information– (Weighted average)*(Entropy of feature).
Information Gain- Reduction in the entropy. It also decides which node should be treated as root node or intermediate node.
Let us calculate the Entropy and Information Gain for the below dataset:
There are three independent columns(Outlook,Humidity,Wind) and target variable(play) columns available in the dataset.
Que-How to find which feature is root or intermediate node and prepare the decision tree to decide whether child will play outside or not?
To decide the root node :
1: Finding the Entropy for whole dataset:
We have Fourteen rows in the dataset in which Nine are “Yes” records and Five are “No” records. If I say the probability of “Yes”, it will be 9/14 and similarly probability of “No” will be 5/14 .
Here probabilities of Yes and No are put into the Entropy formula:
E(S)=-P(Yes)* log P(Yes)-P(No)*log P(No)
E(S)= -(9/14)*log (9/14) -(5/14)*log (5/14)
Now we are going to calculate the gini index or impurity of each individual features (Wind,Humidity,Outlook)
Outlook feature: There are three categories Sunny, Overcast ,Rain exist in the feature and below are counts of the categories in the dataset.
Further these categories are to be broken respective of target variable and calculate the entropy and information gain as per below:
Wind feature: There are two categories Strong, Weak exist in the feature and below are counts of the categories.
Humidity feature: There are two categories High,Normal exist in the feature and below are counts of the categories.
- Entropy of whole dataset is already known ->0.94.
- Information of each features are also known->0.699,0.788,0.892.
- Information Gain is the difference of Entropy and Information of each feature.
Highest Information Gain should be chosen as the root node of the Decision Tree. Here Outlook feature must be chosen as a root node. We will again reiterate the above approach to find out the intermediate and leaf nodes. So final tree will be looked like :