Gini Impurity:

  • It is a measure of how often a randomly chosen element from the set would be incorrectly labelled.
  • It helps to find out the root node,intermediate nodes and leaf node to develop the decision tree
  • It is used by the CART (classification and regression tree) algorithm for classification trees.
  • It reaches its minimum (zero) when all cases in the node fall into a single target.

Gini impurity is calculated using below formula.

gini(D)=1-\sum_{i=1}^{n}p^2

When we start modeling on the datasets where there are some independent columns and target columns available then require to follow below steps to calculate the Gini impurity and Gini gain to find the root or intermediate or leaf nodes for Decision Tree.

Que-How to calculate the Gini Impurity if there are Ten records in the dataset?

Let us see another dataset and calculate the gini impurity:

Fig-3 Dataset

There are three independent columns(Outlook,Humidity,Wind) and target variable(play) columns available in the dataset.

Que-How to find which feature is root or intermediate node and prepare the decision tree to decide whether child will play outside or not?

To decide the root node :
1: Finding the gini(D) for whole dataset:

We have fourteen rows in the dataset in which Nine are “Yes” records and Five are “No” records. If I say the probability of “Yes”, it will be 9/14 and similarly probability of “No” will be 5/14 .


Here probabilities of Yes and No are put into the gini formula:

gini(D)=1-(9/14)^2-(5/14)^2 => 0.45

Now we are going to calculate the gini index or impurity of each individual features (Wind,Humidity,Outlook)

Outlook feature: There are three categories Sunny, Overcast ,Rain exist in the feature and below are counts of the categories in the dataset.
Sunny: 5
Overcast: 4
Rain: 5
Further these categories are to be broken respective of target variable and calculate the gini of features and gini index (D1/D)gini(D1)+(D2/D)gini(D2)+(D3/D)gini(D3) as per below:
(D1/D)->5/14
(D2/D)->4/14
(D3/D)->5/14

Wind feature: There are two categories Strong, Weak exist in the feature and below are counts of the categories.
Strong:6
Weak:8
Further these categories are to be broken respective of target variable and calculate the gini index (D1/D)gini(D1)+(D2/D)gini(D2) as per below:
(D1/D)->6/14
(D2/D)->8/14

Humidity feature: There are two categories High,Normal exist in the feature and below are counts of the categories.
High:7
Normal:7
Further these categories are to be splitted respective of target variable and calculate the gini index (D1/D)gini(D1)+(D2/D)gini(D2) as per below:
Here (D1/D)->7/14
(D2/D)->7/14

Gini Gain:

Gini Gain is difference between the “Gini index for the whole dataset” and “weighted average of Gini index for the individual features”.
Gini index of whole dataset and individual features are already calculated:
Gini(D)->0.45 and gini index of individual features ->0.34,0.36,0.42 .
Now Gini Gain can be calculated as per below:


Fig-7 Gini Index and Gini Gain

Highest Gini Gain or lowest Gini Index should be chosen the root node of the Decision Tree. Here Outlook feature must be chosen as a root node . We will again reiterate the above approach to find out the intermediate and leaf nodes. So final tree will be looked like :

Fig 8 -Decision Tree
Insert math as
Block
Inline
Additional settings
Formula color
Text color
#333333
Type math using LaTeX
Preview
\({}\)
Nothing to preview
Insert