ML 101: Gini Index Vs. Entropy For Decision Trees (Python)

The Gini Index and Entropy are two important concepts in decision trees and data science. While both seem similar, underlying mathematical differences separate the two.

Understanding these subtle differences is important as one may work better for your machine learning algorithm.

In this blog post, you’ll learn

What the Gini Index Is
Is this different from Gini Impurity?
Is this different from the Gini Coefficient?
What Entropy Is
Is Entropy Different Than Information Gain?
The Difference between the Gini Index and Entropy
An Example Coded in Python on a Real Dataset

Grab some headphones and a coffee (you’ll need it)

Table of Contents show

What is the Gini Index

The Gini Index is simply a tree-splitting criterion.

When your decision tree has to make a “split” in your data, it makes that split at that particular root node that minimizes the Gini index.

Below, we can see the Gini Index Formula:

Where each random pi is our probability of that point being randomly classified to a certain class.

The Gini index will always be between [0, 0.5], where 0 is a selection that perfectly splits each class in your dataset (pure), and 0.5 means that neither of the classes was correctly classified (impure).

Is The Gini Index Different from Gini Impurity?

There is no difference between the Gini index and Gini impurity. Both of these are commonly referenced as splitting criteria that are used in decision trees. Since these mean the same thing, you’d calculate them the same way.

Is The Gini Index Different from Gini Coefficient?

The Gini index is not the same as the Gini Coefficient in data mining or machine learning.

In the study of economics, many will use these terms interchangeably, but within data science and decision trees, these terms are not the same.

While many commonly confuse this, the Gini index is a classification measure measuring the level of purity at each node (how much does it classify).

The Gini Coefficient (in machine learning) is a binary classification ranking method that depicts how likely a variable is to be positive. (Read More)

Why Is The Gini Index seen as the Gini Coefficient in fields outside of data science?

Outside of data science and machine learning, commonly in the field of economics, these variables start to “blend together.”

This is because Gini Index measures a categorical variable’s impurity (variance), and the Gini Coefficient measures a numerical variable’s inequality (variance), usually income.

Due to this subtle difference, some fields have started to use the terms interchangeably, making the situation quite confusing for others!

What is Entropy

Entropy is simply “how much variance the dataset has.”

If we have a dataset (like in fraud detection) that is mostly data from situations that isn’t fraudulent, this dataset would have low entropy.

If we had a dataset that was 50% “No” and 50% “Yes,” this dataset would have high entropy.

Below, we have the formula for entropy:

Where the pi is the probability of randomly picking one element of that specific class (a proportion that class makes up of the dataset)

Is Entropy Different Than Information Gain?

Entropy is different from information gain since information gain uses entropy as part of its calculation to decide which node to make a split.

Information gain is the actual splitting criteria, it uses criteria like entropy and information to find the highest increase in information gain.

Here is the formula for information gain:

As we can see, information gain increases as entropy decreases.

This makes sense since we remember that entropy is the amount of variance in our dataset, so if a split is able to bring us close to 0, we can say we’ve correctly classified all of our training data (brought our variance to 0).

This is shown below, where we’ve reached minimum entropy (zero) but maximum information gain (one).

Gini Index Vs. Entropy In Decision Trees

According to a paper released by Laura Elena Raileanue and Kilian Stoffel, the Gini Index and Entropy usually give similar results in scoring algorithms. (Source)

However, compared to the Gini Index, the entropy calculation is much more computationally expensive to calculate at every single node.

For these reasons, in my data mining projects, I will usually always stick with the Gini index unless I can prove that using Entropy is worth it through cross-validation.

If you’re low on time and are looking for a computationally efficient solution between Gini and Entropy, use Gini – as the switch probably is not worth it.

When is Entropy Better than the Gini Index?

Entropy has proven to be stronger than the Gini Index on datasets that are heavily unbalanced.

While the math is a little complicated, the logarithm within the Entropy formula will increase the “strength” of low-occurring events.

This difference makes Entropy much better for projects involving deep learning and naturally unbalanced projects like fraud detection.

Gini Index vs. Entropy (Python Code)

import pandas as pd
import numpy as np
import random

# like always, publically available dataset
# # https://www.kaggle.com/datasets/hellbuoy/car-price-prediction
df = pd.read_csv('car-price.csv')

# we will take two variables,
# we will use doornumber as our target
# and the others as our indpendent variables
df = df[['drivewheel','fueltype','aspiration','doornumber','carbody']]

df.sample(n=10)

# function will calculate gini_index for each column
# function from scratch
# in a dataframe
# and print out the best column to split on
import pandas as pd
import numpy as np

def gini_index(dataset, targetcol):
    
    # store all of our columns and gini scores
    gini_scores = []
    
    # iterate through each column in your dataframe
    for col in dataset.columns:
        
        # skip our target column
        # no information gain on target columns!
        # we can't split here
        if col == targetcol:
            continue
        
        # resets for each column in your dataset
        gini = 0
        
        # get the value counts for that column
        unique_values = dataset[col].value_counts()
        
        # iterate through each unique value for that column
        for key, val in unique_values.items():
        
            # get the target variable seperated, based on
            # the independent variable
            filteredDf = dataset[targetcol][dataset[col] == key].value_counts()
            
            # need n for the length
            n = len(dataset)
            
            # sum of the value counts for that column
            ValueSum = filteredDf.sum()
            
            # need the probabilities of each class
            p = 0
            
            # we now have to send it to our gini impurity formula
            for i,j in filteredDf.items():
                p += (filteredDf[i] / ValueSum) ** 2
            
            # gini total for column 
            # is all uniques from each column
            gini += (val / n) * (1-p)

        print(f'Variable {col} has Gini Index of {round(gini,4)}\n')
        
        # append our column name and gini score
        gini_scores.append((col,gini))
    
    # sort our gini scores lowest to highest
    split_pair = sorted(gini_scores, key=lambda x: -x[1], reverse=True)[0]
    
    # print out the best score
    print(f'''Split on {split_pair[0]} With Gini Index of {round(split_pair[1],3)}''')
        
        
final = gini_index(df, 'doornumber')

import numpy as np
import pandas as pd
import math

def entropy(dataset, targetcol):
    # store all of our columns and gini scores
    entropy_scores = []
    
    # iterate through each column in your dataframe
    for col in dataset.columns:
        
        if col == targetcol:
            continue
        
        # get the value_counts normalized, saving us having to iterate through
        # each variable
        value_counts = dataset[col].value_counts(normalize=True, sort=False)
        
        # calculate our entropy for the column
        entropy = -(value_counts * np.log(value_counts) / np.log(math.e)).sum()
        
        print(f'Variable {col} has Entropy of {round(entropy,4)}\n')
        
        # append our column name and gini score
        entropy_scores.append((col,entropy))
    
    # sort our gini scores lowest to highest
    split_pair = sorted(entropy_scores, key=lambda x: -x[1], reverse=True)[0]
    
    # print out the best score
    print(f'''Split on {split_pair[0]} With Information Gain of {round(1-split_pair[1],3)}''')
        
        

final = entropy(df, 'carbody')
final

Conclusions and Recap

Now that you are well equipped to use Entropy and the Gini index in decision trees and other tree algorithms, make sure you consider the things we’ve listed.

Time
Type of Problem
Computational Costs

While it’s easy to get these two confused, they aren’t the same thing and depending on your specific situation, you’ll want to choose the correct splitting criteria for your machine-learning models.

ML 101: Gini Index vs. Entropy for Decision Trees (Python)