Gini Index vs. Entropy

ML 101: Gini Index vs. Entropy for Decision Trees (Python)

The Gini Index and Entropy are two important concepts in decision trees and data science. While both seem similar, underlying mathematical differences separate the two.

Understanding these subtle differences is important as one may work better for your machine learning algorithm.

In this blog post, you’ll learn

  • What the Gini Index Is
  • Is this different from Gini Impurity?
  • Is this different from the Gini Coefficient?
  • What Entropy Is
  • Is Entropy Different Than Information Gain?
  • The Difference between the Gini Index and Entropy
  • An Example Coded in Python on a Real Dataset


Grab some headphones and a coffee (you’ll need it)

man in black long sleeve shirt wearing black headphones sitting on chair


What is the Gini Index

The Gini Index is simply a tree-splitting criterion.

When your decision tree has to make a “split” in your data, it makes that split at that particular root node that minimizes the Gini index.

Below, we can see the Gini Index Formula:

gini index formula


Where each random pi is our probability of that point being randomly classified to a certain class.

The Gini index will always be between [0, 0.5], where 0 is a selection that perfectly splits each class in your dataset (pure), and 0.5 means that neither of the classes was correctly classified (impure).


Is The Gini Index Different from Gini Impurity?

There is no difference between the Gini index and Gini impurity. Both of these are commonly referenced as splitting criteria that are used in decision trees. Since these mean the same thing, you’d calculate them the same way.


Is The Gini Index Different from Gini Coefficient?

The Gini index is not the same as the Gini Coefficient in data mining or machine learning.

In the study of economics, many will use these terms interchangeably, but within data science and decision trees, these terms are not the same.

black and white street sign

While many commonly confuse this, the Gini index is a classification measure measuring the level of purity at each node (how much does it classify).

The Gini Coefficient (in machine learning) is a binary classification ranking method that depicts how likely a variable is to be positive. (Read More)


Why Is The Gini Index seen as the Gini Coefficient in fields outside of data science?

Outside of data science and machine learning, commonly in the field of economics, these variables start to “blend together.”

This is because Gini Index measures a categorical variable’s impurity (variance), and the Gini Coefficient measures a numerical variable’s inequality (variance), usually income.

Due to this subtle difference, some fields have started to use the terms interchangeably, making the situation quite confusing for others!

U.S. dollar banknote with map


What is Entropy

Entropy is simply “how much variance the dataset has.”

If we have a dataset (like in fraud detection) that is mostly data from situations that isn’t fraudulent, this dataset would have low entropy.

If we had a dataset that was 50% “No” and 50% “Yes,” this dataset would have high entropy.

Below, we have the formula for entropy:

entropy formula

Where the pi is the probability of randomly picking one element of that specific class (a proportion that class makes up of the dataset)


Is Entropy Different Than Information Gain?

Entropy is different from information gain since information gain uses entropy as part of its calculation to decide which node to make a split.

Information gain is the actual splitting criteria, it uses criteria like entropy and information to find the highest increase in information gain.

Here is the formula for information gain:

information gain formula


As we can see, information gain increases as entropy decreases.

This makes sense since we remember that entropy is the amount of variance in our dataset, so if a split is able to bring us close to 0, we can say we’ve correctly classified all of our training data (brought our variance to 0).

This is shown below, where we’ve reached minimum entropy (zero) but maximum information gain (one).

inverse relationship between information gain and entropy


Gini Index Vs. Entropy In Decision Trees

According to a paper released by Laura Elena Raileanue and Kilian Stoffel, the Gini Index and Entropy usually give similar results in scoring algorithms. (Source)

However, compared to the Gini Index, the entropy calculation is much more computationally expensive to calculate at every single node.

For these reasons, in my data mining projects, I will usually always stick with the Gini index unless I can prove that using Entropy is worth it through cross-validation.


If you’re low on time and are looking for a computationally efficient solution between Gini and Entropy, use Gini – as the switch probably is not worth it.


When is Entropy Better than the Gini Index?

Entropy has proven to be stronger than the Gini Index on datasets that are heavily unbalanced.

While the math is a little complicated, the logarithm within the Entropy formula will increase the “strength” of low-occurring events.

This difference makes Entropy much better for projects involving deep learning and naturally unbalanced projects like fraud detection.


Gini Index vs. Entropy (Python Code)

import pandas as pd
import numpy as np
import random

# like always, publically available dataset
# # https://www.kaggle.com/datasets/hellbuoy/car-price-prediction
df = pd.read_csv('car-price.csv')

# we will take two variables,
# we will use doornumber as our target
# and the others as our indpendent variables
df = df[['drivewheel','fueltype','aspiration','doornumber','carbody']]

df.sample(n=10)

dataset for modeling

# function will calculate gini_index for each column
# function from scratch
# in a dataframe
# and print out the best column to split on
import pandas as pd
import numpy as np

def gini_index(dataset, targetcol):
    
    # store all of our columns and gini scores
    gini_scores = []
    
    # iterate through each column in your dataframe
    for col in dataset.columns:
        
        # skip our target column
        # no information gain on target columns!
        # we can't split here
        if col == targetcol:
            continue
        
        # resets for each column in your dataset
        gini = 0
        
        # get the value counts for that column
        unique_values = dataset[col].value_counts()
        
        # iterate through each unique value for that column
        for key, val in unique_values.items():
        
            # get the target variable seperated, based on
            # the independent variable
            filteredDf = dataset[targetcol][dataset[col] == key].value_counts()
            
            # need n for the length
            n = len(dataset)
            
            # sum of the value counts for that column
            ValueSum = filteredDf.sum()
            
            # need the probabilities of each class
            p = 0
            
            # we now have to send it to our gini impurity formula
            for i,j in filteredDf.items():
                p += (filteredDf[i] / ValueSum) ** 2
            
            # gini total for column 
            # is all uniques from each column
            gini += (val / n) * (1-p)

        print(f'Variable {col} has Gini Index of {round(gini,4)}\n')
        
        # append our column name and gini score
        gini_scores.append((col,gini))
    
    # sort our gini scores lowest to highest
    split_pair = sorted(gini_scores, key=lambda x: -x[1], reverse=True)[0]
    
    # print out the best score
    print(f'''Split on {split_pair[0]} With Gini Index of {round(split_pair[1],3)}''')
        
        
final = gini_index(df, 'doornumber')

gini index for columns

 

import numpy as np
import pandas as pd
import math

def entropy(dataset, targetcol):
    # store all of our columns and gini scores
    entropy_scores = []
    
    # iterate through each column in your dataframe
    for col in dataset.columns:
        
        if col == targetcol:
            continue
        
        # get the value_counts normalized, saving us having to iterate through
        # each variable
        value_counts = dataset[col].value_counts(normalize=True, sort=False)
        
        # calculate our entropy for the column
        entropy = -(value_counts * np.log(value_counts) / np.log(math.e)).sum()
        
        print(f'Variable {col} has Entropy of {round(entropy,4)}\n')
        
        # append our column name and gini score
        entropy_scores.append((col,entropy))
    
    # sort our gini scores lowest to highest
    split_pair = sorted(entropy_scores, key=lambda x: -x[1], reverse=True)[0]
    
    # print out the best score
    print(f'''Split on {split_pair[0]} With Information Gain of {round(1-split_pair[1],3)}''')
        
        

final = entropy(df, 'carbody')
final

entropy information gain
Conclusions and Recap

Now that you are well equipped to use Entropy and the Gini index in decision trees and other tree algorithms, make sure you consider the things we’ve listed.

  • Time
  • Type of Problem
  • Computational Costs

While it’s easy to get these two confused, they aren’t the same thing and depending on your specific situation, you’ll want to choose the correct splitting criteria for your machine-learning models.


Other Articles In Our Machine Learning 101 Series

We have many quick guides that go over some of the fundamental parts of machine learning. Some of those guides include:

  • Reverse Standardization: Another staple in our 101 series is an introductory article teaching you about scaling and standardization.
  • Parameter Versus Variable: Commonly misunderstood – these two aren’t the same thing. This article will break down the difference.
  • Criterion Vs. Predictor: One of the first steps in statistical testing is the independent and dependent variables.
  • Heatmaps In Python: Visualizing data is key in data science; this post will teach eight different libraries to plot heatmaps.
  • SelectKBest (Sklearn): Now that you know how to perfect your decision trees, make sure your feature selection is good!
  • CountVectorizer vs. TfidfVectorizer: Interested in learning NLP? This is a great guide to jump into after learning about two famous distributions.
  • Welch’s T-Test: Do you know the difference between the student’s t-test and welch’s t-test? Don’t worry, we explain it in-depth here.
  • Normal Distribution Vs. Uniform Distribution: Two key distributions that will pop up everywhere in data science.
Stewart Kaplan