Keras – EML

Loud and Proud: Verbose in Machine Learning

Stewart Kaplan — Tue, 06 May 2025 15:57:37 +0000

In machine learning, there are two types: those who like to keep things short and sweet and those who want to explain everything in detail.

I fall into the latter category – I love verbosity.

Some might call it overkill, but I see it as a way to ensure no stone is left unturned.

In this guide, we will explore the ins and outs of verbose in machine learning, including when it should be used and how to implement it correctly in your models.

By the end, you’ll know the following:

What Verbosity is
Understanding The Output From Verbose
Setting Up Verbose with Two Famous Machine Learning Algos
When You Should and Shouldn’t use Verbose Settings

What is Verbose in Machine Learning?

In machine learning, “verbose” refers to a particular setting used when training and validating models.

When verbose is turned on, the algorithm will provide more detailed information about its progress as your model iterates through the training process.

It’ll push this output right to your console!

This can be useful for:

Debugging
Error Finding
Understanding your Models progression with offline metrics
Early Stopping In Deep Learning

As a warning, the verbose output can sometimes slow down the training process since printing output to your console is much slower than just running the model.

I like to run the algorithm without any verbose setting output and only use it if there are problems with the results.

Realize in most standard modeling packages; there are different “levels” of verbosity.

For example, here are the levels for the famous Sklearn package.

We will use the GridsearchCV for this example:

Setting Verbose = 0

Silent Modeling!

Setting Verbose = 1

This will display the computation time for each fold and the parameter candidate.

Setting Verbose = 2:

This will display everything from 1, and the score will also be displayed;

Setting Verbose = 3:

This will display everything from 1 and 2, along with candidate parameter indexes and the computation time.

This will be slightly different for each model we choose, but we get the general gist: as verbosity increases, we get more output information to our console.

Reference: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

Understanding Verbose Output Within Data Science

One way to approach verbose output from your models is to break it down into smaller chunks.

If you’re looking at a massive block of text, try focusing on one section at a time.

Before you go through each word of your log, you should understand what value you’re trying to optimize (score, computation time, etc.) and focus mainly on that.

What I like to do is to look for patterns in the output. Watching as the numbers increase or decrease is usually much more helpful than the EXACT number at that EXACT time.

Verbosity settings are to be used as a guide, and if you come at your modeling process with a goal in mind, verbosity can help you get to the finish line.

Below, we’ll explain how to set up the verbose setting in python with some famous models so you can use it in any situation.

How To Set Up Verbose in XGBoost Models

For our XGBoost model, we only changed the verbosity setting from 1 to 3.

In the below models, we used the “verbose” setting, but in XGBoost, this setting is called “verbosity.”

Here is the code we used,

from sklearn.datasets import load_wine
import xgboost as xgb
import numpy as np
from sklearn.metrics import mean_squared_error

wine_df = load_wine()

X = wine_df.data
y = wine_df.target

xgb_model = xgb.XGBRegressor(objective="reg:squarederror", random_state=42, verbosity=1)

xgb_model.fit(X, y)

y_pred = xgb_model.predict(X)

mse=mean_squared_error(y, y_pred)

print(f'\n\nModel Mean squared error {np.sqrt(mse)}')

We can see that our model is silent when verbosity is set to 1.

When increasing this to 2, we see much more output.

Finally, once we’ve passed 3 into our model, we see everything from 2, plus some final metrics and KPIs on model performance.

How To Set Up Verbose in Scikit Learn Models

Here is our code; we only changed the verbose setting in our gradient-boosted model to get the images below.

from sklearn.datasets import load_wine
import xgboost as xgb
import numpy as np
from sklearn.metrics import mean_squared_error

wine_df = load_wine()

X = wine_df.data
y = wine_df.target

xgb_model = xgb.XGBRegressor(objective="reg:squarederror", random_state=42, verbosity=1)

xgb_model.fit(X, y)

y_pred = xgb_model.predict(X)

mse=mean_squared_error(y, y_pred)

print(f'\n\nModel Mean squared error {np.sqrt(mse)}')

With Verbose set to 0, we see that our model is “silent.”

When we upgrade this to 1, we see gaps in our iterations.

Finally, when we push this to 2, we see a complete breakdown from each iteration of our model.

How To Set Up Verbose in Deep Learning (Keras) Models

Here is the code we used for our Deep learning example, only changing the verbose setting in the .fit method.

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from sklearn.datasets import load_wine

wine_df = load_wine()

X = wine_df.data
y = wine_df.target

model = Sequential()
model.add(Dense(12, input_shape=(X.shape[1],), activation='relu'))
model.add(Dense(8, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

model.compile(loss='mean_squared_error', optimizer='adam', metrics=['accuracy'])

model.fit(X, y, epochs=150, batch_size=10, verbose=1)

_, accuracy = model.evaluate(X, y)
print('Accuracy: %.2f' % (accuracy*100))

When verbose is set to 0, our model is silent.

When this is pushed up to 1, we only get an update after each epoch.

Finally, when this is set to 2, we get an output showing each iteration in each epoch.

When Should You Use Verbose Settings in Machine Learning?

Verbosity will give you a ton of information while building out your models.

This can be super helpful when you’re in the fine-tuning stage or trying to dive deep into a problem you need help with.

Since most verbosity settings are in levels, you can choose a level that gives you the amount of output you need.

When Do We Turn Off Verbose?

When working in data science and on models, it’s critical to strike the right balance between training speed and model performance.

Your boss needs to see results, but they also need to see the right results.

If you build inaccurate models, nobody will want them – but if you never get models out the door, you’ll find yourself needing a new job.

One way to strike this balance is to be selective with the verbose parameter.

You can start your modeling without verbose turned on, and if you do not see the results you want, turn it on, as it can help you dive deeper into your model and find areas that need improvement.

How To Determine Keras Feature Importance

Stewart Kaplan — Tue, 22 Apr 2025 02:18:57 +0000

Seeing what features are most important in your models is key to optimizing and increasing model accuracy.

Feature importance is one of the most crucial aspects of machine learning, and sometimes how you got to an answer is more important than the output.

Below we’ll go over how to determine the essential features in your neural networks and other models.

Finding the Feature Importance in Keras Models

The easiest way to find the importance of the features in Keras is to use the SHAP package. This algorithm is based on Professor Su-In Lee’s research from the AIMS Lab. This algorithm works by removing each feature and testing how much it affected the outcome and accuracy. (Source, Source)

Using SHAP with Keras Neural Network (CNN)

Remember, before using this package; we will need to install it.

pip install shap

Now that our package is installed, we can use SHAP to figure out which features are most important.

import shap

import numpy as np

letters = trainning_data[np.random.choice(trainning_data.shape[0], 100, replace=False)]

This step will assume you’ve already designed your model.

Since this function works by removing features one at a time, you’ll need a finished model before SHAP can test all of your features.

explain = shap.DeepExplainer(model, letters)

our_values_for_shap= explain.shap_values(test_data[1:5])

shap.image_plot(our_values_for_shap, -test_data[1:5])

Explaining a Keras SHAP output

This will result in an image like the one displayed here. (Source)

This image may seem overwhelming, but this answer is helpful.

To explain this, we will start with the four. On the left, we see the bolded four, signifying what a “4” actually looks like.

Blue values hurt the model in terms of accuracy; for example, the second four from the left have a bit of blue at the bottom; if those pixels are present in our hand drawing, they will decrease our accuracy at predicting “4”.

This makes total sense; if there is a curve at the bottom of the four, the number will look much more like a 0 or a 6.

Another example (Source)

This one is much more obvious, and sticking with our colors before, we quickly notice that our model uses the eyes of the meerkat to accurately predict a specific image as a meerkat and uses the beak of a to predict the dowitcher.

This makes sense, as Meerkats have a distinct face region and lack a beak.

Starting your model correctly is how you have success during modeling, we will teach you how in Keras Input Shape.

Some others, like Keras Shuffle, are also super important for modeling accuracy.

What is PCA Used For?

Principal component analysis (PCA) reduces the number of dimensions in your data.

This could take data with 10,000+ columns (variables) down to just two linear combinations (eigenvectors), representing the original 10,000 column dataset.

You could use these linear combinations to model and still see similar accuracies to original modeling techniques, as long as the linear combinations from your projections account for a high amount of variance.

We get a ton of questions about how to know steps per epoch in Keras and we go over it in-depth in that linked article.

PCA For Feature Selection

Feature selection is tough, but some other tools besides SHAP can help you extract the most important data from your training data.

Here’s a unique situation to think through, and it indeed shows the importance of PCA and how reducing your data down into smaller dimensions can help you extract the most essential features.

Suppose you had a super lean dataset with only 1200 rows.

Right away, we know this is a very small dataset, but there is no way that you will be able to get more data.

Let’s also say that you have 2000 columns, which all have predictability and are equally important columns.

We’re now in the data scientist’s worst spot, where we have more variables (dimensions) than rows of data.

We’re going to assume some tricks like columns swapping (where your columns become your rows) aren’t available.

When the number of predictors (your dimensions) is greater than the number of columns, models have difficulty finding unique solutions based on the column make-up.

Without going too far, your models will probably be inadequate and inaccurate.

This is where PCA comes in.

Imagine now that we perform our PCA; we notice that we can cut our 2000 columns down to about 15 eigenvectors while maintaining about 95% of the variance.

We now have our 15 features!

We will build a model over these 15 eigenvectors and use this model for testing and production.

Dense Layer is fundamental in machine learning and is something you should probably check out.

Using PCA in Production Systems

One common missed thing is if you train a model on PCA, every other input into that model, whether it’s validation, testing, or production, must also pass through the same PCA process.

People will often perform PCA on their dataset and train up a model utilizing the eigenvectors, but in production, they try to feed regular inputs through it.

This obviously won’t work, as our model is expecting 15 eigenvectors, and if you were to push through the original 2000 variables, our model wouldn’t even know how to take it.

How does PCA work in Keras?

PCA in Keras works exactly how PCA works in other packages, by projecting your dataset into a different subspace with vectors that maximize the variance in the dataset.

The amount of variance that these vectors hold is the eigenvalues.

This is why we always filter our eigenvalues in order from greatest to least, as we want the features that explain the most (variance) from our dataset.

Difference between PCA and SHAP

The easiest way to break this down is by keeping it simple.

SHAP is used after a model is already built to see what features are most important and what impact they have on the outcome

PCA is used before a model is built to reduce the dimensions of your dataset into an advantageous situation.

Mathematically, SHAP and PCA are not similar one bit and do not rely on the same techniques to get to their destinations.

SHAP is an iterative approach, trying out a model feature by feature to understand its significance, while PCA is a dataset-wide approach, computing eigenvectors from the covariance matrix between each of the features.

In PCA, the final eigenvectors will have little to no resemblance to the original features, as these new features were computed from the covariance matrix of features.

When to use PCA and when to use SHAP

If you currently have a model that ran and is showing great accuracy on out-of-sample tests, you should use SHAP to understand what features are more critical and to get a better understanding of the features in the model.

Suppose you are running into trouble during the modeling process like your data is too big or you have more columns than rows.

In that case, you should use PCA to transform your dataset into significant eigenvectors that could prove advantageous to the modeling process.

The last use of PCA over SHAP is charting. As humans, once we move past three dimensions (3 columns), it becomes impossible to chart our data.

If we used PCA to reduce our data down into 2-3 dimensions where we could now plot it, this might allow us and our teammates to see insights in the data that initially weren’t visible.

Keras Input Shape: The Beginning of Every Model

Stewart Kaplan — Thu, 17 Apr 2025 16:19:11 +0000

Creating different machine learning models in Keras becomes super easy once we understand the fundamentals.

Getting the correct output shape starts with correctly defining the right input shape for your deep learning models.

If you mess this up, you’ll spend a ton of time googling around to figure out why your model will not run correctly.

What is the Keras Input Shape?

The Keras input shape is a parameter for the input layer (InputLayer). You’ll use the input shape parameter to define a tensor for the first layer in your neural network. If your input is an array of n integers, then your input shape would be (n,).

Different Usages of the Input layer

When defining your input layer, you need to consider the specific Keras model you are building.

Image input shape

If your input data is an image and your model is a classification model, you’ll want to define the input shape by the number of pixels and channels.

For classification models, think about your dataset being constrained to some subset of values; for example, if you’re trying to predict on the MNIST dataset (Source), your classification model will try to put each image into a group between [0,9]

A 250×250 pixel image with three channels will be

input_shape=(250,250,3)

However, if the model you are building is more regression-focused, your shape will be much different.

Array input_shape

Let’s say your input will be an array of 600 values; this means you’ll need to define your input shape a bit differently.

You will usually see an array of inputs in supervised learning, where you’re trying to find patterns in a dataset that lead you to a specific target column that you will predict (regression).

A 600-value array would look something like this.

input_shape=(600,)

Many people will try not to include the comma in the input_shape, but this comma is mandatory in Python.

This is because tensors are created from tuples, and without the comma, Python does not transform this into a tuple, making it impossible for the Input function to create the tuple. (Read More)

Keras Input Shapes Batch Dimension

One of the most confusing aspects of the input shape when using Keras is understanding how batching works with this input tensor.

Along with batching, we get a ton of questions about how to know steps per epoch in Keras and we go over it in-depth in that linked article.

Since we’re defining only one instance of the training data, we may see None for the first dimension whenever we access our model during the training process.

In this Keras example

input_shape=(600,)

If you were to print this out in model.summary(), you would see this.

(None, 600)

This shape tuple responds with None in the first parameter due to the batch size.

Remember, each array was 600 values long, but during training, we will probably be passing in batches that have a similar structure.

If we were to pass in batches that have size 30, when checking our model.summary(), what we would see is

(30,600)

as we now have 30 tensors of 600 values (batch size).

Knowing how many dimensions is crucial for accurate modeling, and correctly orchestrating the next layer from the previous layer is how you create accurate models.

Some people will try defining the batch size in their models; however, this can prove problematic.

Allowing Keras to choose the batch size without user contributions will allow for a fluid input size, meaning the batch size can change at any time.

This is optimal and will allow flexibility in your sequential model and output shape.

Keras Sequential Model

In Keras, much of your modeling can be done with the Sequential parameter.

Some inputs you may need for this modeling tutorial

from tensorflow import keras

from tensorflow.keras import layers

Think of the sequential model as a one-way road, where the entrance will be your input layer, then go through some hidden layers to a single output layer.

The input tensor is fundamental; remember, we can define that in a couple of different ways.

In newer versions of TensorFlow (Keras integrated), you’ll see the layer input defined as the following, using Conv2D (dense layers require inputs also).

CNNModel.add(layers.Conv2D(32, (3, 3), activation=’elu’, input_shape=(32, 32, 3)))

In older versions, you’ll see the Input layer defined as we discussed earlier with something like

CNNModel.add(keras.Input(shape=(32, 32, 3)))

CNNModel.add(layers.Conv2D(32, 3, activation=”relu”))

Both of these will work the same way and have the same shape.

When using the model summary, you will be able to see the outline of the model.

If you do not define an input layer while defining your model, you will not be able to call the model summary method until your input shape is defined.

Using the model summary is one of the easiest ways to understand how your model will progress, as you’ll be able to see how the dense layers will be laid out, even if the layer is hidden.

Keras Functional API

In Keras, there is another type of modeling philosophy that you can use.

This is called the functional API and compared to the sequential model, it will allow for multiple inputs and outputs throughout the model.

Instead of having one input layer and one final output layer, you could have multiple input layers and multiple output layers.

This logic also follows for the different hidden layers within the model, as they can also have separate inputs and outputs.

Initially, this isn’t very clear but think of this as an ensemble method from regular machine learning.

Sometimes, you need a little more than just the training data to get to the accuracy or outcome you are looking for.

If you are having accuracy problems, Keras shuffle could help you figure it out.

When you Would Use the Functional Keras API

Let’s say you wanted to classify images, but along with those images, you had some text input (like tags) that existed in a separate database.

While we know we can represent the tags in a one dimensional array, and we saw previously how we could classify images with a CNN, how would we use these together?

Understanding features during modeling is important. We wrote Keras Feature Importance to give a good intro so you could understand your models better.

What if we handled our tags on one side, our image on the other, and brought them together into a final softmax function for classification?

Instead of now just relying on the image data, utilizing the functional API gave us a bit more data to classify our images correctly.

This is a massive upgrade over our other sequential models, which could only handle images or the tags one at a time.

Keras Model Output Shape

The Keras Model output shapes depend entirely on the units defined in the previous layer.

If your previous dense layer was defined as something like

input_shape(600,)

model.add(units=4, activation’……..’,input_shape=(600,)

You will quickly notice

Output Shape = (None, 4)

And we know from earlier that None signifies the batch size.

Keras Model and Reduced Sizing

When building Keras models, you will quickly notice that your models will decrease in size as you move down throughout your model.

The reason for this is simply due to the nature of matrix multiplication.

For example

[Batch, 600] * [600, 4] = [Batch, 4] (output shape)

This brings us back to earlier, where we saw our (None, 4) output tensor.

Keras Shuffle: A full in-depth guide (Get THIS right)

Stewart Kaplan — Thu, 17 Apr 2025 03:24:03 +0000

Deep learning can be tricky, but we have some APIs that help us create wonderful models that can quickly converge to a great solution.

The Keras API used for neural networks has risen in popularity for modeling with TensorFlow.

Keras Shuffle is easy to mess up and is essential for your success with modeling and data science.

What is Keras Shuffle?

Keras Shuffle is a modeling parameter asking you if you want to shuffle your training data before each epoch. This parameter should be set to false if your data is time-series and true anytime the training data points are independent.

A successful Model starts way before you start writing your code.

Understanding how you want to set up your batching and epochs is crucial for your model’s success.

We go over Keras Shuffle, the different parameters of Keras Shuffle, when you should set it to True or False, and how to get the best usage out of it below.

Messing up this model parameter will create an overfitting model that isn’t reproducible in the real world.

This is one of the last things we want as machine learning engineers, and we will show you how to avoid this.

What is Keras Shuffle?

In the most basic explanation, Keras Shuffle is a modeling parameter asking you if you want to shuffle your training data before each epoch.

To break this down a little further, if we have one dataset and the number of epochs is set to 5, it would use the whole dataset set 5 times.

Many will set shuffle=True, so your model does not see the training data in the same order for each epoch.

This can improve the model’s accuracy and potentially cover up some bias in your data.

Realize this does not shuffle the validation or test set, so reproducibility of each training epoch will be impossible; However, model runs can still be compared fairly as the validation set for each epoch will remain not shuffled and the same.

model.fit(x, y, batch_size=400, epochs=5, shuffle=True)

In the above line, the dataset will be used five times and split up into 400 chunks each time. Because shuffle=True, the data will be shuffled five different times for each epoch.

model.fit(x, y, batch_size=400, epochs=5, shuffle=False)

In the above line, the dataset will be used five times and split up into 400 chunks each time. Because shuffle=False, your data will be taken in sequential order for each of the five epochs.

More information about building out these models is in our article all about the Dense Layer

When would you use Keras Shuffle?

Anytime you are modeling with Keras, you are required to use Keras Shuffle.

You do not have a choice as it is a required parameter in the .fit method.

More information on another .fit parameter can be found here at steps per epoch keras.

Let’s go over a couple of different instances of where’d you would set Keras Shuffle to true and when you would want to set it to false.

If you’re doing any classification, you’re going to want to have shuffle set to true.

Also, if you are dealing with any independent data, you’re going to want to set shuffle to true.

This is because shuffling has been shown to reduce overfitting in sample scores. (Source)

Your goal is to have shuffle set to true, as this will improve your model.

Sometimes, you can not have shuffle set to true, which will cause your model not to shuffle data.

Most machine learning algorithms have an underlying assumption.

This assumption assumes that each instant or line of your data is independent of each other.

We cannot shuffle time-series data because the data are no longer independent from each other.

Think about the stock market; one of the most significant indicators of a stock’s current position is the previous one.

For that to be true, how could this current instance be independent of the last?

(They aren’t Independent, and stock market data is time-series)

Now think what would happen if you shuffle that data.

Let’s say your training data included the bolded highlighted data points and value t was put inside your testing set.

Time	Value
t-2	36
t-1	42
t	x
t+1	58

Quickly we see how unfair it is to possess both data from the past and the future in the training set as predictions are now caped between [42,58] for t.

We will see an incredibly high test accuracy when running our tests and validations.

However, once this model is deployed, the accuracy will quickly fall off.

Because in the real world, we will never possess t+1 time, as the future doesn’t exist (at least in Machine Learning), and we won’t have a data point on it.

Our testing accuracy will quickly plummet without this upper bound on the current prediction.

Starting your model correctly is how you have success during modeling, we will teach you how in Keras Input Shape.

Parameters of Keras Shuffle?

Remember from earlier that Keras Shuffle is either true or false.

Since this parameter cannot be omitted, you must provide the true or false parameter (or do not mention it at all).

Keras Shuffle is always set to true by default, so even if you forget to provide it, your data will automatically be shuffled during training.

Keras Shuffle or Train Test Split?

This is a great question, but it is fundamentally wrong to compare them.

Keras Shuffle is an intra-training set decision, meaning that whatever you choose, this will only be applied to the training set.

This does not affect validation or test sets, and only the trained model will be different based on this parameter.

The trained model will be different because it will see either shuffled or non-shuffled data.

Understanding features during modeling is important. We wrote Keras Feature Importance to give a good intro so you could understand your models better.

Train Test Split is much more about separating training and testing data sets.

Whenever you apply train test split, you’re slicing your data into entirely different data sets.

These two work very well together.

And using train test split to create the training and validation sets for your deep learning model (Keras API) will enable you to test the accuracy during training quickly.

The same rules apply for shuffling during train test split as they do for Keras Shuffle.

Most of the time, you’ll want to shuffle while splitting your data, but if your data is not independent (time-series), you will not be able to shuffle at any part of your pipeline.

So the question is not Keras Shuffle or Train Test Split?

It’s More Keras Shuffle and Train Test Split?

Keras Pandas Example

Remember to import pandas, with the assumption of import pandas as pd

df = pd.read_csv(file)

You will now need to grab your target variable

y = df['target']

Remove that target variable from your training set

df = df.drop(['target'], axis=1)

Convert this to a tensor; that way, Keras Shuffle is available, and you may also need to convert your target variable.

df = tf.convert_to_tensor(df.values)

You will need to build out your model in a function (named define_some_model here) and define it with some name (we use model)

model = define_some_model()

Finally, call your model on your dataset and target variable.

model.fit(df, y, batch_size=400, epochs=5, shuffle=True)

Frequently Asked Questions

Keras Databricks

Databricks runtime includes both TensorFlow and the Keras API. Databricks is a perfect pick for deep learning. Having access to distributed training will allow you to create deep learning models that wouldn’t be available on your computer due to resource limits.

Keras Softmax Loss

Keras Softmax Loss is the perfect last layer of probabilistic models. This is because softmax will produce a vector K that will sum to 1, giving an output that indicates which output the model prefers. You will see this a lot in categorical models.

Keras Pyspark

Pyspark and Keras are an incredible duo. Pyspark allows you access to distributed data, meaning you will have more data for modeling. Since Keras is an API that sits on TensorFlow, and deep learning networks are known for doing best with high quantities of data, combining these two is very harmonious.

How to Know Steps Per Epoch Keras (Set This Correctly)

Stewart Kaplan — Wed, 16 Apr 2025 02:56:17 +0000

Keras, while powerful, does have many different hyperparameters to choose from.

Messing up steps_per_epoch while modeling with the .fit method in Keras can create a ton of problems.

This guide will show you what steps_per_epoch does, how to figure out the correct number of steps, and what happens if you choose steps_per_epoch wrong.

Steps Per Epoch Keras

The best way to set steps per epoch in Keras is by monitoring your computer memory and validation scores. If your computer runs out of memory during training, increase the steps_per_epoch parameter. If your training score is high, but your validation score is low, you’ll want to decrease steps_per_epoch as you are overfitting.

How does the number of steps affect batch size?

Remember, in machine learning, an epoch is one forward pass and backward pass of all the available training data.

If you have a dataset with 2500 lines, once all 2500 lines have been through your neural network’s forward and backward pass, this will count as an epoch.

Batch size in Keras

Continuing with our previous example, we still have 2500 lines in our dataset.

However, what happens if your computer cannot load 2500 lines into memory to train on?

This data needs to be split up.

We will need to split up our dataset into smaller chunks; that way, we can process our input.

This would work for us since our computer could handle the 1250 lines and using a higher number of batches than one will allow us to alter the weights twice instead of once.

Understanding features during modeling is important. We wrote Keras Feature Importance to give a good intro so you could understand your models better.

Downside of setting the batch size in Keras

The above process worked great, but what if we don’t always know the size of our training data?

For example, if one epoch is 3000 lines, the next epoch is 3103 lines, and the third epoch is 3050 lines.

These extra 100 lines probably wouldn’t matter much for our model or computer memory, but how would you know what to set the batch_size to?

You could write a function or try some default argument to test if you could figure it out, but it would probably be a waste of development time.

Using steps_per_epoch with training data

Let’s continue with our example above, where we had one epoch is 3000 lines, the next epoch is 3103 lines, and the third epoch is 3050 lines.

We have a general idea of the max capacity our training data can be in each batch size, but it would be hard to know if it should be 1500 or 1525.

This is where steps_per_epoch comes in.

Instead of our picture above, what if we just set steps_per_epoch = 2?

Setting steps_per_epoch within your model allows you to handle any situation where the number of samples in your epoch is different.

We can see that the number of batches has not changed, and even though the batch size has gone up and down, we will still have the same batch size per epoch.

Validation Steps

Now that we’ve reviewed steps per epoch, how does this affect our model outcomes?

During training, you’ll want to make sure that method is correct, and one way to test this while you train your model is by using validation data.

Continuing with our example above, we know we want to increase the iterations of each epoch, as it allows us to increase the amount of time that we update the weights in our neural network.

Many newcomers will then ask why don’t we just set steps per epoch equal to the amount of data in our epoch?

This is a great thought, but doing this while training will result in some horrible accuracy problems.

Validation data

Continuing on the thought above, why can’t we just set the batch size = 1, or steps per epoch equal to the amount of data in the epoch?

Doing this will result in a flawed model, as this model is trained incorrectly and will be overfitting the training data. (Read More)

So, we know we need multiple batches, but we can’t set the number of batches equal to the amount of data.

How do we know how big our batch size is supposed to be?

This is where we use our validation data.

Finding the correct steps per epoch

Modeling in machine learning is an iterative process, and very rarely will you get it right on the first try.

One of the keys to modeling correctly is the different layers. Dense Layer is fundamental in machine learning and is something you should probably check out.

It doesn’t matter how long you’ve been at this or how much knowledge you have on the topic; machine learning is (literally) about trial and error.

The best way to find the steps_per_epoch hyperparameter is by testing.

I like to start with a value at around ten and go up or down based on the size of my training data and the accuracy of my validation dataset.

For example, if you start modeling and quickly run out of memory, you need to increase your steps per epoch. This will lower the amount of data being pushed into memory and will (theoretically) allow you to continue modeling.

Starting your model correctly is how you have success during modeling, we will teach you how in Keras Input Shape.

But remember, as we increase this number, we be susceptible to overfitting the training data.

How to know when you’re overfitting on training data in Keras

In Keras, there is another parameter called validation_split.

Some others, like Keras Shuffle, are also super important for modeling accuracy.

This value is a decimal value that will tell your Keras model how much of the data to leave out to test against.

Your model will have never seen this data before, and after each epoch, Keras will test your trained model against this validation data.

Let’s say that we set validation_split = .2; this will hold out 20% of the data from our training.

While modeling, we expect our training accuracy and validation accuracy to be pretty close. (Understanding accuracy)

What happens when our training accuracy is much higher than our scores on the validation data?

This sadly means we have overfit on our data and need to make changes in the model to combat this.

To make things clear, I do want to say that there are many reasons why you can overfit a model, and steps_per_epoch is just one of them.

But one of the first things that I do if I am overfitting during modeling is reduce the number of steps_per_epoch that I previously set.

This will increase the amount of data in each iteration, hopefully keeping the model from overlearning some of the noise in our data.

Dense Layer: The Building Block to Neural Networks

Stewart Kaplan — Tue, 15 Apr 2025 13:00:40 +0000

The Dense layer is a critical component in Machine Learning.

While the most straightforward layer, the dense layer is still vital in any neural network design and is one of the most commonly used layers.

Below we will be breaking down the output generated from a dense layer, input arrays, and the difference between a dense layer versus some other layers.

What is a layer?

Layers are made of nodes, and the nodes provide an environment to perform computations on data.

In simpler terms, think of a neural network as a stadium, a layer as a row of seats in a stadium, and a node as each seat.

A node combines the inputs of a data set with a weighted coefficient which either increases or dampens inputs.

These rows and seats work together to get us to the final output layer, which will contain our final answers (based on how we defined the previous layer)

What is Keras?

Keras is a Python API that runs on top of the Machine Learning Platform Tensorflow.

Keras enables users to add several prebuilt layers in different Neural network architectures.

When TensorFlow was initially released, it was pretty challenging to use.

Learning any Machine Learning framework will not be easy, and there will always be a learning curve, but early TensorFlow was pretty low-level and took a ton of time to learn.

Keras is a python library that builds on the top of TensorFlow, which has a user-friendly interface, faster production deployment, and faster initial development of machine learning models.

Using Keras makes the overall experience of TensorFlow easier.

Realize, before 2017, Keras was only a stand-alone API.

Now, TensorFlow has fully integrated Keras, but you can still use the Keras API by itself, and the stand-alone API usually is more up-to-date with newer features.

Understanding features during modeling is important. We wrote Keras Feature Importance to give a good intro so you could understand your models better.

Keras Layers

Keras Layers are the building blocks of the whole API.

We will stack these layers together to create our models, but you could also have a single dense layer that acts as something as simple as a linear regression model or multiple dense layers (with a hidden layer) to create a neural network.

Changing one of the layers in a neural network will change the results in the final output arrays.

Types of Layers in Keras

The core layers within the Keras API are

Dense Layer
Input Layer
Activation Layer
Embedding Layer
Masking Layer
Lambda Layer(Read More Here)

The Dense Layer is the most commonly used, and there is some slight overlap in these Keras layers.

For example, a parameter passed within a dense layer can be the activation function, or you can pass an activation function as a layer in a sequential model.

In future posts, we will be going more in-depth into activation functions and other deep learning model features. More information on modeling can be found here at steps per epoch keras.

What the Dense Layer Performs

The dense layer performs the following calculation

outputs = activation(dot(input, kernel) + bias)

Let’s break this down a bit (from the inside out).

What is the Input Matrix

Your input data passed will be as a matrix into your dense layer.

If your input data, for example, was a data frame with m rows and n columns.

Your matrix will be the same m rows and n columns, just lacking column identifiers.

We go over input data in-depth and much more about Keras in our other post, Keras Shuffle.

What is the Kernel Weights Matrix

Each Kernel weight matrix is specific to that dense layer and node (think about row number and seat number).

The kernel weights matrix is the heart of the neural network; as the data progresses from dense to dense layers, these weights will be updated based on backpropagation. (Learn more)

The Kernel weights matrix is updated after every run, and the new weights matrix created will contain new weights to multiply the input data by.

The weight matrix is crucial to understand. Many newcomers to machine learning have trouble understanding the vector shape needed to do the dot product between the input data and weight matrix.

What is the Dot Product of the Input and Kernel?

The output size of the dot product between the input and kernel will be a single scalar value.

This throws some people off who are expecting another matrix from the dot product and are unfamiliar with the differences. (See the difference).

The value received from this dot product of the Input and Kernel is the value that will be passed onward in your neural network before applying any bias to it.

What is the Bias Vector?

To understand the bias vector, let’s go back to one of the most simple fundamentals of mathematics.

The equation of a line

y = mx + B

Now, I know it isn’t talked about a bunch, but that B term is the bias of a line.

Understanding bias’s effect is simpler when you can see it in action.

Here is the equation y = 1x + 0

Here is the equation y= 1x + 2

In our first picture, even though the line is the same, our line never went through (2,2), and if our function we’re trying to predict value (2,2), it wouldn’t be possible.

However, once we added bias, our function went right through the point (2,2) and would give us that exact prediction with input x = 2.

Now, a bias vector is this same logic; just instead of one vector term, there is n number of vector terms, where n is the size of your vector.

Keras Dense Layer Activation Function

Here is a list of the different dense layer activation functions

relu
sigmoid
softmax
softplus
softsign
tanh
selu
elu
exponential

(See more here)

Activation function

We know how the inside of our dense layer formula works; the last part is the activation function.

Remember, after our bias is applied, we will have a vector.

So, we have

outputs = activation(vector)

Where our activation can be anything chosen above, we will select the relu activation function for this.

The relu function will take each value in the vector and keep it if it’s above zero or replace it with zero if not.

Dense Layer Examples

For example, input vector = [-1,2,-4,2,4] (after out dot product and applying our bias vector)

Starting your model correctly is how you have success during modeling, we will teach you how in Keras Input Shape.

Will become output vector = [0,2,0,2,4] with the same output shape.

Frequently Asked Questions

Are the Dense Layers Always Hidden?

Dense layers are always hidden because a neural network will be initialized with an input layer, and the outputs will come from an output layer. The dense layers in the middle will not be accessible and hidden.

What’s the difference between a hidden layer and a fully connected layer?

A fully connected layer has weights connected to all the output values from the previous layer, while a hidden layer is just a layer that is not the input or output layers. A fully connected layer can be a hidden layer, but these two can also exist separately.

What is a densely connected layer?

A densely connected layer is another word for a dense layer. A dense layer is densely connected to the output layer before it, whether an input layer or another dense layer.

Keras – EML

Loud and Proud: Verbose in Machine Learning

What is Verbose in Machine Learning?

For example, here are the levels for the famous Sklearn package.

Understanding Verbose Output Within Data Science

How To Set Up Verbose in XGBoost Models

How To Set Up Verbose in Scikit Learn Models

How To Set Up Verbose in Deep Learning (Keras) Models

When Should You Use Verbose Settings in Machine Learning?

When Do We Turn Off Verbose?

Other Quick Machine Learning Tutorials

How To Determine Keras Feature Importance

Finding the Feature Importance in Keras Models

Using SHAP with Keras Neural Network (CNN)

Explaining a Keras SHAP output

What is PCA Used For?

PCA For Feature Selection

Using PCA in Production Systems

How does PCA work in Keras?

Difference between PCA and SHAP

When to use PCA and when to use SHAP

Keras Input Shape: The Beginning of Every Model

What is the Keras Input Shape?

Different Usages of the Input layer

Image input shape

Array input_shape

Keras Input Shapes Batch Dimension

Keras Sequential Model

Keras Functional API

When you Would Use the Functional Keras API

Keras Model Output Shape

Keras Model and Reduced Sizing

Keras Shuffle: A full in-depth guide (Get THIS right)

What is Keras Shuffle?

When would you use Keras Shuffle?

Keras Shuffle or Train Test Split?

Keras Pandas Example

Remove that target variable from your training set

Frequently Asked Questions

Keras Databricks

Keras Softmax Loss

Keras Pyspark

How to Know Steps Per Epoch Keras (Set This Correctly)

Steps Per Epoch Keras

How does the number of steps affect batch size?

Batch size in Keras

Downside of setting the batch size in Keras

Using steps_per_epoch with training data

Validation Steps

Validation data

Finding the correct steps per epoch

How to know when you’re overfitting on training data in Keras

Dense Layer: The Building Block to Neural Networks

What is a layer?

What is Keras?

Keras Layers

Types of Layers in Keras

What the Dense Layer Performs

What is the Input Matrix

What is the Kernel Weights Matrix

What is the Dot Product of the Input and Kernel?

What is the Bias Vector?

The equation of a line

Here is the equation y = 1x + 0

Here is the equation y= 1x + 2

Keras Dense Layer Activation Function

Activation function

Dense Layer Examples

Frequently Asked Questions

Are the Dense Layers Always Hidden?

What’s the difference between a hidden layer and a fully connected layer?

What is a densely connected layer?