Keras Normalization Layers- Batch Normalization and Layer Normalization Explained for Beginners

Keras Normalization Layers- Batch Normalization and Layer Normalization Explained for Beginners
Keras Normalization Layers- Batch Normalization and Layer Normalization Explained for Beginners


In this article, we will go through the tutorial for Keras Normalization Layer where will understand why a normalization layer is needed. We will also see what are the two types of normalization layers in Keras – i) Batch Normalization Layer and ii) Layer Normalization Layer and understand them in detail with the help of examples. We will finally do a comparison between batch normalization vs layer normalization with their advantages and disadvantages.

But before going into that let us first understand why Normalization is needed in the first place.

What is Normalization?

Normalization is a method usually used for preparing data before training the model. The main purpose of normalization is to provide a uniform scale for numerical values. If the dataset contains numerical data varying in a huge range, it will skew the learning process, resulting in a bad model. The normalization method ensures there is no loss of information and even the range of values isn’t affected.

Normalization is done by the below formula, by subtracting the mean and dividing by the standard deviation.

Batch Normalization vs Layer Normalization

Deep Learning Specialization on Coursera

In spite of normalizing the input data, the value of activations of certain neurons in the hidden layers can start varying across a wide scale during the training process. This means the input to the neurons to the next hidden layer will also range across the wide range, bringing instability.

To deal with this problem, we use the techniques of “batch normalization” layer and “layer normalization” layer. Let us see these two techniques in detail along with their implementation examples in Keras.

Batch Normalization Layer

keras batch normalization
batch normalization (Source)

Batch Normalization Layer is applied for neural networks where the training is done in mini-batches. We divide the data into batches with a certain batch size and then pass it through the network. Batch normalization is applied on the neuron activation for all the samples in the mini-batch such that the mean of output lies close to 0 and the standard deviation lies close to 1. It also introduces two learning parameters gama and beta in its calculation which are all optimized during training.

Advantages of Batch Normalization Layer

  1. Batch normalization improves the training time and accuracy of the neural network.
  2. It decreases the effect of weight initialization.
  3. It also adds a regularization effect on the network.
  4. It works better with the fully Connected Neural Network (FCN) and Convolutional Neural Network.

Disadvantages of Batch Normalization Layer

  1. Batch normalization is dependent on mini-batch size which means if the mini-batch size is small, it will have little to no effect
  2. If there is no batch size involved, like in traditional gradient descent learning, we cannot use it at all.
  3. Batch normalization does not work well with Recurrent Neural Networks (RNN)

Syntax for Batch Normalization Layer in Keras


Keras Batch Normalization Layer Example

In this example, we’ll be looking at how batch normalization layer is implemented.

First, we load the libraries and packages that are required. We also import kmnist dataset for our implementation.

Install Keras Dataset

In [1]:
! pip install extra_keras_datasets
Collecting extra_keras_datasets
Requirement already satisfied: numpy in /usr/local/lib/python3.6/dist-packages (from extra_keras_datasets) (1.18.5)
Requirement already satisfied: scipy in /usr/local/lib/python3.6/dist-packages (from extra_keras_datasets) (1.4.1)
Requirement already satisfied: pandas in /usr/local/lib/python3.6/dist-packages (from extra_keras_datasets) (1.1.4)
Requirement already satisfied: scikit-learn in /usr/local/lib/python3.6/dist-packages (from extra_keras_datasets) (0.22.2.post1)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.6/dist-packages (from pandas->extra_keras_datasets) (2.8.1)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.6/dist-packages (from pandas->extra_keras_datasets) (2018.9)
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.6/dist-packages (from scikit-learn->extra_keras_datasets) (0.17.0)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.6/dist-packages (from python-dateutil>=2.7.3->pandas->extra_keras_datasets) (1.15.0)
Installing collected packages: extra-keras-datasets
Successfully installed extra-keras-datasets-1.2.0

Import Libraries

In [2]:
from extra_keras_datasets import kmnist
import tensorflow
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten
from tensorflow.keras.layers import Conv2D, MaxPooling2D
from tensorflow.keras.layers import BatchNormalization

Model Configuration

In the below code snippet we are specifying the batch size as 250, the number of epochs executed is 25, the data will be classified into 10 different classes, 20% of the training data is used as the validation set and lastly, verbosity is set to true.

In [3]:
# Model configuration
batch_size = 250
no_epochs = 25
no_classes = 10
validation_split = 0.2
verbosity = 1

Download and Load Dataset

Next, we load the MNIST dataset from Keras datasets module.

In [4]:
# Load KMNIST dataset
(input_train, target_train), (input_test, target_test) = kmnist.load_data(type='kmnist')
Downloading data from
18391040/18384171 [==============================] - 4s 0us/step
Downloading data from
32768/29700 [=================================] - 0s 5us/step
Downloading data from
3080192/3079479 [==============================] - 2s 1us/step
Downloading data from
8192/5304 [==============================================] - 0s 0us/step

Here we process the data, this helps it in preparing data for training.

In [5]:
# Shape of the input sets
input_train_shape = input_train.shape
input_test_shape = input_test.shape

Now we are obtaining the shape of both training and testing datasets. We also set the shape of keras input data.

In [6]:
# Keras layer input shape
input_shape = (input_train_shape[1], input_train_shape[2], 1)

To include channels during training, we are using the below code. There are instances when there are no apt channels, so we need to reshape the data for including those channels.

In [7]:
# Reshape the training data to include channels
input_train = input_train.reshape(input_train_shape[0], input_train_shape[1], input_train_shape[2], 1)
input_test = input_test.reshape(input_test_shape[0], input_test_shape[1], input_test_shape[2], 1)

The pre-processing of data is almost completed, we just need to convert data type to float32 format, it definitely makes the process faster.

In [8]:
# Parse numbers as floats
input_train = input_train.astype('float32')
input_test = input_test.astype('float32')

At last, we normalize the data for better results.

In [9]:
# Normalize input data
input_train = input_train / 255
input_test = input_test / 255

Designing the model architecture

In this section, we will be designing our neural network using the Sequential API of Keras. Notice we have used batch normalization layers in the design. Along with that, we have also used dense, convolutional, and pooling layers in the architecture.

In [10]:
# Create the model
model = Sequential()
model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Conv2D(64, kernel_size=(3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dense(256, activation='relu'))
model.add(Dense(no_classes, activation='softmax'))

Model compilation & fitting data

Now we need to compile our model. Model Compilation helps in instantiating the model architecture. In this step we are using sparse categorical crossentropy loss, adam optimizer is used for model optimization.

In [11]:
# Compile the model

Training the Model

At last, we are fitting the data to our model for training.

Here the input training set is provided corresponding targets, and then it is trained to pre-decided batch_size and number of epochs.

In [12]:
# Fit data to model
history =, target_train,
Epoch 1/25
192/192 [==============================] - 52s 269ms/step - loss: 0.2153 - accuracy: 0.9375 - val_loss: 4.0184 - val_accuracy: 0.1009
Epoch 2/25
192/192 [==============================] - 52s 270ms/step - loss: 0.0504 - accuracy: 0.9870 - val_loss: 2.3883 - val_accuracy: 0.2734
Epoch 3/25
192/192 [==============================] - 52s 269ms/step - loss: 0.0205 - accuracy: 0.9963 - val_loss: 0.3584 - val_accuracy: 0.8750
Epoch 4/25
192/192 [==============================] - 52s 270ms/step - loss: 0.0083 - accuracy: 0.9991 - val_loss: 0.0639 - val_accuracy: 0.9811
Epoch 5/25
192/192 [==============================] - 52s 270ms/step - loss: 0.0033 - accuracy: 0.9999 - val_loss: 0.0576 - val_accuracy: 0.9830
Epoch 6/25
192/192 [==============================] - 52s 270ms/step - loss: 0.0015 - accuracy: 1.0000 - val_loss: 0.0568 - val_accuracy: 0.9830
Epoch 7/25
192/192 [==============================] - 52s 269ms/step - loss: 9.4551e-04 - accuracy: 1.0000 - val_loss: 0.0558 - val_accuracy: 0.9839
Epoch 8/25
192/192 [==============================] - 57s 299ms/step - loss: 6.4900e-04 - accuracy: 1.0000 - val_loss: 0.0560 - val_accuracy: 0.9838
Epoch 9/25
192/192 [==============================] - 52s 269ms/step - loss: 4.9284e-04 - accuracy: 1.0000 - val_loss: 0.0561 - val_accuracy: 0.9837
Epoch 10/25
192/192 [==============================] - 52s 269ms/step - loss: 3.9835e-04 - accuracy: 1.0000 - val_loss: 0.0559 - val_accuracy: 0.9843
Epoch 11/25
192/192 [==============================] - 52s 269ms/step - loss: 3.0890e-04 - accuracy: 1.0000 - val_loss: 0.0560 - val_accuracy: 0.9843
Epoch 12/25
192/192 [==============================] - 52s 268ms/step - loss: 2.5462e-04 - accuracy: 1.0000 - val_loss: 0.0570 - val_accuracy: 0.9842
Epoch 13/25
192/192 [==============================] - 52s 269ms/step - loss: 2.2261e-04 - accuracy: 1.0000 - val_loss: 0.0565 - val_accuracy: 0.9843
Epoch 14/25
192/192 [==============================] - 52s 270ms/step - loss: 1.8208e-04 - accuracy: 1.0000 - val_loss: 0.0576 - val_accuracy: 0.9842
Epoch 15/25
192/192 [==============================] - 52s 270ms/step - loss: 1.5766e-04 - accuracy: 1.0000 - val_loss: 0.0579 - val_accuracy: 0.9845
Epoch 16/25
192/192 [==============================] - 52s 270ms/step - loss: 1.3558e-04 - accuracy: 1.0000 - val_loss: 0.0583 - val_accuracy: 0.9846
Epoch 17/25
192/192 [==============================] - 52s 269ms/step - loss: 1.1254e-04 - accuracy: 1.0000 - val_loss: 0.0590 - val_accuracy: 0.9846
Epoch 18/25
192/192 [==============================] - 52s 269ms/step - loss: 1.0031e-04 - accuracy: 1.0000 - val_loss: 0.0587 - val_accuracy: 0.9846
Epoch 19/25
192/192 [==============================] - 51s 268ms/step - loss: 8.5360e-05 - accuracy: 1.0000 - val_loss: 0.0594 - val_accuracy: 0.9844
Epoch 20/25
192/192 [==============================] - 51s 267ms/step - loss: 7.5436e-05 - accuracy: 1.0000 - val_loss: 0.0598 - val_accuracy: 0.9847
Epoch 21/25
192/192 [==============================] - 51s 267ms/step - loss: 6.5648e-05 - accuracy: 1.0000 - val_loss: 0.0601 - val_accuracy: 0.9847
Epoch 22/25
192/192 [==============================] - 51s 267ms/step - loss: 5.8587e-05 - accuracy: 1.0000 - val_loss: 0.0606 - val_accuracy: 0.9844
Epoch 23/25
192/192 [==============================] - 52s 269ms/step - loss: 5.1551e-05 - accuracy: 1.0000 - val_loss: 0.0614 - val_accuracy: 0.9843
Epoch 24/25
192/192 [==============================] - 52s 271ms/step - loss: 4.4635e-05 - accuracy: 1.0000 - val_loss: 0.0616 - val_accuracy: 0.9847
Epoch 25/25
192/192 [==============================] - 52s 270ms/step - loss: 3.8689e-05 - accuracy: 1.0000 - val_loss: 0.0613 - val_accuracy: 0.9848

Generating evaluation metrics

Now we will evaluate the performance of our model.

In [13]:
# Generate generalization metric  s
score = model.evaluate(input_test, target_test, verbose=0)
print(f'Test loss: {score[0]} / Test accuracy: {score[1]}')
Test loss: 0.22573837637901306 / Test accuracy: 0.9553999900817871

Layer Normalization Layer

Batch Normalization vs Layer Normalization
Batch Normalization vs Layer Normalization (Source)

The next type of normalization layer in Keras is Layer Normalization which addresses the drawbacks of batch normalization. This technique is not dependent on batches and the normalization is applied on the neuron for a single instance across all features. Here also mean activation remains close to 0 and mean standard deviation remains close to 1.

Advantages of Layer Normalization

  1. It is not dependent on any batch sizes during training.
  2. It works better with Recurrent Neural Network.

Disadvantages of Layer Normalization

  1. It may not produce good results with Convolutional Neural Networks (CNN)

Syntax of Layer Normalization Layer in Keras


Keras Normalization Layer Example

Let us see the example of how does LayerNormalization works in Keras. For this, we will be using the same dataset that we had used in the above example of batch normalization. Hence we are skipping the data download and preprocessing part for which you can refer to the above example. We will directly go to designing and training the neural network.

Designing the model architecture

In [15]:
import tensorflow as tf
In [16]:
model_lay = tf.keras.models.Sequential([
  # Reshape into "channels last" setup.
  tf.keras.layers.Reshape((28,28,1), input_shape=(28,28)),
  tf.keras.layers.Conv2D(filters=10, kernel_size=(3,3),data_format="channels_last"),
  # LayerNorm Layer
  tf.keras.layers.LayerNormalization(axis=3 , center=True , scale=True),
  tf.keras.layers.Dense(128, activation='relu'),
  tf.keras.layers.Dense(10, activation='softmax')

Compiling the Model

In [17]:
# Compiling Layer Normalization Model

Fitting the Model

In [19]:
# Fitting Layer Normalization Model
his_lay =, target_train)
1875/1875 [==============================] - 41s 22ms/step - loss: 0.3809 - accuracy: 0.8847

Evaluating Layer Normalization Model

In [20]:
# Generate generalization metric  s
score = model_lay.evaluate(input_test, target_test, verbose=0)
print(f'Test loss: {score[0]} / Test accuracy: {score[1]}')
Test loss: 0.45526155829429626 / Test accuracy: 0.8698999881744385

Comparing the results of two methods

As we look at the accuracy of the two methods on test data, we can see that batch normalization achieved 96% accuracy whereas layer normalization achieved 87% accuracy. We can see that on CNN, batch normalization produced better results than layer normalization.

Batch Normalization vs Layer Normalization

Before wrapping up, let us summarise the key difference between Batch Normalization and Layer Normalization in deep learning that we discussed above.

  1. Batch Normalization depends on mini-batch size and may not work properly for smaller batch sizes. On the other hand, Layer normalization does not depend on mini-batch size.
  2. In batch normalization, input values of the same neuron for all the data in the mini-batch are normalized. Whereas in layer normalization, input values for all neurons in the same layer are normalized for each data sample.
  3. Batch normalization works better with fully connected layers and convolutional neural network (CNN) but it shows poor results with recurrent neural network (RNN). On the other hand, the main advantage of Layer normalization is that it works really well with RNN.


In this tutorial, we learned about the Keras normalization layer and its different types i.e. batch normalization and layer normalization. We saw the syntax, examples, and did a detailed comparison of batch normalization vs layer normalization in Keras with their advantages and disadvantages.

Reference Keras Documentation


Please enter your comment!
Please enter your name here