# Batch vs Layer Normalization in Deep Neural Nets. The Illustrated Way!

The Batch Normalization (BN) and Layer Normalization (LN) techniques are widely used techniques in deep learning. They ease the optimization process and help very deep networks converge faster.

The Batch Normalization (BN) has been successfully applied to the vision tasks while the the Layer Normalization (LN) to the sequential tasks, mainly in NLP.

They are both normalization techniques applied to the input of each layer. Therefore,
both techniques calculate the same two statistics: *mean* and *variance*, only in a
different manner.

To fully understand and know the difference between *BN* and *LN* is not quite
straightforward. For this reason in this blog we explain batch and layer normalization
with intuitive illustrations.

# Batch Normalization

The Batch Normalization (BN)
was first introduced to solve the *internal covariance shift* i.e. the change in the
distributions of the hidden layers in the course of training.

In general *BN* accelerates the training of deep neural nets. It also reduces the dependence
of gradients on the scale of the parameters (or of their initial values) which in turn
allows the use of much higher learning rates. However, it has one drawback, it requires
a sufficiently large batch size.

To save us the pain of reading the entire paper, without going too much into the details,
the essential part on how *Batch Normalization* works is illustrated in the image below:

In *Batch Normalization* the *mean* and *variance* are calculated for each individual
channel across all elements (pixels or tokens) in all batches.

Even though at first sight it may sound counterintuitive, but because it iterates over all
batches it is called *Batch Normalization*

# Layer Normalization

Having sufficiently large batch size is impractical for sequential tasks where the length of the sequence can be very large. To mitigate this constraint, the Layer Normalization (LN) technique was introduced.

Thus, *LN* is less dependent on the batch size and can be used with small batch sizes.
It can also help to reduce the vanishing gradient in recurrent neural networks.

Agian, to save us the the time of reading the entire paper the essential part on how
*Layer Normalization* works is illustrated in the image below:

In *Batch Normalization* the *mean* and *variance* are calculated for each individual
batch across all elements (pixels or tokens) in all channels.

At first sight it may be counterintuitive, but because it iterates over all
channels i.e. features it is called *Layer Normalization*

# PyTorch Implementation

The *PyTorch* implementation is given in code snippets below. During ttraining,
we create two learnable parameters `gamma`

and `beta`

to shift the normalized input.

To have unbiased inference, during training we calculate the *moving mean* and
*moving variance*. Later on, during inference we use these moving averages as a
replacement of the test data *mean* and *variance*.

1
2

import torch
import torch.nn as nn

Below you can find the *Batch Normalization* implementation in PyTorch:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28

class BatchNorm(nn.Module):
def __init__(self, num_features: int, training: bool, eps: float=1e-6) -> None:
super().__init__()
self.training = training
# learnable parameters
self.gamma = nn.Parameter(torch.ones(num_features))
self.beta = nn.Parameter(torch.zeros(num_features))
# hyperparams
self.eps = eps
self.moving_mean = nn.Parameter(torch.zeros(num_features), requires_grad=False)
self.moving_var = nn.Parameter(torch.ones(num_features), requires_grad=False)
def forward(self, x):
if self.training:
mean = x.mean(dim=0, keepdim=True)
var = x.var(dim=0, keepdim=True)
self.moving_mean = 0.9 * self.moving_mean + 0.1 * mean
self.moving_var = 0.9 * self.moving_var + 0.1 * var
else:
mean = self.moving_mean
var = self.moving_var
x = (x - mean) / torch.sqrt(var + self.eps)
x = self.gamma * x + self.beta
return x

Below you can find the *Layer Normalization* implementation in PyTorch:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28

class LayerNorm(nn.Module):
def __init__(self, num_features: int, training: bool, eps: float=1e-6) -> None:
super().__init__()
self.training = training
# learnable parameters
self.gamma = nn.Parameter(torch.ones(num_features))
self.beta = nn.Parameter(torch.zeros(num_features))
# hyperparams
self.eps = eps
self.moving_mean = nn.Parameter(torch.zeros(num_features), requires_grad=False)
self.moving_var = nn.Parameter(torch.ones(num_features), requires_grad=False)
def forward(self, x):
if self.training:
mean = x.mean(dim=-1, keepdim=True)
var = x.var(dim=-1, keepdim=True)
self.moving_mean = 0.9 * self.moving_mean + 0.1 * mean
self.moving_var = 0.9 * self.moving_var + 0.1 * var
else:
mean = self.moving_mean
var = self.moving_var
x = (x - mean) / torch.sqrt(var + self.eps)
x = self.gamma * x + self.beta
return x

Take a look and downlaod the PDF document containing the illustrations above by clicking on the button below:

For more information, please follow me on
**LinkedIn**
or **Twitter**.
If you like this content you can subscribe to the mailing list below to get similar updates from time to time.

## Leave a comment