Batch Normalization (BatchNorm)



Batch Normalization is one of the regulization techniques that was intoduced to get rid of the huge variations in activation distributions (differences in magnitude) between layers. With BatchNorm, training deeper and deeper models becomes more stable and efficient.

We will discuss the following:


Introduction: What is Batch Normalization?

Batch Normalization (BatchNorm) is a mechanism for keeping the intermediate activation values well-scaled and stable during training, leading to an easier optimization and faster convergance.

How BatchNorm Works

During training, BatchNorm first normalize the inputs by subtracting their mean and dividing by their standard deviation, then a scale coefficient \(\gamma\) and an offset \(\beta\).

Here are the functions used:

\[ \hat{x} = \frac{x - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}} \]

\[ y = \gamma \hat{x} + \beta \]

where \(\mu_B\) and \(\sigma_B^2\) are the batch mean and variance, and \(\epsilon\) is a small constant for numerical stability.

Why using the learnable parameters \(\gamma\) and \(\beta\)?

These learnable parameters can scale and shift the normalized activations as needed, allowing it to recover the original distribution if that is optimal for the task. And can help the model to restore the representational power of the network after normalization.

Benefits:

  • Stabilizes training by reducing internal covariate shift.
  • Allows higher learning rates and faster training.
  • Acts as a regularizer, reducing the need for dropout in some cases.
  • Improves generalization.

Code & Example

Check this simple example of LeNet network, and observe how Batch Norm imprved the training and reached a better final loss:

Batch Norm on LeNet Model.


Comments