Weight Initialization & Why It Matters

Weight Initialization is an underrated step in the model training process. It is a core and a fundemental part that can lead to a better results if done correctly. In this explainer, we will try to introduce the working of this technique and how it leads to better results and some of the most popular initialization techniques used in practice.

We will discuss the following:

Introduction: Why Weight Initialization Matters?
- What Goes Wrong With “Bad” Initialization?
Random Initialization
Xavier Initialization
He Initialization
Summary

Introduction: Why Weight Initialization Matters?

Weight initialization is the process of chosing the starting values for weights and biases in a neural network before training begins. This step can lead to a significant improvment in training speed and effeciancy. On the other side, poor initialization can cause the network to learn very slowly, get stuck, or even fail to learn at all. In this section, we’ll see why initialization matters and compare the results between using different types of initialization techniques.

What Goes Wrong With “Bad” Initialization?

Think of the analogy of hiking route, where you are trying to reach the botom of the mountin (Minimize the loss to the lowest value possible): if you start at the top of a hill, it’s easy to walk down to the lowest point. But if you start deep inside a cave, you might spend a lot of time just trying to get out before you can even begin heading toward the minimum. In neural networks, good initialization gives the model a clear path to learn quickly, while bad initialization can trap it and slow down learning.

– Insert an Image to show the analogy – – Sign wave with downward move –

Random Initialization

Random initialization means setting the starting weights of a neural network to random values, randomization is important, we don’t want to set all weights and bias with the same initial value (all 0 for example), this helps the neurons to learn different features during training. If all weights got the same starting point, then all will learn the same feature (curves only, or edges only,…). But if weights and biases have different starting points, each neuron can learn different features, making the network more effective.

Experiment

Let’s try a simple experiment where we try differnt initialization values, then observe how activations propagate through each layer. This experiment will show how choosing the right starting values can lead to a better and smooth training, or cause problems like vanishing or exploding activations.

Give the input \(X\): \[ X = [1, 2, 3, 4, 5] \]

We will use a number of 8 hidden layers, each with 10 neurons:

The weight for each layer will be initialized using a normal distribution. The parameter \(\sigma\) (the standard deviation) controls how large the weights are initially. We will try different \(\sigma\) values for each run: \[ W \sim \mathcal{N}(0, \sigma^2) \]

\[ \sigma \in \{10^{-5}, 10^{-3}, 10^{-2}, 10^{-1}, 0.5\} \]

The bias will be set to zero: \[ b = 0 \]

In the following function, we will define the forward pass. The function takes the input \(X\), then compute the pre-activation and post-activation:

\[ \begin{gather*} \text{Pre-activation} \\ Z = H_{\text{prev}} \cdot W + b \end{gather*} \]

\[ \begin{gather*} \text{Post-activation} \\ H = \text{ReLU}(Z) \end{gather*} \]

We will store the activations (Pre & Post), then get the variance for each one using the following variance equation:

\[ \text{Var}(Z) = \frac{1}{N} \sum_{i=1}^N (z_i - \mu)^2 \]

where \(z_i\) is the pre-activation for the layer \(i\), and the \(\mu\) is the mean of \(Z\), calculated using this:

\[ \mu = \frac{1}{N} \sum_{i=1}^N z_i \]

The np.var() function in the Python code efficiently calculates the variance, we will use it to get the variance for both \(H\) and \(Z\) values. This work is performed using the function measure_layer_variances.

Note

Tip:
Make sure to run the above code cells in order, then run the final plotting cell below.

Notice that the variance is dropping rapidly accross layers when the choice of sigma is small. This means the activations are shrinking toward zero as they pass through each layer, this is called vanishing activations. With this the learning will be slow and the network struggles to learn because the signal is too weak to propagate through layers. The opposite is also true, if \(\sigma\) is too large, the variance can grow quickly, causing exploding activations and making training unstable.

Xavier Initialization

This is one of the techniques that is designed to keep the variance of activations stable as they flow forward and backward in the network. We saw from the previous experiment that using small variance for initializing the weights will cause the activations to vanish across the layers. Xavier Initialization resolve this by keeping the the activation variance consistent from layer to layer during the forward pass.

This is achived by setting the weight initialization using a normal distribution with mean of 0 and std equals to \(\frac{2}{n_{\text{in}} + n_{\text{out}}}\) , according to the number of input and output connections in each layer as the following:

For normal (Gaussian) distribution: \[ W \sim \mathcal{N}\left(0, \frac{2}{n_{\text{in}} + n_{\text{out}}}\right) \]

For uniform distribution: \[ W \sim \mathcal{U}\left(-\sqrt{\frac{6}{n_{\text{in}} + n_{\text{out}}}},\ +\sqrt{\frac{6}{n_{\text{in}} + n_{\text{out}}}}\right) \]

With this initialization, the network is less likely to suffer from vanishing or exploding activations.

Note

Note:
The main argument is that the variance of the activations at initialization should remain constant across all layers during forward and backward propagation. This helps ensure stable learning and prevents issues like vanishing or exploding gradients.

Initialization Type	Distribution Shape	When to Use	Key Behavior
Normal (Gaussian) Xavier	Bell-shaped, values clustered near zero	When you want stable, smooth training and reproducible behavior	Activations/gradients stay controlled, less noisy, gentle exploration
Uniform Xavier	Even spread between positive & negative range	For shallow networks or when you want broader initial exploration	Allows more variation in early learning, more exploratory weight space

He Initialization

One thing we did not meantion on the Xavier Initialization is that each layer’s weights are initialized differetly depending on the number of input signals to the layer (\(n_{in}\)) and the output nurons \(n_{out}\). In contrast, He Initialization only considers the number of input neurons (\(n_{\text{in}}\)) and is specifically designed for use with the ReLU activation function.

In He Initialization, for each layer in the network, we initialize the weights using this: \[ \mathrm{Var}[w] = \frac{2}{n_{\text{in}}} \]

The logic for chosing this equation is as follows:

The pre-activation function for a layer in the network: \[ Z = \sum w_i x_i \]

We get the variance: \[ \mathrm{Var}[Z] = \mathrm{Var}\left(\sum_{i=1}^{n_{\text{in}}} w_i x_i\right) \]

we can say the following if we assumed independance for \(w_ix_i\) (separate variables that don’t influence one another): \[ \mathrm{Var}[Z] = \sum_{i=1}^{n_{\text{in}}} \mathrm{Var}[W X] \]

Therefore the equation becomes: \[ \mathrm{Var}[Z] = {n_{\text{in}}} \mathrm{Var}[W] \mathrm{Var}[X] \]

For each layer, there are \(X\) (input values) and \(H\) (post-activation values), and to ensure variance doesn’t vanish/explode across layers, we enforce: \[ \mathrm{Var}[H] = \mathrm{Var}[X] \]

Also, remember that applying the ReLU activation function reduces the variance of the pre-activation (\(Z\)) by half. This means that after each ReLU layer, the variance of the activations becomes \(\frac{1}{2}\) of the variance of the pre-activations.

\[ \mathrm{Var}[H] = \frac{1}{2} n_{\text{in}} \cdot \mathrm{Var}[W] \cdot \mathrm{Var}[X] \]

Plug-in in previous equation leads to:

\[ \mathrm{Var}[W] = \frac{2}{n_{\text{in}}} \]

Summary

We have explored the concept of weight initialization as one of the crucial steps in training a deep neural network. We started with a random initialization and saw using a small experiment how bad initialization can lead to vanishing activations, where the information is lost as the signal propagates across layers, leading to inefeciant training. Advanced initialization methods like Xavier and He initialization are specifically designed to keep the variance stable across all layers. In summary, Choosing the right initialization method leads to faster, more reliable training and better model performance.