Softmax & Cross-Entropy
In this explainer we will dig deeper into Softmax and Cross-Entropy. Weβll explore how softmax transforms raw scores into probabilities,and how the Cross-Entropy measures how well those probabilities match the true labels. We will take a look at a general intuition then explain the math followed by code implementation.
We will discuss the following:
Introduction: From Raw scores to Probabilities
Depending on the machine learning task, binary classification (True/False, Male/Female) or multi-class classification (such as digit recognition: 0-9, or animal types: cat/dog/horse), the activation function type will be different. In the binary classification task, the model needs to output one probabibilty: βWhat is the probability that this input is True?β. In multi-class classification, the model must output a probability for each possible classβfor example, the probability that the input belongs to class 0, 1, 2, β¦, or 9 in digit recognition.
Because of this, the choice of activation function at the output layer also changes:
- Sigmoid is used for binary classification, where the output is a single value between 0 and 1.
- Softmax is used for multi-class classification, where the model outputs several values that must form a valid probability distribution (all values between 0 and 1, and they sum to 1).
The choice of loss function also different: - For binary classification, we use Binary Cross-Entropy. - For multi-class classification, we use Categorical Cross-Entropy.
To get an intuition of what the two (Activation and Loss) are doing, we can say that the activation function expressing the model prediction, while the Loss measures the model mistakes.
ββββββββββββββββββββββββββββ
β Logits (Raw Scores) β
β [2.1, 0.5, -1.3] β
βββββββββββββββ¬βββββββββββββ
β
βΌ
ββββββββββββββ
β Softmax β
βββββββ¬βββββββ
β
βΌ
βββββββββββββββββββββββββββββββ
β Probabilities (Sum = 1) β
β [0.72, 0.23, 0.05] β
βββββββββββββββββββββββββββββββ
β
β
βββββββββββββββΆ βββββββββββββββββββββββββββββββββ
β Categorical Cross-Entropy ββββΆ Final Loss Value
βββββββββββββββββββββββββββββββββ
β²
β
ββββββββββββββββββββββββββ
β True Label: Class 2 β
β [0, 1, 0] β
ββββββββββββββββββββββββββ
The Categorical Cross-Entropy will take these probabilities and compare them to the true values to get the loss
What Softmax Does?
To avoid repreating the explanation in the previous section, the important thing to know about Softmax purpose is that it converts the scores results from the last layer to probabilities. Each class gets a probability between 0 and 1, and all these probabilities together always sum to 1.
Why do we need to use Softmax?
Neural networks often produce raw scores (logits) that can be any numbers β positive, negative, or very large. These scores are not meaningful by themselves because they donβt behave like probabilities. Softmax fixes this by transforming those raw scores into probabilities.
With this we can know which class the model thinks is most likely, and how confident the model is.
Softmax: math
Softmax takes a list of scores [0.72, 0.23, 0.05] pass them into this formela: \[
p_i = \frac{e^{z_i}}{\sum_j e^{z_j}}
\]
First we apply the \(e^{x}\) to all values in the list: \[ e^{z} = [e^{0.72},\ e^{0.23},\ e^{0.05}] \]
Results:
\[ e^{z} \approx [2.054,\ 1.259,\ 1.051] \]
Sum the values:
\[ S = 2.054 + 1.259 + 1.051 = 4.364 \]
Then divide each exponentiated value by the sum to get the probabilities:
\[ p = \left[ \frac{2.054}{4.364},\ \frac{1.259}{4.364},\ \frac{1.051}{4.364} \right] \approx [0.471,\ 0.288,\ 0.241] \]
Results:
\[ p \approx [0.471,\ 0.288,\ 0.241] \]
Now:
every number is between 0 and 1
all numbers sum to 1
the biggest logit becomes the biggest probability
Softmax: code
Try changing the values in the logits list and see the how probabilities change:
Cross-Entropy Loss (Intuition)
We saw how Softmax gives us probabilities for each class, now we need a way to measure how good these probabilities are compared to the true label. This is where Categorical Cross-Entropy comes in. It looks only at the probability assigned to the correct class and gives a small loss if this probability is high, and a large loss if it is low.
Cross-Entropy: math
Categorical Cross-Entropy measures how well the predicted probability of the correct class matches the true label.
The formula is:
\[ \text{Loss} = -\log(p_{\text{true}}) \]
Letβs assume that the true class is class 0:
\[ y_{true} = [1, 0, 0] \]
Now we look to the probability list output from the softmax: [0.471, 0.288, 0.241]. The model assigned a probability of 0.471 to class 0, which is the true class. We use this value to evaluate how good the prediction is.
\[ p_{\text{true}} = 0.471 \]
Applying the categorical cross-entropy formula:
\[ \text{Loss} = -\log(p_{\text{true}}) \]
Plugging in the value:
\[ \text{Loss} = -\log(0.471) \approx 0.753 \]
Note:
The log function penalizes small probabilities very strongly.
See these examples:
\[ -\log(1) = 0 \]
\[ -\log(0.9) \approx 0.105 \]
\[ -\log(0.5) \approx 0.693 \]
\[ -\log(0.1) \approx 2.30 \]
\[ -\log(0.01) \approx 4.60 \]
\[ -\log(0.0001) \approx 9.21 \]
π‘ Notice:
Small mistakes β small penalty
Huge mistakes β huge penalty
The cross-entropy loss value (for example, \(0.753\)) measures how well the model predicted the correct class.
- A lower loss means the model assigned a higher probability to the true class (better prediction).
- A higher loss means the model was less confident or incorrect about the true class.
In training, the goal is to minimize this loss so the model becomes more accurate.
Cross-Entropy: code
Notice how the loss is lower when the model assigns higher probability to the correct class, and the opposite is ture; the loss increse if the model assignes high probability to the wrong class.
Summary
In summary, you need to know that softmax converts raw model outputs into probabilities for each class, and it works in multi classification tasks. Then, Categorical Cross-Entropy used to measure how well these probabilities match the true labels. If the model gives low probability for the correct class then the penalty (loss) will be high, and the opposite is true: if the model assigns a high probability to the correct class, the loss will be low. This encourages the model to become more confident and accurate in its predictions during training.