Loss Functions in Machine Learning

How do algorithms know if their predictions are good?

This is an explainer of another fundamental part of Machine Learning and AI: Loss Functions.

We will discuss the following:

Definition: What is a loss function?
Common types
Math
Summary

Definition: What is a loss function?

A loss function is a measure of how far the model prediction is from the actual “ground truth” value. It measures the performance of the model; lower loss better performance. In training, we aim to get the loss as low as possible.

Common types

Depending on the machine learning task we are trying to perform the loss function will differ. There are different types of loss functions each used for particular tasks:

Regression losses (predicting continuous values)

Mean Squared Error (MSE)
This loss calculates how far off the predictions are by averaging the squares of the differences between the predicted and actual values. It’s the most common loss for regression tasks. \[ \text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 \]
Mean Absolute Error (MAE)
This loss measures the average of the absolute differences between the predicted and actual values. It’s more robust to outliers than MSE. \[ \text{MAE} = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i| \]

Note

When squaring the error in (MSE), it becomes bigger, thus having stronger impact on the total loss. This means MSE is more sensitive to outliers than Mean Absolute Error (MAE), which treats all errors equally.

If we are building a model that we don’t need it to pay significant attention to outliers (rare events), then we use MAE as the loss function. But if those rare events are improtant then we need to use MSE to pay attention to outliers.

Classification losses

Binary Cross-Entropy (Log loss) — binary classification:
This loss type is used when we have two categories as the output (binary), and the model needs to output a probability for each categoriy benig the correct one. In another words: Giving an input sample X, what is the probability that this sample corresponds to a class or category (A or B for example). \[ \text{Loss} = -\frac{1}{n} \sum_{i=1}^{n} \left[ y_i \log(\hat{y}_i) + (1 - y_i) \log(1- \hat{y}_i) \right] \]

the output of the above function is a number that tells us how well the model’s predicted probabilities match the actual labels. A lower value means the predictions are closer to the true answers, while a higher value means the model is making more mistakes.
Categorical Cross-Entropy — multi-class classification:

The above BCE is used for a binary classification, while the Categorical Cross-Entropy is used in multi-class classification problems, where each sample belongs to one of the multiple possible classes. \[ \text{Loss} = -\frac{1}{n} \sum_{i=1}^{n} \sum_{c=1}^{C} y_{i,c} \log(\hat{y}_{i,c}) \]

Here, \(C\) is the number of classes, \(y_{i,c}\) is 1 if sample \(i\) belongs to class \(c\) (otherwise 0), and \(\hat{y}_{i,c}\) is the predicted probability for class \(c\).

Math

Let’s take each loss function mentioned above and try to create a simple example to help visulize how each function is truly working.

Starting with MSE, condsider we got the following results at the end of an epoch:

y_true = [2, 4, 6]
y_pred = [3, 5, 4]

We need now to measure how far our predictions are from the true value. By applying the MSE function, it will take each value from \(y_pred\) then subtract it from y_true then square the result, then sum all squared values and lastly average the sum by dividing by the size of y_pred (in this case, size = 3).

The lower the Mean Squared Error, the closer the predictions to the actual values. Therefore, our target in training will be aiming for lower MSE.

Try applying the above examples to MAE function:

Beacuse in MAE the error is not squared, it is less afected by large errors (outliers).MSE emphasizes bigger mistakes more strongly.

Now turning to the cross-entropy, we cannot use a direct subtraction between y_pred and y_ture like we did in MSE and MAE because here in cross-entropy the model output is a probability (How confidant the model is??) If the model predicts a high probability for the wrong class, for example it gives 0.9 while the true value is 0, the loss in this case will become very large pushing the model to be careful with its confidence.

A lower binary cross-entropy means the model’s predicted probabilities are close to the actual labels; a higher value means the predictions are less accurate.

In the case of categorical cross-entropy loss, each sample ture output is a one hot encoded, meaning if the model has to predict the class out of 3 possible classes, then the true label for each sample will look like [1, 0, 0], [0, 1, 0], or [0, 0, 1]—indicating which class is correct.

The model outputs a probability for each class, such as [0.7, 0.2, 0.1], showing how confident it is about each class. The categorical cross-entropy loss then compares the predicted probabilities to the one-hot true labels, penalizing the model more when it assigns low probability to the correct class.

A lower categorical cross-entropy means the model is assiging high probabulity to the correct class for each sample.

Summary

Loss functions are a fundemental block of the training process for a machine learning model. They measure how far off the model’s predictions are from the true values. We explored common loss functions like MSE and MAE for regression, and cross-entropy for classification. Using simple code examples, we saw how each loss works and why choosing the right one matters for your task. Lower loss means better predictions, and the goal during training is always to minimize the loss.