Intuition Behind the Training Process in Machine Learning

Get a glimpse of how training works



This is an explainer of what is meant by training in Machine Learning, this is a simple guide to help you grasp the big picture only, but in reality, there are more details. Just follow the examples and try changing the numbers to test with different numbers.

We will discuss the following:


Introduction

Going into details on the parts of machine learning process is not our target for this page. Making it short, we need to perform the following:

  • Define model: \(y = w.x + b\)

  • Define target and input:

    • \(y = 24\)

    • \(x = 3\)

  • Define initial values: \(w=0.2\), \(b=0.4\)

Our goal is to adjust the values of \(w\) and \(b\) so that when \(w\) multiplied by \(x\) then added with b the result is our target \(y\) value \(y = 24\).

Math

In the previous section, we defined the vlaues of \(w\) and \(b\) and \(x\). Let see if we plug these values in the model \(y = w.x + b\) what is the value of \(y\) we will get??

\[ \hat{y} = w \cdot x + b \\ = 0.2 \times 3 + 0.4 \\ = 0.6 + 0.4 \\ = 1.0 \]

This is far from our target \(y = 24\). How to modify the values of \(w\) and \(b\) so that the above equation returns the target value \(y = 24\) ??

How did we know that we are far from the target?? It is simple, we just subtracted the results \((\hat{y}=1)\) , this is also called predicted value, we subtract it from the target \(y = 24\).

This is in machine learning terms called: finding the error using a loss function.

Therefore, in this example we define the loss function as follows:

\[ Loss = (\hat{y} - y_{target})^2 \]

We use the square only to prevent dealing with negative values, because we only care about distance from the target, not the direction.


Great! Now, how can we actually build an algorithm—a training loop—that automatically tweaks the values of \(w\) and \(b\) until our model’s output matches the target \(y = 24\)? In other words, how does the machine learn the best values for \(w\) and \(b\)?

Here are the steps:

  1.  Start with a random guess for \(w\) and \(b\).

  2.  Plug in the values in the model to get \(\hat{y}\) (Prediction).

  3.  Calculate the error using the Loss function.

  4.  Based on the loss adjust the values of \(w\) and \(b\).

  5.  Loop back again with new using the new \(w\) and \(b\).

  6.  Repeat until hitting the target.

We already saw how to do steps (1-3), these are simply plugging numbers into the model equation \(y = w.x + b\) and the Loss. Now, and this is the important part, how to perform step 4??

Gradient Descent

Gradient Descent is a clever trick that helps us change w and b little by little, so the model gets closer and closer to the target. It’s like giving the machine a recipe to learn from its mistakes and improve every time.

Learn from its mistakes this is what Gradient Descent does! Each time the model makes a prediction, it checks how far off it was (the loss), and then uses that information to make a better guess next time. By repeating this process over and over, the model slowly gets better at finding the best values for \(w\) and \(b\).

In math terms, this is how Gradient Descent performed:

  • Using the Loss, find \[ \frac{\partial \text{Loss}}{\partial w}, \frac{\partial \text{Loss}}{\partial b} \]

  • The above is called the gradient, it tells us which way (and how much) to adjust 

    \(w\) and \(b\) to make the loss smaller.

  • Using the gradient, just simply adjust w like this: \[ w_{new} = w_{old} - lr * \frac{\partial \text{Loss}}{\partial w} \]

lr (learning rate) is a small positive number that controls the step size (How much to adjust the value?). This will control the learning and prevents overshooting far from the best fit.


One last thing remains, how \(\frac{\partial \text{Loss}}{\partial w}\) calculated?

Recall: \[ \text{Loss} = (\hat{y} - y_{\text{target}})^2 \]

And \[ \hat{y} = w \cdot x + b \]

So, \[ \text{Loss} = (w \cdot x + b - y_{\text{target}})^2 \]

Now, take the derivative of Loss with respect to \(w\): \[ \frac{\partial \text{Loss}}{\partial w} = 2(w \cdot x + b - y_{\text{target}}) \cdot x = gradient \cdot x \]

\[ \frac{\partial \text{Loss}}{\partial b} = 2(w \cdot x + b - y_{\text{target}}) = gradient \]

Remember, \(\frac{\partial \text{Loss}}{\partial w}\) tells us how much sensitive the loss is to small changes in \(w\). By subtracting this term from \(w_{old}\), we nudge \(w\) in the direction of a smaller loss—and a smaller loss means we’re moving closer to the target value.

Code

Now that we have explained the intuition and the math part, lets turn everything into code and see how the algorithm will find the best values of \(w\) and \(b\).

Define initial values of \(w\) and \(b\), and define the target we want the model to reach.

Now define the functions that will update the weights and will

Now loop until the target converges to 20 then break.

See how the model is getting closer and closer to the target value. Also notice how the loss is going down meaning the prediction is not far from the target value. We can visulize the change in prediction values and loss over epochs as follows:

You can modify the numbers and rerun the cells and notice how starting with higher or lower weights values is effecting the number of epochs the model needs to reach the target. Also, try a different learning rate value.

Summary

In this simple explainer, we started by defining the basic idea of model training, which is adjusting weight values in the given model equation \(y = w.x + b\) to hit a target value $ y_{target} $. Then in the math section, we explored gradient decent and how it is the main driver for tweaking the weights’ values. Lastly, the code section showed step by step how gradient descent updates \(w\) and \(b\) until the prediction lands on the target—so you can see the learning process in action.