Linear Regression

The First Step Toward Machine Learning

This is an explainer of a fundamental part of Machine Learning and AI: Linear Regression.

We will discuss the following:

Definition
The Equation
Use Case
Training a Linear Regression Model
- Math
- Code

Definition

Linear regression is a statistical approach to predict a variable Y based on another variable X. It models the relationship between the two variables by fitting a straight line (the regression line) to the observed data. This line represents the best estimate of Y for any given value of X.

See the below image for example — this is a simple Linear relationship between temperature and Energy.

When the temperature (x-axis) increases, the value of energy consumption (y-axis) increases. This relationship can be modeled using a straight line. This line is telling us how the two variables are related: as one goes up, so does the other, in a predictable way. The line also can predict the value of energy consumption for a given temperature value. This is the simplest form of regression analysis.

Machine Learning involves finding the best-fitting line that describes the relationship between temperature and energy consumption, as shown in the example above. This process allows us to make predictions based on observed data. The question is: How to find the best-fitting line?

The Equation

Going back to the previous example, we consider the following:

x : Tempreture
y : Energy Consumption

The linear equation that represents the relationship between the two variables is:

$y = w.x + b$

We already know what $x$ and $y$ represents. But what is $w$ and $b$ ??

To get the intuition of these two variables, we will vary the value of $w$ while keeping $b$ constant, then observe how the line changes:

The line is changing when different values of $w$ are plugged in. This should give you an intuition of the purpose of $w$: it controls the slope of the line. A higher $w$ means the line rises more steeply as $x$ increases, while a lower (or negative) $w$ means the line is flatter or slopes downward.

Similarly, $b$ is called the intercept. It determines where the line crosses the $y$-axis. Changing $b$ shifts the line up or down without changing its slope.

In summary:

$w$ (weight): Controls the angle or steepness of the line (the relationship strength between $x$ and $y$).
$b$ (bias): Moves the line up or down.

Use Case

Now that we understand the line equation and its parameters, let’s see how to find the best values for $w$ and $b$ that fit our data. The goal is to choose $w$ and $b$ so that the line is as close as possible to all the data points.

In order to find those values, we need to start with the historical data collected on the input x (temperature) and the actual measured y value (energy consumption). Here is a preview of the historical data (training data):

x (Tempreture)	Y (Energy Consumption)
0.0	2.32
0.53	0.72
1.05	1.32
1.58	2.72
2.11	3.65
2.63	7.83
3.16	6.6
3.68	5.08
4.21	7.17
4.74	8.48

The approach in finding the values of w and b is to start with a random guess, then on each iteration change the values slowly until reaching the best results that fits the data.

How can we know that we reached the best results?

We need a tool—a mathematical function—that checks at the end of each iteration whether we have reached the best fit or not. This is called a loss function. The loss function measures how far we are from the true value; the returned value of this function is called error. The bigger the value, the farther we are from the best fit.

A common loss function for linear regression is the Mean Squared Error (MSE), which is calculated as:

\[ \text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 \]

Where:

$y_i$ is the actual value,
$\hat{y}_i$ is the predicted value from our line,
$n$ is the number of data points.

What does the iqueation represents? - For each iteration, predict the value of $\hat{y}_i$ then subtract it from the true value $y_i$ (calculate the error), then after all iterations average all errors collected by dividing by the total number of points $n$.

Training a Linear Regression Model

In this section I will guide you on how to train a simple linear regression model..

Remeber that training in machine learning requires the following:

Training Data: A set of input-output pairs (in our case, temperature and energy consumption).
A Model: The linear equation $y = wx + b$.
A Loss Function: To measure how well the model predicts the data (we use Mean Squared Error).
An Optimization Algorithm: To adjust $w$ and $b$ to minimize the loss (commonly, Gradient Descent).

Let’s see how this works in practice using Python:

Don’t panic on the numbers, we are only defining numbers that will be used for training. We are only training on 20 sample points, but in actual scenarios millions of records used for training, the more training data the better the results we get.

Note

If you’re not quite ready for the example above, I’ve created a simpler explainer to help you build intuition about what actually happens during training and what “training” really means.
👉 See the Simple Training Intuition page

After defining the training data above, here is a sample for preview:

Remember this is our model:

\[ y = w.X + b \]

For a start we need an initial values of $w$ and $b$, lets set them to:

These are totally random values, you can pick different values if you want.

Now with the above values of $w$ and $b$, lets see the how the line fits the $X$ and $y$ points defined earlier :

You can adjust the values of w and b and then rerun the above cell to see how the line changes.

Math

REMEMBER, training is finding the best values of $w$ and $b$ that will fit the data. For that we will use the gradient decent technique to update the weights on each iteration.

What is gradient decent?

It is a mathematical approach that helps the model learn by slowly adjusting $w$ and $b$ so the loss gets smaller and smaller. At each step, gradient descent checks how far off the prediction is, then change the weights in the direction that makes the error less. By repeating this process, the model gets closer to the best fit.

For that we need to do the following steps on each iteration:

Start with a random guess for $w$ and $b$.
Get $\hat{y}_i$ (the predicted value) by plugging to the model equation ($y = wx + b$).
Get the Loss (How far the predicted value from the target?).
Using the gradient of $w$ and $b$, update their values.
Repeat the process.

Only one thing remains before turning everything to code, What is the gradient?

The gradient is telling us how much the loss will change if we tweak $w$ or $b$ just a little. It points in the direction where the loss gets smaller, and this is the purpose of the whole process; we need to find the best values of weights that leads to smaller loss.

In math terms,,

To find the gradient of $w$ for example, we start with the loss function:

\[ \text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 \]

Substite $\hat{y}_i$ with the model equation $\hat{y}_i = w.x + b$

\[ \text{MSE} = \text{Loss}= \frac{1}{n} \sum_{i=1}^{n} (y_i - w.x+b)^2 \]

Compute the gradient with respect to $w$ (How much the loss is changing when $w$ changes?)

\[ \frac{\partial \text{Loss}}{\partial w} = 2(y_{\text{i}} - w x + b) \cdot x \]

Note

The summation symbol $\sum_{i=1}^{n}$ is used when calculating the loss over all data points. Here, we’re working with just one example, so the sum disappears from the equation.

The same process for the gradient of $b$ :

\[ \frac{\partial \text{Loss}}{\partial b} = 2(y_{\text{i}} - w x + b) \]

Now we use these gradient values to get the updated values of the weights:

\[ w_{new} = w_{old} - lr \cdot \frac{\partial \text{Loss}}{\partial w} \]

\[ b_{new} = b_{old} - lr \cdot \frac{\partial \text{Loss}}{\partial b} \]

Note

The $lr$ in the equation stands for learning rate. It controls how big a step we take when updating $w$. A small learning rate means slower, careful steps; a large learning rate means faster, bigger steps. Choosing the right learning rate helps the model learn efficiently.

Code

Now that we have the mathematical tools and intuition of how training works, lets turn everything into code, and see how the training process will modify the initial values of w and b until it finds the best fit for the line:

Start with defining inital values:

Define the Loss function and the gradient function:

Note

If you checked the other simple explainer, we used only a single x and y values. Here we have a list of X values and Y values.
👉 See the Simple Training Intuition page

Because we have a list of X and Y values, in each epoch we use all values of X to predict a list of Y predictions. Then, we calculate the average gradient across all points and use it to update the weights. This way, the model learns from the whole dataset at once, not just a single example.

Using the final weights we found, plot the line and see how it fits the data:

Notice In the below graph the difference between our initail guess we made in the beggining of this explainer, and the position of the line the model predicted after training.

Summary

In this explainer, we learned how linear regression works by fitting a straight line to data using the equation $y=w.x + b$. The start was with a random guess of the values w and b (weights), then using the gradient decent approach, we looped over the data updating the values of weights for a certin number of epochs. The results show the best fitting line.

--- title: "Linear Regression" --- <h3 style="text-align:center;"> The First Step Toward Machine Learning </h3> <br><br> This is an explainer of a fundamental part of Machine Learning and AI: **Linear Regression**. We will discuss the following: - [Definition](#definition) - [The Equation](#equation) - [Use Case](#use-case) - [Training a Linear Regression Model](##training) - [Math](##math) - [Code](##code) ------------------------------------------------------------------------ ## Definition {#definition} Linear regression is a statistical approach to predict a variable **Y** based on another variable **X**. It models the relationship between the two variables by fitting a straight line (the regression line) to the observed data. This line represents the best estimate of **Y** for any given value of **X**. See the below image for example — this is a simple Linear relationship between temperature and Energy. ![](../images/temp.vs.energy.png){fig-align="center" width="70%"} When the temperature (x-axis) increases, the value of energy consumption (y-axis) increases. This relationship can be modeled using a straight line. This line is telling us how the two variables are related: as one goes up, so does the other, in a predictable way. The line also can predict the value of energy consumption for a given temperature value. This is the simplest form of regression analysis. ![](../images/regre.png){fig-align="center" width="70%"} ------------------------------------------------------------------------ Machine Learning involves finding the best-fitting line that describes the relationship between temperature and energy consumption, as shown in the example above. This process allows us to make predictions based on observed data. The question is: How to find the best-fitting line? ## The Equation {#equation} Going back to the previous example, we consider the following: - **x** : Tempreture - **y** : Energy Consumption The linear equation that represents the relationship between the two variables is: $y = w.x + b$ We already know what $x$ and $y$ represents. But what is $w$ and $b$ ?? To get the intuition of these two variables, we will vary the value of $w$ while keeping $b$ constant, then observe how the line changes: ![](/images/linear_regression2.gif) The line is changing when different values of $w$ are plugged in. This should give you an intuition of the purpose of $w$: it controls the **slope** of the line. A higher $w$ means the line rises more steeply as $x$ increases, while a lower (or negative) $w$ means the line is flatter or slopes downward. Similarly, $b$ is called the **intercept**. It determines where the line crosses the $y$-axis. Changing $b$ shifts the line up or down without changing its slope. ![](/images/linear_regression3.gif) In summary: - $w$ (weight): Controls the angle or steepness of the line (the relationship strength between $x$ and $y$). - $b$ (bias): Moves the line up or down. ## Use Case {#use-case} Now that we understand the line equation and its parameters, let's see how to find the best values for $w$ and $b$ that fit our data. The goal is to choose $w$ and $b$ so that the line is as close as possible to all the data points. In order to find those values, we need to start with the historical data collected on the input x (temperature) and the actual measured y value (energy consumption). Here is a preview of the historical data (training data): | x (Tempreture) | Y (Energy Consumption) | |----------------|------------------------| | 0.0 | 2.32 | | 0.53 | 0.72 | | 1.05 | 1.32 | | 1.58 | 2.72 | | 2.11 | 3.65 | | 2.63 | 7.83 | | 3.16 | 6.6 | | 3.68 | 5.08 | | 4.21 | 7.17 | | 4.74 | 8.48 | The approach in finding the values of w and b is to start with a random guess, then on each iteration change the values slowly until reaching the best results that fits the data. How can we know that we reached the best results? We need a tool—a mathematical function—that checks at the end of each iteration whether we have reached the best fit or not. This is called a **loss function**. The loss function measures how far we are from the true value; the returned value of this function is called **error**. The bigger the value, the farther we are from the best fit. A common loss function for linear regression is the **Mean Squared Error (MSE)**, which is calculated as: $$ \text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 $$ Where: - $y_i$ is the actual value, - $\hat{y}_i$ is the predicted value from our line, - $n$ is the number of data points. What does the iqueation represents? - For each iteration, predict the value of $\hat{y}_i$ then subtract it from the true value $y_i$ (calculate the error), then after all iterations average all errors collected by dividing by the total number of points $n$. ------------------------------------------------------------------------ ## Training a Linear Regression Model {#training} In this section I will guide you on how to train a simple linear regression model.. Remeber that training in machine learning requires the following: 1. **Training Data**: A set of input-output pairs (in our case, temperature and energy consumption). 2. **A Model**: The linear equation $y = wx + b$. 3. **A Loss Function**: To measure how well the model predicts the data (we use Mean Squared Error). 4. **An Optimization Algorithm**: To adjust $w$ and $b$ to minimize the loss (commonly, Gradient Descent). Let's see how this works in practice using Python: ```{pyodide-python} import numpy as np import pandas as pd # Define the training data X = np.array([0.0, 0.53, 1.05, 1.58, 2.11, 2.63, 3.16, 3.68, 4.21, 4.74, 5.26, 5.79, 6.32, 6.84, 7.37, 7.89, 8.42, 8.95, 9.47, 10.0]) y = np.array([0.89, 3.63, 2.13, 3.09, 7.26, 8.72, 8.17, 9.45, 8.4, 7.9, 10.44, 14.9, 14.47, 17.4, 12.21, 17.79, 13.69, 20.24, 17.93, 18.02]) print("Training data loaded:") ``` Don't panic on the numbers, we are only defining numbers that will be used for training. We are only training on 20 sample points, but in actual scenarios millions of records used for training, the more training data the better the results we get. ::: callout-note If you’re not quite ready for the example above, I’ve created a **simpler explainer** to help you build intuition about what actually happens during training and what “training” really means.\ [👉 See the Simple Training Intuition page](./simple-training-intuition.qmd) ::: After defining the training data above, here is a sample for preview: ```{pyodide-python} # Create DataFrame df = pd.DataFrame({'X': X, 'y': y}) # Preview 10 samples df.head(10) ``` Remember this is our model: $$ y = w.X + b $$ For a start we need an initial values of $w$ and $b$, lets set them to: ```{pyodide-python} w = 0.4 b = 0.2 ``` These are totally random values, you can pick different values if you want. Now with the above values of $w$ and $b$, lets see the how the line fits the $X$ and $y$ points defined earlier : ```{pyodide-python} model = w*X + b import matplotlib.pyplot as plt plt.figure(figsize=(8, 5)) plt.scatter(X, y, label='Original Data', alpha=0.6) plt.plot(X, model , color='red', label='Regression Line') plt.xlabel('Temperature') plt.ylabel('Energy consumption') plt.title('Linear Regression Results') plt.legend() plt.grid(True) plt.show() ``` You can adjust the values of w and b and then rerun the above cell to see how the line changes. ### Math {#math} **REMEMBER**, training is finding the best values of $w$ and $b$ that will fit the data. For that we will use the gradient decent technique to update the weights on each iteration.\ [What is gradient decent?]{.underline} It is a mathematical approach that helps the model learn by slowly adjusting $w$ and $b$ so the loss gets smaller and smaller. At each step, gradient descent checks how far off the prediction is, then change the weights in the direction that makes the error less. By repeating this process, the model gets closer to the best fit. For that we need to do the following steps on each iteration: - Start with a random guess for $w$ and $b$. - Get $\hat{y}_i$ (the predicted value) by plugging to the model equation ($y = wx + b$). - Get the Loss (How far the predicted value from the target?). - Using [**the gradient**]{.underline} of $w$ and $b$, update their values. - Repeat the process. Only one thing remains before turning everything to code, [**What is the gradient?**]{.underline} The gradient is telling us how much the loss will change if we tweak $w$ or $b$ just a little. It points in the direction where the loss gets smaller, and this is the purpose of the whole process; we need to find the best values of weights that leads to smaller loss. In math terms,, To find the gradient of $w$ for example, we start with the loss function: $$ \text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 $$ Substite $\hat{y}_i$ with the model equation $\hat{y}_i = w.x + b$ $$ \text{MSE} = \text{Loss}= \frac{1}{n} \sum_{i=1}^{n} (y_i - w.x+b)^2 $$ Compute the gradient with respect to $w$ ([How much the loss is changing when $w$ changes?]{.underline}) $$ \frac{\partial \text{Loss}}{\partial w} = 2(y_{\text{i}} - w x + b) \cdot x $$ ::: callout-note The summation symbol $\sum_{i=1}^{n}$ is used when calculating the loss over all data points. Here, we’re working with just one example, so the sum disappears from the equation. ::: The same process for the gradient of $b$ : $$ \frac{\partial \text{Loss}}{\partial b} = 2(y_{\text{i}} - w x + b) $$ Now we use these gradient values to get the updated values of the weights: $$ w_{new} = w_{old} - lr \cdot \frac{\partial \text{Loss}}{\partial w} $$ $$ b_{new} = b_{old} - lr \cdot \frac{\partial \text{Loss}}{\partial b} $$ ::: callout-note The $lr$ in the equation stands for **learning rate**. It controls how big a step we take when updating $w$. A small learning rate means slower, careful steps; a large learning rate means faster, bigger steps. Choosing the right learning rate helps the model learn efficiently. ::: ------------------------------------------------------------------------ ### Code {#code} Now that we have the mathematical tools and intuition of how training works, lets turn everything into code, and see how the training process will modify the initial values of w and b until it finds the best fit for the line: Start with defining inital values: ```{pyodide-python} w = 0.2 b = 0.5 lr = 0.01 # x and y already defined above and store in df df.head(10) ``` Define the Loss function and the gradient function: ```{pyodide-python} def loss(y_pred, y_true): return np.mean((y_pred - y_true) ** 2) def gradient_w(X, y_pred, y_true): return np.mean(2 * (y_pred - y_true) * X) def gradient_b(y_pred, y_true): return np.mean(2 * (y_pred - y_true)) ``` ::: callout-note If you checked the other simple explainer, we used only a single x and y values. Here we have a list of X values and Y values.\ [👉 See the Simple Training Intuition page](./simple-training-intuition.qmd) ::: Because we have a list of X and Y values, in each epoch we use all values of X to predict a list of Y predictions. Then, we calculate the average gradient across all points and use it to update the weights. This way, the model learns from the whole dataset at once, not just a single example. ```{pyodide-python} for epoch in range(50): y_pred = w * X + b Loss = loss(y_pred, y) # compute gradients dw = gradient_w(X, y_pred, y) db = gradient_b(y_pred, y) # Update weights w = w - lr * dw b = b - lr * db print(f"Epoch {epoch} | Loss = {Loss:.2f} | w = {w:.2f} | b = {b:.2f}") print(f"\nFinal weights:\nw: {w:.2f}\nb: {b:.2f}") ``` Using the final weights we found, plot the line and see how it fits the data: ```{pyodide-python} model = w*X + b plt.figure(figsize=(8, 5)) plt.scatter(X, y, label='Original Data', alpha=0.6) plt.plot(X, model , color='red', label='Regression Line') plt.xlabel('Temperature') plt.ylabel('Energy consumption') plt.title('Linear Regression Results') plt.legend() plt.grid(True) plt.show() ``` Notice In the below graph the difference between our initail guess we made in the beggining of this explainer, and the position of the line the model predicted after training. ```{pyodide-python} model_final = w * X + b # Initial guess model model_initial = 0.2 * X + 0.5 fig, axes = plt.subplots(1, 2, figsize=(8, 5)) # Left plot: trained model axes[0].scatter(X, y, label='Original Data', alpha=0.6) axes[0].plot(X, model_final, color='red', label='Trained Regression Line') axes[0].set_xlabel('Temperature') axes[0].set_ylabel('Energy consumption') axes[0].set_title('Trained Linear Regression') axes[0].legend() axes[0].grid(True) # Right plot: initial guess axes[1].scatter(X, y, label='Original Data', alpha=0.6) axes[1].plot(X, model_initial, color='blue', label='Initial Guess Line') axes[1].set_xlabel('Temperature') axes[1].set_ylabel('Energy consumption') axes[1].set_title('Initial Guess') axes[1].legend() axes[1].grid(True) plt.tight_layout() plt.show() ``` ## Summary In this explainer, we learned how linear regression works by fitting a straight line to data using the equation $y=w.x + b$. The start was with a random guess of the values w and b (weights), then using the gradient decent approach, we looped over the data updating the values of weights for a certin number of epochs. The results show the best fitting line.