Gradient Descent

Gradient descent

We will extend our simpel example of gradient decent to try implementing the following:

  • Add a bias term

  • Add an Activation Function

  • Stack Layers

Example used: \[ x = [1, 2, 3, 4, 5] \] \[ y = 25 \]

Find the weights satisfying the equation: \[ w_1 \cdot 1 + w_2 \cdot 2 + w_3 \cdot 3 + w_4 \cdot 4 + w_5 \cdot 5 = 25 \]

The loss function is: \[ L(w) = (w_1 x_1 + w_2 x_2 + w_3 x_3 + w_4 x_4 + w_5 x_5 - y)^2 \]

Add a bias term

In this section, we will add a bias term to our model.

The equation: \[ w_1 \cdot 1 + w_2 \cdot 2 + w_3 \cdot 3 + w_4 \cdot 4 + w_5 \cdot 5 + b = 25 \]

Where \(b\) is the bias term.

The gradient of \(w_1\) is: \[ \frac{\partial L}{\partial w_1} = (w_1 x_1 + w_2 x_2 + w_3 x_3 + w_4 x_4 + w_5 x_5 + b - y) \cdot x_1 \]

The gradient of \(b\) is: \[ \frac{\partial L}{\partial b} = (w_1 x_1 + w_2 x_2 + w_3 x_3 + w_4 x_4 + w_5 x_5 + b - y) \]

Show Code
import numpy as np
# Dataset
x = np.array([1, 2, 3, 4, 5])
y = 25
# Initial weights
w = np.array([4.2, 3.6, 1.5, 0.7, 0.1])
b = 0.0

# Learning rate
alpha = 0.01

# Compute the loss
def compute_loss(w, b, x, y):
    return np.mean((w * x + b - y) ** 2) # the mean if we have multiple data points

# Compute the gradient (This is for a single data point)
def compute_gradient(w, b, x, y):
    error = np.dot(w, x) + b - y
    dw = 2 * error * x             # gradient w.r.t each weight
    db = 2 * error                 # gradient w.r.t bias
    return dw, db

# Update weights
def update_weights(w, b, x, y, alpha):
    dw, db = compute_gradient(w, b, x, y)
    w = w - alpha * dw
    b = b - alpha * db
    return w, b

# Training loop
for epoch in range(1000):
    loss = compute_loss(w, b, x, y)
    w, b = update_weights(w, b, x, y, alpha)
    if epoch % 100 == 0:
        print(f"Epoch {epoch}: Loss = {loss}")

print(f"Final weights: {w}")
print(f"Final bias: {b}")
print(f"Final loss: {compute_loss(w, b, x, y)}")

#multiply weights with x:
print(f"Prediction: {np.dot(w, x) + b}")
Epoch 0: Loss = 452.56399999999996
Epoch 100: Loss = 398.9825051020408
Epoch 200: Loss = 398.9825051020408
Epoch 300: Loss = 398.9825051020408
Epoch 400: Loss = 398.9825051020408
Epoch 500: Loss = 398.9825051020408
Epoch 600: Loss = 398.9825051020408
Epoch 700: Loss = 398.9825051020408
Epoch 800: Loss = 398.9825051020408
Epoch 900: Loss = 398.9825051020408
Final weights: [4.30357143 3.80714286 1.81071429 1.11428571 0.61785714]
Final bias: 0.10357142857142854
Final loss: 398.9825051020408
Prediction: 25.0

Before moving forward, lets create data with multiple examples.

\[ x = [[1, 2, 3, 4, 5], [2, 3, 4, 5, 6], [3, 4, 5, 6, 7]] \] \[ y = [25, 30, 35] \]

Show Code
#Dataset with multiple examples
x = np.array([[1, 2, 3, 4, 5], [2, 3, 4, 5, 6], [3, 4, 5, 6, 7]])
y = np.array([25, 30, 35])

w = np.zeros(x.shape[1])  # Initialize weights to zero

b = 0.0
alpha = 0.01


# Compute the loss
def compute_loss(w, b, x, y):
    return np.mean((np.dot(x, w) + b - y) ** 2)

def compute_gradient(w, b, x, y):
    errors = np.dot(x, w) + b - y       # shape (3,) — one error per example
    dw = (2 / len(x)) * np.dot(x.T, errors)  # shape (5,)
    db = (2 / len(x)) * np.sum(errors)        # scalar
    return dw, db

def update_weights(w, b, x, y, alpha):
    dw, db = compute_gradient(w, b, x, y)
    w = w - alpha * dw
    b = b - alpha * db
    return w, b

for epoch in range(1000):
    loss = compute_loss(w, b, x, y)
    w, b = update_weights(w, b, x, y, alpha)
    if epoch % 100 == 0:
        print(f"Epoch {epoch}: Loss = {loss}")
        

print(f"Final weights: {w}")
print(f"Final bias: {b}")
print(f"Predictions: {np.dot(x, w) + b}")
print(f"Targets:     {y}")
Epoch 0: Loss = 916.6666666666666
Epoch 100: Loss = 0.3542681599414232
Epoch 200: Loss = 0.07389895348457104
Epoch 300: Loss = 0.015415033100318507
Epoch 400: Loss = 0.0032155157046107573
Epoch 500: Loss = 0.0006707440184727005
Epoch 600: Loss = 0.00013991458280599502
Epoch 700: Loss = 2.918563556681734e-05
Epoch 800: Loss = 6.088009600987926e-06
Epoch 900: Loss = 1.269935027346966e-06
Final weights: [-0.81758997  0.09126705  1.00012407  1.9089811   2.81783812]
Final bias: 0.9088570222521724
Predictions: [24.99928835 29.99990871 35.00052908]
Targets:     [25 30 35]

Why x.T?

We need the first feature (\(x_1\)) of all examples to be in the first row, becasue the gradient is computed as the dot product of the error and the input features. If we have the features in columns, we can easily compute the gradient for all examples at once using matrix multiplication.

Training real data

Let’s test the above implementation on the real data. Can the above implementation get accurate results for a real dataset?

Show Code
# import dataset
from sklearn.datasets import load_diabetes
data = load_diabetes()
X, y = data.data, data.target

#split dataset into train and test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Show Code
#print the first 5 examples in a dataframe
import pandas as pd
df = pd.DataFrame(X_train, columns=data.feature_names)
df.head()
age sex bmi bp s1 s2 s3 s4 s5 s6
0 0.070769 0.050680 0.012117 0.056301 0.034206 0.049416 -0.039719 0.034309 0.027364 -0.001078
1 -0.009147 0.050680 -0.018062 -0.033213 -0.020832 0.012152 -0.072854 0.071210 0.000272 0.019633
2 0.005383 -0.044642 0.049840 0.097615 -0.015328 -0.016345 -0.006584 -0.002592 0.017036 -0.013504
3 -0.027310 -0.044642 -0.035307 -0.029770 -0.056607 -0.058620 0.030232 -0.039493 -0.049872 -0.129483
4 -0.023677 -0.044642 -0.065486 -0.081413 -0.038720 -0.053610 0.059685 -0.076395 -0.037129 -0.042499

Features in the diabetes dataset:

Column Full Name Description
age Age Age of the patient
sex Sex Sex of the patient
bmi Body Mass Index Weight relative to height
bp Blood Pressure Average blood pressure
s1 TC Total cholesterol
s2 LDL Low-density lipoprotein
s3 HDL High-density lipoprotein
s4 TCH Total cholesterol / HDL ratio
s5 LTG Log of serum triglycerides
s6 GLU Blood sugar level

The values like 0.038076, -0.034821 are because sklearn already normalized them — each feature has been scaled to have:

  • Mean = 0

  • Standard deviation = 1

Show Code
print(y[:5])
[151.  75. 141. 206. 135.]
Show Code
b = 0.0

# Compute the loss
def compute_loss(w, b, x, y):
    return np.mean((np.dot(x, w) + b - y) ** 2)

def compute_gradient(w, b, x, y):
    errors = np.dot(x, w) + b - y       # shape (3,) — one error per example
    dw = (2 / len(x)) * np.dot(x.T, errors)  # shape (10,)
    db = (2 / len(x)) * np.sum(errors)        # scalar
    return dw, db

def update_weights(w, b, x, y, alpha):
    dw, db = compute_gradient(w, b, x, y)
    w = w - alpha * dw
    b = b - alpha * db
    return w, b
Show Code
#store losses for plotting
losses = []
alpha = 0.01

w = np.zeros(X_train.shape[1])  # Initialize weights to zero

# training loop
for epoch in range(1000):
    loss = compute_loss(w, b, X_train, y_train)
    w, b = update_weights(w, b, X_train, y_train, alpha)
    losses.append(loss)
    if epoch % 100 == 0:
        print(f"Epoch {epoch}: Loss = {loss}")
        


#plot the loss curve
import matplotlib.pyplot as plt

plt.plot(losses)
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Loss Curve')
plt.show()
Epoch 0: Loss = 6078.967813535314
Epoch 100: Loss = 5997.582985597988
Epoch 200: Loss = 5921.177398303041
Epoch 300: Loss = 5847.190151819955
Epoch 400: Loss = 5775.497046711073
Epoch 500: Loss = 5706.019503961029
Epoch 600: Loss = 5638.682415619815
Epoch 700: Loss = 5573.413309446693
Epoch 800: Loss = 5510.1422434676015
Epoch 900: Loss = 5448.801716577623

We notice from the abvove that the model is not performing well. The loss is not decreasing and the predictions are not close to the actual values.

We need to improve the training process by: - Randomly initializing the weights and bias

  • Reducing the learning rate
Show Code
#store losses for plotting
losses = []
alpha = 0.01

w = np.random.randn(X_train.shape[1]) * 0.01  # small random values

# training loop
for epoch in range(10000):
    loss = compute_loss(w, b, X_train, y_train)
    w, b = update_weights(w, b, X_train, y_train, alpha)
    losses.append(loss)
    if epoch % 100 == 0:
        print(f"Epoch {epoch}: Loss = {loss}")
        


#plot the loss curve
import matplotlib.pyplot as plt

plt.plot(losses)
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Loss Curve')
plt.show()
Epoch 0: Loss = 6076.474836574767
Epoch 100: Loss = 5997.59499482191
Epoch 200: Loss = 5921.2326574172075
Epoch 300: Loss = 5847.245383693941
Epoch 400: Loss = 5775.551507061216
Epoch 500: Loss = 5706.073196041579
Epoch 600: Loss = 5638.735355362369
Epoch 700: Loss = 5573.465512551738
Epoch 800: Loss = 5510.193725192475
Epoch 900: Loss = 5448.852491747228
Epoch 1000: Loss = 5389.376665541143
Epoch 1100: Loss = 5331.7033717851
Epoch 1200: Loss = 5275.771927531907
Epoch 1300: Loss = 5221.523764461674
Epoch 1400: Loss = 5168.902354396241
Epoch 1500: Loss = 5117.853137446141
Epoch 1600: Loss = 5068.323452696923
Epoch 1700: Loss = 5020.262471345028
Epoch 1800: Loss = 4973.621132196499
Epoch 1900: Loss = 4928.352079445013
Epoch 2000: Loss = 4884.4096026484995
Epoch 2100: Loss = 4841.749578826645
Epoch 2200: Loss = 4800.329416604203
Epoch 2300: Loss = 4760.108002327755
Epoch 2400: Loss = 4721.045648086098
Epoch 2500: Loss = 4683.1040415669295
Epoch 2600: Loss = 4646.246197684861
Epoch 2700: Loss = 4610.436411918091
Epoch 2800: Loss = 4575.640215293314
Epoch 2900: Loss = 4541.824330960554
Epoch 3000: Loss = 4508.956632301666
Epoch 3100: Loss = 4477.006102518284
Epoch 3200: Loss = 4445.942795646865
Epoch 3300: Loss = 4415.7377989503575
Epoch 3400: Loss = 4386.363196637797
Epoch 3500: Loss = 4357.792034864874
Epoch 3600: Loss = 4329.998287970148
Epoch 3700: Loss = 4302.956825903229
Epoch 3800: Loss = 4276.643382802739
Epoch 3900: Loss = 4251.034526683425
Epoch 4000: Loss = 4226.10763019316
Epoch 4100: Loss = 4201.840842402028
Epoch 4200: Loss = 4178.213061586983
Epoch 4300: Loss = 4155.2039089768505
Epoch 4400: Loss = 4132.793703423759
Epoch 4500: Loss = 4110.963436968196
Epoch 4600: Loss = 4089.6947512661195
Epoch 4700: Loss = 4068.96991484763
Epoch 4800: Loss = 4048.7718011777933
Epoch 4900: Loss = 4029.0838674912743
Epoch 5000: Loss = 4009.890134373396
Epoch 5100: Loss = 3991.175166061249
Epoch 5200: Loss = 3972.9240514393928
Epoch 5300: Loss = 3955.122385705588
Epoch 5400: Loss = 3937.756252682872
Epoch 5500: Loss = 3920.812207755135
Epoch 5600: Loss = 3904.2772614041537
Epoch 5700: Loss = 3888.1388633268048
Epoch 5800: Loss = 3872.3848871119835
Epoch 5900: Loss = 3857.0036154574154
Epoch 6000: Loss = 3841.983725907285
Epoch 6100: Loss = 3827.3142770922846
Epoch 6200: Loss = 3812.9846954543054
Epoch 6300: Loss = 3798.984762438673
Epoch 6400: Loss = 3785.3046021373743
Epoch 6500: Loss = 3771.9346693673483
Epoch 6600: Loss = 3758.8657381684848
Epoch 6700: Loss = 3746.0888907064686
Epoch 6800: Loss = 3733.595506566194
Epoch 6900: Loss = 3721.3772524219276
Epoch 7000: Loss = 3709.4260720709226
Epoch 7100: Loss = 3697.734176817641
Epoch 7200: Loss = 3686.294036196189
Epoch 7300: Loss = 3675.0983690190333
Epoch 7400: Loss = 3664.140134740463
Epoch 7500: Loss = 3653.412525123671
Epoch 7600: Loss = 3642.908956200761
Epoch 7700: Loss = 3632.623060515295
Epoch 7800: Loss = 3622.5486796374344
Epoch 7900: Loss = 3612.6798569420366
Epoch 8000: Loss = 3603.0108306404154
Epoch 8100: Loss = 3593.536027056818
Epoch 8200: Loss = 3584.250054140981
Epoch 8300: Loss = 3575.147695208411
Epoch 8400: Loss = 3566.2239029003786
Epoch 8500: Loss = 3557.473793355847
Epoch 8600: Loss = 3548.8926405878565
Epoch 8700: Loss = 3540.475871057171
Epoch 8800: Loss = 3532.219058436192
Epoch 8900: Loss = 3524.117918556445
Epoch 9000: Loss = 3516.1683045331647
Epoch 9100: Loss = 3508.366202060707
Epoch 9200: Loss = 3500.707724872788
Epoch 9300: Loss = 3493.189110361716
Epoch 9400: Loss = 3485.8067153510106
Epoch 9500: Loss = 3478.5570120160155
Epoch 9600: Loss = 3471.4365839472507
Epoch 9700: Loss = 3464.4421223515074
Epoch 9800: Loss = 3457.5704223858047
Epoch 9900: Loss = 3450.8183796195267

Show Code
# NumPy Implementation - Accuracy Curve (R²)
from sklearn.metrics import r2_score

losses = []
r2_scores = []
alpha = 0.01
w = np.random.randn(X_train.shape[1]) * 0.01
b = 0.0

for epoch in range(10000):
    loss = compute_loss(w, b, X_train, y_train)
    w, b = update_weights(w, b, X_train, y_train, alpha)
    
    # compute R² every epoch
    y_pred = np.dot(X_test, w) + b
    r2 = r2_score(y_test, y_pred)
    
    losses.append(loss)
    r2_scores.append(r2)

# Plot both curves side by side
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

ax1.plot(losses)
ax1.set_xlabel('Epoch')
ax1.set_ylabel('Loss (MSE)')
ax1.set_title('Loss Curve')

ax2.plot(r2_scores, color='green')
ax2.set_xlabel('Epoch')
ax2.set_ylabel('R² Score')
ax2.set_title('Accuracy Curve (R²)')

plt.tight_layout()
plt.show()

print(f"Final R²:   {r2_scores[-1]:.4f}")
print(f"Final RMSE: {np.sqrt(losses[-1]):.2f}")

Final R²:   0.4167
Final RMSE: 58.68

Test on pytorch

Let’s test the above implementation on pytorch. Can we get the same results using pytorch?

Show Code
import torch
import torch.nn as nn


# Convert to tensors
X_tensor = torch.tensor(X_train, dtype=torch.float32)
y_tensor = torch.tensor(y_train, dtype=torch.float32).view(-1, 1)  # shape (442, 1)

# Define model — same as ours: linear layer (weights + bias)
model = nn.Linear(X_train.shape[1], 1)  # 10 inputs, 1 output

# Loss function — same as ours: MSE
loss_fn = nn.MSELoss()

# Optimizer — same as ours: gradient descent
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

# Training loop
# PyTorch Implementation - Accuracy Curve (R²)
losses = []
r2_scores = []

for epoch in range(10000):
    y_pred = model(X_tensor)
    loss = loss_fn(y_pred, y_tensor)

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    # compute R² every epoch
    with torch.no_grad():
        X_test_tensor = torch.tensor(X_test, dtype=torch.float32)
        y_test_tensor = torch.tensor(y_test, dtype=torch.float32).view(-1, 1)
        preds = model(X_test_tensor).squeeze().numpy()
        r2 = r2_score(y_test, preds)

    losses.append(loss.item())
    r2_scores.append(r2)

# Plot both curves side by side
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

ax1.plot(losses)
ax1.set_xlabel('Epoch')
ax1.set_ylabel('Loss (MSE)')
ax1.set_title('Loss Curve (PyTorch)')

ax2.plot(r2_scores, color='green')
ax2.set_xlabel('Epoch')
ax2.set_ylabel('R² Score')
ax2.set_title('Accuracy Curve (PyTorch)')

plt.tight_layout()
plt.show()

print(f"Final R²:   {r2_scores[-1]:.4f}")
print(f"Final RMSE: {np.sqrt(losses[-1]):.2f}")

Final R²:   0.4167
Final RMSE: 58.69

Improt model class defined in model_functions.py

Show Code
from model_functions import MyModel

model = MyModel(alpha=0.01, epochs=10000)
model.fit(X_train, y_train)

model.plot_loss()
Epoch 0: Loss = 29711.4162
Epoch 100: Loss = 6409.4053
Epoch 200: Loss = 5924.7243
Epoch 300: Loss = 5843.6582
Epoch 400: Loss = 5771.9388
Epoch 500: Loss = 5702.5559
Epoch 600: Loss = 5635.3125
Epoch 700: Loss = 5570.1342
Epoch 800: Loss = 5506.9511
Epoch 900: Loss = 5445.6957
Epoch 1000: Loss = 5386.3030
Epoch 1100: Loss = 5328.7102
Epoch 1200: Loss = 5272.8567
Epoch 1300: Loss = 5218.6841
Epoch 1400: Loss = 5166.1359
Epoch 1500: Loss = 5115.1576
Epoch 1600: Loss = 5065.6967
Epoch 1700: Loss = 5017.7023
Epoch 1800: Loss = 4971.1255
Epoch 1900: Loss = 4925.9191
Epoch 2000: Loss = 4882.0373
Epoch 2100: Loss = 4839.4360
Epoch 2200: Loss = 4798.0729
Epoch 2300: Loss = 4757.9068
Epoch 2400: Loss = 4718.8980
Epoch 2500: Loss = 4681.0084
Epoch 2600: Loss = 4644.2010
Epoch 2700: Loss = 4608.4402
Epoch 2800: Loss = 4573.6915
Epoch 2900: Loss = 4539.9216
Epoch 3000: Loss = 4507.0986
Epoch 3100: Loss = 4475.1915
Epoch 3200: Loss = 4444.1703
Epoch 3300: Loss = 4414.0061
Epoch 3400: Loss = 4384.6712
Epoch 3500: Loss = 4356.1386
Epoch 3600: Loss = 4328.3822
Epoch 3700: Loss = 4301.3771
Epoch 3800: Loss = 4275.0989
Epoch 3900: Loss = 4249.5243
Epoch 4000: Loss = 4224.6307
Epoch 4100: Loss = 4200.3963
Epoch 4200: Loss = 4176.8000
Epoch 4300: Loss = 4153.8214
Epoch 4400: Loss = 4131.4408
Epoch 4500: Loss = 4109.6394
Epoch 4600: Loss = 4088.3988
Epoch 4700: Loss = 4067.7013
Epoch 4800: Loss = 4047.5297
Epoch 4900: Loss = 4027.8676
Epoch 5000: Loss = 4008.6990
Epoch 5100: Loss = 3990.0084
Epoch 5200: Loss = 3971.7811
Epoch 5300: Loss = 3954.0025
Epoch 5400: Loss = 3936.6589
Epoch 5500: Loss = 3919.7368
Epoch 5600: Loss = 3903.2232
Epoch 5700: Loss = 3887.1055
Epoch 5800: Loss = 3871.3718
Epoch 5900: Loss = 3856.0102
Epoch 6000: Loss = 3841.0095
Epoch 6100: Loss = 3826.3588
Epoch 6200: Loss = 3812.0474
Epoch 6300: Loss = 3798.0652
Epoch 6400: Loss = 3784.4024
Epoch 6500: Loss = 3771.0493
Epoch 6600: Loss = 3757.9968
Epoch 6700: Loss = 3745.2360
Epoch 6800: Loss = 3732.7583
Epoch 6900: Loss = 3720.5553
Epoch 7000: Loss = 3708.6190
Epoch 7100: Loss = 3696.9416
Epoch 7200: Loss = 3685.5156
Epoch 7300: Loss = 3674.3338
Epoch 7400: Loss = 3663.3890
Epoch 7500: Loss = 3652.6746
Epoch 7600: Loss = 3642.1839
Epoch 7700: Loss = 3631.9106
Epoch 7800: Loss = 3621.8484
Epoch 7900: Loss = 3611.9916
Epoch 8000: Loss = 3602.3343
Epoch 8100: Loss = 3592.8709
Epoch 8200: Loss = 3583.5961
Epoch 8300: Loss = 3574.5047
Epoch 8400: Loss = 3565.5916
Epoch 8500: Loss = 3556.8519
Epoch 8600: Loss = 3548.2810
Epoch 8700: Loss = 3539.8742
Epoch 8800: Loss = 3531.6271
Epoch 8900: Loss = 3523.5355
Epoch 9000: Loss = 3515.5953
Epoch 9100: Loss = 3507.8023
Epoch 9200: Loss = 3500.1528
Epoch 9300: Loss = 3492.6429
Epoch 9400: Loss = 3485.2691
Epoch 9500: Loss = 3478.0278
Epoch 9600: Loss = 3470.9156
Epoch 9700: Loss = 3463.9292
Epoch 9800: Loss = 3457.0654
Epoch 9900: Loss = 3450.3211

Improve the model by adding an activation function

First sample: \[ x = [x_1, x_2, x_3, x_4, x_5, x_6, x_7, x_8, x_9, x_{10}] \] \[ x = [0.070769,0.050680,0.012117,0.056301,0.034206,0.049416,-0.039719,0.034309,0.027364,-0.001078] \] \[ y = 144.0\]

Equation: \[ z_1 = w_1 \cdot x_1 + ... + w_{10} \cdot x_{10} + b = 144.0 \]

\[ \hat{y} = ReLU(z_1) \]

Loss function (For all examples): \[ L(w) = \sum_{i=1}^{n}(\hat{y}_i - y_i)^2 \]

Gradient of \(w_1\): \[ \frac{\partial L}{\partial w_1} = \sum_{i=1}^{n} 2(\hat{y}_i - y_i) \cdot \frac{\partial \hat{y}_i}{\partial z_1} \cdot x_1 \]

Gradient of \(b\): \[ \frac{\partial L}{\partial b} = \sum_{i=1}^{n} 2(\hat{y}_i - y_i) \cdot \frac{\partial \hat{y}_i}{\partial z_1} \]

Show Code
import importlib
import model_functions
importlib.reload(model_functions)


from model_functions import MyModelReLU

model = MyModelReLU(alpha=0.01, epochs=10000)
model.fit(X_train, y_train)

model.plot_loss() 

y_pred = model.predict(X_test)  # ⚠️ need to override predict too!
Epoch 0: Loss = 29711.2339
Epoch 100: Loss = 6420.0955
Epoch 200: Loss = 5925.5228
Epoch 300: Loss = 5844.2659
Epoch 400: Loss = 5772.5269
Epoch 500: Loss = 5703.1279
Epoch 600: Loss = 5635.8690
Epoch 700: Loss = 5570.6756
Epoch 800: Loss = 5507.4778
Epoch 900: Loss = 5446.2083
Epoch 1000: Loss = 5386.8020
Epoch 1100: Loss = 5329.1960
Epoch 1200: Loss = 5273.3297
Epoch 1300: Loss = 5219.1447
Epoch 1400: Loss = 5166.5845
Epoch 1500: Loss = 5115.5946
Epoch 1600: Loss = 5066.1224
Epoch 1700: Loss = 5018.1171
Epoch 1800: Loss = 4971.5298
Epoch 1900: Loss = 4926.3130
Epoch 2000: Loss = 4882.4213
Epoch 2100: Loss = 4839.8105
Epoch 2200: Loss = 4798.4380
Epoch 2300: Loss = 4758.2629
Epoch 2400: Loss = 4719.2454
Epoch 2500: Loss = 4681.3473
Epoch 2600: Loss = 4644.5316
Epoch 2700: Loss = 4608.7628
Epoch 2800: Loss = 4574.0063
Epoch 2900: Loss = 4540.2290
Epoch 3000: Loss = 4507.3987
Epoch 3100: Loss = 4475.4845
Epoch 3200: Loss = 4444.4565
Epoch 3300: Loss = 4414.2857
Epoch 3400: Loss = 4384.9443
Epoch 3500: Loss = 4356.4054
Epoch 3600: Loss = 4328.6430
Epoch 3700: Loss = 4301.6320
Epoch 3800: Loss = 4275.3481
Epoch 3900: Loss = 4249.7680
Epoch 4000: Loss = 4224.8690
Epoch 4100: Loss = 4200.6293
Epoch 4200: Loss = 4177.0279
Epoch 4300: Loss = 4154.0443
Epoch 4400: Loss = 4131.6590
Epoch 4500: Loss = 4109.8529
Epoch 4600: Loss = 4088.6078
Epoch 4700: Loss = 4067.9058
Epoch 4800: Loss = 4047.7300
Epoch 4900: Loss = 4028.0637
Epoch 5000: Loss = 4008.8910
Epoch 5100: Loss = 3990.1966
Epoch 5200: Loss = 3971.9654
Epoch 5300: Loss = 3954.1831
Epoch 5400: Loss = 3936.8359
Epoch 5500: Loss = 3919.9103
Epoch 5600: Loss = 3903.3932
Epoch 5700: Loss = 3887.2723
Epoch 5800: Loss = 3871.5353
Epoch 5900: Loss = 3856.1706
Epoch 6000: Loss = 3841.1668
Epoch 6100: Loss = 3826.5130
Epoch 6200: Loss = 3812.1988
Epoch 6300: Loss = 3798.2138
Epoch 6400: Loss = 3784.5481
Epoch 6500: Loss = 3771.1924
Epoch 6600: Loss = 3758.1373
Epoch 6700: Loss = 3745.3739
Epoch 6800: Loss = 3732.8937
Epoch 6900: Loss = 3720.6882
Epoch 7000: Loss = 3708.7496
Epoch 7100: Loss = 3697.0699
Epoch 7200: Loss = 3685.6416
Epoch 7300: Loss = 3674.4576
Epoch 7400: Loss = 3663.5107
Epoch 7500: Loss = 3652.7942
Epoch 7600: Loss = 3642.3015
Epoch 7700: Loss = 3632.0261
Epoch 7800: Loss = 3621.9621
Epoch 7900: Loss = 3612.1033
Epoch 8000: Loss = 3602.4442
Epoch 8100: Loss = 3592.9790
Epoch 8200: Loss = 3583.7024
Epoch 8300: Loss = 3574.6093
Epoch 8400: Loss = 3565.6945
Epoch 8500: Loss = 3556.9532
Epoch 8600: Loss = 3548.3806
Epoch 8700: Loss = 3539.9722
Epoch 8800: Loss = 3531.7237
Epoch 8900: Loss = 3523.6306
Epoch 9000: Loss = 3515.6888
Epoch 9100: Loss = 3507.8944
Epoch 9200: Loss = 3500.2435
Epoch 9300: Loss = 3492.7323
Epoch 9400: Loss = 3485.3571
Epoch 9500: Loss = 3478.1145
Epoch 9600: Loss = 3471.0010
Epoch 9700: Loss = 3464.0133
Epoch 9800: Loss = 3457.1483
Epoch 9900: Loss = 3450.4027

Improve the model by adding a second layer

Each hidden neuron is just like your single-layer case: a weighted sum plus (optionally) an activation function.

Hidden neuron 1

\[ z_1^{(1)} = w_{11} x_1 + w_{12} x_2 + w_{13} x_3 + b_1 \]

\[ a_1^{(1)} = \sigma(z_1^{(1)}) \]


Hidden neuron 2

\[ z_2^{(1)} = w_{21} x_1 + w_{22} x_2 + w_{23} x_3 + b_2 \]

\[ a_2^{(1)} = \sigma(z_2^{(1)}) \]


Hidden neuron 3

\[ z_3^{(1)} = w_{31} x_1 + w_{32} x_2 + w_{33} x_3 + b_3 \]

\[ a_3^{(1)} = \sigma(z_3^{(1)}) \]


So now the hidden layer outputs:

\[ (a_1^{(1)},\ a_2^{(1)},\ a_3^{(1)}) \]

Show Code
#forward pass
def forward_pass(w, b, x):
    return np.dot(x, w) + b

# Compute the loss
def compute_loss(w, b, x, y):
    return np.mean((forward_pass(w, b, x) - y) ** 2)

def compute_gradient(w, b, x, y):
    errors = forward_pass(w, b, x) - y       # shape (3,) — one error per example
    dw = (2 / len(x)) * np.dot(x.T, errors)  # shape (10,)
    db = (2 / len(x)) * np.sum(errors)        # scalar
    return dw, db

def update_weights(w, b, x, y, alpha):
    dw, db = compute_gradient(w, b, x, y)
    w = w - alpha * dw
    b = b - alpha * db
    return w, b
Show Code
# define a weight initialization function matrix of previous layer to current layer
def initialize_weights(hidden_size, input_size):
    return np.random.randn(input_size, hidden_size) * 0.01  # small random values


initialize_weights(3, 3)
array([[-0.00115139, -0.00137835, -0.00057377],
       [ 0.00497807, -0.00097957,  0.0127437 ],
       [ 0.00732416, -0.00475363,  0.00061397]])
Show Code
X = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
Y = np.array([10, 28, 40])
x_one_example = np.array([1, 2, 3])
w = initialize_weights(3, 3)
b = np.zeros(3)  # one bias per hidden neuron

print("For one example:", forward_pass(w, b, x_one_example), "... These are the activations of the hidden layer.")
print("==="*20)
print("For multiple examples:\n", forward_pass(w, b, X))
For one example: [ 0.01380295 -0.0005757  -0.01468555] ... These are the activations of the hidden layer.
============================================================
For multiple examples:
 [[ 0.01380295 -0.0005757  -0.01468555]
 [ 0.03393964 -0.0283276  -0.02297084]
 [ 0.05407633 -0.05607951 -0.03125612]]

Now pass the hidden layer outputs to the second layer with 2 output neurons: Output neuron 1 \[ z_1^{(2)} = w_{11}^{(2)} a_1^{(1)} + w_{21}^{(2)} a_2^{(1)} + w_{31}^{(2)} a_3^{(1)} + b_1^{(2)} \] \[ a_1^{(2)} = \sigma(z_1^{(2)}) \]

Output neuron 2 \[ z_2^{(2)} = w_{12}^{(2)} a_1^{(1)} + w_{22}^{(2)} a_2^{(1)} + w_{32}^{(2)} a_3^{(1)} + b_2^{(2)} \]

\[ a_2^{(2)} = \sigma(z_2^{(2)}) \]


The final output of the second layer is: \[ (a_1^{(2)},\ a_2^{(2)}) \]


Last layer:

\[ z_1^{(3)} = w_{11}^{(3)} a_1^{(2)} + w_{21}^{(3)} a_2^{(2)} + b_3^{(3)} \]

\[ a_1^{(3)} = \sigma(z_1^{(3)}) \]

\[ \hat{y} = a_1^{(3)} \]

Backpropagation equations:

Loss function: \[L(w) = \sum_{i=1}^{n}(\hat{y}_i - y_i)^2\]

Gradient of loss with respect to \(\hat{y}_i\): \[\frac{\partial L}{\partial \hat{y}_i} = 2(\hat{y}_i - y_i)\] Notice the summation disappears because we are computing the gradient for a single example at a time.

Gradient of loss with respect to \(a_1^{(3)}\) (output layer activation): \[\frac{\partial L}{\partial a_1^{(3)}} = \frac{\partial L}{\partial \hat{y}} = 2(\hat{y} - y)\]

Gradient of loss with respect to \(z_1^{(3)}\) (output layer pre-activation): \[\frac{\partial L}{\partial z_1^{(3)}} = \frac{\partial L}{\partial a_1^{(3)}} \cdot \frac{\partial a_1^{(3)}}{\partial z_1^{(3)}} = 2(\hat{y} - y) \cdot \sigma'(z_1^{(3)})\]

Where \(\sigma'(z)\) is the derivative of the activation function. \[\sigma'(z) = \sigma(z)(1 - \sigma(z)) \quad \]

\[\frac{\partial L}{\partial z_1^{(3)}} = 2(\hat{y} - y) \cdot a_1^{(3)}(1 - a_1^{(3)}) = \delta^{(3)}\] Where \(\delta^{(3)}\) is the error term for the current layer (3).

Gradient of loss with respect to \(w_{11}^{(3)}\): \[\frac{\partial L}{\partial w_{11}^{(3)}} = \frac{\partial L}{\partial z_1^{(3)}} \cdot \frac{\partial z_1^{(3)}}{\partial w_{11}^{(3)}} = 2(\hat{y} - y) \cdot a_1^{(3)}(1 - a_1^{(3)}) \cdot a_1^{(2)} = \delta^{(3)} \cdot a_1^{(2)}\]

\[\frac{\partial L}{\partial w_{21}^{(3)}} = \delta^{(3)} \cdot a_2^{(2)}\] \[\frac{\partial L}{\partial b_{3}^{(3)}} = \delta^{(3)}\]


Propagate the error back to the second layer: Gradient of loss with respect to \(a_1^{(2)}\): \[\frac{\partial L}{\partial a_1^{(2)}} = \frac{\partial L}{\partial z_1^{(3)}} \cdot \frac{\partial z_1^{(3)}}{\partial a_1^{(2)}}\]

We know that: \[ z_1^{(3)} = w_{11}^{(3)} a_1^{(2)} + w_{21}^{(3)} a_2^{(2)} + b_3^{(3)} \] So: \[ \frac{\partial z_1^{(3)}}{\partial a_1^{(2)}} = w_{11}^{(3)} \] Therefore: \[\frac{\partial L}{\partial a_1^{(2)}} = \delta^{(3)} \cdot w_{11}^{(3)}\]

Gradient of loss with respect to \(z_1^{(2)}\): \[\frac{\partial L}{\partial z_1^{(2)}} = \frac{\partial L}{\partial a_1^{(2)}} \cdot \frac{\partial a_1^{(2)}}{\partial z_1^{(2)}} = \delta^{(3)} \cdot w_{11}^{(3)} \cdot \sigma'(z_1^{(2)}) = \delta^{(2)}\]

Where: \[\sigma'(z_1^{(2)}) = \sigma(z_1^{(2)})(1 - \sigma(z_1^{(2)}))\]

Gradient of loss with respect to \(w_{11}^{(2)}\): \[\frac{\partial L}{\partial w_{11}^{(2)}} = \frac{\partial L}{\partial z_1^{(2)}} \cdot \frac{\partial z_1^{(2)}}{\partial w_{11}^{(2)}} = \delta^{(2)} \cdot a_1^{(1)}\] \[\frac{\partial L}{\partial w_{21}^{(2)}} = \delta^{(2)} \cdot a_2^{(1)}\] \[\frac{\partial L}{\partial w_{31}^{(2)}} = \delta^{(2)} \cdot a_3^{(1)}\] \[\frac{\partial L}{\partial b_{1}^{(2)}} = \delta^{(2)}\]

Trun the above to code:

General layer backward pass: \[\frac{\partial L}{\partial z^{(l)}} = \frac{\partial L}{\partial a^{(l)}} \cdot a^{(l)}(1 - a^{(l)}) = dZ^{(l)} \]

\[ dW^{(l)} = A^{(l-1)T} \cdot dZ^{(l)} \] \[ db^{(l)} = \sum dZ^{(l)} \]

\[ \frac{\partial L}{\partial a^{(l-1)}} = dZ^{(l)} \cdot W^{(l)T} = dA_{prev} \]

Show Code
X = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
Y = np.array([10, 28, 40])

def initialize_weights(input_size, hidden_size):
    W = np.random.randn(input_size, hidden_size) * 0.01  # small random values
    b = np.zeros((1, hidden_size))  # initialize biases to zero
    return W, b

def forward_pass(w, b, x): # this is the forward pass for a single layer
    return np.dot(x, w) + b
Show Code
# get the output using the above functions for 2 hidden layers (first 3 neruons, then 2 neurons, then 1 output neuron)
w_1, b_1 = initialize_weights(3, 3)  # weights for layer 1
h_1 = forward_pass(w_1, b_1, X)  # hidden layer 1 activations

w_2, b_2 = initialize_weights(3, 2)  # weights for layer 2
h_2 = forward_pass(w_2, b_2, h_1)  # hidden layer 2 activations

w_3, b_3 = initialize_weights(2, 1)  # weights for output layer
y_hat = forward_pass(w_3, b_3, h_2)  # output layer activations

print("Output of hidden layer 1:\n", h_1)
print("Output of hidden layer 2:\n", h_2)
print("Output of output layer:\n", y_hat)
Output of hidden layer 1:
 [[-0.01928725  0.0253243  -0.02441398]
 [-0.04851654  0.06566209 -0.06542295]
 [-0.07774583  0.10599988 -0.10643192]]
Output of hidden layer 2:
 [[ 5.69294394e-05 -8.43031920e-04]
 [ 1.20687056e-04 -2.22565488e-03]
 [ 1.84444672e-04 -3.60827785e-03]]
Output of output layer:
 [[-3.27748402e-06]
 [-8.72778649e-06]
 [-1.41780890e-05]]
Show Code
# impemnt the backpropagation equations for the above 3 layer network:
def relu(x):
    return np.maximum(0, x)
def relu_derivative(x):
    return (x > 0).astype(float)

def backward_pass(dZ, dA, A, A_prev, W):
    """
    dZ: gradient of loss w.r.t. pre-activation (linear output) of current layer
    dA: gradient of loss w.r.t. activations of current layer
    A: activation of current layer
    A_prev: activation from previous layer
    W: weights of current layer
    """

    # Step 1: gradients
    dW = np.dot(A_prev.T, dZ) # A_prev.T will give the a_1 for all examples in the first row
    db = np.sum(dZ, axis=0, keepdims=True) # sum over all examples to get the bias gradient

    # Step 2: propagate backward
    dA_prev = np.dot(dZ, W.T)

    return dA_prev, dW, db
Show Code
#call backward pass for the output layer
dA3 = y_hat - Y.reshape(-1, 1)  # gradient of loss w.r.t. output layer activations
dZ3 = dA3
dA2, dW3, db3 = backward_pass(dZ3, dA3, y_hat, h_2, w_3)
print("Gradient w.r.t. output layer weights:\n", dW3)
print("Gradient w.r.t. output layer bias:\n", db3) 
print("Gradient w.r.t. hidden layer 2 activations (to be propagated back to hidden layer 2):\n", dA2)
Gradient w.r.t. output layer weights:
 [[-0.01132632]
 [ 0.21507984]]
Gradient w.r.t. output layer bias:
 [[-78.00002618]]
Gradient w.r.t. hidden layer 2 activations (to be propagated back to hidden layer 2):
 [[-0.02534    -0.04058855]
 [-0.07095199 -0.11364794]
 [-0.10135998 -0.1623542 ]]
Show Code
# He initialization
def initialize_weights(input_size, hidden_size):
    W = np.random.randn(input_size, hidden_size) * np.sqrt(2.0 / input_size)  # ✅ He init
    b = np.zeros((1, hidden_size))
    return W, b

Notice: We skip ReLU in the forward pass for the output layer, you must also skip the ReLU derivative in the backward pass.

Show Code
#simple training loop
network_arck = (3,5,1) # define network archeticture 

# define weights and biases for each layer
def get_network_weights(network):
    weights = []
    for i in range(len(network) - 1):
        W, b = initialize_weights(network[i], network[i+1])
        weights.append((W,b))
    
    return weights
    

# get weights for the network
weights = get_network_weights(network_arck)
alpha = 0.001
losses = []
weight_history = [] 
grad_history = []
    
for i in range(100):
    # ---- forward pass ------
    activations = [X]
    A = X
    for idx, (W, b) in enumerate(weights):
        A = forward_pass(W, b, A)
        if idx < len(weights) - 1:  # ✅ ReLU on hidden layers only
            A = relu(A)
        activations.append(A)
    
    y_hat = activations[-1]
    
    # ----- loss ------
    loss = np.mean((y_hat - Y.reshape(-1, 1)) ** 2)
    losses.append(loss)
    
    if i == 0:
        print(f"Activations of first hidden layer:\n{activations[1]}")
    
    # ----- backword pass ------
    dA = y_hat - Y.reshape(-1,1)
    epoch_grads = {}
    
    
    for l in reversed(range(len(weights))):
        W, b  = weights[l]
        A = activations[l+1]
        A_prev = activations[l]
        
        if l < len(weights) - 1:  # ✅ ReLU derivative for hidden layers
            dZ = dA * relu_derivative(A)
        else:
            dZ = dA  # output layer, no activation function
        
        
        dA_prev, dW, db = backward_pass(dZ, dA, A, A_prev, W)
        epoch_grads[l] = np.mean(np.abs(dW))  # ✅ record gradient magnitude
        dA = dA_prev
        # update weights
        W -= alpha * dW
        b -= alpha * db
        weights[l] = (W, b)  # update the weights in the list
        
    
    if i % 10 == 0:
        print(f"Epoch {i}: Loss = {loss}")
        weight_history.append({l: (W.copy(), b.copy()) for l, (W, b) in enumerate(weights)})
        grad_history.append(epoch_grads)
    
Activations of first hidden layer:
[[ 3.37221711  0.          0.          0.          6.03112655]
 [ 7.01142676  0.          0.          0.         14.33448017]
 [10.65063641  0.          0.          0.         22.63783379]]
Epoch 0: Loss = 177.14630229031638
Epoch 10: Loss = 2.336859495246242
Epoch 20: Loss = 2.007355747623226
Epoch 30: Loss = 2.00403004917375
Epoch 40: Loss = 2.003657202476628
Epoch 50: Loss = 2.003341098355025
Epoch 60: Loss = 2.003052510150096
Epoch 70: Loss = 2.002788838094523
Epoch 80: Loss = 2.0025479301822084
Epoch 90: Loss = 2.0023278221129566
Show Code
def predict(X, weights):
    """
    X: input data (n_samples, n_features)
    weights: list of (W, b) tuples for each layer
    """
    A = X
    for idx, (W, b) in enumerate(weights):
        A = forward_pass(W, b, A)
        if idx < len(weights) - 1:  # ✅ ReLU on hidden layers only, linear output
            A = relu(A)
    return A

# Usage
y_pred = predict(X, weights)
print("Predictions:\n", y_pred.flatten())
print("Targets:    ", Y)
Predictions:
 [11.07071144 26.01956646 40.96842148]
Targets:     [10 28 40]
Show Code
# test using the class
import importlib
import model_functions
importlib.reload(model_functions)

from model_functions import MyDeepModel

X = np.array([[1, 2, 3], [7, 5, 3], [9, 6, 4]], dtype=float)
Y = np.array([10, 28, 40], dtype=float)

model = MyDeepModel(layer_sizes=[X.shape[1], 5, 1], alpha=0.001, epochs=1000)
model.fit(X, Y)

y_pred = model.predict(X)
print(f"Prediction: {y_pred}")
Epoch 0: Loss = 827.936695770421
Epoch 10: Loss = 3.5213098426782707
Epoch 20: Loss = 2.960573090533107
Epoch 30: Loss = 2.929184145669646
Epoch 40: Loss = 2.9021968361545465
Epoch 50: Loss = 2.8781677341821297
Epoch 60: Loss = 2.8561364855957607
Epoch 70: Loss = 2.835463747202823
Epoch 80: Loss = 2.8157236353571222
Epoch 90: Loss = 2.7966322215883355
Epoch 100: Loss = 2.7779999782903833
Epoch 110: Loss = 2.7597001510158363
Epoch 120: Loss = 2.7416477255069585
Epoch 130: Loss = 2.723785442882935
Epoch 140: Loss = 2.7060745030469278
Epoch 150: Loss = 2.6884883859241904
Epoch 160: Loss = 2.6710087456821583
Epoch 170: Loss = 2.6536226829035683
Epoch 180: Loss = 2.6363209325015733
Epoch 190: Loss = 2.6190966600739922
Epoch 200: Loss = 2.6019446624366545
Epoch 210: Loss = 2.584860836597919
Epoch 220: Loss = 2.56784182699165
Epoch 230: Loss = 2.5508847910649446
Epoch 240: Loss = 2.5339872434365787
Epoch 250: Loss = 2.5171469522101244
Epoch 260: Loss = 2.5003618699045984
Epoch 270: Loss = 2.4836300873628385
Epoch 280: Loss = 2.466949802912609
Epoch 290: Loss = 2.450319301655487
Epoch 300: Loss = 2.433736941483407
Epoch 310: Loss = 2.417201143568071
Epoch 320: Loss = 2.40071038582766
Epoch 330: Loss = 2.384263198379637
Epoch 340: Loss = 2.3678581603217874
Epoch 350: Loss = 2.351493897406296
Epoch 360: Loss = 2.335169080317146
Epoch 370: Loss = 2.3188824233597063
Epoch 380: Loss = 2.3026326834347763
Epoch 390: Loss = 2.286418659212698
Epoch 400: Loss = 2.270239190451083
Epoch 410: Loss = 2.254093157418428
Epoch 420: Loss = 2.237979480398442
Epoch 430: Loss = 2.221897119257763
Epoch 440: Loss = 2.205845073065554
Epoch 450: Loss = 2.1898223797565084
Epoch 460: Loss = 2.1738281158319155
Epoch 470: Loss = 2.157861396094122
Epoch 480: Loss = 2.141921373411506
Epoch 490: Loss = 2.126007238511179
Epoch 500: Loss = 2.110118219797664
Epoch 510: Loss = 2.094253583195368
Epoch 520: Loss = 2.0784126320137144
Epoch 530: Loss = 2.062594706833032
Epoch 540: Loss = 2.0467991854102245
Epoch 550: Loss = 2.031025482602543
Epoch 560: Loss = 2.015273050308442
Epoch 570: Loss = 1.9995413774240918
Epoch 580: Loss = 1.9838299898142084
Epoch 590: Loss = 1.9681384502962562
Epoch 600: Loss = 1.9524663586361886
Epoch 610: Loss = 1.9368133515551638
Epoch 620: Loss = 1.9211791027452076
Epoch 630: Loss = 1.9055633228931619
Epoch 640: Loss = 1.8899657597111386
Epoch 650: Loss = 1.8743861979724155
Epoch 660: Loss = 1.8588244595515089
Epoch 670: Loss = 1.8432804034665928
Epoch 680: Loss = 1.827753925923756
Epoch 690: Loss = 1.8122449603608024
Epoch 700: Loss = 1.7967534774898504
Epoch 710: Loss = 1.781279485337188
Epoch 720: Loss = 1.7658230292787884
Epoch 730: Loss = 1.7503841920703624
Epoch 740: Loss = 1.7349630938703668
Epoch 750: Loss = 1.719559892254459
Epoch 760: Loss = 1.7041747822203746
Epoch 770: Loss = 1.6888079961812226
Epoch 780: Loss = 1.67345980394627
Epoch 790: Loss = 1.6581305126875903
Epoch 800: Loss = 1.6428204668910726
Epoch 810: Loss = 1.627530048290524
Epoch 820: Loss = 1.6122596757834788
Epoch 830: Loss = 1.5970098053271975
Epoch 840: Loss = 1.5817809298135053
Epoch 850: Loss = 1.566573578921328
Epoch 860: Loss = 1.5513883189452358
Epoch 870: Loss = 1.5362257525991427
Epoch 880: Loss = 1.5210865187933391
Epoch 890: Loss = 1.5059712923842279
Epoch 900: Loss = 1.4908722004314787
Epoch 910: Loss = 1.475806757630216
Epoch 920: Loss = 1.4607675647762688
Epoch 930: Loss = 1.445755437489904
Epoch 940: Loss = 1.4307712256316167
Epoch 950: Loss = 1.4158161585212146
Epoch 960: Loss = 1.40089046527652
Epoch 970: Loss = 1.3859954378405455
Epoch 980: Loss = 1.3711320582855373
Epoch 990: Loss = 1.35630134036942
Prediction: [ 9.97039161 29.5843936  38.7698239 ]

Comments