Show Code
# import
import numpy as npWe will start by implementing a simple RNN to predict the next character in a sequence “hello”. We will use one-hot encoding for the characters and a simple RNN cell to process the input sequence.
In the below code, we define the text (single word “hello”), create a set of unique characters, and then create mappings from characters to indices and vice versa.
['e', 'h', 'l', 'o']
char_to_idx: {'e': 0, 'h': 1, 'l': 2, 'o': 3}
idx_to_char: {0: 'e', 1: 'h', 2: 'l', 3: 'o'}
--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[2], line 9 7 # Example usage: 8 char = 'h' ----> 9 one_hot_vector = one_hot_encode(char, char_to_idx) 10 print(f"One-hot encoding for '{char}': {one_hot_vector}") Cell In[2], line 3, in one_hot_encode(char, char_to_idx) 2 def one_hot_encode(char, char_to_idx): ----> 3 one_hot = np.zeros(len(char_to_idx)) 4 one_hot[char_to_idx[char]] = 1 5 return one_hot NameError: name 'np' is not defined
h = np.zeros(3) # h_0: blank memory
inputs = [one_hot_encode(c, char_to_idx) for c in text]
W_xh = np.random.rand(3, len(chars)) # input to hidden weights
W_hh = np.random.rand(3, 3) # hidden to hidden weights
b_h = np.random.rand(3) # hidden bias
W_hy = np.random.rand(len(chars), 3) # hidden to output weights
b_y = np.random.rand(len(chars)) # output bias
for char, x in zip(text, inputs):
h = np.tanh(W_xh @ x + W_hh @ h + b_h)
y = W_hy @ h + b_y
print(f"'{char}' h = {np.round(h, 4)}")'h' h = [0.508 0.5977 0.7204]
'e' h = [0.8949 0.9406 0.8871]
'l' h = [0.9745 0.9789 0.9229]
'l' h = [0.9767 0.9821 0.9295]
'o' h = [0.9683 0.9786 0.9534]
The hidden state h equation is computed as follows: \[h_t = \tanh(W_{xh} x_t + W_{hh} h_{t-1} + b_h)\] Where: - \(W_{xh}\): Weights for input to hidden layer - \(W_{hh}\): Weights for hidden to hidden layer - \(b_h\): Bias for hidden layer We will also compute the output y at each time step using the hidden state: \[y_t = W_{hy} h_t + b_y\] Where: - \(W_{hy}\): Weights for hidden to output layer - \(b_y\): Bias for output layer
One more intresting thing to note about the response we got from the previous cell is the different values of the hidden state h for the same character l at different time steps. See the l character in the output below:
'h' hidden_state = [0.508 0.5977 0.7204]
'e' hidden_state = [0.8949 0.9406 0.8871]
'l' hidden_state = [0.9745 0.9789 0.9229]
'l' hidden_state = [0.9767 0.9821 0.9295]
'o' hidden_state = [0.9683 0.9786 0.9534]
This is because the hidden state is influenced by the previous hidden state and the current input, which allows the RNN to capture meanings in the sequence.
We will implement Backpropagation Through Time (BPTT) to train our RNN. BPTT is a method for training RNNs by unrolling the network through time and applying backpropagation to compute gradients for all time steps.
For that, we need: - input sequence (X) - target sequence (Y)
The input is the first 4 characters of “hello” and the target is the next character for each input character. So, we will have: - X = [‘h’, ‘e’, ‘l’, ‘l’] - Y = [‘e’, ‘l’, ‘l’, ‘o’]
input_chars = text[:-1]
target_chars = text[1:]
print("input_chars:", input_chars)
print("target_chars:", target_chars)
# Then convert the input characters to one-hot vectors:
input_vectors = [one_hot_encode(c, char_to_idx) for c in input_chars]
target_vectors = [char_to_idx[c] for c in target_chars]
print("input_vectors:")
for c, v in zip(input_chars, input_vectors):
print(f"'{c}': {v}")
print("target_vectors:")
for c, v in zip(target_chars, target_vectors):
print(f"'{c}': {v}")
input_chars: hell
target_chars: ello
input_vectors:
'h': [0. 1. 0. 0.]
'e': [1. 0. 0. 0.]
'l': [0. 0. 1. 0.]
'l': [0. 0. 1. 0.]
target_vectors:
'e': 0
'l': 2
'l': 2
'o': 3
why we did not include the last character ‘o’ in the input sequence? Because it is the character we want to predict, so it is part of the target sequence.
Why the target sequence is the next character for each input character? Because we want to train the RNN to predict the next character in the sequence, so the target for each input character is the next character in the sequence.
Why the target vectors are not one-hot encoded? Because we will use the cross-entropy loss function, and using the index is a shortcut for the same math.
x: [0. 1. 0. 0.], t_idx: 0
x: [1. 0. 0. 0.], t_idx: 2
x: [0. 0. 1. 0.], t_idx: 2
x: [0. 0. 1. 0.], t_idx: 3
Below is the code for the forward pass and loss computation, we loop through each input-target pair, compute the hidden state h and the output y, and then compute the loss using cross-entropy loss.
h_prevs: list stores the previous hidden states for each time step.h_list: list stores the current hidden states for each time step.p_list: list stores the output probabilities for each time step.The loss is computed using the cross-entropy loss function, which compares the predicted probabilities with the target indices, as follows: \[\text{loss} = -\sum_{t} \log(p_t[t_idx])\] Where: - \(p_t\): Predicted probabilities at time step t - \(t_idx\): Target index for the current time step
# wrap this in a forward_pass function that returns the total loss and the stored values for the backward pass
hidden_size = 3
input_size = len(input_vectors[0]) # length of the one-hot vector
output_size = len(char_to_idx) # number of unique characters (for output layer size)
#initialize weights and biases
W_xh = np.random.randn(hidden_size, input_size) * 0.1
W_hh = np.random.randn(hidden_size, hidden_size) * 0.1
b_h = np.zeros(hidden_size)
W_hy = np.random.randn(output_size, hidden_size) * 0.1
b_y = np.zeros(output_size)
def forward_pass(input_vectors, target_vectors):
h = np.zeros(hidden_size)
total_loss = 0 # store total loss for the sequence (1 epoch only)
h_prevs, h_list, p_list = [], [], [] # store for backward pass
for x, t_idx in zip(input_vectors, target_vectors):
# --- Forward pass ---
h_prevs.append(h.copy()) # h_{t-1}
h = np.tanh(W_xh @ x + W_hh @ h + b_h)
y = W_hy @ h + b_y
# softmax and cross-entropy loss
exp_y = np.exp(y - np.max(y)) # stable softmax
p = exp_y / exp_y.sum()
total_loss += -np.log(p[t_idx]) # cross-entropy
h_list.append(h.copy()) # h_t
p_list.append(p) # softmax probs
total_loss /= len(input_vectors) # average loss per time step (For stability in training)
return total_loss, h_prevs, h_list, p_listinput_size: 4
hidden_size: 3
output_size: 4
W_xh: (3, 4)
W_hh: (3, 3)
W_hy: (4, 3)
Now that we stored the hidden states and the output probabilities for each time step, we can perform the backward pass to compute the gradients and update the weights. The backward pass will involve computing the gradients of the loss with respect to the weights and biases, and then updating them using gradient descent.
# ============================================
# Backpropagation Through Time (BPTT)
# ============================================
def backward_pass(input_vectors, target_vectors, h_prevs, h_list, p_list):
# --------------------------------------------------
# Gradients for shared weights/biases
# (accumulated across ALL timesteps)
# --------------------------------------------------
dW_xh = np.zeros_like(W_xh)
dW_hh = np.zeros_like(W_hh)
dW_hy = np.zeros_like(W_hy)
db_h = np.zeros_like(b_h)
db_y = np.zeros_like(b_y)
# --------------------------------------------------
# No future gradient at final timestep
# dh(T+1) = 0
# --------------------------------------------------
dh_next = np.zeros(hidden_size)
# --------------------------------------------------
# Backward through time
# T → 1
# --------------------------------------------------
for t in reversed(range(len(input_vectors))):
# ==================================================
# 1. OUTPUT ERROR
#
# dy(t) = p(t) - y_true(t)
# ==================================================
dy = p_list[t].copy()
dy[target_vectors[t]] -= 1
# ==================================================
# 2. OUTPUT WEIGHT GRADIENT
#
# dL/dW_hy += h(t)^T · dy(t)
# ==================================================
dW_hy += np.outer(dy, h_list[t])
# Bias gradient
db_y += dy
# ==================================================
# 3. HIDDEN STATE ERROR
#
# dh(t) =
# current output error
# + future timestep error
#
# dh(t) = W_hy^T·dy(t) + dh_next
# ==================================================
dh = W_hy.T @ dy + dh_next
# ==================================================
# 4. BACKPROP THROUGH TANH
#
# dtanh(t) = dh(t) ⊙ (1 - h(t)^2)
# ==================================================
dtanh = dh * (1 - h_list[t]**2)
# ==================================================
# 5. INPUT → HIDDEN WEIGHT GRADIENT
#
# dL/dW_xh += x(t)^T · dtanh(t)
# ==================================================
dW_xh += np.outer(dtanh, input_vectors[t])
# ==================================================
# 6. HIDDEN → HIDDEN WEIGHT GRADIENT
#
# dL/dW_hh += h(t-1)^T · dtanh(t)
# ==================================================
dW_hh += np.outer(dtanh, h_prevs[t])
# Bias gradient
db_h += dtanh
# ==================================================
# 7. PASS ERROR TO PREVIOUS TIMESTEP
#
# dh_next = W_hh^T · dtanh(t)
# ==================================================
dh_next = W_hh.T @ dtanh
return dW_xh, dW_hh, db_h, dW_hy, db_yfor epoch in range(1000): total_loss, h_prevs, h_list, p_list = forward_pass(input_vectors, target_vectors) dW_xh, dW_hh, db_h, dW_hy, db_y = backward_pass(input_vectors, target_vectors, h_prevs, h_list, p_list) # --- Update weights --- lr = 0.01 for W, dW in [(W_xh, dW_xh), (W_hh, dW_hh), (W_hy, dW_hy)]: W -= lr * dW b_h -= lr * db_h b_y -= lr * db_y if epoch % 10 == 0: print(f"Epoch {epoch}, Loss: {total_loss:.4f}")After training, the weights have learned the pattern in “hello”. To predict, we:
'h')This is called autoregressive generation — the model feeds its own output back as the next input.
def predict(seed_char, n_chars=4):
h = np.zeros(hidden_size)
current_char = seed_char
result = seed_char
for _ in range(n_chars):
x = one_hot_encode(current_char, char_to_idx)
# forward pass (no loss needed)
h = np.tanh(W_xh @ x + W_hh @ h + b_h)
y = W_hy @ h + b_y
exp_y = np.exp(y - np.max(y))
p = exp_y / exp_y.sum()
next_idx = np.argmax(p) # pick most likely character
next_char = idx_to_char[next_idx]
result += next_char
current_char = next_char # feed prediction back as next input
return result
print(predict('h')) # should output: hellohello
Comments