Building a simple RNN with PyTorch

Show Code

# import 
import numpy as np

Manual RNN Implementation

We will start by implementing a simple RNN to predict the next character in a sequence “hello”. We will use one-hot encoding for the characters and a simple RNN cell to process the input sequence.

In the below code, we define the text (single word “hello”), create a set of unique characters, and then create mappings from characters to indices and vice versa.

Show Code

text = "hello"
chars = sorted(set(text))
print(chars)

# Then we create a mapping from characters to indices and vice versa:
char_to_idx = {c: i for i, c in enumerate(chars)}
idx_to_char = {i: c for i, c in enumerate(chars)}

print("char_to_idx:", char_to_idx)
print("idx_to_char:", idx_to_char)

['e', 'h', 'l', 'o']
char_to_idx: {'e': 0, 'h': 1, 'l': 2, 'o': 3}
idx_to_char: {0: 'e', 1: 'h', 2: 'l', 3: 'o'}

Show Code

# one-hot encoding
def one_hot_encode(char, char_to_idx):
    one_hot = np.zeros(len(char_to_idx))
    one_hot[char_to_idx[char]] = 1
    return one_hot

# Example usage:
char = 'h'
one_hot_vector = one_hot_encode(char, char_to_idx)
print(f"One-hot encoding for '{char}': {one_hot_vector}")

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[2], line 9
      7 # Example usage:
      8 char = 'h'
----> 9 one_hot_vector = one_hot_encode(char, char_to_idx)
     10 print(f"One-hot encoding for '{char}': {one_hot_vector}")

Cell In[2], line 3, in one_hot_encode(char, char_to_idx)
      2 def one_hot_encode(char, char_to_idx):
----> 3     one_hot = np.zeros(len(char_to_idx))
      4     one_hot[char_to_idx[char]] = 1
      5     return one_hot

NameError: name 'np' is not defined

Show Code

h = np.zeros(3)   # h_0: blank memory
inputs = [one_hot_encode(c, char_to_idx) for c in text]

W_xh = np.random.rand(3, len(chars))  # input to hidden weights
W_hh = np.random.rand(3, 3)           # hidden to hidden weights
b_h = np.random.rand(3)              # hidden bias
W_hy = np.random.rand(len(chars), 3) # hidden to output weights
b_y = np.random.rand(len(chars))     # output bias


for char, x in zip(text, inputs):
    h = np.tanh(W_xh @ x  +  W_hh @ h  +  b_h)
    y = W_hy @ h + b_y
    print(f"'{char}'  h = {np.round(h, 4)}")

'h'  h = [0.508  0.5977 0.7204]
'e'  h = [0.8949 0.9406 0.8871]
'l'  h = [0.9745 0.9789 0.9229]
'l'  h = [0.9767 0.9821 0.9295]
'o'  h = [0.9683 0.9786 0.9534]

The hidden state h equation is computed as follows: \[h_t = \tanh(W_{xh} x_t + W_{hh} h_{t-1} + b_h)\] Where: - \(W_{xh}\): Weights for input to hidden layer - \(W_{hh}\): Weights for hidden to hidden layer - \(b_h\): Bias for hidden layer We will also compute the output y at each time step using the hidden state: \[y_t = W_{hy} h_t + b_y\] Where: - \(W_{hy}\): Weights for hidden to output layer - \(b_y\): Bias for output layer

One more intresting thing to note about the response we got from the previous cell is the different values of the hidden state h for the same character l at different time steps. See the l character in the output below:

'h'  hidden_state = [0.508  0.5977 0.7204]
'e'  hidden_state = [0.8949 0.9406 0.8871]
'l'  hidden_state = [0.9745 0.9789 0.9229]
'l'  hidden_state = [0.9767 0.9821 0.9295]
'o'  hidden_state = [0.9683 0.9786 0.9534]

This is because the hidden state is influenced by the previous hidden state and the current input, which allows the RNN to capture meanings in the sequence.

Backpropagation Through Time (BPTT)

We will implement Backpropagation Through Time (BPTT) to train our RNN. BPTT is a method for training RNNs by unrolling the network through time and applying backpropagation to compute gradients for all time steps.

For that, we need: - input sequence (X) - target sequence (Y)

The input is the first 4 characters of “hello” and the target is the next character for each input character. So, we will have: - X = [‘h’, ‘e’, ‘l’, ‘l’] - Y = [‘e’, ‘l’, ‘l’, ‘o’]

Show Code

input_chars  = text[:-1] 
target_chars = text[1:]
print("input_chars:", input_chars)
print("target_chars:", target_chars)


# Then convert the input characters to one-hot vectors:
input_vectors = [one_hot_encode(c, char_to_idx) for c in input_chars]
target_vectors = [char_to_idx[c] for c in target_chars]
print("input_vectors:")
for c, v in zip(input_chars, input_vectors):    
    print(f"'{c}': {v}")
print("target_vectors:")
for c, v in zip(target_chars, target_vectors):    
    print(f"'{c}': {v}")

input_chars: hell
target_chars: ello
input_vectors:
'h': [0. 1. 0. 0.]
'e': [1. 0. 0. 0.]
'l': [0. 0. 1. 0.]
'l': [0. 0. 1. 0.]
target_vectors:
'e': 0
'l': 2
'l': 2
'o': 3

why we did not include the last character ‘o’ in the input sequence? Because it is the character we want to predict, so it is part of the target sequence.

Why the target sequence is the next character for each input character? Because we want to train the RNN to predict the next character in the sequence, so the target for each input character is the next character in the sequence.

Why the target vectors are not one-hot encoded? Because we will use the cross-entropy loss function, and using the index is a shortcut for the same math.

Show Code

for x, t_idx in zip(input_vectors, target_vectors):
    print(f"x: {x}, t_idx: {t_idx}")

x: [0. 1. 0. 0.], t_idx: 0
x: [1. 0. 0. 0.], t_idx: 2
x: [0. 0. 1. 0.], t_idx: 2
x: [0. 0. 1. 0.], t_idx: 3

Below is the code for the forward pass and loss computation, we loop through each input-target pair, compute the hidden state h and the output y, and then compute the loss using cross-entropy loss.

h_prevs: list stores the previous hidden states for each time step.
h_list: list stores the current hidden states for each time step.
p_list: list stores the output probabilities for each time step.

The loss is computed using the cross-entropy loss function, which compares the predicted probabilities with the target indices, as follows: \[\text{loss} = -\sum_{t} \log(p_t[t_idx])\] Where: - \(p_t\): Predicted probabilities at time step t - \(t_idx\): Target index for the current time step

Show Code

# wrap this in a forward_pass function that returns the total loss and the stored values for the backward pass

hidden_size = 3
input_size = len(input_vectors[0]) # length of the one-hot vector
output_size = len(char_to_idx) # number of unique characters (for output layer size)

#initialize weights and biases
W_xh = np.random.randn(hidden_size, input_size) * 0.1
W_hh = np.random.randn(hidden_size, hidden_size) * 0.1
b_h = np.zeros(hidden_size)
W_hy = np.random.randn(output_size, hidden_size) * 0.1
b_y = np.zeros(output_size)


def forward_pass(input_vectors, target_vectors):
    h = np.zeros(hidden_size)
    total_loss = 0 # store total loss for the sequence (1 epoch only)
    h_prevs, h_list, p_list = [], [], []   # store for backward pass

    for x, t_idx in zip(input_vectors, target_vectors):
        # --- Forward pass ---
        h_prevs.append(h.copy())           # h_{t-1}
        h = np.tanh(W_xh @ x + W_hh @ h + b_h)
        y = W_hy @ h + b_y

        # softmax and cross-entropy loss
        exp_y = np.exp(y - np.max(y))      # stable softmax
        p = exp_y / exp_y.sum()

        total_loss += -np.log(p[t_idx])    # cross-entropy
        
        h_list.append(h.copy())            # h_t
        p_list.append(p)                   # softmax probs
    
    total_loss /= len(input_vectors) # average loss per time step (For stability in training)
        
    return total_loss, h_prevs, h_list, p_list

Show Code

print("input_size:", input_size)     # should be 4
print("hidden_size:", hidden_size)   # should be 3
print("output_size:", output_size)   # should be 4

print("W_xh:", W_xh.shape)  # (3, 4)
print("W_hh:", W_hh.shape)  # (3, 3)
print("W_hy:", W_hy.shape)  # (4, 3)

input_size: 4
hidden_size: 3
output_size: 4
W_xh: (3, 4)
W_hh: (3, 3)
W_hy: (4, 3)

Now that we stored the hidden states and the output probabilities for each time step, we can perform the backward pass to compute the gradients and update the weights. The backward pass will involve computing the gradients of the loss with respect to the weights and biases, and then updating them using gradient descent.

Show Code

# ============================================
# Backpropagation Through Time (BPTT)
# ============================================

def backward_pass(input_vectors, target_vectors, h_prevs, h_list, p_list):

    # --------------------------------------------------
    # Gradients for shared weights/biases
    # (accumulated across ALL timesteps)
    # --------------------------------------------------
    dW_xh = np.zeros_like(W_xh)
    dW_hh = np.zeros_like(W_hh)
    dW_hy = np.zeros_like(W_hy)

    db_h  = np.zeros_like(b_h)
    db_y  = np.zeros_like(b_y)

    # --------------------------------------------------
    # No future gradient at final timestep
    # dh(T+1) = 0
    # --------------------------------------------------
    dh_next = np.zeros(hidden_size)

    # --------------------------------------------------
    # Backward through time
    # T → 1
    # --------------------------------------------------
    for t in reversed(range(len(input_vectors))):

        # ==================================================
        # 1. OUTPUT ERROR
        #
        # dy(t) = p(t) - y_true(t)
        # ==================================================
        dy = p_list[t].copy()
        dy[target_vectors[t]] -= 1


        # ==================================================
        # 2. OUTPUT WEIGHT GRADIENT
        #
        # dL/dW_hy += h(t)^T · dy(t)
        # ==================================================
        dW_hy += np.outer(dy, h_list[t])

        # Bias gradient
        db_y += dy


        # ==================================================
        # 3. HIDDEN STATE ERROR
        #
        # dh(t) =
        #     current output error
        #   + future timestep error
        #
        # dh(t) = W_hy^T·dy(t) + dh_next
        # ==================================================
        dh = W_hy.T @ dy + dh_next


        # ==================================================
        # 4. BACKPROP THROUGH TANH
        #
        # dtanh(t) = dh(t) ⊙ (1 - h(t)^2)
        # ==================================================
        dtanh = dh * (1 - h_list[t]**2)


        # ==================================================
        # 5. INPUT → HIDDEN WEIGHT GRADIENT
        #
        # dL/dW_xh += x(t)^T · dtanh(t)
        # ==================================================
        dW_xh += np.outer(dtanh, input_vectors[t])


        # ==================================================
        # 6. HIDDEN → HIDDEN WEIGHT GRADIENT
        #
        # dL/dW_hh += h(t-1)^T · dtanh(t)
        # ==================================================
        dW_hh += np.outer(dtanh, h_prevs[t])

        # Bias gradient
        db_h += dtanh


        # ==================================================
        # 7. PASS ERROR TO PREVIOUS TIMESTEP
        #
        # dh_next = W_hh^T · dtanh(t)
        # ==================================================
        dh_next = W_hh.T @ dtanh


    return dW_xh, dW_hh, db_h, dW_hy, db_y

Show Code

for epoch in range(1000):    total_loss, h_prevs, h_list, p_list = forward_pass(input_vectors, target_vectors)    dW_xh, dW_hh, db_h, dW_hy, db_y = backward_pass(input_vectors, target_vectors, h_prevs, h_list, p_list)    # --- Update weights ---    lr = 0.01    for W, dW in [(W_xh, dW_xh), (W_hh, dW_hh), (W_hy, dW_hy)]:        W -= lr * dW    b_h -= lr * db_h    b_y -= lr * db_y    if epoch % 10 == 0:        print(f"Epoch {epoch}, Loss: {total_loss:.4f}")

Prediction

After training, the weights have learned the pattern in “hello”. To predict, we:

Give the model a seed character (e.g. 'h')
Run a forward pass → get probabilities over the 4 characters
Pick the most likely next character (argmax)
Feed that predicted character back as the next input
Repeat

This is called autoregressive generation — the model feeds its own output back as the next input.

Show Code

def predict(seed_char, n_chars=4):
    h = np.zeros(hidden_size)
    current_char = seed_char
    result = seed_char

    for _ in range(n_chars):
        x = one_hot_encode(current_char, char_to_idx)

        # forward pass (no loss needed)
        h = np.tanh(W_xh @ x + W_hh @ h + b_h)
        y = W_hy @ h + b_y

        exp_y = np.exp(y - np.max(y))
        p = exp_y / exp_y.sum()

        next_idx = np.argmax(p)           # pick most likely character
        next_char = idx_to_char[next_idx]

        result += next_char
        current_char = next_char          # feed prediction back as next input

    return result

print(predict('h'))   # should output: hello

hello

Manual RNN Implementation

Backpropagation Through Time (BPTT)

Prediction

Comments