Understanding the "Attention Is All You Need" Paper

Posted 02 October 2023

The "Attention Is All You Need" paper, authored by Vaswani et al., introduced a groundbreaking architecture called the transformer. This architecture has played a pivotal role in the field of natural language processing (NLP) and has become the foundation for many state-of-the-art models, including GPT-2 and BERT. In this article, we'll explore the key components and concepts presented in the paper, and we'll provide accompanying code to illustrate the main ideas.

Introduction

The "Attention Is All You Need" paper revolutionized NLP by introducing the transformer, a novel neural network architecture that excelled in handling sequential data. Unlike previous models that relied on recurrent or convolutional layers, the transformer leveraged self-attention mechanisms to process input sequences. This led to significant improvements in efficiency and accuracy, making it a cornerstone in the development of modern NLP models.

Code Overview

To better understand the concepts from the paper, let's first examine the code implementation of a transformer model. Below, we provide an overview of the code, highlighting important components and their functions:

import torch
import torch.nn as nn
from torch.nn import functional as F

# Hyperparameters
batch_size = 64 # Number of sequences processed in parallel
block_size = 256 # Maximum context length for predictions
max_iters = 5000 # Maximum training iterations
eval_interval = 500 # Interval for evaluating loss
learning_rate = 3e-4 # Learning rate
device = 'cuda' if torch.cuda.is_available() else 'cpu' # Check if GPU is available
eval_iters = 200
n_embd = 384 # Embedding dimension
n_head = 6 # Number of attention heads
n_layer = 6 # Number of transformer layers
dropout = 0.2 # Dropout rate

# Set a random seed for reproducibility
torch.manual_seed(1337)

# Load the input text data
with open('7-tiny-shakespeare.txt', 'r', encoding='utf-8') as f:
text = f.read()

# Create a vocabulary of unique characters in the text
chars = sorted(list(set(text)))
vocab_size = len(chars)

# Create mappings between characters and integers
stoi = {ch: i for i, ch in enumerate(chars)}
itos = {i: ch for i, ch in enumerate(chars)}

# Define encoder and decoder functions
encode = lambda s: [stoi[c] for c in s] # Encoder: string to list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # Decoder: list of integers to string
....

This code sets up the hyperparameters, loads text data, creates character mappings, and defines encoder and decoder functions. It's essential to preprocess data and define these elements before constructing the transformer model.

Multi-Head Self-Attention

One of the key innovations in the transformer architecture is multi-head self-attention. This mechanism allows the model to focus on different parts of the input sequence simultaneously. Each attention head learns different relationships between tokens, making the model robust to various patterns in the data. Here's the code for multi-head self-attention:

# Define a single head of self-attention
class Head(nn.Module):
def __init__(self, head_size):
super().__init__()
self.key = nn.Linear(n_embd, head_size, bias=False)
self.query = nn.Linear(n_embd, head_size, bias=False)
self.value = nn.Linear(n_embd, head_size, bias=False)
self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))
self.dropout = nn.Dropout(dropout)

def forward(self, x):
B, T, C = x.shape
k = self.key(x) # (B, T, hs)
q = self.query(x) # (B, T, hs)
wei = q @ k.transpose(-2, -1) * k.shape[-1]**-0.5
wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)
wei = self.dropout(wei)
v = self.value(x) # (B, T, hs)
out = wei @ v
return out
....

In this code, we define a single head of self-attention, which includes key, query, and value linear transformations. The attention scores are computed, and masking is applied to prevent attending to future tokens. The resulting weighted values are combined to produce the attention output.

Multi-Head Attention

The multi-head attention mechanism is built by stacking multiple attention heads. This allows the model to capture various patterns and dependencies within the input data. Here's the code for multi-head attention:

# Define multi-head self-attention
class MultiHeadAttention(nn.Module):
def __init__(self, num_heads, head_size):
super().__init__()
self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
self.proj = nn.Linear(head_size * num_heads, n_embd)
self.dropout = nn.Dropout(dropout)

def forward(self, x):
out = torch.cat([h(x) for h in self.heads], dim=-1)
out = self.dropout(self.proj(out))
return out# Define multi-head self-attention

class MultiHeadAttention(nn.Module):
def __init__(self, num_heads, head_size):
super().__init__()
self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
self.proj = nn.Linear(head_size * num_heads, n_embd)
self.dropout = nn.Dropout(dropout)

def forward(self, x):
out = torch.cat([h(x) for h in self.heads], dim=-1)
out = self.dropout(self.proj(out))
return out
....

In this code, we create multiple attention heads and concatenate their outputs. The concatenated output is projected to the desired dimension using a linear layer.

Feedforward Neural Networks

The transformer also employs feedforward neural networks to process information after self-attention layers. These networks enable the model to capture complex patterns in the data. Here's the code for the feedforward layer:

# Define feedforward layer
class FeedForward(nn.Module):
def __init__(self, n_embd):
super().__init__()
self.net = nn.Sequential(
nn.Linear(n_embd, 4 * n_embd),
nn.ReLU(),
nn.Linear(4 * n_embd, n_embd),
nn.Dropout(dropout),
)

def forward(self, x):
return self.net(x)
return out
....

This code defines a feedforward neural network that consists of linear layers and ReLU activation functions. It processes the information passed through the self-attention layers.

Transformer Blocks

Transformer blocks are the fundamental building blocks of the model. They contain multi-head self-attention, feedforward layers, and layer normalization. Here's the code for a transformer block:

# Define a transformer block
class Block(nn.Module):
def __init__(self, n_embd, n_head):
super().__init__()
head_size = n_embd // n_head
self.sa = MultiHeadAttention(n_head, head_size)
self.ffwd = FeedForward(n_embd)
self.ln1 = nn.LayerNorm(n_embd)
self.ln2 = nn.LayerNorm(n_embd)

def forward(self, x):
x = x + self.sa(self.ln1(x))
x = x + self.ffwd(self.ln2(x))
return x
....

A transformer block consists of multi-head self-attention, feedforward layers, and layer normalization. It processes the input data and enhances its representations.

GPT Language Model

The GPT (Generative Pre-trained Transformer) language model is built using multiple transformer blocks. It includes token and positional embeddings, transformer layers, layer normalization, and a linear head for predicting the next token. Here's the code for the GPT language model:

# Define the GPT Language Model
class GPTLanguageModel(nn.Module):
def __init__(self):
super().__init__()
self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
self.position_embedding_table = nn.Embedding(block_size, n_embd)
self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])
self.ln_f = nn.LayerNorm(n_embd)
self.lm_head = nn.Linear(n_embd, vocab_size)
self.apply(self._init_weights)
def _init_weights(self, module):
if isinstance(module, nn.Linear):
torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
if module.bias is not None:
torch.nn.init.zeros_(module.bias)
elif isinstance(module, nn.Embedding):
torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)

def forward(self, idx, targets=None):
B, T = idx.shape
tok_emb = self.token_embedding_table(idx)
pos_emb = self.position_embedding_table(torch.arange(T, device=device))
x = tok_emb + pos_emb
x = self.blocks(x)
x = self.ln_f(x)
logits = self.lm_head(x)

if targets is None:
loss = None
else:
B, T, C = logits.shape
logits = logits.view(B * T, C)
targets = targets.view(B * T)
loss = F.cross_entropy(logits, targets)

return logits, loss
def generate(self, idx, max_new_tokens):
for _ in range(max_new_tokens):
idx_cond = idx[:, -block_size:]
logits, loss = self(idx_cond)
logits = logits[:, -1, :]
probs = F.softmax(logits, dim=-1)
idx_next = torch.multinomial(probs, num_samples=1)
idx = torch.cat((idx, idx_next), dim=1)
return idx

This code defines the GPT language model, including embeddings, transformer blocks, and the linear head for token prediction. It also includes functions for text generation.

Conclusion

The "Attention Is All You Need" paper introduced the transformer architecture, which has since become a cornerstone in NLP research. This article provided a high-level overview of the paper's concepts and accompanying code to illustrate the key components of a transformer-based model. Understanding these components is crucial for working with and building upon state-of-the-art NLP models like GPT-2 and BERT. As NLP research continues to advance, transformers remain at the forefront of innovation, driving progress in language understanding and generation.

View Project