Transformer models now power 90% of state-of-the-art NLP systems (Google Research 2024). This tutorial explores the architecture, implementation, and applications of BERT, GPT, and their variants.
Transformer Models: BERT, GPT & Beyond
Transformer Model Adoption (2024)
1. Transformer Architecture
Core Components:
- Self-Attention: Context-aware word relationships
- Multi-Head Attention: Parallel attention mechanisms
- Positional Encoding: Captures word order
- Encoder-Decoder: BERT (encoder), GPT (decoder)
Key Equations:
Attention(Q,K,V) = softmax(QKᵀ/√dₖ)V MultiHead = Concat(head₁,...,headₕ)Wᵒ where headᵢ = Attention(QWᵢᵩ, KWᵢᴷ, VWᵢⱽ)
PyTorch Implementation:
import torch.nn as nn
class SelfAttention(nn.Module):
def __init__(self, embed_size, heads):
super().__init__()
self.embed_size = embed_size
self.heads = heads
self.head_dim = embed_size // heads
self.values = nn.Linear(embed_size, embed_size)
self.keys = nn.Linear(embed_size, embed_size)
self.queries = nn.Linear(embed_size, embed_size)
self.fc_out = nn.Linear(embed_size, embed_size)
def forward(self, values, keys, query, mask):
N = query.shape[0]
value_len, key_len, query_len = values.shape[1], keys.shape[1], query.shape[1]
# Split into multiple heads
values = self.values(values).view(N, value_len, self.heads, self.head_dim)
keys = self.keys(keys).view(N, key_len, self.heads, self.head_dim)
queries = self.queries(query).view(N, query_len, self.heads, self.head_dim)
energy = torch.einsum("nqhd,nkhd->nhqk", [queries, keys])
if mask is not None:
energy = energy.masked_fill(mask == 0, float("-1e20"))
attention = torch.softmax(energy / (self.embed_size ** (1/2)), dim=3)
out = torch.einsum("nhql,nlhd->nqhd", [attention, values])
out = out.reshape(N, query_len, self.embed_size)
return self.fc_out(out)
2. BERT & Encoder Models
Key Features:
Feature | Description | Impact |
---|---|---|
Masked LM | Predict hidden words | Bidirectional context |
Next Sentence Prediction | Sentence relationship | Better discourse |
WordPiece Tokenization | Subword units | Handles rare words |
HuggingFace Implementation:
from transformers import BertTokenizer, BertForSequenceClassification
import torch
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
inputs = tokenizer("Hello world!", return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits
Variants:
RoBERTa (optimized pretraining), DistilBERT (smaller), ALBERT (parameter efficiency)
3. GPT & Decoder Models
Key Features:
Autoregressive
Predict next token
Enables generationCausal Attention
Masked self-attention
Prevents cheatingScaling Laws
More data → Better performance
Drives large modelsText Generation Example:
from transformers import GPT2Tokenizer, GPT2LMHeadModel
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')
input_ids = tokenizer.encode("The future of AI is", return_tensors='pt')
output = model.generate(input_ids, max_length=50, num_return_sequences=1)
print(tokenizer.decode(output[0], skip_special_tokens=True))
Transformer Model Comparison
Model | Type | Parameters | Best For |
---|---|---|---|
BERT | Encoder | 110M-340M | Classification, QA |
GPT-4 | Decoder | 1.7T | Text generation |
T5 | Encoder-Decoder | 11B | Text-to-text tasks |
BART | Encoder-Decoder | 400M | Summarization |
4. Practical Applications
Implementation Techniques:
- Fine-tuning: Adapt pretrained models
- Fine-tuning: Adapt pretrained models
- Prompt Engineering: Craft effective inputs for zero-shot learning
- Model Distillation: Create smaller, faster models
- Quantization: Reduce model size for deployment
Implementation Techniques:
Real-World Use Cases:
Text Classification
# Sentiment analysis with BERT
from transformers import pipeline
classifier = pipeline("sentiment-analysis",
model="distilbert-base-uncased-finetuned-sst-2-english")
classifier("This tutorial is incredibly helpful!")
Named Entity Recognition
# NER with spaCy transformers
import spacy
nlp = spacy.load("en_core_web_trf")
doc = nlp("Apple reached $2T market cap in 2023")
print([(ent.text, ent.label_) for ent in doc.ents])
Text Generation
# Creative writing with GPT
generator = pipeline("text-generation", model="gpt2")
output = generator("In a future where AI governs humanity,",
max_length=100,
num_return_sequences=1)
5. Advanced Topics & Future Directions
Emerging Architectures:
Model | Innovation | Advantage |
---|---|---|
Switch Transformers | Mixture-of-Experts | Efficient scaling |
Perceiver IO | Cross-modal attention | Handles any input |
RetNet | Retentive networks | Parallel + recurrent |
Optimization Techniques:
- Flash Attention: Faster attention computation
- LoRA: Low-rank adaptation for efficient fine-tuning
- 8-bit Inference: Memory-efficient deployment
Research Frontiers:
Multimodal
Text + Image + Audio
E.g. Flamingo, GPT-4VEfficiency
Smaller, faster models
TinyBERT, MobileBERTReasoning
Chain-of-thought
Improved logicConclusion & Next Steps
Transformer models have revolutionized NLP and are expanding into other domains. Key takeaways:
- Understand the attention mechanism that powers all transformers
- Choose the right architecture (BERT, GPT, etc.) for your task
- Leverage pretrained models through fine-tuning
- Consider efficiency techniques for production deployment
Learning Resources:
Ready to implement? Start with these colab notebooks:
×
