Transformer Models: BERT, GPT & Beyond

Transformer models now power 90% of state-of-the-art NLP systems (Google Research 2024). This tutorial explores the architecture, implementation, and applications of BERT, GPT, and their variants.

Transformer Model Adoption (2024)

BERT-style (45%)

GPT-style (35%)

Other (20%)

1. Transformer Architecture

Core Components:

Self-Attention: Context-aware word relationships
Multi-Head Attention: Parallel attention mechanisms
Positional Encoding: Captures word order
Encoder-Decoder: BERT (encoder), GPT (decoder)

Key Equations:

Attention(Q,K,V) = softmax(QKᵀ/√dₖ)V
MultiHead = Concat(head₁,...,headₕ)Wᵒ
where headᵢ = Attention(QWᵢᵩ, KWᵢᴷ, VWᵢⱽ)

PyTorch Implementation:


import torch.nn as nn

class SelfAttention(nn.Module):
    def __init__(self, embed_size, heads):
        super().__init__()
        self.embed_size = embed_size
        self.heads = heads
        self.head_dim = embed_size // heads
        
        self.values = nn.Linear(embed_size, embed_size)
        self.keys = nn.Linear(embed_size, embed_size)
        self.queries = nn.Linear(embed_size, embed_size)
        self.fc_out = nn.Linear(embed_size, embed_size)
        
    def forward(self, values, keys, query, mask):
        N = query.shape[0]
        value_len, key_len, query_len = values.shape[1], keys.shape[1], query.shape[1]
        
        # Split into multiple heads
        values = self.values(values).view(N, value_len, self.heads, self.head_dim)
        keys = self.keys(keys).view(N, key_len, self.heads, self.head_dim)
        queries = self.queries(query).view(N, query_len, self.heads, self.head_dim)
        
        energy = torch.einsum("nqhd,nkhd->nhqk", [queries, keys])
        if mask is not None:
            energy = energy.masked_fill(mask == 0, float("-1e20"))
            
        attention = torch.softmax(energy / (self.embed_size ** (1/2)), dim=3)
        out = torch.einsum("nhql,nlhd->nqhd", [attention, values])
        out = out.reshape(N, query_len, self.embed_size)
        return self.fc_out(out)

2. BERT & Encoder Models

Key Features:

Feature	Description	Impact
Masked LM	Predict hidden words	Bidirectional context
Next Sentence Prediction	Sentence relationship	Better discourse
WordPiece Tokenization	Subword units	Handles rare words

HuggingFace Implementation:


from transformers import BertTokenizer, BertForSequenceClassification
import torch

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')

inputs = tokenizer("Hello world!", return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)
logits = outputs.logits

Variants:

RoBERTa (optimized pretraining), DistilBERT (smaller), ALBERT (parameter efficiency)

3. GPT & Decoder Models

Key Features:

Autoregressive

Predict next token

Enables generation

Causal Attention

Masked self-attention

Prevents cheating

Scaling Laws

More data → Better performance

Drives large models

Text Generation Example:


from transformers import GPT2Tokenizer, GPT2LMHeadModel

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')

input_ids = tokenizer.encode("The future of AI is", return_tensors='pt')
output = model.generate(input_ids, max_length=50, num_return_sequences=1)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Transformer Model Comparison

Model	Type	Parameters	Best For
BERT	Encoder	110M-340M	Classification, QA
GPT-4	Decoder	1.7T	Text generation
T5	Encoder-Decoder	11B	Text-to-text tasks
BART	Encoder-Decoder	400M	Summarization

4. Practical Applications

Implementation Techniques:

Fine-tuning: Adapt pretrained models

Implementation Techniques:

Fine-tuning: Adapt pretrained models
Prompt Engineering: Craft effective inputs for zero-shot learning
Model Distillation: Create smaller, faster models
Quantization: Reduce model size for deployment

Real-World Use Cases:

Text Classification


# Sentiment analysis with BERT
from transformers import pipeline
classifier = pipeline("sentiment-analysis", 
                     model="distilbert-base-uncased-finetuned-sst-2-english")
classifier("This tutorial is incredibly helpful!")

Named Entity Recognition


# NER with spaCy transformers
import spacy
nlp = spacy.load("en_core_web_trf")
doc = nlp("Apple reached $2T market cap in 2023")
print([(ent.text, ent.label_) for ent in doc.ents])

Text Generation


# Creative writing with GPT
generator = pipeline("text-generation", model="gpt2")
output = generator("In a future where AI governs humanity,", 
                  max_length=100,
                  num_return_sequences=1)

5. Advanced Topics & Future Directions

Emerging Architectures:

Model	Innovation	Advantage
Switch Transformers	Mixture-of-Experts	Efficient scaling
Perceiver IO	Cross-modal attention	Handles any input
RetNet	Retentive networks	Parallel + recurrent

Optimization Techniques:

Flash Attention: Faster attention computation
LoRA: Low-rank adaptation for efficient fine-tuning
8-bit Inference: Memory-efficient deployment

Research Frontiers:

Multimodal

Text + Image + Audio

E.g. Flamingo, GPT-4V

Efficiency

Smaller, faster models

TinyBERT, MobileBERT

Reasoning

Chain-of-thought

Improved logic

Conclusion & Next Steps

Transformer models have revolutionized NLP and are expanding into other domains. Key takeaways:

Understand the attention mechanism that powers all transformers
Choose the right architecture (BERT, GPT, etc.) for your task
Leverage pretrained models through fine-tuning
Consider efficiency techniques for production deployment

Learning Resources:

Original Transformer Paper HuggingFace Course PyTorch Tutorial

Ready to implement? Start with these colab notebooks:

BERT Fine-tuning GPT Text Generation

0 Interaction

0 Views

0 Likes

Transformer Models: BERT, GPT & Beyond

Transformer Model Adoption (2024)

1. Transformer Architecture

Core Components:

Key Equations:

PyTorch Implementation:

2. BERT & Encoder Models

Key Features:

HuggingFace Implementation:

Variants:

3. GPT & Decoder Models

Key Features:

Autoregressive

Causal Attention

Scaling Laws

Text Generation Example:

Transformer Model Comparison

4. Practical Applications

Implementation Techniques:

Implementation Techniques:

Real-World Use Cases:

Text Classification

Named Entity Recognition

Text Generation

5. Advanced Topics & Future Directions

Emerging Architectures:

Optimization Techniques:

Research Frontiers:

Multimodal

Efficiency

Reasoning

Conclusion & Next Steps

Learning Resources:

Welcome to Ptutorials