Understanding GPT
Build a GPT language model from scratch in PyTorch — from a bigram baseline through self-attention, multi-head attention, and full transformer blocks to generating Shakespeare.
download courseware
Prefer to inspect a complete implementation? Download pre-completed courseware for this course.
- Pre-completed projectsign in to downloadComplete reference implementation — bigram baseline and full GPT model, ready to train on TinyShakespeare.
The Big Picture
What a GPT is, how the dataset works, and the plan for building one from scratch.
- [ ] What Is a GPT?
- Explain what a GPT does at a high level — next-token prediction on sequences
- Identify the key components of the transformer architecture
- Describe the progression from bigram model to full GPT
-
- Explain how TinyShakespeare is structured and why it is a good training dataset
- Describe character-level tokenization and how encode/decode functions work
- Explain the relationship between block_size, batch_size, and training batches
The Bigram Baseline
The simplest possible language model — predicting the next character from just the current one.
-
- Explain how a bigram model uses an embedding table as its only learned component
- Trace the forward pass from token input through embedding lookup to logits
- Calculate the expected loss of an untrained model with a uniform distribution
-
- Describe the training loop: forward pass, loss computation, backward pass, optimizer step
- Explain why the training loss and validation loss diverge
- Evaluate the quality of bigram-generated text and identify its specific failures
Self-Attention
How tokens learn to communicate — from simple averaging to scaled dot-product attention with queries, keys, and values.
-
- Explain why averaging previous token embeddings is a first step toward attention
- Describe how a lower-triangular matrix implements causal averaging
- Identify the limitations of uniform averaging compared to learned weighting
-
- Explain how softmax converts raw scores into a probability distribution for attention weights
- Describe the masking-then-softmax pattern for causal attention
- Trace how changing the weight matrix changes which information each position receives
-
- Explain the roles of query, key, and value projections in self-attention
- Trace the computation from token embeddings through Q/K/V to attention output
- Describe why scaling by 1/√d_k is necessary for stable training
Building the Transformer
Multi-head attention, feedforward networks, residual connections, and layer normalization — the full transformer block.
-
- Explain why multiple attention heads outperform a single large head
- Describe how head outputs are concatenated and projected
- Trace the validation loss improvement from single-head to multi-head attention
-
- Explain the role of the feedforward network in a transformer block
- Describe how residual connections enable training of deep networks
- Trace the data flow through attention → add → feedforward → add
-
- Explain what layer normalization does and why it helps training
- Distinguish pre-norm and post-norm transformer architectures
- Describe the complete data flow through a transformer block
Scaling Up
Dropout, deeper networks, and the complete GPT — from 2.49 loss to 1.48 loss.
-
- Explain how dropout regularizes a neural network during training
- Describe the relationship between model size, training data, and overfitting
- Trace how scaling hyperparameters transforms generated text quality
- [ ] The Complete GPT
- Trace the full data flow from raw text through the complete GPT model to generated output
- Identify which components are essential to the architecture and which are efficiency optimizations
- Explain the connection between this character-level model and production language models