Understanding GPT

Build a GPT language model from scratch in PyTorch — from a bigram baseline through self-attention, multi-head attention, and full transformer blocks to generating Shakespeare.

download courseware

Prefer to inspect a complete implementation? Download pre-completed courseware for this course.

Pre-completed project
Complete reference implementation — bigram baseline and full GPT model, ready to train on TinyShakespeare.
sign in to download

The Big Picture

What GPT is, what it does, and the dataset we will train on — Shakespeare's collected works.

[ ] What Is a GPT? [text] free
- Explain what a GPT does at a high level — next-token prediction on sequences
- Identify the key components of the transformer architecture
- Describe the progression from bigram model to full GPT
[ ] The Dataset and Tokenization [text] free
- Explain how TinyShakespeare is structured and why it is a good training dataset
- Describe character-level tokenization and how encode/decode functions work
- Explain the relationship between block_size, batch_size, and training batches

The Bigram Baseline

The simplest possible language model — predicting the next character from the current one alone.

[ ] A Bigram Language Model [text]
- Explain how a bigram model uses an embedding table as its only learned component
- Trace the forward pass from token input through embedding lookup to logits
- Calculate the expected loss of an untrained model with a uniform distribution
[ ] Training and Generating Text [text]
- Describe the training loop: forward pass, loss computation, backward pass, optimizer step
- Explain why the training loss and validation loss diverge
- Evaluate the quality of bigram-generated text and identify its specific failures

Self-Attention

The mechanism that lets each position look at all previous positions and decide what is relevant.

[ ] Averaging Past Context [text]
- Explain why averaging previous token embeddings is a first step toward attention
- Describe how a lower-triangular matrix implements causal averaging
- Identify the limitations of uniform averaging compared to learned weighting
[ ] Softmax and Weighted Aggregation [text]
- Explain how softmax converts raw scores into a probability distribution for attention weights
- Describe the masking-then-softmax pattern for causal attention
- Trace how changing the weight matrix changes which information each position receives
[ ] Queries, Keys, and Values [text]
- Explain the roles of query, key, and value projections in self-attention
- Trace the computation from token embeddings through Q/K/V to attention output
- Describe why scaling by 1/√d_k is necessary for stable training

Building the Transformer

Multi-head attention, feedforward networks, residual connections, and layer normalization — the full transformer block.

[ ] Multi-Head Attention [text]
- Explain why multiple attention heads outperform a single large head
- Describe how head outputs are concatenated and projected
- Trace the validation loss improvement from single-head to multi-head attention
[ ] Feedforward Networks and Residual Connections [text]
- Explain the role of the feedforward network in a transformer block
- Describe how residual connections enable training of deep networks
- Trace the data flow through attention → add → feedforward → add
[ ] Layer Normalization and Transformer Blocks [text]
- Explain what layer normalization does and why it helps training
- Distinguish pre-norm and post-norm transformer architectures
- Describe the complete data flow through a transformer block

Scaling Up

Dropout, deeper stacking, and the complete GPT — generating Shakespeare that reads like Shakespeare.

[ ] Dropout and Scaling [text]
- Explain how dropout regularizes a neural network during training
- Describe the relationship between model size, training data, and overfitting
- Trace how scaling hyperparameters transforms generated text quality
[ ] The Complete GPT [text]
- Trace the full data flow from raw text through the complete GPT model to generated output
- Identify which components are essential to the architecture and which are efficiency optimizations
- Explain the connection between this character-level model and production language models