Everyone's talking about AI adoption. Almost nobody has the real numbers. Help us change that — and get the full report 👉 Engineers | Leaders

Understanding GPT

Build a GPT language model from scratch in PyTorch — from a bigram baseline through self-attention, multi-head attention, and full transformer blocks to generating Shakespeare.


download courseware

Prefer to inspect a complete implementation? Download pre-completed courseware for this course.

  • Pre-completed project
    Complete reference implementation — bigram baseline and full GPT model, ready to train on TinyShakespeare.
    sign in to download

The Big Picture

What a GPT is, how the dataset works, and the plan for building one from scratch.

  • [ ] What Is a GPT? [text] free
    • Explain what a GPT does at a high level — next-token prediction on sequences
    • Identify the key components of the transformer architecture
    • Describe the progression from bigram model to full GPT
    • Explain how TinyShakespeare is structured and why it is a good training dataset
    • Describe character-level tokenization and how encode/decode functions work
    • Explain the relationship between block_size, batch_size, and training batches

The Bigram Baseline

The simplest possible language model — predicting the next character from just the current one.

    • Explain how a bigram model uses an embedding table as its only learned component
    • Trace the forward pass from token input through embedding lookup to logits
    • Calculate the expected loss of an untrained model with a uniform distribution
    • Describe the training loop: forward pass, loss computation, backward pass, optimizer step
    • Explain why the training loss and validation loss diverge
    • Evaluate the quality of bigram-generated text and identify its specific failures

Self-Attention

How tokens learn to communicate — from simple averaging to scaled dot-product attention with queries, keys, and values.

    • Explain why averaging previous token embeddings is a first step toward attention
    • Describe how a lower-triangular matrix implements causal averaging
    • Identify the limitations of uniform averaging compared to learned weighting
    • Explain how softmax converts raw scores into a probability distribution for attention weights
    • Describe the masking-then-softmax pattern for causal attention
    • Trace how changing the weight matrix changes which information each position receives
    • Explain the roles of query, key, and value projections in self-attention
    • Trace the computation from token embeddings through Q/K/V to attention output
    • Describe why scaling by 1/√d_k is necessary for stable training

Building the Transformer

Multi-head attention, feedforward networks, residual connections, and layer normalization — the full transformer block.

    • Explain why multiple attention heads outperform a single large head
    • Describe how head outputs are concatenated and projected
    • Trace the validation loss improvement from single-head to multi-head attention
    • Explain the role of the feedforward network in a transformer block
    • Describe how residual connections enable training of deep networks
    • Trace the data flow through attention → add → feedforward → add
    • Explain what layer normalization does and why it helps training
    • Distinguish pre-norm and post-norm transformer architectures
    • Describe the complete data flow through a transformer block

Scaling Up

Dropout, deeper networks, and the complete GPT — from 2.49 loss to 1.48 loss.

    • Explain how dropout regularizes a neural network during training
    • Describe the relationship between model size, training data, and overfitting
    • Trace how scaling hyperparameters transforms generated text quality
  • [ ] The Complete GPT [text]
    • Trace the full data flow from raw text through the complete GPT model to generated output
    • Identify which components are essential to the architecture and which are efficiency optimizations
    • Explain the connection between this character-level model and production language models