Understanding microGPT
Walk through Karpathy's 200-line GPT implementation — from autograd and tokenization to multi-head attention, training, and text generation.
download courseware
Prefer to inspect a complete implementation? Download pre-completed courseware for this course.
- Pre-completed projectsign in to downloadA complete reference implementation of microGPT in pure Python — the full autograd engine, transformer architecture, training loop, and generation in a single file.
The Big Picture
What microGPT is, where the data comes from, and how text becomes numbers.
-
- Identify the components contained in microGPT's 200 lines
- Explain how microGPT relates to Karpathy's earlier projects (micrograd, makemore, nanoGPT)
- Distinguish essential algorithmic content from efficiency optimizations
-
- Explain how microGPT loads and prepares its training data
- Describe character-level tokenization and the role of the BOS token
- Compare character-level tokenization with subword tokenization
Automatic Differentiation
How microGPT computes gradients with a scalar autograd engine.
- [ ] The Value Class
- Explain what the Value class stores and why each field exists
- Describe how operator overloading builds a computation graph
- Trace forward computation through a simple Value expression
- [ ] Backpropagation
- Apply the chain rule to compute gradients through a multi-step computation
- Explain why backpropagation uses reverse topological order
- Trace gradient flow through a simple computation graph
The Transformer Architecture
Embeddings, normalization, attention, and feedforward layers — the components that make a GPT.
-
- Explain the purpose of token embeddings and position embeddings
- Describe how the two embedding tables combine to form the input to the transformer
- Identify why position information is necessary for attention
-
- Explain why normalization is needed in deep networks
- Compute RMSNorm for a simple vector
- Describe the pre-norm transformer design pattern
-
- Trace the computation of scaled dot-product attention in microGPT
- Explain the purpose of causal masking in autoregressive models
- Describe how multiple attention heads capture different relationships
-
- Explain what the MLP sublayer computes and why it is needed alongside attention
- Describe how residual connections help gradient flow in deep networks
- Trace data flow through a complete transformer block
Training
Loss functions, the Adam optimizer, and the training loop that teaches the model to predict the next token.
-
- Explain what cross-entropy loss measures in the context of next-token prediction
- Trace the computation from logits through softmax to loss
- Describe why the loss is computed at every position in the sequence
-
- Explain why plain gradient descent is insufficient for training neural networks
- Describe the two running averages that Adam maintains per parameter
- Explain the purpose of bias correction in Adam
-
- Describe the three phases of each training step
- Explain how the training loop processes documents of varying lengths
- Interpret training loss values and what they indicate about model progress
Generation
Turning a trained model into a text generator, and connecting microGPT to the full GPT lineage.
-
- Explain how autoregressive generation produces text one token at a time
- Describe how temperature controls the tradeoff between diversity and quality
- Trace the inference loop from a BOS token to a complete generated name
-
- Trace the full data flow from raw text to generated output
- Identify which microGPT components are essential algorithms and which are efficiency optimizations in production systems
- Explain what production GPT systems add beyond the core algorithm