Understanding microGPT

Walk through Karpathy's 200-line GPT implementation — from autograd and tokenization to multi-head attention, training, and text generation.

download courseware

Prefer to inspect a complete implementation? Download pre-completed courseware for this course.

Pre-completed project
A complete reference implementation of microGPT in pure Python — the full autograd engine, transformer architecture, training loop, and generation in a single file.
sign in to download

What microGPT is, what it contains, and how the data flows from raw text to token IDs.

[ ] What Is microGPT? [text] free
- Identify the components contained in microGPT's 200 lines
- Explain how microGPT relates to Karpathy's earlier projects (micrograd, makemore, nanoGPT)
- Distinguish essential algorithmic content from efficiency optimizations
[ ] Data and Tokenization [text] free
- Explain how microGPT loads and prepares its training data
- Describe character-level tokenization and the role of the BOS token
- Compare character-level tokenization with subword tokenization

The Value class that tracks computations and computes gradients — microGPT's autograd engine.

[ ] The Value Class [text]
- Explain what the Value class stores and why each field exists
- Describe how operator overloading builds a computation graph
- Trace forward computation through a simple Value expression
[ ] Backpropagation [text]
- Apply the chain rule to compute gradients through a multi-step computation
- Explain why backpropagation uses reverse topological order
- Trace gradient flow through a simple computation graph

Embeddings, RMSNorm, multi-head attention, and feedforward layers — the building blocks of the model.

[ ] Embeddings: Tokens and Positions [text]
- Explain the purpose of token embeddings and position embeddings
- Describe how the two embedding tables combine to form the input to the transformer
- Identify why position information is necessary for attention
[ ] RMSNorm: Stabilizing Activations [text]
- Explain why normalization is needed in deep networks
- Compute RMSNorm for a simple vector
- Describe the pre-norm transformer design pattern
[ ] Multi-Head Attention [text]
- Trace the computation of scaled dot-product attention in microGPT
- Explain the purpose of causal masking in autoregressive models
- Describe how multiple attention heads capture different relationships
[ ] The MLP and Residual Connections [text]
- Explain what the MLP sublayer computes and why it is needed alongside attention
- Describe how residual connections help gradient flow in deep networks
- Trace data flow through a complete transformer block

Cross-entropy loss, the Adam optimizer, and the training loop that teaches microGPT to predict characters.

[ ] Loss and Next-Token Prediction [text]
- Explain what cross-entropy loss measures in the context of next-token prediction
- Trace the computation from logits through softmax to loss
- Describe why the loss is computed at every position in the sequence
[ ] The Adam Optimizer [text]
- Explain why plain gradient descent is insufficient for training neural networks
- Describe the two running averages that Adam maintains per parameter
- Explain the purpose of bias correction in Adam
[ ] The Training Loop [text]
- Describe the three phases of each training step
- Explain how the training loop processes documents of varying lengths
- Interpret training loss values and what they indicate about model progress

Sampling, temperature, and the complete picture — from training to producing new text.

[ ] Sampling and Temperature [text]
- Explain how autoregressive generation produces text one token at a time
- Describe how temperature controls the tradeoff between diversity and quality
- Trace the inference loop from a BOS token to a complete generated name
[ ] The Complete Picture [text]
- Trace the full data flow from raw text to generated output
- Identify which microGPT components are essential algorithms and which are efficiency optimizations in production systems
- Explain what production GPT systems add beyond the core algorithm