Everyone's talking about AI adoption. Almost nobody has the real numbers. Help us change that — and get the full report 👉 Engineers | Leaders

Understanding microGPT

Walk through Karpathy's 200-line GPT implementation — from autograd and tokenization to multi-head attention, training, and text generation.


download courseware

Prefer to inspect a complete implementation? Download pre-completed courseware for this course.

  • Pre-completed project
    A complete reference implementation of microGPT in pure Python — the full autograd engine, transformer architecture, training loop, and generation in a single file.
    sign in to download

The Big Picture

What microGPT is, where the data comes from, and how text becomes numbers.

  • [ ] What Is microGPT? [text] free
    • Identify the components contained in microGPT's 200 lines
    • Explain how microGPT relates to Karpathy's earlier projects (micrograd, makemore, nanoGPT)
    • Distinguish essential algorithmic content from efficiency optimizations
  • [ ] Data and Tokenization [text] free
    • Explain how microGPT loads and prepares its training data
    • Describe character-level tokenization and the role of the BOS token
    • Compare character-level tokenization with subword tokenization

Automatic Differentiation

How microGPT computes gradients with a scalar autograd engine.

  • [ ] The Value Class [text]
    • Explain what the Value class stores and why each field exists
    • Describe how operator overloading builds a computation graph
    • Trace forward computation through a simple Value expression
  • [ ] Backpropagation [text]
    • Apply the chain rule to compute gradients through a multi-step computation
    • Explain why backpropagation uses reverse topological order
    • Trace gradient flow through a simple computation graph

The Transformer Architecture

Embeddings, normalization, attention, and feedforward layers — the components that make a GPT.

    • Explain the purpose of token embeddings and position embeddings
    • Describe how the two embedding tables combine to form the input to the transformer
    • Identify why position information is necessary for attention
    • Explain why normalization is needed in deep networks
    • Compute RMSNorm for a simple vector
    • Describe the pre-norm transformer design pattern
    • Trace the computation of scaled dot-product attention in microGPT
    • Explain the purpose of causal masking in autoregressive models
    • Describe how multiple attention heads capture different relationships
    • Explain what the MLP sublayer computes and why it is needed alongside attention
    • Describe how residual connections help gradient flow in deep networks
    • Trace data flow through a complete transformer block

Training

Loss functions, the Adam optimizer, and the training loop that teaches the model to predict the next token.

    • Explain what cross-entropy loss measures in the context of next-token prediction
    • Trace the computation from logits through softmax to loss
    • Describe why the loss is computed at every position in the sequence
  • [ ] The Adam Optimizer [text]
    • Explain why plain gradient descent is insufficient for training neural networks
    • Describe the two running averages that Adam maintains per parameter
    • Explain the purpose of bias correction in Adam
  • [ ] The Training Loop [text]
    • Describe the three phases of each training step
    • Explain how the training loop processes documents of varying lengths
    • Interpret training loss values and what they indicate about model progress

Generation

Turning a trained model into a text generator, and connecting microGPT to the full GPT lineage.

    • Explain how autoregressive generation produces text one token at a time
    • Describe how temperature controls the tradeoff between diversity and quality
    • Trace the inference loop from a BOS token to a complete generated name
    • Trace the full data flow from raw text to generated output
    • Identify which microGPT components are essential algorithms and which are efficiency optimizations in production systems
    • Explain what production GPT systems add beyond the core algorithm